Commit Graph

5342 Commits

Author SHA1 Message Date
Changyu Bi 76ed9a3990 Add missing status check when compiling with `ASSERT_STATUS_CHECKED=1` (#11686)
Summary:
It seems the flag `-fno-elide-constructors` is incorrectly overwritten in Makefile by 9c2ebcc2c3/Makefile (L243)
Applying the change in PR https://github.com/facebook/rocksdb/issues/11675 shows a lot of missing status checks. This PR adds the missing status checks.

Most of changes are just adding asserts in unit tests. I'll add pr comment around more interesting changes that need review.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11686

Test Plan: change Makefile as in https://github.com/facebook/rocksdb/issues/11675, and run `ASSERT_STATUS_CHECKED=1 TEST_UINT128_COMPAT=1 ROCKSDB_MODIFY_NPHASH=1 LIB_MODE=static OPT="-DROCKSDB_NAMESPACE=alternative_rocksdb_ns" make V=1 -j24 J=24 check`

Reviewed By: hx235

Differential Revision: D48176132

Pulled By: cbi42

fbshipit-source-id: 6758946cfb1c6ff84c4c1e0ca540d05e6fc390bd
2023-08-09 15:46:44 -07:00
Yu Zhang c751583c03 Set default cf ts sz for a reused transaction (#11685)
Summary:
Set up the default column family timestamp size for a reused write committed transaction.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11685

Test Plan: Added unit test.

Reviewed By: ltamasi

Differential Revision: D48195129

Pulled By: jowlyzhang

fbshipit-source-id: 54faa900c123fc6daa412c01490e36c10a24a678
2023-08-09 13:49:42 -07:00
Hui Xiao 9a034801ce Group rocksdb.sst.read.micros stat by different user read IOActivity + misc (#11444)
Summary:
**Context/Summary:**
- Similar to https://github.com/facebook/rocksdb/pull/11288 but for user read such as `Get(), MultiGet(), DBIterator::XXX(), Verify(File)Checksum()`.
   - For this, I refactored some user-facing `MultiGet` calls in `TransactionBase` and various types of `DB` so that it does not call a user-facing `Get()` but `GetImpl()` for passing the `ReadOptions::io_activity` check (see PR conversation)
   - New user read stats breakdown are guarded by `kExceptDetailedTimers` since measurement shows they have 4-5% regression to the upstream/main.

- Misc
   - More refactoring: with https://github.com/facebook/rocksdb/pull/11288, we complete passing `ReadOptions/IOOptions` to FS level. So we can now replace the previously [added](https://github.com/facebook/rocksdb/pull/9424) `rate_limiter_priority` parameter in `RandomAccessFileReader`'s `Read/MultiRead/Prefetch()` with `IOOptions::rate_limiter_priority`
   - Also, `ReadAsync()` call time is measured in `SST_READ_MICRO` now

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11444

Test Plan:
- CI fake db crash/stress test
- Microbenchmarking

**Build** `make clean && ROCKSDB_NO_FBCODE=1 DEBUG_LEVEL=0 make -jN db_basic_bench`
- google benchmark version: 604f6fd3f4
- db_basic_bench_base: upstream
- db_basic_bench_pr: db_basic_bench_base + this PR
- asyncread_db_basic_bench_base: upstream + [db basic bench patch for IteratorNext](https://github.com/facebook/rocksdb/compare/main...hx235:rocksdb:micro_bench_async_read)
- asyncread_db_basic_bench_pr: asyncread_db_basic_bench_base + this PR

**Test**

Get
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{null_stat|base|pr} --benchmark_filter=DBGet/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1/negative_query:0/enable_filter:0/mmap:1/threads:1 --benchmark_repetitions=1000
```

Result
```
Coming soon
```

AsyncRead
```
TEST_TMPDIR=/dev/shm ./asyncread_db_basic_bench_{base|pr} --benchmark_filter=IteratorNext/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1/async_io:1/include_detailed_timers:0 --benchmark_repetitions=1000 > syncread_db_basic_bench_{base|pr}.out
```

Result
```
Base:
1956,1956,1968,1977,1979,1986,1988,1988,1988,1990,1991,1991,1993,1993,1993,1993,1994,1996,1997,1997,1997,1998,1999,2001,2001,2002,2004,2007,2007,2008,

PR (2.3% regression, due to measuring `SST_READ_MICRO` that wasn't measured before):
1993,2014,2016,2022,2024,2027,2027,2028,2028,2030,2031,2031,2032,2032,2038,2039,2042,2044,2044,2047,2047,2047,2048,2049,2050,2052,2052,2052,2053,2053,
```

Reviewed By: ajkr

Differential Revision: D45918925

Pulled By: hx235

fbshipit-source-id: 58a54560d9ebeb3a59b6d807639692614dad058a
2023-08-08 17:26:50 -07:00
Yu Zhang 9c2ebcc2c3 Log user_defined_timestamps_persisted flag in event logger (#11683)
Summary:
As titled, and also removed an undefined and unused member function in for ColumnFamilyData

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11683

Reviewed By: ajkr

Differential Revision: D48156290

Pulled By: jowlyzhang

fbshipit-source-id: cc99aaafe69db6611af3854cb2b2ebc5044941f7
2023-08-08 12:25:21 -07:00
Peter Dillinger e214964f40 Fix a potential memory leak on row_cache insertion failure (#11682)
Summary:
Although the built-in Cache implementations never return failure on Insert without keeping a reference (Handle), a custom implementation could. The code for inserting into row_cache does not keep a reference but does not clean up appropriately on non-OK. This is a fix.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11682

Test Plan: unit test added that previously fails under ASAN

Reviewed By: ajkr

Differential Revision: D48153831

Pulled By: pdillinger

fbshipit-source-id: 86eb7387915c5b38b6ff5dd8deb4e1e223b7d020
2023-08-08 11:34:41 -07:00
tabokie 6d1effaf01 exclude uninitialized files when estimating compression ratio (#11664)
Summary:
Exclude files with uninitialized table properties when estimating compression ratio.

Cherry-picking downstream PR: https://github.com/tikv/rocksdb/pull/335

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11664

Reviewed By: cbi42

Differential Revision: D48002518

Pulled By: ajkr

fbshipit-source-id: 931fac8a06b4ed7b7b605cf79903302f1b8babfd
2023-08-07 12:35:42 -07:00
Xinye Tao d2b0652b32 compute compaction score once for a batch of range file deletes (#10744)
Summary:
Only re-calculate compaction score once for a batch of deletions. Fix performance regression brought by https://github.com/facebook/rocksdb/pull/8434.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10744

Test Plan:
In one of our production cluster that recently upgraded to RocksDB 6.29, it takes more than 10 minutes to delete files in 30,000 ranges. The RocksDB instance contains approximately 80,000 files. After this patch, the duration reduces to 100+ ms, which is on par with RocksDB 6.4.

Cherry-picking downstream PR: https://github.com/tikv/rocksdb/pull/316

Signed-off-by: tabokie <xy.tao@outlook.com>

Reviewed By: cbi42

Differential Revision: D48002581

Pulled By: ajkr

fbshipit-source-id: 7245607ee3ad79c53b648a6396c9159f166b9437
2023-08-07 12:29:31 -07:00
Andrew Kryczka 4500a0d6ec Avoid an std::map copy in persistent stats (#11681)
Summary:
An internal user reported this copy showing up in a CPU profile. We can use move instead.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11681

Differential Revision: D48103170

Pulled By: ajkr

fbshipit-source-id: 083d6470181a0041bb5275b657aa61bee23a3729
2023-08-06 18:01:08 -07:00
Changyu Bi eca48bc166 Avoid shifting component too large error in FileTtlBooster (#11673)
Summary:
When `num_levels` > 65, we may be shifting more than 63 bits in FileTtlBooster. This can give errors like: `runtime error: shift exponent 98 is too large for 64-bit type 'uint64_t' (aka 'unsigned long')`. This PR makes a quick fix for this issue by taking a min in the shifting component. This issue should be rare since it requires a user using a large `num_levels`. I'll follow up with a more complex fix if needed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11673

Test Plan: * Add a unit test that produce the above error before this PR. Need to compile it with ubsan: `COMPILE_WITH_UBSAN=1 OPT="-fsanitize-blacklist=.circleci/ubsan_suppression_list.txt" ROCKSDB_DISABLE_ALIGNED_NEW=1 USE_CLANG=1 make V=1 -j32 compaction_picker_test`

Reviewed By: hx235

Differential Revision: D48074386

Pulled By: cbi42

fbshipit-source-id: 25e59df7e93f20e0793cffb941de70ac815d9392
2023-08-04 14:29:50 -07:00
Hui Xiao 09882a52d6 Prepare for deprecation of Options::access_hint_on_compaction_start (#11658)
Summary:
**Context/Summary:**
After https://github.com/facebook/rocksdb/pull/11631, file hint is not longer needed for compaction read. Therefore we can deprecate `Options::access_hint_on_compaction_start`. As this is a public API change, we should first mark the relevant APIs (including the Java's) deprecated and remove it in next major release 9.0.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11658

Test Plan: No code change

Reviewed By: ajkr

Differential Revision: D47997856

Pulled By: hx235

fbshipit-source-id: 16e015ae7728c224b1caef73143aa9915668f4ac
2023-08-03 17:23:02 -07:00
Vardhan 87a21d08fe Add an option to trigger flush when the number of range deletions reach a threshold (#11358)
Summary:
Add a mutable column family option `memtable_max_range_deletions`. When non-zero, RocksDB will try to flush the current memtable after it has at least `memtable_max_range_deletions` range deletions. Java API is added and crash test is updated accordingly to randomly enable this option.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11358

Test Plan:
* New unit test: `DBRangeDelTest.MemtableMaxRangeDeletions`
* Ran crash test `python3 ./tools/db_crashtest.py whitebox --simple --memtable_max_range_deletions=20` and saw logs showing flushed memtables usually with 20 range deletions.

Reviewed By: ajkr

Differential Revision: D46582680

Pulled By: cbi42

fbshipit-source-id: f23d6fa8d8264ecf0a18d55c113ba03f5e2504da
2023-08-02 19:58:56 -07:00
amatveev-cf 946d1009bc Expand Statistics support in the C API (#11263)
Summary:
Adds a few missing features to the C API:
1) Statistics level
2) Getting individual values instead of a serialized string

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11263

Test Plan: unit tests

Reviewed By: ajkr

Differential Revision: D47309963

Pulled By: hx235

fbshipit-source-id: 84df59db4045fc0fb3ea4aec451bc5c2afd2a248
2023-08-02 10:53:40 -07:00
Peter Dillinger 7a1b0207e6 format_version=6 and context-aware block checksums (#9058)
Summary:
## Context checksum
All RocksDB checksums currently use 32 bits of checking
power, which should be 1 in 4 billion false negative (FN) probability (failing to
detect corruption). This is true for random corruptions, and in some cases
small corruptions are guaranteed to be detected. But some possible
corruptions, such as in storage metadata rather than storage payload data,
would have a much higher FN rate. For example:
* Data larger than one SST block is replaced by data from elsewhere in
the same or another SST file. Especially with block_align=true, the
probability of exact block size match is probably around 1 in 100, making
the FN probability around that same. Without `block_align=true` the
probability of same block start location is probably around 1 in 10,000,
for FN probability around 1 in a million.

To solve this problem in new format_version=6, we add "context awareness"
to block checksum checks. The stored and expected checksum value is
modified based on the block's position in the file and which file it is in. The
modifications are cleverly chosen so that, for example
* blocks within about 4GB of each other are guaranteed to use different context
* blocks that are offset by exactly some multiple of 4GiB are guaranteed to use
different context
* files generated by the same process are guaranteed to use different context
for the same offsets, until wrap-around after 2^32 - 1 files

Thus, with format_version=6, if a valid SST block and checksum is misplaced,
its checksum FN probability should be essentially ideal, 1 in 4B.

## Footer checksum
This change also adds checksum protection to the SST footer (with
format_version=6), for the first time without relying on whole file checksum.
To prevent a corruption of the format_version in the footer (e.g. 6 -> 5) to
defeat the footer checksum, we change much of the footer data format
including an "extended magic number" in format_version 6 that would be
interpreted as empty index and metaindex block handles in older footer
versions. We also change the encoding of handles to free up space for
other new data in footer.

## More detail: making space in footer
In order to keep footer the same size in format_version=6 (avoid change to IO
patterns), we have to free up some space for new data. We do this two ways:
* Metaindex block handle is encoded down to 4 bytes (from 10) by assuming
it immediately precedes the footer, and by assuming it is < 4GB.
* Index block handle is moved into metaindex. (I don't know why it was
in footer to begin with.)

## Performance
In case of small performance penalty, I've made a "pay as you go" optimization
to compensate: replace `MutableCFOptions` in BlockBasedTableBuilder::Rep
with the only field used in that structure after construction: `prefix_extractor`.
This makes the PR an overall performance improvement (results below).

Nevertheless I'm seeing essentially no difference going from fv=5 to fv=6,
even including that improvement for both. That's based on extreme case table
write performance testing, many files with many blocks. This is relatively
checksum intensive (small blocks) and salt generation intensive (small files).

```
(for I in `seq 1 100`; do TEST_TMPDIR=/dev/shm/dbbench2 ./db_bench -benchmarks=fillseq -memtablerep=vector -disable_wal=1 -allow_concurrent_memtable_write=false -num=3000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -write_buffer_size=100000 -compression_type=none -block_size=1000; done) 2>&1 | grep micros/op | tee out
awk '{ tot += $5; n += 1; } END { print int(1.0 * tot / n) }' < out
```

Each value below is ops/s averaged over 100 runs, run simultaneously with competing
configuration for load fairness

Before -> after (both fv=5): 483530 -> 483673 (negligible)
Re-run 1: 480733 -> 485427 (1.0% faster)
Re-run 2: 483821 -> 484541 (0.1% faster)
Before (fv=5) -> after (fv=6): 482006 -> 485100 (0.6% faster)
Re-run 1: 482212 -> 485075 (0.6% faster)
Re-run 2: 483590 -> 484073 (0.1% faster)
After fv=5 -> after fv=6: 483878 -> 485542 (0.3% faster)
Re-run 1: 485331 -> 483385 (0.4% slower)
Re-run 2: 485283 -> 483435 (0.4% slower)
Re-run 3: 483647 -> 486109 (0.5% faster)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9058

Test Plan:
unit tests included (table_test, db_properties_test, salt in env_test). General DB tests
and crash test updated to test new format_version.

Also temporarily updated the default format version to 6 and saw some test failures. Almost all
were due to an inadvertent additional read in VerifyChecksum to verify the index block checksum,
though it's arguably a bug that VerifyChecksum does not appear to (re-)verify the index block
checksum, just assuming it was verified in opening the index reader (probably *usually* true but
probably not always true). Some other concerns about VerifyChecksum are left in FIXME
comments. The only remaining test failure on change of default (in block_fetcher_test) now
has a comment about how to upgrade the test.

The format compatibility test does not need updating because we have not updated the default
format_version.

Reviewed By: ajkr, mrambacher

Differential Revision: D33100915

Pulled By: pdillinger

fbshipit-source-id: 8679e3e572fa580181a737fd6d113ed53c5422ee
2023-07-30 16:40:01 -07:00
Changyu Bi 6a0f637633 Compare the number of input keys and processed keys for compactions (#11571)
Summary:
... to improve data integrity validation during compaction.

A new option `compaction_verify_record_count` is introduced for this verification and is enabled by default. One exception when the verification is not done is when a compaction filter returns kRemoveAndSkipUntil which can cause CompactionIterator to seek until some key and hence not able to keep track of the number of keys processed.

For expected number of input keys, we sum over the number of total keys - number of range tombstones across compaction input files (`CompactionJob::UpdateCompactionStats()`). Table properties are consulted if `FileMetaData` is not initialized for some input file. Since table properties for all input files were also constructed during `DBImpl::NotifyOnCompactionBegin()`, `Compaction::GetTableProperties()` is introduced to reduce duplicated code.

For actual number of keys processed, each subcompaction will record its number of keys processed to `sub_compact->compaction_job_stats.num_input_records` and aggregated when all subcompactions finish (`CompactionJob::AggregateCompactionStats()`). In the case when some subcompaction encountered kRemoveAndSkipUntil from compaction filter and does not have accurate count, it propagates this information through `sub_compact->compaction_job_stats.has_num_input_records`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11571

Test Plan:
* Add a new unit test `DBCompactionTest.VerifyRecordCount` for the corruption case.
* All other unit tests for non-corrupted case.
* Ran crash test for a few hours: `python3 ./tools/db_crashtest.py whitebox --simple`

Reviewed By: ajkr

Differential Revision: D47131965

Pulled By: cbi42

fbshipit-source-id: cc8e94565dd526c4347e9d3843ecf32f6727af92
2023-07-28 09:47:31 -07:00
Yu Zhang c24ef26ca7 Support switching on / off UDT together with in-Memtable-only feature (#11623)
Summary:
Add support to allow enabling / disabling user-defined timestamps feature for an existing column family in combination with the in-Memtable only feature.

To do this, this PR includes:
1) Log the `persist_user_defined_timestamps` option per column family in Manifest to facilitate detecting an attempt to enable / disable UDT. This entry is enforced to be logged in the same VersionEdit as the user comparator name entry.

2) User-defined timestamps related options are validated when re-opening a column family, including user comparator name and the `persist_user_defined_timestamps` flag. These type of settings and settings change are considered valid:
     a) no user comparator change and no effective `persist_user_defined_timestamp` flag change.
     b) switch user comparator to enable UDT provided the immediately effective `persist_user_defined_timestamps` flag
         is false.
     c) switch user comparator to disable UDT provided that the before-change `persist_user_defined_timestamps` is
         already false.
3) when an attempt to enable UDT is detected, we mark all its existing SST files as "having no UDT" by marking its `FileMetaData.user_defined_timestamps_persisted` flag to false and handle their file boundaries `FileMetaData.smallest`, `FileMetaData.largest` by padding a min timestamp.

4) while enabling / disabling UDT feature, timestamp size inconsistency in existing WAL logs are handled to make it compatible with the running user comparator.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11623

Test Plan:
```
make all check
./db_with_timestamp_basic_test --gtest-filter="*EnableDisableUDT*"
./db_wal_test --gtest_filter="*EnableDisableUDT*"
```

Reviewed By: ltamasi

Differential Revision: D47636862

Pulled By: jowlyzhang

fbshipit-source-id: dcd19f67292da3c3cc9584c09ad00331c9ab9322
2023-07-26 20:16:32 -07:00
Yu Zhang 4ea7b796b7 Respect cutoff timestamp during flush (#11599)
Summary:
Make flush respect the cutoff timestamp `full_history_ts_low` as much as possible for the user-defined timestamps in Memtables only feature. We achieve this by not proceeding with the actual flushing but instead reschedule the same `FlushRequest` so a follow up flush job can continue with the check after some interval.

This approach doesn't work well for atomic flush, so this feature currently is not supported in combination with atomic flush. Furthermore, this approach also requires a customized method to get the next immediately bigger user-defined timestamp. So currently it's limited to comparator that use uint64_t as the user-defined timestamp format. This support can be extended when we add such a customized method to `AdvancedColumnFamilyOptions`.

For non atomic flush request, at any single time, a column family can only have as many as one FlushRequest for it in the `flush_queue_`. There is deduplication done at `FlushRequest` enqueueing(`SchedulePendingFlush`) and dequeueing time (`PopFirstFromFlushQueue`). We hold the db mutex between when a `FlushRequest` is popped from the queue and the same FlushRequest get rescheduled, so no other `FlushRequest` with a higher `max_memtable_id` can be added to the `flush_queue_` blocking us from re-enqueueing the same `FlushRequest`.

Flush is continued nevertheless if there is risk of entering write stall mode had the flush being postponed, e.g. due to accumulation of write buffers, exceeding the `max_write_buffer_number` setting. When this happens, the newest user-defined timestamp in the involved Memtables need to be tracked and we use it to increase the `full_history_ts_low`, which is an inclusive cutoff timestamp for which RocksDB promises to keep all user-defined timestamps equal to and newer than it.

Tet plan:
```
./column_family_test --gtest_filter="*RetainUDT*"
./memtable_list_test --gtest_filter="*WithTimestamp*"
./flush_job_test --gtest_filter="*WithTimestamp*"
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11599

Reviewed By: ajkr

Differential Revision: D47561586

Pulled By: jowlyzhang

fbshipit-source-id: 9400445f983dd6eac489e9dd0fb5d9b99637fe89
2023-07-26 16:25:06 -07:00
Changyu Bi 5c2a063c49 Clarify usage for options `ttl` and `periodic_compaction_seconds` for universal compaction (#11552)
Summary:
this is stacked on https://github.com/facebook/rocksdb/issues/11550 to further clarify usage of these two options for universal compaction. Similar to FIFO, the two options have the same meaning for universal compaction, which can be confusing to use. For example, for universal compaction, dynamically changing the value of `ttl` has no impact on periodic compactions. Users should dynamically change `periodic_compaction_seconds` instead. From feature matrix (https://fburl.com/daiquery/5s647hwh), there are instances where users set `ttl` to non-zero value and `periodic_compaction_seconds` to 0. For backward compatibility reason, instead of deprecating `ttl`, comments are added to mention that `periodic_compaction_seconds` are preferred. In `SanitizeOptions()`, we update the value of `periodic_compaction_seconds` to take into account value of `ttl`. The logic is documented in relevant option comment.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11552

Test Plan: * updated existing unit test `DBTestUniversalCompaction2.PeriodicCompactionDefault`

Reviewed By: ajkr

Differential Revision: D47381434

Pulled By: cbi42

fbshipit-source-id: bc41f29f77318bae9a96be84dd89bf5617c7fd57
2023-07-26 11:31:54 -07:00
Rémi Calixte 6628ff12d6 Extend C API to expose base db of transaction db (#11562)
Summary:
Add `rocksdb_transactiondb_get_base_db` and `rocksdb_transactiondb_close_base_db` functions to the C API modeled after `rocksdb_optimistictransactiondb_get_base_db` and `rocksdb_optimistictransactiondb_close_base_db`:
ca50ccc71a/include/rocksdb/c.h (L2711-L2716)

With this pair of functions, it is possible to get a `rocksdb_t *` from a `rocksdb_transactiondb_t *`. The main goal is to be able to use the approximate memory usage API, only accessible to the `rocksdb_t *` type:
ca50ccc71a/include/rocksdb/c.h (L2821-L2833)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11562

Reviewed By: ajkr

Differential Revision: D47603343

Pulled By: jowlyzhang

fbshipit-source-id: c70cf6af5834026e232fe7791634db3a396f7d5e
2023-07-20 12:03:09 -07:00
shuzz 2f712235ab optimized code (#11614)
Summary:
improvement code by std::move and c++17

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11614

Reviewed By: ajkr

Differential Revision: D47599519

Pulled By: jowlyzhang

fbshipit-source-id: 6b897876f4e87e94a74c53d8db2a01303d500bff
2023-07-19 12:52:39 -07:00
huangmengbin 98d0f6ec08 fix: VersionSet::DumpManifest (#11605)
Summary:
Fixes https://github.com/facebook/rocksdb/issues/11604

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11605

Reviewed By: jowlyzhang

Differential Revision: D47459254

Pulled By: ajkr

fbshipit-source-id: 4420e443fbf4bd01ddaa2b47285fc4445bf36246
2023-07-19 10:44:10 -07:00
Dan Wang 8a7b9888d4 Fix the sync point SanitizeOptions::AfterChangeMaxOpenFiles which is not executed in db_compaction_test (#11583)
Summary:
In [db_impl_open.cc](https://github.com/facebook/rocksdb/blob/main/db/db_impl/db_impl_open.cc), the sync point `SanitizeOptions::AfterChangeMaxOpenFiles` is used to set `max_open_files` with some specified "**invalid**" value even if it has been sanitized.

However,  in [db_compaction_test.cc](https://github.com/facebook/rocksdb/blob/main/db/db_compaction_test.cc), `SanitizeOptions::AfterChangeMaxOpenFiles` would not be executed since `SyncPoint::EnableProcessing()` is run after `DBTestBase::Reopen()`.  To enable `SanitizeOptions::AfterChangeMaxOpenFiles`,  `SyncPoint::EnableProcessing()` should be put ahead of `DBTestBase::Reopen()`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11583

Test Plan:
run unit tests locally as below:
```
make J=1 check

[ RUN      ] DBCompactionTest.LevelTtlCascadingCompactions
[       OK ] DBCompactionTest.LevelTtlCascadingCompactions (85 ms)
[ RUN      ] DBCompactionTest.LevelPeriodicCompaction
[       OK ] DBCompactionTest.LevelPeriodicCompaction (57 ms)
```

Reviewed By: jowlyzhang

Differential Revision: D47311827

Pulled By: ajkr

fbshipit-source-id: 99165e87a8129e404af06fdf9b4c96eca540fd23
2023-07-19 10:41:09 -07:00
Andrew Kryczka 05c3b8ecac Prepare for specialized interface for row cache (#11620)
Summary:
An internal user wants to implement a key-aware row cache policy. For that, they need to know the components of the cache key, especially the user key component. With a specialized `RowCache` interface, we will be able to tell them the components so they won't have to make assumptions about our internal key schema.

This PR prepares for the specialized `RowCache` interface by updating the migration plan of https://github.com/facebook/rocksdb/issues/11450. I added a release note for the removed APIs and didn't mention the added ones for now.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11620

Reviewed By: pdillinger

Differential Revision: D47536962

Pulled By: ajkr

fbshipit-source-id: bbee0fc4ad67fc699a66b8f2b4ea4544dd003691
2023-07-18 19:12:58 -07:00
Changyu Bi 662a1c99f6 Verify number of keys flushed during DB open (#11611)
Summary:
Extend the coverage for option `flush_verify_memtable_count`. The verification code is similar to the ones for regular flush: c3c84b3397/db/flush_job.cc (L956-L965)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11611

Test Plan: existing tests.

Reviewed By: ajkr

Differential Revision: D47478893

Pulled By: cbi42

fbshipit-source-id: ca580c9dbcd6e91facf2e49210661336a79a248e
2023-07-18 10:39:11 -07:00
leipeng bc0db33483 Optimize about sstableKeyCompare (#11610)
Summary:
We observed `CompactionOutputs::UpdateGrandparentBoundaryInfo` consumes much time for `InternalKey::DecodeFrom` and `InternalKey::~InternalKey` in flame graph.

This PR omit the InternalKey object in `CompactionOutputs::UpdateGrandparentBoundaryInfo` .

![image](https://github.com/facebook/rocksdb/assets/1574991/661eaeec-2f46-46c6-a6a8-9738d6c191de)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11610

Reviewed By: ajkr

Differential Revision: D47426971

Pulled By: cbi42

fbshipit-source-id: f0d3a8186d778294515c0685032f5b395c4d6a62
2023-07-13 22:26:55 -07:00
Changyu Bi 854eb76a8c Improve error message when an SST file in MANIFEST is not found (#11573)
Summary:
I got the following error message when an SST file is recorded in MANIFEST but is missing from the db folder.
It's confusing in two ways:
1. The part about file "./074837.ldb" which RocksDB will attempt to open only after ./074837.sst is not found.
2. The last part about "No such file or directory in file ./MANIFEST-074507" sounds like `074837.ldb` is not found in manifest.

```
ldb --hex --db=. get some_key

Failed: Corruption: Corruption: IO error: No such file or directory: While open a file for random read: ./074837.ldb: No such file or directory in file ./MANIFEST-074507
```

Improving the error message a little bit:

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11573

Test Plan:
run the same command after this PR
```
Failed: Corruption: Corruption: IO error: No such file or directory: While open a file for random read: ./074837.sst: No such file or directory  The file ./MANIFEST-074507 may be corrupted.
```

Reviewed By: ajkr

Differential Revision: D47192056

Pulled By: cbi42

fbshipit-source-id: 06863f376cc4455803cffb2250c41399b4c39467
2023-07-10 15:52:38 -07:00
weedge 1a7c741977 fix: std::optional value() build error on older macOS SDK (#11574)
Summary:
`PORTABLE=1 USE_SSE=1 USE_PCLMUL=1 WITH_JEMALLOC_FLAG=1 JEMALLOC=1 make static_lib`  on MacOS

clang --version:

Apple clang version 12.0.0 (clang-1200.0.32.29)
Target: x86_64-apple-darwin22.4.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

compile err like this:

util/udt_util.cc:39:39: error: 'value' is unavailable: introduced in macOS 10.14
  if (running_ts_sz != recorded_ts_sz.value()) {
                                      ^
/Library/Developer/CommandLineTools/usr/bin/../include/c++/v1/optional:944:33: note: 'value' has been explicitly marked
      unavailable here
    constexpr value_type const& value() const&
                                ^
util/udt_util.cc:217:62: error: 'value' is unavailable: introduced in macOS 10.14
      *new_key = StripTimestampFromUserKey(key, record_ts_sz.value());
                                                             ^
/Library/Developer/CommandLineTools/usr/bin/../include/c++/v1/optional:953:27: note: 'value' has been explicitly marked
      unavailable here
    constexpr value_type& value() &
                          ^
2 errors generated.
make: *** [util/udt_util.o] Error 1

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11574

Reviewed By: ajkr

Differential Revision: D47269519

Pulled By: cbi42

fbshipit-source-id: da49d90cdf00a0af519f91c0cf7d257401eb395f
2023-07-10 14:21:34 -07:00
Yu Zhang f74526341d Handle file boundaries when timestamps should not be persisted (#11578)
Summary:
Handle file boundaries `FileMetaData.smallest`, `FileMetaData.largest` for when `persist_user_defined_timestamps` is false:
    1) on the manifest write path, the original user-defined timestamps in file boundaries are stripped. This stripping is done during `VersionEdit::Encode` to limit the effect of the stripping to only the persisted version of the file boundaries.
    2) on the manifest read path during DB open, a a min timestamp is padded to the file boundaries. Ideally, this padding should happen during `VersionEdit::Decode` so that all in memory file boundaries have a compatible user key format as the running user comparator. However, because the user-defined timestamp size information is not available at that time. This change is added to `VersionEditHandler::OnNonCfOperation`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11578

Test Plan:
```
make all check
./version_edit_test --gtest_filter="*EncodeDecodeNewFile4HandleFileBoundary*".
./db_with_timestamp_basic_test --gtest_filter="*HandleFileBoundariesTest*"
```

Reviewed By: pdillinger

Differential Revision: D47309399

Pulled By: jowlyzhang

fbshipit-source-id: 21b4d54d2089a62826b31d779094a39cb2bbbd51
2023-07-10 11:03:25 -07:00
Yu Zhang baf37a0e81 Fix a unit test hole for recovering UDTs with WAL files (#11577)
Summary:
Thanks pdillinger for pointing out this test hole. The test `DBWALTestWithTimestamp.Recover` that is intended to test recovery from WAL including user-defined timestamps doesn't achieve its promised coverage. Specifically, after https://github.com/facebook/rocksdb/issues/11557, timestamps will be removed during flush, and RocksDB by default flush memtables during recovery with `avoid_flush_during_recovery` defaults to false.  This test didn't fail even if all the timestamps are quickly lost due to the default flush behavior.

This PR renamed test `Recover` to `RecoverAndNoFlush`, and updated it to verify timestamps are successfully recovered from WAL with some time-travel reads. `avoid_flush_during_recovery` is set to true to help do this verification.

On the other hand, for test `DBWALTestWithTimestamp.RecoverAndFlush`, since flush on reopen is DB's default behavior. Setting the flags `max_write_buffer` and `arena_block_size` are not really the factors that enforces the flush, so these flags are removed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11577

Test Plan: ./db_wal_test

Reviewed By: pdillinger

Differential Revision: D47142892

Pulled By: jowlyzhang

fbshipit-source-id: 9465e278806faa5885b541b4e32d99e698edef7d
2023-07-07 16:47:49 -07:00
Changyu Bi 1f410ff95f Make `rocksdb_options_add_compact_on_deletion_collector_factory` backward compatible (#11593)
Summary:
https://github.com/facebook/rocksdb/issues/11542 added a parameter to the C API `rocksdb_options_add_compact_on_deletion_collector_factory` which causes some internal builds to fail. External users using this API would also require code change. Making the API backward compatible by restoring the old C API and add the parameter to a new C API `rocksdb_options_add_compact_on_deletion_collector_factory_del_ratio`.

Also updated change log for 8.4 and will backport this change to 8.4 branch once landed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11593

Test Plan: `make c_test && ./c_test`

Reviewed By: akankshamahajan15

Differential Revision: D47299555

Pulled By: cbi42

fbshipit-source-id: 517dc093ef4cf02cac2fe4af4f1af13754bbda63
2023-07-07 13:16:20 -07:00
Changyu Bi df082c8d1d Deprecate option `periodic_compaction_seconds` for FIFO compaction (#11550)
Summary:
both options `ttl` and `periodic_compaction_seconds` have the same meaning for FIFO compaction, which is redundant and can be confusing to use. For example, setting TTL to 0 does not disable TTL: user needs to also set periodic_compaction_seconds to 0. Another example is that dynamically setting `periodic_compaction_seconds` (surprisingly) has no effect on TTL compaction. This is because FIFO compaction picker internally only looks at value of `ttl`. The value of `ttl` is in `SanitizeOptions()` which take into account the value of `periodic_compaction_seconds`, but dynamically setting an option does not invoke this method.

This PR clarifies the usage of both options for FIFO compaction: only `ttl` should be used, `periodic_compaction_seconds` will not have any effect on FIFO compaction.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11550

Test Plan:
- updated existing unit test `DBOptionsTest.SanitizeFIFOPeriodicCompaction`
- checked existing values of both options in feature matrix: https://fburl.com/daiquery/xxd0gs9w. All current uses cases either have `periodic_compaction_seconds = 0` or have `periodic_compaction_seconds > ttl`, so should not cause change of behavior.

Reviewed By: ajkr

Differential Revision: D46902959

Pulled By: cbi42

fbshipit-source-id: a9ede235b276783b4906aaec443551fa62ceff4c
2023-07-05 14:40:45 -07:00
leipeng 25b08eb438 MemTable::Add: first_seqno_.compare_exchange_weak to earliest_seqno_ (#11398)
Summary:
This should be a benign bug caused by a long lived typo, this PR fix this issue.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11398

Reviewed By: ajkr

Differential Revision: D47163379

Pulled By: cbi42

fbshipit-source-id: 531728cae496fd7ac1371bbbd64fc103c3a90dcf
2023-07-03 15:05:38 -07:00
darionyaphet f4e304f987 Simplify conditional judgment (#11580)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11580

Reviewed By: ajkr

Differential Revision: D47158687

Pulled By: cbi42

fbshipit-source-id: 4841b77eee78ddcf35da6ea33da71861c5f1e773
2023-07-03 09:41:48 -07:00
Yu Zhang 15053f3ab4 Logically strip timestamp during flush (#11557)
Summary:
Logically strip the user-defined timestamp when L0 files are created during flush when `AdvancedColumnFamilyOptions.persist_user_defined_timestamps` is false. Logically stripping timestamp here means replacing the original user-defined timestamp with a mininum timestamp, which for now is hard coded to be all zeros bytes.

While working on this, I caught a missing piece on the `BlockBuilder` level for this feature. The current quick path `std::min(buffer_size, last_key_size)` needs a bit tweaking to work for this feature. When user-defined timestamp is stripped during block building, on writing first entry or right after resetting, `buffer` is empty and `buffer_size` is zero as usual. However, in follow-up writes, depending on the size of the stripped user-defined timestamp, and the size of the value, what's in `buffer` can sometimes be smaller than `last_key_size`, leading `std::min(buffer_size, last_key_size)` to truncate the `last_key`. Previous test doesn't caught the bug because in those tests, the size of the stripped user-defined timestamps bytes is smaller than the length of the value. In order to avoid the conditional operation, this PR changed the original trivial `std::min` operation into an arithmetic operation. Since this is a change in a hot and performance critical path, I did the following benchmark to check no observable regression is introduced.
```TEST_TMPDIR=/dev/shm/rocksdb1 ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=50000000```
Compiled with DEBUG_LEVEL=0
Test vs. control runs simulaneous for better accuracy, units = ops/sec
                       PR  vs base:
Round 1: 350652 vs 349055
Round 2: 365733 vs 364308
Round 3: 355681 vs 354475

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11557

Test Plan:
New timestamp specific test added or existing tests augmented, both are parameterized with `UserDefinedTimestampTestMode`:
`UserDefinedTimestampTestMode::kNormal` -> UDT feature enabled, write / read with min timestamp
`UserDefinedTimestampTestMode::kStripUserDefinedTimestamps` -> UDT feature enabled, write / read with min timestamp, set Options.persist_user_defined_timestamps to false.

```
make all check
./db_wal_test --gtest_filter="*WithTimestamp*"
./flush_job_test --gtest_filter="*WithTimestamp*"
./repair_test --gtest_filter="*WithTimestamp*"
./block_based_table_reader_test
```

Reviewed By: pdillinger

Differential Revision: D47027664

Pulled By: jowlyzhang

fbshipit-source-id: e729193b6334dfc63aaa736d684d907a022571f5
2023-06-29 15:50:50 -07:00
Griffin Smith bfdc91017c C-API: Expose remaining PlainTableOptions (#11442)
Summary:
Expose the remaining fields of PlainTableOptions as arguments to `rocksdb_options_set_plain_table_factory` in the C API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11442

Reviewed By: ajkr

Differential Revision: D46786962

Pulled By: hx235

fbshipit-source-id: 8862083dde332bfecc5ff02f9375776ad35c11f5
2023-06-27 12:30:28 -07:00
Jay Schmidek f7aa70a72f Add create_column_families to C api (#9527)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9527

Reviewed By: akankshamahajan15

Differential Revision: D47007647

Pulled By: ajkr

fbshipit-source-id: e13544130b2731e07fa5fa4b9a2aa5f75b548c7e
2023-06-27 11:58:33 -07:00
Changyu Bi ca50ccc71a Add CreateColumnFamilyWithImport to `StackableDB` and `DBImplReadOnly` (#11556)
Summary:
https://github.com/facebook/rocksdb/issues/11378 added a new overloaded `CreateColumnFamilyWithImport` API and updated the virtual function in `StackableDB` and `DBImplReadOnly` to the newly overloaded one. This caused internal error when there is a derived class that tries to override the original `CreateColumnFamilyWithImport` function. This PR adds the original `CreateColumnFamilyWithImport` function back to `StackableDB` and `DBImplReadOnly`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11556

Test Plan: check if this fixes an internal build

Reviewed By: akankshamahajan15

Differential Revision: D46980506

Pulled By: cbi42

fbshipit-source-id: 975a6c5748bf9481499a62ee5997ca59e542e3bc
2023-06-23 13:56:26 -07:00
akankshamahajan fbd2f563bb Add an interface to provide support for underlying FS to pass their own buffer during reads (#11324)
Summary:
1. Public API change: Replace `use_async_io`  API in file_system with `SupportedOps` API which is used by underlying FileSystem to indicate to upper layers whether the FileSystem supports different operations introduced in `enum FSSupportedOps `. Right now operations are `async_io` and whether FS will provide its own buffer during reads or not. The api is changed to extend it to various FileSystem operations in one API rather than creating a separate API for each operation.

2. Provide support for underlying FS to pass their own buffer during Reads (async and sync read) instead of using RocksDB provided `scratch` (buffer) in `FSReadRequest`. Currently only MultiRead supports it and later will be extended to other reads as well (point lookup, scan etc). More details in how to enable in file_system.h

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11324

Test Plan: Tested locally

Reviewed By: anand1976

Differential Revision: D44465322

Pulled By: akankshamahajan15

fbshipit-source-id: 9ec9e08f839b5cc815e75d5dade6cd549998d0ec
2023-06-23 11:48:49 -07:00
Yu Zhang 7521478b43 Record the `persist_user_defined_timestamps` flag in manifest (#11515)
Summary:
Start to record the value of the flag `AdvancedColumnFamilyOptions.persist_user_defined_timestamps` in the Manifest and table properties for a SST file when it is created. And use the recorded flag when creating a table reader for the SST file. This flag's default value is true, it is only explicitly recorded if it's false.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11515

Test Plan:
```
make all check
./version_edit_test
```

Reviewed By: ltamasi

Differential Revision: D46920386

Pulled By: jowlyzhang

fbshipit-source-id: 075c20363d3d2cc1368422ecc805617ed135cc26
2023-06-21 21:49:01 -07:00
Alexandre Lavigne 2926e0718c Add missing parameter in C API (#11542)
Summary:
The class `NewCompactOnDeletionCollectorFactory` exposes the parameter `delete_ratio`.

The C API `rocksdb_options_add_compact_on_deletion_collector_factory` does not allow a user to pass a delete ration to be passed down the the C++ class bellow.

The class has default value for the delete ratio which makes it pass the compilation and the tests.

closes https://github.com/facebook/rocksdb/issues/11541

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11542

Reviewed By: ajkr

Differential Revision: D46770908

Pulled By: cbi42

fbshipit-source-id: 7b5162fe459896052e392e2d85a8f6c01db3b464
2023-06-20 13:18:04 -07:00
Levi Tamasi 022d89549d Attempt to deflake DBWALTestWithEnrichedEnv.SkipDeletedWALs (#11537)
Summary:
Calling `Flush` (even with `wait==true`) does not guarantee that obsolete WAL files are physically deleted before the call returns. The patch attempts to fix the resulting flakiness by using `SyncPoint`s to make sure `PurgeObsoleteFiles` finishes before checking for WAL deletions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11537

Test Plan:
```
gtest-parallel --repeat=1000 ./db_wal_test --gtest_filter="*SkipDeletedWALs*"
```

Reviewed By: pdillinger

Differential Revision: D46736050

Pulled By: ltamasi

fbshipit-source-id: 47a931b7a3a03ef681fbf4adb5a0b223d452703e
2023-06-19 16:04:49 -07:00
Changyu Bi bc04ec85db Make option `level_compaction_dynamic_level_bytes` true by default (#11525)
Summary:
after https://github.com/facebook/rocksdb/issues/11321 and https://github.com/facebook/rocksdb/issues/11340 (both included in RocksDB v8.2), migration from `level_compaction_dynamic_level_bytes=false` to `level_compaction_dynamic_level_bytes=true` is automatic by RocksDB and requires no manual compaction from user. Making the option true by default as it has several advantages: 1. better space amplification guarantee (a more stable LSM shape). 2. compaction is more adaptive to write traffic. 3. automatic draining of unneeded levels. Wiki is updated with more detail: https://github.com/facebook/rocksdb/wiki/Leveled-Compaction#option-level_compaction_dynamic_level_bytes-and-levels-target-size.

The PR mostly contains fixes for unit tests as they assumed `level_compaction_dynamic_level_bytes=false`. Most notable change is commit f742be330c and b1928e42b3 which override the default option in DBTestBase to still set `level_compaction_dynamic_level_bytes=false` by default. This helps to reduce the change needed for unit tests. I think this default option override in unit tests is okay since the behavior of `level_compaction_dynamic_level_bytes=true` is tested by explicitly setting this option. Also, `level_compaction_dynamic_level_bytes=false` may be more desired in unit tests as it makes it easier to create a desired LSM shape.

Comment for option `level_compaction_dynamic_level_bytes` is updated to reflect this change and change made in https://github.com/facebook/rocksdb/issues/10057.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11525

Test Plan: `make -j32 J=32 check` several times to try to catch flaky tests due to this option change.

Reviewed By: ajkr

Differential Revision: D46654256

Pulled By: cbi42

fbshipit-source-id: 6b5827dae124f6f1fdc8cca2ac6f6fcd878830e1
2023-06-15 21:12:39 -07:00
yaphet 253bc91953 Move the status judgment into the block (#11534)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11534

Reviewed By: cbi42

Differential Revision: D46732248

Pulled By: ajkr

fbshipit-source-id: c9360866c35a2c436ab83b85a14ed7bd43a88d95
2023-06-15 16:55:38 -07:00
darionyaphet 9f774baaa8 Support Error Recovery Retry Flush in GetFlushReasonString (#11536)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11536

Reviewed By: cbi42

Differential Revision: D46732297

Pulled By: ajkr

fbshipit-source-id: 82bb078f189da233addc4f483eaa6eaf7fdd3910
2023-06-15 16:53:44 -07:00
mayue.fight fa878a0107 Support to create a CF by importing multiple non-overlapping CFs (#11378)
Summary:
The original Feature Request is from [https://github.com/facebook/rocksdb/issues/11317](https://github.com/facebook/rocksdb/issues/11317).
Flink uses rocksdb as the state backend,  all DB options are the same, and the keys of each DB instance are adjacent and there is no key overlap between two db instances.
In the Flink rescaling scenario, it is necessary to quickly split the DB according to a certain key range or quickly merge multiple DBs into one.

This PR is mainly used to quickly merge multiple DBs into one.

We hope to extend the function of `CreateColumnFamilyWithImports` to support creating ColumnFamily by importing multiple ColumnFamily with no overlapping keys.

The import logic is almost the same as `CreateColumnFamilyWithImport`, but it will check whether there is key overlap between CF when importing. The import will fail if there are key overlaps.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11378

Reviewed By: ajkr

Differential Revision: D46413709

Pulled By: cbi42

fbshipit-source-id: 846d0049fad11c59cf460fa846c345b26c658dfb
2023-06-15 12:25:04 -07:00
Changyu Bi 15e8a843d9 Do not include last level in compaction when `allow_ingest_behind=true` (#11489)
Summary:
when a DB is configured with `allow_ingest_behind = true`, the last level should be reserved for ingested files and these files should not be included in any compaction. Currently, a major compaction can compact these files to smaller levels. This can cause future files to be rejected for ingest behind (see `ExternalSstFileIngestionJob::CheckLevelForIngestedBehindFile()`). This PR fixes the issue such that files in the last level is not included in any compaction.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11489

Test Plan: * Updated unit test `ExternalSSTFileTest.IngestBehind` to test that last level is not included in manual and auto-compaction.

Reviewed By: ajkr

Differential Revision: D46455711

Pulled By: cbi42

fbshipit-source-id: 5e2142c2a709ef932ad797897795021c06c4ac8c
2023-06-14 11:28:56 -07:00
Andrew Kryczka cac3240cbf add property "rocksdb.obsolete-sst-files-size" (#11533)
Summary:
See "unreleased_history/new_features/obsolete_sst_files_size.md" for description

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11533

Test Plan: updated unit test

Reviewed By: jowlyzhang

Differential Revision: D46703152

Pulled By: ajkr

fbshipit-source-id: ea5e31cd6293eccc154130c13e66b5271f57c102
2023-06-13 15:52:45 -07:00
Ignat Loskutov 7c67aee4a0 statistics.cc: fix mistype (#11509)
Summary:
Add new tickers: `rocksdb.error.handler.bg.error.count`, `rocksdb.error.handler.bg.io.error.count`, `rocksdb.error.handler.bg.retryable.io.error.count` to replace the misspelled ones: `rocksdb.error.handler.bg.errro.count`, `rocksdb.error.handler.bg.io.errro.count`, `rocksdb.error.handler.bg.retryable.io.errro.count` ('error' instead of 'errro'). Users should switch to use the new tickers before 9.0 release as the misspelled old tickers will be completely removed then.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11509

Reviewed By: ltamasi

Differential Revision: D46542809

Pulled By: jowlyzhang

fbshipit-source-id: a2a6d8354af46a060de81d40ef6f5336a80bd32e
2023-06-09 13:25:57 -07:00
Ignat Loskutov 05fcacdb42 Add missing stopwatch and perf timer to DBImplReadOnly (#11521)
Summary:
The `rocksdb.db.get.micros` histogram is never updated if the DB is open in ReadOnly mode, as well as the `get_cpu_nanos` perf counter. An earlier PR (https://github.com/facebook/rocksdb/issues/4260) for some reason has only added the TODO line, not the accounting itself, so this one is intended to fix it, adding two lines to match [DBImplSecondary](4dafa5b220/db/db_impl/db_impl_secondary.cc (L366-L367)).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11521

Reviewed By: cbi42

Differential Revision: D46577330

Pulled By: jowlyzhang

fbshipit-source-id: be147923e763af32bbc18fd6bdf3aff8ebf08aee
2023-06-08 17:40:56 -07:00
Changyu Bi 2e8cc98ab2 Fix subcompaction bug to allow running two subcompactions (#11501)
Summary:
as reported in https://github.com/facebook/rocksdb/issues/11476, RocksDB currently does not execute compactions in two subcompactions even when they qualify. This PR fixes this issue.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11501

Test Plan:
* Add a new unit test.
* Run crash test with max_subcompactions=2: `python3 tools/db_crashtest.py blackbox --simple  --subcompactions=2 --target_file_size_base=1048576 --compaction_style=0`
  * saw logs showing compactions being executed as 2 subcompactions
```
2023/06/01-17:28:44.028470 3147486 (Original Log Time 2023/06/01-17:28:44.025972) EVENT_LOG_v1 {"time_micros": 1685665724025939, "job": 6, "event": "compaction_finished", "compaction_time_micros": 34539, "compaction_time_cpu_micros": 26404, "output_level": 6, "num_output_files": 2, "total_output_size": 1109796, "num_input_records": 13188, "num_output_records": 13021, "num_subcompactions": 2, "output_compression": "NoCompression", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [0, 0, 0, 0, 0, 0, 13]}
```

Reviewed By: ajkr

Differential Revision: D46411497

Pulled By: cbi42

fbshipit-source-id: 3ebfc02e19f78f782e114a9546dc3d481d496258
2023-06-06 13:36:02 -07:00
leipeng ddfcbea3e1 IterKey: change space_[32] to 39 to utilize padding space (#10633)
Summary:
This PR utilize padding space.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10633

Reviewed By: pdillinger

Differential Revision: D46410978

Pulled By: ajkr

fbshipit-source-id: 23ec757b1eea9221c1390971e39d341c6b7f2003
2023-06-06 11:19:15 -07:00