Commit Graph

12654 Commits

Author SHA1 Message Date
Andrew Kryczka a43481b3d0 Decouple `RateLimiter` burst size and refill period (#12379)
Summary:
When the rate limiter does not have any waiting requests, the first request to arrive may consume all of the available bandwidth, despite potentially having lower priority than requests that arrive later in the same refill interval. Then, those higher priority requests must wait for a refill. So even in scenarios in which we have an overall bandwidth surplus, the highest priority requests can be sporadically delayed up to a whole refill period.

Alone, this isn't necessarily problematic as the refill period is configurable via `refill_period_us` and can be tuned down as needed until the max sporadic delay is tolerable. However, tuning down `refill_period_us` had a side effect of reducing burst size. Some users require a certain burst size to issue optimal I/O sizes to the underlying storage system.

To satisfy those users, this PR decouples the refill period from the burst size. That way, the max sporadic delay can be limited without impacting I/O sizes issued to the underlying storage system.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12379

Test Plan:
The goal is to show we can now limit the max sporadic delay without impacting compaction's I/O size.

The benchmark runs compaction with a large I/O size, while user reads simultaneously run at a low rate that does not consume all of the available bandwidth. The max sporadic delay is measured using the P100 of rocksdb.file.read.get.micros. I just used strace to verify the compaction reads follow `rate_limiter_single_burst_bytes`

Setup: `./db_bench -benchmarks=fillrandom,flush -write_buffer_size=67108864 -disable_auto_compactions=true -value_size=256 -num=1048576`

Benchmark: `./db_bench -benchmarks=readrandom -use_existing_db=true -num=1048576 -duration=10 -benchmark_read_rate_limit=4096 -rate_limiter_bytes_per_sec=67108864 -rate_limiter_refill_period_us=$refill_micros -rate_limiter_single_burst_bytes=16777216 -rate_limit_bg_reads=true -rate_limit_user_ops=true -statistics=true -cache_size=0 -stats_level=5 -compaction_readahead_size=16777216 -use_direct_reads=true`

Results:

refill_micros | rocksdb.file.read.get.micros (P100)
-- | --
10000 | 10802
100000 | 100240
1000000 | 922061

For verifying compaction read sizes: `strace -fye pread64 ./db_bench -benchmarks=compact -use_existing_db=true -rate_limiter_bytes_per_sec=67108864 -rate_limiter_refill_period_us=$refill_micros -rate_limiter_single_burst_bytes=16777216 -rate_limit_bg_reads=true -compaction_readahead_size=16777216 -use_direct_reads=true`

Reviewed By: hx235

Differential Revision: D54165675

Pulled By: ajkr

fbshipit-source-id: c5968486316cbfb7ff8e5b7d75d3589883dd1105
2024-02-26 16:55:13 -08:00
Peter Dillinger 41849210e9 Fix ArenaTest.UnmappedAllocation in some cases (#12378)
Summary:
Fix compatibility with transparent huge pages by allocating in increments (1MiB) smaller than the
typical smallest huge page size of 2MiB.

Also, bypass the test when jemalloc config.fill is used, which means the allocator is explicitly
configured to write to memory before we get it, which is not what this test expects.

Fixes https://github.com/facebook/rocksdb/issues/12351

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12378

Test Plan:
```
sudo bash -c 'echo "always" > /sys/kernel/mm/transparent_hugepage/enabled'
```
And see unit test fails before this change, passes after this change

Also tested internal buck build with dbg mode (previously failing).

Reviewed By: jaykorean, hx235

Differential Revision: D54139634

Pulled By: pdillinger

fbshipit-source-id: 179accebe918d8eecd46a979fcf21d356f9b5519
2024-02-26 16:08:21 -08:00
Richard Barnes a4ff83d1b2 Fix deprecated use of 0/NULL in internal_repo_rocksdb/repo/utilities/transactions/lock/range/range_tree/lib/locktree/wfg.cc + 3
Summary:
`nullptr` is typesafe. `0` and `NULL` are not. In the future, only `nullptr` will be allowed.

This diff helps us embrace the future _now_ in service of enabling `-Wzero-as-null-pointer-constant`.

Reviewed By: meyering

Differential Revision: D54163069

fbshipit-source-id: e5bb4b6ee79d82f1437ffed602bdb41dcfc0e59a
2024-02-25 22:17:04 -08:00
Yu Zhang 2940acac00 Persist table options use_delta_encoding in options file (#11987)
Summary:
This option is used for encoding keys in block based table files. It has been having a default true value since its introduction.

Users may not notice this option is not persisted in options file unless they are explicitly setting it to false. If the users expect `Iterator::GetProperty("rocksdb.iterator.is-key-pinned")` to return 1 when setting `ReadOptions.pin_data = true`, they should have noticed loading options file won't work and have work around for this by always explicitly set this option to false for opening DB. This change won't impact those users except that now they can remove their work around. If the users are not relying on key pinning behavior at all and as a result didn't notice the option is not persisted, this change shouldn't have any visible behavior impact either.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11987

Reviewed By: hx235

Differential Revision: D54093238

Pulled By: jowlyzhang

fbshipit-source-id: 256a3348c44cf91349034d1f6e242c437b32b9a5
2024-02-23 14:13:28 -08:00
Jay Huh f300438c20 Mark offpeak feature production-ready (#12375)
Summary:
The feature was released in 8.9.0 and verified at Meta internally (via ZippyDB test tier). Marking the feature ready in production.

Wiki has been added in https://github.com/facebook/rocksdb/wiki/Daily-Off%E2%80%90peak-Time-Option

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12375

Test Plan: No code change. N/A

Reviewed By: cbi42

Differential Revision: D54128890

Pulled By: jaykorean

fbshipit-source-id: a6c728ab87657fc5263048e21c366053ec5717af
2024-02-23 13:26:22 -08:00
Alan Paxton d1386de632 Java FFI blog post - Post-publication issues with images (2) (#12372)
Summary:
Replace unreliable-in-chrome PDF w/PNG of same graph

jmh-result-pinnable-vs-output-plot.pdf is showing as thumbnail on Chrome, rendering OK on Safari for some; I have converted it to PNG in the hope that will display correctly in all environments.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12372

Reviewed By: cbi42

Differential Revision: D54076718

Pulled By: jowlyzhang

fbshipit-source-id: 2eff995f0239ab7850a40063d841380738953533
2024-02-22 15:01:55 -08:00
raffertyyu e09b9d0cb9 Fix zstd typo in cmake (#12309)
Summary:
https://github.com/facebook/rocksdb/issues/12247 imported another typo in cmakelists.txt and findzstd.cmake.
cmake report ZSTD_INCLUDE_DIRS not found.
Actually it should be
aacf60dda2/cmake/modules/Findzstd.cmake (L8)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12309

Reviewed By: hx235

Differential Revision: D54070348

Pulled By: ajkr

fbshipit-source-id: eaf6e260ea3669b8ea38e4c74a375bb885761b51
2024-02-22 14:39:05 -08:00
anand76 d9c0d44dab Add a perf level for measuring user thread block time (#12368)
Summary:
Enabling time PerfCounter stats in RocksDB is currently very expensive, as it enables all sorts of relatively uninteresting stats, such as iteration, point lookup breakdown etc. This PR adds a new perf level between `kEnableCount` and `kEnableTimeExceptForMutex` to enable stats for time spent by user (i.e a RocksDB user) threads blocked by other RocksDB threads or events, such as a write group leader, write delay or stalls etc. It does not include time spent waiting to acquire mutexes, or waiting for IO.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12368

Test Plan: Add a unit test for write_thread_wait_nanos

Reviewed By: ajkr

Differential Revision: D54021583

Pulled By: anand1976

fbshipit-source-id: 3f6fcf71010132ffffca0391a5565f3b59fddd48
2024-02-22 12:14:53 -08:00
Alan Paxton cb4f4381f6 Java FFI blog post - Post-publication issues with images (#12366)
Summary:
Review comments
Broken image links

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12366

Reviewed By: hx235

Differential Revision: D53999663

Pulled By: ajkr

fbshipit-source-id: 72546f468367dc950eb61a876c4f763a580eb76d
2024-02-21 15:50:57 -08:00
jrchyang 70cb330a4a optimize file size statistics in benchmark script (#12363)
Summary:
Execute `ls` once when counting the file size of the `DB_DIR` and remove unused file number counter variable `c` . The test information as follow :

```Shell
# benchmark command

NUM_KEYS=30000000 CACHE_SIZE=6442450944 DB_DIR=/mnt/rocksdb_test WAL_DIR=/mnt/rocksdb_test ../tools/benchmark.sh fillseq_disable_wal

# before modification

cat /tmp/benchmark_fillseq.wal_disabled.v400.log.stats.sizes
0.0	0.0	0.0	0.0	195250
1.1	1.1	0.0	0.0	195300
2.5	2.5	0.0	0.0	195310
3.8	3.7	0.0	0.0	195320
5.1	5.1	0.0	0.0	195330
max sizes (GB): 5.1 all, 5.1 sst, 0.0 log, 0.0 blob

# after modification

cat /tmp/benchmark_fillseq.wal_disabled.v400.log.stats.sizes
0.0	0.0	0.0	0.0	194839
1.2	1.2	0.0	0.0	194849
2.6	2.6	0.0	0.0	194859
4.0	4.0	0.0	0.0	194909
5.4	5.4	0.0	0.0	194919
max sizes (GB): 5.4 all, 5.4 sst, 0.0 log, 0.0 blob
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12363

Reviewed By: hx235

Differential Revision: D54005427

Pulled By: ajkr

fbshipit-source-id: fae149705eb3fcda48d7381c42836a150f35ddc4
2024-02-21 15:45:18 -08:00
Yu Zhang f1ca47b904 Add support to bulk load external files for UDT in memtable only feature (#12356)
Summary:
This PR expands on the capabilities added in https://github.com/facebook/rocksdb/issues/12343. It adds sanity checks for external file's comparator name and user-defined timestamps related flag. With this, it now supports ingesting files to a column family that enables user-defined timestamps in Memtable only feature.

Two fields in the table properties are used for aformentioned check: 1) the comparator name, it records what comparator is used to create this external sst file, 2) the flag `user_defined_timestamps_persisted`.  We compare these two fields with the column family's settings. The details are in util function `ValidateUserDefinedTimestampsOptions`.

To optimize for the majority of the cases where sanity check should pass and the table properties read should not affect how `TableReader` is constructed, instead of read the table properties block separately and use it for sanity check before creating a `TableReader`. We continue using the current flow to first create a `TableReader`, use it for reading table properties and do sanity checks, and reset the`TableReader` for the case where the column family enables UDTs in memtable only feature, and the external file does not contain user-defined timestamps.

This PR also groups other table properties related sanity check in function `GetIngestedFileInfo` into the newly added `SanityCheckTableProperties` function.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12356

Test Plan:
added unit test
existing unit test

Reviewed By: cbi42

Differential Revision: D54025116

Pulled By: jowlyzhang

fbshipit-source-id: a918276c15f9908bd9df8513ce667638882e1554
2024-02-21 15:41:53 -08:00
Andrew Kryczka 8e29f243c9 No filesystem reads during `Merge()` writes (#12365)
Summary:
This occasional filesystem read in the write path has caused user pain. It doesn't seem very useful considering it only limits one component's merge chain length, and only helps merge uncached (i.e., infrequently read) values. This PR proposes allowing `max_successive_merges` to be exceeded when the value cannot be read from in-memory components. I included a rollback flag (`strict_max_successive_merges`) just in case.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12365

Test Plan:
"rocksdb.block.cache.data.add" is number of data blocks read from filesystem. Since the benchmark is write-only, compaction is disabled, and flush doesn't read data blocks, any nonzero value means the user write issued the read.

```
$ for s in false true; do echo -n "strict_max_successive_merges=$s: " && ./db_bench -value_size=64 -write_buffer_size=131072 -writes=128 -num=1 -benchmarks=mergerandom,flush,mergerandom -merge_operator=stringappend -disable_auto_compactions=true -compression_type=none -strict_max_successive_merges=$s -max_successive_merges=100 -statistics=true |& grep 'block.cache.data.add COUNT' ; done
strict_max_successive_merges=false: rocksdb.block.cache.data.add COUNT : 0
strict_max_successive_merges=true: rocksdb.block.cache.data.add COUNT : 1
```

Reviewed By: hx235

Differential Revision: D53982520

Pulled By: ajkr

fbshipit-source-id: e40f761a60bd601f232417ac0058e4a33ee9c0f4
2024-02-21 13:15:27 -08:00
Jeff Palm 5950907a82 switch to using centos8-native (#12367)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12367

switch to using centos8-native for rocks-db

Reviewed By: jowlyzhang

Differential Revision: D53971368

fbshipit-source-id: 635885dfb9e0ec6daa7623627a50e6b2897725ba
2024-02-21 12:03:40 -08:00
Alan Paxton 003197f005 Foreign function interface (Panama) blog (#11760)
Summary:
We did some experimental work with FFI and native memory as a potential improvement to the Java API.
The work lives (unmerged) in https://github.com/facebook/rocksdb/pull/11095

This is the report text from that branch, extract as a blog post.
Along with some supporting files (png, pdf of graphs).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11760

Reviewed By: hx235

Differential Revision: D53943442

Pulled By: ajkr

fbshipit-source-id: 7c9f800e25be22c10e736cdd3b0d65422ecfc826
2024-02-20 13:44:35 -08:00
leedonggyu ca99a8f153 Add function to check if the RocksDB instance is closed or not (#11337)
Summary:
In RocksDb jni threre is no method to know if the instance is closed or not.
so when using a closed instance it makes jvm crash.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11337

Reviewed By: jaykorean

Differential Revision: D53941387

Pulled By: ajkr

fbshipit-source-id: e3e4e6fe48409fa70a312810e467ec0c4ce356ef
2024-02-20 11:36:28 -08:00
Yu Zhang 31dfc81e18 Start 9.1.0 release (#12360)
Summary:
with release notes for 9.0.fb, format_compatible test update, and version.h update.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12360

Test Plan: CI

Reviewed By: cbi42

Differential Revision: D53879416

Pulled By: jowlyzhang

fbshipit-source-id: 29598893d9ce2d0bb181345ddb78f9b1529aee75
2024-02-16 18:26:48 -08:00
Alex Wied f2732d0586 Export GetSequenceNumber functionality for Snapshots (#12354)
Summary:
This PR adds `Snapshot->GetSequenceNumber()` functionality to the C API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12354

Reviewed By: akankshamahajan15

Differential Revision: D53836085

Pulled By: cbi42

fbshipit-source-id: 4a14daeba9210a69bcb74e4c1c0666deff1b4837
2024-02-16 10:28:41 -08:00
Adam Retter 055b21ab11 Update ZLib to 1.3.1 (#12358)
Summary:
pdillinger This fixes the RocksJava build, is also needed in the 8.10.fb and 8.11.fb branches please?

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12358

Reviewed By: jaykorean

Differential Revision: D53859743

Pulled By: pdillinger

fbshipit-source-id: b8417fccfee931591805f9aecdfae7c086fee708
2024-02-16 10:26:32 -08:00
anand76 d227276147 Deprecate some variants of Get and MultiGet (#12327)
Summary:
A lot of variants of Get and MultiGet have been added to `include/rocksdb/db.h` over the years. Try to consolidate them by marking variants that don't return timestamps as deprecated. The underlying DB implementation will check and return Status::NotSupported() if it doesn't support returning timestamps and the caller asks for it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12327

Reviewed By: pdillinger

Differential Revision: D53828151

Pulled By: anand1976

fbshipit-source-id: e0b5ca42d32daa2739d5f439a729815a2d4ff050
2024-02-16 09:21:06 -08:00
Akanksha Mahajan 956f1dfde3 Change ReadAsync callback API to remove const from FSReadRequest (#11649)
Summary:
Modify ReadAsync callback API to remove const from FSReadRequest as const doesn't let to fs_scratch to move the ownership.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11649

Test Plan: CircleCI jobs

Reviewed By: anand1976

Differential Revision: D53585309

Pulled By: akankshamahajan15

fbshipit-source-id: 3bff9035db0e6fbbe34721a5963443355807420d
2024-02-16 09:14:55 -08:00
anand76 28c1c15c29 Sync tickers and histograms across C++ and Java (#12355)
Summary:
The RocksDB ticker and histogram statistics were out of sync between the C++ and Java code, with a number of newer stats missing in TickerType.java and HistogramType.java. Also, there were gaps in numbering in portal.h, which could soon become an issue due to the number of tickers and the fact that we're limited to 1 byte in Java. This PR adds the missing stats, and re-numbers all of them. It also moves some stats around to try to group related stats together. Since this will go into a major release, compatibility shouldn't be an issue.

This should be automated at some point, since the current process is somewhat error prone.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12355

Reviewed By: jaykorean

Differential Revision: D53825324

Pulled By: anand1976

fbshipit-source-id: 298c180872f4b9f1ee54b8bb22f4e280458e7e09
2024-02-15 17:22:03 -08:00
Peter Dillinger 12018136d8 KeySegmentsExtractor and prototype higher-dimensional filtering (#12075)
Summary:
This change contains a prototype new API for "higher dimensional" filtering of read queries. Existing filters treat keys as one-dimensional, either as distinct points (whole key) or as contiguous ranges in comparator order (prefix filters). The proposed KeySegmentsExtractor allows treating keys as multi-dimensional for filtering purposes even though they still have a single total order across dimensions. For example, consider these keys in different LSM levels:

L0:
abc_0123
abc_0150
def_0114
ghi_0134

L1:
abc_0045
bcd_0091
def_0077
xyz_0080

If we get a range query for [def_0100, def_0200), a prefix filter (up to the underscore) will tell us that both levels are potentially relevant. However, if each SST file stores a simple range of the values for the second segment of the key, we would see that L1 only has [0045, 0091] which (under certain required assumptions) we are sure does not overlap with the given range query. Thus, we can filter out processing or reading any index or data blocks from L1 for the query.

This kind of case shows up with time-ordered data but is more general than filtering based on user timestamp. See https://github.com/facebook/rocksdb/issues/11332 . Here the "time" segments of the keys are meaningfully ordered with respect to each other even when the previous segment is different, so summarizing data along an alternate dimension of the key like this can work well for filtering.

This prototype implementation simply leverages existing APIs for user table properties and table filtering, which is not very CPU efficient. Eventually, we expect to create a native implementation. However, I have put some significant
thought and engineering into the new APIs overall, which I expect to be close to refined enough for production.

For details, see new public APIs in experimental.h. For a detailed example, see the new unit test in db_bloom_filter_test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12075

Test Plan: Unit test included

Reviewed By: jowlyzhang

Differential Revision: D53619406

Pulled By: pdillinger

fbshipit-source-id: 9e6e7b82b4db8d815db76a6ab340e90db2c191f2
2024-02-15 15:39:55 -08:00
Peter Dillinger bfd00bba9c Use format_version=6 by default (#12352)
Summary:
It's in production for a large storage service, and it was initially released 6 months ago (8.6.0). IMHO that's enough room for "easy downgrade" to most any user's previously integrated version, even if they only update a few times a year.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12352

Test Plan:
tests updated, including format capatibility test

table_test: ApproximateOffsetOfCompressed is affected because adding index block to metaindex adds about 13 bytes
to SST files in format_version 6. This test has historically been problematic and one reason is that, apparently, not only
could it pass/fail depending on snappy compression version, but also how long your host name is, because of db_host_id.
I've cleared that out for the test, which takes care of format_version=6 and hopefully improves long-term reliability.

Suggested follow-up: FinishImpl in table_test.cc takes a table_options that is ignored in some cases and might not match
the ioptions.table_factory configuration unless the caller is very careful. This should be cleaned up somehow.

Reviewed By: anand1976

Differential Revision: D53786884

Pulled By: pdillinger

fbshipit-source-id: 1964cbd40d3ab0a821fdc01c458031df716fcf51
2024-02-15 11:23:48 -08:00
Changyu Bi 6e57135a65 Add a changelog entry for PR 12322 (#12353)
Summary:
.. for public api change related to sst_dump.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12353

Reviewed By: jaykorean

Differential Revision: D53791123

Pulled By: cbi42

fbshipit-source-id: 3fbe9c7a3eb0a30dc1a00d39bc8a46028baa3779
2024-02-15 09:53:20 -08:00
Gilbert Liu d201e59941 Update llvm-fb to 15 (#12342)
Summary:
Update llvm-fb to 15 and some other dependency versions.

## Test

Copied over the two script files to tp2 librocksdb source and ran tp2_build, it succeeded.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12342

Reviewed By: ltamasi

Differential Revision: D53690631

Pulled By: bunnypak

fbshipit-source-id: 68f884b2a565f98bc3510290b411a901ef781adb
2024-02-14 14:40:05 -08:00
Yu Zhang f405e55cfa Add support in SstFileWriter to not persist user defined timestamps (#12348)
Summary:
This PR adds support in `SstFileWriter` to create SST files without persisting timestamps when the column family has enabled UDTs in Memtable only feature. The sst files created from flush and compaction do not contain timestamps, we want to make the sst files created by `SstFileWriter` to follow the same pattern and not persist timestamps. This is to prepare for ingesting external SST files for this type of column family.

There are timestamp-aware APIs and non timestamp-aware APIs in `SstFileWriter`. The former are exclusively used for when the column family's comparator is timestamp-aware, a.k.a `Comparator::timestamp_size() > 0`, while the latter are exclusively used for the column family's comparator is non timestamp-aware, a.k.a `Comparator::timestamp_size() == 0`.  There are sanity checks to make sure these APIs are correctly used.

In this PR, the APIs usage continue with above enforcement, where even though timestamps are not eventually persisted, users are still asked to use only the timestamp-aware APIs. But because data points will logically all have minimum timestamps, we don't allow multiple versions of the same user key (without timestamp) to be added.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12348

Test Plan:
Added unit tests
Manual inspection of generated sst files with `sst_dump`

Reviewed By: ltamasi

Differential Revision: D53732667

Pulled By: jowlyzhang

fbshipit-source-id: e43beba0d3a1736b94ee5c617163a6280efd65b7
2024-02-13 20:30:07 -08:00
Yu Zhang 4bea83aa44 Remove the force mode for EnableFileDeletions API (#12337)
Summary:
There is no strong reason for user to need this mode while on the other hand, its behavior is destructive.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12337

Reviewed By: hx235

Differential Revision: D53630393

Pulled By: jowlyzhang

fbshipit-source-id: ce94b537258102cd98f89aa4090025663664dd78
2024-02-13 18:36:25 -08:00
Jay Huh 8c7c0a38f1 Minor refactor with printing stdout in blackbox tests (#12350)
Summary:
As title. Adding a missing stdout printing in `blackbox_crash_main()`

# Test

**Blackbox**
```
$> python3 tools/db_crashtest.py blackbox --simple --max_key=25000000 --write_buffer_size=4194304
```
```
...
stdout:
 Choosing random keys with no overwrite
DB path: [/tmp/jewoongh/rocksdb_crashtest_blackbox34jwn9of]
(Re-)verified 0 unique IDs
2024/02/13-12:27:33  Initializing worker threads
Crash-recovery verification passed :)
2024/02/13-12:27:36  Starting database operations
...
jewoongh stdout test
jewoongh stdout test
...
jewoongh stdout test
stderr:
 jewoongh injected error
```

**Whitebox**
```
$> python3 tools/db_crashtest.py whitebox --simple --max_key=25000000 --write_buffer_size=4194304
```
```
...
stdout:
 Choosing random keys with no overwrite
Creating 24415 locks
...
2024/02/13-12:31:51  Initializing worker threads
Crash-recovery verification passed :)
2024/02/13-12:31:54  Starting database operations
jewoongh stdout test
jewoongh stdout test
jewoongh stdout test
...
stderr:
 jewoongh injected error
...
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12350

Reviewed By: akankshamahajan15, cbi42

Differential Revision: D53728910

Pulled By: jaykorean

fbshipit-source-id: ec90ed3b5e6a1102d1fb55d357d0371e5072a173
2024-02-13 14:15:52 -08:00
Yu Zhang 10d02456b6 Add support to bulk load external files with user-defined timestamps (#12343)
Summary:
This PR adds initial support to bulk loading external sst files with user-defined timestamps.

To ensure this invariant is met while ingesting external files:
     assume there are two internal keys: <K, ts1, seq1> and <K, ts2, seq2>, the following should hold:
     ts1 < ts2 iff. seq1 < seq2

These extra requirements are added for ingesting external files with user-defined timestamps:
1) A file with overlapping user key (without timestamp) range with the db cannot be ingested. This is because we cannot ensure above invariant is met without checking each overlapped key's timestamp and compare it with the timestamp from the db. This is an expensive step. This bulk loading feature will be used by MyRocks and currently their usage can guarantee ingested file's key range doesn't overlap with db.
4f3a57a13f/storage/rocksdb/ha_rocksdb.cc (L3312)
We can consider loose this requirement by doing this check in the future, this initial support just disallow this.

2) Files with overlapping user key (without timestamp) range are not allowed to be ingested. For similar reasons, it's hard to ensure above invariant is met. For example, if we have two files where user keys are interleaved like this:
file1: [c10, c8, f10, f5]
file2: [b5, c11, f4]
Either file1 gets a bigger global seqno than file2, or the other way around, above invariant cannot be met.
So we disallow this.

2) When a column family enables user-defined timestamps, it doesn't support ingestion behind mode. Ingestion behind currently simply puts the file at the bottommost level, and assign a global seqno 0 to the file. We need to do similar search though the LSM tree for key range overlap checks to make sure aformentioned invariant is met. So this initial support disallow this mode. We can consider adding it in the future.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12343

Test Plan: Add unit tests

Reviewed By: cbi42

Differential Revision: D53686182

Pulled By: jowlyzhang

fbshipit-source-id: f05e3fb27967f7974ed40179d78634c40ecfb136
2024-02-13 11:15:28 -08:00
马越 45668a05f5 add unit test for compactRangeWithNullBoundaries java api (#12333)
Summary:
The purpose of this PR is to supplement a set of unit tests for https://github.com/facebook/rocksdb/pull/12328

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12333

Reviewed By: ltamasi

Differential Revision: D53553830

Pulled By: cbi42

fbshipit-source-id: d21490f7ce7b30f42807ee37eda455ca6abdd072
2024-02-13 10:48:31 -08:00
Levi Tamasi de1e3ff6ea Fix a data race in DBImpl::RenameTempFileToOptionsFile (#12347)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12347

`DBImpl::disable_delete_obsolete_files_` should only be accessed while holding the DB mutex to prevent data races. There's a piece of logic in `DBImpl::RenameTempFileToOptionsFile` where this synchronization was previously missing. The patch fixes this issue similarly to how it's handled in `DisableFileDeletions` and `EnableFileDeletions`, that is, by saving the counter value while holding the mutex and then performing the actual file deletion outside the critical section. Note: this PR only fixes the race itself; as a followup, we can also look into cleaning up and optimizing the file deletion logic (which is currently inefficient on multiple different levels).

Reviewed By: jowlyzhang

Differential Revision: D53675153

fbshipit-source-id: 5358e894ee6829d3edfadac50a93d97f8819e481
2024-02-12 13:26:09 -08:00
Yaroslav Stepanchuk 395d24f0fa Fix build on alpine 3.19 (#12345)
Summary:
Add missing include of the cstdint header.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12345

Reviewed By: ltamasi

Differential Revision: D53672261

Pulled By: cbi42

fbshipit-source-id: 758944c0b51b9701a129e7b88f692103bbce11d3
2024-02-12 11:24:56 -08:00
chuhao zeng daf06f1361 Continue format script when changes detected by clang-format-diff.py (#12329)
Summary:
The original [clang-format-diff.py script](https://raw.githubusercontent.com/llvm/llvm-project/main/clang/tools/clang-format/clang-format-diff.py), referenced in format.sh, exits with a status of 1 at the end after writing diffs to stderr.  Consequently, the format.sh script terminates after initializing the 'diffs' variable.

Implemented additional logic in format-diff.sh to ensure continuous execution, even when changes are detected and further formatting is required.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12329

Reviewed By: pdillinger

Differential Revision: D53483185

Pulled By: cbi42

fbshipit-source-id: b7adff26f129220941258fd6ee83d053fa12b077
2024-02-09 16:46:38 -08:00
Changyu Bi b46f5707c4 Fix unexpected keyword argument 'print_as_stderr' in crash test (#12339)
Summary:
Fix crash test failure like https://github.com/facebook/rocksdb/actions/runs/7821514511/job/21338625372#step:5:530

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12339

Test Plan: CI

Reviewed By: jaykorean

Differential Revision: D53545053

Pulled By: cbi42

fbshipit-source-id: b466a8dc9c0ded0377e8677937199c6f959f96ef
2024-02-07 15:44:17 -08:00
Peter Dillinger 58d55b7f4e Mark wal_compression feature as production-ready (#12336)
Summary:
(as title)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12336

Test Plan: in use at Meta for a large service; in crash test

Reviewed By: hx235

Differential Revision: D53537628

Pulled By: pdillinger

fbshipit-source-id: 69e7ac9ab7b59b928d1144105667a7fde8a55a5a
2024-02-07 15:06:04 -08:00
Changyu Bi 42a8e583c9 Print zstd warning to stdout in stress test (#12338)
Summary:
so the stress test does not fail.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12338

Reviewed By: jaykorean

Differential Revision: D53542941

Pulled By: cbi42

fbshipit-source-id: 83b2eb3cb5cc4c5a268da386c22c4aadeb039a74
2024-02-07 14:17:51 -08:00
Akanksha Mahajan 9a2d7485f0 Print stderr in crash test script and exit on stderr (#12335)
Summary:
Some of the errors like data race and heap-after-use are error out based on crash test reporting them as error by relying on stderr. So reverting back to original form unless we come up with a more reliable solution to error out.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12335

Reviewed By: cbi42

Differential Revision: D53534781

Pulled By: akankshamahajan15

fbshipit-source-id: b19aa560d1560ac2281f7bc04e13961ed751f178
2024-02-07 12:34:40 -08:00
Peter Dillinger 54cb9c77d9 Prefer static_cast in place of most reinterpret_cast (#12308)
Summary:
The following are risks associated with pointer-to-pointer reinterpret_cast:
* Can produce the "wrong result" (crash or memory corruption). IIRC, in theory this can happen for any up-cast or down-cast for a non-standard-layout type, though in practice would only happen for multiple inheritance cases (where the base class pointer might be "inside" the derived object). We don't use multiple inheritance a lot, but we do.
* Can mask useful compiler errors upon code change, including converting between unrelated pointer types that you are expecting to be related, and converting between pointer and scalar types unintentionally.

I can only think of some obscure cases where static_cast could be troublesome when it compiles as a replacement:
* Going through `void*` could plausibly cause unnecessary or broken pointer arithmetic. Suppose we have
`struct Derived: public Base1, public Base2`.  If we have `Derived*` -> `void*` -> `Base2*` -> `Derived*` through reinterpret casts, this could plausibly work (though technical UB) assuming the `Base2*` is not dereferenced. Changing to static cast could introduce breaking pointer arithmetic.
* Unnecessary (but safe) pointer arithmetic could arise in a case like `Derived*` -> `Base2*` -> `Derived*` where before the Base2 pointer might not have been dereferenced. This could potentially affect performance.

With some light scripting, I tried replacing pointer-to-pointer reinterpret_casts with static_cast and kept the cases that still compile. Most occurrences of reinterpret_cast have successfully been changed (except for java/ and third-party/). 294 changed, 257 remain.

A couple of related interventions included here:
* Previously Cache::Handle was not actually derived from in the implementations and just used as a `void*` stand-in with reinterpret_cast. Now there is a relationship to allow static_cast. In theory, this could introduce pointer arithmetic (as described above) but is unlikely without multiple inheritance AND non-empty Cache::Handle.
* Remove some unnecessary casts to void* as this is allowed to be implicit (for better or worse).

Most of the remaining reinterpret_casts are for converting to/from raw bytes of objects. We could consider better idioms for these patterns in follow-up work.

I wish there were a way to implement a template variant of static_cast that would only compile if no pointer arithmetic is generated, but best I can tell, this is not possible. AFAIK the best you could do is a dynamic check that the void* conversion after the static cast is unchanged.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12308

Test Plan: existing tests, CI

Reviewed By: ltamasi

Differential Revision: D53204947

Pulled By: pdillinger

fbshipit-source-id: 9de23e618263b0d5b9820f4e15966876888a16e2
2024-02-07 10:44:11 -08:00
Yu Zhang e3e8fbb497 Add a separate range classes for internal usage (#12071)
Summary:
Introduce some different range classes `UserKeyRange` and `UserKeyRangePtr` to be used by internal implementation. The `Range` class is used in both public APIs like `DB::GetApproximateSizes`, `DB::GetApproximateMemTableStats`, `DB::GetPropertiesOfTablesInRange` etc and internal implementations like `ColumnFamilyData::RangesOverlapWithMemtables`, `VersionSet::GetPropertiesOfTablesInRange`.

These APIs have different expectations of what keys this range class contain.  Public API users are supposed to populate the range with the user keys without timestamp, in the same way that point lookup and range scan APIs' key input only expect the user key without timestamp. The internal APIs implementation expect a user key whose format is compatible with the user comparator, a.k.a a user key with the timestamp.

This PR contains:
1) introducing counterpart range class `UserKeyRange` `UserKeyRangePtr` for internal implementation while leave the existing `Range` and `RangePtr` class only for public APIs. Internal implementations are updated to use this new class instead.
2) add user-defined timestamp support for `DB::GetPropertiesOfTablesInRange` API and `DeleteFilesInRanges` API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12071

Test Plan:
existing tests
Added test for `DB::GetPropertiesOfTablesInRange` and `DeleteFilesInRanges` APIs for when user-defined timestamp is enabled.
The change in external_file_ingestion_job doesn't have a user-defined timestamp enabled test case coverage, will add one in a follow up PR that adds file ingestion support for UDT.

Reviewed By: ltamasi

Differential Revision: D53292608

Pulled By: jowlyzhang

fbshipit-source-id: 9a9279e23c640a6d8f8232636501a95aef7638b8
2024-02-06 18:35:36 -08:00
Jay Huh 0088f77788 Multiget LDB Followup (#12332)
Summary:
# Summary

Following up jowlyzhang 's comment in https://github.com/facebook/rocksdb/issues/12283 .
- Remove `ARG_TTL` from help which is not relevant to `multi_get` command
- Treat NotFound status as non-error case for both `Get` and `MultiGet` and updated the unit test, `ldb_test.py`
- Print key along with value in `multi_get` command

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12332

Test Plan:
**Unit Test**
```
$>python3 tools/ldb_test.py
...
Ran 25 tests in 17.447s

OK
```

**Manual Run**

```
$> ./ldb --db=/data/users/jewoongh/rocksdb_test/T173992396/rocksdb_crashtest_blackbox --hex multi_get 0x0000000000000009000000000000012B00000000000000D8 0x0000000000000009000000000000002678787878BEEF
0x0000000000000009000000000000012B00000000000000D8 ==> 0x47000000434241404F4E4D4C4B4A494857565554535251505F5E5D5C5B5A595867666564636261606F6E6D6C6B6A696877767574737271707F7E7D7C7B7A797807060504030201000F0E0D0C0B0A090817161514131211101F1E1D1C1B1A1918
Key not found: 0x0000000000000009000000000000002678787878BEEF
```

```
$> ./ldb --db=/data/users/jewoongh/rocksdb_test/T173992396/rocksdb_crashtest_blackbox --hex get 0x00000000000000090000000000
Key not found
```

Reviewed By: jowlyzhang

Differential Revision: D53450164

Pulled By: jaykorean

fbshipit-source-id: 9ccec78ad3695e65b1ed0c147c7cbac502a1bd48
2024-02-05 20:11:35 -08:00
Hui Xiao 1a885fe730 Remove deprecated Options::access_hint_on_compaction_start (#11654)
Summary:
**Context:**
`Options::access_hint_on_compaction_start ` is marked deprecated and now ready to be removed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11654

Test Plan:
Multiple db_stress runs with pre-PR and post-PR binary randomly to ensure forward/backward compatibility on options 36a5686ec0?fbclid=IwAR2IcdAUdTvw9O9V5GkHEYJRGMVR9p7Ei-LMa-9qiXlj3z80DxjkxlGnP1E
`python3 tools/db_crashtest.py --simple blackbox --interval=30`

Reviewed By: cbi42

Differential Revision: D47892459

Pulled By: hx235

fbshipit-source-id: a62f46a0377fe143be7638e218978d5431c15c56
2024-02-05 13:35:19 -08:00
马越 3a287796e3 Fix the problem that wrong Key may be passed when using CompactRange JAVA API (#12328)
Summary:
When using the Rocksdb Java API.

When we use Java code to call `db.compactRange (columnFamilyHandle, start, null)` which means we hope to perform range compaction on keys bigger than **start**.
we expected call to the corresponding C++ code : `db->compactRange (columnFamilyHandle, &start, nullptr)`
But in reality, what is being called is
`db ->compactRange (columnFamilyHandle,start,"")`

The problem here is the `null` in Java are not converted to `nullptr`, but rather to `""`, which may result in some unexpected results

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12328

Reviewed By: jowlyzhang

Differential Revision: D53432749

Pulled By: cbi42

fbshipit-source-id: eeadd19d05667230568668946d2ef1d5b2568268
2024-02-05 11:05:57 -08:00
Peter Dillinger 6e88126dd3 Don't log an error when an auxiliary dir is missing (#12326)
Summary:
info_log gets an error logged when wal_dir or a db_path/cf_path is missing. Under this condition, the directory is created later (in DBImpl::Recover -> Directories::SetDirectories) with no error status returned.

To avoid error spam in logs, change these to a descriptive "header" log entry.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12326

Test Plan: manual with DBBasicTest.DBCloseAllDirectoryFDs which exercises this code

Reviewed By: jowlyzhang

Differential Revision: D53374743

Pulled By: pdillinger

fbshipit-source-id: 32d1ce18809da13a25bdd6183d661f66a3b6a111
2024-02-05 10:26:41 -08:00
Yu Zhang 4eaa771c01 Refactor external sst file ingestion job (#12305)
Summary:
Updates some documentations and invariant assertions after https://github.com/facebook/rocksdb/issues/12257 and https://github.com/facebook/rocksdb/issues/12284. Also refactored some duplicate code and improved some error message and preconditions for errors.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12305

Test Plan: Existing unit tests

Reviewed By: hx235

Differential Revision: D53371325

Pulled By: jowlyzhang

fbshipit-source-id: fb0edcb3a3602cdf0a292ef437cfdfe897fc6c99
2024-02-02 18:07:57 -08:00
Changyu Bi 5620efc794 Remove deprecated option `ignore_max_compaction_bytes_for_input` (#12323)
Summary:
The option is introduced in https://github.com/facebook/rocksdb/issues/10835 to allow disabling the new compaction behavior if it's not safe. The option is enabled by default and there has not been a need to disable it. So it should be safe to remove now.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12323

Reviewed By: ajkr

Differential Revision: D53330336

Pulled By: cbi42

fbshipit-source-id: 36eef4664ac96b3a7ed627c48bd6610b0a7eafc5
2024-02-02 17:09:42 -08:00
Changyu Bi ace1721b28 Remove deprecated option `level_compaction_dynamic_file_size` (#12325)
Summary:
The option is introduced in https://github.com/facebook/rocksdb/issues/10655 to allow reverting to old behavior. The option is enabled by default and there has not been a need to disable it. Remove it for 9.0 release. Also fixed and improved a few unit tests that depended on setting this option to false.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12325

Test Plan: existing tests.

Reviewed By: hx235

Differential Revision: D53369430

Pulled By: cbi42

fbshipit-source-id: 0ec2440ca8d88db7f7211c581542c7581bd4d3de
2024-02-02 15:37:40 -08:00
Peter Dillinger 1d6dbfb8b7 Rename IntTblPropCollector -> InternalTblPropColl (#12320)
Summary:
I've always found this name difficult to read, because it sounds like it's for collecting int(eger)
table properties.

I'm fixing this now to set up for a change that I have stubbed out in the public API (table_properties.h):
a new adapter function `TablePropertiesCollector::AsInternal()` that allows RocksDB-provided
TablePropertiesCollectors (such as CompactOnDeletionCollector) to implement the easier-to-upgrade
internal interface while still (superficially) implementing the public interface. In addition to added flexibility,
this should be a performance improvement as the adapter class UserKeyTablePropertiesCollector can be
avoided for such cases where a RocksDB-provided collector is used (AsInternal() returns non-nullptr).

table_properties.h is the only file with changes that aren't simple find-replace renaming.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12320

Test Plan: existing tests, CI

Reviewed By: ajkr

Differential Revision: D53336945

Pulled By: pdillinger

fbshipit-source-id: 02535bcb30bbfb00e29e8478af62e5dad50a63b8
2024-02-02 14:14:43 -08:00
anand76 95b41eec6d Fix potential incorrect result for duplicate key in MultiGet (#12295)
Summary:
The RocksDB correctness testing has recently discovered a possible, but very unlikely, correctness issue with MultiGet. The issue happens when all of the below conditions are met -
1. Duplicate keys in a MultiGet batch
2. Key matches the last key in a non-zero, non-bottommost level file
3. Final value is not in the file (merge operand, not snapshot visible etc)
4. Multiple entries exist for the key in the file spanning more than 1 data block. This can happen due to snapshots, which would force multiple versions of the key in the file, and they may spill over to another data block
5. Lookup attempt in the SST for the first of the duplicates fails with IO error on a data block (NOT the first data block, but the second or subsequent uncached block), but no errors for the other duplicates
6. Value or merge operand for the key is present in the very next level

The problem is, in FilePickerMultiGet, when looking up keys in a level we use FileIndexer and the overlapping file in the current level to determine the search bounds for that key in the file list in the next level. If the next level is empty, the search bounds are reset and we do a full binary search in the next non-empty level's LevelFilesBrief. However, under the  conditions https://github.com/facebook/rocksdb/issues/1 and https://github.com/facebook/rocksdb/issues/2 listed above, only the first of the duplicates has its next-level search bounds updated, and the remaining duplicates are skipped.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12295

Test Plan: Add unit tests that fail an assertion or return wrong result without the fix

Reviewed By: hx235

Differential Revision: D53187634

Pulled By: anand1976

fbshipit-source-id: a5eadf4fede9bbdec784cd993b15e3341436d1ea
2024-02-02 11:48:35 -08:00
Sanket Sanjeev Karnik 046ac91a7f Mark destructors as overridden (#12324)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12324

We are trying to use rocksdb inside Hedwig. This is causing some builds to fail D53033764. Hence fixing -Wsuggest-destructor-override warning.

Reviewed By: jowlyzhang

Differential Revision: D53328538

fbshipit-source-id: d5b9865442de049b18f9ed086df5fa58fb8880d5
2024-02-01 19:09:25 -08:00
Changyu Bi c6b1f6d182 Augment sst_dump tool to verify num_entries in table property (#12322)
Summary:
sst_dump --command=check can now compare number of keys in a file with num_entries in table property and reports corruption is there is a mismatch.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12322

Test Plan:
- new unit test for API `SstFileDumper::ReadSequential`
- ran sst_dump on a good and a bad file:
```
sst_dump --file=./32316112.sst
options.env is 0x7f68bfcb5000
Process ./32316112.sst
Sst file format: block-based
from [] to []

sst_dump --file=./32316115.sst
options.env is 0x7f6d0d2b5000
Process ./32316115.sst
Sst file format: block-based
from [] to []
./32316115.sst: Corruption: Table property has num_entries = 6050408 but scanning the table returns 6050406 records.
```

Reviewed By: jowlyzhang

Differential Revision: D53320481

Pulled By: cbi42

fbshipit-source-id: d84c996346a9575a5a2ea5f5fb09a9d3ee672cd6
2024-02-01 14:35:03 -08:00