Commit Graph

12522 Commits

Author SHA1 Message Date
Radek Hubner 583fded565 Fix regression for Javadoc jar build (#12404)
Summary:
https://github.com/facebook/rocksdb/issues/12371 Introduced regression not defining dependency between `create_javadoc`  and `rocksdb_javadocs_jar` build targets.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12404

Reviewed By: pdillinger

Differential Revision: D54516862

Pulled By: ajkr

fbshipit-source-id: 785a99b2caf979395ae0de60e40e7d1b93059adb
2024-03-06 10:33:17 -08:00
Peter Dillinger a53ed91691 Fix/improve temperature handling for file ingestion (#12402)
Summary:
Partly following up on leftovers from https://github.com/facebook/rocksdb/issues/12388

In terms of public API:
* Make it clear that IngestExternalFileArg::file_temperature is just a hint for opening the existing file, though it was previously used for both copy-from temp hint and copy-to temp, which was bizarre.
* Specify how IngestExternalFile assigns temperature to file ingested into DB. (See details in comments.) This approach is not perfect in terms of matching how the DB assigns temperatures, but was the simplest way to get close. The key complication for matching DB temperature assignments is that ingestion files are copied (to a destination temp) before their target level is determined (in general).
* Add a temperature option to SstFileWriter::Open so that files intended for ingestion can be initially written to a chosen temperature.
* Note that "fail_if_not_bottommost_level" is obsolete/confusing use of "bottommost"

In terms of the implementation, there was a similar bit of oddness with the internal CopyFile API, which only took one temperature, ambiguously applicable to the source, destination, or both. This is also fixed.

Eventual suggested follow-up:
* Before copying files for ingestion, determine a tentative level assignment to use for destination temperature, and keep that even if final level assignment happens to be different at commit time (rare).
* More temperature handling for CreateColumnFamilyWithImport and Checkpoints.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12402

Test Plan:
Deeply revamped
ExternalSSTFileBasicTest.IngestWithTemperature to test the new changes. Previously this test was insufficient because it was only looking at temperatures according to the DB manifest. Incorporating FileTemperatureTestFS allows us to also test the temperatures in the storage layer.

Used macros instead of functions for better tracing to critical source location on test failures.

Some enhancements to FileTemperatureTestFS in the process of developing the revamped test.

Reviewed By: jowlyzhang

Differential Revision: D54442794

Pulled By: pdillinger

fbshipit-source-id: 41d9d0afdc073e6a983304c10bbc07c70cc7e995
2024-03-05 16:56:08 -08:00
Jay Huh 3412195367 Introduce MultiCfIterator (#12153)
Summary:
This PR introduces a new implementation of `Iterator` via a new public API called `NewMultiCfIterator()`. The new API takes a vector of column family handles to build a cross-column-family iterator, which internally maintains multiple `DBIter`s as child iterators from a consistent database state. When a key exists in multiple column families, the iterator selects the value (and wide columns) from the first column family containing the key, following the order provided in the `column_families` parameter. Similar to the merging iterator, a min heap is used to iterate across the child iterators. Backward iteration and direction change functionalities will be implemented in future PRs.

The comparator used to compare keys across different column families will be derived from the iterator of the first column family specified in `column_families`. This comparator will be checked against the comparators from all other column families that the iterator will traverse. If there's a mismatch with any of the comparators, the initialization of the iterator will fail.

Please note that this PR is not enough for users to start using `MultiCfIterator`. The `MultiCfIterator` and related APIs are still marked as "**DO NOT USE - UNDER CONSTRUCTION**". This PR is just the first of many PRs that will follow soon.

This PR includes the following:
- Introduction and partial implementation of the `MultiCfIterator`, which implements the generic `Iterator` interface. The implementation includes the construction of the iterator, `SeekToFirst()`, `Next()`, `Valid()`, `key()`, `value()`, and `columns()`.
- Unit tests to verify iteration across multiple column families in two distinct scenarios: (1) keys are unique across all column families, and (2) the same keys exist in multiple column families.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12153

Reviewed By: pdillinger

Differential Revision: D52308697

Pulled By: jaykorean

fbshipit-source-id: b03e69f13b40af5a8f0598d0f43a0bec01ef8294
2024-03-05 10:22:43 -08:00
jsteemann 3fff57fa6a fix linking without thread status support (#12400)
Summary:
When compiling with `-DNROCKSDB_THREAD_STATUS`, some functions in ThreadStatusUtil are declared but their definition is missing. Their definitions are only compiled when not defining `NROCKSDB_THREAD_STATUS`. This causes problems on linking, when the linker cannot find the definitions of

- ThreadStatusUtil::GetThreadOperation
- ThreadStatusUtil::SetEnableTracking

This PR fixes it by adding stubs for these functions in case `NROCKSDB_THREAD_STATUS` is defined.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12400

Reviewed By: ajkr

Differential Revision: D54510769

Pulled By: cbi42

fbshipit-source-id: e79e9257492d3dba59615e9e306df7e79838d73b
2024-03-04 17:39:03 -08:00
yuzhangyu@fb.com 1cfdece85d Run internal cpp modernizer on RocksDB repo (#12398)
Summary:
When internal cpp modernizer attempts to format rocksdb code, it will replace macro `ROCKSDB_NAMESPACE`  with its default definition `rocksdb` when collapsing nested namespace. We filed a feedback for the tool T180254030 and the team filed a bug for this: https://github.com/llvm/llvm-project/issues/83452. At the same time, they suggested us to run the modernizer tool ourselves so future auto codemod attempts will be smaller. This diff contains:

Running
`xplat/scripts/codemod_service/cpp_modernizer.sh`
in fbcode/internal_repo_rocksdb/repo (excluding some directories in utilities/transactions/lock/range/range_tree/lib that has a non meta copyright comment)
without swapping out the namespace macro `ROCKSDB_NAMESPACE`

Followed by RocksDB's own
`make format`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12398

Test Plan: Auto tests

Reviewed By: hx235

Differential Revision: D54382532

Pulled By: jowlyzhang

fbshipit-source-id: e7d5b40f9b113b60e5a503558c181f080b9d02fa
2024-03-04 10:08:32 -08:00
Richard Barnes d7b8756976 Remove extra semi colon from internal_repo_rocksdb/repo/db/table_cache_sync_and_async.h
Summary:
`-Wextra-semi` or `-Wextra-semi-stmt`

If the code compiles, this is safe to land.

Reviewed By: palmje

Differential Revision: D54362208

fbshipit-source-id: a47acd4c794c899fccb65285b116b50d9566ea12
2024-03-04 06:34:44 -08:00
Richard Barnes ced333ee45 Remove extra semi colon from instagram/ranking/mezql/shots/parser/fast/Token.cpp
Summary:
`-Wextra-semi` or `-Wextra-semi-stmt`

If the code compiles, this is safe to land.

Reviewed By: palmje

Differential Revision: D54362213

fbshipit-source-id: 0bbc9e5fce917fc4f72423f0a4c8cb2c2b1759dd
2024-03-04 06:32:50 -08:00
jsteemann 965364972d fix compile warning (#12399)
Summary:
Fix compile warning
```
monitoring/thread_status_util.cc: In static member function ‘static void rocksdb::ThreadStatusUtil::NewColumnFamilyInfo(const rocksdb::DB*, const rocksdb::ColumnFamilyData*, const std::string&, const rocksdb::Env*)’: monitoring/thread_status_util.cc:193:55: warning: unused parameter ‘env’ [-Wunused-parameter]
  193 |                                            const Env* env) {}
      |                                            ~~~~~~~~~~~^~~
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12399

Reviewed By: jaykorean

Differential Revision: D54424333

Pulled By: cbi42

fbshipit-source-id: 3dcb89f85d3a63b1b0d0d6a8b277f49ce03b6d1a
2024-03-01 11:25:16 -08:00
Jay Huh c00c16855d Access DBImpl* and CFD* by CFHImpl* in Iterators (#12395)
Summary:
In the current implementation of iterators, `DBImpl*` and `ColumnFamilyData*` are held in `DBIter` and `ArenaWrappedDBIter` for two purposes: tracing and Refresh() API. With the introduction of a new iterator called MultiCfIterator in PR https://github.com/facebook/rocksdb/issues/12153 , which is a cross-column-family iterator that maintains multiple DBIters as child iterators from a consistent database state, we need to make some changes to the existing implementation. The new iterator will still be exposed through the generic Iterator interface with an additional capability to return AttributeGroups (via `attribute_groups()`) which is a list of wide columns grouped by column family. For more information about AttributeGroup, please refer to previous PRs:  https://github.com/facebook/rocksdb/issues/11925 #11943, and https://github.com/facebook/rocksdb/issues/11977.

To be able to return AttributeGroup in the default single CF iterator created, access to `ColumnFamilyHandle*` within `DBIter` is necessary. However, this is not currently available in `DBIter`. Since `DBImpl*` and `ColumnFamilyData*` can be easily accessed via `ColumnFamilyHandleImpl*`, we have decided to replace the pointers to `ColumnFamilyData` and `DBImpl` in `DBIter` with a pointer to `ColumnFamilyHandleImpl`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12395

Test Plan:
# Summary

In the current implementation of iterators, `DBImpl*` and `ColumnFamilyData*` are held in `DBIter` and `ArenaWrappedDBIter` for two purposes: tracing and Refresh() API. With the introduction of a new iterator called MultiCfIterator in PR #12153 , which is a cross-column-family iterator that maintains multiple DBIters as child iterators from a consistent database state, we need to make some changes to the existing implementation. The new iterator will still be exposed through the generic Iterator interface with an additional capability to return AttributeGroups (via `attribute_groups()`) which is a list of wide columns grouped by column family. For more information about AttributeGroup, please refer to previous PRs:  #11925 #11943, and #11977.

To be able to return AttributeGroup in the default single CF iterator created, access to `ColumnFamilyHandle*` within `DBIter` is necessary. However, this is not currently available in `DBIter`. Since `DBImpl*` and `ColumnFamilyData*` can be easily accessed via `ColumnFamilyHandleImpl*`, we have decided to replace the pointers to `ColumnFamilyData` and `DBImpl` in `DBIter` with a pointer to `ColumnFamilyHandleImpl`.

# Test Plan

There should be no behavior changes. Existing tests and CI for the correctness tests.

**Test for Perf Regression**
Build
```
$> make -j64 release
```
Setup
```
$> TEST_TMPDIR=/dev/shm/db_bench ./db_bench -benchmarks="filluniquerandom" -key_size=32 -value_size=512 -num=1000000 -compression_type=none
```
Run
```
TEST_TMPDIR=/dev/shm/db_bench ./db_bench -use_existing_db=1 -benchmarks="newiterator,seekrandom" -cache_size=10485760000
```

Before the change
```
DB path: [/dev/shm/db_bench/dbbench]
newiterator  :       0.552 micros/op 1810157 ops/sec 0.552 seconds 1000000 operations;
DB path: [/dev/shm/db_bench/dbbench]
seekrandom   :       4.502 micros/op 222143 ops/sec 4.502 seconds 1000000 operations; (0 of 1000000 found)
```
After the change
```
DB path: [/dev/shm/db_bench/dbbench]
newiterator  :       0.520 micros/op 1924401 ops/sec 0.520 seconds 1000000 operations;
DB path: [/dev/shm/db_bench/dbbench]
seekrandom   :       4.532 micros/op 220657 ops/sec 4.532 seconds 1000000 operations; (0 of 1000000 found)
```

Reviewed By: pdillinger

Differential Revision: D54332713

Pulled By: jaykorean

fbshipit-source-id: b28d897ad519e58b1ca82eb068a6319544a4fae5
2024-03-01 10:28:20 -08:00
Jay Huh 5bcc184975 Update APIs to support generic unique identifier format (#12384)
Summary:
The current design proposes using a combination of `job_id`, `db_id`, and `db_session_id` to create a unique identifier for remote compaction jobs. However, this approach may not be suitable for users who prefer a different format for the unique identifier.

At Meta, we are utilizing generic compute offload to offload compaction tasks to remote workers. The compute offload client generates a UUID for each task, which requires an update to the current RocksDB API for onboarding purposes.

Users still have the option to create the unique identifier by combining `job_id`, `db_id`, and `db_session_id` if they prefer.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12384

Test Plan:
```
$> ./compaction_service_test                                                                                                                             13:29:35
[==========] Running 14 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 14 tests from CompactionServiceTest
[ RUN      ] CompactionServiceTest.BasicCompactions
[       OK ] CompactionServiceTest.BasicCompactions (2642 ms)
[ RUN      ] CompactionServiceTest.ManualCompaction
[       OK ] CompactionServiceTest.ManualCompaction (454 ms)
[ RUN      ] CompactionServiceTest.CancelCompactionOnRemoteSide
[       OK ] CompactionServiceTest.CancelCompactionOnRemoteSide (1643 ms)
[ RUN      ] CompactionServiceTest.FailedToStart
[       OK ] CompactionServiceTest.FailedToStart (1332 ms)
[ RUN      ] CompactionServiceTest.InvalidResult
[       OK ] CompactionServiceTest.InvalidResult (1516 ms)
[ RUN      ] CompactionServiceTest.SubCompaction
[       OK ] CompactionServiceTest.SubCompaction (551 ms)
[ RUN      ] CompactionServiceTest.CompactionFilter
[       OK ] CompactionServiceTest.CompactionFilter (563 ms)
[ RUN      ] CompactionServiceTest.Snapshot
[       OK ] CompactionServiceTest.Snapshot (124 ms)
[ RUN      ] CompactionServiceTest.ConcurrentCompaction
[       OK ] CompactionServiceTest.ConcurrentCompaction (660 ms)
[ RUN      ] CompactionServiceTest.CompactionInfo
[       OK ] CompactionServiceTest.CompactionInfo (984 ms)
[ RUN      ] CompactionServiceTest.FallbackLocalAuto
[       OK ] CompactionServiceTest.FallbackLocalAuto (343 ms)
[ RUN      ] CompactionServiceTest.FallbackLocalManual
[       OK ] CompactionServiceTest.FallbackLocalManual (380 ms)
[ RUN      ] CompactionServiceTest.RemoteEventListener
[       OK ] CompactionServiceTest.RemoteEventListener (491 ms)
[ RUN      ] CompactionServiceTest.TablePropertiesCollector
[       OK ] CompactionServiceTest.TablePropertiesCollector (169 ms)
[----------] 14 tests from CompactionServiceTest (11854 ms total)

[----------] Global test environment tear-down
[==========] 14 tests from 1 test case ran. (11855 ms total)
[  PASSED  ] 14 tests.
```

Reviewed By: hx235

Differential Revision: D54220339

Pulled By: jaykorean

fbshipit-source-id: 5a9054f31933d1996adca02082eb37b6d5353224
2024-03-01 09:55:30 -08:00
Changyu Bi 4aed229fa7 Add `write_memtable_time` to perf level `kEnableWait` (#12394)
Summary:
.. so write time can be measured under the new perf level for single-threaded writes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12394

Test Plan: * add a new UT `PerfContextTest.WriteMemtableTimePerfLevel`

Reviewed By: anand1976

Differential Revision: D54326263

Pulled By: cbi42

fbshipit-source-id: d0e334d9581851ba6cf53c776c0bd876365d1e00
2024-02-29 15:08:26 -08:00
Peter Dillinger 13ef21c22e default_write_temperature option (#12388)
Summary:
Currently SST files that aren't applicable to last_level_temperature nor file_temperature_age_thresholds are written with temperature kUnknown, which is a little weird and doesn't support CF-based tiering. The default_temperature option only affects how kUnknown is interpreted for stats.

This change adds a new per-CF option default_write_temperature that determines the temperature of new SST files when those other options do not apply.

Also made a change to ignore last_level_temperature with FIFO compaction, because I found that could lead to an infinite loop in compaction.

Needed follow-up: Fix temperature handling with external file ingestion

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12388

Test Plan: unit tests extended appropriately. (Ignore whitespace changes when reviewing.)

Reviewed By: jowlyzhang

Differential Revision: D54266574

Pulled By: pdillinger

fbshipit-source-id: c9ec9a74dbf22be6e986f77f9689d05fea8ef0bb
2024-02-28 14:36:13 -08:00
Adam Retter 5458eda5f0 Pass build parallelism flag to Docker builds (#12392)
Summary:
Passed the `-j` flag through to builds happening inside Docker containers.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12392

Reviewed By: cbi42

Differential Revision: D54311937

Pulled By: ajkr

fbshipit-source-id: 5cf1bfe4b9059cc2d078fb5331812f32cf9e89ab
2024-02-28 12:51:00 -08:00
Greg Sadetsky eab876bb49 fix out of date macos instructions in INSTALL.md (#12393)
Summary:
closes https://github.com/facebook/rocksdb/issues/12349

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12393

Reviewed By: cbi42

Differential Revision: D54311983

Pulled By: ajkr

fbshipit-source-id: 3109ad80bdd5d656756364d3d2a60dd15c339fcc
2024-02-28 12:38:15 -08:00
Adam Retter 99cc36be9b Correct CMake Javadoc and source jar builds (#12371)
Summary:
Fix some issues introduced in https://github.com/facebook/rocksdb/pull/12199 (CC rhubner)
1. Previous `jar -v -c -f` was not valid command syntax.
2. Javadoc and source Jar files were prefixed `rocksdb-`, now corrected to `rocksdbjni-`

pdillinger This needs to be merged to `main` and also `8.11.fb` (to fix the Windows build for the RocksJava release of 8.11.2) please.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12371

Reviewed By: pdillinger, jowlyzhang

Differential Revision: D54136834

Pulled By: hx235

fbshipit-source-id: f356f2401042af359ada607e5f0be627418ccd6c
2024-02-27 15:46:12 -08:00
奏之章 1fa5dff7d1 WriteThread::EnterAsBatchGroupLeader reorder writers (#12138)
Summary:
Reorder writers list to allow a leader can take as more commits as possible to maximize the throughput of the system and reduce IOPS.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12138

Reviewed By: hx235

Differential Revision: D53955592

Pulled By: ajkr

fbshipit-source-id: 4d899d038faef691b63801d9d85f5cc079b7bbb5
2024-02-27 15:23:54 -08:00
zaidoon 3104e55f29 update DB::DumpSupportInfo to log whether jemalloc is supported or not (#12386)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12386

Reviewed By: cbi42

Differential Revision: D54231896

Pulled By: ajkr

fbshipit-source-id: 6b3357b2e97d3599955e303810088bb5d5896199
2024-02-27 15:07:00 -08:00
Peter Dillinger d780e7a561 Remove `bottommost_temperature` (#12389)
Summary:
deprecated option already replaced by `last_level_temperature`. (Keeping recognition of the option in old options files.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12389

Test Plan: tests updated

Reviewed By: jowlyzhang, cbi42

Differential Revision: D54267946

Pulled By: pdillinger

fbshipit-source-id: 65c49b15e7394829c1f3b44edd4179d2daff6017
2024-02-27 14:48:00 -08:00
Andrew Kryczka a43481b3d0 Decouple `RateLimiter` burst size and refill period (#12379)
Summary:
When the rate limiter does not have any waiting requests, the first request to arrive may consume all of the available bandwidth, despite potentially having lower priority than requests that arrive later in the same refill interval. Then, those higher priority requests must wait for a refill. So even in scenarios in which we have an overall bandwidth surplus, the highest priority requests can be sporadically delayed up to a whole refill period.

Alone, this isn't necessarily problematic as the refill period is configurable via `refill_period_us` and can be tuned down as needed until the max sporadic delay is tolerable. However, tuning down `refill_period_us` had a side effect of reducing burst size. Some users require a certain burst size to issue optimal I/O sizes to the underlying storage system.

To satisfy those users, this PR decouples the refill period from the burst size. That way, the max sporadic delay can be limited without impacting I/O sizes issued to the underlying storage system.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12379

Test Plan:
The goal is to show we can now limit the max sporadic delay without impacting compaction's I/O size.

The benchmark runs compaction with a large I/O size, while user reads simultaneously run at a low rate that does not consume all of the available bandwidth. The max sporadic delay is measured using the P100 of rocksdb.file.read.get.micros. I just used strace to verify the compaction reads follow `rate_limiter_single_burst_bytes`

Setup: `./db_bench -benchmarks=fillrandom,flush -write_buffer_size=67108864 -disable_auto_compactions=true -value_size=256 -num=1048576`

Benchmark: `./db_bench -benchmarks=readrandom -use_existing_db=true -num=1048576 -duration=10 -benchmark_read_rate_limit=4096 -rate_limiter_bytes_per_sec=67108864 -rate_limiter_refill_period_us=$refill_micros -rate_limiter_single_burst_bytes=16777216 -rate_limit_bg_reads=true -rate_limit_user_ops=true -statistics=true -cache_size=0 -stats_level=5 -compaction_readahead_size=16777216 -use_direct_reads=true`

Results:

refill_micros | rocksdb.file.read.get.micros (P100)
-- | --
10000 | 10802
100000 | 100240
1000000 | 922061

For verifying compaction read sizes: `strace -fye pread64 ./db_bench -benchmarks=compact -use_existing_db=true -rate_limiter_bytes_per_sec=67108864 -rate_limiter_refill_period_us=$refill_micros -rate_limiter_single_burst_bytes=16777216 -rate_limit_bg_reads=true -compaction_readahead_size=16777216 -use_direct_reads=true`

Reviewed By: hx235

Differential Revision: D54165675

Pulled By: ajkr

fbshipit-source-id: c5968486316cbfb7ff8e5b7d75d3589883dd1105
2024-02-26 16:55:13 -08:00
Peter Dillinger 41849210e9 Fix ArenaTest.UnmappedAllocation in some cases (#12378)
Summary:
Fix compatibility with transparent huge pages by allocating in increments (1MiB) smaller than the
typical smallest huge page size of 2MiB.

Also, bypass the test when jemalloc config.fill is used, which means the allocator is explicitly
configured to write to memory before we get it, which is not what this test expects.

Fixes https://github.com/facebook/rocksdb/issues/12351

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12378

Test Plan:
```
sudo bash -c 'echo "always" > /sys/kernel/mm/transparent_hugepage/enabled'
```
And see unit test fails before this change, passes after this change

Also tested internal buck build with dbg mode (previously failing).

Reviewed By: jaykorean, hx235

Differential Revision: D54139634

Pulled By: pdillinger

fbshipit-source-id: 179accebe918d8eecd46a979fcf21d356f9b5519
2024-02-26 16:08:21 -08:00
Richard Barnes a4ff83d1b2 Fix deprecated use of 0/NULL in internal_repo_rocksdb/repo/utilities/transactions/lock/range/range_tree/lib/locktree/wfg.cc + 3
Summary:
`nullptr` is typesafe. `0` and `NULL` are not. In the future, only `nullptr` will be allowed.

This diff helps us embrace the future _now_ in service of enabling `-Wzero-as-null-pointer-constant`.

Reviewed By: meyering

Differential Revision: D54163069

fbshipit-source-id: e5bb4b6ee79d82f1437ffed602bdb41dcfc0e59a
2024-02-25 22:17:04 -08:00
Yu Zhang 2940acac00 Persist table options use_delta_encoding in options file (#11987)
Summary:
This option is used for encoding keys in block based table files. It has been having a default true value since its introduction.

Users may not notice this option is not persisted in options file unless they are explicitly setting it to false. If the users expect `Iterator::GetProperty("rocksdb.iterator.is-key-pinned")` to return 1 when setting `ReadOptions.pin_data = true`, they should have noticed loading options file won't work and have work around for this by always explicitly set this option to false for opening DB. This change won't impact those users except that now they can remove their work around. If the users are not relying on key pinning behavior at all and as a result didn't notice the option is not persisted, this change shouldn't have any visible behavior impact either.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11987

Reviewed By: hx235

Differential Revision: D54093238

Pulled By: jowlyzhang

fbshipit-source-id: 256a3348c44cf91349034d1f6e242c437b32b9a5
2024-02-23 14:13:28 -08:00
Jay Huh f300438c20 Mark offpeak feature production-ready (#12375)
Summary:
The feature was released in 8.9.0 and verified at Meta internally (via ZippyDB test tier). Marking the feature ready in production.

Wiki has been added in https://github.com/facebook/rocksdb/wiki/Daily-Off%E2%80%90peak-Time-Option

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12375

Test Plan: No code change. N/A

Reviewed By: cbi42

Differential Revision: D54128890

Pulled By: jaykorean

fbshipit-source-id: a6c728ab87657fc5263048e21c366053ec5717af
2024-02-23 13:26:22 -08:00
Alan Paxton d1386de632 Java FFI blog post - Post-publication issues with images (2) (#12372)
Summary:
Replace unreliable-in-chrome PDF w/PNG of same graph

jmh-result-pinnable-vs-output-plot.pdf is showing as thumbnail on Chrome, rendering OK on Safari for some; I have converted it to PNG in the hope that will display correctly in all environments.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12372

Reviewed By: cbi42

Differential Revision: D54076718

Pulled By: jowlyzhang

fbshipit-source-id: 2eff995f0239ab7850a40063d841380738953533
2024-02-22 15:01:55 -08:00
raffertyyu e09b9d0cb9 Fix zstd typo in cmake (#12309)
Summary:
https://github.com/facebook/rocksdb/issues/12247 imported another typo in cmakelists.txt and findzstd.cmake.
cmake report ZSTD_INCLUDE_DIRS not found.
Actually it should be
aacf60dda2/cmake/modules/Findzstd.cmake (L8)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12309

Reviewed By: hx235

Differential Revision: D54070348

Pulled By: ajkr

fbshipit-source-id: eaf6e260ea3669b8ea38e4c74a375bb885761b51
2024-02-22 14:39:05 -08:00
anand76 d9c0d44dab Add a perf level for measuring user thread block time (#12368)
Summary:
Enabling time PerfCounter stats in RocksDB is currently very expensive, as it enables all sorts of relatively uninteresting stats, such as iteration, point lookup breakdown etc. This PR adds a new perf level between `kEnableCount` and `kEnableTimeExceptForMutex` to enable stats for time spent by user (i.e a RocksDB user) threads blocked by other RocksDB threads or events, such as a write group leader, write delay or stalls etc. It does not include time spent waiting to acquire mutexes, or waiting for IO.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12368

Test Plan: Add a unit test for write_thread_wait_nanos

Reviewed By: ajkr

Differential Revision: D54021583

Pulled By: anand1976

fbshipit-source-id: 3f6fcf71010132ffffca0391a5565f3b59fddd48
2024-02-22 12:14:53 -08:00
Alan Paxton cb4f4381f6 Java FFI blog post - Post-publication issues with images (#12366)
Summary:
Review comments
Broken image links

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12366

Reviewed By: hx235

Differential Revision: D53999663

Pulled By: ajkr

fbshipit-source-id: 72546f468367dc950eb61a876c4f763a580eb76d
2024-02-21 15:50:57 -08:00
jrchyang 70cb330a4a optimize file size statistics in benchmark script (#12363)
Summary:
Execute `ls` once when counting the file size of the `DB_DIR` and remove unused file number counter variable `c` . The test information as follow :

```Shell
# benchmark command

NUM_KEYS=30000000 CACHE_SIZE=6442450944 DB_DIR=/mnt/rocksdb_test WAL_DIR=/mnt/rocksdb_test ../tools/benchmark.sh fillseq_disable_wal

# before modification

cat /tmp/benchmark_fillseq.wal_disabled.v400.log.stats.sizes
0.0	0.0	0.0	0.0	195250
1.1	1.1	0.0	0.0	195300
2.5	2.5	0.0	0.0	195310
3.8	3.7	0.0	0.0	195320
5.1	5.1	0.0	0.0	195330
max sizes (GB): 5.1 all, 5.1 sst, 0.0 log, 0.0 blob

# after modification

cat /tmp/benchmark_fillseq.wal_disabled.v400.log.stats.sizes
0.0	0.0	0.0	0.0	194839
1.2	1.2	0.0	0.0	194849
2.6	2.6	0.0	0.0	194859
4.0	4.0	0.0	0.0	194909
5.4	5.4	0.0	0.0	194919
max sizes (GB): 5.4 all, 5.4 sst, 0.0 log, 0.0 blob
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12363

Reviewed By: hx235

Differential Revision: D54005427

Pulled By: ajkr

fbshipit-source-id: fae149705eb3fcda48d7381c42836a150f35ddc4
2024-02-21 15:45:18 -08:00
Yu Zhang f1ca47b904 Add support to bulk load external files for UDT in memtable only feature (#12356)
Summary:
This PR expands on the capabilities added in https://github.com/facebook/rocksdb/issues/12343. It adds sanity checks for external file's comparator name and user-defined timestamps related flag. With this, it now supports ingesting files to a column family that enables user-defined timestamps in Memtable only feature.

Two fields in the table properties are used for aformentioned check: 1) the comparator name, it records what comparator is used to create this external sst file, 2) the flag `user_defined_timestamps_persisted`.  We compare these two fields with the column family's settings. The details are in util function `ValidateUserDefinedTimestampsOptions`.

To optimize for the majority of the cases where sanity check should pass and the table properties read should not affect how `TableReader` is constructed, instead of read the table properties block separately and use it for sanity check before creating a `TableReader`. We continue using the current flow to first create a `TableReader`, use it for reading table properties and do sanity checks, and reset the`TableReader` for the case where the column family enables UDTs in memtable only feature, and the external file does not contain user-defined timestamps.

This PR also groups other table properties related sanity check in function `GetIngestedFileInfo` into the newly added `SanityCheckTableProperties` function.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12356

Test Plan:
added unit test
existing unit test

Reviewed By: cbi42

Differential Revision: D54025116

Pulled By: jowlyzhang

fbshipit-source-id: a918276c15f9908bd9df8513ce667638882e1554
2024-02-21 15:41:53 -08:00
Andrew Kryczka 8e29f243c9 No filesystem reads during `Merge()` writes (#12365)
Summary:
This occasional filesystem read in the write path has caused user pain. It doesn't seem very useful considering it only limits one component's merge chain length, and only helps merge uncached (i.e., infrequently read) values. This PR proposes allowing `max_successive_merges` to be exceeded when the value cannot be read from in-memory components. I included a rollback flag (`strict_max_successive_merges`) just in case.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12365

Test Plan:
"rocksdb.block.cache.data.add" is number of data blocks read from filesystem. Since the benchmark is write-only, compaction is disabled, and flush doesn't read data blocks, any nonzero value means the user write issued the read.

```
$ for s in false true; do echo -n "strict_max_successive_merges=$s: " && ./db_bench -value_size=64 -write_buffer_size=131072 -writes=128 -num=1 -benchmarks=mergerandom,flush,mergerandom -merge_operator=stringappend -disable_auto_compactions=true -compression_type=none -strict_max_successive_merges=$s -max_successive_merges=100 -statistics=true |& grep 'block.cache.data.add COUNT' ; done
strict_max_successive_merges=false: rocksdb.block.cache.data.add COUNT : 0
strict_max_successive_merges=true: rocksdb.block.cache.data.add COUNT : 1
```

Reviewed By: hx235

Differential Revision: D53982520

Pulled By: ajkr

fbshipit-source-id: e40f761a60bd601f232417ac0058e4a33ee9c0f4
2024-02-21 13:15:27 -08:00
Jeff Palm 5950907a82 switch to using centos8-native (#12367)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12367

switch to using centos8-native for rocks-db

Reviewed By: jowlyzhang

Differential Revision: D53971368

fbshipit-source-id: 635885dfb9e0ec6daa7623627a50e6b2897725ba
2024-02-21 12:03:40 -08:00
Alan Paxton 003197f005 Foreign function interface (Panama) blog (#11760)
Summary:
We did some experimental work with FFI and native memory as a potential improvement to the Java API.
The work lives (unmerged) in https://github.com/facebook/rocksdb/pull/11095

This is the report text from that branch, extract as a blog post.
Along with some supporting files (png, pdf of graphs).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11760

Reviewed By: hx235

Differential Revision: D53943442

Pulled By: ajkr

fbshipit-source-id: 7c9f800e25be22c10e736cdd3b0d65422ecfc826
2024-02-20 13:44:35 -08:00
leedonggyu ca99a8f153 Add function to check if the RocksDB instance is closed or not (#11337)
Summary:
In RocksDb jni threre is no method to know if the instance is closed or not.
so when using a closed instance it makes jvm crash.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11337

Reviewed By: jaykorean

Differential Revision: D53941387

Pulled By: ajkr

fbshipit-source-id: e3e4e6fe48409fa70a312810e467ec0c4ce356ef
2024-02-20 11:36:28 -08:00
Yu Zhang 31dfc81e18 Start 9.1.0 release (#12360)
Summary:
with release notes for 9.0.fb, format_compatible test update, and version.h update.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12360

Test Plan: CI

Reviewed By: cbi42

Differential Revision: D53879416

Pulled By: jowlyzhang

fbshipit-source-id: 29598893d9ce2d0bb181345ddb78f9b1529aee75
2024-02-16 18:26:48 -08:00
Alex Wied f2732d0586 Export GetSequenceNumber functionality for Snapshots (#12354)
Summary:
This PR adds `Snapshot->GetSequenceNumber()` functionality to the C API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12354

Reviewed By: akankshamahajan15

Differential Revision: D53836085

Pulled By: cbi42

fbshipit-source-id: 4a14daeba9210a69bcb74e4c1c0666deff1b4837
2024-02-16 10:28:41 -08:00
Adam Retter 055b21ab11 Update ZLib to 1.3.1 (#12358)
Summary:
pdillinger This fixes the RocksJava build, is also needed in the 8.10.fb and 8.11.fb branches please?

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12358

Reviewed By: jaykorean

Differential Revision: D53859743

Pulled By: pdillinger

fbshipit-source-id: b8417fccfee931591805f9aecdfae7c086fee708
2024-02-16 10:26:32 -08:00
anand76 d227276147 Deprecate some variants of Get and MultiGet (#12327)
Summary:
A lot of variants of Get and MultiGet have been added to `include/rocksdb/db.h` over the years. Try to consolidate them by marking variants that don't return timestamps as deprecated. The underlying DB implementation will check and return Status::NotSupported() if it doesn't support returning timestamps and the caller asks for it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12327

Reviewed By: pdillinger

Differential Revision: D53828151

Pulled By: anand1976

fbshipit-source-id: e0b5ca42d32daa2739d5f439a729815a2d4ff050
2024-02-16 09:21:06 -08:00
Akanksha Mahajan 956f1dfde3 Change ReadAsync callback API to remove const from FSReadRequest (#11649)
Summary:
Modify ReadAsync callback API to remove const from FSReadRequest as const doesn't let to fs_scratch to move the ownership.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11649

Test Plan: CircleCI jobs

Reviewed By: anand1976

Differential Revision: D53585309

Pulled By: akankshamahajan15

fbshipit-source-id: 3bff9035db0e6fbbe34721a5963443355807420d
2024-02-16 09:14:55 -08:00
anand76 28c1c15c29 Sync tickers and histograms across C++ and Java (#12355)
Summary:
The RocksDB ticker and histogram statistics were out of sync between the C++ and Java code, with a number of newer stats missing in TickerType.java and HistogramType.java. Also, there were gaps in numbering in portal.h, which could soon become an issue due to the number of tickers and the fact that we're limited to 1 byte in Java. This PR adds the missing stats, and re-numbers all of them. It also moves some stats around to try to group related stats together. Since this will go into a major release, compatibility shouldn't be an issue.

This should be automated at some point, since the current process is somewhat error prone.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12355

Reviewed By: jaykorean

Differential Revision: D53825324

Pulled By: anand1976

fbshipit-source-id: 298c180872f4b9f1ee54b8bb22f4e280458e7e09
2024-02-15 17:22:03 -08:00
Peter Dillinger 12018136d8 KeySegmentsExtractor and prototype higher-dimensional filtering (#12075)
Summary:
This change contains a prototype new API for "higher dimensional" filtering of read queries. Existing filters treat keys as one-dimensional, either as distinct points (whole key) or as contiguous ranges in comparator order (prefix filters). The proposed KeySegmentsExtractor allows treating keys as multi-dimensional for filtering purposes even though they still have a single total order across dimensions. For example, consider these keys in different LSM levels:

L0:
abc_0123
abc_0150
def_0114
ghi_0134

L1:
abc_0045
bcd_0091
def_0077
xyz_0080

If we get a range query for [def_0100, def_0200), a prefix filter (up to the underscore) will tell us that both levels are potentially relevant. However, if each SST file stores a simple range of the values for the second segment of the key, we would see that L1 only has [0045, 0091] which (under certain required assumptions) we are sure does not overlap with the given range query. Thus, we can filter out processing or reading any index or data blocks from L1 for the query.

This kind of case shows up with time-ordered data but is more general than filtering based on user timestamp. See https://github.com/facebook/rocksdb/issues/11332 . Here the "time" segments of the keys are meaningfully ordered with respect to each other even when the previous segment is different, so summarizing data along an alternate dimension of the key like this can work well for filtering.

This prototype implementation simply leverages existing APIs for user table properties and table filtering, which is not very CPU efficient. Eventually, we expect to create a native implementation. However, I have put some significant
thought and engineering into the new APIs overall, which I expect to be close to refined enough for production.

For details, see new public APIs in experimental.h. For a detailed example, see the new unit test in db_bloom_filter_test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12075

Test Plan: Unit test included

Reviewed By: jowlyzhang

Differential Revision: D53619406

Pulled By: pdillinger

fbshipit-source-id: 9e6e7b82b4db8d815db76a6ab340e90db2c191f2
2024-02-15 15:39:55 -08:00
Peter Dillinger bfd00bba9c Use format_version=6 by default (#12352)
Summary:
It's in production for a large storage service, and it was initially released 6 months ago (8.6.0). IMHO that's enough room for "easy downgrade" to most any user's previously integrated version, even if they only update a few times a year.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12352

Test Plan:
tests updated, including format capatibility test

table_test: ApproximateOffsetOfCompressed is affected because adding index block to metaindex adds about 13 bytes
to SST files in format_version 6. This test has historically been problematic and one reason is that, apparently, not only
could it pass/fail depending on snappy compression version, but also how long your host name is, because of db_host_id.
I've cleared that out for the test, which takes care of format_version=6 and hopefully improves long-term reliability.

Suggested follow-up: FinishImpl in table_test.cc takes a table_options that is ignored in some cases and might not match
the ioptions.table_factory configuration unless the caller is very careful. This should be cleaned up somehow.

Reviewed By: anand1976

Differential Revision: D53786884

Pulled By: pdillinger

fbshipit-source-id: 1964cbd40d3ab0a821fdc01c458031df716fcf51
2024-02-15 11:23:48 -08:00
Changyu Bi 6e57135a65 Add a changelog entry for PR 12322 (#12353)
Summary:
.. for public api change related to sst_dump.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12353

Reviewed By: jaykorean

Differential Revision: D53791123

Pulled By: cbi42

fbshipit-source-id: 3fbe9c7a3eb0a30dc1a00d39bc8a46028baa3779
2024-02-15 09:53:20 -08:00
Gilbert Liu d201e59941 Update llvm-fb to 15 (#12342)
Summary:
Update llvm-fb to 15 and some other dependency versions.

## Test

Copied over the two script files to tp2 librocksdb source and ran tp2_build, it succeeded.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12342

Reviewed By: ltamasi

Differential Revision: D53690631

Pulled By: bunnypak

fbshipit-source-id: 68f884b2a565f98bc3510290b411a901ef781adb
2024-02-14 14:40:05 -08:00
Yu Zhang f405e55cfa Add support in SstFileWriter to not persist user defined timestamps (#12348)
Summary:
This PR adds support in `SstFileWriter` to create SST files without persisting timestamps when the column family has enabled UDTs in Memtable only feature. The sst files created from flush and compaction do not contain timestamps, we want to make the sst files created by `SstFileWriter` to follow the same pattern and not persist timestamps. This is to prepare for ingesting external SST files for this type of column family.

There are timestamp-aware APIs and non timestamp-aware APIs in `SstFileWriter`. The former are exclusively used for when the column family's comparator is timestamp-aware, a.k.a `Comparator::timestamp_size() > 0`, while the latter are exclusively used for the column family's comparator is non timestamp-aware, a.k.a `Comparator::timestamp_size() == 0`.  There are sanity checks to make sure these APIs are correctly used.

In this PR, the APIs usage continue with above enforcement, where even though timestamps are not eventually persisted, users are still asked to use only the timestamp-aware APIs. But because data points will logically all have minimum timestamps, we don't allow multiple versions of the same user key (without timestamp) to be added.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12348

Test Plan:
Added unit tests
Manual inspection of generated sst files with `sst_dump`

Reviewed By: ltamasi

Differential Revision: D53732667

Pulled By: jowlyzhang

fbshipit-source-id: e43beba0d3a1736b94ee5c617163a6280efd65b7
2024-02-13 20:30:07 -08:00
Yu Zhang 4bea83aa44 Remove the force mode for EnableFileDeletions API (#12337)
Summary:
There is no strong reason for user to need this mode while on the other hand, its behavior is destructive.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12337

Reviewed By: hx235

Differential Revision: D53630393

Pulled By: jowlyzhang

fbshipit-source-id: ce94b537258102cd98f89aa4090025663664dd78
2024-02-13 18:36:25 -08:00
Jay Huh 8c7c0a38f1 Minor refactor with printing stdout in blackbox tests (#12350)
Summary:
As title. Adding a missing stdout printing in `blackbox_crash_main()`

# Test

**Blackbox**
```
$> python3 tools/db_crashtest.py blackbox --simple --max_key=25000000 --write_buffer_size=4194304
```
```
...
stdout:
 Choosing random keys with no overwrite
DB path: [/tmp/jewoongh/rocksdb_crashtest_blackbox34jwn9of]
(Re-)verified 0 unique IDs
2024/02/13-12:27:33  Initializing worker threads
Crash-recovery verification passed :)
2024/02/13-12:27:36  Starting database operations
...
jewoongh stdout test
jewoongh stdout test
...
jewoongh stdout test
stderr:
 jewoongh injected error
```

**Whitebox**
```
$> python3 tools/db_crashtest.py whitebox --simple --max_key=25000000 --write_buffer_size=4194304
```
```
...
stdout:
 Choosing random keys with no overwrite
Creating 24415 locks
...
2024/02/13-12:31:51  Initializing worker threads
Crash-recovery verification passed :)
2024/02/13-12:31:54  Starting database operations
jewoongh stdout test
jewoongh stdout test
jewoongh stdout test
...
stderr:
 jewoongh injected error
...
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12350

Reviewed By: akankshamahajan15, cbi42

Differential Revision: D53728910

Pulled By: jaykorean

fbshipit-source-id: ec90ed3b5e6a1102d1fb55d357d0371e5072a173
2024-02-13 14:15:52 -08:00
Yu Zhang 10d02456b6 Add support to bulk load external files with user-defined timestamps (#12343)
Summary:
This PR adds initial support to bulk loading external sst files with user-defined timestamps.

To ensure this invariant is met while ingesting external files:
     assume there are two internal keys: <K, ts1, seq1> and <K, ts2, seq2>, the following should hold:
     ts1 < ts2 iff. seq1 < seq2

These extra requirements are added for ingesting external files with user-defined timestamps:
1) A file with overlapping user key (without timestamp) range with the db cannot be ingested. This is because we cannot ensure above invariant is met without checking each overlapped key's timestamp and compare it with the timestamp from the db. This is an expensive step. This bulk loading feature will be used by MyRocks and currently their usage can guarantee ingested file's key range doesn't overlap with db.
4f3a57a13f/storage/rocksdb/ha_rocksdb.cc (L3312)
We can consider loose this requirement by doing this check in the future, this initial support just disallow this.

2) Files with overlapping user key (without timestamp) range are not allowed to be ingested. For similar reasons, it's hard to ensure above invariant is met. For example, if we have two files where user keys are interleaved like this:
file1: [c10, c8, f10, f5]
file2: [b5, c11, f4]
Either file1 gets a bigger global seqno than file2, or the other way around, above invariant cannot be met.
So we disallow this.

2) When a column family enables user-defined timestamps, it doesn't support ingestion behind mode. Ingestion behind currently simply puts the file at the bottommost level, and assign a global seqno 0 to the file. We need to do similar search though the LSM tree for key range overlap checks to make sure aformentioned invariant is met. So this initial support disallow this mode. We can consider adding it in the future.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12343

Test Plan: Add unit tests

Reviewed By: cbi42

Differential Revision: D53686182

Pulled By: jowlyzhang

fbshipit-source-id: f05e3fb27967f7974ed40179d78634c40ecfb136
2024-02-13 11:15:28 -08:00
马越 45668a05f5 add unit test for compactRangeWithNullBoundaries java api (#12333)
Summary:
The purpose of this PR is to supplement a set of unit tests for https://github.com/facebook/rocksdb/pull/12328

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12333

Reviewed By: ltamasi

Differential Revision: D53553830

Pulled By: cbi42

fbshipit-source-id: d21490f7ce7b30f42807ee37eda455ca6abdd072
2024-02-13 10:48:31 -08:00
Levi Tamasi de1e3ff6ea Fix a data race in DBImpl::RenameTempFileToOptionsFile (#12347)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12347

`DBImpl::disable_delete_obsolete_files_` should only be accessed while holding the DB mutex to prevent data races. There's a piece of logic in `DBImpl::RenameTempFileToOptionsFile` where this synchronization was previously missing. The patch fixes this issue similarly to how it's handled in `DisableFileDeletions` and `EnableFileDeletions`, that is, by saving the counter value while holding the mutex and then performing the actual file deletion outside the critical section. Note: this PR only fixes the race itself; as a followup, we can also look into cleaning up and optimizing the file deletion logic (which is currently inefficient on multiple different levels).

Reviewed By: jowlyzhang

Differential Revision: D53675153

fbshipit-source-id: 5358e894ee6829d3edfadac50a93d97f8819e481
2024-02-12 13:26:09 -08:00
Yaroslav Stepanchuk 395d24f0fa Fix build on alpine 3.19 (#12345)
Summary:
Add missing include of the cstdint header.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12345

Reviewed By: ltamasi

Differential Revision: D53672261

Pulled By: cbi42

fbshipit-source-id: 758944c0b51b9701a129e7b88f692103bbce11d3
2024-02-12 11:24:56 -08:00