Summary:
In C++, `extern` is redundant in a number of cases:
* "Global" function declarations and definitions
* "Global" variable definitions when already declared `extern`
For consistency and simplicity, I've removed these in code that *we own*. In a couple of cases, I removed obsolete declarations, and for MagicNumber constants, I have consolidated the declarations into a header file (format.h)
as standard best practice would prescribe.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12300
Test Plan: no functional changes, CI
Reviewed By: ajkr
Differential Revision: D53148629
Pulled By: pdillinger
fbshipit-source-id: fb8d927959892e03af09b0c0d542b0a3b38fd886
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12269
`-Wextra-semi` or `-Wextra-semi-stmt`
If the code compiles, this is safe to land.
Reviewed By: jaykorean
Differential Revision: D52969093
fbshipit-source-id: 0520085819fa785679c859b63b877931d3f71f2c
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12270
`-Wextra-semi` or `-Wextra-semi-stmt`
If the code compiles, this is safe to land.
Reviewed By: jaykorean
Differential Revision: D52965944
fbshipit-source-id: 625d47662e984db9ce06e72ff39025b8a24aa246
Summary:
`-Wextra-semi` or `-Wextra-semi-stmt`
If the code compiles, this is safe to land.
Reviewed By: dmm-fb
Differential Revision: D52967247
fbshipit-source-id: 4a67cb9719e092ad9bbe9c7e1d060e3f9042ecf7
Summary:
## Overview
In this PR, we introduce support for setting the RocksDB native logger through Java. As mentioned in the discussion on the [Google Group discussion](https://groups.google.com/g/rocksdb/c/xYmbEs4sqRM/m/e73E4whJAQAJ), this work is primarily motivated by the JDK 17 [performance regression in JNI thread attach/detach calls](https://bugs.openjdk.org/browse/JDK-8314859): the only existing RocksJava logging configuration call, `setLogger`, invokes the provided logger over the JNI.
## Changes
Specifically, these changes add support for the `devnull` and `stderr` native loggers. For the `stderr` logger, we add the ability to prefix every log with a `logPrefix`, so that it becomes possible know which database a particular log is coming from (if multiple databases are in use). The API looks like the following:
```java
Options opts = new Options();
NativeLogger stderrNativeLogger = NativeLogger.newStderrLogger(
InfoLogLevel.DEBUG_LEVEL, "[my prefix here]");
options.setLogger(stderrNativeLogger);
try (final RocksDB db = RocksDB.open(options, ...)) {...}
// Cleanup
stderrNativeLogger.close()
opts.close();
```
Note that the API to set the logger is the same, via `Options::setLogger` (or `DBOptions::setLogger`). However, it will set the RocksDB logger to be native when the provided logger is an instance of `NativeLogger`.
## Testing
Two tests have been added in `NativeLoggerTest.java`. The first test creates both the `devnull` and `stderr` loggers, and sets them on the associated `Options`. However, to avoid polluting the testing output with logs from `stderr`, only the `devnull` logger is actually used in the test. The second test does the same logic, but for `DBOptions`.
It is possible to manually verify the `stderr` logger by modifying the tests slightly, and observing that the console indeed gets cluttered with logs from `stderr`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12213
Reviewed By: cbi42
Differential Revision: D52772306
Pulled By: ajkr
fbshipit-source-id: 4026895f78f9cc250daf6bfa57427957e2d8b053
Summary:
## Context/Summary
Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity.
For that, this PR does the following:
- Tag different write IOs by passing down and converting WriteOptions to IOOptions
- Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS
Some related code refactory to make implementation cleaner:
- Blob stats
- Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info.
- Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write.
- Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority
- Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification
- Build table
- TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables
- Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder.
This parameter is used for dynamically changing file io priority for flush, see https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more
- Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority
## Test
### db bench
Flush
```
./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100
rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
```
compaction, db oopen
```
Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279
rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213
rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66
```
blob stats - just to make sure they aren't broken by this PR
```
Integrated Blob DB
Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600
rocksdb.blobdb.blob.file.synced COUNT : 1
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same)
```
```
Stacked Blob DB
Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876
rocksdb.blobdb.blob.file.synced COUNT : 8
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same)
```
### Rehearsal CI stress test
Trigger 3 full runs of all our CI stress tests
### Performance
Flush
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark; enable_statistics = true
Pre-pr: avg 507515519.3 ns
497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908,
Post-pr: avg 511971266.5 ns, regressed 0.88%
502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408,
```
Compaction
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 495346098.30 ns
492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846
Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97%
502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007
```
Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats)
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 3848.10 ns
3814,3838,3839,3848,3854,3854,3854,3860,3860,3860
Post-pr: avg 3874.20 ns, regressed 0.68%
3863,3867,3871,3874,3875,3877,3877,3877,3880,3881
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910
Reviewed By: ajkr
Differential Revision: D49788060
Pulled By: hx235
fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff
Summary:
* Largely based on https://github.com/facebook/rocksdb/issues/12085 but grouped into one large workflow because of bad GHA UI design (see comments).
* Windows job details consolidated into an action file so that those jobs can easily move between per-pr-push and nightly.
* Simplify some handling of "CIRCLECI" environment and add "GITHUB_ACTIONS" in the same places
* For jobs that we want to go in pr-jobs or nightly there are disabled "candidate" workflows with draft versions of those jobs.
* ARM jobs are disabled waiting on full GHA support.
* build-linux-java-static needed some special attention to work, due to GLIBC compatibility issues (see comments).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12163
Test Plan:
Nightly jobs can be seen passing between these two links:
https://github.com/facebook/rocksdb/actions/runs/7266835435/job/19799390061?pr=12163https://github.com/facebook/rocksdb/actions/runs/7269697823/job/19807724471?pr=12163
And per-PR jobs of course passing on this PR.
Reviewed By: hx235
Differential Revision: D52335810
Pulled By: pdillinger
fbshipit-source-id: bbb95196f33eabad8cddf3c6b52f4413c80e034d
Summary:
`-Wextra-semi` or `-Wextra-semi-stmt`
If the code compiles, this is safe to land.
Reviewed By: palmje
Differential Revision: D51995065
fbshipit-source-id: 9b55a0d8abd0927b76376cb7751bf0fcab10518c
Summary:
See new atomic.h file comments for motivation.
I have updated HyperClockCache to use the new atomic wrapper, fixing a few cases where an implicit conversion was accidentally used and therefore mixing std::memory_order_seq_cst where release/acquire ordering (or relaxed) was intended. There probably wasn't a real bug because I think all the cases happened to be in single-threaded contexts like constructors/destructors or statistical ops like `GetCapacity()` that don't need any particular ordering constraints.
Recommended follow-up:
* Replace other uses of std::atomic to help keep them safe from bugs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12051
Test Plan:
Did some local correctness stress testing with cache_bench. Also triggered 15 runs of fbcode_blackbox_crash_test and saw no related failures (just 3 failures in ~CacheWithSecondaryAdapter(), already known)
No performance difference seen before & after running simultaneously:
```
(while ./cache_bench -cache_type=fixed_hyper_clock_cache -populate_cache=0 -cache_size=3000000000 -ops_per_thread=500000 -threads=12 -histograms=0 2>&1 | grep parallel; do :; done) | awk '{ s += $3; c++; print "Avg time: " (s/c);}'
```
... for both fixed_hcc and auto_hcc.
Reviewed By: jowlyzhang
Differential Revision: D51090518
Pulled By: pdillinger
fbshipit-source-id: eeb324facb3185584603f9ea0c4de6f32919a2d7
Summary:
As titled. The most notable place that marks the feature as experimental is its wiki page. That was updated. And this PR removes the experimental marker from a few places for this feature.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11974
Reviewed By: ltamasi
Differential Revision: D50383640
Pulled By: jowlyzhang
fbshipit-source-id: 0bfe26ceda0793515f54b602cf3cd13d0737ec25
Summary:
In follow-up to https://github.com/facebook/rocksdb/issues/11922, fix a race in functions like CreateColumnFamily and SetDBOptions where the DB reports one option setting but a different one is left in effect.
To fix, we can add an extra mutex around these rare operations. We don't want to hold the DB mutex during I/O or other slow things because of the many purposes it serves, but a mutex more limited to these cases should be fine.
I believe this would fix a write-write race in https://github.com/facebook/rocksdb/issues/10079 but not the read-write race.
Intended follow-up to this:
* Should be able to remove write thread synchronization from DBImpl::WriteOptionsFile
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11929
Test Plan:
Added two mini-stress style regression tests that fail with >1% probability before this change:
DBOptionsTest::SetStatsDumpPeriodSecRace
ColumnFamilyTest::CreateAndDropPeriodicRace
I haven't reproduced such an inconsistency between in-memory options and on disk latest options, but this change at least improves safety and adds a test anyway:
DBOptionsTest::SetStatsDumpPeriodSecRace
Reviewed By: ajkr
Differential Revision: D50024506
Pulled By: pdillinger
fbshipit-source-id: 1e99a9ed4d96fdcf3ac5061ec6b3cee78aecdda4
Summary:
This change has two primary goals (follow-up to https://github.com/facebook/rocksdb/issues/11917, https://github.com/facebook/rocksdb/issues/11920):
* Ensure the DB seqno_to_time_mapping has entries that allow us to put a good time lower bound on any writes that happen after setting up preserve/preclude options (either in a new DB, new CF, SetOptions, etc.) and haven't yet aged out of that time window. This allows us to remove a bunch of work-arounds in tests.
* For new DBs using preserve/preclude options, automatically reserve some sequence numbers and pre-map them to cover the time span back to the preserve/preclude cut-off time. In the future, this will allow us to import data from another DB by key, value, and write time by assigning an appropriate seqno in this DB for that write time.
Note that the pre-population (historical mappings) does not happen if the original options at DB Open time do not have preserve/preclude, so it is recommended to create initial column families at that time with create_missing_column_families, to take advantage of this (future) feature. (Adding these historical mappings after DB Open would risk non-monotonic seqno_to_time_mapping, which is dubious if not dangerous.)
Recommended follow-up:
* Solve existing race conditions (not memory safety) where parallel operations like CreateColumnFamily or SetDBOptions could leave the wrong setting in effect.
* Make SeqnoToTimeMapping more gracefully handle a possible case in which too many mappings are added for the time range of concern. It seems like there could be cases where data is massively excluded from the cold tier because of entries falling off the front of the mapping list (causing GetProximalSeqnoBeforeTime() to return 0). (More investigation needed.)
No release note for the minor bug fix because this is still an experimental feature with limited usage.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11922
Test Plan: tests added / updated
Reviewed By: jowlyzhang
Differential Revision: D49956563
Pulled By: pdillinger
fbshipit-source-id: 92beb918c3a298fae9ca8e509717b1067caa1519
Summary:
In preparing some seqno_to_time_mapping improvements, I found that some of the wrap-up work for creating column families was unnecessarily repeated in the case of DB::Open with create_missing_column_families. This change fixes that (`CreateColumnFamily()` -> `CreateColumnFamilyImpl()` in `DBImpl::Open()`), motivated by avoiding repeated calls to `RegisterRecordSeqnoTimeWorker()` but with the side benefit of avoiding repeated calls to `WriteOptionsFile()` for each CF.
Also in this change:
* Add a `Status::UpdateIfOk()` function for combining statuses in a common pattern
* Rename `max_time_duration` -> `min_preserve_seconds` (include units as much as possible)
* Improved comments in several places
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11920
Test Plan: tests added / updated
Reviewed By: jaykorean
Differential Revision: D49919147
Pulled By: pdillinger
fbshipit-source-id: 3d0318c1d070c842c5331da0a5b415caedc104f1
Summary:
RocksDB's primary function is to facilitate read and write operations. Compactions, while essential for minimizing read amplifications and optimizing storage, can sometimes compete with these primary tasks. Especially during periods of high read/write traffic, it's vital to ensure that primary operations receive priority, avoiding any potential disruptions or slowdowns. Conversely, during off-peak times when traffic is minimal, it's an opportune moment to tackle low-priority tasks like TTL based compactions, optimizing resource usage.
In this PR, we are incorporating the concept of off-peak time into RocksDB by introducing `daily_offpeak_time_utc` within the DBOptions. This setting is formatted as "HH:mm-HH:mm" where the first one before "-" is the start time and the second one is the end time, inclusive. It will be later used for resource optimization in subsequent PRs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11893
Test Plan:
- New Unit Test Added - `DBOptionsTest::OffPeakTimes`
- Existing Unit Test Updated - `OptionsTest`, `OptionsSettableTest`
Reviewed By: pdillinger
Differential Revision: D49714553
Pulled By: jaykorean
fbshipit-source-id: fef51ea7c0fede6431c715bff116ddbb567c8752
Summary:
Added some util function APIs to facilitate using the U64Ts.
The U64Ts format for encoding a timestamp is not entirely RocksDB internal. When users are using the user-defined timestamp feature from the transaction layer, its public APIs including `SetCommitTimestamp`, `GetCommitTimestamp`, `SetReadTimestampForValidation` are taking and returning timestamps as uint64_t. But if users want to use the APIs from the DB layer, including populating `ReadOptions.timestamp`, interpreting `Iterator::timestamp()`, these APIs are using and returning U64Ts timestamps as an encoded Slice. So these util functions are added to facilitate the usage.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11888
Reviewed By: ltamasi
Differential Revision: D49620709
Pulled By: jowlyzhang
fbshipit-source-id: ace8d782ee7c3372cf410abf761320d373e495e1
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11870
Having a large number of merge operands applied at query time can have a significant effect on performance; therefore, applications might want limit the number of deltas for any given key. However, there is currently no way to establish the number of operands for certain types of queries. The ticker `READ_NUM_MERGE_OPERANDS` only provides aggregate (not per-read) information. The `PerfContext` counters `internal_merge_count` and `internal_merge_point_lookup_count` can be used to get this information on a per-query basis for iterators and single point lookups; however, there is no per-key breakdown for `MultiGet` type APIs. The patch addresses this issue by introducing a special kind of OK status which signals that an application-defined threshold on the number of merge operands has been exceeded for a given key. The threshold can be specified on a per-query basis using a new field in `ReadOptions`.
Reviewed By: jaykorean
Differential Revision: D49522786
fbshipit-source-id: 4265b3848d1be5ff313a3e8fb604ddf56411dd2c
Summary:
An internal user wants to be able to dynamically switch between Bloom and Ribbon filters, without a custom FilterPolicy. Making `filter_policy` mutable would actually make issue https://github.com/facebook/rocksdb/issues/10079 worse, because it would be a race on a pointer field, not just on scalars.
As a reasonable compromise until that is fixed, I am enabling dynamic control over Bloom vs. Ribbon choice by making
RibbonFilterPolicy::bloom_before_level mutable, and doing that safely by using an atomic.
I've also slightly tweaked the interpretation of that field so that setting it to INT_MAX really means "always Bloom."
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11838
Test Plan: unit tests added/extended. crash test updated for SetOptions call and tested under TSAN with amplified probability (lower set_options_one_in).
Reviewed By: ajkr
Differential Revision: D49296284
Pulled By: pdillinger
fbshipit-source-id: e4251c077510df9a9c719876f482448c0d15402a
Summary:
`GetEntity` API support for ReadOnly DB and Secondary DB.
- Introduced `GetImpl()` with `GetImplOptions` in `db_impl_readonly` and refactored current `Get()` logic into `GetImpl()` so that look up logic can be reused for `GetEntity()` (Following the same pattern as `DBImpl::Get()` and `DBImpl::GetEntity()`)
- Introduced `GetImpl()` with `GetImplOptions` in `db_impl_secondary` and refactored current `GetImpl()` logic. This is to make `DBImplSecondary::Get/GetEntity` consistent with `DBImpl::Get/GetEntity` and `DBImplReadOnly::Get/GetEntity`
- `GetImpl()` in `db_impl` is now virtual. both `db_impl_readonly` and `db_impl_secondary`'s `Get()` override are no longer needed since all three dbs now have the same `Get()` which calls `GetImpl()` internally.
- `GetImpl()` in `DBImplReadOnly` and `DBImplSecondary` now pass in `columns` instead of `nullptr` in lookup functions like `memtable->get()`
- Introduced `GetEntity()` API in `DBImplReadOnly` and `DBImplSecondary` which simply calls `GetImpl()` with `columns` set in `GetImplOptions`.
- Introduced `Env::IOActivity::kGetEntity` and set read_options.io_activity to `Env::IOActivity::kGetEntity` for `GetEntity()` operations (in db_impl)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11799
Test Plan:
**Unit Tests**
- Added verification in `DBWideBasicTest::PutEntity` by Reopening DB as ReadOnly with the same setup.
- Added verification in `DBSecondaryTest::ReopenAsSecondary` by calling `PutEntity()` and `GetEntity()` on top of existing `Put()` and `Get()`
- `make -j64 check`
**Crash Tests**
- `python3 tools/db_crashtest.py blackbox --max_key=25000000 --write_buffer_size=4194304 --max_bytes_for_level_base=2097152 --target_file_size_base=2097152 --periodic_compaction_seconds=0 --use_put_entity_one_in=10 --use_get_entity=1 --duration=60 --inter
val=10`
- `python3 tools/db_crashtest.py blackbox --simple --max_key=25000000 --write_buffer_size=4194304 --max_bytes_for_level_base=2097152 --target_file_size_base=2097152 --periodic_compaction_seconds=0 --use_put_entity_one_in=10 --use_get_entity=1 `
- `python3 tools/db_crashtest.py blackbox --cf_consistency --max_key=25000000 --write_buffer_size=4194304 --max_bytes_for_level_base=2097152 --target_file_size_base=2097152 --periodic_compaction_seconds=0 --use_put_entity_one_in=10 --use_get_entity=1 --duration=60 --inter
val=10`
Reviewed By: ltamasi
Differential Revision: D49037040
Pulled By: jaykorean
fbshipit-source-id: a0648253ded6e91af7953de364ed3c6bf163626b
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11807
For now, RocksDB has limited support for using merge with wide columns: when a bunch of merge operands have to be applied to a wide-column base value, RocksDB currently passes only the value of the default column to the application's `MergeOperator`, which means there is no way to update any other columns during a merge. As a first step in making this more general, the patch adds a new API `FullMergeV3` to `MergeOperator`.
`FullMergeV3`'s interface enables applications to receive a plain, wide-column, or non-existent base value as merge input, and to produce a new plain value, a new wide-column value, or an existing operand as merge output. Note that there are no limitations on the column names and values if the merge result is a wide-column entity. Also, the interface is general in the sense that it makes it possible e.g. for a merge that takes a plain base value and some deltas to produce a wide-column entity as a result.
For backward compatibility, the default implementation of `FullMergeV3` falls back to `FullMergeV2` and implements the current logic where merge operands are applied to the default column of the base entity and any other columns are unchanged. (Note that with `FullMergeV3` in the `MergeOperator` interface, this behavior will become customizable.)
This patch just introduces the new API and the default backward compatible implementation. I plan to integrate `FullMergeV3` into the query and compaction logic in subsequent diffs.
Reviewed By: jaykorean
Differential Revision: D49117253
fbshipit-source-id: 109e016f25cd130fc504790818d927bae7fec6bd
Summary:
There was a `#include "port/lang.h"` situated inside an `extern "C" {` which just started causing the header to be unusuable in some contexts. This was a regression on the CircleCI job build-linux-unity-and-headers in https://github.com/facebook/rocksdb/issues/11792
The include, and another like it, now appears obsolete so removed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11797
Test Plan: local `make check-headers` and `make`, CI
Reviewed By: jaykorean
Differential Revision: D48976826
Pulled By: pdillinger
fbshipit-source-id: 131d66969e045f2ded0f8936924ee30c9ef2655a
Summary:
For a SST file that uses user-defined timestamp aware comparators, if a lower or upper bound is set, sst_dump tool doesn't handle it well. This PR adds support for that. While working on this `MaybeAddTimestampsToRange` is moved to the udt_util.h file to be shared.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11757
Test Plan:
make all check
for changes in db_impl.cc and db_impl_compaction_flush.cc
for changes in sst_file_dumper.cc, I manually tested this change handles specifying bounds for UDT use cases. It probably should have a unit test file eventually.
Reviewed By: ltamasi
Differential Revision: D48668048
Pulled By: jowlyzhang
fbshipit-source-id: 1560465f40e44668d6d82a7439fe9012be0e74a8
Summary:
This PR adds a missing piece for the UDT in memtable only feature, which is to automatically increase `full_history_ts_low` when flush happens during recovery.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11774
Test Plan:
Added unit test
make all check
Reviewed By: ltamasi
Differential Revision: D48799109
Pulled By: jowlyzhang
fbshipit-source-id: fd681ed66d9d40904ca2c919b2618eb692686035
Summary:
Having a synthetic implementation of `TimedWait()` in `SystemClock` will allow us to add `SyncPoint`s while mutex is released, which was previously impossible since the lock was released and reacquired all within `pthread_cond_timedwait()`. Additionally, integrating `TimedWait()` with `MockSystemClock` allows us to cleanup some workarounds in the test code. In this PR I only cleaned up the `GenericRateLimiter` test workaround.
This is related to the intended follow-up mentioned in https://github.com/facebook/rocksdb/issues/7101's description. There are a couple differences:
(1) This PR does not include removing the particular workaround that initially motivated it. Actually, the `Timer` class uses `InstrumentedCondVar`, so the interface introduced here is inadequate to remove that workaround. On the bright side, the interface introduced in this PR can be changed as needed since it can neither be used nor extended externally, due to using forward-declared `port::CondVar*` in the interface.
(2) This PR only makes the change in `SystemClock` not `Env`. Older revisions of this PR included `Env::TimedWait()` and `SpecialEnv::TimedWait()`; however, since they were unused it probably makes sense to defer adding them until when they are needed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11753
Reviewed By: pdillinger
Differential Revision: D48654995
Pulled By: ajkr
fbshipit-source-id: 15e19f2454b64d4ec7f50e328691c66ca9911122
Summary:
* JemallocNodumpAllocator was passing a size_t to FastRange32, which could cause compilation errors or warnings (seen with clang)
* Fixed the order of arguments to match what would be used with modulo operator (%), for clarity.
Fixes https://github.com/facebook/rocksdb/issues/11006
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11707
Test Plan: no functional change, existing tests
Reviewed By: ajkr
Differential Revision: D48435149
Pulled By: pdillinger
fbshipit-source-id: e6e8b107ded4eceda37db20df59985c846a2546b
Summary:
As titled. User-defined timestamp feature users sometimes directly call the user comparator to do validation on their side too. Having access to the root comparator can help make their code consistent for when UDT is enabled and disabled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11704
Reviewed By: ltamasi
Differential Revision: D48355090
Pulled By: jowlyzhang
fbshipit-source-id: 26bc73543bfb379ef548d1361803d6f8c308cef6
Summary:
**Context/Summary:**
- Similar to https://github.com/facebook/rocksdb/pull/11288 but for user read such as `Get(), MultiGet(), DBIterator::XXX(), Verify(File)Checksum()`.
- For this, I refactored some user-facing `MultiGet` calls in `TransactionBase` and various types of `DB` so that it does not call a user-facing `Get()` but `GetImpl()` for passing the `ReadOptions::io_activity` check (see PR conversation)
- New user read stats breakdown are guarded by `kExceptDetailedTimers` since measurement shows they have 4-5% regression to the upstream/main.
- Misc
- More refactoring: with https://github.com/facebook/rocksdb/pull/11288, we complete passing `ReadOptions/IOOptions` to FS level. So we can now replace the previously [added](https://github.com/facebook/rocksdb/pull/9424) `rate_limiter_priority` parameter in `RandomAccessFileReader`'s `Read/MultiRead/Prefetch()` with `IOOptions::rate_limiter_priority`
- Also, `ReadAsync()` call time is measured in `SST_READ_MICRO` now
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11444
Test Plan:
- CI fake db crash/stress test
- Microbenchmarking
**Build** `make clean && ROCKSDB_NO_FBCODE=1 DEBUG_LEVEL=0 make -jN db_basic_bench`
- google benchmark version: 604f6fd3f4
- db_basic_bench_base: upstream
- db_basic_bench_pr: db_basic_bench_base + this PR
- asyncread_db_basic_bench_base: upstream + [db basic bench patch for IteratorNext](https://github.com/facebook/rocksdb/compare/main...hx235:rocksdb:micro_bench_async_read)
- asyncread_db_basic_bench_pr: asyncread_db_basic_bench_base + this PR
**Test**
Get
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{null_stat|base|pr} --benchmark_filter=DBGet/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1/negative_query:0/enable_filter:0/mmap:1/threads:1 --benchmark_repetitions=1000
```
Result
```
Coming soon
```
AsyncRead
```
TEST_TMPDIR=/dev/shm ./asyncread_db_basic_bench_{base|pr} --benchmark_filter=IteratorNext/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1/async_io:1/include_detailed_timers:0 --benchmark_repetitions=1000 > syncread_db_basic_bench_{base|pr}.out
```
Result
```
Base:
1956,1956,1968,1977,1979,1986,1988,1988,1988,1990,1991,1991,1993,1993,1993,1993,1994,1996,1997,1997,1997,1998,1999,2001,2001,2002,2004,2007,2007,2008,
PR (2.3% regression, due to measuring `SST_READ_MICRO` that wasn't measured before):
1993,2014,2016,2022,2024,2027,2027,2028,2028,2030,2031,2031,2032,2032,2038,2039,2042,2044,2044,2047,2047,2047,2048,2049,2050,2052,2052,2052,2053,2053,
```
Reviewed By: ajkr
Differential Revision: D45918925
Pulled By: hx235
fbshipit-source-id: 58a54560d9ebeb3a59b6d807639692614dad058a
Summary:
BottomNBits() - there is a single fast instruction for this on x86 since BMI2, but testing with godbolt indicates you need at least GCC 10 for the compiler to choose that instruction from the obvious C++ code. https://godbolt.org/z/5a7Ysd41h
BitwiseAnd() - this is a convenience function that works around the language flaw that the type of the result of x & y is the larger of the two input types, when it should be the smaller. This can save some ugly static_cast.
I expect to use both of these in coming HyperClockCache developments, and have applied them in a couple of places in existing code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11660
Test Plan: unit tests added
Reviewed By: jowlyzhang
Differential Revision: D47935531
Pulled By: pdillinger
fbshipit-source-id: d148c43a1e51df4a1c549b93aaf2725a3f8d3bd6
Summary:
... to improve data integrity validation during compaction.
A new option `compaction_verify_record_count` is introduced for this verification and is enabled by default. One exception when the verification is not done is when a compaction filter returns kRemoveAndSkipUntil which can cause CompactionIterator to seek until some key and hence not able to keep track of the number of keys processed.
For expected number of input keys, we sum over the number of total keys - number of range tombstones across compaction input files (`CompactionJob::UpdateCompactionStats()`). Table properties are consulted if `FileMetaData` is not initialized for some input file. Since table properties for all input files were also constructed during `DBImpl::NotifyOnCompactionBegin()`, `Compaction::GetTableProperties()` is introduced to reduce duplicated code.
For actual number of keys processed, each subcompaction will record its number of keys processed to `sub_compact->compaction_job_stats.num_input_records` and aggregated when all subcompactions finish (`CompactionJob::AggregateCompactionStats()`). In the case when some subcompaction encountered kRemoveAndSkipUntil from compaction filter and does not have accurate count, it propagates this information through `sub_compact->compaction_job_stats.has_num_input_records`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11571
Test Plan:
* Add a new unit test `DBCompactionTest.VerifyRecordCount` for the corruption case.
* All other unit tests for non-corrupted case.
* Ran crash test for a few hours: `python3 ./tools/db_crashtest.py whitebox --simple`
Reviewed By: ajkr
Differential Revision: D47131965
Pulled By: cbi42
fbshipit-source-id: cc8e94565dd526c4347e9d3843ecf32f6727af92
Summary:
Add a built-in comparator that supports uint64_t style user-defined timestamps for ReverseBytewiseComparator.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11647
Test Plan:
Added a test wrapper for retrieving this comparator from registry and used it in this test:
`./udt_util_test`
Reviewed By: ltamasi
Differential Revision: D47848303
Pulled By: jowlyzhang
fbshipit-source-id: 5af5534a8c2d9195997d0308c8e194c1c797548c
Summary:
Fix use_after_free bug in async_io MultiReads when underlying FS enabled kFSBuffer. kFSBuffer is when underlying FS pass their own buffer instead of using RocksDB scratch in FSReadRequest
Since it's an experimental feature, added a hack for now to fix the bug.
Planning to make public API change to remove const from the callback as it doesn't make sense to use const.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11645
Test Plan: tested locally
Reviewed By: ltamasi
Differential Revision: D47819907
Pulled By: akankshamahajan15
fbshipit-source-id: 1faf5ef795bf27e2b3a60960374d91274931df8d
Summary:
Add support to allow enabling / disabling user-defined timestamps feature for an existing column family in combination with the in-Memtable only feature.
To do this, this PR includes:
1) Log the `persist_user_defined_timestamps` option per column family in Manifest to facilitate detecting an attempt to enable / disable UDT. This entry is enforced to be logged in the same VersionEdit as the user comparator name entry.
2) User-defined timestamps related options are validated when re-opening a column family, including user comparator name and the `persist_user_defined_timestamps` flag. These type of settings and settings change are considered valid:
a) no user comparator change and no effective `persist_user_defined_timestamp` flag change.
b) switch user comparator to enable UDT provided the immediately effective `persist_user_defined_timestamps` flag
is false.
c) switch user comparator to disable UDT provided that the before-change `persist_user_defined_timestamps` is
already false.
3) when an attempt to enable UDT is detected, we mark all its existing SST files as "having no UDT" by marking its `FileMetaData.user_defined_timestamps_persisted` flag to false and handle their file boundaries `FileMetaData.smallest`, `FileMetaData.largest` by padding a min timestamp.
4) while enabling / disabling UDT feature, timestamp size inconsistency in existing WAL logs are handled to make it compatible with the running user comparator.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11623
Test Plan:
```
make all check
./db_with_timestamp_basic_test --gtest-filter="*EnableDisableUDT*"
./db_wal_test --gtest_filter="*EnableDisableUDT*"
```
Reviewed By: ltamasi
Differential Revision: D47636862
Pulled By: jowlyzhang
fbshipit-source-id: dcd19f67292da3c3cc9584c09ad00331c9ab9322
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11624
<queue> must be included to use std::queue.
Reviewed By: pdillinger
Differential Revision: D47562433
fbshipit-source-id: 7c5b19fd9e411694c782dfc0dff0231d4f92ef24
Summary:
I made some changes to add OpenBSD support.
Second time doing something like this, so I apologize in advance if I'm doing something wrong (had some minor hiccups with how github worked).
Fixes https://github.com/facebook/rocksdb/issues/11220
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11255
Reviewed By: akankshamahajan15
Differential Revision: D46361706
Pulled By: ajkr
fbshipit-source-id: 90922fa30197fe6d6f3c0e3ecca2dbb92c337277
Summary:
Switch from std::unordered_map to RocksDB UnorderedMap for all the places that logging user-defined timestamp size in WAL used.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11507
Test Plan:
```
make all check
```
Reviewed By: ltamasi
Differential Revision: D46448975
Pulled By: jowlyzhang
fbshipit-source-id: bdb4d56a723b697a33daaf0f856a61d49a367a99
Summary:
Start logging the timestamp size record in WAL and use the record during recovery. Currently, user comparator cannot be different from what was used to create a column family, so the timestamp size record is just used to confirm it's consistent with the timestamp size the running user comparator indicates.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11471
Test Plan:
```
make all check
./db_secondary_test
./db_wal_test --gtest_filter="*WithTimestamp*"
./repair_test --gtest_filter="*WithTimestamp*"
```
Reviewed By: ltamasi
Differential Revision: D46236769
Pulled By: jowlyzhang
fbshipit-source-id: f6c60b5c8defdb05021c63df302ccc0be1275ad0
Summary:
Currently it's easy to use a ton of memory with many small OptimisticTransactionDB instances, because each one by default allocates a million mutexes (40 bytes each on my compiler) for validating transactions. It even puts a lot of pressure on the allocator by allocating each one individually!
In this change:
* Create a new object and option that enables sharing these buckets of mutexes between instances. This is generally good for load balancing potential contention as various DBs become hotter or colder with txn writes. About the only cases where this sharing wouldn't make sense (e.g. each DB usually written by one thread) are cases that would be better off with OccValidationPolicy::kValidateSerial which doesn't use the buckets anyway.
* Allocate the mutexes in a contiguous array, for efficiency
* Add an option to ensure the mutexes are cache-aligned. In several other places we use cache-aligned mutexes but OptimisticTransactionDB historically does not. It should be a space-time trade-off the user can choose.
* Provide some visibility into the memory used by the mutex buckets with an ApproximateMemoryUsage() function (also used in unit testing)
* Share code with other users of "striped" mutexes, appropriate refactoring for customization & efficiency (e.g. using FastRange instead of modulus)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11439
Test Plan: unit tests added. Ran sized-up versions of stress test in unit test, including a before-and-after performance test showing no consistent difference. (NOTE: OptimisticTransactionDB not currently covered by db_stress!)
Reviewed By: ltamasi
Differential Revision: D45796393
Pulled By: pdillinger
fbshipit-source-id: ae2b3a26ad91ceeec15debcdc63ff48df6736a54
Summary:
Add a util method `HandleWriteBatchTimestampSizeDifference` to handle a `WriteBatch` read from WAL log when user-defined timestamp size record is written and read. Two check modes are added: `kVerifyConsistency` that just verifies the recorded timestamp size are consistent with the running ones. This mode is to be used by `db_impl_secondary` for opening a DB as secondary instance. It will also be used by `db_impl_open` before the user comparator switch support is added to make a column switch between enabling/disable UDT feature. The other mode `kReconcileInconsistency` will be used by `db_impl_open` later when user comparator can be changed.
Another change is to extract a method `CollectColumnFamilyIdsFromWriteBatch` in db_secondary_impl.h into its standalone util file so it can be shared.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11451
Test Plan:
```
make check
./udt_util_test
```
Reviewed By: ltamasi
Differential Revision: D45894386
Pulled By: jowlyzhang
fbshipit-source-id: b96790777f154cddab6d45d9ba2e5d20ebc6fe9d
Summary:
We want to know more about opportunities for better range filters, and the effectiveness of our own range filters. Currently the stats are very limited, essentially logging just hits and misses against prefix filters for range scans in BLOOM_FILTER_PREFIX_* without tracking the false positive rate. Perhaps confusingly, when prefix filters are used for point queries, the stats are currently going into the non-PREFIX tickers.
This change does several things:
* Introduce new stat tickers for seeks and related filtering, \*LEVEL_SEEK\*
* Most importantly, allows us to see opportunities for range filtering. Specifically, we can count how many times a seek in an SST file accesses at least one data block, and how many times at least one value() is then accessed. If a data block was accessed but no value(), we can generally assume that the key(s) seen was(were) not of interest so could have been filtered with the right kind of filter, avoiding the data block access.
* We can get the same level of detail when a filter (for now, prefix Bloom/ribbon) is used, or not. Specifically, we can infer a false positive rate for prefix filters (not available before) from the seek "false positive" rate: when a data block is accessed but no value() is called. (There can be other explanations for a seek false positive, but in typical iterator usage it would indicate a filter false positive.)
* For efficiency, I wanted to avoid making additional calls to the prefix extractor (or key comparisons, etc.), which would be required if we wanted to more precisely detect filter false positives. I believe that instrumenting value() is the best balance of efficiency vs. accurately measuring what we are often interested in.
* The stats are divided between last level and non-last levels, to help understand potential tiered storage use cases.
* The old BLOOM_FILTER_PREFIX_* stats have a different meaning: no longer referring to iterators but to point queries using prefix filters. BLOOM_FILTER_PREFIX_TRUE_POSITIVE is added for computing the prefix false positive rate on point queries, which can be due to filter false positives as well as different keys with the same prefix.
* Similarly, the non-PREFIX BLOOM_FILTER stats are now for whole key filtering only.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11460
Test Plan:
unit tests updated, including updating many to pop the stat value since last read to improve test
readability and maintainability.
Performance test shows a consistent small improvement with these changes, both with clang and with gcc. CPU profile indicates that RecordTick is using less CPU, and this makes sense at least for a high filter miss rate. Before, we were recording two ticks per filter miss in iterators (CHECKED & USEFUL) and now recording just one (FILTERED).
Create DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8
```
And run simultaneous before&after with
```
TEST_TMPDIR=/dev/shm ./db_bench -readonly -benchmarks=seekrandom[-X1000] -num=10000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -seek_nexts=1 -duration=20 -seed=43 -threads=8 -cache_size=1000000000 -statistics
```
Before: seekrandom [AVG 275 runs] : 189680 (± 222) ops/sec; 18.4 (± 0.0) MB/sec
After: seekrandom [AVG 275 runs] : 197110 (± 208) ops/sec; 19.1 (± 0.0) MB/sec
Reviewed By: ajkr
Differential Revision: D46029177
Pulled By: pdillinger
fbshipit-source-id: cdace79a2ea548d46c5900b068c5b7c3a02e5822
Summary:
Apply a small (and automatic) set of IntelliJ Java inspections/repairs to the Java interface to RocksDB Java and its tests.
Partly enabled by the fact that we now (from RocksDB7) require java 8.
Explicit <p> in empty lines in javadoc comments.
Parameters and variables made final where possible.
Anonymous subclasses converted lambdas.
Some tests which previously used other assertion models were converted to assertj, e.g. (assertThat(actual).isEqualTo(expected)
In a very few cases tests were found to be inoperative or broken, and were repaired. No problems with actual RocksDB behaviour were observed.
This PR is intended to replace https://github.com/facebook/rocksdb/pull/9618 - that PR was not merged, and attempts to rebase it have yielded a questionable looking diff, so we choose to go back to square 1 here, and implement a conservative set of changes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10951
Reviewed By: anand1976
Differential Revision: D45057849
Pulled By: ajkr
fbshipit-source-id: e4ea46bfc80518ae86f37702b03ca9352bc11c3d