Commit graph

11391 commits

Author SHA1 Message Date
Baptiste Lemaire 5879053fd0 Dynamically changeable MemPurge option (#10011)
Summary:
**Summary**
Make the mempurge option flag a Mutable Column Family option flag. Therefore, the mempurge feature can be dynamically toggled.

**Motivation**
RocksDB users prefer having the ability to switch features on and off without having to close and reopen the DB. This is particularly important if the feature causes issues and needs to be turned off. Dynamically changing a DB option flag does not seem currently possible.
Moreover, with this new change, the MemPurge feature can be toggled on or off independently between column families, which we see as a major improvement.

**Content of this PR**
This PR includes removal of the `experimental_mempurge_threshold` flag as a DB option flag, and its re-introduction as a `MutableCFOption` flag. I updated the code to handle dynamic changes of the flag (in particular inside the `FlushJob` file). Additionally, this PR includes a new test to demonstrate the capacity of the code to toggle the MemPurge feature on and off, as well as the addition in the `db_stress` module of 2 different mempurge threshold values (0.0 and 1.0) that can be randomly changed with the `set_option_one_in` flag. This is useful to stress test the dynamic changes.

**Benchmarking**
I will add numbers to prove that there is no performance impact within the next 12 hours.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10011

Reviewed By: pdillinger

Differential Revision: D36462357

Pulled By: bjlemaire

fbshipit-source-id: 5e3d63bdadf085c0572ecc2349e7dd9729ce1802
2022-06-23 09:42:18 -07:00
Gang Liao 2352e2dfda Add the blob cache to the stress tests and the benchmarking tool (#10202)
Summary:
In order to facilitate correctness and performance testing, we would like to add the new blob cache to our stress test tool `db_stress` and our continuously running crash test script `db_crashtest.py`, as well as our synthetic benchmarking tool `db_bench` and the BlobDB performance testing script `run_blob_bench.sh`.
As part of this task, we would also like to utilize these benchmarking tools to get some initial performance numbers about the effectiveness of caching blobs.

This PR is a part of https://github.com/facebook/rocksdb/issues/10156

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10202

Reviewed By: ltamasi

Differential Revision: D37325739

Pulled By: gangliao

fbshipit-source-id: deb65d0d414502270dd4c324d987fd5469869fa8
2022-06-22 16:04:03 -07:00
Bo Wang c073ed7601 Fix typo in comments and code (#10233)
Summary:
Fix typo in comments and code.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10233

Test Plan: Existing unit tests should pass.

Reviewed By: jay-zhuang, anand1976

Differential Revision: D37356702

Pulled By: gitbw95

fbshipit-source-id: 32c019adcc6dcc95a9882b38147a310091368e51
2022-06-22 15:45:21 -07:00
Yueh-Hsuan Chiang e103b87296 Add get_column_family_metadata() and related functions to C API (#10207)
Summary:
* Add metadata related structs and functions in C API, including
  - `rocksdb_get_column_family_metadata()` and `rocksdb_get_column_family_metadata_cf()`
     that returns `rocksdb_column_family_metadata_t`.
  - `rocksdb_column_family_metadata_t` and its get functions & destroy function.
  - `rocksdb_level_metadata_t` and its and its get functions & destroy function.
  - `rocksdb_file_metadata_t` and its and get functions & destroy functions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10207

Test Plan:
Extend the existing c_test.c to include additional checks for column_family_metadata
inside CheckCompaction.

Reviewed By: riversand963

Differential Revision: D37305209

Pulled By: ajkr

fbshipit-source-id: 0a5183206353acde145f5f9b632c3bace670aa6e
2022-06-22 15:00:28 -07:00
Alan Paxton a16e2ff82a Adapt benchmark result script to new fields. (#10120)
Summary:
Recently merged CI benchmark scripts were failing.

There has clearly been a major revision of the fields of benchmark output. The upload script expects and sanity-checks the existence of some fields (changes date to conform to OpenSearch format)..., so the script needs to change.

Also add a bit more exception checking to make it more obvious when this happens again.

We have deleted the existing report.tsv from the benchmark machine. An existing report.tsv is appended to by default, so that if the fields change, later rows no longer match the header. This makes for an upload that dies half way through the report file, when the format no longer matches the header.

Re-instate the config.yml for running the benchmarks, so we can once again test it in situ.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10120

Reviewed By: pdillinger

Differential Revision: D37314908

Pulled By: jay-zhuang

fbshipit-source-id: 34f5243fee694b75c6838eb55d3398e4273254b2
2022-06-22 09:26:13 -07:00
Yanqin Jin 36fefd7e22 Continue to deflake BackupEngineTest.Concurrency (#10228)
Summary:
Even after https://github.com/facebook/rocksdb/issues/10069, `BackupEngineTest.Concurrency` is still flaky with decreased probability of failure.

Repro steps as follows
```bash
make backup_engine_test
gtest-parallel -r 1000 -w 64 ./backup_engine_test --gtest_filter=BackupEngineTest.Concurrency
```

The first two commits of this PR demonstrate how the test is flaky. https://github.com/facebook/rocksdb/issues/10069 handles the case in which
`Rename()` file returns `IOError` with subcode `PathNotFound`, and `CreateLoggerFromOptions()`
allows the operation to succeed, as expected by the test. However, `BackupEngineTest` uses
`RemapFileSystem` on top of `ChrootFileSystem` which can return `NotFound` instead of `IOError`.

This behavior is different from `Env::Default()` which returns PathNotFound if the src of `rename()`
does not exist. We should make the behaviors of the test Env/FS match a real Env/FS.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10228

Test Plan:
```bash
make check
gtest-parallel -r 1000 -w 64 ./backup_engine_test --gtest_filter=BackupEngineTest.Concurrency
```

Reviewed By: pdillinger

Differential Revision: D37337241

Pulled By: riversand963

fbshipit-source-id: 07a53115e424467b55a731866e571f0ad4c6635d
2022-06-22 08:50:05 -07:00
Yanqin Jin 9586dcf1ce Expose the initial logger creation error (#10223)
Summary:
https://github.com/facebook/rocksdb/issues/9984 changes the behavior of RocksDB: if logger creation failed during `SanitizeOptions()`,
`DB::Open()` will fail. However, since `SanitizeOptions()` is called in `DBImpl::DBImpl()`, we cannot
directly expose the error to caller without some additional work.
This is a first version proposal which:
- Adds a new member `init_logger_creation_s` to `DBImpl` to store the result of init logger creation
- Checks the error during `DB::Open()` and return it to caller if non-ok

This is not very ideal. We can alternatively move the logger creation logic out of the `SanitizeOptions()`.
Since `SanitizeOptions()` is used in other places, we need to check whether this change breaks anything
in case other callers of `SanitizeOptions()` assumes that a logger should be created.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10223

Test Plan: make check

Reviewed By: pdillinger

Differential Revision: D37321717

Pulled By: riversand963

fbshipit-source-id: 58042358a86369d606549dd9938933dd47591c4b
2022-06-22 08:26:38 -07:00
Yanqin Jin 42c631b339 Update API comment about Options::best_efforts_recovery (#10180)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10180

Reviewed By: pdillinger

Differential Revision: D37182037

Pulled By: riversand963

fbshipit-source-id: a8dc865b86e2249beb7a543c317e94a14781e910
2022-06-21 23:34:39 -07:00
Peter Dillinger 84210c9489 Add data block hash index to crash test, fix MultiGet issue (#10220)
Summary:
There was a bug in the MultiGet enhancement in https://github.com/facebook/rocksdb/issues/9899 with data
block hash index, which was not caught because data block hash index was
never added to stress tests. This change fixes both issues.

Fixes https://github.com/facebook/rocksdb/issues/10186

I intend to pick this into the 7.4.0 release candidate

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10220

Test Plan:
Failure quickly reproduces in crash test with
kDataBlockBinaryAndHash, and does not seem to with the fix. Reproducing
the failure with a unit test I believe would be too tricky and fragile
to be worthwhile.

Reviewed By: anand1976

Differential Revision: D37315647

Pulled By: pdillinger

fbshipit-source-id: 9f648265bba867275edc752f7a56611a59401cba
2022-06-21 16:23:58 -07:00
Yanqin Jin d654888b8f Refactor wal filter processing during recovery (#10214)
Summary:
So that DBImpl::RecoverLogFiles do not have to deal with implementation
details of WalFilter.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10214

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D37299122

Pulled By: riversand963

fbshipit-source-id: acf1a80f1ef75da393d375f55968b2f3ac189816
2022-06-21 14:51:56 -07:00
Bo Wang f7605ec655 Update LZ4 library for platform009 (#10224)
Summary:
Update LZ4 library for platform009.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10224

Test Plan: Current unit tests should pass.

Reviewed By: anand1976

Differential Revision: D37321801

Pulled By: gitbw95

fbshipit-source-id: 8a3d3019d9f7478ac737176f2d2f443c0159829e
2022-06-21 13:22:58 -07:00
zczhu 30141461f9 Add basic kRoundRobin compaction policy (#10107)
Summary:
Add `kRoundRobin` as a compaction priority. The implementation is as follows.

- Define a cursor as the smallest Internal key in the successor of the selected file. Add `vector<InternalKey> compact_cursor_` into `VersionStorageInfo` where each element (`InternalKey`) in `compact_cursor_` represents a cursor. In round-robin compaction policy, we just need to select the first file (assuming files are sorted) and also has the smallest InternalKey larger than/equal to the cursor. After a file is chosen, we create a new `Fsize` vector which puts the selected file is placed at the first position in `temp`, the next cursor is then updated as the smallest InternalKey in successor of the selected file (the above logic is implemented in `SortFileByRoundRobin`).
- After a compaction succeeds, typically `InstallCompactionResults()`, we choose the next cursor for the input level and save it to `edit`. When calling `LogAndApply`, we save the next cursor with its level into some local variable and finally apply the change to `vstorage` in `SaveTo` function.
- Cursors are persist pair by pair (<level, InternalKey>) in `EncodeTo` so that they can be reconstructed when reopening. An empty cursor will not be encoded to MANIFEST

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10107

Test Plan: add unit test (`CompactionPriRoundRobin`) in `compaction_picker_test`, add `kRoundRobin` priority in `CompactionPriTest` from `db_compaction_test`, and add `PersistRoundRobinCompactCursor` in `db_compaction_test`

Reviewed By: ajkr

Differential Revision: D37316037

Pulled By: littlepig2013

fbshipit-source-id: 9f481748190ace416079139044e00df2968fb1ee
2022-06-21 11:56:53 -07:00
Yanqin Jin b012d23557 Destroy iniital db dir for a test in DBWALTest (#10221)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10221

Reviewed By: hx235

Differential Revision: D37316280

Pulled By: riversand963

fbshipit-source-id: 062781acec2f36beebc62003bcc8ec280488d572
2022-06-21 11:27:10 -07:00
Guido Tagliavini Ponce 3afed7408c Replace per-shard chained hash tables with open-addressing scheme (#10194)
Summary:
In FastLRUCache, we replace the current chained per-shard hash table by an open-addressing hash table. In particular, this allows us to preallocate all handles.

Because all handles are preallocated, this implementation doesn't support strict_capacity_limit = false (i.e., allowing insertions beyond the predefined capacity). This clashes with current assumptions of some tests, namely two tests in cache_test and the crash tests. We have disabled these for now.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10194

Test Plan: ``make -j24 check``

Reviewed By: pdillinger

Differential Revision: D37296770

Pulled By: guidotag

fbshipit-source-id: 232ff1b8260331d868ebf4e3e5d8ad709390b0ad
2022-06-21 08:45:04 -07:00
Gang Liao deff48bcef Add blob source to retrieve blobs in RocksDB (#10198)
Summary:
There is currently no caching mechanism for blobs, which is not ideal especially when the database resides on remote storage (where we cannot rely on the OS page cache). As part of this task, we would like to make it possible for the application to configure a blob cache.
In this task, we formally introduced the blob source to RocksDB.  BlobSource is a new abstraction layer that provides universal access to blobs, regardless of whether they are in the blob cache, secondary cache, or (remote) storage. Depending on user settings, it always fetch blobs from multi-tier cache and storage with minimal cost.

Note: The new `MultiGetBlob()` implementation is not included in the current PR. To go faster, we aim to create a separate PR for it in parallel!

This PR is a part of https://github.com/facebook/rocksdb/issues/10156

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10198

Reviewed By: ltamasi

Differential Revision: D37294735

Pulled By: gangliao

fbshipit-source-id: 9cb50422d9dd1bc03798501c2778b6c7520c7a1e
2022-06-20 20:58:11 -07:00
sdong 4207872fc3 Reduce a duplicate consistency check when applying a new version (#10169)
Summary:
One consistency check in SaveTo() is dupilcated with the one within Apply(). Remove one of then in release mode to reduce time spent in DB mutex.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10169

Test Plan: Run existing tests and see nothing breaks.

Reviewed By: ltamasi

Differential Revision: D37157821

fbshipit-source-id: 73b89443a20b43362ff66d10b9212022034a8234
2022-06-20 19:15:59 -07:00
Levi Tamasi 8f59c41cc7 Add new value value type for wide-column entities (#10211)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10211

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D37294067

Pulled By: ltamasi

fbshipit-source-id: 3b26f1964746ba4e3654579cb07cd975a29c7319
2022-06-20 18:04:08 -07:00
Peter Dillinger 501543573a Fix bad include (#10213)
Summary:
include "include/rocksdb/blah.h" is messing up some internal
builds vs. include "rocksdb/blah." This fixes the bad case and adds a
check for future instances.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10213

Test Plan: back-port to 7.4 release candidate and watch internal build

Reviewed By: hx235

Differential Revision: D37296202

Pulled By: pdillinger

fbshipit-source-id: d7cc6b2c57d858dff0444f19320d83c8b4f9b185
2022-06-20 17:42:01 -07:00
Peter Dillinger ccb4f047ae Add 7.4 to format compatibility test (#10209)
Summary:
Forgotten in https://github.com/facebook/rocksdb/issues/10204

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10209

Test Plan: local run with SHORT_TEST=1

Reviewed By: hx235

Differential Revision: D37284028

Pulled By: pdillinger

fbshipit-source-id: 631c1969906d002acc930662dcd5eefc0c758429
2022-06-20 13:13:37 -07:00
Peter Dillinger 6358e1b967 Start release 7.5 development (#10204)
Summary:
Update HISTORY.md and version.h

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10204

Test Plan: version bump only

Reviewed By: ajkr

Differential Revision: D37271866

Pulled By: pdillinger

fbshipit-source-id: 0ccaa2af36648a5b6017c172a7826a244e1aec93
2022-06-20 07:12:39 -07:00
Peter Dillinger fac7a23685 Update HISTORY for 7.4.0 release freeze (#10196)
Summary:
Planned for Sunday 6/19

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10196

Test Plan: no code

Reviewed By: akankshamahajan15

Differential Revision: D37244857

Pulled By: pdillinger

fbshipit-source-id: afbf4aa201983b3c01c16b5f55c68f2325d17421
2022-06-19 16:31:16 -07:00
Changyu Bi 0e0a19832e Fix a bug in WriteBatchInternal::Append when write batch KV protection is turned on (#10201)
Summary:
This bug was discovered after write batch checksum verification before WAL is added (https://github.com/facebook/rocksdb/issues/10114) and stress test with write batch checksum protection is turned on (https://github.com/facebook/rocksdb/issues/10037). In this [line](d5d8920f2c/db/write_batch.cc (L2887)), the number of checksums may not be consistent with `batch->Count()`. This PR fixes this issue.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10201

Test Plan:
```
./db_stress --batch_protection_bytes_per_key=8 --destroy_db_initially=1 --max_key=100000 --use_txn=1
```

Reviewed By: ajkr

Differential Revision: D37260799

Pulled By: cbi42

fbshipit-source-id: ff8dce7dcce295d689333bc9d892d17a843bf0ea
2022-06-18 15:12:17 -07:00
Andrew Kryczka d5d8920f2c Fix race condition with WAL tracking and FlushWAL(true /* sync */) (#10185)
Summary:
`FlushWAL(true /* sync */)` is used internally and for manual WAL sync. It had a bug when used together with `track_and_verify_wals_in_manifest` where the synced size tracked in MANIFEST was larger than the number of bytes actually synced.

The bug could be repro'd almost immediately with the following crash test command: `python3 tools/db_crashtest.py blackbox --simple --write_buffer_size=524288 --max_bytes_for_level_base=2097152 --target_file_size_base=524288 --duration=3600 --interval=10 --sync_fault_injection=1 --disable_wal=0 --checkpoint_one_in=1000 --max_key=10000 --value_size_mult=33`.

An example error message produced by the above command is shown below. The error sometimes arose from the checkpoint and other times arose from the main stress test DB.

```
Corruption: Size mismatch: WAL (log number: 119) in MANIFEST is 27938 bytes , but actually is 27859 bytes on disk.
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10185

Test Plan:
- repro unit test
- the above crash test command no longer finds the error. It does find a different error after a while longer such as "Corruption: WAL file 481 required by manifest but not in directory list"

Reviewed By: riversand963

Differential Revision: D37200993

Pulled By: ajkr

fbshipit-source-id: 98e0071c1a89f4d009888512ed89f9219779ae5f
2022-06-17 16:45:28 -07:00
Hui Xiao a5d773e077 Add rate-limiting support to batched MultiGet() (#10159)
Summary:
**Context/Summary:**
https://github.com/facebook/rocksdb/pull/9424 added rate-limiting support for user reads, which does not include batched `MultiGet()`s that call `RandomAccessFileReader::MultiRead()`. The reason is that it's harder (compared with RandomAccessFileReader::Read()) to implement the ideal rate-limiting where we first call `RateLimiter::RequestToken()` for allowed bytes to multi-read and then consume those bytes by satisfying as many requests in `MultiRead()` as possible. For example, it can be tricky to decide whether we want partially fulfilled requests within one `MultiRead()` or not.

However, due to a recent urgent user request, we decide to pursue an elementary (but a conditionally ineffective) solution where we accumulate enough rate limiter requests toward the total bytes needed by one `MultiRead()` before doing that `MultiRead()`. This is not ideal when the total bytes are huge as we will actually consume a huge bandwidth from rate-limiter causing a burst on disk. This is not what we ultimately want with rate limiter. Therefore a follow-up work is noted through TODO comments.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10159

Test Plan:
- Modified existing unit test `DBRateLimiterOnReadTest/DBRateLimiterOnReadTest.NewMultiGet`
- Traced the underlying system calls `io_uring_enter` and verified they are 10 seconds apart from each other correctly under the setting of  `strace -ftt -e trace=io_uring_enter ./db_bench -benchmarks=multireadrandom -db=/dev/shm/testdb2 -readonly -num=50 -threads=1 -multiread_batched=1 -batch_size=100 -duration=10 -rate_limiter_bytes_per_sec=200 -rate_limiter_refill_period_us=1000000 -rate_limit_bg_reads=1 -disable_auto_compactions=1 -rate_limit_user_ops=1` where each `MultiRead()` read about 2000 bytes (inspected by debugger) and the rate limiter grants 200 bytes per seconds.
- Stress test:
   - Verified `./db_stress (-test_cf_consistency=1/test_batches_snapshots=1) -use_multiget=1 -cache_size=1048576 -rate_limiter_bytes_per_sec=10241024 -rate_limit_bg_reads=1 -rate_limit_user_ops=1` work

Reviewed By: ajkr, anand1976

Differential Revision: D37135172

Pulled By: hx235

fbshipit-source-id: 73b8e8f14761e5d4b77235dfe5d41f4eea968bcd
2022-06-17 16:40:47 -07:00
Gang Liao c965c9ef65 Read blob from blob cache if exists when GetBlob() (#10178)
Summary:
There is currently no caching mechanism for blobs, which is not ideal especially when the database resides on remote storage (where we cannot rely on the OS page cache). As part of this task, we would like to make it possible for the application to configure a blob cache.
In this task, we added a new abstraction layer `BlobSource` to retrieve blobs from either blob cache or raw blob file. Note: For simplicity, the current PR only includes `GetBlob()`.  `MultiGetBlob()` will be included in the next PR.

This PR is a part of https://github.com/facebook/rocksdb/issues/10156

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10178

Reviewed By: ltamasi

Differential Revision: D37250507

Pulled By: gangliao

fbshipit-source-id: 3fc4a55a0cea955a3147bdc7dba06430e377259b
2022-06-17 15:22:59 -07:00
Peter Dillinger 1aac814578 Use optimized folly DistributedMutex in LRUCache when available (#10179)
Summary:
folly DistributedMutex is faster than standard mutexes though
imposes some static obligations on usage. See
https://github.com/facebook/folly/blob/main/folly/synchronization/DistributedMutex.h
for details. Here we use this alternative for our Cache implementations
(especially LRUCache) for better locking performance, when RocksDB is
compiled with folly.

Also added information about which distributed mutex implementation is
being used to cache_bench output and to DB LOG.

Intended follow-up:
* Use DMutex in more places, perhaps improving API to support non-scoped
locking
* Fix linking with fbcode compiler (needs ROCKSDB_NO_FBCODE=1 currently)

Credit: Thanks Siying for reminding me about this line of work that was previously
left unfinished.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10179

Test Plan:
for correctness, existing tests. CircleCI config updated.
Also Meta-internal buck build updated.

For performance, ran simultaneous before & after cache_bench. Out of three
comparison runs, the middle improvement to ops/sec was +21%:

Baseline: USE_CLANG=1 DEBUG_LEVEL=0 make -j24 cache_bench (fbcode
compiler)

```
Complete in 20.201 s; Rough parallel ops/sec = 1584062
Thread ops/sec = 107176

Operation latency (ns):
Count: 32000000 Average: 9257.9421  StdDev: 122412.04
Min: 134  Median: 3623.0493  Max: 56918500
Percentiles: P50: 3623.05 P75: 10288.02 P99: 30219.35 P99.9: 683522.04 P99.99: 7302791.63
```

New: (add USE_FOLLY=1)

```
Complete in 16.674 s; Rough parallel ops/sec = 1919135  (+21%)
Thread ops/sec = 135487

Operation latency (ns):
Count: 32000000 Average: 7304.9294  StdDev: 108530.28
Min: 132  Median: 3777.6012  Max: 91030902
Percentiles: P50: 3777.60 P75: 10169.89 P99: 24504.51 P99.9: 59721.59 P99.99: 1861151.83
```

Reviewed By: anand1976

Differential Revision: D37182983

Pulled By: pdillinger

fbshipit-source-id: a17eb05f25b832b6a2c1356f5c657e831a5af8d1
2022-06-17 13:08:45 -07:00
Peter Dillinger f87adcfb3f Fix overflow in ribbon_bench after #10184 (#10195)
Summary:
Ribbon micro-bench needs updating after re-numbering
`BloomLikeFilterPolicy::GetAllFixedImpls()` entries. (CircleCI nightly
failure.)

Also fixed memory leaks while using ASAN to validate my fix. (I assume
the leaks weren't intentional for some performance characteristic.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10195

Test Plan: run with ASAN

Reviewed By: jay-zhuang

Differential Revision: D37244459

Pulled By: pdillinger

fbshipit-source-id: 5a363e10de3c4c9c88099c937e3dc3b4cf24fd30
2022-06-17 12:53:57 -07:00
Andrew Kryczka 5d6005c780 Add WriteOptions::protection_bytes_per_key (#10037)
Summary:
Added an option, `WriteOptions::protection_bytes_per_key`, that controls how many bytes per key we use for integrity protection in `WriteBatch`. It takes effect when `WriteBatch::GetProtectionBytesPerKey() == 0`.

Currently the only supported value is eight. Invoking a user API with it set to any other nonzero value will result in `Status::NotSupported` returned to the user.

There is also a bug fix for integrity protection with `inplace_callback`, where we forgot to take into account the possible change in varint length when calculating KV checksum for the final encoded buffer.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10037

Test Plan:
- Manual
  - Set default value of `WriteOptions::protection_bytes_per_key` to eight and ran `make check -j24`
  - Enabled in MyShadow for 1+ week
- Automated
  - Unit tests have a `WriteMode` that enables the integrity protection via `WriteOptions`
  - Crash test - in most cases, use `WriteOptions::protection_bytes_per_key` to enable integrity protection

Reviewed By: cbi42

Differential Revision: D36614569

Pulled By: ajkr

fbshipit-source-id: 8650087ceac9b61b560f1e5fafe5e1baf9c725fb
2022-06-16 23:10:07 -07:00
Peter Dillinger f62c1e1e56 Fix a false negative merge conflict (#10192)
Summary:
.. between https://github.com/facebook/rocksdb/issues/10184 and https://github.com/facebook/rocksdb/issues/10122 not detected by source control,
leading to non-compiling code.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10192

Test Plan: updated test

Reviewed By: hx235

Differential Revision: D37231921

Pulled By: pdillinger

fbshipit-source-id: fa21488716f4c006b111b8c4127d71c757c935c3
2022-06-16 21:14:10 -07:00
Changyu Bi 8cf86258b8 Update HISTORY.md for #10114 (#10189)
Summary:
Update HISTORY.md for https://github.com/facebook/rocksdb/issues/10114: write batch checksum verification before writing to WAL.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10189

Reviewed By: ajkr

Differential Revision: D37226366

Pulled By: cbi42

fbshipit-source-id: cd2f076961abc35f35783e0f2cc3beda68cdb446
2022-06-16 19:59:26 -07:00
Peter Dillinger fff302d989 More testing w/prefix extractor, small refactor (#10122)
Summary:
There was an interesting code path not covered by testing that
is difficult to replicate in a unit test, which is now covered using a
sync point. Specifically, the case of table_prefix_extractor == null and
!need_upper_bound_check in `BlockBasedTable::PrefixMayMatch`, which
can happen if table reader is open before extractor is registered with global
object registry, but is later registered and re-set with SetOptions. (We
don't have sufficient testing control over object registry to set that up
repeatedly.)

Also, this function has been renamed to `PrefixRangeMayMatch` for clarity
vs. other functions that are not the same.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10122

Test Plan: unit tests expanded

Reviewed By: siying

Differential Revision: D36944834

Pulled By: pdillinger

fbshipit-source-id: 9e52d9da1929a3e42bbc230fcdc3599949de7bdb
2022-06-16 16:41:25 -07:00
Peter Dillinger 126c223714 Remove deprecated block-based filter (#10184)
Summary:
In https://github.com/facebook/rocksdb/issues/9535, release 7.0, we hid the old block-based filter from being created using
the public API, because of its inefficiency. Although we normally maintain read compatibility
on old DBs forever, filters are not required for reading a DB, only for optimizing read
performance. Thus, it should be acceptable to remove this code and the substantial
maintenance burden it carries as useful features are developed and validated (such
as user timestamp).

This change completely removes the code for reading and writing the old block-based
filters, net removing about 1370 lines of code no longer needed. Options removed from
testing / benchmarking tools. The prior existence is only evident in a couple of places:
* `CacheEntryRole::kDeprecatedFilterBlock` - We can update this public API enum in
a major release to minimize source code incompatibilities.
* A warning is logged when an old table file is opened that used the old block-based
filter. This is provided as a courtesy, and would be a pain to unit test, so manual testing
should suffice. Unfortunately, sst_dump does not tell you whether a file uses
block-based filter, and the structure of the code makes it very difficult to fix.
* To detect that case, `kObsoleteFilterBlockPrefix` (renamed from `kFilterBlockPrefix`)
for metaindex is maintained (for now).

Other notes:
* In some cases where numbers are associated with filter configurations, we have had to
update the assigned numbers so that they all correspond to something that exists.
* Fixed potential stat counting bug by assuming `filter_checked = false` for cases
like `filter == nullptr` rather than assuming `filter_checked = true`
* Removed obsolete `block_offset` and `prefix_extractor` parameters from several
functions.
* Removed some unnecessary checks `if (!table_prefix_extractor() && !prefix_extractor)`
because the caller guarantees the prefix extractor exists and is compatible

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10184

Test Plan:
tests updated, manually test new warning in LOG using base version to
generate a DB

Reviewed By: riversand963

Differential Revision: D37212647

Pulled By: pdillinger

fbshipit-source-id: 06ee020d8de3b81260ffc36ad0c1202cbf463a80
2022-06-16 15:51:33 -07:00
anand76 a6691d0f65 Update stats to help users estimate MultiGet async IO impact (#10182)
Summary:
Add a couple of stats to help users estimate the impact of potential MultiGet perf improvements -
1. NUM_LEVEL_READ_PER_MULTIGET - A histogram stat for number of levels that required MultiGet to read from a file
2. MULTIGET_COROUTINE_COUNT - A ticker stat to count the number of times the coroutine version of MultiGetFromSST was used

The NUM_DATA_BLOCKS_READ_PER_LEVEL stat is obsoleted as it doesn't provide useful information for MultiGet optimization.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10182

Reviewed By: akankshamahajan15

Differential Revision: D37213296

Pulled By: anand1976

fbshipit-source-id: 5d2b7708017c0e278578ae4bffac3926f6530efb
2022-06-16 12:12:43 -07:00
Yanqin Jin 4d31d3c2ed Abort in dbg mode after logging (#10183)
Summary:
In CompactionIterator code, there are multiple places where the process
will abort in dbg mode before logging the error message describing the
cause. This PR changes only the logging behavior for compaction iterator so
that error message is written to LOG before the process aborts in debug
mode.

Also updated the triggering condition for an assertion for single delete with
user-defined timestamp.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10183

Test Plan: make check

Reviewed By: akankshamahajan15

Differential Revision: D37190218

Pulled By: riversand963

fbshipit-source-id: 741bb007067be7cfbe94ac9e530ad4b2b339c009
2022-06-15 22:00:24 -07:00
Akanksha Mahajan 8353ae8b27 Add few optimizations in async_io for short scans (#10140)
Summary:
This PR adds few optimizations for async_io for shorter scans.
1.  If async_io is enabled, seek would create FilePrefetchBuffer object to fetch the data asynchronously. However `FilePrefetchbuffer::num_file_reads_` wasn't taken into consideration if it calls Next after Seek and would go for Prefetching.  This PR fixes that and Next will go for prefetching only if `FilePrefetchbuffer::num_file_reads_` is greater than 2 along with if blocks are sequential. This scenario is only for implicit auto readahead.
2. For seek, when it calls TryReadFromCacheAsync to poll it makes async call as well because TryReadFromCacheAsync flow wasn't changed. So I updated to return after poll instead of further prefetching any data.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10140

Test Plan:
1. Added a unit test
                  2. Ran crash_test with async_io = 1 to make sure nothing crashes.

Reviewed By: anand1976

Differential Revision: D37042242

Pulled By: akankshamahajan15

fbshipit-source-id: b8e6b7cb2ee0886f37a8f53951948b9084e8ffda
2022-06-15 20:17:35 -07:00
Peter Dillinger 3d358a7e25 Fix handling of accidental truncation of IDENTITY file (#10173)
Summary:
A consequence of https://github.com/facebook/rocksdb/issues/9990 was requiring a non-empty DB ID to generate
new SST files. But if the DB ID is not tracked in the manifest and the IDENTITY file
is somehow truncated to 0 bytes, then an empty DB ID would be assigned, leading
to crash. This change ensures a non-empty DB ID is assigned and set in the
IDENTITY file.

Also,
* Some light refactoring to clean up the logic
* (I/O efficiency) If the ID is tracked in the manifest and already matches the
IDENTITY file, don't needlessly overwrite the file.
* (Debugging) Log the DB ID to info log on open, because sometimes IDENTITY
can change if DB is moved around (though it would be unusual for info log to
be copied/moved without IDENTITY file)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10173

Test Plan: unit tests expanded/updated

Reviewed By: ajkr

Differential Revision: D37176545

Pulled By: pdillinger

fbshipit-source-id: a9b414cd35bfa33de48af322a36c24538d50bef1
2022-06-15 15:39:49 -07:00
Peter Dillinger 94329ae4ec Use only ASCII in source files (#10164)
Summary:
Fix existing usage of non-ASCII and add a check to prevent
future use. Added `-n` option to greps to provide line numbers.

Alternative to https://github.com/facebook/rocksdb/issues/10147

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10164

Test Plan:
used new checker to find & fix cases, manually check
db_bench output is preserved

Reviewed By: akankshamahajan15

Differential Revision: D37148792

Pulled By: pdillinger

fbshipit-source-id: 68c8b57e7ab829369540d532590bf756938855c7
2022-06-15 14:44:43 -07:00
Changyu Bi 9882652b0e Verify write batch checksum before WAL (#10114)
Summary:
Context: WriteBatch can have key-value checksums when it was created `with protection_bytes_per_key > 0`.
This PR added checksum verification for write batches before they are written to WAL.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10114

Test Plan:
- Added new unit tests to db_kv_checksum_test.cc: `make check -j32`
- benchmark on performance regression: `./db_bench --benchmarks=fillrandom[-X20] -db=/dev/shm/test_rocksdb -write_batch_protection_bytes_per_key=8`
  - Pre-PR:
`
fillrandom [AVG    20 runs] : 198875 (± 3006) ops/sec;   22.0 (± 0.3) MB/sec
`
  - Post-PR:
`
fillrandom [AVG    20 runs] : 196487 (± 2279) ops/sec;   21.7 (± 0.3) MB/sec
`
  Mean regressed about 1% (198875 -> 196487 ops/sec).

Reviewed By: ajkr

Differential Revision: D36917464

Pulled By: cbi42

fbshipit-source-id: 29beb74edf65f04b1a890b4f650d873dc7ed790d
2022-06-15 13:43:58 -07:00
Ali Saidi 2e5a323dbd Change the instruction used for a pause on arm64 (#10118)
Summary:
While the yield instruction conseptually sounds correct on most platforms it is
a simple nop that doesn't delay the execution anywhere close to what an x86
pause instruction does. In other projects with spin-wait loops an isb has been
observed to be much closer to the x86 behavior.

On a Graviton3 system the following test improves on average by 2x with this
change averaged over 20 runs:

```
./db_bench  -benchmarks=fillrandom -threads=64 -batch_size=1
-memtablerep=skip_list -value_size=100 --num=100000
level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999
-disable_auto_compactions --max_write_buffer_number=8 -max_background_flushes=8
--disable_wal --write_buffer_size=160000000 --block_size=16384
--allow_concurrent_memtable_write -compression_type none
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10118

Reviewed By: jay-zhuang

Differential Revision: D37120578

fbshipit-source-id: c20bde4298222edfab7ff7cb6d42497e7012400d
2022-06-15 13:08:11 -07:00
sdong 69a32eecab Use madvise() for mmaped file advise (#10170)
Summary:
A recent PR https://github.com/facebook/rocksdb/pull/10142 enabled fadvise for mmaped file. However, we were told that it might not take effective and madvise() should be used.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10170

Test Plan:
Run existing tests
Run a benchmark using mmap with advise random and see I/O size is indeed small.

Reviewed By: anand1976

Differential Revision: D37158582

fbshipit-source-id: 8b3a74f0e89d2e16aac78ee4124c05841d4135c3
2022-06-15 13:05:58 -07:00
Yanqin Jin ce419c0f10 Allow db_bench and db_stress to set allow_data_in_errors (#10171)
Summary:
There is `Options::allow_data_in_errors` that controls whether RocksDB
is allowed to log data, e.g. key, value, etc in LOG files. It is false
by default. However, in db_bench and db_stress, it is often ok to log
data because there is no concern about privacy.

This PR allows db_stress and db_bench to set this option on the command
line, while it remains false by default. Furthermore, make
crash/recovery test driven by db_crashtest.py to opt-in.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10171

Test Plan: Stress test and db_bench

Reviewed By: hx235

Differential Revision: D37163787

Pulled By: riversand963

fbshipit-source-id: 0242f24d292ba15b6faf8ff903963b85d3e011f8
2022-06-15 12:38:04 -07:00
Akanksha Mahajan 19345de60d fix cancel argument for latest liburing (#10168)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10168

the arg changed to u64

Reviewed By: ajkr

Differential Revision: D37155407

fbshipit-source-id: 464eab2806675f148fce075a6fea369fa3d7a9bb
2022-06-15 09:10:19 -07:00
iseki 40dfa26049 Fix C4702 on windows (#10146)
Summary:
This code is unreachable when `ROCKSDB_LITE` not defined. And it cause build fail on my environment VS2019 16.11.15.
```
-- Selecting Windows SDK version 10.0.19041.0 to target Windows 10.0.19044.
-- The CXX compiler identification is MSVC 19.29.30145.0
-- The C compiler identification is MSVC 19.29.30145.0
-- The ASM compiler identification is MSVC
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10146

Reviewed By: akankshamahajan15

Differential Revision: D37112916

Pulled By: ajkr

fbshipit-source-id: e0b2bf3055d6fac1b3fb40b9f02c4cbae3f82757
2022-06-14 21:32:10 -07:00
mpoeter 77f4799515 Fix potential leak when reusing PinnableSlice instances. (#10166)
Summary:
`PinnableSlice` may hold a handle to a cache value which must be released to correctly decrement the ref-counter. However, when `PinnableSlice` variables are reused, e.g. like this:
```
PinnableSlice pin_slice;
db.Get("foo", &pin_slice);
db.Get("foo", &pin_slice);
```
then the second `Get` simply overwrites the old value in `pin_slice` and the handle returned by the first `Get` is _not_ released.

This PR adds `Reset` calls to the `Get`/`MultiGet` calls that accept `PinnableSlice` arguments to ensure proper cleanup of old values.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10166

Reviewed By: hx235

Differential Revision: D37151632

Pulled By: ajkr

fbshipit-source-id: 9dd3c3288300f560531b843f67db11aeb569a9ff
2022-06-14 21:29:52 -07:00
Ali Saidi b550fc0b09 Modify the instructions emited for PREFETCH on arm64 (#10117)
Summary:
__builtin_prefetch(...., 1) prefetches into the L2 cache on x86 while the same
emits a pldl3keep instruction on arm64 which doesn't seem to be close enough.

Testing on a Graviton3, and M1 system with memtablerep_bench fillrandom and
skiplist througpuh increased as follows adjusting the 1 to 2 or 3:
```
           1 -> 2     1 -> 3
----------------------------
Graviton3   +10%        +15%
M1          +10%        +10%
```

Given that prefetching into the L1 cache seems to help, I chose that conversion

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10117

Reviewed By: pdillinger

Differential Revision: D37120475

fbshipit-source-id: db1ef43f941445019c68316500a2250acc643d5e
2022-06-14 17:58:44 -07:00
James Tucker 751d1a3e48 mingw: remove no-asynchronous-unwind-tables (#9963)
Summary:
This default is generally incompatible with other parts of mingw, and
can be applied by outside users as-needed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9963

Reviewed By: akankshamahajan15

Differential Revision: D36302813

Pulled By: ajkr

fbshipit-source-id: 9456b41a96bde302bacbc39e092ccecfcb42f34f
2022-06-14 17:42:55 -07:00
Gang Liao cba398df8a Add blob cache option in the column family options (#10155)
Summary:
There is currently no caching mechanism for blobs, which is not ideal especially when the database resides on remote storage (where we cannot rely on the OS page cache). As part of this task, we would like to make it possible for the application to configure a blob cache.
This PR is a part of https://github.com/facebook/rocksdb/issues/10156

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10155

Reviewed By: ltamasi

Differential Revision: D37150819

Pulled By: gangliao

fbshipit-source-id: b807c7916ea5d411588128f8e22a49f171388fe2
2022-06-14 14:19:26 -07:00
tabokie 1d2950b8dd fix a false positive case of parsing table factory from options file (#10094)
Summary:
During options file parsing, reset table factory before attempting to parse it
from string. This avoids mistakenly treating the default table factory as a
newly created one.

Signed-off-by: tabokie <xy.tao@outlook.com>

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10094

Reviewed By: akankshamahajan15

Differential Revision: D36945378

Pulled By: ajkr

fbshipit-source-id: 94b2604e5e87682063b4b78f6370f3e8f101dc44
2022-06-14 13:20:54 -07:00
Hui Xiao d665afdbf3 Account memory of FileMetaData in global memory limit (#9924)
Summary:
**Context/Summary:**
As revealed by heap profiling, allocation of `FileMetaData` for [newly created file added to a Version](https://github.com/facebook/rocksdb/pull/9924/files#diff-a6aa385940793f95a2c5b39cc670bd440c4547fa54fd44622f756382d5e47e43R774) can consume significant heap memory. This PR is to account that toward our global memory limit based on block cache capacity.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9924

Test Plan:
- Previous `make check` verified there are only 2 places where the memory of  the allocated `FileMetaData` can be released
- New unit test `TEST_P(ChargeFileMetadataTestWithParam, Basic)`
- db bench (CPU cost of `charge_file_metadata` in write and compact)
   - **write micros/op: -0.24%** : `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_file_metadata=1 (remove this option for pre-PR) -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 | egrep 'fillseq'`
   - **compact micros/op -0.87%** : `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_file_metadata=1 -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 -numdistinct=1000 && ./db_bench -benchmarks=compact -db=$TEST_TMPDIR -use_existing_db=1 -charge_file_metadata=1 -disable_auto_compactions=1 | egrep 'compact'`

table 1 - write

#-run | (pre-PR) avg micros/op | std micros/op | (post-PR)  micros/op | std micros/op | change (%)
-- | -- | -- | -- | -- | --
10 | 3.9711 | 0.264408 | 3.9914 | 0.254563 | 0.5111933721
20 | 3.83905 | 0.0664488 | 3.8251 | 0.0695456 | -0.3633711465
40 | 3.86625 | 0.136669 | 3.8867 | 0.143765 | 0.5289363078
80 | 3.87828 | 0.119007 | 3.86791 | 0.115674 | **-0.2673865734**
160 | 3.87677 | 0.162231 | 3.86739 | 0.16663 | **-0.2419539978**

table 2 - compact

#-run | (pre-PR) avg micros/op | std micros/op | (post-PR)  micros/op | std micros/op | change (%)
-- | -- | -- | -- | -- | --
10 | 2,399,650.00 | 96,375.80 | 2,359,537.00 | 53,243.60 | -1.67
20 | 2,410,480.00 | 89,988.00 | 2,433,580.00 | 91,121.20 | 0.96
40 | 2.41E+06 | 121811 | 2.39E+06 | 131525 | **-0.96**
80 | 2.40E+06 | 134503 | 2.39E+06 | 108799 | **-0.78**

- stress test: `python3 tools/db_crashtest.py blackbox --charge_file_metadata=1  --cache_size=1` killed as normal

Reviewed By: ajkr

Differential Revision: D36055583

Pulled By: hx235

fbshipit-source-id: b60eab94707103cb1322cf815f05810ef0232625
2022-06-14 13:06:40 -07:00
Akanksha Mahajan 40d19bc12c Fix the failure related to io_uring_prep_cancel (#10165)
Summary:
Fix for Internal jobs are failing with
```
 error: no matching function for call to 'io_uring_prep_cancel'
      io_uring_prep_cancel(sqe, posix_handle, 0);
      ^~~~~~~~~~~~~~~~~~~~
note: candidate function not viable: no known conversion from 'rocksdb::Posix_IOHandle *' to '__u64' (aka 'unsigned long long') for 2nd argument
static inline void io_uring_prep_cancel(struct io_uring_sqe *sqe,
```

User data is set using `io_uring_set_data` API so no need to pass posix_handle here.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10165

Test Plan: CircleCI jobs

Reviewed By: jay-zhuang

Differential Revision: D37145233

Pulled By: akankshamahajan15

fbshipit-source-id: 05da650e1240e9c6fcc8aed5f0067308dccb164a
2022-06-14 12:35:11 -07:00