Commit graph

325 commits

Author SHA1 Message Date
anand76 d8fb849b7e Basic RocksDB follower implementation (#12540)
Summary:
A basic implementation of RocksDB follower mode, which opens a remote database (referred to as leader) on a distributed file system by tailing its MANIFEST. It leverages the secondary instance mode, but is different in some key ways -
1. It has its own directory with links to the leader's database
2. Periodically refreshes itself
3. (Future) Snapshot support
4. (Future) Garbage collection of obsolete links
5. (Long term) Memtable replication

There are two main classes implementing this functionality - `DBImplFollower` and `OnDemandFileSystem`. The former is derived from `DBImplSecondary`. Similar to `DBImplSecondary`, it implements recovery and catch up through MANIFEST tailing using the `ReactiveVersionSet`, but does not consider logs. In a future PR, we will implement memtable replication, which will eliminate the need to catch up using logs. In addition, the recovery and catch-up tries to avoid directory listing as repeated metadata operations are expensive.

The second main piece is the `OnDemandFileSystem`, which plugs in as an `Env` for the follower instance and creates the illusion of the follower directory as a clone of the leader directory. It creates links to SSTs on first reference. When the follower tails the MANIFEST and attempts to create a new `Version`, it calls `VerifyFileMetadata` to verify the size of the file, and optionally the unique ID of the file. During this process, links are created which prevent the underlying files from getting deallocated even if the leader deletes the files.

TODOs: Deletion of obsolete links, snapshots, robust checking against misconfigurations, better observability etc.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12540

Reviewed By: jowlyzhang

Differential Revision: D56315718

Pulled By: anand1976

fbshipit-source-id: d19e1aca43a6af4000cb8622a718031b69ebd97b
2024-04-19 19:13:31 -07:00
Akanksha Mahajan 956f1dfde3 Change ReadAsync callback API to remove const from FSReadRequest (#11649)
Summary:
Modify ReadAsync callback API to remove const from FSReadRequest as const doesn't let to fs_scratch to move the ownership.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11649

Test Plan: CircleCI jobs

Reviewed By: anand1976

Differential Revision: D53585309

Pulled By: akankshamahajan15

fbshipit-source-id: 3bff9035db0e6fbbe34721a5963443355807420d
2024-02-16 09:14:55 -08:00
Peter Dillinger 54cb9c77d9 Prefer static_cast in place of most reinterpret_cast (#12308)
Summary:
The following are risks associated with pointer-to-pointer reinterpret_cast:
* Can produce the "wrong result" (crash or memory corruption). IIRC, in theory this can happen for any up-cast or down-cast for a non-standard-layout type, though in practice would only happen for multiple inheritance cases (where the base class pointer might be "inside" the derived object). We don't use multiple inheritance a lot, but we do.
* Can mask useful compiler errors upon code change, including converting between unrelated pointer types that you are expecting to be related, and converting between pointer and scalar types unintentionally.

I can only think of some obscure cases where static_cast could be troublesome when it compiles as a replacement:
* Going through `void*` could plausibly cause unnecessary or broken pointer arithmetic. Suppose we have
`struct Derived: public Base1, public Base2`.  If we have `Derived*` -> `void*` -> `Base2*` -> `Derived*` through reinterpret casts, this could plausibly work (though technical UB) assuming the `Base2*` is not dereferenced. Changing to static cast could introduce breaking pointer arithmetic.
* Unnecessary (but safe) pointer arithmetic could arise in a case like `Derived*` -> `Base2*` -> `Derived*` where before the Base2 pointer might not have been dereferenced. This could potentially affect performance.

With some light scripting, I tried replacing pointer-to-pointer reinterpret_casts with static_cast and kept the cases that still compile. Most occurrences of reinterpret_cast have successfully been changed (except for java/ and third-party/). 294 changed, 257 remain.

A couple of related interventions included here:
* Previously Cache::Handle was not actually derived from in the implementations and just used as a `void*` stand-in with reinterpret_cast. Now there is a relationship to allow static_cast. In theory, this could introduce pointer arithmetic (as described above) but is unlikely without multiple inheritance AND non-empty Cache::Handle.
* Remove some unnecessary casts to void* as this is allowed to be implicit (for better or worse).

Most of the remaining reinterpret_casts are for converting to/from raw bytes of objects. We could consider better idioms for these patterns in follow-up work.

I wish there were a way to implement a template variant of static_cast that would only compile if no pointer arithmetic is generated, but best I can tell, this is not possible. AFAIK the best you could do is a dynamic check that the void* conversion after the static cast is unchanged.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12308

Test Plan: existing tests, CI

Reviewed By: ltamasi

Differential Revision: D53204947

Pulled By: pdillinger

fbshipit-source-id: 9de23e618263b0d5b9820f4e15966876888a16e2
2024-02-07 10:44:11 -08:00
Peter Dillinger 76c834e441 Remove 'virtual' when implied by 'override' (#12319)
Summary:
... to follow modern C++ style / idioms.

Used this hack:
```
for FILE in `cat my_list_of_files`; do perl -pi -e 'BEGIN{undef $/;} s/ virtual( [^;{]* override)/$1/smg' $FILE; done
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12319

Test Plan: existing tests, CI

Reviewed By: jaykorean

Differential Revision: D53275303

Pulled By: pdillinger

fbshipit-source-id: bc0881af270aa8ef4d0ae4f44c5a6614b6407377
2024-01-31 13:14:42 -08:00
Richard Barnes 1f3e3ead3f Remove extra semi colon from internal_repo_rocksdb/repo/env/env_encryption.cc (#12274)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12274

`-Wextra-semi` or `-Wextra-semi-stmt`

If the code compiles, this is safe to land.

Reviewed By: jaykorean

Differential Revision: D52969133

fbshipit-source-id: f5a8452af25a5a51d5c7e4045baef12575022da9
2024-01-24 07:22:49 -08:00
Richard Barnes 51ecdd3e8f Remove extra semi colon from internal_repo_rocksdb/repo/env/env_encryption_ctr.h
Summary:
`-Wextra-semi` or `-Wextra-semi-stmt`

If the code compiles, this is safe to land.

Reviewed By: jaykorean

Differential Revision: D52969018

fbshipit-source-id: 0b79c1599fef4eb902c9ef3fac827f1ed4ea94ed
2024-01-23 06:07:30 -08:00
Richard Barnes 59ba1d200d Remove unused variables in internal_repo_rocksdb/repo/env/env_posix.cc (#12243)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12243

LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.

This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Reviewed By: jowlyzhang

Differential Revision: D52847993

fbshipit-source-id: 221da13c6ca9967e3b934f98f318a832a144df39
2024-01-17 14:08:07 -08:00
akankshamahajan 5cb2d09d47 Refactor FilePrefetchBuffer code (#12097)
Summary:
Summary - Refactor FilePrefetchBuffer code
- Implementation:
FilePrefetchBuffer maintains a deque of free buffers (free_bufs_) of size num_buffers_ and buffers (bufs_) which contains the prefetched data. Whenever a buffer is consumed or is outdated (w.r.t. to requested offset), that buffer is cleared and returned to free_bufs_.

 If a buffer is available in free_bufs_, it's moved to bufs_ and is sent for prefetching. num_buffers_ defines how many buffers are maintained that contains prefetched data.
If num_buffers_ == 1, it's a sequential read flow. Read API will be called on that one buffer whenever the data is requested and is not in the buffer.
If num_buffers_ > 1, then the data is prefetched asynchronosuly in the buffers whenever the data is consumed from the buffers and that buffer is freed.
If num_buffers > 1, then requested data can be overlapping between 2 buffers. To return the continuous buffer overlap_bufs_ is used. The requested data is copied from 2 buffers to the overlap_bufs_ and overlap_bufs_ is returned to
the caller.

- Merged Sync and Async code flow into one in FilePrefetchBuffer.

Test Plan -
- Crash test passed
- Unit tests
- Pending - Benchmarks

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12097

Reviewed By: ajkr

Differential Revision: D51759552

Pulled By: akankshamahajan15

fbshipit-source-id: 69a352945affac2ed22be96048d55863e0168ad5
2024-01-05 09:29:01 -08:00
Hui Xiao 06e593376c Group SST write in flush, compaction and db open with new stats (#11910)
Summary:
## Context/Summary
Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity.

For that, this PR does the following:
- Tag different write IOs by passing down and converting WriteOptions to IOOptions
- Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS

Some related code refactory to make implementation cleaner:
- Blob stats
   - Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS  include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info.
   - Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write.
- Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority
- Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification
- Build table
   - TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables
   - Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder.
This parameter is used for dynamically changing file io priority for flush, see  https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more
   - Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority

## Test
### db bench

Flush
```
./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100

rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
```

compaction, db oopen
```
Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench

Run:./db_bench --statistics=1 --benchmarks=compact  --db=../db_bench --use_existing_db=1

rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279
rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213
rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66
```

blob stats - just to make sure they aren't broken by this PR
```
Integrated Blob DB

Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench

Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact  --db=../db_bench --use_existing_db=1

pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600
rocksdb.blobdb.blob.file.synced COUNT : 1
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842

post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164

rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same)
```

```
Stacked Blob DB

Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench

pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876
rocksdb.blobdb.blob.file.synced COUNT : 8
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445

post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164

rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same)
```

###  Rehearsal CI stress test
Trigger 3 full runs of all our CI stress tests

###  Performance

Flush
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark; enable_statistics = true

Pre-pr: avg 507515519.3 ns
497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908,

Post-pr: avg 511971266.5 ns, regressed 0.88%
502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408,
```

Compaction
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1  --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark

Pre-pr: avg 495346098.30 ns
492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846

Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97%
502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007
```

Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats)
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1  --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark

Pre-pr: avg 3848.10 ns
3814,3838,3839,3848,3854,3854,3854,3860,3860,3860

Post-pr: avg 3874.20 ns, regressed 0.68%
3863,3867,3871,3874,3875,3877,3877,3877,3880,3881
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910

Reviewed By: ajkr

Differential Revision: D49788060

Pulled By: hx235

fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff
2023-12-29 15:29:23 -08:00
Yu Zhang ba8fa0f546 internal_repo_rocksdb (4372117296613874540) (#12117)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12117

Reviewed By: ajkr

Differential Revision: D51745846

Pulled By: jowlyzhang

fbshipit-source-id: 51c806a484b3b43d174b06d2cfe9499191d09914
2023-12-04 11:17:32 -08:00
Timo Riski 39d33475da Fix build on FreeBSD (#11218) (#12078)
Summary:
Fixes https://github.com/facebook/rocksdb/issues/11218

Changes from https://github.com/facebook/rocksdb/issues/10881 broke FreeBSD builds with:

    env/io_posix.h:39:9: error: 'POSIX_MADV_NORMAL' macro redefined [-Werror,-Wmacro-redefined]

This commit fixes FreeBSD builds by ignoring MADV defines.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12078

Reviewed By: cbi42

Differential Revision: D51452802

Pulled By: ajkr

fbshipit-source-id: 0a1f5a90954e7d257a95794277a843ac77f3a709
2023-11-20 10:11:16 -08:00
Yingchun Lai c4c62c2304 Support to use environment variable to test customer encryption plugins (#12025)
Summary:
The CreateEnvTest.CreateEncryptedFileSystem unit test is to verify the creation functionality of EncryptedFileSystem, but now it just support the builtin CTREncryptionProvider class.
This patch make it flexible to use environment variable `TEST_FS_URI`, it is useful to test customer encryption plugins.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12025

Reviewed By: anand1976

Differential Revision: D50799656

Pulled By: ajkr

fbshipit-source-id: dbcacfefbf07de9c7803f7707b34c5193bec17bf
2023-11-09 10:45:13 -08:00
Yingchun Lai 76402c034e Fix incorrect parameters order in env_basic_test.cc (#11997)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11997

Reviewed By: jaykorean

Differential Revision: D50608182

Pulled By: ajkr

fbshipit-source-id: d33cfdb5adfea91175c8fa21e8b80e22f728f6c6
2023-10-30 10:47:04 -07:00
Peter Dillinger dc87847e65 Fix windows build errors (rdtsc and fnptr) (#12008)
Summary:
Combining best parts of https://github.com/facebook/rocksdb/issues/11794 and https://github.com/facebook/rocksdb/issues/11766, fixing the CircleCI config in the latter. I was going to amend the latter but wasn't granted access.

Ideally this would be ported back to 8.4 branch and crc32c part into 8.3 branch.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12008

Test Plan: CI

Reviewed By: hx235

Differential Revision: D50616172

Pulled By: pdillinger

fbshipit-source-id: fa7f778bc281e881a140522e774f480c6d1e5f48
2023-10-24 16:20:37 -07:00
Andrew Kryczka e373685dab Add SystemClock::TimedWait() function (#11753)
Summary:
Having a synthetic implementation of `TimedWait()` in `SystemClock` will allow us to add `SyncPoint`s while mutex is released, which was previously impossible since the lock was released and reacquired all within `pthread_cond_timedwait()`. Additionally, integrating `TimedWait()` with `MockSystemClock` allows us to cleanup some workarounds in the test code. In this PR I only cleaned up the `GenericRateLimiter` test workaround.

This is related to the intended follow-up mentioned in https://github.com/facebook/rocksdb/issues/7101's description. There are a couple differences:

(1) This PR does not include removing the particular workaround that initially motivated it. Actually, the `Timer` class uses `InstrumentedCondVar`, so the interface introduced here is inadequate to remove that workaround. On the bright side, the interface introduced in this PR can be changed as needed since it can neither be used nor extended externally, due to using forward-declared `port::CondVar*` in the interface.
(2) This PR only makes the change in `SystemClock` not `Env`. Older revisions of this PR included `Env::TimedWait()` and `SpecialEnv::TimedWait()`; however, since they were unused it probably makes sense to defer adding them until when they are needed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11753

Reviewed By: pdillinger

Differential Revision: D48654995

Pulled By: ajkr

fbshipit-source-id: 15e19f2454b64d4ec7f50e328691c66ca9911122
2023-08-29 18:39:10 -07:00
Peter Dillinger 7a1b0207e6 format_version=6 and context-aware block checksums (#9058)
Summary:
## Context checksum
All RocksDB checksums currently use 32 bits of checking
power, which should be 1 in 4 billion false negative (FN) probability (failing to
detect corruption). This is true for random corruptions, and in some cases
small corruptions are guaranteed to be detected. But some possible
corruptions, such as in storage metadata rather than storage payload data,
would have a much higher FN rate. For example:
* Data larger than one SST block is replaced by data from elsewhere in
the same or another SST file. Especially with block_align=true, the
probability of exact block size match is probably around 1 in 100, making
the FN probability around that same. Without `block_align=true` the
probability of same block start location is probably around 1 in 10,000,
for FN probability around 1 in a million.

To solve this problem in new format_version=6, we add "context awareness"
to block checksum checks. The stored and expected checksum value is
modified based on the block's position in the file and which file it is in. The
modifications are cleverly chosen so that, for example
* blocks within about 4GB of each other are guaranteed to use different context
* blocks that are offset by exactly some multiple of 4GiB are guaranteed to use
different context
* files generated by the same process are guaranteed to use different context
for the same offsets, until wrap-around after 2^32 - 1 files

Thus, with format_version=6, if a valid SST block and checksum is misplaced,
its checksum FN probability should be essentially ideal, 1 in 4B.

## Footer checksum
This change also adds checksum protection to the SST footer (with
format_version=6), for the first time without relying on whole file checksum.
To prevent a corruption of the format_version in the footer (e.g. 6 -> 5) to
defeat the footer checksum, we change much of the footer data format
including an "extended magic number" in format_version 6 that would be
interpreted as empty index and metaindex block handles in older footer
versions. We also change the encoding of handles to free up space for
other new data in footer.

## More detail: making space in footer
In order to keep footer the same size in format_version=6 (avoid change to IO
patterns), we have to free up some space for new data. We do this two ways:
* Metaindex block handle is encoded down to 4 bytes (from 10) by assuming
it immediately precedes the footer, and by assuming it is < 4GB.
* Index block handle is moved into metaindex. (I don't know why it was
in footer to begin with.)

## Performance
In case of small performance penalty, I've made a "pay as you go" optimization
to compensate: replace `MutableCFOptions` in BlockBasedTableBuilder::Rep
with the only field used in that structure after construction: `prefix_extractor`.
This makes the PR an overall performance improvement (results below).

Nevertheless I'm seeing essentially no difference going from fv=5 to fv=6,
even including that improvement for both. That's based on extreme case table
write performance testing, many files with many blocks. This is relatively
checksum intensive (small blocks) and salt generation intensive (small files).

```
(for I in `seq 1 100`; do TEST_TMPDIR=/dev/shm/dbbench2 ./db_bench -benchmarks=fillseq -memtablerep=vector -disable_wal=1 -allow_concurrent_memtable_write=false -num=3000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -write_buffer_size=100000 -compression_type=none -block_size=1000; done) 2>&1 | grep micros/op | tee out
awk '{ tot += $5; n += 1; } END { print int(1.0 * tot / n) }' < out
```

Each value below is ops/s averaged over 100 runs, run simultaneously with competing
configuration for load fairness

Before -> after (both fv=5): 483530 -> 483673 (negligible)
Re-run 1: 480733 -> 485427 (1.0% faster)
Re-run 2: 483821 -> 484541 (0.1% faster)
Before (fv=5) -> after (fv=6): 482006 -> 485100 (0.6% faster)
Re-run 1: 482212 -> 485075 (0.6% faster)
Re-run 2: 483590 -> 484073 (0.1% faster)
After fv=5 -> after fv=6: 483878 -> 485542 (0.3% faster)
Re-run 1: 485331 -> 483385 (0.4% slower)
Re-run 2: 485283 -> 483435 (0.4% slower)
Re-run 3: 483647 -> 486109 (0.5% faster)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9058

Test Plan:
unit tests included (table_test, db_properties_test, salt in env_test). General DB tests
and crash test updated to test new format_version.

Also temporarily updated the default format version to 6 and saw some test failures. Almost all
were due to an inadvertent additional read in VerifyChecksum to verify the index block checksum,
though it's arguably a bug that VerifyChecksum does not appear to (re-)verify the index block
checksum, just assuming it was verified in opening the index reader (probably *usually* true but
probably not always true). Some other concerns about VerifyChecksum are left in FIXME
comments. The only remaining test failure on change of default (in block_fetcher_test) now
has a comment about how to upgrade the test.

The format compatibility test does not need updating because we have not updated the default
format_version.

Reviewed By: ajkr, mrambacher

Differential Revision: D33100915

Pulled By: pdillinger

fbshipit-source-id: 8679e3e572fa580181a737fd6d113ed53c5422ee
2023-07-30 16:40:01 -07:00
Yingchun Lai 44524cf5da remove duplicate comments in EncryptedEnv (#11549)
Summary:
There are some comments on subclasses in EncryptedEnv module which are duplicate to their parent classes, it would be nice to remove the duplication and keep the consistency if the comments on parent classes updated in someday.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11549

Reviewed By: akankshamahajan15

Differential Revision: D47007061

Pulled By: ajkr

fbshipit-source-id: 8bfdaf9f2418a24ca951c30bb88e90ac861d9016
2023-06-27 11:55:37 -07:00
akankshamahajan fbd2f563bb Add an interface to provide support for underlying FS to pass their own buffer during reads (#11324)
Summary:
1. Public API change: Replace `use_async_io`  API in file_system with `SupportedOps` API which is used by underlying FileSystem to indicate to upper layers whether the FileSystem supports different operations introduced in `enum FSSupportedOps `. Right now operations are `async_io` and whether FS will provide its own buffer during reads or not. The api is changed to extend it to various FileSystem operations in one API rather than creating a separate API for each operation.

2. Provide support for underlying FS to pass their own buffer during Reads (async and sync read) instead of using RocksDB provided `scratch` (buffer) in `FSReadRequest`. Currently only MultiRead supports it and later will be extended to other reads as well (point lookup, scan etc). More details in how to enable in file_system.h

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11324

Test Plan: Tested locally

Reviewed By: anand1976

Differential Revision: D44465322

Pulled By: akankshamahajan15

fbshipit-source-id: 9ec9e08f839b5cc815e75d5dade6cd549998d0ec
2023-06-23 11:48:49 -07:00
Peter Dillinger 98c6d7fd80 Internal API for generating semi-random salt (#11331)
Summary:
... so that a non-cryptographic whole file checksum would be highly resistant
to manipulation by a user able to manipulate key-value data (e.g. a user whose data is
stored in RocksDB) and able to predict SST metadata such as DB session id and file
number based on read access to logs or DB files. The adversary would also need to predict
the salt in order to influence the checksum result toward collision with another file's
checksum.

This change is just internal code to support such a future feature. I think this should be a
passive feature, not option-controlled, because you probably won't think about needing it
until you discover you do need it, and it should be low cost, in space (16 bytes per SST
file) and CPU.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11331

Test Plan: Unit tests added to verify at least pseudorandom behavior. (Actually caught a bug in first draft!) The new "stress" style tests run in ~3ms each on my system.

Reviewed By: ajkr

Differential Revision: D46129415

Pulled By: pdillinger

fbshipit-source-id: 7972dc74487e062b29b1fd9c227425e922c98796
2023-06-21 11:32:49 -07:00
Yu Zhang 66499780b2 Fix error case memory bug in GetHostName() (#11544)
Summary:
Fix the error handling in `GetHostName` for non EFAULT, non EINVAL error. Current handling will cause stack overflow when non null-terminated c style string is in `name`, e.g. ENAMETOOLONG, when the `name` buffer is not big enough and the host name is truncated.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11544

Test Plan:
```
COMPILE_WITH_ASAN=1 make all check
```

Reviewed By: pdillinger

Differential Revision: D46775799

Pulled By: jowlyzhang

fbshipit-source-id: e0fc9400c50fe38bc1fd888b4fea5fe8706165bf
2023-06-16 11:47:19 -07:00
Peter Dillinger 70bf5ef093 Avoid destroying default PosixEnv, safely (#11538)
Summary:
Use another static object to join threads instead.

This change is motivated by a case in which some code using NewLRUCache() -> ShardedCacheBase -> SemiStructuredUniqueIdGen -> GenerateRawUniqueId() -> Env::Default() was happening
during static destruction.

I didn't see anything else in PosixEnv or base classes that would cause a problem by not
destroying. (WinEnv is already not destroyed; see env_default.cc)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11538UndefinedBehaviorSanitizer: undefined-behavior env/env_test.cc:3561:23 in
$
```

Test Plan:
test added, which would previously fail with UBSAN:

```
$ ./env_test --gtest_filter=*Destruct*
Note: Google Test filter = *Destruct*
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from EnvTestMisc
[ RUN      ] EnvTestMisc.StaticDestruction
[       OK ] EnvTestMisc.StaticDestruction (0 ms)
[----------] 1 test from EnvTestMisc (0 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (0 ms total)
[  PASSED  ] 1 test.
env/env_test.cc:3561:23: runtime error: member call on address 0x7f7b96671ca8 which does not point to an object of type 'rocksdb::Env'
0x7f7b96671ca8: note: object is of type 'N7rocksdb12ConfigurableE'
 00 00 00 00  90 a7 f7 95 7b 7f 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^~~~~~~~~~~~~~~~~~~~~~~
              vptr for 'N7rocksdb12ConfigurableE'

Reviewed By: jowlyzhang

Differential Revision: D46737389

Pulled By: pdillinger

fbshipit-source-id: 0f80a443bf799ffc5641e898cf3a75f7d10a987b
2023-06-14 16:18:08 -07:00
Peter Dillinger 206fdea3d9 Change internal headers with duplicate names (#11408)
Summary:
In IDE navigation I find it annoying that there are two statistics.h files (etc.) and often land on the wrong one. Here I migrate several headers to use the blah.h <- blah_impl.h <- blah.cc idiom. Although clang-format wants "blah.h" to be the top include for "blah.cc", I think overall this is an improvement.

No public API changes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11408

Test Plan: existing tests

Reviewed By: ltamasi

Differential Revision: D45456696

Pulled By: pdillinger

fbshipit-source-id: 809d931253f3272c908cf5facf7e1d32fc507373
2023-05-17 11:27:09 -07:00
Peter Dillinger f4a02f2c52 Add hash_seed to Caches (#11391)
Summary:
See motivation and description in new ShardedCacheOptions::hash_seed option.

Updated db_bench so that its seed param is used for the cache hash seed.
Made its code more safe to ensure seed is set before use.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11391

Test Plan:
unit tests added / updated

**Performance** - no discernible difference seen running cache_bench repeatedly before & after. With lru_cache and hyper_clock_cache.

Reviewed By: hx235

Differential Revision: D45557797

Pulled By: pdillinger

fbshipit-source-id: 40bf4da6d66f9d41a8a0eb8e5cf4246a4aa07934
2023-05-09 22:24:26 -07:00
anand76 eac6b6d0cd Ignore async_io ReadOption if FileSystem doesn't support it (#11296)
Summary:
In PosixFileSystem, IO uring support is opt-in. If the support is not enabled by the user, then ignore the async_io ReadOption in MultiGet and iteration at the top, rather than follow the async_io codepath and transparently switch to sync IO at the FileSystem layer.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11296

Test Plan: Add new unit tests

Reviewed By: akankshamahajan15

Differential Revision: D44045776

Pulled By: anand1976

fbshipit-source-id: a0881bf763ca2fde50b84063d0068bb521edd8b9
2023-03-17 14:57:09 -07:00
mrambacher b6640c3117 Remove FactoryFunc from LoadXXXObject (#11203)
Summary:
The primary purpose of the FactoryFunc was to support LITE mode where the ObjectRegistry was not available.  With the removal of LITE mode, the function was no longer required.

Note that the MergeOperator had some private classes defined in header files.  To gain access to their constructors (and name methods), the class definitions were moved into header files.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11203

Reviewed By: cbi42

Differential Revision: D43160255

Pulled By: pdillinger

fbshipit-source-id: f3a465fd5d1a7049b73ecf31e4b8c3762f6dae6c
2023-02-17 12:54:07 -08:00
sdong 4720ba4391 Remove RocksDB LITE (#11147)
Summary:
We haven't been actively mantaining RocksDB LITE recently and the size must have been gone up significantly. We are removing the support.

Most of changes were done through following comments:

unifdef -m -UROCKSDB_LITE `git grep -l ROCKSDB_LITE | egrep '[.](cc|h)'`

by Peter Dillinger. Others changes were manually applied to build scripts, CircleCI manifests, ROCKSDB_LITE is used in an expression and file db_stress_test_base.cc.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11147

Test Plan: See CI

Reviewed By: pdillinger

Differential Revision: D42796341

fbshipit-source-id: 4920e15fc2060c2cd2221330a6d0e5e65d4b7fe2
2023-01-27 13:14:19 -08:00
Changyu Bi c94c8fcbd4 Remove deprecated FileSystem::Load() (#11122)
Summary:
user should use FileSystem::CreateFromString() instead.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11122

Reviewed By: ajkr

Differential Revision: D42727580

Pulled By: cbi42

fbshipit-source-id: c68b17bb82ba9dee46ba23b677d87ecf0a1e06c8
2023-01-26 20:20:58 -08:00
Peter Dillinger 9afa0f05ad Remove deprecated Env::LoadEnv() (#11121)
Summary:
Can use Env::CreateFromString() instead

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11121

Test Plan: unit tests updated

Reviewed By: cbi42

Differential Revision: D42723813

Pulled By: pdillinger

fbshipit-source-id: 5d4b5b10225dfdaf662f5f8049ee965a05d3edc9
2023-01-25 12:08:49 -08:00
Brord van Wierst d80baa1396 Added placeholders for MADV defines (#10881)
Summary:
Cross compiling rocksdb with rust bindings to android leads to an error since 7.4.0 (Incusion of madvise)
This is due to missing placeholders for non-linux platforms.

This PR adds the missing placeholders.

See https://github.com/rust-rocksdb/rust-rocksdb/issues/697 for the specific error thrown.

I have just completed the CLA :)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10881

Reviewed By: akankshamahajan15

Differential Revision: D40726103

Pulled By: ajkr

fbshipit-source-id: 6b391636a74ef7e20d0daf47d332ddf0c14d5c34
2022-11-02 14:42:42 -07:00
akankshamahajan ee3dbdc083 Run clang-format on env/ folder (#10859)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10859

Test Plan: CircleCI jobs

Reviewed By: anand1976

Differential Revision: D40653839

Pulled By: akankshamahajan15

fbshipit-source-id: ce75205ee34ee3896a77a807d5c556886de78b01
2022-10-24 17:54:14 -07:00
Peter Dillinger e466173d5c Print stack traces on frozen tests in CI (#10828)
Summary:
Instead of existing calls to ps from gnu_parallel, call a new wrapper that does ps, looks for unit test like processes, and uses pstack or gdb to print thread stack traces. Also, using `ps -wwf` instead of `ps -wf` ensures output is not cut off.

For security, CircleCI runs with security restrictions on ptrace (/proc/sys/kernel/yama/ptrace_scope = 1), and this change adds a work-around to `InstallStackTraceHandler()` (only used by testing tools) to allow any process from the same user to debug it. (I've also touched >100 files to ensure all the unit tests call this function.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10828

Test Plan: local manual + temporary infinite loop in a unit test to observe in CircleCI

Reviewed By: hx235

Differential Revision: D40447634

Pulled By: pdillinger

fbshipit-source-id: 718a4c4a5b54fa0f9af2d01a446162b45e5e84e1
2022-10-18 00:35:35 -07:00
akankshamahajan ae0f9c3339 Add new property in IOOptions to skip recursing through directories and list only files during GetChildren. (#10668)
Summary:
Add new property "do_not_recurse" in  IOOptions for underlying file system to skip iteration of directories during DB::Open if there are no sub directories and list only files.
By default this property is set to false. This property is set true currently in the code where RocksDB is sure only files are needed during DB::Open.

Provided support in PosixFileSystem to use "do_not_recurse".

TestPlan:
- Existing tests

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10668

Reviewed By: anand1976

Differential Revision: D39471683

Pulled By: akankshamahajan15

fbshipit-source-id: 90e32f0b86d5346d53bc2714d3a0e7002590527f
2022-10-03 10:59:45 -07:00
Akanksha Mahajan 0f4978d34f Fix sqe->addr passed in cancel request in io_uring (#10644)
Summary:
Update io_uring_prep_cancel as it is now backward compatible.
Also, io_uring_prep_cancel expects sqe->addr to match with read
request submitted. It's being set wrong which is now fixed in this PR.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10644

Test Plan:
- Ran internally with lastest liburing package and on RocksDB
github repo with older version.
- Ran seekrandom regression to confirm there is no regression.

Reviewed By: anand1976

Differential Revision: D39284229

Pulled By: akankshamahajan15

fbshipit-source-id: fd52cdf23d676da114896163626b75c8ae09c980
2022-09-21 14:21:59 -07:00
Jay Zhuang 5d3aefb682 Migrate to docker for CI run (#10496)
Summary:
Moved linux builds to using docker to avoid CI instability caused by dependency installation site down.
Added the `Dockerfile` which is used to build the image.
The build time is also significantly reduced, because no dependencies installation and with using 2xlarge+ instance for slow build (like tsan test).
Also fixed a few issues detected while building this:
* `DestoryDB()` Status not checked for a few tests
* nullptr might be used in `inlineskiplist.cc`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10496

Test Plan: CI

Reviewed By: ajkr

Differential Revision: D38554200

Pulled By: jay-zhuang

fbshipit-source-id: 16e8fb2bf07b9c84bb27fb18421c4d54f2f248fd
2022-08-10 17:34:38 -07:00
Vladimir Kikhtenko 8d664ccb07 increase buffer size in PosixFileSystem::GetAbsolutePath to PATH_MAX (#10413)
Summary:
RocksDB fails to open database with relative path when length of cwd
is longer than 256 bytes. This happens due to ERANGE in getcwd call.
Here we simply increase buffer size to the most common PATH_MAX value.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10413

Reviewed By: riversand963

Differential Revision: D38189254

Pulled By: ajkr

fbshipit-source-id: 8a0d3a78bbe87645499fbf29fb12bd3d04cd4657
2022-08-02 17:18:18 -07:00
Peter Dillinger 27f3af5966 Fix serious FSDirectory use-after-Close bug (missing fsync) (#10460)
Summary:
TL;DR: due to a recent change, if you drop a column family,
often that DB will no longer fsync after writing new SST files
to remaining or new column families, which could lead to data
loss on power loss.

More bug detail:
The intent of https://github.com/facebook/rocksdb/issues/10049 was to Close FSDirectory objects at
DB::Close time rather than waiting for DB object destruction.
Unfortunately, it also closes shared FSDirectory objects on
DropColumnFamily (& destroy remaining handles), which can lead
to use-after-Close on FSDirectory shared with remaining column
families. Those "uses" are only Fsyncs (or redundant Closes). In
the default Posix filesystem, an Fsync on a closed FSDirectory is a
quiet no-op. Consequently (under most configurations), if you drop
a column family, that DB will no longer fsync after writing new SST
files to column families sharing the same directory (true under most
configurations).

More fix detail:
Basically, this removes unnecessary Close ops on destroying
ColumnFamilyData. We let `shared_ptr` take care of calling the
destructor at the right time. If the intent was to require Close be
called before destroying FSDirectory, that was not made clear by the
author of FileSystem and was not at all enforced by https://github.com/facebook/rocksdb/issues/10049, which
could have added `assert(fd_ == -1)` to `~PosixDirectory()` but did
not. To keep this fix simple, we relax the unit test for https://github.com/facebook/rocksdb/issues/10049 to allow
timely destruction of FSDirectory to suffice as Close (in
CountedFileSystem). Added a TODO to revisit that.

Also in this PR:
* Added a TODO to share FSDirectory instances between DB and its column
families. (Already shared among column families.)
* Made DB::Close attempt to close all its open FSDirectory objects even
if there is a failure in closing one. Also code clean-up around this
logic.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10460

Test Plan:
add an assert to check for use-after-Close. With that
existing tests can detect the misuse. With fix, tests pass (except noted
relaxing of unit test for https://github.com/facebook/rocksdb/issues/10049)

Reviewed By: ajkr

Differential Revision: D38357922

Pulled By: pdillinger

fbshipit-source-id: d42079cadbedf0a969f03389bf586b3b4e1f9137
2022-08-02 10:54:32 -07:00
sdong cc2099803a Use EnvLogger instead of PosixLogger (#10436)
Summary:
EnvLogger was built to replace PosixLogger that supports multiple Envs. Make FileSystem use EnvLogger by default, remove Posix FS specific implementation and remove PosixLogger code,
Some hacky changes are made to make sure iostats are not polluted by logging, in order to pass existing unit tests.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10436

Test Plan: Run db_bench and watch info log files.

Reviewed By: anand1976

Differential Revision: D38259855

fbshipit-source-id: 67d65874bfba7a33535b6d0dd0ed92cbbc9888b8
2022-08-01 14:37:18 -07:00
Jay Zhuang fcccc412d7 Remove Travis CI (#10407)
Summary:
Travis CI is depreciated and haven't been maintained for some time.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10407

Reviewed By: ajkr

Differential Revision: D38078382

Pulled By: jay-zhuang

fbshipit-source-id: f42057f2f41f722bdce56bf195f67a94835191fb
2022-07-22 20:16:45 -07:00
Akanksha Mahajan f6c4d7a576 Fix hang in MultiRead with O_DIRECT and io_uring (#10368)
Summary:
Fix bug in O_DIRECT and io_uring when its EOF and bytes_read =
0 because of wrong check, it got added into incomplete list and gets stuck in an infinite loop as it will always return bytes_read = 0. The bug was introduced by PR https://github.com/facebook/rocksdb/pull/10197 and that PR is not released yet in any release branch.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10368

Test Plan: Added new unit test

Reviewed By: siying

Differential Revision: D37885184

Pulled By: akankshamahajan15

fbshipit-source-id: 35b36a44b696d29b2f6f25301aa1b19547b4e03b
2022-07-18 15:37:29 -07:00
zczhu 96206531bc Support reservation in thread pool (#10278)
Summary:
Add `ReserveThreads` and `ReleaseThreads` functions in thread pool to support reservation in for a specific thread pool.  With this feature, a thread will be blocked if the number of waiting threads (noted by `num_waiting_threads_`) equals the number of reserved threads (noted by `reserved_threads_`), normally `reserved_threads_` is upper bounded by `num_waiting_threads_`; in rare cases (e.g. `SetBackgroundThreadsInternal` is called when some threads are already reserved), `num_waiting_threads_` can be less than `reserved_threads`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10278

Test Plan: Add `ReserveThreads` unit test in `env_test`. Update the unit test `SimpleColumnFamilyInfoTest` in `thread_list_test` with adding `ReserveThreads` related assertions.

Reviewed By: hx235

Differential Revision: D37640946

Pulled By: littlepig2013

fbshipit-source-id: 4d691f6b9a433569f96ab52d52c3defe5b065367
2022-07-08 19:48:09 -07:00
Akanksha Mahajan fc51b7f33a Fix clang error implicit conversion loses integer precision (#10323)
Summary:
Fix  error: implicit conversion loses integer precision:
'size_t' (aka 'unsigned long') to 'uint32_t' (aka 'unsigned int')
[-Werror,-Wshorten-64-to-32]

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10323

Test Plan: USE_CLANG=1 make -j32

Reviewed By: gitbw95

Differential Revision: D37688250

Pulled By: akankshamahajan15

fbshipit-source-id: 443873e41279ee8bdbe8452818549792047532fb
2022-07-07 11:35:15 -07:00
Akanksha Mahajan 2acbf386a3 Provide support for direct_reads with async_io (#10197)
Summary:
Provide support for use_direct_reads with async_io.

TestPlan:
-  Updated unit tests
-  db_bench: Results in https://github.com/facebook/rocksdb/pull/10197#issuecomment-1159239420
- db_stress
```
export CRASH_TEST_EXT_ARGS=" --async_io=1 --use_direct_reads=1"
make crash_test -j
```
- Ran db_bench on previous RocksDB version before any async_io implementation (as there have many changes in different PRs in this area) https://github.com/facebook/rocksdb/pull/10197#issuecomment-1160781563.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10197

Reviewed By: anand1976

Differential Revision: D37255646

Pulled By: akankshamahajan15

fbshipit-source-id: fec61ae15bf4d625f79dea56e4f86e0e307ba920
2022-07-06 11:42:59 -07:00
Bo Wang c073ed7601 Fix typo in comments and code (#10233)
Summary:
Fix typo in comments and code.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10233

Test Plan: Existing unit tests should pass.

Reviewed By: jay-zhuang, anand1976

Differential Revision: D37356702

Pulled By: gitbw95

fbshipit-source-id: 32c019adcc6dcc95a9882b38147a310091368e51
2022-06-22 15:45:21 -07:00
Yanqin Jin 36fefd7e22 Continue to deflake BackupEngineTest.Concurrency (#10228)
Summary:
Even after https://github.com/facebook/rocksdb/issues/10069, `BackupEngineTest.Concurrency` is still flaky with decreased probability of failure.

Repro steps as follows
```bash
make backup_engine_test
gtest-parallel -r 1000 -w 64 ./backup_engine_test --gtest_filter=BackupEngineTest.Concurrency
```

The first two commits of this PR demonstrate how the test is flaky. https://github.com/facebook/rocksdb/issues/10069 handles the case in which
`Rename()` file returns `IOError` with subcode `PathNotFound`, and `CreateLoggerFromOptions()`
allows the operation to succeed, as expected by the test. However, `BackupEngineTest` uses
`RemapFileSystem` on top of `ChrootFileSystem` which can return `NotFound` instead of `IOError`.

This behavior is different from `Env::Default()` which returns PathNotFound if the src of `rename()`
does not exist. We should make the behaviors of the test Env/FS match a real Env/FS.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10228

Test Plan:
```bash
make check
gtest-parallel -r 1000 -w 64 ./backup_engine_test --gtest_filter=BackupEngineTest.Concurrency
```

Reviewed By: pdillinger

Differential Revision: D37337241

Pulled By: riversand963

fbshipit-source-id: 07a53115e424467b55a731866e571f0ad4c6635d
2022-06-22 08:50:05 -07:00
sdong 69a32eecab Use madvise() for mmaped file advise (#10170)
Summary:
A recent PR https://github.com/facebook/rocksdb/pull/10142 enabled fadvise for mmaped file. However, we were told that it might not take effective and madvise() should be used.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10170

Test Plan:
Run existing tests
Run a benchmark using mmap with advise random and see I/O size is indeed small.

Reviewed By: anand1976

Differential Revision: D37158582

fbshipit-source-id: 8b3a74f0e89d2e16aac78ee4124c05841d4135c3
2022-06-15 13:05:58 -07:00
Akanksha Mahajan 19345de60d fix cancel argument for latest liburing (#10168)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10168

the arg changed to u64

Reviewed By: ajkr

Differential Revision: D37155407

fbshipit-source-id: 464eab2806675f148fce075a6fea369fa3d7a9bb
2022-06-15 09:10:19 -07:00
Akanksha Mahajan 40d19bc12c Fix the failure related to io_uring_prep_cancel (#10165)
Summary:
Fix for Internal jobs are failing with
```
 error: no matching function for call to 'io_uring_prep_cancel'
      io_uring_prep_cancel(sqe, posix_handle, 0);
      ^~~~~~~~~~~~~~~~~~~~
note: candidate function not viable: no known conversion from 'rocksdb::Posix_IOHandle *' to '__u64' (aka 'unsigned long long') for 2nd argument
static inline void io_uring_prep_cancel(struct io_uring_sqe *sqe,
```

User data is set using `io_uring_set_data` API so no need to pass posix_handle here.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10165

Test Plan: CircleCI jobs

Reviewed By: jay-zhuang

Differential Revision: D37145233

Pulled By: akankshamahajan15

fbshipit-source-id: 05da650e1240e9c6fcc8aed5f0067308dccb164a
2022-06-14 12:35:11 -07:00
Akanksha Mahajan c0e0f30667 Implement AbortIO using io_uring (#10125)
Summary:
Implement AbortIO in posix using io_uring to cancel any pending read requests submitted. Its cancelled using io_uring_prep_cancel which sets the IORING_OP_ASYNC_CANCEL flag.

To cancel a request, the sqe must have ->addr set to the user_data of the request it wishes to cancel. If the request is cancelled successfully, the original request is completed with -ECANCELED and the cancel request is completed with a result of 0. If the request was already running, the original may or may not complete in error. The cancel request will complete with -EALREADY for that case. And finally, if the request to cancel wasn't found, the cancel request is completed with -ENOENT.

Reference: https://kernel.dk/io_uring-whatsnew.pdf,
https://lore.kernel.org/io-uring/d9a8d76d23690842f666c326631ecc2d85b6c1bc.1615566409.git.asml.silence@gmail.com/

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10125

Test Plan: Existing Posix tests.

Reviewed By: anand1976

Differential Revision: D36946970

Pulled By: akankshamahajan15

fbshipit-source-id: 3bc1f1521b3151d01a348fc6431eb3fc85db3a14
2022-06-13 18:07:24 -07:00
sdong 80afa77660 Run fadvise with mmap file (#10142)
Summary:
Right now with mmap file, we don't run fadvise following users' requests. There is no reason for that so this diff does that.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10142

Test Plan:
A simple readrandom against files with page cache dropped shows latency improvement from 7.8 us to 2.8:

./db_bench -use_existing_db --benchmarks=readrandom --num=100

Reviewed By: anand1976

Differential Revision: D37074975

fbshipit-source-id: ccc72bcac1b5fd634eb8fa2b6a5d9afe332e0bf6
2022-06-10 16:34:01 -07:00
zczhu e88d8935ae Add comments/permit unchecked error to close_db_dir pull requests (#10093)
Summary:
In [close_db_dir](https://github.com/facebook/rocksdb/pull/10049) pull request, some merging conflicts occurred (some comments and one line `s.PermitUncheckedError()` are missing). This pull request aims to put them back.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10093

Reviewed By: ajkr

Differential Revision: D36884117

Pulled By: littlepig2013

fbshipit-source-id: 8c0e2a8793fc52804067c511843bd1ff4912c1c3
2022-06-02 21:52:35 -07:00