Summary:
Modify ReadAsync callback API to remove const from FSReadRequest as const doesn't let to fs_scratch to move the ownership.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11649
Test Plan: CircleCI jobs
Reviewed By: anand1976
Differential Revision: D53585309
Pulled By: akankshamahajan15
fbshipit-source-id: 3bff9035db0e6fbbe34721a5963443355807420d
Summary:
The following are risks associated with pointer-to-pointer reinterpret_cast:
* Can produce the "wrong result" (crash or memory corruption). IIRC, in theory this can happen for any up-cast or down-cast for a non-standard-layout type, though in practice would only happen for multiple inheritance cases (where the base class pointer might be "inside" the derived object). We don't use multiple inheritance a lot, but we do.
* Can mask useful compiler errors upon code change, including converting between unrelated pointer types that you are expecting to be related, and converting between pointer and scalar types unintentionally.
I can only think of some obscure cases where static_cast could be troublesome when it compiles as a replacement:
* Going through `void*` could plausibly cause unnecessary or broken pointer arithmetic. Suppose we have
`struct Derived: public Base1, public Base2`. If we have `Derived*` -> `void*` -> `Base2*` -> `Derived*` through reinterpret casts, this could plausibly work (though technical UB) assuming the `Base2*` is not dereferenced. Changing to static cast could introduce breaking pointer arithmetic.
* Unnecessary (but safe) pointer arithmetic could arise in a case like `Derived*` -> `Base2*` -> `Derived*` where before the Base2 pointer might not have been dereferenced. This could potentially affect performance.
With some light scripting, I tried replacing pointer-to-pointer reinterpret_casts with static_cast and kept the cases that still compile. Most occurrences of reinterpret_cast have successfully been changed (except for java/ and third-party/). 294 changed, 257 remain.
A couple of related interventions included here:
* Previously Cache::Handle was not actually derived from in the implementations and just used as a `void*` stand-in with reinterpret_cast. Now there is a relationship to allow static_cast. In theory, this could introduce pointer arithmetic (as described above) but is unlikely without multiple inheritance AND non-empty Cache::Handle.
* Remove some unnecessary casts to void* as this is allowed to be implicit (for better or worse).
Most of the remaining reinterpret_casts are for converting to/from raw bytes of objects. We could consider better idioms for these patterns in follow-up work.
I wish there were a way to implement a template variant of static_cast that would only compile if no pointer arithmetic is generated, but best I can tell, this is not possible. AFAIK the best you could do is a dynamic check that the void* conversion after the static cast is unchanged.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12308
Test Plan: existing tests, CI
Reviewed By: ltamasi
Differential Revision: D53204947
Pulled By: pdillinger
fbshipit-source-id: 9de23e618263b0d5b9820f4e15966876888a16e2
Summary:
Update io_uring_prep_cancel as it is now backward compatible.
Also, io_uring_prep_cancel expects sqe->addr to match with read
request submitted. It's being set wrong which is now fixed in this PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10644
Test Plan:
- Ran internally with lastest liburing package and on RocksDB
github repo with older version.
- Ran seekrandom regression to confirm there is no regression.
Reviewed By: anand1976
Differential Revision: D39284229
Pulled By: akankshamahajan15
fbshipit-source-id: fd52cdf23d676da114896163626b75c8ae09c980
Summary:
TL;DR: due to a recent change, if you drop a column family,
often that DB will no longer fsync after writing new SST files
to remaining or new column families, which could lead to data
loss on power loss.
More bug detail:
The intent of https://github.com/facebook/rocksdb/issues/10049 was to Close FSDirectory objects at
DB::Close time rather than waiting for DB object destruction.
Unfortunately, it also closes shared FSDirectory objects on
DropColumnFamily (& destroy remaining handles), which can lead
to use-after-Close on FSDirectory shared with remaining column
families. Those "uses" are only Fsyncs (or redundant Closes). In
the default Posix filesystem, an Fsync on a closed FSDirectory is a
quiet no-op. Consequently (under most configurations), if you drop
a column family, that DB will no longer fsync after writing new SST
files to column families sharing the same directory (true under most
configurations).
More fix detail:
Basically, this removes unnecessary Close ops on destroying
ColumnFamilyData. We let `shared_ptr` take care of calling the
destructor at the right time. If the intent was to require Close be
called before destroying FSDirectory, that was not made clear by the
author of FileSystem and was not at all enforced by https://github.com/facebook/rocksdb/issues/10049, which
could have added `assert(fd_ == -1)` to `~PosixDirectory()` but did
not. To keep this fix simple, we relax the unit test for https://github.com/facebook/rocksdb/issues/10049 to allow
timely destruction of FSDirectory to suffice as Close (in
CountedFileSystem). Added a TODO to revisit that.
Also in this PR:
* Added a TODO to share FSDirectory instances between DB and its column
families. (Already shared among column families.)
* Made DB::Close attempt to close all its open FSDirectory objects even
if there is a failure in closing one. Also code clean-up around this
logic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10460
Test Plan:
add an assert to check for use-after-Close. With that
existing tests can detect the misuse. With fix, tests pass (except noted
relaxing of unit test for https://github.com/facebook/rocksdb/issues/10049)
Reviewed By: ajkr
Differential Revision: D38357922
Pulled By: pdillinger
fbshipit-source-id: d42079cadbedf0a969f03389bf586b3b4e1f9137
Summary:
Travis CI is depreciated and haven't been maintained for some time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10407
Reviewed By: ajkr
Differential Revision: D38078382
Pulled By: jay-zhuang
fbshipit-source-id: f42057f2f41f722bdce56bf195f67a94835191fb
Summary:
Fix bug in O_DIRECT and io_uring when its EOF and bytes_read =
0 because of wrong check, it got added into incomplete list and gets stuck in an infinite loop as it will always return bytes_read = 0. The bug was introduced by PR https://github.com/facebook/rocksdb/pull/10197 and that PR is not released yet in any release branch.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10368
Test Plan: Added new unit test
Reviewed By: siying
Differential Revision: D37885184
Pulled By: akankshamahajan15
fbshipit-source-id: 35b36a44b696d29b2f6f25301aa1b19547b4e03b
Summary:
A recent PR https://github.com/facebook/rocksdb/pull/10142 enabled fadvise for mmaped file. However, we were told that it might not take effective and madvise() should be used.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10170
Test Plan:
Run existing tests
Run a benchmark using mmap with advise random and see I/O size is indeed small.
Reviewed By: anand1976
Differential Revision: D37158582
fbshipit-source-id: 8b3a74f0e89d2e16aac78ee4124c05841d4135c3
Summary:
Right now with mmap file, we don't run fadvise following users' requests. There is no reason for that so this diff does that.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10142
Test Plan:
A simple readrandom against files with page cache dropped shows latency improvement from 7.8 us to 2.8:
./db_bench -use_existing_db --benchmarks=readrandom --num=100
Reviewed By: anand1976
Differential Revision: D37074975
fbshipit-source-id: ccc72bcac1b5fd634eb8fa2b6a5d9afe332e0bf6
Summary:
In [close_db_dir](https://github.com/facebook/rocksdb/pull/10049) pull request, some merging conflicts occurred (some comments and one line `s.PermitUncheckedError()` are missing). This pull request aims to put them back.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10093
Reviewed By: ajkr
Differential Revision: D36884117
Pulled By: littlepig2013
fbshipit-source-id: 8c0e2a8793fc52804067c511843bd1ff4912c1c3
Summary:
Currently, the DB directory file descriptor is left open until the deconstruction process (`DB::Close()` does not close the file descriptor). To verify this, comment out the lines between `db_ = nullptr` and `db_->Close()` (line 512, 513, 514, 515 in ldb_cmd.cc) to leak the ``db_'' object, build `ldb` tool and run
```
strace --trace=open,openat,close ./ldb --db=$TEST_TMPDIR --ignore_unknown_options put K1 V1 --create_if_missing
```
There is one directory file descriptor that is not closed in the strace log.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10049
Test Plan: Add a new unit test DBBasicTest.DBCloseAllDirectoryFDs: Open a database with different WAL directory and three different data directories, and all directory file descriptors should be closed after calling Close(). Explicitly call Close() after a directory file descriptor is not used so that the counter of directory open and close should be equivalent.
Reviewed By: ajkr, hx235
Differential Revision: D36722135
Pulled By: littlepig2013
fbshipit-source-id: 07bdc2abc417c6b30997b9bbef1f79aa757b21ff
Summary:
ToString() is created as some platform doesn't support std::to_string(). However, we've already used std::to_string() by mistake for 16 months (in db/db_info_dumper.cc). This commit just remove ToString().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9955
Test Plan: Watch CI tests
Reviewed By: riversand963
Differential Revision: D36176799
fbshipit-source-id: bdb6dcd0e3a3ab96a1ac810f5d0188f684064471
Summary:
If FilePrefetchBuffer object is destroyed and then later Poll() calls callback on object which has been destroyed, it gives segfault on accessing destroyed object. It was caught after adding unit tests that tests Posix implementation of ReadAsync and Poll APIs.
This PR also updates and fixes existing IOURing tests which were not running locally because RocksDbIOUringEnable function wasn't defined and IOUring was disabled for those tests
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9777
Test Plan: Added new unit test
Reviewed By: anand1976
Differential Revision: D35254002
Pulled By: akankshamahajan15
fbshipit-source-id: 68e80054ffb14ae25c255920ebc6548ca5f130a1
Summary:
Provide support for Async Read and Poll in Posix file system using IOUring.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9578
Test Plan: In progress
Reviewed By: anand1976
Differential Revision: D34690256
Pulled By: akankshamahajan15
fbshipit-source-id: 291cbd1380a3cb904b726c34c0560d1b2ce44a2e
Summary:
Loose ends relate to mmap on 32-bit systems. (Testing is more
complicated when the feature was completely disabled on 32-bit.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9386
Test Plan: CI
Reviewed By: ajkr
Differential Revision: D33590715
Pulled By: pdillinger
fbshipit-source-id: f2637036a538a552200adee65b6765fce8cae27b
Summary:
Closing https://github.com/facebook/rocksdb/issues/5954
fsync/fdatasync on Linux:
```
(fsync/fdatasync) includes writing through or flushing a disk cache if present.
```
However, on OS X and iOS:
```
(fsync) will flush all data from the host to the drive (i.e. the "permanent storage device"),
the drive itself may not physically write the data to the platters for quite some time and it
may be written in an out-of-order sequence.
```
Solution is to use `fcntl(F_FULLFSYNC)` on OS X so that we get the same
persistence guarantee.
According to OSX man page,
```
The F_FULLFSYNC fcntl asks the drive to flush **all** buffered data to permanent storage.
```
This suggests that it will be no faster than `fsync` on Linux, since Linux, according to its man page,
```
writing through or flushing a disk cache if present
```
It means Linux may not flush **all** data from disk cache.
This is similar to bug reports/fixes in:
- golang: https://github.com/golang/go/issues/26650
- leveldb: 296de8d5b8.
Not sure if we should fallback to fsync since we break persistence contract.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9356
Reviewed By: jay-zhuang
Differential Revision: D33417416
Pulled By: riversand963
fbshipit-source-id: 475548ff9c5eaccde325e0f6842694271cbc8cb7
Summary:
Directory fsync might be expensive on btrfs and it may not be needed.
Here are 4 directory fsync cases:
1. creating a new file: dir-fsync is not needed on btrfs, as long as the
new file itself is synced.
2. renaming a file: dir-fsync is not needed if the renamed file is
synced. So an API `FsyncAfterFileRename(filename, ...)` is provided
to sync the file on btrfs. By default, it just calls dir-fsync.
3. deleting files: dir-fsync is forced by set
`IOOptions.force_dir_fsync = true`
4. renaming multiple files (like backup and checkpoint): dir-fsync is
forced, the same as above.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8903
Test Plan: run tests on btrfs and non btrfs
Reviewed By: ajkr
Differential Revision: D30885059
Pulled By: jay-zhuang
fbshipit-source-id: dd2730b31580b0bcaedffc318a762d7dbf25de4a
Summary:
Potential bugs in the IO uring implementation can cause bad data to be returned in the completion queue. Add some checks in the PosixRandomAccessFile::MultiRead completion handling code to catch such errors and fail the entire MultiRead. Also log some diagnostic messages and stack trace.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8894
Reviewed By: siying, pdillinger
Differential Revision: D30826982
Pulled By: anand1976
fbshipit-source-id: af91815ac760e095d6cc0466cf8bd5c10167fd15
Summary:
Right now return codes by io_uring_submit_and_wait() and io_uring_wait_cqe() are not handled. It is not the good practice. Although these two functions are not supposed to return non-0 values in normal exeuction, people suspect that they might return non-0 value when an interruption happens, and the code might cause hanging.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8311
Test Plan: Make sure at least normal test cases still pass.
Reviewed By: anand1976
Differential Revision: D28500828
fbshipit-source-id: 8a76cea9cafbd041102e0b6a8eef9d0bfed7c211
Summary:
Refactor kill point to one single class, rather than several extern variables. The intention was to drop unflushed data before killing to simulate some job, and I tried to a pointer to fault ingestion fs to the killing class, but it ended up with harder than I thought. Perhaps we'll need to do this in another way. But I thought the refactoring itself is good so I send it out.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8241
Test Plan: make release and run crash test for a while.
Reviewed By: anand1976
Differential Revision: D28078486
fbshipit-source-id: f9182c1455f52e6851c13f88a21bade63bcec45f
Summary:
Currently, we only truncate the latest alive WAL files when the DB is opened. If the latest WAL file is empty or was flushed during Open, its not truncated since the file will be deleted later on in the Open path. However, before deletion, a new WAL file is created, and if the process crash loops between the new WAL file creation and deletion of the old WAL file, the preallocated space will keep accumulating and eventually use up all disk space. To prevent this, always truncate the latest WAL file, even if its empty or the data was flushed.
Tests:
Add unit tests to db_wal_test
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8122
Reviewed By: riversand963
Differential Revision: D27366132
Pulled By: anand1976
fbshipit-source-id: f923cc03ef033ccb32b140d36c6a63a8152f0e8e
Summary:
`strerror()` is not thread-safe, using `strerror_r()` instead. The API could be different on the different platforms, used the code from 0deef031cb/folly/String.cpp (L457)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8087
Reviewed By: mrambacher
Differential Revision: D27267151
Pulled By: jay-zhuang
fbshipit-source-id: 4b8856d1ec069d5f239b764750682c56e5be9ddb
Summary:
This is a PR generated **semi-automatically** by an internal tool to remove unused includes and `using` statements.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7604
Test Plan: make check
Reviewed By: ajkr
Differential Revision: D24579392
Pulled By: riversand963
fbshipit-source-id: c4bfa6c6b08da1de186690d37eb73d8fff45aecd
Summary:
Make (most of) the env*_test pass when ASSERT_STATUS_CHECKED is enabled.
One test that opens a database is currently disabled in this mode, as there are many errors that need revisited for DB tests and status checks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7176
Reviewed By: cheng-chang
Differential Revision: D22799278
Pulled By: ajkr
fbshipit-source-id: 16d8a02eaeecd6df1060249b6a5811292801f2ed
Summary:
If both direct IO and IO uring are enabled, when IO uring returns partial result, we'll try to read the remaining part of the request, but the starting address/offset of the remaining part might not be aligned to the block size, in direct IO mode, the unaligned offset causes bug.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6853
Test Plan: run make check with both direct IO and IO uring enabled, this is covered by one of the continuous tests.
Reviewed By: anand1976
Differential Revision: D21603023
Pulled By: cheng-chang
fbshipit-source-id: 942f6a11ff21e1892af6c4464e02bab4c707787c
Summary:
In release mode, asserts are not compiled, so `r` is not used, causing compiler warnings.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6750
Test Plan: make check under release mode
Reviewed By: anand1976
Differential Revision: D21220365
Pulled By: cheng-chang
fbshipit-source-id: fd4afa9843d54af68c4da8660ec61549803e1167
Summary:
This change updates PosixSequentialFile::Read to call clearerr()
before fread()ing again after an EINTR is returned on a previous
fread.
The original fix is from bd8f1ebb91.
Fixing https://github.com/facebook/rocksdb/issues/6509
Signed-off-by: phantomape <cxucheng@outlook.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6609
Reviewed By: zhichao-cao
Differential Revision: D20731482
Pulled By: riversand963
fbshipit-source-id: 7f1f3a1449077d5560f45c465a78d08633740ba0
Summary:
In Linux, when reopening DB with many SST files, profiling shows that 100% system cpu time spent for a couple of seconds for `GetLogicalBufferSize`. This slows down MyRocks' recovery time when site is down.
This PR introduces two new APIs:
1. `Env::RegisterDbPaths` and `Env::UnregisterDbPaths` lets `DB` tell the env when it starts or stops using its database directories . The `PosixFileSystem` takes this opportunity to set up a cache from database directories to the corresponding logical block sizes.
2. `LogicalBlockSizeCache` is defined only for OS_LINUX to cache the logical block sizes.
Other modifications:
1. rename `logical buffer size` to `logical block size` to be consistent with Linux terms.
2. declare `GetLogicalBlockSize` in `PosixHelper` to expose it to `PosixFileSystem`.
3. change the functions `IOError` and `IOStatus` in `env/io_posix.h` to have external linkage since they are used in other translation units too.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6457
Test Plan:
1. A new unit test is added for `LogicalBlockSizeCache` in `env/io_posix_test.cc`.
2. A new integration test is added for `DB` operations related to the cache in `db/db_logical_block_size_cache_test.cc`.
`make check`
Differential Revision: D20131243
Pulled By: cheng-chang
fbshipit-source-id: 3077c50f8065c0bffb544d8f49fb10bba9408d04
Summary:
The logic that handles io_uring partial results was wrong. Fix the logic by putting it into a queue and continue reading.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6441
Test Plan: Make sure this patch fixes the application test case where the bug was discovered; in env_test, add a unit test that simulates partial results and make sure the results are still correct.
Differential Revision: D20018616
fbshipit-source-id: 5398a7e34d74c26d52aa69dfd604e93e95d99c62
Summary:
When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433
Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.
Differential Revision: D19977691
fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
Summary:
ASAN reports:
internal_repo_rocksdb/repo:db_test - MultiThreaded/MultiThreadedDBTest.MultiThreaded/43: fatal
==2692739==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x6130000500ca at pc 0x0000006be780 bp 0x7efef85ccd20 sp 0x7efef85cc4d0
[CONTEXT] === How to use this, how to get the raw stack trace, and more: fburl.com/ASAN ===
[CONTEXT] READ of size 331 at 0x6130000500ca thread T195
[CONTEXT] #0 db_test_bin+0x6be77f __interceptor_strlen.part.35
[CONTEXT] https://github.com/facebook/rocksdb/issues/1 internal_repo_rocksdb/repo/include/rocksdb/slice.h:55 rocksdb::Slice::Slice(char const*)
[CONTEXT] https://github.com/facebook/rocksdb/issues/2 internal_repo_rocksdb/repo/env/io_posix.cc:522 rocksdb::PosixRandomAccessFile::MultiRead(rocksdb::ReadRequest*, unsigned long)
I looked at env/io_posix.cc:522 but don't see a reason why the line needs to be there at all, because it is not used before overwritten. So it must be a line that is put there as a bug. Remove it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6135
Test Plan: Rerun the same test which passes after the fix. Run all the tests and make sure they all pass.
Differential Revision: D18880251
fbshipit-source-id: 3b84ac6a05b67b529c4202e0ceb4c047460f44f2
Summary:
Right now, PosixRandomAccessFile::MultiRead() executes read requests in parallel. In this PR, it leverages I/O Uring library to run it in parallel, even when page cache is enabled. This function will fall back if the kernel version doesn't support it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5881
Test Plan: Run the unit test on a kernel version supporting it and make sure all tests pass, and run a unit test on kernel version supporting it and see it pass. Before merging, will also run stress test and see it passes.
Differential Revision: D17742266
fbshipit-source-id: e05699c925ac04fdb42379456a4e23e4ebcb803a
Summary:
Current PosixLogger performs IO operations using posix calls. Thus the
current implementation will not work for non-posix env. Created a new
logger class EnvLogger that uses env specific WritableFileWriter for IO operations.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5491
Test Plan: make check
Differential Revision: D15909002
Pulled By: ggaurav28
fbshipit-source-id: 13a8105176e8e42db0c59798d48cb6a0dbccc965
Summary:
`sync_file_range` returns `ENOSYS` on Windows Subsystem for Linux even
when using a supposedly supported filesystem like ext4. To handle this
case we can do a dynamic check that a no-op `sync_file_range`
invocation, which is accomplished by passing zero for the `flags`
argument, succeeds.
Also I rearranged the function and comments to hopefully make it more
easily understandable.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5416
Differential Revision: D15807061
fbshipit-source-id: d31d94e1f228b7850ea500e6199f8b5daf8cfbd3
Summary:
Many logging related source files are under util/. It will be more structured if they are together.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5387
Differential Revision: D15579036
Pulled By: siying
fbshipit-source-id: 3850134ed50b8c0bb40a0c8ae1f184fa4081303f
Summary:
There are too many types of files under util/. Some test related files don't belong to there or just are just loosely related. Mo
ve them to a new directory test_util/, so that util/ is cleaner.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5377
Differential Revision: D15551366
Pulled By: siying
fbshipit-source-id: 0f5c8653832354ef8caa31749c0143815d719e2c
Summary:
This is a workaround for the issue described in #5169.
It has been tested on a database with very large values, but not dedicated test has been added to the code base.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5213
Differential Revision: D15243116
Pulled By: siying
fbshipit-source-id: e0c226a6cd71a60924dcd7ce7af74abcb4054484
Summary:
The existing implementation does not guarantee bytes reach disk every `bytes_per_sync` when writing SST files, or every `wal_bytes_per_sync` when writing WALs. This can cause confusing behavior for users who enable this feature to avoid large syncs during flush and compaction, but then end up hitting them anyways.
My understanding of the existing behavior is we used `sync_file_range` with `SYNC_FILE_RANGE_WRITE` to submit ranges for async writeback, such that we could continue processing the next range of bytes while that I/O is happening. I believe we can preserve that benefit while also limiting how far the processing can get ahead of the I/O, which prevents huge syncs from happening when the file finishes.
Consider this `sync_file_range` usage: `sync_file_range(fd_, 0, static_cast<off_t>(offset + nbytes), SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE)`. Expanding the range to start at 0 and adding the `SYNC_FILE_RANGE_WAIT_BEFORE` flag causes any pending writeback (like from a previous call to `sync_file_range`) to finish before it proceeds to submit the latest `nbytes` for writeback. The latest `nbytes` are still written back asynchronously, unless processing exceeds I/O speed, in which case the following `sync_file_range` will need to wait on it.
There is a second change in this PR to use `fdatasync` when `sync_file_range` is unavailable (determined statically) or has some known problem with the underlying filesystem (determined dynamically).
The above two changes only apply when the user enables a new option, `strict_bytes_per_sync`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5183
Differential Revision: D14953553
Pulled By: siying
fbshipit-source-id: 445c3862e019fb7b470f9c7f314fc231b62706e9
Summary:
User report has shown that sometimes `BlockBasedTable::SetupCacheKeyPrefix` would assert when trying to generate an id from the file. The actual cause seems to be hardware related but we might be better off without the incorrect assertion
See T42178927 for more information
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5102
Differential Revision: D14604677
Pulled By: miasantreble
fbshipit-source-id: fcb09207ebdc4fa66e941afbc0523d84797e7ad7