Summary:
Fixes https://github.com/facebook/rocksdb/issues/5734. By reading the code the assert don't quite make sense to me, since `dataSize` and `fileOffset` has no correlation. But my knowledge about `EncryptedEnv` is very limited.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5735
Test Plan:
run `ENCRYPTED_ENV=1 ./db_encryption_test`
Signed-off-by: Yi Wu <yiwu@pingcap.com>
Differential Revision: D17133849
fbshipit-source-id: bb7262d308e5b2503c400b180edc252668df0ef0
Summary:
The ObjectRegistry class replaces the Registrar and NewCustomObjects. Objects are registered with the registry by Type (the class must implement the static const char *Type() method).
This change is necessary for a few reasons:
- By having a class (rather than static template instances), the class can be passed between compilation units, meaning that objects could be registered and shared from a dynamic library with an executable.
- By having a class with instances, different units could have different objects registered. This could be useful if, for example, one Option allowed for a dynamic library and one did not.
When combined with some other PRs (being able to load shared libraries, a Configurable interface to configure objects to/from string), this code will allow objects in external shared libraries to be added to a RocksDB image at run-time, rather than requiring every new extension to be built into the main library and called explicitly by every program.
Test plan (on riversand963's devserver)
```
$COMPILE_WITH_ASAN=1 make -j32 all && sleep 1 && make check
```
All tests pass.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5293
Differential Revision: D16363396
Pulled By: riversand963
fbshipit-source-id: fbe4acb615bfc11103eef40a0b288845791c0180
Summary:
Current PosixLogger performs IO operations using posix calls. Thus the
current implementation will not work for non-posix env. Created a new
logger class EnvLogger that uses env specific WritableFileWriter for IO operations.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5491
Test Plan: make check
Differential Revision: D15909002
Pulled By: ggaurav28
fbshipit-source-id: 13a8105176e8e42db0c59798d48cb6a0dbccc965
Summary:
The usage of `AlignedBuffer` in env_encryption.cc writes and reads to/from the AlignedBuffer's internal buffer directly without going through AlignedBuffer's APIs (like `Append` and `Read`), causing encapsulation to break in some cases. The writes are especially problematic as after the data is written to the buffer (directly using either memmove or memcpy), the size of the buffer is not updated ... causing the AlignedBuffer to lose track of the encapsulated buffer's current size.
Fixed this by updating the buffer size after every write.
Todo for later:
Add an overloaded method to AlignedBuffer to support a memmove in addition to a memcopy. Encryption env does a memmove, and hence I couldn't switch to using `AlignedBuffer.Append()`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5396
Test Plan: `make check`
Differential Revision: D15764756
Pulled By: sagar0
fbshipit-source-id: 2e24b52bd3b4b5056c5c1da157f91ddf89370183
Summary:
The changes in 8272a6de57 were untested with `USE_HDFS=1`. There were a couple compiler errors. This PR fixes them.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5444
Test Plan:
```
$ EXTRA_LDFLAGS="-L/tmp/hadoop-3.1.2/lib/native/" EXTRA_CXXFLAGS="-I/tmp/hadoop-3.1.2/include" USE_HDFS=1 make -j12 check
```
Differential Revision: D15885009
fbshipit-source-id: 2a0a63739e0b9a2819b461ad63ce1292c4833fe2
Summary:
`sync_file_range` returns `ENOSYS` on Windows Subsystem for Linux even
when using a supposedly supported filesystem like ext4. To handle this
case we can do a dynamic check that a no-op `sync_file_range`
invocation, which is accomplished by passing zero for the `flags`
argument, succeeds.
Also I rearranged the function and comments to hopefully make it more
easily understandable.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5416
Differential Revision: D15807061
fbshipit-source-id: d31d94e1f228b7850ea500e6199f8b5daf8cfbd3
Summary:
This affects some TSAN builds:
env/env_test.cc: In member function ‘virtual void rocksdb::EnvPosixTestWithParam_MultiRead_Test::TestBody()’:
env/env_test.cc:1126:76: error: type qualifiers ignored on cast result type [-Werror=ignored-qualifiers]
auto data = NewAligned(kSectorSize * 8, static_cast<const char>(i + 1));
^
env/env_test.cc:1154:77: error: type qualifiers ignored on cast result type [-Werror=ignored-qualifiers]
auto buf = NewAligned(kSectorSize * 8, static_cast<const char>(i*2 + 1));
^
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5432
Differential Revision: D15727277
Pulled By: ltamasi
fbshipit-source-id: dc0e687b123e7c4d703ccc0c16b7167e07d1c9b0
Summary:
Previous code has a warning when compile with tsan, leading to an error since we have -Werror.
Compilation result
```
In file included from ./env/env_chroot.h:12,
from env/env_test.cc:40:
./include/rocksdb/env.h: In instantiation of ‘rocksdb::Status rocksdb::DynamicLibrary::LoadFunction(const string&, std::function<T>*) [with T = void*(void*, const char*); std::__cxx11::string = std::__cxx11::basic_string<char>]’:
env/env_test.cc:260:5: required from here
./include/rocksdb/env.h:1010:17: error: cast between incompatible function types from ‘rocksdb::DynamicLibrary::FunctionPtr’ {aka ‘void* (*)()’} to ‘void* (*)(void*, const char*)’ [-Werror=cast-function-type]
*function = reinterpret_cast<T*>(ptr);
^~~~~~~~~~~~~~~~~~~~~~~~~
cc1plus: all warnings being treated as errors
make: *** [env/env_test.o] Error 1
```
It also has another error reported by clang
```
env/env_posix.cc:141:11: warning: Value stored to 'err' during its initialization is never read
char* err = dlerror(); // Clear any old error
^~~ ~~~~~~~~~
1 warning generated.
```
Test plan (on my devserver).
```
$make clean
$OPT=-g ROCKSDB_FBCODE_BUILD_WITH_PLATFORM007=1 COMPILE_WITH_TSAN=1 make -j32
$
$make clean
$USE_CLANG=1 TEST_TMPDIR=/dev/shm/rocksdb OPT=-g make -j1 analyze
```
Both should pass.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5414
Differential Revision: D15637315
Pulled By: riversand963
fbshipit-source-id: 8e307483761019a4d5998cab92d49516d7edffbf
Summary:
Define the Env:: MultiRead() method to allow callers to request multiple block reads in one shot. The underlying Env implementation can parallelize it if it chooses to in order to reduce the overall IO latency.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5311
Differential Revision: D15502172
Pulled By: anand1976
fbshipit-source-id: 2b228269c2e11b5f54694d6b2bb3119c8a8ce2b9
Summary:
This change adds a Dynamic Library class to the RocksDB Env. Dynamic libraries are populated via the Env::LoadLibrary method.
The addition of dynamic library support allows for a few different features to be developed:
1. The compression code can be changed to use dynamic library support. This would allow RocksDB to determine at run-time what compression packages were installed. This change would eliminate the need to make sure the build-time and run-time environment had the same library set. It would also simplify some of the Java build issues (where it attempts to build and include various packages inside the RocksDB jars).
2. Along with other features (to be provided in a subsequent PR), this change would allow code/configurations to be added to RocksDB at run-time. For example, the build system includes code for building an "rados" environment and adding "Cassandra" features. Instead of these extensions being built into the base RocksDB code, these extensions could be loaded at run-time as required/appropriate, either by configuration or explicitly.
We intend to push out other changes in support of the extending RocksDB at run-time via configurations.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5281
Differential Revision: D15447613
Pulled By: riversand963
fbshipit-source-id: 452cd4f54511c0bceee18f6d9d919aae9fd25fef
Summary:
Many logging related source files are under util/. It will be more structured if they are together.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5387
Differential Revision: D15579036
Pulled By: siying
fbshipit-source-id: 3850134ed50b8c0bb40a0c8ae1f184fa4081303f
Summary:
There are too many types of files under util/. Some test related files don't belong to there or just are just loosely related. Mo
ve them to a new directory test_util/, so that util/ is cleaner.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5377
Differential Revision: D15551366
Pulled By: siying
fbshipit-source-id: 0f5c8653832354ef8caa31749c0143815d719e2c
Summary:
This is a workaround for the issue described in #5169.
It has been tested on a database with very large values, but not dedicated test has been added to the code base.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5213
Differential Revision: D15243116
Pulled By: siying
fbshipit-source-id: e0c226a6cd71a60924dcd7ce7af74abcb4054484
Summary:
The existing implementation does not guarantee bytes reach disk every `bytes_per_sync` when writing SST files, or every `wal_bytes_per_sync` when writing WALs. This can cause confusing behavior for users who enable this feature to avoid large syncs during flush and compaction, but then end up hitting them anyways.
My understanding of the existing behavior is we used `sync_file_range` with `SYNC_FILE_RANGE_WRITE` to submit ranges for async writeback, such that we could continue processing the next range of bytes while that I/O is happening. I believe we can preserve that benefit while also limiting how far the processing can get ahead of the I/O, which prevents huge syncs from happening when the file finishes.
Consider this `sync_file_range` usage: `sync_file_range(fd_, 0, static_cast<off_t>(offset + nbytes), SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE)`. Expanding the range to start at 0 and adding the `SYNC_FILE_RANGE_WAIT_BEFORE` flag causes any pending writeback (like from a previous call to `sync_file_range`) to finish before it proceeds to submit the latest `nbytes` for writeback. The latest `nbytes` are still written back asynchronously, unless processing exceeds I/O speed, in which case the following `sync_file_range` will need to wait on it.
There is a second change in this PR to use `fdatasync` when `sync_file_range` is unavailable (determined statically) or has some known problem with the underlying filesystem (determined dynamically).
The above two changes only apply when the user enables a new option, `strict_bytes_per_sync`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5183
Differential Revision: D14953553
Pulled By: siying
fbshipit-source-id: 445c3862e019fb7b470f9c7f314fc231b62706e9
Summary:
Change the behavior of OptimizeForSmallDb() so that it is less likely to go out of memory.
Change the behavior of OptimizeForPointLookup() to take advantage of the new memtable whole key filter, and move away from prefix extractor as well as hash-based indexing, as they are prone to misuse.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5165
Differential Revision: D14880709
Pulled By: siying
fbshipit-source-id: 9af30e3c9e151eceea6d6b38701a58f1f9fb692d
Summary:
This fix should help reading from encrypted files if the file-to-be-read
is smaller than expected. For example, when using the encrypted env and
making it read a journal file of exactly 0 bytes size, the encrypted env
code crashes with SIGSEGV in its Decrypt function, as there is no check
if the read attempts to read over the file's boundaries (as specified
originally by the `dataSize` parameter).
The most important problem this patch addresses is however that there is
no size underlow check in `CTREncryptionProvider::CreateCipherStream`:
The stream to be read will be initialized to a size of always
`prefix.size() - (2 * blockSize)`. If the prefix however is smaller than
twice the block size, this will obviously assume a _very_ large stream
and read over the bounds. The patch adds a check here as follows:
// If the prefix is smaller than twice the block size, we would below read a
// very large chunk of the file (and very likely read over the bounds)
assert(prefix.size() >= 2 * blockSize);
if (prefix.size() < 2 * blockSize) {
return Status::Corruption("Unable to read from file " + fname + ": read attempt would read beyond file bounds");
}
so embedders can catch the error in their release builds.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5160
Differential Revision: D14834633
Pulled By: sagar0
fbshipit-source-id: 47aa39a6db8977252cede054c7eb9a663b9a3484
Summary:
Fix some hdfs-related code so that it can compile and run 'db_stress'
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5122
Differential Revision: D14675495
Pulled By: riversand963
fbshipit-source-id: cac280479efcf5451982558947eac1732e8bc45a
Summary:
This PR allows RocksDB to run in single-primary, multi-secondary process mode.
The writer is a regular RocksDB (e.g. an `DBImpl`) instance playing the role of a primary.
Multiple `DBImplSecondary` processes (secondaries) share the same set of SST files, MANIFEST, WAL files with the primary. Secondaries tail the MANIFEST of the primary and apply updates to their own in-memory state of the file system, e.g. `VersionStorageInfo`.
This PR has several components:
1. (Originally in #4745). Add a `PathNotFound` subcode to `IOError` to denote the failure when a secondary tries to open a file which has been deleted by the primary.
2. (Similar to #4602). Add `FragmentBufferedReader` to handle partially-read, trailing record at the end of a log from where future read can continue.
3. (Originally in #4710 and #4820). Add implementation of the secondary, i.e. `DBImplSecondary`.
3.1 Tail the primary's MANIFEST during recovery.
3.2 Tail the primary's MANIFEST during normal processing by calling `ReadAndApply`.
3.3 Tailing WAL will be in a future PR.
4. Add an example in 'examples/multi_processes_example.cc' to demonstrate the usage of secondary RocksDB instance in a multi-process setting. Instructions to run the example can be found at the beginning of the source code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4899
Differential Revision: D14510945
Pulled By: riversand963
fbshipit-source-id: 4ac1c5693e6012ad23f7b4b42d3c374fecbe8886
Summary:
User report has shown that sometimes `BlockBasedTable::SetupCacheKeyPrefix` would assert when trying to generate an id from the file. The actual cause seems to be hardware related but we might be better off without the incorrect assertion
See T42178927 for more information
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5102
Differential Revision: D14604677
Pulled By: miasantreble
fbshipit-source-id: fcb09207ebdc4fa66e941afbc0523d84797e7ad7
Summary:
[RocksDB] Make it easier for users to load options from option file and set shared block cache.
Right now, it requires several dynamic casting for users to set the shared block cache to their option struct cast from the option file.
If people don't do that, every CF of every DB will generate its own 8MB block cache. It's not a usable setting. So we are dragging every user who loads options from the file into such a mess.
Instead, we should allow them to pass their cache object to LoadLatestOptions() and LoadOptionsFromFile(), so that those loaded option structs will have the shared block cache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5063
Differential Revision: D14518584
Pulled By: rashmishrm
fbshipit-source-id: c91430ff9425a0e67d76fc67931d755f491ca5aa
Summary:
The compiler flag `-DROCKSDB_FALLOCATE_PRESENT` was only set when
`fallocate`, `FALLOC_FL_KEEP_SIZE`, and `FALLOC_FL_PUNCH_HOLE` were all
present. However, the last of the three is not really necessary for the
primary `fallocate` use case; furthermore, it was introduced only in later
Linux kernel versions (2.6.38+).
This PR changes the flag `-DROCKSDB_FALLOCATE_PRESENT` to only require
`fallocate` and `FALLOC_FL_KEEP_SIZE` to be present. There is a separate
check for `FALLOC_FL_PUNCH_HOLE` only in the place where it is used.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5023
Differential Revision: D14248487
Pulled By: siying
fbshipit-source-id: a10ed0b902fa755988e957bd2dcec9081ec0502e
Summary:
The info log header feature never worked well, because log level Header was not
translated to Logger::LogHeader() call. Fix it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4980
Differential Revision: D14087283
Pulled By: siying
fbshipit-source-id: 7e7d03ce35fa8d13d4ee549f46f7326f7bc0006d
Summary:
nvme device path doesn't have "block" as like "nvme/nvme0/nvme0n1"
or "nvme/nvme0/nvme0n1/nvme0n1p1". the last directory such as
"nvme0n1p1" should be removed if nvme drive is partitioned.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4866
Differential Revision: D13627824
Pulled By: riversand963
fbshipit-source-id: 09ab968f349f3dbb890beea20193f1359b17d317
Summary:
Measure CPU time consumed for a compaction and report it in the stats report
Enable NowCPUNanos() to work for MacOS
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4889
Differential Revision: D13701276
Pulled By: zinoale
fbshipit-source-id: 5024e5bbccd4dd10fd90d947870237f436445055
Summary:
Introduce the first CPU timing counter, perf_context.get_cpu_nanos. This opens a door to more CPU counters in the future.
Only Posix Env has it implemented using clock_gettime() with CLOCK_THREAD_CPUTIME_ID. How accurate the counter is depends on the platform.
Make PerfStepTimer to take an Env as an argument, and sometimes pass it in. The direct reason is to make the unit tests to use SpecialEnv where we can ingest logic there. But in long term, this is a good change.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4741
Differential Revision: D13287798
Pulled By: siying
fbshipit-source-id: 090361049d9d5095d1d1a369fe1338d2e2e1c73f
Summary:
Ran the following commands to recursively change all the files under RocksDB:
```
find . -type f -name "*.cc" -exec sed -i 's/ unique_ptr/ std::unique_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/<unique_ptr/<std::unique_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/ shared_ptr/ std::shared_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/<shared_ptr/<std::shared_ptr/g' {} +
```
Running `make format` updated some formatting on the files touched.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4638
Differential Revision: D12934992
Pulled By: sagar0
fbshipit-source-id: 45a15d23c230cdd64c08f9c0243e5183934338a8
Summary:
`WritableFileWrapper` was missing some newer methods that were added to `WritableFile`. Without these functions, the missing wrapper methods would fallback to using the default implementations in WritableFile instead of using the corresponding implementations in, say, `PosixWritableFile` or `WinWritableFile`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4584
Differential Revision: D10559199
Pulled By: sagar0
fbshipit-source-id: 0d0f18a486aee727d5b8eebd3110a41988e27391
Summary:
Currently statistics are supposed to be dumped to info log at intervals of `options.stats_dump_period_sec`. However the implementation choice was to bind it with compaction thread, meaning if the database has been serving very light traffic, the stats may not get dumped at all.
We decided to separate stats dumping into a new timed thread using `TimerQueue`, which is already used in blob_db. This will allow us schedule new timed tasks with more deterministic behavior.
Tested with db_bench using `--stats_dump_period_sec=20` in command line:
> LOG:2018/09/17-14:07:45.575025 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS -------
LOG:2018/09/17-14:08:05.643286 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS -------
LOG:2018/09/17-14:08:25.691325 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS -------
LOG:2018/09/17-14:08:45.740989 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS -------
LOG content:
> 2018/09/17-14:07:45.575025 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS -------
2018/09/17-14:07:45.575080 7fe99fbfe700 [WARN] [db/db_impl.cc:606]
** DB Stats **
Uptime(secs): 20.0 total, 20.0 interval
Cumulative writes: 4447K writes, 4447K keys, 4447K commit groups, 1.0 writes per commit group, ingest: 5.57 GB, 285.01 MB/s
Cumulative WAL: 4447K writes, 0 syncs, 4447638.00 writes per sync, written: 5.57 GB, 285.01 MB/s
Cumulative stall: 00:00:0.012 H:M:S, 0.1 percent
Interval writes: 4447K writes, 4447K keys, 4447K commit groups, 1.0 writes per commit group, ingest: 5700.71 MB, 285.01 MB/s
Interval WAL: 4447K writes, 0 syncs, 4447638.00 writes per sync, written: 5.57 MB, 285.01 MB/s
Interval stall: 00:00:0.012 H:M:S, 0.1 percent
** Compaction Stats [default] **
Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4382
Differential Revision: D9933051
Pulled By: miasantreble
fbshipit-source-id: 6d12bb1e4977674eea4bf2d2ac6d486b814bb2fa
Summary:
The assert in PosixEnv::FileExists is currently based on the return value of `access` syscall. Instead it should be based on errno.
Initially I wanted to remove this assert as [`access`](https://linux.die.net/man/2/access) can error out in a few other cases (like EROFS). But on thinking more it feels like the assert is doing the right thing ... its good to crash on EROFS, EFAULT, EINVAL, and other major filesystem related problems so that the user is immediately aware of the problems while testing.
(I think it might be ok to crash on EIO as well, but there might be a specific reason why it was decided not to crash for EIO, and I don't have that context. So letting the letting the assert checks remain as is for now).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4427
Differential Revision: D10037200
Pulled By: sagar0
fbshipit-source-id: 5cc96116a2e53cef701f444a8b5290576f311e51
Summary:
This commit implements automatic recovery from a Status::NoSpace() error
during background operations such as write callback, flush and
compaction. The broad design is as follows -
1. Compaction errors are treated as soft errors and don't put the
database in read-only mode. A compaction is delayed until enough free
disk space is available to accomodate the compaction outputs, which is
estimated based on the input size. This means that users can continue to
write, and we rely on the WriteController to delay or stop writes if the
compaction debt becomes too high due to persistent low disk space
condition
2. Errors during write callback and flush are treated as hard errors,
i.e the database is put in read-only mode and goes back to read-write
only fater certain recovery actions are taken.
3. Both types of recovery rely on the SstFileManagerImpl to poll for
sufficient disk space. We assume that there is a 1-1 mapping between an
SFM and the underlying OS storage container. For cases where multiple
DBs are hosted on a single storage container, the user is expected to
allocate a single SFM instance and use the same one for all the DBs. If
no SFM is specified by the user, DBImpl::Open() will allocate one, but
this will be one per DB and each DB will recover independently. The
recovery implemented by SFM is as follows -
a) On the first occurance of an out of space error during compaction,
subsequent
compactions will be delayed until the disk free space check indicates
enough available space. The required space is computed as the sum of
input sizes.
b) The free space check requirement will be removed once the amount of
free space is greater than the size reserved by in progress
compactions when the first error occured
c) If the out of space error is a hard error, a background thread in
SFM will poll for sufficient headroom before triggering the recovery
of the database and putting it in write-only mode. The headroom is
calculated as the sum of the write_buffer_size of all the DB instances
associated with the SFM
4. EventListener callbacks will be called at the start and completion of
automatic recovery. Users can disable the auto recov ery in the start
callback, and later initiate it manually by calling DB::Resume()
Todo:
1. More extensive testing
2. Add disk full condition to db_stress (follow-on PR)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4164
Differential Revision: D9846378
Pulled By: anand1976
fbshipit-source-id: 80ea875dbd7f00205e19c82215ff6e37da10da4a
Summary:
As you know, almost all compilers support "pragma once" keyword instead of using include guards. To be keep consistency between header files, all header files are edited.
Besides this, try to fix some warnings about loss of data.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4339
Differential Revision: D9654990
Pulled By: ajkr
fbshipit-source-id: c2cf3d2d03a599847684bed81378c401920ca848
Summary:
In our application we spawn helper child processes concurrently with
opening rocksdb. In one situation I observed that the child process had inherited
the rocksdb lock file as well as directory handles to the rocksdb storage location.
The code in env_posix takes care to set CLOEXEC but doesn't use `O_CLOEXEC` at the
time that the files are opened which means that there is a window of opportunity
to leak the descriptors across a fork/exec boundary.
This diff introduces a helper that can conditionally set the `O_CLOEXEC` bit for
the open call using the same logic as that in the existing helper for setting
that flag post-open.
I've preserved the post-open logic for systems that don't have `O_CLOEXEC`.
I've introduced setting `O_CLOEXEC` for what appears to be a number of temporary
or transient files and directory handles; I suspect that none of the files
opened by Rocks are intended to be inherited by a forked child process.
In one case, `fopen` is used to open a file. I've added the use of the glibc-specific `e`
mode to turn on `O_CLOEXEC` for this case. While this doesn't cover all posix systems,
it is an improvement for our common deployment system.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4328
Reviewed By: ajkr
Differential Revision: D9553046
Pulled By: wez
fbshipit-source-id: acdb89f7a85ca649b22fe3c3bd76f82142bec2bf
Summary:
sysmacros.h should be included in OS_ANDROID build as well otherwise the compile would complain: error: use of undeclared identifier 'major'.
Fixes https://github.com/facebook/rocksdb/issues/4231
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4232
Differential Revision: D9217350
Pulled By: maysamyabandeh
fbshipit-source-id: 21f4b62dbbda3163120ac0b38b95d95d35d67dce
Summary:
The patch makes sure that two parallel test threads will operate on different db paths. This enables using open source tools such as gtest-parallel to run the tests of a file in parallel.
Example: ``` ~/gtest-parallel/gtest-parallel ./table_test```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4135
Differential Revision: D8846653
Pulled By: maysamyabandeh
fbshipit-source-id: 799bad1abb260e3d346bcb680d2ae207a852ba84
Summary:
The original `EnvPosixTest.RunImmediately` assumes that after scheduling
a background thread, the thread is guaranteed to complete after 0.1 second.
I do not know about any non-real-time OS/runtime providing this guarantee. Nor
does C++11 standard say anything about this in the documentation of `std::thread`.
In fact, we have observed this test failure multiple times on appveyor, and we
haven't been able to reproduce the failure deterministically. Therefore,
I disable this test for now until we know for sure how it used to fail.
Instead, I add another test `EnvPosixTest.RunEventually` that checks that
a thread will be scheduled eventually.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4126
Differential Revision: D8827086
Pulled By: riversand963
fbshipit-source-id: abc5cb655f90d50b791493da5eeb3716885dfe93
Summary:
Right now slow deletion with ftruncate doesn't work well with checkpoints because it ruin hard linked files in checkpoints. To fix it, check the file has no other hard link before ftruncate it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4093
Differential Revision: D8730360
Pulled By: siying
fbshipit-source-id: 756eea5bce8a87b9a2ea3a5bfa190b2cab6f75df
Summary:
Moved the direct-IO assertion to the top in `PosixSequentialFile::PositionedRead`, as it doesn't make sense to check for sector alignments before checking for direct IO.
Closes https://github.com/facebook/rocksdb/pull/3891
Differential Revision: D8267972
Pulled By: sagar0
fbshipit-source-id: 0ecf77c0fb5c35747a4ddbc15e278918c0849af7
Summary:
Here are some fixes for build on Solaris Sparc.
It is also fixing CRC test on BigEndian platforms.
Closes https://github.com/facebook/rocksdb/pull/4000
Differential Revision: D8455394
Pulled By: ajkr
fbshipit-source-id: c9289a7b541a5628139c6b77e84368e14dc3d174
Summary:
Rebased and resubmitting #1831 on behalf of stevelittle.
The problem is when a single process attempts to open the same DB twice, the second attempt fails due to LOCK file held. If the second attempt had opened the LOCK file, it'll now need to close it, and closing causes the file to be unlocked. Then, any subsequent attempt to open the DB will succeed, which is the wrong behavior.
The solution was to track which files a process has locked in PosixEnv, and check those before opening a LOCK file.
Fixes#1780.
Closes https://github.com/facebook/rocksdb/pull/3993
Differential Revision: D8398984
Pulled By: ajkr
fbshipit-source-id: 2755fe66950a0c9de63075f932f9e15768041918
Summary:
PR https://github.com/facebook/rocksdb/pull/3838 made some changes that triggers lint warnings.
Run `make format` to fix formatting as suggested by siying .
Also piggyback two changes:
1) fix singleton destruction order for windows and posix env
2) fix two clang warnings
Closes https://github.com/facebook/rocksdb/pull/3954
Differential Revision: D8272041
Pulled By: miasantreble
fbshipit-source-id: 7c4fd12bd17aac13534520de0c733328aa3c6c9f
Summary:
Ensure the PosixEnv singleton is destroyed first since its destructor waits for background threads to all complete. This ensures background threads cannot hit sync points after the SyncPoint singleton is destroyed, which was previously possible.
Closes https://github.com/facebook/rocksdb/pull/3951
Differential Revision: D8265295
Pulled By: ajkr
fbshipit-source-id: 7738dd458c5d993a78377dd0420e82badada81ab
Summary:
```PosixMmapReadableFile::fd_``` is closed after created, but needs to remain open for the lifetime of `PosixMmapReadableFile` since it is used whenever `InvalidateCache` is called.
Closes https://github.com/facebook/rocksdb/pull/2764
Differential Revision: D8152515
Pulled By: ajkr
fbshipit-source-id: b738a6a55ba4e392f9b0f374ff396a1e61c64f65
Summary:
Catch up with Posix features
NewWritableRWFile must fail when file does not exists
Implement Env::Truncate()
Adjust Env options optimization functions
Implement MemoryMappedBuffer on Windows.
Closes https://github.com/facebook/rocksdb/pull/3857
Differential Revision: D8053610
Pulled By: ajkr
fbshipit-source-id: ccd0d46c29648a9f6f496873bc1c9d6c5547487e
Summary:
Previously `DBOptions::use_direct_io_for_flush_and_compaction=true` combined with `DBOptions::use_direct_reads=false` could cause RocksDB to simultaneously read from two file descriptors for the same file, where background reads used direct I/O and foreground reads used buffered I/O. Our measurements found this mixed-mode I/O negatively impacted foreground read perf, compared to when only buffered I/O was used.
This PR makes the mixed-mode I/O situation impossible by repurposing `DBOptions::use_direct_io_for_flush_and_compaction` to only apply to background writes, and `DBOptions::use_direct_reads` to apply to all reads. There is no risk of direct background direct writes happening simultaneously with buffered reads since we never read from and write to the same file simultaneously.
Closes https://github.com/facebook/rocksdb/pull/3829
Differential Revision: D7915443
Pulled By: ajkr
fbshipit-source-id: 78bcbf276449b7e7766ab6b0db246f789fb1b279
Summary:
The only use of RandomRW is to change seqno when bulkloading, and in this use case, the file should exist. We should fail the file opening in this case.
Closes https://github.com/facebook/rocksdb/pull/3827
Differential Revision: D7913719
Pulled By: siying
fbshipit-source-id: 62cf6734f1a6acb9e14f715b927da388131c3492
Summary:
this is a repeat commit of a8a28da215, which got reverted together with 6afe22db2e, but forgotten about when that commit was un-reverted in 46152d53bf.
Closes https://github.com/facebook/rocksdb/pull/3796
Differential Revision: D7826077
Pulled By: ajkr
fbshipit-source-id: edb22375da56e2feda50c5b35f942f4d2d52b19c
Summary:
- Original commit: a4fb1f8c04
- Revert commit (we reverted as a quick fix to get crash tests passing): 6afe22db2e
This PR includes the contents of the original commit plus two bug fixes, which are:
- In whitebox crash test, only set `--expected_values_path` for `db_stress` runs in the first half of the crash test's duration. In the second half, a fresh DB is created for each `db_stress` run, so we cannot maintain expected state across `db_stress` runs.
- Made `Exists()` return true for `UNKNOWN_SENTINEL` values. I previously had an assert in `Exists()` that value was not `UNKNOWN_SENTINEL`. But it is possible for post-crash-recovery expected values to be `UNKNOWN_SENTINEL` (i.e., if the crash happens in the middle of an update), in which case this assertion would be tripped. The effect of returning true in this case is there may be cases where a `SingleDelete` deletes no data. But if we had returned false, the effect would be calling `SingleDelete` on a key with multiple older versions, which is not supported.
Closes https://github.com/facebook/rocksdb/pull/3793
Differential Revision: D7811671
Pulled By: ajkr
fbshipit-source-id: 67e0295bfb1695ff9674837f2e05bb29c50efc30
Summary:
crash-recovery verification is failing in the whitebox testing, which may or may not be a valid correctness issue -- need more time to investigate. In the meantime, reverting so we don't mask other failures.
Closes https://github.com/facebook/rocksdb/pull/3786
Differential Revision: D7794516
Pulled By: ajkr
fbshipit-source-id: 28ccdfdb9ec9b3b0fb08c15cbf9d2e282201ff33
Summary:
This change adds a virtual `Truncate` method to `Env`, which truncates
the named file to the specified size. At the moment, this is only
supported for `MockEnv`, but other `Env's` could be extended to override
the method too. This is the same approach that methods like `LinkFile` and
`AreSameFile` have taken.
This is useful for any user of the in-memory `Env`. The implementation's
header is not exported, so before this change, it was impossible to
access it's already existing `Truncate` method.
Closes https://github.com/facebook/rocksdb/pull/3779
Differential Revision: D7785789
Pulled By: ajkr
fbshipit-source-id: 3bcdaeea7b7180529f7d9b496dc67b791a00bbf0
Summary:
It seems clear to me that the variable is initialized before line 492, but it wasn't clear to UBSAN. The failure was:
```
In file included from ./env/io_posix.h:14:0,
from env/env_posix.cc:44:
./include/rocksdb/env.h: In member function ‘virtual rocksdb::Status rocksdb::{anonymous}::PosixEnv::NewMemoryMappedFileBuffer(const string&, std::unique_ptr<rocksdb::MemoryMappedFileBuffer>*)’:
./include/rocksdb/env.h:822:36: error: ‘base’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
: base(_base), length(_length) {}
^
env/env_posix.cc:482:11: note: ‘base’ was declared here
void* base;
```
We can just initialize to nullptr to keep UBSAN happy.
Closes https://github.com/facebook/rocksdb/pull/3770
Differential Revision: D7756287
Pulled By: ajkr
fbshipit-source-id: 0f2efb9594e2d3a30706a4ca7e1d4a6328031bf2
Summary:
Previously, our `db_stress` tool held the expected state of the DB in-memory, so after crash-recovery, there was no way to verify data correctness. This PR adds an option, `--expected_values_file`, which specifies a file holding the expected values.
In black-box testing, the `db_stress` process can be killed arbitrarily, so updates to the `--expected_values_file` must be atomic. We achieve this by `mmap`ing the file and relying on `std::atomic<uint32_t>` for atomicity. Actually this doesn't provide a total guarantee on what we want as `std::atomic<uint32_t>` could, in theory, be translated into multiple stores surrounded by a mutex. We can verify our assumption by looking at `std::atomic::is_always_lock_free`.
For the `mmap`'d file, we didn't have an existing way to expose its contents as a raw memory buffer. This PR adds it in the `Env::NewMemoryMappedFileBuffer` function, and `MemoryMappedFileBuffer` class.
`db_crashtest.py` is updated to use an expected values file for black-box testing. On the first iteration (when the DB is created), an empty file is provided as `db_stress` will populate it when it runs. On subsequent iterations, that same filename is provided so `db_stress` can check the data is as expected on startup.
Closes https://github.com/facebook/rocksdb/pull/3629
Differential Revision: D7463144
Pulled By: ajkr
fbshipit-source-id: c8f3e82c93e045a90055e2468316be155633bd8b
Summary:
Background activities like compaction can negatively affect
latency of higher-priority tasks like request processing. To avoid this,
rocksdb already lowers the IO priority of background threads on Linux
systems. While this takes care of typical IO-bound systems, it does not
help much when CPU (temporarily) becomes the bottleneck. This is
especially likely when using more expensive compression settings.
This patch adds an API to allow for lowering the CPU priority of
background threads, modeled on the IO priority API. Benchmarks (see
below) show significant latency and throughput improvements when CPU
bound. As a result, workloads with some CPU usage bursts should benefit
from lower latencies at a given utilization, or should be able to push
utilization higher at a given request latency target.
A useful side effect is that compaction CPU usage is now easily visible
in common tools, allowing for an easier estimation of the contribution
of compaction vs. request processing threads.
As with IO priority, the implementation is limited to Linux, degrading
to a no-op on other systems.
Closes https://github.com/facebook/rocksdb/pull/3763
Differential Revision: D7740096
Pulled By: gwicke
fbshipit-source-id: e5d32373e8dc403a7b0c2227023f9ce4f22b413c
Summary:
The test is flaky in our CI but could not be reproduce manually on the same CI host. Disabling it.
Closes https://github.com/facebook/rocksdb/pull/3753
Differential Revision: D7716320
Pulled By: yiwu-arbug
fbshipit-source-id: 6bed3b05880c1d24e8dc86bc970e5181bc98fb45
Summary:
Previously threads were named "rocksdb:bg\<index in thread pool\>", so the first thread in all thread pools would be named "rocksdb:bg0". Users want to be able to distinguish threads used for flush (high-pri) vs regular compaction (low-pri) vs compaction to bottom-level (bottom-pri). So I changed the thread naming convention to include the thread-pool priority.
Closes https://github.com/facebook/rocksdb/pull/3702
Differential Revision: D7581415
Pulled By: ajkr
fbshipit-source-id: ce04482b6acd956a401ef22dc168b84f76f7d7c1
Summary:
this PR fixes a few failed contbuild:
1. ASAN memory leak in Block::NewIterator (table/block.cc:429). the proper destruction of first_level_iter_ and second_level_iter_ of two_level_iterator.cc is missing from the code after the refactoring in https://github.com/facebook/rocksdb/pull/3406
2. various unused param errors introduced by https://github.com/facebook/rocksdb/pull/3662
3. updated comment for `ForceReleaseCachedEntry` to emphasize the use of `force_erase` flag.
Closes https://github.com/facebook/rocksdb/pull/3718
Reviewed By: maysamyabandeh
Differential Revision: D7621192
Pulled By: miasantreble
fbshipit-source-id: 476c94264083a0730ded957c29de7807e4f5b146
Summary:
Calling rocksdb::Log, rocksdb::Info, etc with a `shared_ptr<Logger>` should behave the same as calling those functions with a `Logger *`. This PR achieves it by making the `shared_ptr<Logger>` versions delegate to the `Logger *` versions.
Closes#3689
Closes https://github.com/facebook/rocksdb/pull/3710
Differential Revision: D7595557
Pulled By: ajkr
fbshipit-source-id: 64dd7f20fd42dc821bac7b8032705c35b483e00d
Summary:
This PR comments out the rest of the unused arguments which allow us to turn on the -Wunused-parameter flag. This is the second part of a codemod relating to https://github.com/facebook/rocksdb/pull/3557.
Closes https://github.com/facebook/rocksdb/pull/3662
Differential Revision: D7426121
Pulled By: Dayvedde
fbshipit-source-id: 223994923b42bd4953eb016a0129e47560f7e352
Summary:
This diff handles cases where compaction causes an ENOSPC error.
This does not handle corner cases where another background job is started while compaction is running, and the other background job triggers ENOSPC, although we do allow the user to provision for these background jobs with SstFileManager::SetCompactionBufferSize.
It also does not handle the case where compaction has finished and some other background job independently triggers ENOSPC.
Usage: Functionality is inside SstFileManager. In particular, users should set SstFileManager::SetMaxAllowedSpaceUsage, which is the reference highwatermark for determining whether to cancel compactions.
Closes https://github.com/facebook/rocksdb/pull/3449
Differential Revision: D7016941
Pulled By: amytai
fbshipit-source-id: 8965ab8dd8b00972e771637a41b4e6c645450445
Summary:
In attempting to build a static lib for use in iOS, I ran in to lots of type errors between uint64_t and size_t. This PR contains the changes I made to get `TARGET_OS=IOS make static_lib` to succeed while also getting Xcode to build successfully with the resulting `librocksdb.a` library imported.
This also compiles for me on macOS and tests fine, but I'm really not sure if I made the correct decisions about where to `static_cast` and where to change types.
Also up for discussion: is iOS worth supporting? Getting the static lib is just part one, we aren't providing any bridging headers or wrappers like the ObjectiveRocks project, it won't be a great experience.
Closes https://github.com/facebook/rocksdb/pull/3503
Differential Revision: D7106457
Pulled By: gfosco
fbshipit-source-id: 82ac2073de7e1f09b91f6b4faea91d18bd311f8e
Summary:
This patch addressed several issues.
Portability including db_test std::thread -> port::Thread Cc: @
and %z to ROCKSDB portable macro. Cc: maysamyabandeh
Implement Env::AreFilesSame
Make the implementation of file unique number more robust
Get rid of C-runtime and go directly to Windows API when dealing
with file primitives.
Implement GetSectorSize() and aling unbuffered read on the value if
available.
Adjust Windows Logger for the new interface, implement CloseImpl() Cc: anand1976
Fix test running script issue where $status var was of incorrect scope
so the failures were swallowed and not reported.
DestroyDB() creates a logger and opens a LOG file in the directory
being cleaned up. This holds a lock on the folder and the cleanup is
prevented. This fails one of the checkpoin tests. We observe the same in production.
We close the log file in this change.
Fix DBTest2.ReadAmpBitmapLiveInCacheAfterDBClose failure where the test
attempts to open a directory with NewRandomAccessFile which does not
work on Windows.
Fix DBTest.SoftLimit as it is dependent on thread timing. CC: yiwu-arbug
Closes https://github.com/facebook/rocksdb/pull/3552
Differential Revision: D7156304
Pulled By: siying
fbshipit-source-id: 43db0a757f1dfceffeb2b7988043156639173f5b
Summary:
Currently FreeBSD build is broken in master and possibly some previous releases due to unrecognized symbol `O_DIRECT`.
This PR will fix the build on FreeBSD
Closes https://github.com/facebook/rocksdb/pull/3560
Differential Revision: D7148646
Pulled By: miasantreble
fbshipit-source-id: 95b6c3d310fa531267c086b2cd40a5ab1c042b5a
Summary:
The recent Logger::Close() and DBImpl::Close() implementation rely on
calling the CloseImpl() virtual function from the destructor, which will
not work. Refactor the implementation to have a private close helper
function in derived classes that can be called by both CloseImpl() and
the destructor.
Closes https://github.com/facebook/rocksdb/pull/3528
Reviewed By: gfosco
Differential Revision: D7049303
Pulled By: anand1976
fbshipit-source-id: 76a64cbf403209216dfe4864ecf96b5d7f3db9f4
Summary:
- removed a few unneeded variables
- fused some variable declarations and their assignments
- fixed right-trimming code in string_util.cc to not underflow
- simplifed an assertion
- move non-nullptr check assertion before dereferencing of that pointer
- pass an std::string function parameter by const reference instead of by value (avoiding potential copy)
Closes https://github.com/facebook/rocksdb/pull/3507
Differential Revision: D7004679
Pulled By: sagar0
fbshipit-source-id: 52944952d9b56dfcac3bea3cd7878e315bb563c4
Summary:
Currently, the only way to close an open DB is to destroy the DB
object. There is no way for the caller to know the status. In one
instance, the destructor encountered an error due to failure to
close a log file on HDFS. In order to prevent silent failures, we add
DB::Close() that calls CloseImpl() which must be implemented by its
descendants.
The main failure point in the destructor is closing the log file. This
patch also adds a Close() entry point to Logger in order to get status.
When DBOptions::info_log is allocated and owned by the DBImpl, it is
explicitly closed by DBImpl::CloseImpl().
Closes https://github.com/facebook/rocksdb/pull/3348
Differential Revision: D6698158
Pulled By: anand1976
fbshipit-source-id: 9468e2892553eb09c4c41b8723f590c0dbd8ab7d
Summary:
Add ROCKSDB_VALGRIND_RUN macro and suppress false-positive "unimplemented functionality" throw by valgrind for steam hints.
Another approach would be add a valgrind suppress file. Valgrind is suppose to print the suppression when given "--gen-suppressions=all" param, which is suppose to be the content for the suppression file. But it doesn't print.
Closes https://github.com/facebook/rocksdb/pull/3174
Differential Revision: D6338786
Pulled By: yiwu-arbug
fbshipit-source-id: 3559efa5f3b92d40d09ad6ac82bc7b59f86c75aa
Summary:
Add per-exe execution capability
Add fix parsing of groups/tests
Add timer test exclusion
Fix unit tests
Ifdef threadpool specific tests that do not pass on Vista threadpool.
Remove spurious outout from prefix_test so test case listing works
properly.
Fix not using standard test directories results in file creation errors
in sst_dump_test.
BlobDb fixes:
In C++ end() iterators can not be dereferenced. They are not valid.
When deleting blob_db_ set it to nullptr before any other code executes.
Not fixed:. On Windows you can not delete a file while it is open.
[ RUN ] BlobDBTest.ReadWhileGC
d:\dev\rocksdb\rocksdb\utilities\blob_db\blob_db_test.cc(75): error: DestroyBlobDB(dbname_, options, bdb_options)
IO error: Failed to delete: d:/mnt/db\testrocksdb-17444/blob_db_test/blob_dir/000001.blob: Permission denied
d:\dev\rocksdb\rocksdb\utilities\blob_db\blob_db_test.cc(75): error: DestroyBlobDB(dbname_, options, bdb_options)
IO error: Failed to delete: d:/mnt/db\testrocksdb-17444/blob_db_test/blob_dir/000001.blob: Permission denied
write_batch
Should not call front() if there is a chance the container is empty
Closes https://github.com/facebook/rocksdb/pull/3152
Differential Revision: D6293274
Pulled By: sagar0
fbshipit-source-id: 318c3717c22087fae13b18715dffb24565dbd956
Summary:
Add a simple policy for NVMe write time life hint
Closes https://github.com/facebook/rocksdb/pull/3095
Differential Revision: D6298030
Pulled By: shligit
fbshipit-source-id: 9a72a42e32e92193af11599eb71f0cf77448e24d
Summary:
There seems to be a typo mistake in env ReuseWritableFile func
where status is being returned twice.
Closes https://github.com/facebook/rocksdb/pull/3099
Differential Revision: D6196204
Pulled By: ajkr
fbshipit-source-id: abb6e3e1c1e772dd485fc39e7f1b9d502fa188fe
Summary:
A few simple changes to allow RocksDB to be built on OpenBSD. Let me know if any further changes are needed.
Closes https://github.com/facebook/rocksdb/pull/3061
Differential Revision: D6138800
Pulled By: ajkr
fbshipit-source-id: a13a17b5dc051e6518bd56a8c5efd1d24dd81b0c
Summary:
MSVC does not support unused attribute at this time. A separate assignment line fixes the issue probably by being counted as usage for MSVC and it no longer complains about unused var.
Closes https://github.com/facebook/rocksdb/pull/3048
Differential Revision: D6126272
Pulled By: maysamyabandeh
fbshipit-source-id: 4907865db45fd75a39a15725c0695aaa17509c1f
Summary:
Problem:
- `DB::SanitizeOptions` strips trailing slash from `wal_dir` but not `dbname`
- We check whether `wal_dir` and `dbname` refer to the same directory using string equality: https://github.com/facebook/rocksdb/blob/master/db/repair.cc#L258
- Providing `dbname` with trailing slash causes default `wal_dir` to be misidentified as a separate directory.
- Then the repair tries to add all SST files to the `VersionEdit` twice (once for `dbname` dir, once for `wal_dir`) and fails with coredump.
Solution:
- Add a new `Env` function, `AreFilesSame`, which uses device and inode number to check whether files are the same. It's currently only implemented in `PosixEnv`.
- Migrate repair to use `AreFilesSame` to check whether `dbname` and `wal_dir` are same. If unsupported, falls back to string comparison.
Closes https://github.com/facebook/rocksdb/pull/2827
Differential Revision: D5761349
Pulled By: ajkr
fbshipit-source-id: c839d548678b742af1166d60b09abd94e5476238
Summary:
When we had a single thread pool for compactions, a thread could be busy for a long time (minutes) executing a compaction involving the bottom level. In multi-instance setups, the entire thread pool could be consumed by such bottom-level compactions. Then, top-level compactions (e.g., a few L0 files) would be blocked for a long time ("head-of-line blocking"). Such top-level compactions are critical to prevent compaction stalls as they can quickly reduce number of L0 files / sorted runs.
This diff introduces a bottom-priority queue for universal compactions including the bottom level. This alleviates the head-of-line blocking situation for fast, top-level compactions.
- Added `Env::Priority::BOTTOM` thread pool. This feature is only enabled if user explicitly configures it to have a positive number of threads.
- Changed `ThreadPoolImpl`'s default thread limit from one to zero. This change is invisible to users as we call `IncBackgroundThreadsIfNeeded` on the low-pri/high-pri pools during `DB::Open` with values of at least one. It is necessary, though, for bottom-pri to start with zero threads so the feature is disabled by default.
- Separated `ManualCompaction` into two parts in `PrepickedCompaction`. `PrepickedCompaction` is used for any compaction that's picked outside of its execution thread, either manual or automatic.
- Forward universal compactions involving last level to the bottom pool (worker thread's entry point is `BGWorkBottomCompaction`).
- Track `bg_bottom_compaction_scheduled_` so we can wait for bottom-level compactions to finish. We don't count them against the background jobs limits. So users of this feature will get an extra compaction for free.
Closes https://github.com/facebook/rocksdb/pull/2580
Differential Revision: D5422916
Pulled By: ajkr
fbshipit-source-id: a74bd11f1ea4933df3739b16808bb21fcd512333
Summary:
Replace dynamic_cast<> so that users can choose to build with RTTI off, so that they can save several bytes per object, and get tiny more memory available.
Some nontrivial changes:
1. Add Comparator::GetRootComparator() to get around the internal comparator hack
2. Add the two experiemental functions to DB
3. Add TableFactory::GetOptionString() to avoid unnecessary casting to get the option string
4. Since 3 is done, move the parsing option functions for table factory to table factory files too, to be symmetric.
Closes https://github.com/facebook/rocksdb/pull/2645
Differential Revision: D5502723
Pulled By: siying
fbshipit-source-id: fd13cec5601cf68a554d87bfcf056f2ffa5fbf7c
Summary:
This reverts the previous commit 1d7048c598, which broke the build.
Did a `git revert 1d7048c`.
Closes https://github.com/facebook/rocksdb/pull/2627
Differential Revision: D5476473
Pulled By: sagar0
fbshipit-source-id: 4756ff5c0dfc88c17eceb00e02c36176de728d06
Summary: This uses `clang-tidy` to comment out unused parameters (in functions, methods and lambdas) in fbcode. Cases that the tool failed to handle are fixed manually.
Reviewed By: igorsugak
Differential Revision: D5454343
fbshipit-source-id: 5dee339b4334e25e963891b519a5aa81fbf627b2
Summary:
We has to remove this line because previously it is only called when use_os_buffer = false. But now we have direct io to replace it.
Closes https://github.com/facebook/rocksdb/pull/2573
Differential Revision: D5412824
Pulled By: yiwu-arbug
fbshipit-source-id: 81f3f0cdf94566bfc09ef2ff123e40cddbe36b36
Summary:
Valgrind had false positive complaints about the initialization pattern for `GetCurrentTime()`'s argument in #2480. We can instead have the client initialize the time variable before calling `GetCurrentTime()`, and have `GetCurrentTime()` promise to only overwrite it in success case.
Closes https://github.com/facebook/rocksdb/pull/2526
Differential Revision: D5358689
Pulled By: ajkr
fbshipit-source-id: 857b189f24c19196f6bb299216f3e23e7bc4be42
Summary:
This PR adds support for encrypting data stored by RocksDB when written to disk.
It adds an `EncryptedEnv` override of the `Env` class with matching overrides for sequential&random access files.
The encryption itself is done through a configurable `EncryptionProvider`. This class creates is asked to create `BlockAccessCipherStream` for a file. This is where the actual encryption/decryption is being done.
Currently there is a Counter mode implementation of `BlockAccessCipherStream` with a `ROT13` block cipher (NOTE the `ROT13` is for demo purposes only!!).
The Counter operation mode uses an initial counter & random initialization vector (IV).
Both are created randomly for each file and stored in a 4K (default size) block that is prefixed to that file. The `EncryptedEnv` implementation is such that clients of the `Env` class do not see this prefix (nor data, nor in filesize).
The largest part of the prefix block is also encrypted, and there is room left for implementation specific settings/values/keys in there.
To test the encryption, the `DBTestBase` class has been extended to consider a new environment variable called `ENCRYPTED_ENV`. If set, the test will setup a encrypted instance of the `Env` class to use for all tests.
Typically you would run it like this:
```
ENCRYPTED_ENV=1 make check_some
```
There is also an added test that checks that some data inserted into the database is or is not "visible" on disk. With `ENCRYPTED_ENV` active it must not find plain text strings, with `ENCRYPTED_ENV` unset, it must find the plain text strings.
Closes https://github.com/facebook/rocksdb/pull/2424
Differential Revision: D5322178
Pulled By: sdwilsh
fbshipit-source-id: 253b0a9c2c498cc98f580df7f2623cbf7678a27f
Summary:
Force people to write something other than file name while returning status for IOError.
Closes https://github.com/facebook/rocksdb/pull/2493
Differential Revision: D5321309
Pulled By: siying
fbshipit-source-id: 38bcf6c19e80831cd3e300a047e975cbb131d822
Summary:
initialize 2 additional fields tm_gmtoff and tm_zone,
otherwise under strict warnings for initialization, we get errors
in myrocks.
Closes https://github.com/facebook/rocksdb/pull/2439
Differential Revision: D5229013
Pulled By: yiwu-arbug
fbshipit-source-id: 9fc1615a1919656f36064791706ed41e10e9db84
Summary:
We had a crash in this code: `fstat()` failed; `file_stats` contained garbage, in particular `file_stats.st_blksize == 6`; the expression `file_stats.st_blocks / (file_stats.st_blksize / 512)` divided by zero.
Closes https://github.com/facebook/rocksdb/pull/2420
Differential Revision: D5216110
Pulled By: al13n321
fbshipit-source-id: 6d8fc5e7c4f98c1139e68c7829ebdbac68b0fce0
Summary:
/home/travis/build/facebook/rocksdb/env/mock_env.cc: In member function ‘virtual void rocksdb::{anonymous}::TestMemLogger::Logv(const char*, va_list)’:
/home/travis/build/facebook/rocksdb/env/mock_env.cc:391:53: error: ‘t.tm::tm_year’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
static_cast<int>(now_tv.tv_usec));
Closes https://github.com/facebook/rocksdb/pull/2418
Differential Revision: D5193597
Pulled By: maysamyabandeh
fbshipit-source-id: 8801a3ef27f33eb419d534f7de747702cdf504a0
Summary:
USE_CLANG=1 make -j32 analyze
The two errors would disappear after the assertion.
Closes https://github.com/facebook/rocksdb/pull/2416
Differential Revision: D5193526
Pulled By: maysamyabandeh
fbshipit-source-id: 16a21f18f68023f862764dd3ab9e00ca60b0eefa
Summary:
This is a manual commit of this PR:
Retire InMemoryEnv in favor of MockEnv #2082
With MockEnv doing the same yet being more mature, InMemoryEnv is redundant.
Reviewed By: IslamAbdelRahman
Differential Revision: D5162323
fbshipit-source-id: 59fd0082a891dc99cc531e4da9d68bf891eae3f5
Summary:
Previously users could set `max_background_flushes=0` to force rocksdb to use a single thread pool for both background flushes and compactions. That'll no longer be possible since I'm going to deprecate `max_background_flushes` and `max_background_compactions` in favor of a single option. This diff introduces a new way to force a single thread pool: when high-pri pool has zero threads, all background jobs will be submitted to low-pri pool.
Note the majority of the code change is adding `Env::GetBackgroundThreads()`, which is necessary to check whether the user has provided a zero-sized thread pool.
Closes https://github.com/facebook/rocksdb/pull/2204
Differential Revision: D4936256
Pulled By: ajkr
fbshipit-source-id: 929a07a0c0705f7766f5339cd013ff74e90d6e01
Summary:
Disable direct reads for log and manifest. Direct reads should not affect sequential_file
Also add kDirectIO for option_config_ in db_test_util
Closes https://github.com/facebook/rocksdb/pull/2337
Differential Revision: D5100261
Pulled By: lightmark
fbshipit-source-id: 0ebfd13b93fa1b8f9acae514ac44f8125a05868b
Summary:
we align the buffer with logical sector size and should not test it with page size, which is usually 4k.
Closes https://github.com/facebook/rocksdb/pull/2245
Differential Revision: D5001842
Pulled By: lightmark
fbshipit-source-id: a7135fcf6351c6db363e8908956b1e193a4a6291
Summary:
TSAN sometimes complaints data race of PosixLogger::flush_pending_. Make it atomic.
Closes https://github.com/facebook/rocksdb/pull/2231
Differential Revision: D4973397
Pulled By: siying
fbshipit-source-id: 571e886e3eca3231705919d573e250c1c1ec3764
Summary:
Every time after a compaction/flush finish, we issue user reads to put the table into block cache which includes a couple of IO that read footer, index blocks, meta block, etc. So we implement Prefetch here to reduce IO.
Closes https://github.com/facebook/rocksdb/pull/2196
Differential Revision: D4931782
Pulled By: lightmark
fbshipit-source-id: 5a13d58dcab209964352322217193bbf7ff78149
Summary:
Replacement of #2147
The change was squashed due to a lot of conflicts.
Closes https://github.com/facebook/rocksdb/pull/2194
Differential Revision: D4929799
Pulled By: siying
fbshipit-source-id: 5cd49c254737a1d5ac13f3c035f128e86524c581
Summary:
st_blocks shows 16 though the right value is 8. This happens occasionally which seems a bug.
Closes https://github.com/facebook/rocksdb/pull/2160
Differential Revision: D4893542
Pulled By: lightmark
fbshipit-source-id: 68e832586b58bbc6162efbe83ce273f1570d5be3
Summary:
Replace Options::use_direct_writes with Options::use_direct_io_for_flush_and_compaction
Now if Options::use_direct_io_for_flush_and_compaction = true, we will enable direct io for both reads and writes for flush and compaction job. Whereas Options::use_direct_reads controls user reads like iterator and Get().
Closes https://github.com/facebook/rocksdb/pull/2117
Differential Revision: D4860912
Pulled By: lightmark
fbshipit-source-id: d93575a8a5e780cf7e40797287edc425ee648c19
Summary:
filter the warning out and only print it once.
Closes https://github.com/facebook/rocksdb/pull/2137
Differential Revision: D4870925
Pulled By: lightmark
fbshipit-source-id: 91b363ce7f70bce88b0780337f408fc4649139b8
Summary:
OpenBSD doesn't have `O_DIRECT`, so avoid it. (RocksDB compiles successfully on
OpenBSD with this patch.)
Closes https://github.com/facebook/rocksdb/pull/2106
Differential Revision: D4847833
Pulled By: siying
fbshipit-source-id: 214b785
Summary:
In RocksDB, we sometimes preallocate the estimated space for a file to have better perf with fallocate (if supported). Usually it is a little bit bigger than the real resulting file size. At this time, we have to let the Filesystem reclaim the space not used.
Ideally, calling ftruncate to truncate the file to its real size should be enough. HOWEVER, it isn't on tmpfs, which we witness in our case, with some buggy kernel version. ftruncate a file with preallocated space doesn't change number of the blocks used by the file, which means the space not used by the file is not returned to the filesystems. So in this case we need fallocate with FALLOC_FL_PUNCH_HOLE to explicitly reclaim the used blocks. It is a hack to cope with the kernel bug and usually we should not need it.
Closes https://github.com/facebook/rocksdb/pull/2102
Differential Revision: D4848934
Pulled By: lightmark
fbshipit-source-id: f1b40b5
Summary:
Move some files under util/ to new directories env/, monitoring/ options/ and cache/
Closes https://github.com/facebook/rocksdb/pull/2090
Differential Revision: D4833681
Pulled By: siying
fbshipit-source-id: 2fd8bef