Commit Graph

1177 Commits

Author SHA1 Message Date
Levi Tamasi 30fb9dd50f Introduce a helper method UncompressData (#7434)
Summary:
The patch introduces a helper method in `util/compression.h` called `UncompressData`
that dispatches calls to the correct uncompression method based on type, and changes
`UncompressBlockContentsForCompressionType` and `Benchmark::Uncompress` in
`db_bench` so they are implemented in terms of the new method. This eliminates
some code duplication. (`Benchmark::Compress` is also updated to use the previously
introduced `CompressData` helper.)

In addition, the patch brings the implementation of `Snappy_Uncompress` into sync with
the other uncompression methods by making the method compute the buffer size and allocate
the buffer itself. Finally, the patch eliminates some potentially risky back-and-forth conversions
between various unsigned and signed integer types by exposing the size of the allocated buffer
as a `size_t` instead of an `int`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7434

Test Plan:
`make check`
`./db_bench -benchmarks=compress,uncompress --compression_type ...`

Reviewed By: riversand963

Differential Revision: D23900011

Pulled By: ltamasi

fbshipit-source-id: b25df63ceec4639889be94acb22eb53e530c54e0
2020-09-25 09:01:45 -07:00
Andrew Kryczka 0698598f3a Fix scope of `ReadOptions` in `SstFileReader` (#7432)
Summary:
a4a4a2dabd changed the contract of `TableReader::NewIterator()` to require
`ReadOptions` outlive the returned iterator. But I didn't notice that
`SstFileReader` violates the new contract and needs to be adapted. The unit test
provided here exposes the problem when run under ASAN.

```
$ ./sst_file_reader_test --gtest_filter=SstFileReaderTest.ReadOptionsOutOfScope
Note: Google Test filter = SstFileReaderTest.ReadOptionsOutOfScope
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from SstFileReaderTest
[ RUN      ] SstFileReaderTest.ReadOptionsOutOfScope
=================================================================
==3238048==ERROR: AddressSanitizer: stack-use-after-scope on address 0x7ffd6189e158 at pc 0x000001298350 bp 0x7ffd6189c280 sp 0x7ffd6189c278
READ of size 8 at 0x7ffd6189e158 thread T0
    #0 0x129834f in rocksdb::BlockBasedTableIterator::InitDataBlock() table/block_based/block_based_table_iterator.cc:236
    https://github.com/facebook/rocksdb/issues/1 0x12b01f7 in rocksdb::BlockBasedTableIterator::SeekImpl(rocksdb::Slice const*) table/block_based/block_based_table_iterator.cc:77
    https://github.com/facebook/rocksdb/issues/2 0x844d28 in rocksdb::IteratorWrapperBase<rocksdb::Slice>::SeekToFirst() table/iterator_wrapper.h:116
    https://github.com/facebook/rocksdb/issues/3 0x844d28 in rocksdb::DBIter::SeekToFirst() db/db_iter.cc:1352
    https://github.com/facebook/rocksdb/issues/4 0x52482b in rocksdb::SstFileReaderTest_ReadOptionsOutOfScope_Test::TestBody() table/sst_file_reader_test.cc:150
    https://github.com/facebook/rocksdb/issues/5 0x5f433c in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:3899
    https://github.com/facebook/rocksdb/issues/6 0x5f433c in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:3935
    https://github.com/facebook/rocksdb/issues/7 0x5cc2de in testing::Test::Run() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:3973
    https://github.com/facebook/rocksdb/issues/8 0x5cc988 in testing::Test::Run() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:3965
    https://github.com/facebook/rocksdb/issues/9 0x5cc988 in testing::TestInfo::Run() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:4149
    https://github.com/facebook/rocksdb/issues/10 0x5cce9a in testing::TestInfo::Run() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:4124
    https://github.com/facebook/rocksdb/issues/11 0x5cce9a in testing::TestCase::Run() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:4267
    https://github.com/facebook/rocksdb/issues/12 0x5ce696 in testing::TestCase::Run() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:4253
    https://github.com/facebook/rocksdb/issues/13 0x5ce696 in testing::internal::UnitTestImpl::RunAllTests() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:6633
    https://github.com/facebook/rocksdb/issues/14 0x5f541c in bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:3899
    https://github.com/facebook/rocksdb/issues/15 0x5f541c in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:3935
    https://github.com/facebook/rocksdb/issues/16 0x5cee74 in testing::UnitTest::Run() third-party/gtest-1.8.1/fused-src/gtest/gtest-all.cc:6242
    https://github.com/facebook/rocksdb/issues/17 0x4c0332 in RUN_ALL_TESTS() third-party/gtest-1.8.1/fused-src/gtest/gtest.h:22104
    https://github.com/facebook/rocksdb/issues/18 0x4c0332 in main table/sst_file_reader_test.cc:213
    https://github.com/facebook/rocksdb/issues/19 0x7fb0263281a5 in __libc_start_main (/usr/local/fbcode/platform007/lib/libc.so.6+0x211a5)
    https://github.com/facebook/rocksdb/issues/20 0x523e56  (/data/users/andrewkr/rocksdb/sst_file_reader_test+0x523e56)

Address 0x7ffd6189e158 is located in stack of thread T0 at offset 568 in frame
    #0 0x52428f in rocksdb::SstFileReaderTest_ReadOptionsOutOfScope_Test::TestBody() table/sst_file_reader_test.cc:131

  This frame has 9 object(s):
    [32, 40) 'reader'
    [96, 104) '<unknown>'
    [160, 168) '<unknown>'
    [224, 232) 'iter'
    [288, 304) 'gtest_ar'
    [352, 368) '<unknown>'
    [416, 440) 'keys'
    [480, 512) '<unknown>'
    [544, 680) 'ropts' <== Memory access at offset 568 is inside this variable
HINT: this may be a false positive if your program uses some custom stack unwind mechanism or swapcontext
      (longjmp and C++ exceptions *are* supported)
 AddressSanitizer: stack-use-after-scope table/block_based/block_based_table_iterator.cc:236 in rocksdb::BlockBasedTableIterator::InitDataBlock()
...
```

The fix is to use `ArenaWrappedDBIter` which has support for holding a
`ReadOptions` in an `Arena` whose lifetime is tied to the iterator.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7432

Test Plan: verified the provided unit test no longer fails

Reviewed By: pdillinger

Differential Revision: D23880043

Pulled By: ajkr

fbshipit-source-id: 9464c37408f7bd7c9c4a90ceffb04d9f0ca7a494
2020-09-23 15:24:03 -07:00
yaphet 323a834d1d Fix typo (#7353)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7353

Reviewed By: ajkr

Differential Revision: D23585525

Pulled By: zhichao-cao

fbshipit-source-id: 3f72d9663ec207b9dfe6b287d455a13d3c86879f
2020-09-19 18:11:16 -07:00
Peter Dillinger 93719fc953 Restore file size in backup table file names (and other cleanup) (#7400)
Summary:
Prior to 6.12, backup files using share_files_with_checksum had
the file size encoded in the file name, after the last '\_' and before
the last '.'. We considered this an implementation detail subject to
change, and indeed removed this information from the file name (with an
option to use old behavior) because it was considered
ineffective/inefficient for file name uniqueness. However, some
downstream RocksDB users were relying on this information since the file
size is not explicitly in the backup manifest file.

This primary purpose of this change is "retrofitting" the 6.12 release
(not yet a public release) to simultaneously support the benefits of the
new naming scheme (I/O performance and data correctness at scale) and
preserve the file size information, both as default behaviors. With this
change, we are essentially making the file size information encoded in
the file name an official, though obscure, extension of the backup meta
file format.

We preserve an option (kLegacyCrc32cAndFileSize) to use the original
"legacy" naming scheme, with its caveats, and make it easy to omit the
file size information (no kFlagIncludeFileSize), for more compact file
names. But note that changing the naming scheme used on an existing db
and backup directory can lead to transient space amplification, as some
files will be stored under two names in the shared_checksum directory.
Because some backups were saved using the original 6.12 naming scheme,
we offer two ways of dealing with those files: SST files generated by
older 6.12 versions can either use the default naming scheme in effect
when the SST files were generated (kFlagMatchInterimNaming, default, no
transient space amplification) or can use a new naming scheme (no
kFlagMatchInterimNaming, potential space amplification because some
already stored files getting a new name).

We don't have a natural way to detect which files were generated by
previous 6.12 versions, but this change hacks one in by changing DB
session ids to now use a more concise encoding, reducing file name
length, saving ~dozen bytes from SST files, and making them visually
distinct from DB ids so that they are less likely to be mixed up.

Two final auxiliary notes:
Recognizing that the backup file names have become a de facto part of
the backup meta schema, this change makes them easier to parse and
extend by putting a distinct marker, 's', before DB session ids embedded
in the name. When we extend this to allow custom checksums in the name,
they can get their own marker to ensure safe parsing. For backward
compatibility, file size does not get a marker but is assumed for
`_[0-9]+[.]`

Another change from initial 6.12 default behavior is never including
file custom checksum in the file name. Looking ahead to 6.13, we do not
want the default behavior to cause backup space amplification for
someone turning on file custom checksum checking in BackupEngine; we
want that to be an easy decision. When implemented, including file
custom checksums in backup file names will be a non-default option.

Actual file name patterns and priorities, as regexes:

    kLegacyCrc32cAndFileSize OR pre-6.12 SST file ->
      [0-9]+_[0-9]+_[0-9]+[.]sst
    kFlagMatchInterimNaming set (default) AND early 6.12 SST file ->
      [0-9]+_[0-9a-fA-F-]+[.]sst
    kUseDbSessionId AND NOT kFlagIncludeFileSize ->
      [0-9]+_s[0-9A-Z]{20}[.]sst
    kUseDbSessionId AND kFlagIncludeFileSize (default) ->
      [0-9]+_s[0-9A-Z]{20}_[0-9]+[.]sst

We might add opt-in options for more '\_' separated data in the name,
but embedded file size, if present, will always be after last '\_' and
before '.sst'.

This change was originally applied to version 6.12. (See https://github.com/facebook/rocksdb/issues/7390)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7400

Test Plan:
unit tests included. Sync point callbacks are used to mimic
previous version SST files.

Reviewed By: ajkr

Differential Revision: D23759587

Pulled By: pdillinger

fbshipit-source-id: f62d8af4e0978de0a34f26288cfbe66049b70025
2020-09-17 10:24:22 -07:00
mrambacher a08d6f18f0 Add more tests to ASSERT_STATUS_CHECKED (#7367)
Summary:
db_options_test
options_file_test
auto_roll_logger_test
options_util_test
persistent_cache_test

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7367

Reviewed By: jay-zhuang

Differential Revision: D23712520

Pulled By: zhichao-cao

fbshipit-source-id: 99b331e357f5d6a6aabee89d1bd933002cbb3908
2020-09-16 15:48:07 -07:00
mrambacher 7d472accdc Bring the Configurable options together (#5753)
Summary:
This PR merges the functionality of making the ColumnFamilyOptions, TableFactory, and DBOptions into Configurable into a single PR, resolving any merge conflicts

Pull Request resolved: https://github.com/facebook/rocksdb/pull/5753

Reviewed By: ajkr

Differential Revision: D23385030

Pulled By: zhichao-cao

fbshipit-source-id: 8b977a7731556230b9b8c5a081b98e49ee4f160a
2020-09-14 17:01:01 -07:00
Jay Zhuang 8a8a01c642 Fix compile error for old gcc-4.8 (#7358)
Summary:
gcc-4.8 returns error when using the constructor. Not sure if it's a compiler bug/limitation or code issue:
```
table/block_based/block_based_table_reader.cc:3183:67: error: use of deleted function ‘rocksdb::WritableFileStringStreamAdapter::WritableFileStringStreamAdapter(rocksdb::WritableFileStringStreamAdapter&&)’
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7358

Reviewed By: pdillinger

Differential Revision: D23577651

Pulled By: jay-zhuang

fbshipit-source-id: b0197e3d3538da61a6f3866410d88d2047fb9695
2020-09-08 12:09:34 -07:00
Akanksha Mahajan b175eceb09 Store FSWritableFilePtr object in WritableFileWriter (#7193)
Summary:
Replace FSWritableFile pointer with FSWritableFilePtr
    object in WritableFileWriter.
    This new object wraps FSWritableFile pointer.

    Objective: If tracing is enabled, FSWritableFile Ptr returns
    FSWritableFileTracingWrapper pointer that includes all necessary
    information in IORecord and calls underlying FileSystem and invokes
    IOTracer to dump that record in a binary file. If tracing is disabled
    then, underlying FileSystem pointer is returned directly.
    FSWritableFilePtr wrapper class is added to bypass the
    FSWritableFileWrapper when
    tracing is disabled.

    Test Plan: make check -j64

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7193

Reviewed By: anand1976

Differential Revision: D23355915

Pulled By: akankshamahajan15

fbshipit-source-id: e62a27a13c1fd77e36a6dbafc7006d969bed25cf
2020-09-08 10:56:08 -07:00
Jay Zhuang 27aa443a15 Add sst_file_dumper status check (#7315)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7315

Test Plan:
`ASSERT_STATUS_CHECKED=1 make sst_dump_test && ./sst_dump_test`
And manually run `./sst_dump --file=*.sst` before and after the change.

Reviewed By: pdillinger

Differential Revision: D23361669

Pulled By: jay-zhuang

fbshipit-source-id: 5bf51a2a90ee35c8c679e5f604732ec2aef5949a
2020-09-04 19:26:42 -07:00
Andrew Kryczka af54c4092a fix SstFileWriter with dictionary compression (#7323)
Summary:
In block-based table builder, the cut-over from buffered to unbuffered
mode involves sampling the buffered blocks and generating a dictionary.
There was a bug where `SstFileWriter` passed zero as the `target_file_size`
causing the cutover to happen immediately, so there were no samples
available for generating the dictionary.

This PR changes the meaning of `target_file_size == 0` to mean buffer
the whole file before cutting over. It also adds dictionary compression
support to `sst_dump --command=recompress` for easy evaluation.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7323

Reviewed By: cheng-chang

Differential Revision: D23412158

Pulled By: ajkr

fbshipit-source-id: 3b232050e70ef3c2ee85a4b5f6fadb139c569873
2020-09-03 15:49:57 -07:00
Bingyi Sun 61d5a132c9 Fix typo: rename "bounary" to "boundary" in block.cc (#7328)
Summary:
Fix typo in comment for SeekForGetImpl().
Rename "bounary" to "boundary"

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7328

Reviewed By: riversand963

Differential Revision: D23439748

Pulled By: zhichao-cao

fbshipit-source-id: 83a34c417c71a3210ce54a090d76c4d5571313f3
2020-09-02 20:47:18 -07:00
Jay Zhuang c2485f2d81 Add buffer prefetch support for non directIO usecase (#7312)
Summary:
A new file interface `SupportPrefetch()` is added. When the user overrides it to `false`, an internal prefetch buffer will be used for readahead. Useful for non-directIO but FS doesn't have readahead support.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7312

Reviewed By: anand1976

Differential Revision: D23329847

Pulled By: jay-zhuang

fbshipit-source-id: 71cd4ce6f4a820840294e4e6aec111ab76175527
2020-08-27 18:16:53 -07:00
sdong 722814e357 Get() to fail with underlying failures in PartitionIndexReader::CacheDependencies() (#7297)
Summary:
Right now all I/O failures under PartitionIndexReader::CacheDependencies() is swallowed. This doesn't impact correctness but we've made a decision that any I/O error in read path now should be returned to users for awareness. Return errors in those cases instead.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7297

Test Plan: Add a new unit test that ingest errors in this code path and see Get() fails. Only one I/O path is hit in PartitionIndexReader::CacheDependencies(). Several option changes are attempt but not able to got other pread paths triggered. Not sure whether other failure cases would be even possible. Would rely on continuous stress test to validate it.

Reviewed By: anand1976

Differential Revision: D23257950

fbshipit-source-id: 859dbc92fa239996e1bb378329344d3d54168c03
2020-08-25 19:01:05 -07:00
mrambacher b7e1c5213f Add some simulator cache and block tracer tests to ASSERT_STATUS_CHECKED (#7305)
Summary:
More tests now pass.  When in doubt, I added a TODO comment to check what should happen with an ignored error.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7305

Reviewed By: akankshamahajan15

Differential Revision: D23301262

Pulled By: ajkr

fbshipit-source-id: 5f120edc7393560aefc0633250277bbc7e8de9e6
2020-08-24 16:43:31 -07:00
mrambacher e9befdebbf Add EnvTestWithParam::OptionsTest to the ASSERT_STATUS_CHECKED passes (#7283)
Summary:
This test uses database functionality and required more extensive work to get it to pass than the other tests.  The DB functionality required for this test now passes the check.

When it was unclear what the proper behavior was for unchecked status codes, a TODO was added.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7283

Reviewed By: akankshamahajan15

Differential Revision: D23251497

Pulled By: ajkr

fbshipit-source-id: 52b79629bdafa0a58de8ead1d1d66f141b331523
2020-08-20 19:18:35 -07:00
Levi Tamasi 9d6f48ec1d Clean up CompressBlock/CompressBlockInternal a bit (#7249)
Summary:
The patch cleans up and refactors `CompressBlock` and `CompressBlockInternal` a bit.
In particular, it does the following:
* It renames `CompressBlockInternal` to `CompressData` and moves it to `util/compression.h`,
where other general compression-related utilities are located. This will facilitate reuse in the
BlobDB write path.
* The signature of the method is changed so it now takes `compression_format_version`
(similarly to the compression library specific methods) instead of `format_version` (which is
specific to the block based table).
* `GetCompressionFormatForVersion` no longer takes `compression_type` as a parameter.
This parameter was only used in a (not entirely up-to-date) assertion; also, removing it
eliminates the need to ensure this precondition holds at all call sites.
* Does some minor cleanup in `CompressBlock`, for instance, it is now possible to pass
only one of `sampled_output_fast` and `sampled_output_slow`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7249

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D23087278

Pulled By: ltamasi

fbshipit-source-id: e6316e45baed8b4e7de7c1780c90501c2a3439b3
2020-08-12 18:25:48 -07:00
Yuhong Guo 5444942f15 Fix cmake build on MacOS (#7205)
Summary:
1. `std::random_shuffle` is deprecated and now we can use `std::shuffle`
```
/rocksdb/db/prefix_test.cc:590:12: error: 'random_shuffle<std::__1::__wrap_iter<unsigned long long *> >'
      is deprecated [-Werror,-Wdeprecated-declarations]
      std::random_shuffle(prefixes.begin(), prefixes.end());
           ^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/algorithm:2982:1: note:
      'random_shuffle<std::__1::__wrap_iter<unsigned long long *> >' has been explicitly marked deprecated here
_LIBCPP_DEPRECATED_IN_CXX14 void
^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/__config:1107:39: note: expanded from macro
      '_LIBCPP_DEPRECATED_IN_CXX14'
#  define _LIBCPP_DEPRECATED_IN_CXX14 _LIBCPP_DEPRECATED
                                      ^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/__config:1090:48: note: expanded from macro
      '_LIBCPP_DEPRECATED'
#    define _LIBCPP_DEPRECATED __attribute__ ((deprecated))
```
2. `c_test` link error with `-DROCKSDB_BUILD_SHARED=OFF`:
```
[  7%] Linking CXX executable c_test
ld: library not found for -lrocksdb-shared
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make[5]: *** [c_test] Error 1
make[4]: *** [CMakeFiles/c_test.dir/all] Error 2
make[4]: *** Waiting for unfinished jobs....
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7205

Reviewed By: ajkr

Differential Revision: D23030641

Pulled By: pdillinger

fbshipit-source-id: f270e50fc0b824ca1a0876ec5c65d33f55a72dd0
2020-08-10 10:48:05 -07:00
anand76 832b056a30 Enable IO timeouts for iterators (#7161)
Summary:
Introduce io_timeout in ReadOptions and enabled deadline/io_timeout for
Iterators.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7161

Test Plan: New unit tests in db_basic_test

Reviewed By: riversand963

Differential Revision: D22687352

Pulled By: anand1976

fbshipit-source-id: 67bbb0e6d7ae80b256589244468494292538c6ec
2020-08-07 12:01:08 -07:00
sdong 5c1a544122 Clean up InternalIterator upper bound logic a little bit (#7200)
Summary:
IteratorIterator::IsOutOfBound() and IteratorIterator::MayBeOutOfUpperBound() are two functions that related to upper bound check. It is hard for users to reason about this complexity. Consolidate the two functions into one and assign an enum as results to improve readability.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7200

Test Plan: Run all existing test. Would run crash test with atomic for a while.

Reviewed By: anand1976

Differential Revision: D22833181

fbshipit-source-id: a0c724267056adbd0476bde74650e6c7226077e6
2020-08-05 10:44:57 -07:00
sdong 41c328fe57 Fix a perf regression that caused every key to go through upper bound check (#7209)
Summary:
https://github.com/facebook/rocksdb/pull/5289 introduces a performance regression that caused an upper bound check within every BlockBasedTableIterator::Next(). This is unnecessary if we've checked the boundary key for current block and it is within upper bound.

Fix the bug. Also rename the boolean to a enum so that the code is slightly better readable. The original regression was probably to fix a bug that the block upper bound check status is not reset after a new block is created. Fix it bug so that the regression can be avoided without hitting the bug.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7209

Test Plan: Run all existing tests. Will run atomic black box crash test for a while.

Reviewed By: anand1976

Differential Revision: D22859246

fbshipit-source-id: cbdad1f5e656c55fd8b71726d5a4f6cb53ff9140
2020-08-04 11:30:09 -07:00
Andrew Kryczka a4a4a2dabd dedup ReadOptions in iterator hierarchy (#7210)
Summary:
Previously, a `ReadOptions` object was stored in every `BlockBasedTableIterator`
and every `LevelIterator`. This redundancy consumes extra memory,
resulting in the `Arena` making more allocations, and iteration
observing worse cache performance.

This PR migrates callers of `NewInternalIterator()` and
`MakeInputIterator()` to provide a `ReadOptions` object guaranteed to
outlive the returned iterator. When the iterator's lifetime will be managed by the
user, this lifetime guarantee is achieved by storing the `ReadOptions`
value in `ArenaWrappedDBIter`. Then, sub-iterators of `NewInternalIterator()` and
`MakeInputIterator()` can hold a reference-to-const `ReadOptions`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7210

Test Plan:
- `make check` under ASAN and valgrind
- benchmark: on a DB with 2 L0 files and 3 L1+ levels, this PR reduced `Arena` allocation 4792 -> 4160 bytes.

Reviewed By: anand1976

Differential Revision: D22861323

Pulled By: ajkr

fbshipit-source-id: 54aebb3e89c872eeab0f5793b4b6e42878d093ce
2020-08-03 15:23:04 -07:00
sdong 692f6a3138 Implement NextAndGetResult() in memtable and level iterator (#7179)
Summary:
NextAndGetResult() is not implemented in memtable and is very simply implemented in level iterator. The result is that for a normal leveled iterator, performance regression will be observed for calling PrepareValue() for most iterator Next(). Mitigate the problem by implementing the function for both iterators. In level iterator, the implementation cannot be perfect as when calling file iterator's SeekToFirst() we don't have information about whether the value is prepared. Fortunately, the first key should not cause a big portion of the CPu.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7179

Test Plan: Run normal crash test for a while.

Reviewed By: anand1976

Differential Revision: D22783840

fbshipit-source-id: c19f45cdf21b756190adef97a3b66ccde3936e05
2020-07-29 09:45:21 -07:00
mrambacher d44cbc5314 Add hash of key/value checks when paranoid_file_checks=true (#7134)
Summary:
When paraoid_files_checks=true, a rolling key-value hash is generated and compared to what is written to the file.  If the values do not match, the SST file is rejected.

Code put in place for the check for both flush and compaction jobs.  Corresponding test added to corruption_test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7134

Reviewed By: cheng-chang

Differential Revision: D22646149

fbshipit-source-id: 8fde1984a1a11edd3bd82a413acffc5ea7aa683f
2020-07-22 11:04:40 -07:00
Andrew Kryczka 643c863b72 minimize BlockIter comparator scope (#7149)
Summary:
PR https://github.com/facebook/rocksdb/issues/6944 transitioned `BlockIter` from using `Comparator*` to using
concrete `UserComparatorWrapper` and `InternalKeyComparator`. However,
adding them as instance variables to `BlockIter` was not optimal.
Bloating `BlockIter` caused the `ArenaWrappedDBIter`'s arena allocator to do more heap
allocations (in certain cases) which harmed performance of `DB::NewIterator()`. This PR
pushes down the concrete comparator objects to the point of usage, which
forces them to be on the stack. As a result, the `BlockIter` is back to
its original size prior to https://github.com/facebook/rocksdb/issues/6944 (actually a bit smaller since there
were two `Comparator*` before).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7149

Test Plan:
verified our internal `DB::NewIterator()`-heavy regression
test no longer reports regression.

Reviewed By: riversand963

Differential Revision: D22623189

Pulled By: ajkr

fbshipit-source-id: f6d69accfe5de51e0bd9874a480b32b29909bab6
2020-07-20 14:07:04 -07:00
mrambacher c7c7b07f06 More Makefile Cleanup (#7097)
Summary:
Cleans up some of the dependencies on test code in the Makefile while building tools:
- Moves the test::RandomString, DBBaseTest::RandomString into Random
- Moves the test::RandomHumanReadableString into Random
- Moves the DestroyDir method into file_utils
- Moves the SetupSyncPointsToMockDirectIO into sync_point.
- Moves the FaultInjection Env and FS classes under env

These changes allow all of the tools to build without dependencies on test_util, thereby simplifying the build dependencies.  By moving the FaultInjection code, the dependency in db_stress on different libraries for debug vs release was eliminated.

Tested both release and debug builds via Make and CMake for both static and shared libraries.

More work remains to clean up how the tools are built and remove some unnecessary dependencies.  There is also more work that should be done to get the Makefile and CMake to align in their builds -- what is in the libraries and the sizes of the executables are different.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7097

Reviewed By: riversand963

Differential Revision: D22463160

Pulled By: pdillinger

fbshipit-source-id: e19462b53324ab3f0b7c72459dbc73165cc382b2
2020-07-09 14:35:17 -07:00
Andrew Kryczka 82611ee25a save key comparisons in BlockIter::BinarySeek (#7068)
Summary:
This is a followup to https://github.com/facebook/rocksdb/issues/6646. In that PR, for simplicity I just appended a comparison against the 0th restart key in case `BinarySeek()`'s binary search landed at index 0. As a result there were `2/(N+1) + log_2(N)` key comparisons. This PR does it differently. Now we expand the binary search range by one so it also covers the case where target is at or before the restart key at index 0. As a result, it involves `log_2(N+1)` key comparisons.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7068

Test Plan:
ran readrandom with mostly default settings and counted key comparisons
using `PerfContext`.

before: `user_key_comparison_count = 28881965`
after: `user_key_comparison_count = 27823245`

setup command:

```
$ TEST_TMPDIR=/dev/shm/dbbench ./db_bench -benchmarks=fillrandom,compact -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -max_background_jobs=12 -level_compaction_dynamic_level_bytes=true -num=10000000
```

benchmark command:

```
$ TEST_TMPDIR=/dev/shm/dbbench/ ./db_bench -use_existing_db=true -benchmarks=readrandom -disable_auto_compactions=true -num=10000000 -compression_type=none -reads=1000000 -perf_level=3
```

Reviewed By: anand1976

Differential Revision: D22357032

Pulled By: ajkr

fbshipit-source-id: 8b01e9c1c2a4e9d02fc9dfe16c1cc0327f8bdf24
2020-07-09 12:27:20 -07:00
Zitan Chen b35a2f9146 Fix GetFileDbIdentities (#7104)
Summary:
Although PR https://github.com/facebook/rocksdb/issues/7032 fixes the construction of the `SstFileDumper` in `GetFileDbIdentities` by setting a proper `Env` of the `Options` passed in the constructor, the file path was not corrected accordingly. This actually disables backup engine to use db session ids in the file names since the `db_session_id` is always empty.

Now it is fixed by setting the correct path in the construction of `SstFileDumper`. Furthermore, to preserve the Direct IO property that backup engine already has, parameter `EnvOptions` is added to `GetFileDbIdentities` and `SstFileDumper`.

The `BackupUsingDirectIO` test is updated accordingly.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7104

Test Plan: backupable_db_test and some manual tests.

Reviewed By: ajkr

Differential Revision: D22443245

Pulled By: gg814

fbshipit-source-id: 056a9bb8b82947c5e73d7c3fbb62bfe23af5e562
2020-07-09 08:37:59 -07:00
Akanksha Mahajan 54f171fe90 Update Flush policy in PartitionedIndexBuilder on switching from user-key to internal-key mode (#7096)
Summary:
When format_version is high enough to support user-key and
there are index entries for same user key that spans multiple data
blocks then it changes from user-key mode to internal-key mode. But the
flush policy is not reset to point to Block Builder of internal-keys.
After this switch, no entries are added to user key index partition
result, thus it never triggers flushing the block.

Fix: 1. After adding the entry in sub_builder_index_, if there is a switch
from user-key to internal-key, then flush policy is updated to point to
Block Builder of internal-keys index partition.
2. Set sub_builder_index_->seperator_is_key_plus_seq_ = true if
seperator_is_key_plus_seq_  is set to true so that subsequent partitions
can also use internal key mode.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7096

Test Plan: make check -j64

Reviewed By: ajkr

Differential Revision: D22416598

Pulled By: akankshamahajan15

fbshipit-source-id: 01fc2dc07ea1b32f8fb803995ebe6e9a3fbe67ac
2020-07-08 21:03:04 -07:00
rockeet b649d8cb97 Fixed Factory construct just for calling .Name() (#7080)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7080

Reviewed By: riversand963

Differential Revision: D22412352

Pulled By: ajkr

fbshipit-source-id: 1d7f4c1621040a0130245139b52c3f4d3deac865
2020-07-08 11:54:00 -07:00
Andrew Kryczka dd29ad4223 Separate internal and user key comparators in `BlockIter` (#6944)
Summary:
Replace `BlockIter::comparator_` and `IndexBlockIter::user_comparator_wrapper_` with a concrete `UserComparatorWrapper` and `InternalKeyComparator`. The motivation for this change was the inconvenience of not knowing the concrete type of `BlockIter::comparator_`, which prevented calling specialized internal key comparison functions to optimize comparison of keys with global seqno applied.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/6944

Test Plan:
benchmark setup -- single file DBs, in-memory, no compression. "normal_db"
created by regular flush; "ingestion_db" created by ingesting a file. Both
DBs have same contents.

```
$ TEST_TMPDIR=/dev/shm/normal_db/ ./db_bench -benchmarks=fillrandom,compact -write_buffer_size=10485760000 -disable_auto_compactions=true -compression_type=none -num=1000000
$ ./ldb write_extern_sst ./tmp.sst --db=/dev/shm/ingestion_db/dbbench/ --compression_type=no --hex --create_if_missing < <(./sst_dump --command=scan --output_hex --file=/dev/shm/normal_db/dbbench/000007.sst | awk 'began {print "0x" substr($1, 2, length($1) - 2), "==>", "0x" $5} ; /^Sst file format: block-based/ {began=1}')
$ ./ldb ingest_extern_sst ./tmp.sst --db=/dev/shm/ingestion_db/dbbench/
```

benchmark run command:

```
$ TEST_TMPDIR=/dev/shm/$DB/ ./db_bench -benchmarks=seekrandom -seek_nexts=$SEEK_NEXT -use_existing_db=true -cache_index_and_filter_blocks=false -num=1000000 -cache_size=0 -threads=1 -reads=200000000 -mmap_read=1 -verify_checksum=false
```

results: perf improved marginally for ingestion_db and did not change significantly for normal_db:

SEEK_NEXT | DB | code | ops/sec | % change
-- | -- | -- | -- | --
0 | normal_db | master | 350880 |  
0 | normal_db | PR6944 | 351040 | 0.0
0 | ingestion_db | master | 343255 |  
0 | ingestion_db | PR6944 | 349424 | 1.8
10 | normal_db | master | 218711 |  
10 | normal_db | PR6944 | 217892 | -0.4
10 | ingestion_db | master | 220334 |  
10 | ingestion_db | PR6944 | 226437 | 2.8

Reviewed By: pdillinger

Differential Revision: D21924676

Pulled By: ajkr

fbshipit-source-id: ea4288a2eefa8112eb6c651a671c1de18c12e538
2020-07-07 17:26:16 -07:00
Peter Dillinger a680a7ea37 Un-revert #7049, revert #7022 (#7071)
Summary:
Even though local bisection gave me a clear signal (and still does) that reverting https://github.com/facebook/rocksdb/issues/7049 would fix the failures in MultiThreadedDBTest, https://github.com/facebook/rocksdb/issues/7022 seems to be the root cause. Reverting https://github.com/facebook/rocksdb/issues/7022 and keeping https://github.com/facebook/rocksdb/issues/7049 seems to fix the issue in local reproducer also. (Had these landed in opposite order, bisection would have found the root cause.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7071

Reviewed By: akankshamahajan15

Differential Revision: D22362857

Pulled By: pdillinger

fbshipit-source-id: ed63df3d74e9d4ce1604de8fe43b216166c7a3f0
2020-07-02 13:30:41 -07:00
Akanksha Mahajan 5edfe3a3d8 Update Flush policy in PartitionedIndexBuilder on switching from user-key to internal-key mode (#7022)
Summary:
When format_version is high enough to support user-key and there are index entries for same user key that spans multiple data blocks then it changes from user-key mode to internal-key mode. But the flush policy is not reset to point to Block Builder of internal-keys. After this switch, no entries are added to user key index partition result, thus it never triggers flushing the block.

Fix: After adding the entry in sub_builder_index_, if there is a switch from user-key to internal-key, then flush policy is updated to point to Block Builder of internal-keys index partition.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7022

Test Plan:
1. make check -j64
           2. Added one unit test case

Reviewed By: ajkr

Differential Revision: D22197734

Pulled By: akankshamahajan15

fbshipit-source-id: d87e9e46bccab8e896ee6979d6b79c51f73d479e
2020-07-01 14:58:08 -07:00
Andrew Kryczka 8458532d58 Skip unnecessary allocation for mmap reads under 5000 bytes (#7043)
Summary:
With mmap enabled on an uncompressed file, we were previously always doing a heap allocation to obtain the scratch buffer for `RandomAccessFileReader::Read()`. However, that allocation was unnecessary as the underlying file reader returned a pointer into its mapped memory, not the provided scratch buffer. This PR makes passes the `BlockFetcher`'s inline buffer as the scratch buffer if the data block is small enough (less than `kDefaultStackBufferSize` bytes, currently 5000). Ideally we would not pass a scratch buffer at all for an mmap read; however, the `RandomAccessFile::Read()` API guarantees such a buffer is provided, and non-standard implementations may be relying on it even when `Options::allow_mmap_reads == true`. In that case, this PR still works but introduces an extra copy from the inline buffer to a heap buffer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7043

Reviewed By: cheng-chang

Differential Revision: D22320606

Pulled By: ajkr

fbshipit-source-id: ad964dd23df34e07d979c6032c2dfe5454c98b52
2020-06-30 15:40:40 -07:00
Anand Ananthabhotla 9a5886bd8c Extend Get/MultiGet deadline support to table open (#6982)
Summary:
Current implementation of the ```read_options.deadline``` option only checks the deadline for random file reads during point lookups. This PR extends the checks to file opens, prefetches and preloads as part of table open.

The main changes are in the ```BlockBasedTable```, partitioned index and filter readers, and ```TableCache``` to take ReadOptions as an additional parameter. In ```BlockBasedTable::Open```, in order to retain existing behavior w.r.t checksum verification and block cache usage, we filter out most of the options in ```ReadOptions``` except ```deadline```. However, having the ```ReadOptions``` gives us more flexibility to honor other options like verify_checksums, fill_cache etc. in the future.

Additional changes in callsites due to function signature changes in ```NewTableReader()``` and ```FilePrefetchBuffer```.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6982

Test Plan: Add new unit tests in db_basic_test

Reviewed By: riversand963

Differential Revision: D22219515

Pulled By: anand1976

fbshipit-source-id: 8a3b92f4a889808013838603aa3ca35229cd501b
2020-06-29 14:53:17 -07:00
sdong f9817201af Add unity build to CircleCI (#7026)
Summary:
We are still keeping unity build working. So it's a good idea to add to a pre-commit CI.
A latest GCC docker image just to get a little bit more coverage. Fix three small issues to make it pass.
Also make unity_test to run db_basic_test rather than db_test to cut the test time. There is no point to run expensive tests here. It was set to run db_test before db_basic_test was separated out.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7026

Test Plan: watch tests to pass.

Reviewed By: zhichao-cao

Differential Revision: D22223197

fbshipit-source-id: baa3b6cbb623bf359829b63ce35715c75bcb0ed4
2020-06-26 11:14:08 -07:00
Zitan Chen be41c61f22 Add a new option for BackupEngine to store table files under shared_checksum using DB session id in the backup filenames (#6997)
Summary:
`BackupableDBOptions::new_naming_for_backup_files` is added. This option is false by default. When it is true, backup table filenames under directory shared_checksum are of the form `<file_number>_<crc32c>_<db_session_id>.sst`.

Note that when this option is true, it comes into effect only when both `share_files_with_checksum` and `share_table_files` are true.

Three new test cases are added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6997

Test Plan: Passed make check.

Reviewed By: ajkr

Differential Revision: D22098895

Pulled By: gg814

fbshipit-source-id: a1d9145e7fe562d71cde7ac995e17cb24fd42e76
2020-06-24 19:31:25 -07:00
sdong 9cc25190e1 Test CircleCI with CLANG-10 (#7025)
Summary:
It's useful to build RocksDB using a more recent clang version in CI. Add a CircleCI build and fix some issues with it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7025

Test Plan: See all tests pass.

Reviewed By: pdillinger

Differential Revision: D22215700

fbshipit-source-id: 914a729c2cd3f3ac4a627cc0ac58d4691dca2168
2020-06-24 16:22:49 -07:00
Peter Dillinger 5b2bbacb6f Minimize memory internal fragmentation for Bloom filters (#6427)
Summary:
New experimental option BBTO::optimize_filters_for_memory builds
filters that maximize their use of "usable size" from malloc_usable_size,
which is also used to compute block cache charges.

Rather than always "rounding up," we track state in the
BloomFilterPolicy object to mix essentially "rounding down" and
"rounding up" so that the average FP rate of all generated filters is
the same as without the option. (YMMV as heavily accessed filters might
be unluckily lower accuracy.)

Thus, the option near-minimizes what the block cache considers as
"memory used" for a given target Bloom filter false positive rate and
Bloom filter implementation. There are no forward or backward
compatibility issues with this change, though it only works on the
format_version=5 Bloom filter.

With Jemalloc, we see about 10% reduction in memory footprint (and block
cache charge) for Bloom filters, but 1-2% increase in storage footprint,
due to encoding efficiency losses (FP rate is non-linear with bits/key).

Why not weighted random round up/down rather than state tracking? By
only requiring malloc_usable_size, we don't actually know what the next
larger and next smaller usable sizes for the allocator are. We pick a
requested size, accept and use whatever usable size it has, and use the
difference to inform our next choice. This allows us to narrow in on the
right balance without tracking/predicting usable sizes.

Why not weight history of generated filter false positive rates by
number of keys? This could lead to excess skew in small filters after
generating a large filter.

Results from filter_bench with jemalloc (irrelevant details omitted):

    (normal keys/filter, but high variance)
    $ ./filter_bench -quick -impl=2 -average_keys_per_filter=30000 -vary_key_count_ratio=0.9
    Build avg ns/key: 29.6278
    Number of filters: 5516
    Total size (MB): 200.046
    Reported total allocated memory (MB): 220.597
    Reported internal fragmentation: 10.2732%
    Bits/key stored: 10.0097
    Average FP rate %: 0.965228
    $ ./filter_bench -quick -impl=2 -average_keys_per_filter=30000 -vary_key_count_ratio=0.9 -optimize_filters_for_memory
    Build avg ns/key: 30.5104
    Number of filters: 5464
    Total size (MB): 200.015
    Reported total allocated memory (MB): 200.322
    Reported internal fragmentation: 0.153709%
    Bits/key stored: 10.1011
    Average FP rate %: 0.966313

    (very few keys / filter, optimization not as effective due to ~59 byte
     internal fragmentation in blocked Bloom filter representation)
    $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000 -vary_key_count_ratio=0.9
    Build avg ns/key: 29.5649
    Number of filters: 162950
    Total size (MB): 200.001
    Reported total allocated memory (MB): 224.624
    Reported internal fragmentation: 12.3117%
    Bits/key stored: 10.2951
    Average FP rate %: 0.821534
    $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000 -vary_key_count_ratio=0.9 -optimize_filters_for_memory
    Build avg ns/key: 31.8057
    Number of filters: 159849
    Total size (MB): 200
    Reported total allocated memory (MB): 208.846
    Reported internal fragmentation: 4.42297%
    Bits/key stored: 10.4948
    Average FP rate %: 0.811006

    (high keys/filter)
    $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000000 -vary_key_count_ratio=0.9
    Build avg ns/key: 29.7017
    Number of filters: 164
    Total size (MB): 200.352
    Reported total allocated memory (MB): 221.5
    Reported internal fragmentation: 10.5552%
    Bits/key stored: 10.0003
    Average FP rate %: 0.969358
    $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000000 -vary_key_count_ratio=0.9 -optimize_filters_for_memory
    Build avg ns/key: 30.7131
    Number of filters: 160
    Total size (MB): 200.928
    Reported total allocated memory (MB): 200.938
    Reported internal fragmentation: 0.00448054%
    Bits/key stored: 10.1852
    Average FP rate %: 0.963387

And from db_bench (block cache) with jemalloc:

    $ ./db_bench -db=/dev/shm/dbbench.no_optimize -benchmarks=fillrandom -format_version=5 -value_size=90 -bloom_bits=10 -num=2000000 -threads=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false
    $ ./db_bench -db=/dev/shm/dbbench -benchmarks=fillrandom -format_version=5 -value_size=90 -bloom_bits=10 -num=2000000 -threads=8 -optimize_filters_for_memory -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false
    $ (for FILE in /dev/shm/dbbench.no_optimize/*.sst; do ./sst_dump --file=$FILE --show_properties | grep 'filter block' ; done) | awk '{ t += $4; } END { print t; }'
    17063835
    $ (for FILE in /dev/shm/dbbench/*.sst; do ./sst_dump --file=$FILE --show_properties | grep 'filter block' ; done) | awk '{ t += $4; } END { print t; }'
    17430747
    $ #^ 2.1% additional filter storage
    $ ./db_bench -db=/dev/shm/dbbench.no_optimize -use_existing_db -benchmarks=readrandom,stats -statistics -bloom_bits=10 -num=2000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false -duration=10 -cache_index_and_filter_blocks -cache_size=1000000000
    rocksdb.block.cache.index.add COUNT : 33
    rocksdb.block.cache.index.bytes.insert COUNT : 8440400
    rocksdb.block.cache.filter.add COUNT : 33
    rocksdb.block.cache.filter.bytes.insert COUNT : 21087528
    rocksdb.bloom.filter.useful COUNT : 4963889
    rocksdb.bloom.filter.full.positive COUNT : 1214081
    rocksdb.bloom.filter.full.true.positive COUNT : 1161999
    $ #^ 1.04 % observed FP rate
    $ ./db_bench -db=/dev/shm/dbbench -use_existing_db -benchmarks=readrandom,stats -statistics -bloom_bits=10 -num=2000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false -optimize_filters_for_memory -duration=10 -cache_index_and_filter_blocks -cache_size=1000000000
    rocksdb.block.cache.index.add COUNT : 33
    rocksdb.block.cache.index.bytes.insert COUNT : 8448592
    rocksdb.block.cache.filter.add COUNT : 33
    rocksdb.block.cache.filter.bytes.insert COUNT : 18220328
    rocksdb.bloom.filter.useful COUNT : 5360933
    rocksdb.bloom.filter.full.positive COUNT : 1321315
    rocksdb.bloom.filter.full.true.positive COUNT : 1262999
    $ #^ 1.08 % observed FP rate, 13.6% less memory usage for filters

(Due to specific key density, this example tends to generate filters that are "worse than average" for internal fragmentation. "Better than average" cases can show little or no improvement.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6427

Test Plan: unit test added, 'make check' with gcc, clang and valgrind

Reviewed By: siying

Differential Revision: D22124374

Pulled By: pdillinger

fbshipit-source-id: f3e3aa152f9043ddf4fae25799e76341d0d8714e
2020-06-22 13:32:07 -07:00
Peter Dillinger 25a0d0ca30 Fix block checksum for >=4GB, refactor (#6978)
Summary:
Although RocksDB falls over in various other ways with KVs
around 4GB or more, this change fixes how XXH32 and XXH64 were being
called by the block checksum code to support >= 4GB in case that should
ever happen, or the code copied for other uses.

This change is not a schema compatibility issue because the checksum
verification code would checksum the first (block_size + 1) mod 2^32
bytes while the checksum construction code would checksum the first
block_size mod 2^32 plus the compression type byte, meaning the
XXH32/64 checksums for >=4GB block would not match about 255/256 times.

While touching this code, I refactored to consolidate redundant
implementations, improving diagnostics and performance tracking in some
cases. Also used less confusing language in those diagnostics.

Makes https://github.com/facebook/rocksdb/issues/6875 obsolete.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6978

Test Plan:
I was able to write a test for this using an SST file writer
and VerifyChecksum in a reader. The test fails before the fix, though
I'm leaving the test disabled because I don't think it's worth the
expense of running regularly.

Reviewed By: gg814

Differential Revision: D22143260

Pulled By: pdillinger

fbshipit-source-id: 982993d16134e8c50bea2269047f901c1783726e
2020-06-19 16:18:24 -07:00
sdong 223b57eeb8 Fix the bug that compressed cache is disabled in read-only DBs (#6990)
Summary:
Compressed block cache is disabled in https://github.com/facebook/rocksdb/pull/4650 for no good reason. Re-enable it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6990

Test Plan: Add a unit test to make sure a general function works with read-only DB + compressed block cache.

Reviewed By: ltamasi

Differential Revision: D22072755

fbshipit-source-id: 2a55df6363de23a78979cf6c747526359e5dc7a1
2020-06-17 14:30:54 -07:00
Zitan Chen 94d04529de Store DB identity and DB session ID in SST files (#6983)
Summary:
`db_id` and `db_session_id` are now part of the table properties for all formats and stored in SST files. This adds about 99 bytes to each new SST file.

The `TablePropertiesNames` for these two identifiers are `rocksdb.creating.db.identity` and `rocksdb.creating.session.identity`.

In addition, SST files generated from SstFileWriter and Repairer have DB identity “SST Writer” and “DB Repairer”, respectively. Their DB session IDs are generated in the same way as `DB::GetDbSessionId`.

A table property test is added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6983

Test Plan: make check and some manual tests.

Reviewed By: zhichao-cao

Differential Revision: D22048826

Pulled By: gg814

fbshipit-source-id: afdf8c11424a6f509b5c0b06dafad584a80103c9
2020-06-17 10:57:40 -07:00
Levi Tamasi aa8f1331af Fix uninitialized memory read in table_test (#6980)
Summary:
When using parameterized tests, `gtest` sometimes prints the test
parameters. If no other printing method is available, it essentially
produces a hex dump of the object. This can cause issues with valgrind
with types like `TestArgs` in `table_test`, where the object layout has
gaps (with uninitialized contents) due to the members' alignment
requirements. The patch fixes the uninitialized reads by providing an
`operator<<` for `TestArgs` and also makes sure all members are
initialized (in a consistent order) on all code paths.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6980

Test Plan: `valgrind --leak-check=full ./table_test`

Reviewed By: siying

Differential Revision: D22045536

Pulled By: ltamasi

fbshipit-source-id: 6f5920ac28c712d0aa88162fffb80172ed769c32
2020-06-15 14:08:12 -07:00
Zhen Li 9c24a5cb4d Fix persistent cache on windows (#6932)
Summary:
Persistent cache feature caused rocks db crash on windows. I posted a issue for it, https://github.com/facebook/rocksdb/issues/6919. I found this is because no "persistent_cache_key_prefix" is generated for persistent cache. Looking repo history, "GetUniqueIdFromFile" is not implemented on Windows. So my fix is adding "NewId()" function in "persistent_cache" and using it to generate prefix for persistent cache. In this PR, i also re-enable related test cases defined in "db_test2" and "persistent_cache_test" for windows.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6932

Test Plan:
1. run related test cases in "db_test2" and "persistent_cache_test" on windows and see it passed.
2. manually run db_bench.exe with "read_cache_path" and verified.

Reviewed By: riversand963

Differential Revision: D21911608

Pulled By: cheng-chang

fbshipit-source-id: cdfd938d54a385edbb2836b13aaa1d39b0a6f1c2
2020-06-13 13:28:31 -07:00
Levi Tamasi bacd6edcbe Turn HarnessTest in table_test into a parameterized test (#6974)
Summary:
`HarnessTest` in `table_test.cc` currently tests many parameter
combinations sequentially in a loop. This is problematic from
a testing perspective, since if the test fails, we have no way of
knowing how many/which combinations have failed. It can also cause timeouts on
our test system due to the sheer number of combinations tested.
(Specifically, the parallel compression threads parameter added by
https://github.com/facebook/rocksdb/pull/6262 seems to have been the last straw.)
There is some DIY code there that splits the load among eight test cases
but that does not appear to be sufficient anymore.

Instead, the patch turns `HarnessTest` into a parameterized test, so all the
parameter combinations can be tested separately and potentially
concurrently. It also cleans up the tests a little, fixes
`RandomizedLongDB`, which did not get updated when the parallel
compression threads parameter was added, and turns `FooterTests` into a
standalone test case (since it does not actually need a fixture class).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6974

Test Plan: `make check`

Reviewed By: siying

Differential Revision: D22029572

Pulled By: ltamasi

fbshipit-source-id: 51baea670771c33928f2eb3902bd69dcf540aa41
2020-06-12 17:31:06 -07:00
Andrew Kryczka e6be168aa5 save a key comparison in block seeks (#6646)
Summary:
This saves up to two key comparisons in block seeks. The first key
comparison saved is a redundant key comparison against the restart key
where the linear scan starts. This comparison is saved in all cases
except when the found key is in the first restart interval. The
second key comparison saved is a redundant key comparison against the
restart key where the linear scan ends. This is only saved in cases
where all keys in the restart interval are less than the target
(probability roughly `1/restart_interval`).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6646

Test Plan:
ran a benchmark with mostly default settings and counted key comparisons

before: `user_key_comparison_count = 19399529`
after: `user_key_comparison_count = 18431498`

setup command:

```
$ TEST_TMPDIR=/dev/shm/dbbench ./db_bench -benchmarks=fillrandom,compact -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -max_background_jobs=12 -level_compaction_dynamic_level_bytes=true -num=10000000
```

benchmark command:

```
$ TEST_TMPDIR=/dev/shm/dbbench/ ./db_bench -use_existing_db=true -benchmarks=readrandom -disable_auto_compactions=true -num=10000000 -compression_type=none -reads=1000000 -perf_level=3
```

Reviewed By: pdillinger

Differential Revision: D20849707

Pulled By: ajkr

fbshipit-source-id: 1f01c5cd99ea771fd27974046e37b194f1cdcfac
2020-06-10 13:58:39 -07:00
Andrew Kryczka 02db03af8d make L0 index/filter pinned memory usage predictable (#6911)
Summary:
Memory pinned by `pin_l0_filter_and_index_blocks_in_cache` needs to be predictable based on user config. This PR makes sure
we do not pin extra memory for large files generated by intra-L0 (see https://github.com/facebook/rocksdb/issues/6889).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6911

Test Plan: unit test

Reviewed By: siying

Differential Revision: D21835818

Pulled By: ajkr

fbshipit-source-id: a11a088549d06bed8aacc2548d266e5983f0ead4
2020-06-09 16:51:23 -07:00
Yanqin Jin 3020df9df5 Remove unnecessary inclusion of version_edit.h in env (#6952)
Summary:
In db_options.c, we should avoid including header files in the `db` directory to avoid introducing unnecessary dependency. The reason why `version_edit.h` has been included in `db_options.cc` is because we need two constants, `kUnknownChecksum` and `kUnknownChecksumFuncName`. We can put these two constants as `constexpr` in the public header `file_checksum.h`.

Test plan (devserver):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6952

Reviewed By: zhichao-cao

Differential Revision: D21925341

Pulled By: riversand963

fbshipit-source-id: 2902f3b74c97f0cf16c58ad24c095c787c3a40e2
2020-06-07 21:56:55 -07:00
Zhichao Cao f941adef88 Clean up the dead code (#6946)
Summary:
Remove the dead code in table test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6946

Test Plan: run table_test

Reviewed By: riversand963

Differential Revision: D21913563

Pulled By: zhichao-cao

fbshipit-source-id: c0aa9f3b95dfe87dd7fb2cd4823784f08cb3ddd3
2020-06-05 17:45:22 -07:00
anand76 98b0cbea88 Check iterator status BlockBasedTableReader::VerifyChecksumInBlocks() (#6909)
Summary:
The ```for``` loop in ```VerifyChecksumInBlocks``` only checks ```index_iter->Valid()``` which could be ```false``` either due to reaching the end of the index or, in case of partitioned index, it could be due to a checksum mismatch error when reading a 2nd level index block. Instead of throwing away the index iterator status, we need to return any errors back to the caller.

Tests:
Add a test in block_based_table_reader_test.cc.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6909

Reviewed By: pdillinger

Differential Revision: D21833922

Pulled By: anand1976

fbshipit-source-id: bc778ebf1121dbbdd768689de5183f07a9f0beae
2020-06-05 11:08:25 -07:00
Peter Dillinger c7432cc3c0 Fix more defects reported by Coverity Scan (#6935)
Summary:
Mostly uninitialized values: some probably written before use, but some seem like bugs. Also, destructor needs to be virtual, and possible use-after-free in test
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6935

Test Plan: make check

Reviewed By: siying

Differential Revision: D21885484

Pulled By: pdillinger

fbshipit-source-id: e2e7cb0a0cf196f2b55edd16f0634e81f6cc8e08
2020-06-04 15:35:08 -07:00
sdong afa3518839 Revert "Update googletest from 1.8.1 to 1.10.0 (#6808)" (#6923)
Summary:
This reverts commit 8d87e9cea1.

Based on offline discussions, it's too early to upgrade to gtest 1.10, as it prevents some developers from using an older version of gtest to integrate to some other systems. Revert it for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6923

Reviewed By: pdillinger

Differential Revision: D21864799

fbshipit-source-id: d0726b1ff649fc911b9378f1763316200bd363fc
2020-06-03 15:55:03 -07:00
Peter Dillinger 9360776cb9 Fix handling of too-small filter partition size (#6905)
Summary:
Because ARM and some other platforms have a larger cache line
size, they have a larger minimum filter size, which causes recently
added PartitionedMultiGet test in db_bloom_filter_test to fail on those
platforms. The code would actually end up using larger partitions,
because keys_per_partition_ would be 0 and never == number of keys
added.

The code now attempts to get as close as possible to the small target
size, while fully utilizing that filter size, if the target partition
size is smaller than the minimum filter size.

Also updated the test to break more uniformly across platforms
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6905

Test Plan: updated test, tested on ARM

Reviewed By: anand1976

Differential Revision: D21840639

Pulled By: pdillinger

fbshipit-source-id: 11684b6d35f43d2e98b85ddb2c8dcfd59d670817
2020-06-03 10:43:01 -07:00
Zhichao Cao 2adb7e3768 Fix potential overflow of unsigned type in for loop (#6902)
Summary:
x.size() -1 or y - 1 can overflow to an extremely large value when x.size() pr y is 0 when they are unsigned type. The end condition of i in the for loop will be extremely large, potentially causes segment fault. Fix them.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6902

Test Plan: pass make asan_check

Reviewed By: ajkr

Differential Revision: D21843767

Pulled By: zhichao-cao

fbshipit-source-id: 5b8b88155ac5a93d86246d832e89905a783bb5a1
2020-06-02 15:05:07 -07:00
Peter Dillinger 14eca6bf04 For ApproximateSizes, pro-rate table metadata size over data blocks (#6784)
Summary:
The implementation of GetApproximateSizes was inconsistent in
its treatment of the size of non-data blocks of SST files, sometimes
including and sometimes now. This was at its worst with large portion
of table file used by filters and querying a small range that crossed
a table boundary: the size estimate would include large filter size.

It's conceivable that someone might want only to know the size in terms
of data blocks, but I believe that's unlikely enough to ignore for now.
Similarly, there's no evidence the internal function AppoximateOffsetOf
is used for anything other than a one-sided ApproximateSize, so I intend
to refactor to remove redundancy in a follow-up commit.

So to fix this, GetApproximateSizes (and implementation details
ApproximateSize and ApproximateOffsetOf) now consistently include in
their returned sizes a portion of table file metadata (incl filters
and indexes) based on the size portion of the data blocks in range. In
other words, if a key range covers data blocks that are X% by size of all
the table's data blocks, returned approximate size is X% of the total
file size. It would technically be more accurate to attribute metadata
based on number of keys, but that's not computationally efficient with
data available and rarely a meaningful difference.

Also includes miscellaneous comment improvements / clarifications.

Also included is a new approximatesizerandom benchmark for db_bench.
No significant performance difference seen with this change, whether ~700 ops/sec with cache_index_and_filter_blocks and small cache or ~150k ops/sec without cache_index_and_filter_blocks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6784

Test Plan:
Test added to DBTest.ApproximateSizesFilesWithErrorMargin.
Old code running new test...

    [ RUN      ] DBTest.ApproximateSizesFilesWithErrorMargin
    db/db_test.cc:1562: Failure
    Expected: (size) <= (11 * 100), actual: 9478 vs 1100

Other tests updated to reflect consistent accounting of metadata.

Reviewed By: siying

Differential Revision: D21334706

Pulled By: pdillinger

fbshipit-source-id: 6f86870e45213334fedbe9c73b4ebb1d8d611185
2020-06-02 12:30:23 -07:00
sdong 298b00a396 Reduce dependency on gtest dependency in release code (#6907)
Summary:
Release code now depends on gtest, indirectly through including "test_util/testharness.h". This creates multiple problems. One important reason is the definition of IGNORE_STATUS_IF_ERROR() in test_util/testharness.h. Move it to sync_point.h instead.
Note that utilities/cassandra/format.h still depends on "test_util/testharness.h". This will be resolved in a separate diff.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6907

Test Plan: Run all existing tests.

Reviewed By: ajkr

Differential Revision: D21829884

fbshipit-source-id: 9253c19ffde2936f3ae68998210f8e54f645a6e6
2020-06-02 12:11:24 -07:00
Adam Retter 8d87e9cea1 Update googletest from 1.8.1 to 1.10.0 (#6808)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6808

Reviewed By: anand1976

Differential Revision: D21483984

Pulled By: pdillinger

fbshipit-source-id: 70c5eff2bd54ddba469761d95e4cd4611fb8e598
2020-06-01 20:33:42 -07:00
anand76 66942e8158 Avoid unnecessary reads of uncompression dictionary in MultiGet (#6906)
Summary:
We may sometimes read the uncompression dictionary when its not
necessary, when we lookup a key in an SST file but the index indicates
the key is not present. This can happen with index_type 3.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6906

Test Plan: make check

Reviewed By: cheng-chang

Differential Revision: D21828944

Pulled By: anand1976

fbshipit-source-id: 7aef4f0a39548d0874eafefd2687006d2652f9bb
2020-06-01 19:43:37 -07:00
Cheng Chang bcb9e41080 Explicitly free allocated buffer when status is not ok (#6903)
Summary:
Currently we rely on `BlockContents` to implicitly free the allocated scratch buffer, but when IO error happens, it doesn't make sense to construct the `BlockContents` which might be corrupted. In the stress test, we find that `assert(req.result.size() == block_size(handle));` fails because of potential IO errors.

In this PR, we explicitly free the scratch buffer on error without constructing `BlockContents`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6903

Test Plan: watch stress test

Reviewed By: anand1976

Differential Revision: D21823869

Pulled By: cheng-chang

fbshipit-source-id: 5603fc80e9bf3f44a9d7250ddebd871afe1eb89f
2020-06-01 15:19:40 -07:00
Andrew Kryczka c5abf78bca avoid `IterKey::UpdateInternalKey()` in `BlockIter` (#6843)
Summary:
`IterKey::UpdateInternalKey()` is an error-prone API as it's
incompatible with `IterKey::TrimAppend()`, which is used for
decoding delta-encoded internal keys. This PR stops using it in
`BlockIter`. Instead, it assigns global seqno in a separate `IterKey`'s
buffer when needed. The logic for safely getting a Slice with global
seqno properly assigned is encapsulated in `GlobalSeqnoAppliedKey`.
`BinarySeek()` is also migrated to use this API (previously it ignored
global seqno entirely).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6843

Test Plan:
benchmark setup -- single file DBs, in-memory, no compression. "normal_db"
created by regular flush; "ingestion_db" created by ingesting a file. Both
DBs have same contents.

```
$ TEST_TMPDIR=/dev/shm/normal_db/ ./db_bench -benchmarks=fillrandom,compact -write_buffer_size=10485760000 -disable_auto_compactions=true -compression_type=none -num=1000000
$ ./ldb write_extern_sst ./tmp.sst --db=/dev/shm/ingestion_db/dbbench/ --compression_type=no --hex --create_if_missing < <(./sst_dump --command=scan --output_hex --file=/dev/shm/normal_db/dbbench/000007.sst | awk 'began {print "0x" substr($1, 2, length($1) - 2), "==>", "0x" $5} ; /^Sst file format: block-based/ {began=1}')
$ ./ldb ingest_extern_sst ./tmp.sst --db=/dev/shm/ingestion_db/dbbench/
```

benchmark run command:
```
TEST_TMPDIR=/dev/shm/$DB/ ./db_bench -benchmarks=seekrandom -seek_nexts=10 -use_existing_db=true -cache_index_and_filter_blocks=false -num=1000000 -cache_size=1048576000 -threads=1 -reads=40000000
```

results:

| DB | code | throughput |
|---|---|---|
| normal_db | master |  267.9 |
| normal_db   |    PR6843 | 254.2 (-5.1%) |
| ingestion_db |   master |  259.6 |
| ingestion_db |   PR6843 | 250.5 (-3.5%) |

Reviewed By: pdillinger

Differential Revision: D21562604

Pulled By: ajkr

fbshipit-source-id: 937596f836930515da8084d11755e1f247dcb264
2020-05-28 10:51:30 -07:00
Yanqin Jin 961c7590d6 Add timestamp to delete (#6253)
Summary:
Preliminary user-timestamp support for delete.

If ["a", ts=100] exists, you can delete it by calling `DB::Delete(write_options, key)` in which `write_options.timestamp` points to a `ts` higher than 100.

Implementation
A new ValueType, i.e. `kTypeDeletionWithTimestamp` is added for deletion marker with timestamp.
The reason for a separate `kTypeDeletionWithTimestamp`: RocksDB may drop tombstones (keys with kTypeDeletion) when compacting them to the bottom level. This is OK and useful if timestamp is disabled. When timestamp is enabled, should we still reuse `kTypeDeletion`, we may drop the tombstone with a more recent timestamp, causing deleted keys to re-appear.

Test plan (dev server)
```
make check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6253

Reviewed By: ltamasi

Differential Revision: D20995328

Pulled By: riversand963

fbshipit-source-id: a9e5c22968ad76f98e3dc6ee0151265a3f0df619
2020-05-28 10:40:03 -07:00
Akanksha Mahajan bcefc59e9f Allow MultiGet users to limit cumulative value size (#6826)
Summary:
1. Add a value_size in read options which limits the cumulative value size of keys read in batches. Once the size exceeds read_options.value_size, all the remaining keys are returned with status Abort without further fetching any key.
2. Add a unit test case MultiGetBatchedValueSizeSimple the reads keys from memory and sst files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6826

Test Plan:
1. make check -j64
	   2. Add a new unit test case

Reviewed By: anand1976

Differential Revision: D21471483

Pulled By: akankshamahajan15

fbshipit-source-id: dea51b8e76d5d1df38ece8cdb29933b1d798b900
2020-05-27 13:07:14 -07:00
Cheng Chang 82a82c76e7 Fix potential memory leak of scratch buffer (#6879)
Summary:
If `req.scratch` is an internally allocated buffer, but `raw_block_contents` is not constructed to own `req.scratch`, then `req.scratch` will be leaked.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6879

Test Plan: make asan_check

Reviewed By: anand1976

Differential Revision: D21728498

Pulled By: cheng-chang

fbshipit-source-id: 8fc6a4f2543918c565ddc16ecfad1807eb9a42cf
2020-05-26 15:29:04 -07:00
Andrew Kryczka 292bcf6227 skip direct I/O tests in rocksdb lite (#6867)
Summary:
Fix a couple places where direct I/O was used even though it is
unsupported in lite builds.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6867

Test Plan: `LITE=1 make check -j48`

Reviewed By: pdillinger

Differential Revision: D21689185

Pulled By: ajkr

fbshipit-source-id: 3eaa3abf69cd7d0bcaabbcad3bb5a26fb8dd7301
2020-05-21 13:57:17 -07:00
mrambacher 38be686160 Add Struct Type to OptionsTypeInfo (#6425)
Summary:
Added code for generically handing structs to OptionTypeInfo.  A struct is a collection of variables handled by their own map of OptionTypeInfos.  Examples of structs include Compaction and Cache options.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6425

Reviewed By: siying

Differential Revision: D21668789

Pulled By: zhichao-cao

fbshipit-source-id: 064b110de39dadf82361ed4663f7ac1a535b0b07
2020-05-21 10:58:39 -07:00
Peter Dillinger c7aedf1b48 Clean up some code related to file checksums (#6861)
Summary:
* Add missing unit test for schema stability of FileChecksumGenCrc32c
  (previously was only comparing to itself)
* A lot of clarifying comments
* Add some assertions for preconditions
* Rename WritableFileWriter::CalculateFileChecksum -> UpdateFileChecksum
* Simplify FileChecksumGenCrc32c with shared functions
* Implement EndianSwapValue to replace unused EndianTransform

And incidentally since I had trouble with 'make check-format' GitHub action disagreeing with local run,
* Output full diagnostic information when 'make check-format' fails in CI
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6861

Test Plan: new unit test passes before & after other changes

Reviewed By: zhichao-cao

Differential Revision: D21667115

Pulled By: pdillinger

fbshipit-source-id: 6a99970f87605aa024fa540c78cd519ff322c3e6
2020-05-21 08:12:51 -07:00
Zhichao Cao 545e14b53b Generate file checksum in SstFileWriter (#6859)
Summary:
If Option.file_checksum_gen_factory is set, rocksdb generates the file checksum during flush and compaction based on the checksum generator created by the factory and store the checksum and function name in vstorage and Manifest.

This PR enable file checksum generation in SstFileWrite and store the checksum and checksum function name in the  ExternalSstFileInfo, such that application can use them for other purpose, for example, ingest the file checksum with files in IngestExternalFile().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6859

Test Plan: add unit test and pass make asan_check.

Reviewed By: ajkr

Differential Revision: D21656247

Pulled By: zhichao-cao

fbshipit-source-id: 78a3570c76031d8832e3d2de3d6c79cdf2b675d0
2020-05-20 11:55:31 -07:00
Cheng Chang 91b7553293 Enable IO Uring in MultiGet in direct IO mode (#6815)
Summary:
Currently, in direct IO mode, `MultiGet` retrieves the data blocks one by one instead of in parallel, see `BlockBasedTable::RetrieveMultipleBlocks`.

Since direct IO is supported in `RandomAccessFileReader::MultiRead` in https://github.com/facebook/rocksdb/pull/6446, this PR applies `MultiRead` to `MultiGet` so that the data blocks can be retrieved in parallel.

Also, in direct IO mode and when data blocks are compressed and need to uncompressed, this PR only allocates one continuous aligned buffer to hold the data blocks, and then directly uncompress the blocks to insert into block cache, there is no longer intermediate copies to scratch buffers.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6815

Test Plan:
1. added a new unit test `BlockBasedTableReaderTest::MultiGet`.
2. existing unit tests and stress tests  contain tests against `MultiGet` in direct IO mode.

Reviewed By: anand1976

Differential Revision: D21426347

Pulled By: cheng-chang

fbshipit-source-id: b8446ae0e74152444ef9111e97f8e402ac31b24f
2020-05-14 23:26:26 -07:00
sdong 4a4b8a1344 sst_dump to reduce number of file reads (#6836)
Summary:
sst_dump can issue many file reads from the file system. This doesn't work well with file systems without a OS cache, especially remote file systems. In order to mitigate this problem, several improvements are done:
1. --readahead_size is added, so that users can specify readahead size when scanning the data.
2. Force a 512KB tail readahead, which prevents three I/Os for footer, meta index and property blocks and hopefully index and filter blocks too.
3. Consoldiate SSTDump's I/Os before opening the file for read. Use the same file prefetch buffer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6836

Test Plan: Add a test that covers this new feature.

Reviewed By: pdillinger

Differential Revision: D21516607

fbshipit-source-id: 3ae43526286f67b2f4a5bdedfbc92719d579b87e
2020-05-12 18:23:33 -07:00
Ziyue Yang c384c08a4f Add tests for compression failure in BlockBasedTableBuilder (#6709)
Summary:
Currently there is no check for whether BlockBasedTableBuilder will expose
compression error status if compression fails during the table building.
This commit adds fake faulting compressors and a unit test to test such
cases.

This check finds 5 bugs, and this commit also fixes them:

1. Not handling compression failure well in
   BlockBasedTableBuilder::BGWorkWriteRawBlock.
2. verify_compression failing in BlockBasedTableBuilder when used with ZSTD.
3. Wrongly passing the same reference of block contents to
   BlockBasedTableBuilder::CompressAndVerifyBlock in parallel compression.
4. Wrongly setting block_rep->first_key_in_next_block to nullptr in
   BlockBasedTableBuilder::EnterUnbuffered when there are still incoming data
   blocks.
5. Not maintaining variables for compression ratio estimation and first_block
   in BlockBasedTableBuilder::EnterUnbuffered.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6709

Reviewed By: ajkr

Differential Revision: D21236254

fbshipit-source-id: 101f6e62b2bac2b7be72be198adf93cd32a1ff46
2020-05-12 09:27:35 -07:00
Peter Dillinger b27a1448b6 Fix false NotFound from batched MultiGet with kHashSearch (#6821)
Summary:
The error is assigning KeyContext::s to NotFound status in a
table reader for a "not found in this table" case, which skips searching
in later tables, like only a delete should. (The hash search index iterator
is the only one that can return status NotFound even if Valid() == false.)

This was detected by intermittent failure in
MultiThreadedDBTest.MultiThreaded/5, a kHashSearch configuration.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6821

Test Plan: modified existing unit test to reproduce problem

Reviewed By: anand1976

Differential Revision: D21450469

Pulled By: pdillinger

fbshipit-source-id: 7478003684d637dbd491cdac81468041a791be2c
2020-05-07 15:41:37 -07:00
mrambacher 394f2bbd13 Add OptionTypeInfo::Enum and related methods (#6423)
Summary:
Add methods and constructors for handling enums to the OptionTypeInfo.  This change allows enums to be converted/compared without adding a special "type" to the OptionType.

This change addresses a couple of issues:
- It allows new enumerated types to be added to the options without editing the OptionType base class (and related methods)
- It standardizes the procedure for adding enumerated types to the options, reducing potential mistakes
- It moves the enum maps to the location where they are used, allowing them to be static file members rather than global values
- It reduces the number of types and cases that need to be handled in the various OptionType methods
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6423

Reviewed By: siying

Differential Revision: D21408713

Pulled By: zhichao-cao

fbshipit-source-id: fc492af285d011822578b95d186a0fce25d35626
2020-05-05 15:04:04 -07:00
sdong 079e50d2ba Disallow BlockBasedTableBuilder to set status from non-OK (#6776)
Summary:
There is no systematic mechanism to prevent BlockBasedTableBuilder's status to be set from non-OK to OK. Adding a mechanism to force this will help us prevent failures in the future.

The solution is to only make it possible to set the status code if the status code to set is not OK.

Since the status code passed to CompressAndVerifyBlock() is changed, a mini refactoring is done too so that the output arguments are changed from reference to pointers, based on Google C++ Style.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6776

Test Plan: Run all existing test.

Reviewed By: pdillinger

Differential Revision: D21314382

fbshipit-source-id: 27000c10f1e4c121661e026548d6882066409375
2020-04-30 15:37:03 -07:00
anand76 ab13d43e1d Pass a timeout to FileSystem for random reads (#6751)
Summary:
Calculate ```IOOptions::timeout``` using ```ReadOptions::deadline``` and pass it to ```FileSystem::Read/FileSystem::MultiRead```. This allows us to impose a tighter bound on the time taken by Get/MultiGet on FileSystem/Envs that support IO timeouts. Even on those that don't support, check in ```RandomAccessFileReader::Read``` and ```MultiRead``` and return ```Status::TimedOut()``` if the deadline is exceeded.

For now, TableReader creation, which might do file opens and reads, are not covered. It will be implemented in another PR.

Tests:
Update existing unit tests to verify the correct timeout value is being passed
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6751

Reviewed By: riversand963

Differential Revision: D21285631

Pulled By: anand1976

fbshipit-source-id: d89af843e5a91ece866e87aa29438b52a65a8567
2020-04-30 14:50:39 -07:00
Peter Dillinger eecd8fba46 Fix assertion that can fail on sst corruption (#6780)
Summary:
An assertion that a char == a CompressionType (unsigned char)
originally cast from a char can fail if the original value is negative,
due to numeric promotion.  The assertion should pass even if the value
is invalid CompressionType, because the callee
UncompressBlockContentsForCompressionType checks for that and reports
status appropriately.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6780

Test Plan:
Temporarily change kZSTD = 0x88 and see tests fail. Make this
change (in addition), and tests pass.

Reviewed By: siying

Differential Revision: D21328498

Pulled By: pdillinger

fbshipit-source-id: 61caf8d815581ce49261ecb7ab0f396e9ac4bb92
2020-04-30 12:11:00 -07:00
mrambacher 618bf638aa Add Functions to OptionTypeInfo (#6422)
Summary:
Added functions for parsing, serializing, and comparing elements to OptionTypeInfo.  These functions allow all of the special cases that could not be handled directly in the map of OptionTypeInfo to be moved into the map.  Using these functions, every type can be handled via the map rather than special cased.

By adding these functions, the code for handling options can become more standardized (fewer special cases) and (eventually) handled completely by common classes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6422

Test Plan: pass make check

Reviewed By: siying

Differential Revision: D21269005

Pulled By: zhichao-cao

fbshipit-source-id: 9ba71c721a38ebf9ee88259d60bd81b3282b9077
2020-04-28 18:04:26 -07:00
Peter Dillinger bae6f58696 Basic MultiGet support for partitioned filters (#6757)
Summary:
In MultiGet, access each applicable filter partition only once
per batch, rather than for each applicable key. Also,

* Fix Bloom stats for MultiGet
* Fix/refactor MultiGetContext::Range::KeysLeft, including
* Add efficient BitsSetToOne implementation
* Assert that MultiGetContext::Range does not go beyond shift range

Performance test: Generate db:

    $ ./db_bench --benchmarks=fillrandom --num=15000000 --cache_index_and_filter_blocks -bloom_bits=10 -partition_index_and_filters=true
    ...

Before (middle performing run of three; note some missing Bloom stats):

    $ ./db_bench --use-existing-db --benchmarks=multireadrandom --num=15000000 --cache_index_and_filter_blocks --bloom_bits=10 --threads=16 --cache_size=20000000 -partition_index_and_filters -batch_size=32 -multiread_batched -statistics --duration=20 2>&1 | egrep 'micros/op|block.cache.filter.hit|bloom.filter.(full|use)|number.multiget'
    multireadrandom :      26.403 micros/op 597517 ops/sec; (548427 of 671968 found)
    rocksdb.block.cache.filter.hit COUNT : 83443275
    rocksdb.bloom.filter.useful COUNT : 0
    rocksdb.bloom.filter.full.positive COUNT : 0
    rocksdb.bloom.filter.full.true.positive COUNT : 7931450
    rocksdb.number.multiget.get COUNT : 385984
    rocksdb.number.multiget.keys.read COUNT : 12351488
    rocksdb.number.multiget.bytes.read COUNT : 793145000
    rocksdb.number.multiget.keys.found COUNT : 7931450

After (middle performing run of three):

    $ ./db_bench_new --use-existing-db --benchmarks=multireadrandom --num=15000000 --cache_index_and_filter_blocks --bloom_bits=10 --threads=16 --cache_size=20000000 -partition_index_and_filters -batch_size=32 -multiread_batched -statistics --duration=20 2>&1 | egrep 'micros/op|block.cache.filter.hit|bloom.filter.(full|use)|number.multiget'
    multireadrandom :      21.024 micros/op 752963 ops/sec; (705188 of 863968 found)
    rocksdb.block.cache.filter.hit COUNT : 49856682
    rocksdb.bloom.filter.useful COUNT : 45684579
    rocksdb.bloom.filter.full.positive COUNT : 10395458
    rocksdb.bloom.filter.full.true.positive COUNT : 9908456
    rocksdb.number.multiget.get COUNT : 481984
    rocksdb.number.multiget.keys.read COUNT : 15423488
    rocksdb.number.multiget.bytes.read COUNT : 990845600
    rocksdb.number.multiget.keys.found COUNT : 9908456

So that's about 25% higher throughput even for random keys
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6757

Test Plan: unit test included

Reviewed By: anand1976

Differential Revision: D21243256

Pulled By: pdillinger

fbshipit-source-id: 5644a1468d9e8c8575be02f4e04bc5d62dbbb57f
2020-04-28 14:49:34 -07:00
Yanqin Jin d4398e08fc Fix timestamp support for MultiGet (#6748)
Summary:
1. Avoid nullptr dereference when passing timestamp to KeyContext creation.
2. Construct LookupKey correctly with timestamp when creating MultiGetContext.
3. Compare without timestamp when sorting KeyContexts.

Fixes https://github.com/facebook/rocksdb/issues/6745

Test plan (dev server):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6748

Reviewed By: pdillinger

Differential Revision: D21258691

Pulled By: riversand963

fbshipit-source-id: 44e65b759c18b9986947783edf03be4f890bb004
2020-04-27 22:49:56 -07:00
Peter Dillinger 249eff0f30 Stats for redundant insertions into block cache (#6681)
Summary:
Since read threads do not coordinate on loading data into block
cache, two threads between Lookup and Insert can end up loading and
inserting the same data. This is particularly concerning with
cache_index_and_filter_blocks since those are hot and more likely to
be race targets if ejected from (or not pre-populated in) the cache.

Particularly with moves toward disaggregated / network storage, the cost
of redundant retrieval might be high, and we should at least have some
hard statistics from which we can estimate impact.

Example with full filter thrashing "cliff":

    $ ./db_bench --benchmarks=fillrandom --num=15000000 --cache_index_and_filter_blocks -bloom_bits=10
    ...
    $ ./db_bench --db=/tmp/rocksdbtest-172704/dbbench --use_existing_db --benchmarks=readrandom,stats --num=200000 --cache_index_and_filter_blocks --cache_size=$((130 * 1024 * 1024)) --bloom_bits=10 --threads=16 -statistics 2>&1 | egrep '^rocksdb.block.cache.(.*add|.*redundant)' | grep -v compress | sort
    rocksdb.block.cache.add COUNT : 14181
    rocksdb.block.cache.add.failures COUNT : 0
    rocksdb.block.cache.add.redundant COUNT : 476
    rocksdb.block.cache.data.add COUNT : 12749
    rocksdb.block.cache.data.add.redundant COUNT : 18
    rocksdb.block.cache.filter.add COUNT : 1003
    rocksdb.block.cache.filter.add.redundant COUNT : 217
    rocksdb.block.cache.index.add COUNT : 429
    rocksdb.block.cache.index.add.redundant COUNT : 241
    $ ./db_bench --db=/tmp/rocksdbtest-172704/dbbench --use_existing_db --benchmarks=readrandom,stats --num=200000 --cache_index_and_filter_blocks --cache_size=$((120 * 1024 * 1024)) --bloom_bits=10 --threads=16 -statistics 2>&1 | egrep '^rocksdb.block.cache.(.*add|.*redundant)' | grep -v compress | sort
    rocksdb.block.cache.add COUNT : 1182223
    rocksdb.block.cache.add.failures COUNT : 0
    rocksdb.block.cache.add.redundant COUNT : 302728
    rocksdb.block.cache.data.add COUNT : 31425
    rocksdb.block.cache.data.add.redundant COUNT : 12
    rocksdb.block.cache.filter.add COUNT : 795455
    rocksdb.block.cache.filter.add.redundant COUNT : 130238
    rocksdb.block.cache.index.add COUNT : 355343
    rocksdb.block.cache.index.add.redundant COUNT : 172478
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6681

Test Plan: Some manual testing (above) and unit test covering key metrics is included

Reviewed By: ltamasi

Differential Revision: D21134113

Pulled By: pdillinger

fbshipit-source-id: c11497b5f00f4ffdfe919823904e52d0a1a91d87
2020-04-27 13:20:27 -07:00
Cheng Chang 40497a875a Reduce memory copies when fetching and uncompressing blocks from SST files (#6689)
Summary:
In https://github.com/facebook/rocksdb/pull/6455, we modified the interface of `RandomAccessFileReader::Read` to be able to get rid of memcpy in direct IO mode.
This PR applies the new interface to `BlockFetcher` when reading blocks from SST files in direct IO mode.

Without this PR, in direct IO mode, when fetching and uncompressing compressed blocks, `BlockFetcher` will first copy the raw compressed block into `BlockFetcher::compressed_buf_` or `BlockFetcher::stack_buf_` inside `RandomAccessFileReader::Read` depending on the block size. then during uncompressing, it will copy the uncompressed block into `BlockFetcher::heap_buf_`.

In this PR, we get rid of the first memcpy and directly uncompress the block from `direct_io_buf_` to `heap_buf_`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6689

Test Plan: A new unit test `block_fetcher_test` is added.

Reviewed By: anand1976

Differential Revision: D21006729

Pulled By: cheng-chang

fbshipit-source-id: 2370b92c24075692423b81277415feb2aed5d980
2020-04-24 15:32:56 -07:00
anand76 9e7b7e2c08 Silence false alarms in db_stress fault injection (#6741)
Summary:
False alarms are caused by codepaths that intentionally swallow IO
errors.

Tests:
make crash_test
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6741

Reviewed By: ltamasi

Differential Revision: D21181138

Pulled By: anand1976

fbshipit-source-id: 5ccfbc68eb192033488de6269e59c00f2c65ce00
2020-04-24 13:06:12 -07:00
Ibrahim Jarif ae77880223 Fix some typos in code comments (#6733)
Summary:
This PR fixes some typos in code comments.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6733

Reviewed By: siying

Differential Revision: D21209037

Pulled By: zhichao-cao

fbshipit-source-id: d9274611fab1f5e992998c8c4117b8078c4cbc69
2020-04-23 12:28:49 -07:00
mrambacher 4cbc19d2a1 Add a ConfigOptions for use in comparing objects and converting to/from strings (#6389)
Summary:
The methods in convenience.h are used to compare/convert objects to/from strings.  There is a mishmash of parameters in use here with more needed in the future.  This PR replaces those parameters with a single structure.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6389

Reviewed By: siying

Differential Revision: D21163707

Pulled By: zhichao-cao

fbshipit-source-id: f807b4cc7e2b0af3871536b69546b2604dfa81bd
2020-04-21 17:38:17 -07:00
Andrew Kryczka f9155a3404 Prevent uninitialized load in `IndexBlockIter` (#6736)
Summary:
When index block is empty or an error happens while reading it,
`Invalidate()` is called rather than `Initialize()`. So `Seek()` must
not refer to member variables that are only initialized in
`Initialize()` until it is sure `Initialize()` has been called.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6736

Reviewed By: siying

Differential Revision: D21139641

Pulled By: ajkr

fbshipit-source-id: 71c58cc1adbd795dc3729dd5023bf7df1515ff32
2020-04-20 16:32:43 -07:00
Peter Dillinger 31da5e34c1 C++20 compatibility (#6697)
Summary:
Based on https://github.com/facebook/rocksdb/issues/6648 (CLA Signed), but heavily modified / extended:

* Implicit capture of this via [=] deprecated in C++20, and [=,this] not standard before C++20 -> now using explicit capture lists
* Implicit copy operator deprecated in gcc 9 -> add explicit '= default' definition
* std::random_shuffle deprecated in C++17 and removed in C++20 -> migrated to a replacement in RocksDB random.h API
* Add the ability to build with different std version though -DCMAKE_CXX_STANDARD=11/14/17/20 on the cmake command line
* Minimal rebuild flag of MSVC is deprecated and is forbidden with /std:c++latest (C++20)
* Added MSVC 2019 C++11 & MSVC 2019 C++20 in AppVeyor
* Added GCC 9 C++11 & GCC9 C++20 in Travis
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6697

Test Plan: make check and CI

Reviewed By: cheng-chang

Differential Revision: D21020318

Pulled By: pdillinger

fbshipit-source-id: 12311be5dbd8675a0e2c817f7ec50fa11c18ab91
2020-04-20 13:24:25 -07:00
Peter Dillinger 45d2b4efca Fix tabs and lint-ignores (#6734)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6734

Reviewed By: cheng-chang

Differential Revision: D21134556

Pulled By: pdillinger

fbshipit-source-id: 3636cc1d1333137b70031f8277458781c21631fb
2020-04-20 11:39:31 -07:00
Mike Kolupaev e45673dece Properly report IO errors when IndexType::kBinarySearchWithFirstKey is used (#6621)
Summary:
Context: Index type `kBinarySearchWithFirstKey` added the ability for sst file iterator to sometimes report a key from index without reading the corresponding data block. This is useful when sst blocks are cut at some meaningful boundaries (e.g. one block per key prefix), and many seeks land between blocks (e.g. for each prefix, the ranges of keys in different sst files are nearly disjoint, so a typical seek needs to read a data block from only one file even if all files have the prefix). But this added a new error condition, which rocksdb code was really not equipped to deal with: `InternalIterator::value()` may fail with an IO error or Status::Incomplete, but it's just a method returning a Slice, with no way to report error instead. Before this PR, this type of error wasn't handled at all (an empty slice was returned), and kBinarySearchWithFirstKey implementation was considered a prototype.

Now that we (LogDevice) have experimented with kBinarySearchWithFirstKey for a while and confirmed that it's really useful, this PR is adding the missing error handling.

It's a pretty inconvenient situation implementation-wise. The error needs to be reported from InternalIterator when trying to access value. But there are ~700 call sites of `InternalIterator::value()`, most of which either can't hit the error condition (because the iterator is reading from memtable or from index or something) or wouldn't benefit from the deferred loading of the value (e.g. compaction iterator that reads all values anyway). Adding error handling to all these call sites would needlessly bloat the code. So instead I made the deferred value loading optional: only the call sites that may use deferred loading have to call the new method `PrepareValue()` before calling `value()`. The feature is enabled with a new bool argument `allow_unprepared_value` to a bunch of methods that create iterators (it wouldn't make sense to put it in ReadOptions because it's completely internal to iterators, with virtually no user-visible effect). Lmk if you have better ideas.

Note that the deferred value loading only happens for *internal* iterators. The user-visible iterator (DBIter) always prepares the value before returning from Seek/Next/etc. We could go further and add an API to defer that value loading too, but that's most likely not useful for LogDevice, so it doesn't seem worth the complexity for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6621

Test Plan: make -j5 check . Will also deploy to some logdevice test clusters and look at stats.

Reviewed By: siying

Differential Revision: D20786930

Pulled By: al13n321

fbshipit-source-id: 6da77d918bad3780522e918f17f4d5513d3e99ee
2020-04-15 17:40:44 -07:00
Ziyue Yang 41563b61db Fix data racing of BlockBasedTableBuilder::ParallelCompressionRep::first_block (#6640)
Summary:
BlockBasedTableBuilder::ParallelCompressionRep::first_block can be read in
Flush() and written in BGWorkWriteRawBlock() concurrently. This commit fixes
the issue by reading first_block out before pushing the block to compression
and write.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6640

Test Plan: Run all tests concurrently with TSAN.

Reviewed By: cheng-chang

Differential Revision: D20851370

fbshipit-source-id: 6f039222e8319d31e15f1b45e05c106527253f72
2020-04-13 16:24:57 -07:00
Andrew Kryczka 9eca6d651d fix comparison count for format_version=3 indexes (#6650)
Summary:
In index blocks since `format_version=3`, user keys are written
rather than internal keys. When reading such blocks, the comparator is
obtained via `InternalKeyComparator::user_comparator()`. That function
must not return an unwrapped result as the wrapper class provides
accounting logic to populate `PerfContext::user_key_comparison_count`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6650

Test Plan:
ran db_bench and verified
`PerfContext::user_key_comparison_count` became larger.

Reviewed By: cheng-chang

Differential Revision: D20866325

Pulled By: ajkr

fbshipit-source-id: ad755d46bda31157dacc5b66e532279f19ad538c
2020-04-13 11:18:37 -07:00
anand76 79c838eb0f Fix a few bugs in db_stress fault injection (#6693)
Summary:
Fix the following issues -
1. Output parsing error in db_crashtest.py
2. Memory leak on exit
3. False alarm on filter block read error
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6693

Test Plan: asan_crash

Reviewed By: cheng-chang

Differential Revision: D20990399

Pulled By: anand1976

fbshipit-source-id: 178ee0dd7c69a4bc5db698379db0dedb29281699
2020-04-13 11:01:03 -07:00
anand76 5c19a441c4 Fault injection in db_stress (#6538)
Summary:
This PR implements a fault injection mechanism for injecting errors in reads in db_stress. The FaultInjectionTestFS is used for this purpose. A thread local structure is used to track the errors, so that each db_stress thread can independently enable/disable error injection and verify observed errors against expected errors. This is initially enabled only for Get and MultiGet, but can be extended to iterator as well once its proven stable.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6538

Test Plan:
crash_test
make check

Reviewed By: riversand963

Differential Revision: D20714347

Pulled By: anand1976

fbshipit-source-id: d7598321d4a2d72bda0ced57411a337a91d87dc7
2020-04-10 17:21:26 -07:00
anand76 d600e5b0eb Fix a Centos build failure reported in #6651 (#6656)
Summary:
Fixes issue https://github.com/facebook/rocksdb/issues/6651

Tests:
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6656

Reviewed By: cheng-chang

Differential Revision: D20879084

Pulled By: anand1976

fbshipit-source-id: c2cc508ca2716fcf80dcf9d2ba31c32d211f941e
2020-04-10 11:47:46 -07:00
Yi Wu eb287c72d7 Fix wrong key being read on ingested file with global seqno and delta encoding (#6669)
Summary:
On reading an ingested SST file, `DataBlockIter` will replace seqno encoded in a key with global seqno. However, if the original seqno was part of the prefix used for the next key, the global seqno is by mistake used as part of the prefix to construct the next key, causing wrong result being returned. Although at this point it is only software error while data in the file is not corrupted, the issue can further cause compaction output out of order and corrupted result when the ingested SST participated in compaction. Fixing the issue by save the actual seqno and restore it before the key being used as prefix to construct next key.

The unit test is by Little-Wallace from https://github.com/facebook/rocksdb/issues/6666. Fixing https://github.com/facebook/rocksdb/issues/6666.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6669

Test Plan:
New unit test

Signed-off-by: Yi Wu <yiwu@pingcap.com>

Reviewed By: cheng-chang

Differential Revision: D20931808

Pulled By: ajkr

fbshipit-source-id: f01959c35d6a493954dca981663766c7a5a9e8ab
2020-04-08 21:22:15 -07:00
anand76 fcd7bee925 Properly account block_decompress_time (#6658)
Summary:
It was incorrectly counting time even for blocks that didn't need decompression.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6658

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D20883522

Pulled By: anand1976

fbshipit-source-id: 33c9c4683f54cad150ab260a69e3ef8aa9aff76a
2020-04-07 12:53:59 -07:00
sdong 00f8016b36 Fix clang anaylze warning caused by #6262 (#6641)
Summary:
https://github.com/facebook/rocksdb/pull/6262 causes CLANG analyze to complain. Add assertion to suppress the warning.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6641

Test Plan: Run "clang analyze" and make sure it passes.

Reviewed By: anand1976

Differential Revision: D20841722

fbshipit-source-id: 5fa6e0c5cfe7a822214c9b898a408df59d4fd2cd
2020-04-03 15:47:51 -07:00
mrambacher 259b6ec8da Move the OptionTypeMap code closer to home (#6198)
Summary:
This is a predecessor to the Configurable PR.  This change moves the OptionTypeInfo maps closer to where they will be used.

When the Configurable changes are adopted, these values will become static and not associated with the OptionsHelper.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6198

Reviewed By: siying

Differential Revision: D20778108

Pulled By: zhichao-cao

fbshipit-source-id: a9f85fc73bc53503656e1958ecc1e764052fd1aa
2020-04-03 10:52:38 -07:00
Burton Li df62cd5b35 Fix msvc debug test failures (#6579)
Summary:
1. stats_history_test: one slice of stats history is 12526 Bytes, which is greater than original assumption.
![image](https://user-images.githubusercontent.com/17753898/77381970-5a611a80-6d3c-11ea-9d64-59d2e3c04f79.png)
2. table_test: in VerifyBlockAccessTrace function, release trace reader before delete trace file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6579

Reviewed By: siying

Differential Revision: D20767373

Pulled By: pdillinger

fbshipit-source-id: e8647d665cbe83a3f5429639c6219b50c0912124
2020-04-03 09:54:25 -07:00
sdong d0f3894cf1 In block based table builder, make variables for estimating file size atomic (#6636)
Summary:
With https://github.com/facebook/rocksdb/issues/6262, TSAN complains about data race of some variables. Those variables are used to estimate file size and are accessed in writer and background threads. Since file size estimation doesn't have to be 100% accurate, we make some variables atomic and use relaxed memory order.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6636

Test Plan: Run all tests with TSAN.

Reviewed By: anand1976

Differential Revision: D20820635

fbshipit-source-id: 1ea45ff38be15e33674ffe06b7d42fc9fe161ea5
2020-04-02 16:16:24 -07:00
Ziyue Yang 8088482dd6 Fix a division by zero after #6262 (#6633)
Summary:
With https://github.com/facebook/rocksdb/issues/6262, UBSAN fails with "division by zero":

[ RUN      ] Timestamp/DBBasicTestWithTimestampCompressionSettings.PutAndGetWithCompaction/3
internal_repo_rocksdb/repo/table/block_based/block_based_table_builder.cc:1066:39: runtime error: division by zero
    #0 0x7ffb3117b071 in rocksdb::BlockBasedTableBuilder::WriteRawBlock(rocksdb::Slice const&, rocksdb::CompressionType, rocksdb::BlockHandle*, bool) internal_repo_rocksdb/repo/table/block_based/block_based_table_builder.cc:1066
    https://github.com/facebook/rocksdb/issues/1 0x7ffb311775e1 in rocksdb::BlockBasedTableBuilder::WriteBlock(rocksdb::Slice const&, rocksdb::BlockHandle*, bool) internal_repo_rocksdb/repo/table/block_based/block_based_table_builder.cc:848
    https://github.com/facebook/rocksdb/issues/2 0x7ffb311771a2 in rocksdb::BlockBasedTableBuilder::WriteBlock(rocksdb::BlockBuilder*, rocksdb::BlockHandle*, bool) internal_repo_rocksdb/repo/table/block_based/block_based_table_builder.cc:832

This is caused by not returning immediately after CompressAndVerifyBlock call
in WriteBlock when rep_->status == kBuffered.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6633

Test Plan: Run all existing test.

Reviewed By: anand1976

Differential Revision: D20808366

fbshipit-source-id: 09f24b7c0fbaf4c7a8fc48cac61fa6fcb9b85811
2020-04-02 11:57:05 -07:00
Ziyue Yang 03a781a90c Add pipelined & parallel compression optimization (#6262)
Summary:
This PR adds support for pipelined & parallel compression optimization for `BlockBasedTableBuilder`. This optimization makes block building, block compression and block appending a pipeline, and uses multiple threads to accelerate block compression. Users can set `CompressionOptions::parallel_threads` greater than 1 to enable compression parallelism.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6262

Reviewed By: ajkr

Differential Revision: D20651306

fbshipit-source-id: 62125590a9c15b6d9071def9dc72589c1696a4cb
2020-04-01 16:40:18 -07:00
Levi Tamasi e6f86cfb36 Revert the recent cache deleter change (#6620)
Summary:
Revert "Use function objects as deleters in the block cache (https://github.com/facebook/rocksdb/issues/6545)"

    This reverts commit 6301dbe7a7.

    Revert "Call out the cache deleter related interface change in HISTORY.md (https://github.com/facebook/rocksdb/issues/6606)"

    This reverts commit 3a35542f86.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6620

Test Plan: `make check`

Reviewed By: zhichao-cao

Differential Revision: D20773311

Pulled By: ltamasi

fbshipit-source-id: 7637a761f718f323ef0e7da959462e8fb06e7a2b
2020-03-31 16:11:06 -07:00