Commit Graph

730 Commits

Author SHA1 Message Date
Cheng Chang 80159f6e0b Carry over min_log_number_to_keep_2pc in new MANIFEST (#7747)
Summary:
When two phase commit is enabled, `VersionSet::min_log_number_to_keep_2pc` is set during flush.
But when a new MANIFEST is created, the `min_log_number_to_keep_2pc` is not carried over to the new MANIFEST. So if a new MANIFEST is created and then DB is reopened, the `min_log_number_to_keep_2pc` will be lost.  This may cause DB recovery errors.
The bug is reproduced in a new unit test in `version_set_test.cc`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7747

Test Plan: The new unit test in `version_set_test.cc` should pass.

Reviewed By: jay-zhuang

Differential Revision: D25350661

Pulled By: cheng-chang

fbshipit-source-id: eee890d5b19f15769069670692e270ae31044ece
2020-12-09 19:07:25 -08:00
Yanqin Jin 11c4be2222 Refactor ProcessManifestWrites a little bit (#7751)
Summary:
This PR removes a nested loop inside ProcessManifestWrites. The new
implementation has the same behavior as the old code with simpler logic
and lower complexity.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7751

Test Plan:
make check
Run make crash_test on devserver and succeeds 3 times.

Reviewed By: ltamasi

Differential Revision: D25363526

Pulled By: riversand963

fbshipit-source-id: 27e681949dacd7501a752e5e517b9e85b54ccb2e
2020-12-08 02:37:38 -08:00
Yanqin Jin eee0af9af1 Add full_history_ts_low to column family (#7740)
Summary:
Following https://github.com/facebook/rocksdb/issues/7655 and https://github.com/facebook/rocksdb/issues/7657, this PR adds `full_history_ts_low_` to `ColumnFamilyData`.
`ColumnFamilyData::full_history_ts_low_` will be used to create `FlushJob` and `CompactionJob`.

`ColumnFamilyData::full_history_ts_low` is persisted to the MANIFEST file. An application can only
increase its value. Consider the following case:

>
> The database has a key at ts=950. `full_history_ts_low` is first set to 1000, and then a GC is triggered
> and cleans up all data older than 1000. If the application sets `full_history_ts_low` to 900 afterwards,
> and tries to read at ts=960, the key at 950 is not seen. From the perspective of the read, the result
> is hard to reason. For simplicity, we just do now allow decreasing full_history_ts_low for now.
>

During recovery, the value of `full_history_ts_low` is restored for each column family if applicable. Note that
version edits in the MANIFEST file for the same column family may have `full_history_ts_low` unsorted due
to the potential interleaving of `LogAndApply` calls. Only the max will be used to restore the state of the
column family.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7740

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D25296217

Pulled By: riversand963

fbshipit-source-id: 24acda1df8262cd7cfdc6ce7b0ec56438abe242a
2020-12-05 14:18:22 -08:00
Levi Tamasi 61932cdf1d Add blob support to DBIter (#7731)
Summary:
The patch adds iterator support to the integrated BlobDB implementation.
Whenever a blob reference is encountered during iteration, the corresponding
blob is retrieved by calling `Version::GetBlob`, assuming the `expose_blob_index`
(formerly `allow_blob`) flag is *not* set. (Note: the flag is set by the old stacked
BlobDB implementation, which has its own blob file handling/blob retrieval logic.)

In addition, `DBIter` now uniformly returns `Status::NotSupported` with the error
message `"BlobDB does not support merge operator."` when encountering a
blob reference while performing a merge (instead of potentially returning a
message that implies the database should be opened using the stacked BlobDB's
`Open`.)

TODO: We can implement support for lazily retrieving the blob value (or in other
words, bypassing the retrieval of blob values based on key) by extending the `Iterator`
API with a new `PrepareValue` method (similarly to `InternalIterator`, which already
supports lazy values).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7731

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D25256293

Pulled By: ltamasi

fbshipit-source-id: c39cd782011495a526cdff99c16f5fca400c4811
2020-12-04 21:29:38 -08:00
Levi Tamasi 51a8dc6d14 Integrated blob garbage collection: relocate blobs (#7694)
Summary:
The patch adds basic garbage collection support to the integrated BlobDB
implementation. Valid blobs residing in the oldest blob files are relocated
as they are encountered during compaction. The threshold that determines
which blob files qualify is computed based on the configuration option
`blob_garbage_collection_age_cutoff`, which was introduced in https://github.com/facebook/rocksdb/issues/7661 .
Once a blob is retrieved for the purposes of relocation, it passes through the
same logic that extracts large values to blob files in general. This means that
if, for instance, the size threshold for key-value separation (`min_blob_size`)
got changed or writing blob files got disabled altogether, it is possible for the
value to be moved back into the LSM tree. In particular, one way to re-inline
all blob values if needed would be to perform a full manual compaction with
`enable_blob_files` set to `false`, `enable_blob_garbage_collection` set to
`true`, and `blob_file_garbage_collection_age_cutoff` set to `1.0`.

Some TODOs that I plan to address in separate PRs:

1) We'll have to measure the amount of new garbage in each blob file and log
`BlobFileGarbage` entries as part of the compaction job's `VersionEdit`.
(For the time being, blob files are cleaned up solely based on the
`oldest_blob_file_number` relationships.)
2) When compression is used for blobs, the compression type hasn't changed,
and the blob still qualifies for being written to a blob file, we can simply copy
the compressed blob to the new file instead of going through decompression
and compression.
3) We need to update the formula for computing write amplification to account
for the amount of data read from blob files as part of GC.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7694

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D25069663

Pulled By: ltamasi

fbshipit-source-id: bdfa8feb09afcf5bca3b4eba2ba72ce2f15cd06a
2020-11-23 21:08:22 -08:00
Yanqin Jin 8b6b6aeb1a Refactor with VersionEditHandler (#6581)
Summary:
Added a few classes in the same class hierarchy to remove code duplication and
refactor the logic of reading and processing MANIFEST files.

New classes are as follows.
```
class VersionEditHandlerBase;
class ListColumnFamiliesHandler : VersionEditHandlerBase;
class FileChecksumRetriever : VersionEditHandlerBase;
class DumpManifestHandler : VersionEditHandler;
```
Classes that already existed before this PR are as follows.
```
class VersionEditHandler : VersionEditHandlerBase;
```

With these classes, refactored functions: `VersionSet::Recover()`,
`VersionSet::ListColumnFamilies()`, `VersionSet::DumpManifest()`,
`GetFileChecksumFromManifest()`.

Test Plan (devserver):
```
make check
COMPILE_WITH_ASAN=1 make check
```
These refactored code, especially recovery-related logic, will be tested intensively by
all existing unit tests and stress tests. For example, run
```
make crash_test
```
Verified 3 successful runs on devserver.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/6581

Reviewed By: ajkr

Differential Revision: D20616217

Pulled By: riversand963

fbshipit-source-id: 048c7743aa4be2623ccd0cc3e61c0027e604e78b
2020-11-11 08:00:14 -08:00
Jay Zhuang 18aee7db7e Fix a seek issue with prefix extractor and timestamp (#7644)
Summary:
During seek, prefix compare should not include timestamp.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7644

Test Plan: added unittest

Reviewed By: riversand963

Differential Revision: D24772066

Pulled By: jay-zhuang

fbshipit-source-id: 3982655a8bf8da256a738e8497b73b3d9bdac92e
2020-11-10 14:53:13 -08:00
Cheng Chang 1e40696dd1 Track WAL in MANIFEST: LogAndApply WAL events to MANIFEST (#7601)
Summary:
When a WAL is synced, an edit is written to MANIFEST.
After flushing memtables, the obsoleted WALs are piggybacked to MANIFEST while writing the new L0 files to MANIFEST.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7601

Test Plan:
`track_and_verify_wals_in_manifest` is enabled by default for all tests extending `DBBasicTest`, and in db_stress_test.
Unit test `wal_edit_test`, `version_edit_test`, and `version_set_test` are also updated.
Watch all tests to pass.

Reviewed By: ltamasi

Differential Revision: D24553957

Pulled By: cheng-chang

fbshipit-source-id: 66a569ff1bdced38e22900bd240b73113906e040
2020-11-06 17:22:36 -08:00
Jay Zhuang 881e0dcc09 Fix MultiGet unable to query timestamp data issue (#7589)
Summary:
The filter query key should not contain timestamp. The timestamp is
stripped for Get(), but not MultiGet().

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7589

Reviewed By: riversand963

Differential Revision: D24494661

Pulled By: jay-zhuang

fbshipit-source-id: fc5ff40f9d683a89a760c6ff0ab3aed05a70c317
2020-11-03 09:45:41 -08:00
Ramkumar Vadivelu 9a690a74e1 In ParseInternalKey(), include corrupt key info in Status (#7515)
Summary:
Fixes Issue https://github.com/facebook/rocksdb/issues/7497

When allow_data_in_errors db_options is set, log error key details in `ParseInternalKey()`

Have fixed most of the calls. Have few TODOs still pending - because have to make more deeper changes to pass in the allow_data_in_errors flag. Will do those in a separate PR later.

Tests:
- make check
- some of the existing tests that exercise the "internal key too small" condition are: dbformat_test, cuckoo_table_builder_test
- some of the existing tests that exercise the corrupted key path are: corruption_test, merge_helper_test, compaction_iterator_test

Example of new status returns:
- Key too small - `Corrupted Key: Internal Key too small. Size=5`
- Corrupt key with allow_data_in_errors option set to false: `Corrupted Key: '<redacted>' seq:3, type:3`
- Corrupt key with allow_data_in_errors option set to true: `Corrupted Key: '61' seq:3, type:3`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7515

Reviewed By: ajkr

Differential Revision: D24240264

Pulled By: ramvadiv

fbshipit-source-id: bc48f5d4475ac19d7713e16df37505b31aac42e7
2020-10-28 10:12:58 -07:00
Yanqin Jin 6134ce6444 Perform post-flush updates of memtable list in a callback (#6069)
Summary:
Currently, the following interleaving of events can lead to SuperVersion containing both immutable memtables as well as the resulting L0. This can cause Get to return incorrect result if there are merge operands. This may also affect other operations such as single deletes.

```
  time  main_thr  bg_flush_thr  bg_compact_thr  compact_thr  set_opts_thr
0  |                                                         WriteManifest:0
1  |                                           issue compact
2  |                                 wait
3  |   Merge(counter)
4  |   issue flush
5  |                   wait
6  |                                                         WriteManifest:1
7  |                                 wake up
8  |                                 write manifest
9  |                  wake up
10 |  Get(counter)
11 |                  remove imm
   V
```

The reason behind is that: one bg flush thread's installing new `Version` can be batched and performed by another thread that is the "leader" MANIFEST writer. This bg thread removes the memtables from current super version only after `LogAndApply` returns. After the leader MANIFEST writer signals (releasing mutex) this bg flush thread, it is possible that another thread sees this cf with both memtables (whose data have been flushed to the newest L0) and the L0 before this bg flush thread removes the memtables.

To address this issue, each bg flush thread can pass a callback function to `LogAndApply`. The callback is responsible for removing the memtables. Therefore, the leader MANIFEST writer can call this callback and remove the memtables before releasing the mutex.

Test plan (devserver)
```
$make merge_test
$./merge_test --gtest_filter=MergeTest.MergeWithCompactionAndFlush
$make check
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/6069

Reviewed By: cheng-chang

Differential Revision: D18790894

Pulled By: riversand963

fbshipit-source-id: e41bd600c0448b4f4b2deb3f7677f95e3076b4ed
2020-10-26 18:23:01 -07:00
Cheng Chang 1b224324b5 Track WAL in MANIFEST: persist WALs to and recover WALs from MANIFEST (#7256)
Summary:
This PR makes it able to `LogAndApply` `VersionEdit`s related to WALs, and also be able to `Recover` from MANIFEST with WAL related `VersionEdit`s.

The `VersionEdit`s related to WAL are treated similarly as those related to column family operations, they are not applied to versions, but can be in a commit group. Mixing WAL related `VersionEdit`s with other types of edits will make logic in `ProcessManifestWrite` more complicated, so `VersionEdit`s related to WAL can either be WAL additions or deletions, like column family add and drop.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7256

Test Plan: a set of unit tests are added in `version_set_test.cc`

Reviewed By: riversand963

Differential Revision: D23123238

Pulled By: cheng-chang

fbshipit-source-id: 246be2ed4744fd03fa2738aba408aaa611d0379c
2020-10-23 22:49:51 -07:00
Levi Tamasi e8cb32ed67 Introduce BlobFileCache and add support for blob files to Get() (#7540)
Summary:
The patch adds blob file support to the `Get` API by extending `Version` so that
whenever a blob reference is read from a file, the blob is retrieved from the corresponding
blob file and passed back to the caller. (This is assuming the blob reference is valid
and the blob file is actually part of the given `Version`.) It also introduces a cache
of `BlobFileReader`s called `BlobFileCache` that enables sharing `BlobFileReader`s
between callers. `BlobFileCache` uses the same backing cache as `TableCache`, so
`max_open_files` (if specified) limits the total number of open (table + blob) files.

TODO: proactively open/cache blob files and pin the cache handles of the readers in the
metadata objects similarly to what `VersionBuilder::LoadTableHandlers` does for
table files.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7540

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D24260219

Pulled By: ltamasi

fbshipit-source-id: a8a2a4f11d3d04d6082201b52184bc4d7b0857ba
2020-10-15 13:04:47 -07:00
Zhichao Cao 4146276885 Add ldb_cmd_test to ASSERT_STATUS_CHECKED list (#7499)
Summary:
Add ldb_cmd_test to ASSERT_STATUS_CHECKED list

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7499

Test Plan: pass ASSERT_STATUS_CHECKED=1 make -j48 ldb_cmd_test

Reviewed By: cheng-chang

Differential Revision: D24086203

Pulled By: zhichao-cao

fbshipit-source-id: 29592202b1d4335e566de15e7937269d98d57841
2020-10-08 00:00:48 -07:00
Akanksha Mahajan 38d0a365e3 Add Stats for MultiGet (#7366)
Summary:
Add following stats for MultiGet in Histogram to get more insight on MultiGet.
    1. Number of index and filter blocks read from file as part of MultiGet
    request per level.
    2. Number of data blocks read from file per level.
    3. Number of SST files loaded from file system per level.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7366

Reviewed By: anand1976

Differential Revision: D24127040

Pulled By: akankshamahajan15

fbshipit-source-id: e63a003056b833729b277edc0639c08fb432756b
2020-10-07 13:28:48 -07:00
Yanqin Jin 758ead5df7 Enforce status check for corruption_test (#7453)
Summary:
Enforce status check for corruption_test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7453

Test Plan:
```
ASSERT_STATUS_CHECKED=1 make corruption_test
./corruption_test
```

Reviewed By: jay-zhuang

Differential Revision: D24006862

Pulled By: riversand963

fbshipit-source-id: 664677caf4c3007a25cf565cec3d677f2dcea130
2020-10-02 22:11:00 -07:00
Zhichao Cao b7062f0b2c Status check enforcement for error_handler_fs_test (#7342)
Summary:
Added status check enforcement for error_test_fs_test

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7342

Test Plan: ASSERT_STATUS_CHECKED=1 make -j48 error_test_fs_test

Reviewed By: akankshamahajan15

Differential Revision: D23972231

Pulled By: zhichao-cao

fbshipit-source-id: fa41bfe440012e0c55f2c9507c1d0104e5e93f84
2020-10-02 16:41:13 -07:00
Ramkumar Vadivelu e04a50923d Change ParseInternalKey() to return Status instead of bool (#7457)
Summary:
Fixes https://github.com/facebook/rocksdb/issues/7430

Change ParseInternalKey() to return Status instead of bool.

db_bench (seekrandom) based before/after results with value size of 100 bytes and 16 bytes can be found at (tests ran on an udb server):
https://www.dropbox.com/s/47bwamdy5ozngph/PIK_ret_Status_results.xlsx?dl=0

![db_bench_results](https://user-images.githubusercontent.com/62277872/94642825-2a21a800-029a-11eb-88f2-124136c83fd3.png)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7457

Reviewed By: ajkr

Differential Revision: D24002433

Pulled By: ramvadiv

fbshipit-source-id: ac253ecf577a29044c47c3fe254a01e71404c44c
2020-09-30 19:16:47 -07:00
sdong 5f33436285 Revert an uncessary status code check skipping (#7458)
Summary:
https://github.com/facebook/rocksdb/pull/7452 added an uncessary skip for status code checking. Revert it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7458

Test Plan: Watch CI to finish

Reviewed By: jay-zhuang

Differential Revision: D23994390

fbshipit-source-id: a2b50a6326d8073db3386bff3d32acc5a6666e9b
2020-09-30 11:38:39 -07:00
sdong d08a9005b7 Make db_basic_test pass assert status checked (#7452)
Summary:
Add db_basic_test status check list. Some of the warnings are suppressed. It is possible that some of them are due to real bugs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7452

Test Plan: See CI tests pass.

Reviewed By: zhichao-cao

Differential Revision: D23979764

fbshipit-source-id: 6151570c2a9b931b0fbb3fe939a94b2bd1583cbe
2020-09-29 09:49:04 -07:00
Akanksha Mahajan b175eceb09 Store FSWritableFilePtr object in WritableFileWriter (#7193)
Summary:
Replace FSWritableFile pointer with FSWritableFilePtr
    object in WritableFileWriter.
    This new object wraps FSWritableFile pointer.

    Objective: If tracing is enabled, FSWritableFile Ptr returns
    FSWritableFileTracingWrapper pointer that includes all necessary
    information in IORecord and calls underlying FileSystem and invokes
    IOTracer to dump that record in a binary file. If tracing is disabled
    then, underlying FileSystem pointer is returned directly.
    FSWritableFilePtr wrapper class is added to bypass the
    FSWritableFileWrapper when
    tracing is disabled.

    Test Plan: make check -j64

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7193

Reviewed By: anand1976

Differential Revision: D23355915

Pulled By: akankshamahajan15

fbshipit-source-id: e62a27a13c1fd77e36a6dbafc7006d969bed25cf
2020-09-08 10:56:08 -07:00
Andrew Kryczka 177f8bd063 Bound L0->Lbase fanout in dynamic leveled compaction (#7325)
Summary:
L0 score is based on size target and number of files. The size target
used is `max_bytes_for_level_base`. However, the base level's size can
dynamically expand in write burst mode. In fact, it can expand so much
that L0->Lbase becomes the highest fanout in target sizes. This doesn't
make sense from an efficiency perspective, so this PR bounds the
L0->Lbase fanout to the smoothed level multiplier. The L0 scoring based
on file count remains unchanged.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7325

Test Plan:
contrived benchmark that exhibits the problem:

```
$ TEST_TMPDIR=/data/users/andrewkr/ ./db_bench -benchmarks=filluniquerandom,readrandom -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -level0_file_num_compaction_trigger=4 -level_compaction_dynamic_level_bytes=true -compression_type=none -max_background_jobs=12 -rate_limiter_bytes_per_sec=104857600 -benchmark_write_rate_limit=10485760 -num=100000000
```

Results:

- "Burst W-Amp" is the write-amp near the end of the fillrandom benchmark
- "Total W-Amp" is the write-amp after readrandom has run a while and all levels no longer need compaction

Branch | Burst W-Amp | Total W-Amp | fillrandom (MB/s)
-- | -- | -- | --
master | 20.2 | 21.5 | 4.7
dynamic-l0-score | 12.6 | 14.1 | 7.2

Reviewed By: siying

Differential Revision: D23412935

Pulled By: ajkr

fbshipit-source-id: f91f2067188e432dd39deab02f1c56f195057a0e
2020-09-01 19:34:01 -07:00
Akanksha Mahajan 8e0df9050c Store FSRandomAccessPtr object in RandomAccessFileReader (#7192)
Summary:
Replace FSRandomAccessFile pointer with FSRandomAccessFilePtr
    object in RandomAccessFileReader.
    This new object wraps FSRandomAccessFile pointer.

    Objective: If tracing is enabled, FSRandomAccessFile Ptr returns
    FSRandomAccessFileTracingWrapper pointer that includes all necessary
    information in IORecord and calls underlying FileSystem and invokes
    IOTracer to dump that record in a binary file. If tracing is disabled
    then, underlying FileSystem pointer is returned directly.
    FSRandomAccessFilePtr wrapper class is added to bypass the FSRandomAccessFileWrapper when
    tracing is disabled.

    Test Plan: make check -j64

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7192

Reviewed By: anand1976

Differential Revision: D23356867

Pulled By: akankshamahajan15

fbshipit-source-id: 48f31168166a17a7444b40be44a9a9d4a5c7182c
2020-08-27 11:21:52 -07:00
mrambacher b7e1c5213f Add some simulator cache and block tracer tests to ASSERT_STATUS_CHECKED (#7305)
Summary:
More tests now pass.  When in doubt, I added a TODO comment to check what should happen with an ignored error.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7305

Reviewed By: akankshamahajan15

Differential Revision: D23301262

Pulled By: ajkr

fbshipit-source-id: 5f120edc7393560aefc0633250277bbc7e8de9e6
2020-08-24 16:43:31 -07:00
mrambacher e9befdebbf Add EnvTestWithParam::OptionsTest to the ASSERT_STATUS_CHECKED passes (#7283)
Summary:
This test uses database functionality and required more extensive work to get it to pass than the other tests.  The DB functionality required for this test now passes the check.

When it was unclear what the proper behavior was for unchecked status codes, a TODO was added.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7283

Reviewed By: akankshamahajan15

Differential Revision: D23251497

Pulled By: ajkr

fbshipit-source-id: 52b79629bdafa0a58de8ead1d1d66f141b331523
2020-08-20 19:18:35 -07:00
Akanksha Mahajan cc24ac14eb Store FSSequentialFilePtr object in SequenceFileReader (#7190)
Summary:
This diff contains following changes:
    1. Replace `FSSequentialFile` pointer with `FSSequentialFilePtr` object that wraps `FSSequentialFile` pointer in `SequenceFileReader`.

Objective: If tracing is enabled, `FSSequentialFilePtr` returns `FSSequentialFileTracingWrapper` pointer that includes all necessary information in `IORecord` and calls underlying FileSystem and invokes `IOTracer` to dump that record in a binary file. If tracing is disabled then, underlying `FileSystem` pointer is returned directly. `FSSequentialFilePtr` wrapper class is added to bypass the `FSSequentialFileTracingWrapper` when tracing is disabled.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7190

Test Plan:
make check -j64
          COMPILE_WITH_TSAN=1 make check -j64

Reviewed By: anand1976

Differential Revision: D23059616

Pulled By: akankshamahajan15

fbshipit-source-id: 1564b94dd1297cd0fbfe2ed5c9cc3e20f7395301
2020-08-18 16:20:54 -07:00
Akanksha Mahajan 1f9f630b27 Store FileSystemPtr object that contains FileSystem ptr (#7180)
Summary:
As part of the IOTracing project, this PR
    1. Caches "FileSystemPtr" object(wrapper class that returns file system pointer based on tracing enabled) instead of "FileSystem" pointer.
    2. FileSystemPtr object is created using FileSystem pointer and IOTracer
    pointer.
    3. IOTracer shared_ptr is created in DBImpl and it is passed to different classes through constructor.
    4. When tracing is enabled through DB::StartIOTrace, FileSystemPtr
    returns FileSystemTracingWrapper pointer for tracing purpose and when
    it is disabled underlying FileSystem pointer is returned.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7180

Test Plan:
make check -j64
                COMPILE_WITH_TSAN=1 make check -j64

Reviewed By: anand1976

Differential Revision: D22987117

Pulled By: akankshamahajan15

fbshipit-source-id: 6073617e4c2d5bc363914f3a1f55ae3b0a58fbf1
2020-08-12 17:31:23 -07:00
Levi Tamasi a99fb67233 Remove redundant consistency check from VersionStorageInfo::AddFile (#7237)
Summary:
`VersionStorageInfo::AddFile` currently has a debug-mode consistency
check to make sure the newly added file does not overlap with the
previous one (for levels below L0). Considering that
`VersionBuilder::CheckConsistency` also performs similar checks (in
fact, those checks are more comprehensive and cover L0 as well), this
check is redundant. The patch removes it and also cleans up `AddFile` a
little.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7237

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D23041937

Pulled By: ltamasi

fbshipit-source-id: e00665f3b83bfd17f86c54c238800f3d77d739bd
2020-08-11 09:23:17 -07:00
Cheng Chang cd48ecaa1a Define WAL related classes to be used in VersionEdit and VersionSet (#7164)
Summary:
`WalAddition`, `WalDeletion` are defined in `wal_version.h` and used in `VersionEdit`.
`WalAddition` is used to represent events of creating a new WAL (no size, just log number), or closing a WAL (with size).
`WalDeletion` is used to represent events of deleting or archiving a WAL, it means the WAL is no longer alive (won't be replayed during recovery).

`WalSet` is the set of alive WALs kept in `VersionSet`.

1. Why use `WalDeletion` instead of relying on `MinLogNumber` to identify outdated WALs

On recovery, we can compute `MinLogNumber()` based on the log numbers kept in MANIFEST, any log with number < MinLogNumber can be ignored. So it seems that we don't need to persist `WalDeletion` to MANIFEST, since we can ignore the WALs based on MinLogNumber.

But the `MinLogNumber()` is actually a lower bound, it does not exactly mean that logs starting from MinLogNumber must exist. This is because in a corner case, when a column family is empty and never flushed, its log number is set to the largest log number, but not persisted in MANIFEST. So let's say there are 2 column families, when creating the DB, the first WAL has log number 1, so it's persisted to MANIFEST for both column families. Then CF 0 is empty and never flushed, CF 1 is updated and flushed, so a new WAL with log number 2 is created and persisted to MANIFEST for CF 1. But CF 0's log number in MANIFEST is still 1. So on recovery, MinLogNumber is 1, but since log 1 only contains data for CF 1, and CF 1 is flushed, log 1 might have already been deleted from disk.

We can make `MinLogNumber()` be the exactly minimum log number that must exist, by persisting the most recent log number for empty column families that are not flushed. But if there are N such column families, then every time a new WAL is created, we need to add N records to MANIFEST.

In current design, a record is persisted to MANIFEST only when WAL is created, closed, or deleted/archived, so the number of WAL related records are bounded to 3x number of WALs.

2. Why keep `WalSet` in `VersionSet` instead of applying the `VersionEdit`s to `VersionStorageInfo`

`VersionEdit`s are originally designed to track the addition and deletion of SST files. The SST files are related to column families, each column family has a list of `Version`s, and each `Version` keeps the set of active SST files in `VersionStorageInfo`.

But WALs are a concept of DB, they are not bounded to specific column families. So logically it does not make sense to store WALs in a column family's `Version`s.
Also, `Version`'s purpose is to keep reference to SST / blob files, so that they are not deleted until there is no version referencing them. But a WAL is deleted regardless of version references.
So we keep the WALs in `VersionSet`  for the purpose of writing out the DB state's snapshot when creating new MANIFESTs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7164

Test Plan:
make version_edit_test && ./version_edit_test
make wal_edit_test && ./wal_edit_test

Reviewed By: ltamasi

Differential Revision: D22677936

Pulled By: cheng-chang

fbshipit-source-id: 5a3b6890140e572ffd79eb37e6e4c3c32361a859
2020-08-05 16:34:38 -07:00
sdong 5c1a544122 Clean up InternalIterator upper bound logic a little bit (#7200)
Summary:
IteratorIterator::IsOutOfBound() and IteratorIterator::MayBeOutOfUpperBound() are two functions that related to upper bound check. It is hard for users to reason about this complexity. Consolidate the two functions into one and assign an enum as results to improve readability.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7200

Test Plan: Run all existing test. Would run crash test with atomic for a while.

Reviewed By: anand1976

Differential Revision: D22833181

fbshipit-source-id: a0c724267056adbd0476bde74650e6c7226077e6
2020-08-05 10:44:57 -07:00
Andrew Kryczka a4a4a2dabd dedup ReadOptions in iterator hierarchy (#7210)
Summary:
Previously, a `ReadOptions` object was stored in every `BlockBasedTableIterator`
and every `LevelIterator`. This redundancy consumes extra memory,
resulting in the `Arena` making more allocations, and iteration
observing worse cache performance.

This PR migrates callers of `NewInternalIterator()` and
`MakeInputIterator()` to provide a `ReadOptions` object guaranteed to
outlive the returned iterator. When the iterator's lifetime will be managed by the
user, this lifetime guarantee is achieved by storing the `ReadOptions`
value in `ArenaWrappedDBIter`. Then, sub-iterators of `NewInternalIterator()` and
`MakeInputIterator()` can hold a reference-to-const `ReadOptions`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7210

Test Plan:
- `make check` under ASAN and valgrind
- benchmark: on a DB with 2 L0 files and 3 L1+ levels, this PR reduced `Arena` allocation 4792 -> 4160 bytes.

Reviewed By: anand1976

Differential Revision: D22861323

Pulled By: ajkr

fbshipit-source-id: 54aebb3e89c872eeab0f5793b4b6e42878d093ce
2020-08-03 15:23:04 -07:00
sdong 692f6a3138 Implement NextAndGetResult() in memtable and level iterator (#7179)
Summary:
NextAndGetResult() is not implemented in memtable and is very simply implemented in level iterator. The result is that for a normal leveled iterator, performance regression will be observed for calling PrepareValue() for most iterator Next(). Mitigate the problem by implementing the function for both iterators. In level iterator, the implementation cannot be perfect as when calling file iterator's SeekToFirst() we don't have information about whether the value is prepared. Fortunately, the first key should not cause a big portion of the CPu.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7179

Test Plan: Run normal crash test for a while.

Reviewed By: anand1976

Differential Revision: D22783840

fbshipit-source-id: c19f45cdf21b756190adef97a3b66ccde3936e05
2020-07-29 09:45:21 -07:00
wenh 4924a506b9 Reduce `env_->GetChildren()` calls in DBImpl::Recover() (#7044)
Summary:
There currently exist multiple `GetChildren()` calls in `DBImpl::Recover()`, which can be expensive in cases of distributed file systems.
This pull request try to call `DBImpl::Recover()` of each necessary directory only _once_ and reuse the results in the places of repeated calls in current code.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7044

Test Plan:
Run `make check` and use the default test suite. The modified code should be semantically identical to the current code. As a proof of this solution, we may optionally deploy the system onto a (real or simulated) distributed system and expect reduced latency caused by manifest fetching.

(WIP)

Reviewed By: riversand963

Differential Revision: D22419925

Pulled By: roghnin

fbshipit-source-id: d3774fbfbc246c5527101bc16747eb5c90919886
2020-07-10 13:41:08 -07:00
Peter Dillinger 4b107ceb7e Improve code comments in EstimateLiveDataSize (#7072)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7072

Reviewed By: ajkr

Differential Revision: D22391641

Pulled By: pdillinger

fbshipit-source-id: 0ef355576454514263ab684eb1a5c06787f3242a
2020-07-06 16:17:02 -07:00
Yanqin Jin d47c871190 Fix data race to VersionSet::io_status_ (#7034)
Summary:
After https://github.com/facebook/rocksdb/issues/6949 , VersionSet::io_status_ can be concurrently accessed by multiple
threads without lock, causing tsan test to fail. For example, a bg flush thread
resets io_status_ before calling LogAndApply(), while another thread already in
the process of LogAndApply() reads io_status_. This is a bug.

We do not have to reset io_status_ each time we call LogAndApply(). io_status_
is part of the state of VersionSet, and it indicates the outcome of preceding
MANIFEST/CURRENT files IO operations. Its value should be updated only when:

1. MANIFEST/CURRENT files IO fail for the first time.
2. MANIFEST/CURRENT files IO succeed as part of recovering from a prior
   failure without process restart, e.g. calling Resume().

Test Plan (devserver):
COMPILE_WITH_TSAN=1 make check
COMPILE_WITH_TSAN=1 make db_test2
./db_test2 --gtest_filter=DBTest2.CompactionStall
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7034

Reviewed By: zhichao-cao

Differential Revision: D22247137

Pulled By: riversand963

fbshipit-source-id: 77b83e05390f3ee3cd2d96d3fdd6fe4f225e3216
2020-06-27 08:57:31 -07:00
Yanqin Jin e66199d848 First step towards handling MANIFEST write error (#6949)
Summary:
This PR provides preliminary support for handling IO error during MANIFEST write.
File write/sync is not guaranteed to be atomic. If we encounter an IOError while writing/syncing to the MANIFEST file, we cannot be sure about the state of the MANIFEST file. The version edits may or may not have reached the file. During cleanup, if we delete the newly-generated SST files referenced by the pending version edit(s), but the version edit(s) actually are persistent in the MANIFEST, then next recovery attempt will process the version edits(s) and then fail since the SST files have already been deleted.
One approach is to truncate the MANIFEST after write/sync error, so that it is safe to delete the SST files. However, file truncation may not be supported on certain file systems. Therefore, we take the following approach.
If an IOError is detected during MANIFEST write/sync, we disable file deletions for the faulty database. Depending on whether the IOError is retryable (set by underlying file system), either RocksDB or application can call `DB::Resume()`, or simply shutdown and restart. During `Resume()`, RocksDB will try to switch to a new MANIFEST and write all existing in-memory version storage in the new file. If this succeeds, then RocksDB may proceed. If all recovery is completed, then file deletions will be re-enabled.
Note that multiple threads can call `LogAndApply()` at the same time, though only one of them will be going through the process MANIFEST write, possibly batching the version edits of other threads. When the leading MANIFEST writer finishes, all of the MANIFEST writing threads in this batch will have the same IOError. They will all call `ErrorHandler::SetBGError()` in which file deletion will be disabled.

Possible future directions:
- Add an `ErrorContext` structure so that it is easier to pass more info to `ErrorHandler`. Currently, as in this example, a new `BackgroundErrorReason` has to be added.

Test plan (dev server):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6949

Reviewed By: anand1976

Differential Revision: D22026020

Pulled By: riversand963

fbshipit-source-id: f3c68a2ef45d9b505d0d625c7c5e0c88495b91c8
2020-06-24 19:07:08 -07:00
Yanqin Jin 569b87e8c7 Fail recovery when MANIFEST record checksum mismatch (#6996)
Summary:
https://github.com/facebook/rocksdb/issues/5411 refactored `VersionSet::Recover` but introduced a bug, explained as follows.
Before, once a checksum mismatch happens, `reporter` will set `s` to be non-ok. Therefore, Recover will stop processing the MANIFEST any further.
```
// Correct
// Inside Recover
LogReporter reporter;
reporter.status = &s;
log::Reader reader(..., reporter);
while (reader.ReadRecord() && s.ok()) {
...
}
```
The bug is that, the local variable `s` in `ReadAndRecover` won't be updated by `reporter` while reading the MANIFEST. It is possible that the reader sees a checksum mismatch in a record, but `ReadRecord` retries internally read and finds the next valid record. The mismatched record will be ignored and no error is reported.
```
// Incorrect
// Inside Recover
LogReporter reporter;
reporter.status = &s;
log::Reader reader(..., reporter);
s = ReadAndRecover(reader, ...);

// Inside ReadAndRecover
  Status s;  // Shadows the s in Recover.
  while (reader.ReadRecord() && s.ok()) {
   ...
  }
```
`LogReporter` can use a separate `log_read_status` to track the errors while reading the MANIFEST. RocksDB can process more MANIFEST entries only if `log_read_status.ok()`.
Test plan (devserver):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6996

Reviewed By: ajkr

Differential Revision: D22105746

Pulled By: riversand963

fbshipit-source-id: b22f717a423457a41ca152a242abbb64cf91fc38
2020-06-18 10:09:12 -07:00
Andrew Kryczka 02db03af8d make L0 index/filter pinned memory usage predictable (#6911)
Summary:
Memory pinned by `pin_l0_filter_and_index_blocks_in_cache` needs to be predictable based on user config. This PR makes sure
we do not pin extra memory for large files generated by intra-L0 (see https://github.com/facebook/rocksdb/issues/6889).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6911

Test Plan: unit test

Reviewed By: siying

Differential Revision: D21835818

Pulled By: ajkr

fbshipit-source-id: a11a088549d06bed8aacc2548d266e5983f0ead4
2020-06-09 16:51:23 -07:00
anand76 1fb3593f25 Fix a bug in looking up duplicate keys with MultiGet (#6953)
Summary:
When MultiGet is called with duplicate keys, and the key matches the
largest key in an SST file and the value type is merge, only the first
instance of the duplicate key is returned with correct results. This is
due to the incorrect assumption that if a key in a batch is equal to the
largest key in the file, the next key cannot be present in that file.

Tests:
Add a new unit test
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6953

Reviewed By: cheng-chang

Differential Revision: D21935898

Pulled By: anand1976

fbshipit-source-id: a2cc327a15150e23fd997546ca64d1c33021cb4c
2020-06-08 16:11:21 -07:00
Hao Chen ffe08ffcc2 correct level information in version_set.cc (#6920)
Summary:
fix these two issues https://github.com/facebook/rocksdb/issues/6912  and https://github.com/facebook/rocksdb/issues/6667
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6920

Reviewed By: cheng-chang

Differential Revision: D21864885

Pulled By: ajkr

fbshipit-source-id: 10e21fc1851b67a59d44358f59c64fa5523bd263
2020-06-03 12:27:13 -07:00
Peter Dillinger 14eca6bf04 For ApproximateSizes, pro-rate table metadata size over data blocks (#6784)
Summary:
The implementation of GetApproximateSizes was inconsistent in
its treatment of the size of non-data blocks of SST files, sometimes
including and sometimes now. This was at its worst with large portion
of table file used by filters and querying a small range that crossed
a table boundary: the size estimate would include large filter size.

It's conceivable that someone might want only to know the size in terms
of data blocks, but I believe that's unlikely enough to ignore for now.
Similarly, there's no evidence the internal function AppoximateOffsetOf
is used for anything other than a one-sided ApproximateSize, so I intend
to refactor to remove redundancy in a follow-up commit.

So to fix this, GetApproximateSizes (and implementation details
ApproximateSize and ApproximateOffsetOf) now consistently include in
their returned sizes a portion of table file metadata (incl filters
and indexes) based on the size portion of the data blocks in range. In
other words, if a key range covers data blocks that are X% by size of all
the table's data blocks, returned approximate size is X% of the total
file size. It would technically be more accurate to attribute metadata
based on number of keys, but that's not computationally efficient with
data available and rarely a meaningful difference.

Also includes miscellaneous comment improvements / clarifications.

Also included is a new approximatesizerandom benchmark for db_bench.
No significant performance difference seen with this change, whether ~700 ops/sec with cache_index_and_filter_blocks and small cache or ~150k ops/sec without cache_index_and_filter_blocks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6784

Test Plan:
Test added to DBTest.ApproximateSizesFilesWithErrorMargin.
Old code running new test...

    [ RUN      ] DBTest.ApproximateSizesFilesWithErrorMargin
    db/db_test.cc:1562: Failure
    Expected: (size) <= (11 * 100), actual: 9478 vs 1100

Other tests updated to reflect consistent accounting of metadata.

Reviewed By: siying

Differential Revision: D21334706

Pulled By: pdillinger

fbshipit-source-id: 6f86870e45213334fedbe9c73b4ebb1d8d611185
2020-06-02 12:30:23 -07:00
Levi Tamasi e3f953a863 Make it possible to look up files by number in VersionStorageInfo (#6862)
Summary:
Does what it says on the can: the patch adds a hash map to `VersionStorageInfo`
that maps file numbers to file locations, i.e. (level, position in level) pairs. This
will enable stricter consistency checks in `VersionBuilder`. The patch also fixes
all the unit tests that used duplicate file numbers in a version (which would trigger
an assertion with the new code).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6862

Test Plan:
`make check`
`make whitebox_crash_test`

Reviewed By: riversand963

Differential Revision: D21670446

Pulled By: ltamasi

fbshipit-source-id: 2eac249945cf33d8fb8597b26bfff5221e1a861a
2020-05-28 10:03:06 -07:00
Akanksha Mahajan bcefc59e9f Allow MultiGet users to limit cumulative value size (#6826)
Summary:
1. Add a value_size in read options which limits the cumulative value size of keys read in batches. Once the size exceeds read_options.value_size, all the remaining keys are returned with status Abort without further fetching any key.
2. Add a unit test case MultiGetBatchedValueSizeSimple the reads keys from memory and sst files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6826

Test Plan:
1. make check -j64
	   2. Add a new unit test case

Reviewed By: anand1976

Differential Revision: D21471483

Pulled By: akankshamahajan15

fbshipit-source-id: dea51b8e76d5d1df38ece8cdb29933b1d798b900
2020-05-27 13:07:14 -07:00
sdong a50ea71c00 Improve ldb consistency checks (#6802)
Summary:
When using ldb, users cannot turn on force consistency check in most commands, while they cannot use checksonsistnecy with --try_load_options. The change fixes both by:
1. checkconsistency now calls OpenDB() so that it gets all the options loading and sanitized options logic
2. use options.check_consistency_checks = true by default, and add a --disable_consistency_checks to turn it off.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6802

Test Plan: Add a new unit test. Some manual tests with corrupted DBs.

Reviewed By: pdillinger

Differential Revision: D21388051

fbshipit-source-id: 8d122732d391b426e3982a1c3232a8e3763ffad0
2020-05-08 14:17:47 -07:00
anand76 94265234de Fix race due to delete triggered compaction in Universal compaction mode (#6799)
Summary:
Delete triggered compaction in universal compaction mode was causing a corruption when scheduled in parallel with other compactions.
1. When num_levels = 1, a file marked for compaction may be picked along with all older files in L0, without checking if any of them are already being compaction. This can cause unpredictable results like resurrection of older versions of keys or deleted keys.
2. When num_levels > 1, a delete triggered compaction would not get scheduled if it overlaps with a running regular compaction. However, the reverse is not true. This is due to the fact that in ```UniversalCompactionBuilder::CalculateSortedRuns```, it assumes that entire sorted runs are picked for compaction and only checks the first file in a sorted run to determine conflicts. This is violated by a delete triggered compaction as it works on a subset of a sorted run.

Fix the bug for num_levels > 1, and disable the feature for now when num_levels = 1. After disabling this feature, files would still get marked for compaction, but no compaction would get scheduled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6799

Reviewed By: pdillinger

Differential Revision: D21431286

Pulled By: anand1976

fbshipit-source-id: ae9f0bdb1d6ae2f10284847db731c23f43af164a
2020-05-07 17:32:17 -07:00
Yanqin Jin 5584595f80 Do not swallow error returned from SaveTo() (#6801)
Summary:
With consistency check enabled, VersionBuilder::SaveTo() may return error once
corruption is detected while building versions. We should handle these errors.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6801

Test Plan: make check

Reviewed By: siying

Differential Revision: D21385045

Pulled By: riversand963

fbshipit-source-id: 98f6424e2a4699b62befa21e9fe00e70a771118e
2020-05-05 10:46:20 -07:00
Levi Tamasi a00ddf1574 Expose the set of live blob files from Version/VersionSet (#6785)
Summary:
The patch adds logic that returns the set of live blob files from
`Version::AddLiveFiles` and `VersionSet::AddLiveFiles` (in addition to
live table files), and also cleans up the code a bit, for example, by
exposing only the numbers of table files as opposed to the earlier
`FileDescriptor`s that no clients used. Moreover, the patch extends
the `GetLiveFiles` API so that it also exposes blob files in the current version.
Similarly to https://github.com/facebook/rocksdb/pull/6755,
this is a building block for identifying and purging obsolete blob files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6785

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D21336210

Pulled By: ltamasi

fbshipit-source-id: fc1aede8a49eacd03caafbc5f6f9ce43b6270821
2020-05-04 15:08:13 -07:00
sdong 680c416348 Avoid Swallowing Some File Consistency Checking Bugs (#6793)
Summary:
We are swallowing some file consistency checking failures. This is not expected. We are fixing two cases: DB reopen and manifest dump.
More places are not fixed and need follow-up.

Error from CheckConsistencyForDeletes() is also swallowed, which is not fixed in this PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6793

Test Plan: Add a unit test to cover the reopen case.

Reviewed By: riversand963

Differential Revision: D21366525

fbshipit-source-id: eb438a322237814e8d5125f916a3c6de97f39ded
2020-05-04 14:18:11 -07:00
Levi Tamasi fe238e5438 Keep track of obsolete blob files in VersionSet (#6755)
Summary:
The patch adds logic to keep track of obsolete blob files. A blob file becomes
obsolete when the last `shared_ptr` that points to the corresponding
`SharedBlobFileMetaData` object goes away, which, in turn, happens when the
last `Version` that contains the blob file is destroyed. No longer needed blob
files are added to the obsolete list in `VersionSet` using a custom deleter to
avoid unnecessary coupling between `SharedBlobFileMetaData` and `VersionSet`.
Obsolete blob files are returned by `VersionSet::GetObsoleteFiles` and stored
in `JobContext`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6755

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D21233155

Pulled By: ltamasi

fbshipit-source-id: 47757e06fdc0127f27ed57f51abd27893d9a7b7a
2020-04-30 11:25:51 -07:00
Mike Kolupaev e45673dece Properly report IO errors when IndexType::kBinarySearchWithFirstKey is used (#6621)
Summary:
Context: Index type `kBinarySearchWithFirstKey` added the ability for sst file iterator to sometimes report a key from index without reading the corresponding data block. This is useful when sst blocks are cut at some meaningful boundaries (e.g. one block per key prefix), and many seeks land between blocks (e.g. for each prefix, the ranges of keys in different sst files are nearly disjoint, so a typical seek needs to read a data block from only one file even if all files have the prefix). But this added a new error condition, which rocksdb code was really not equipped to deal with: `InternalIterator::value()` may fail with an IO error or Status::Incomplete, but it's just a method returning a Slice, with no way to report error instead. Before this PR, this type of error wasn't handled at all (an empty slice was returned), and kBinarySearchWithFirstKey implementation was considered a prototype.

Now that we (LogDevice) have experimented with kBinarySearchWithFirstKey for a while and confirmed that it's really useful, this PR is adding the missing error handling.

It's a pretty inconvenient situation implementation-wise. The error needs to be reported from InternalIterator when trying to access value. But there are ~700 call sites of `InternalIterator::value()`, most of which either can't hit the error condition (because the iterator is reading from memtable or from index or something) or wouldn't benefit from the deferred loading of the value (e.g. compaction iterator that reads all values anyway). Adding error handling to all these call sites would needlessly bloat the code. So instead I made the deferred value loading optional: only the call sites that may use deferred loading have to call the new method `PrepareValue()` before calling `value()`. The feature is enabled with a new bool argument `allow_unprepared_value` to a bunch of methods that create iterators (it wouldn't make sense to put it in ReadOptions because it's completely internal to iterators, with virtually no user-visible effect). Lmk if you have better ideas.

Note that the deferred value loading only happens for *internal* iterators. The user-visible iterator (DBIter) always prepares the value before returning from Seek/Next/etc. We could go further and add an API to defer that value loading too, but that's most likely not useful for LogDevice, so it doesn't seem worth the complexity for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6621

Test Plan: make -j5 check . Will also deploy to some logdevice test clusters and look at stats.

Reviewed By: siying

Differential Revision: D20786930

Pulled By: al13n321

fbshipit-source-id: 6da77d918bad3780522e918f17f4d5513d3e99ee
2020-04-15 17:40:44 -07:00