Commit graph

806 commits

Author SHA1 Message Date
Peter Dillinger ecc8ffe17b Update master to version 6.13 (#7378)
Summary:
for release fork

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7378

Test Plan: make check + CI

Reviewed By: jay-zhuang

Differential Revision: D23669163

Pulled By: pdillinger

fbshipit-source-id: 14cbf95b32717c28418c71cc8e10f06733bbc49f
2020-09-12 13:18:09 -07:00
Yanqin Jin 205e577694 Cancel tombstone skipping during bottommost compaction (#7356)
Summary:
During bottommost compaction, RocksDB cannot simply drop a tombstone if
this tombstone is not in the earliest snapshot. The current behavior is: RocksDB
skips other internal keys (of the same user key) in the same snapshot range. In
the meantime, RocksDB should check for the `shutting_down` flag. Otherwise, it
is possible for a bottommost compaction that has already started running to take
a long time to finish, even if the application has tried to cancel all background jobs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7356

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D23663241

Pulled By: riversand963

fbshipit-source-id: 25f8e9b51bc3bfa3353cdf87557800f9d90ee0b5
2020-09-11 17:45:43 -07:00
Peter Dillinger 92639b93a6 Fix checkpoint file deletion race with avoid_unnecessary_blocking_io (#7369)
Summary:
https://github.com/facebook/rocksdb/issues/3341 guaranteed that upon return of `GetSortedWalFiles` after
`DisableFileDeletions`, all pending purges of previously obsolete WAL
files will have finished. However, the addition of
avoid_unnecessary_blocking_io in https://github.com/facebook/rocksdb/issues/5043 opened a hole in the code making
that assurance, which can lead to files to be copied for checkpoint or
backup going missing before being copied, with that option enabled.

This change patches the hole.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7369

Test Plan:
apparent fix to backups in crash test observed. Will work
on a unit test for another commit

Reviewed By: ajkr

Differential Revision: D23620258

Pulled By: pdillinger

fbshipit-source-id: bea36b461a5b719c3e3ef802f967bc3e8ae71614
2020-09-10 22:35:25 -07:00
Yanqin Jin 8307d4400c Update HISTORY.md for PR7329 (#7355)
Summary:
As title.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7355

Reviewed By: pdillinger

Differential Revision: D23566635

Pulled By: riversand963

fbshipit-source-id: f8d846bcff637e7617b764b7bfb9a948ea18d195
2020-09-08 11:10:25 -07:00
Andrew Kryczka 5746767387 add ldb unsafe_remove_sst_file subcommand (#7335)
Summary:
This is adapted from https://github.com/facebook/rocksdb/issues/6678 but takes a different approach, avoiding opening a read-write DB and avoiding the `DeleteFile()` API.

First, this PR refactors how options variables are initialized in `ldb` so it can be reused in a subcommand that doesn't open a DB:

- Separated remaining option initialization logic out of `OpenDB()`. The new `PrepareOptions()` function initializes the full options state.
- Fixed an old TODO about applying the subcommand CF option overrides to the proper `ColumnFamilyOptions` object.

Second, this PR adds the `ldb unsafe_remove_sst_file` subcommand. It uses the `VersionSet`-level APIs to remove the file with the specified number.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7335

Test Plan: played with interactive python and this file removal command. Verified openability/correct results in case of multiple column families, multiple levels, etc.

Reviewed By: pdillinger

Differential Revision: D23454575

Pulled By: ajkr

fbshipit-source-id: 039b7a8cbfc42fd123dcb25821eef51d61148afe
2020-09-03 16:54:51 -07:00
Andrew Kryczka 40e97b02be add warning on DeleteFile() API (#7337)
Summary:
Since we can't land https://github.com/facebook/rocksdb/issues/7336 until the next major release, added a strong warning against the `DeleteFile()` API in the meantime.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7337

Reviewed By: pdillinger

Differential Revision: D23459728

Pulled By: ajkr

fbshipit-source-id: 326cb9b18190386080c35c761a8736d8a877dafb
2020-09-03 16:42:01 -07:00
Andrew Kryczka af54c4092a fix SstFileWriter with dictionary compression (#7323)
Summary:
In block-based table builder, the cut-over from buffered to unbuffered
mode involves sampling the buffered blocks and generating a dictionary.
There was a bug where `SstFileWriter` passed zero as the `target_file_size`
causing the cutover to happen immediately, so there were no samples
available for generating the dictionary.

This PR changes the meaning of `target_file_size == 0` to mean buffer
the whole file before cutting over. It also adds dictionary compression
support to `sst_dump --command=recompress` for easy evaluation.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7323

Reviewed By: cheng-chang

Differential Revision: D23412158

Pulled By: ajkr

fbshipit-source-id: 3b232050e70ef3c2ee85a4b5f6fadb139c569873
2020-09-03 15:49:57 -07:00
Hiep d0c1a01c1b Avoid converting MERGES to PUTS when allow_ingest_behind is true (#7166)
Summary:
- Closes https://github.com/facebook/rocksdb/issues/6490
- Currently MERGEs are converted to PUTs at bottom or compaction has reached the beginning of the key, this can wrongly cover a PUT future base case.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7166

Test Plan:
- Automated: `make all check`
- Manual: With `allow_ingest_behind = true`, add Merge operations to a key then run compaction. Then run ingesting external files to make sure the base case is probably compacted with existing Merges.

Reviewed By: cheng-chang

Differential Revision: D23325425

Pulled By: ajkr

fbshipit-source-id: 3eb415eb7b381b5453e45245393566153b1abb68
2020-09-03 14:39:58 -07:00
Andrew Kryczka 177f8bd063 Bound L0->Lbase fanout in dynamic leveled compaction (#7325)
Summary:
L0 score is based on size target and number of files. The size target
used is `max_bytes_for_level_base`. However, the base level's size can
dynamically expand in write burst mode. In fact, it can expand so much
that L0->Lbase becomes the highest fanout in target sizes. This doesn't
make sense from an efficiency perspective, so this PR bounds the
L0->Lbase fanout to the smoothed level multiplier. The L0 scoring based
on file count remains unchanged.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7325

Test Plan:
contrived benchmark that exhibits the problem:

```
$ TEST_TMPDIR=/data/users/andrewkr/ ./db_bench -benchmarks=filluniquerandom,readrandom -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -level0_file_num_compaction_trigger=4 -level_compaction_dynamic_level_bytes=true -compression_type=none -max_background_jobs=12 -rate_limiter_bytes_per_sec=104857600 -benchmark_write_rate_limit=10485760 -num=100000000
```

Results:

- "Burst W-Amp" is the write-amp near the end of the fillrandom benchmark
- "Total W-Amp" is the write-amp after readrandom has run a while and all levels no longer need compaction

Branch | Burst W-Amp | Total W-Amp | fillrandom (MB/s)
-- | -- | -- | --
master | 20.2 | 21.5 | 4.7
dynamic-l0-score | 12.6 | 14.1 | 7.2

Reviewed By: siying

Differential Revision: D23412935

Pulled By: ajkr

fbshipit-source-id: f91f2067188e432dd39deab02f1c56f195057a0e
2020-09-01 19:34:01 -07:00
Akanksha Mahajan 963314ffd6 Add unit test for max_write_buffer_size_to_maintain (#7311)
Summary:
Add a unit test case to check memory usage when
max_write_buffer_size_to_maintain is set if flushed immutable memtables are
trimmed timely or not.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7311

Test Plan: Compared the results with before bug fix.

Reviewed By: ltamasi

Differential Revision: D23321702

Pulled By: akankshamahajan15

fbshipit-source-id: da04ee21137d641a07fd499a9e2749eb036fcb1e
2020-08-28 17:38:05 -07:00
Jay Zhuang c2485f2d81 Add buffer prefetch support for non directIO usecase (#7312)
Summary:
A new file interface `SupportPrefetch()` is added. When the user overrides it to `false`, an internal prefetch buffer will be used for readahead. Useful for non-directIO but FS doesn't have readahead support.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7312

Reviewed By: anand1976

Differential Revision: D23329847

Pulled By: jay-zhuang

fbshipit-source-id: 71cd4ce6f4a820840294e4e6aec111ab76175527
2020-08-27 18:16:53 -07:00
Peter Dillinger 9aad24da55 Real fix for race in backup custom checksum checking (#7309)
Summary:
This is a "real" fix for the issue worked around in https://github.com/facebook/rocksdb/issues/7294.
To get DB checksum info for live files, we now read the manifest file
that will become part of the checkpoint/backup. This requires a little
extra handling in taking a custom checkpoint, including only reading the
manifest file up to the size prescribed by the checkpoint.

This moves GetFileChecksumsFromManifest from backup code to
file_checksum_helper.{h,cc} and removes apparently unnecessary checking
related to column families.

Updated HISTORY.md and warned potential future users of
DB::GetLiveFilesChecksumInfo()

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7309

Test Plan: updated unit test, before and after

Reviewed By: ajkr

Differential Revision: D23311994

Pulled By: pdillinger

fbshipit-source-id: 741e30a2dc1830e8208f7648fcc8c5f000d4e2d5
2020-08-26 10:39:20 -07:00
sdong 722814e357 Get() to fail with underlying failures in PartitionIndexReader::CacheDependencies() (#7297)
Summary:
Right now all I/O failures under PartitionIndexReader::CacheDependencies() is swallowed. This doesn't impact correctness but we've made a decision that any I/O error in read path now should be returned to users for awareness. Return errors in those cases instead.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7297

Test Plan: Add a new unit test that ingest errors in this code path and see Get() fails. Only one I/O path is hit in PartitionIndexReader::CacheDependencies(). Several option changes are attempt but not able to got other pread paths triggered. Not sure whether other failure cases would be even possible. Would rely on continuous stress test to validate it.

Reviewed By: anand1976

Differential Revision: D23257950

fbshipit-source-id: 859dbc92fa239996e1bb378329344d3d54168c03
2020-08-25 19:01:05 -07:00
Zhichao Cao d51f88c9e4 Pass SST file checksum information through OnTableFileCreated (#7108)
Summary:
When SST file is created, application is able to know the file information through OnTableFileCreated callback in LogAndNotifyTableFileCreationFinished. Since file checksum information can be useful for application when the SST file is created, we add file_checksum and file_checksum_func_name information to TableFileCreationInfo, which will be passed through OnTableFileCreated.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7108

Test Plan: make check, listener_test.

Reviewed By: ajkr

Differential Revision: D22470240

Pulled By: zhichao-cao

fbshipit-source-id: 92c20344d9b986eadfe3480f3769bf4add0dbaae
2020-08-25 10:46:11 -07:00
Connor1996 416943bf28 Eliminates a no-op compaction upon snapshot release when disabling auto compactions (#7267)
Summary:
After releasing a snapshot, it checks whether it is suitable to trigger bottom compactions.
When disabling auto compactions, it may still schedule compaction when releasing a snapshot. Whereas no compaction job will be actually handled, so the state of LSM is not changed and compaction will be triggered again and again every time releasing a snapshot.

Too frequent compactions lead to high CPU usage and high db_mutex lock contention which affects foreground write duration finally.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7267

Test Plan:
- make check
- manual test

Reviewed By: akankshamahajan15

Differential Revision: D23252880

Pulled By: ajkr

fbshipit-source-id: 4431e071a35d9912a2a3592875db27bae521434b
2020-08-24 22:06:45 -07:00
Akanksha Mahajan 3844612625 Bug Fix for memtables not trimmed down. (#7296)
Summary:
When a memtable is trimmed in MemTableListVersion, the memtable
is only added to delete list if it is
the last reference. However it is not the last reference as it is held
by the super version. But the super version would not be switched if the
delete list is empty. So the memtable is never destroyed and memory
usage increases beyond write_buffer_size +
max_write_buffer_size_to_maintain.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7296

Test Plan:
1.  ./db_bench -benchmarks=randomtransaction
-optimistic_transaction_db=1 -statistics -stats_interval_seconds=1
-duration=90 -num=500000 --max_write_buffer_size_to_maintain=16000000
--transaction_set_snapshot

Reviewed By: ltamasi

Differential Revision: D23267395

Pulled By: akankshamahajan15

fbshipit-source-id: 3a8d437fe9f4015f851ff84c0e29528aa946b650
2020-08-21 13:29:05 -07:00
Peter Dillinger a1b5484811 Work around a backup bug with DB custom checksums (#7294)
Summary:
On a read-write DB configured with
DBOptions::file_checksum_gen_factory, BackupEngine::CreateNewBackup can
fail intermittently, with non-OK status. This is due to a race between
GetLiveFiles and GetLiveFilesChecksumInfo in creating backups.

For patching 6.12 release (as this commit is intended for, except this is a
forward-merged version), we can simply treat files for which we falsely failed
to get checksum info as legacy files lacking checksum info.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7294

Test Plan: unit test reproducer included

Reviewed By: ajkr

Differential Revision: D23253489

Pulled By: pdillinger

fbshipit-source-id: 9e4945dad120b776ad3e753be10b962f61f28e14
2020-08-21 08:16:04 -07:00
Andrew Kryczka 5d5ff82408 Disable recycle_log_file_num with kTolerateCorruptedTailRecords (#7271)
Summary:
The two features are naturally incompatible. WAL recycling expects the recovery to succeed upon encountering a corrupt record at the point where new data ends and recycled data remains at the tail. However, `WALRecoveryMode::kTolerateCorruptedTailRecords` must fail upon encountering any such corrupt record, as it cannot differentiate between this and a real corruption, which would cause committed updates to be truncated.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7271

Reviewed By: riversand963

Differential Revision: D23169923

Pulled By: ajkr

fbshipit-source-id: 2cf8a3bcd2c9a0ecb0055a84725047a10fd4db50
2020-08-17 18:21:10 -07:00
Yanqin Jin 92593d511a Add a new EntryType for deletion with timestamp (#7195)
Summary:
Add `kEntryDeleteWithTimestamp` to `EntryType` which is a public API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7195

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D22914704

Pulled By: riversand963

fbshipit-source-id: 886f73c6b70c527cad1c8fc9fc8d3afe60e1ea39
2020-08-17 16:26:06 -07:00
sdong 1760637539 CompactRange() refit level should confirm destination level is not empty (#7261)
Summary:
There is potential data race related CompactRange() with level refitting. After the compaction step and refitting step, some automatic compaction could put data to the destination level and cause the DB to be corrupted. Fix the bug by checking the target level to be empty.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7261

Test Plan: Add a unit test, which would fail with "Corruption: L1 have overlapping ranges '666F6F' seq:6, type:1 vs. '626172' seq:2, type:1", and now it succeeds.

Reviewed By: ajkr

Differential Revision: D23142269

fbshipit-source-id: 28bc14d5ac934c192260b23a4ce3f10a95e3ee91
2020-08-17 14:21:53 -07:00
Jay Zhuang 69760b4d05 Introduce a global StatsDumpScheduler for stats dumping (#7223)
Summary:
Have a global StatsDumpScheduler for all DB instance stats dumping, including `DumpStats()` and `PersistStats()`. Before this, there're 2 dedicate threads for every DB instance, one for DumpStats() one for PersistStats(), which could create lots of threads if there're hundreds DB instances.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7223

Reviewed By: riversand963

Differential Revision: D23056737

Pulled By: jay-zhuang

fbshipit-source-id: 0faa2311142a73433ebb3317361db7cbf43faeba
2020-08-14 20:12:44 -07:00
Andrew Kryczka a1aa3f8385 Disable manual compaction during ReFitLevel() (#7250)
Summary:
Manual compaction with `CompactRangeOptions::change_levels` set could
refit to a level targeted by another manual compaction. If
force_consistency_checks were disabled, it could be possible for
overlapping files to be written at that target level.

This PR prevents the possibility by calling `DisableManualCompaction()`
prior to `ReFitLevel()`. It also improves the manual compaction disabling
mechanism to wait for pending manual compactions to complete before
returning, and support disabling from multiple threads.

Fixes https://github.com/facebook/rocksdb/issues/6432.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7250

Test Plan:
crash test command that repro'd the bug reliably:

```
$ TEST_TMPDIR=/dev/shm python tools/db_crashtest.py blackbox --simple -target_file_size_base=524288 -write_buffer_size=1048576 -clear_column_family_one_in=0 -reopen=0 -max_key=10000000 -column_families=1 -max_background_compactions=8 -compact_range_one_in=100000 -compression_type=none -compaction_style=1 -num_levels=5 -universal_min_merge_width=4 -universal_max_merge_width=8 -level0_file_num_compaction_trigger=12 -rate_limiter_bytes_per_sec=1048576000 -universal_max_size_amplification_percent=100 --duration=3600 --interval=60 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --enable_compaction_filter=0
```

Reviewed By: ltamasi

Differential Revision: D23090800

Pulled By: ajkr

fbshipit-source-id: afcbcd51b42ce76789fdb907d8b9ada790709c13
2020-08-14 11:29:52 -07:00
Zitan Chen b578ca2e4d BackupEngine supports custom file checksums (#7085)
Summary:
A new option `std::shared_ptr<FileChecksumGenFactory> backup_checksum_gen_factory` is added to `BackupableDBOptions`. This allows custom checksum functions to be used for creating, verifying, or restoring backups.

Tests are added.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7085

Test Plan: Passed make check

Reviewed By: pdillinger

Differential Revision: D22390756

Pulled By: gg814

fbshipit-source-id: 3b7756ca444c2129844536b91c3ca09f53b6248f
2020-08-12 13:31:09 -07:00
sdong 41c328fe57 Fix a perf regression that caused every key to go through upper bound check (#7209)
Summary:
https://github.com/facebook/rocksdb/pull/5289 introduces a performance regression that caused an upper bound check within every BlockBasedTableIterator::Next(). This is unnecessary if we've checked the boundary key for current block and it is within upper bound.

Fix the bug. Also rename the boolean to a enum so that the code is slightly better readable. The original regression was probably to fix a bug that the block upper bound check status is not reset after a new block is created. Fix it bug so that the regression can be avoided without hitting the bug.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7209

Test Plan: Run all existing tests. Will run atomic black box crash test for a while.

Reviewed By: anand1976

Differential Revision: D22859246

fbshipit-source-id: cbdad1f5e656c55fd8b71726d5a4f6cb53ff9140
2020-08-04 11:30:09 -07:00
Yanqin Jin a38f04ac26 Update HISTORY and version for 6.12 release (#7194)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7194

Reviewed By: gg814

Differential Revision: D22810654

Pulled By: riversand963

fbshipit-source-id: 01f13089fa2b7e31b827da3e30c90e5c62c41380
2020-07-29 10:13:21 -07:00
Tomas Kolda cd4592c220 SST Partitioner interface that allows to split SST files (#6957)
Summary:
SST Partitioner interface that allows to split SST files during compactions.

It basically instruct compaction to create a new file when needed. When one is using well defined prefixes and prefixed way of defining tables it is good to define also partitioning so that promotion of some SST file does not cover huge key space on next level (worst case complete space).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/6957

Reviewed By: ajkr

Differential Revision: D22461239

fbshipit-source-id: 9ce07bba08b3ba89c2d45630520368f704d1316e
2020-07-24 13:44:49 -07:00
Cheng Chang 7af1fab443 Update HISTORY (#7158)
Summary:
Mention the MultiRead bug in HISTORY.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7158

Test Plan: N/A

Reviewed By: siying

Differential Revision: D22670565

Pulled By: cheng-chang

fbshipit-source-id: 16abf0192957be66511f6a08e00157bfd37b189f
2020-07-23 08:47:13 -07:00
Jay Zhuang b0c5ecd6b3 Make max_subcompactions dynamically changeable (#7159)
Summary:
Make `max-subcompactions` dynamically changeable by passing the `DBOption` to Compaction.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7159

Reviewed By: siying

Differential Revision: D22671238

Pulled By: jay-zhuang

fbshipit-source-id: 311ca9f6bb606965544d8708616d358cfed5be42
2020-07-22 18:32:52 -07:00
mrambacher d44cbc5314 Add hash of key/value checks when paranoid_file_checks=true (#7134)
Summary:
When paraoid_files_checks=true, a rolling key-value hash is generated and compared to what is written to the file.  If the values do not match, the SST file is rejected.

Code put in place for the check for both flush and compaction jobs.  Corresponding test added to corruption_test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7134

Reviewed By: cheng-chang

Differential Revision: D22646149

fbshipit-source-id: 8fde1984a1a11edd3bd82a413acffc5ea7aa683f
2020-07-22 11:04:40 -07:00
Haosen Wen dbc51adbac Use steady_clock instead of system_clock in FileOperationInfo::TimePoint (#7153)
Summary:
Issue https://github.com/facebook/rocksdb/issues/7133 reported that using `system_clock` in `FileOperationInfo::TimePoint` causes the duration of file flush operation (which can be a noop on MacOS in some scenarios) appears to be 0 and fail an assertion in listener_test. Using `steady_clock` supposedly fixed the problem.
`steady_clock` actually fits better into the use cases of `FileOperationInfo::TimePoint` as all usages care about durations but not wall clock time.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7153

Test Plan: make check.

Reviewed By: riversand963

Differential Revision: D22654136

Pulled By: roghnin

fbshipit-source-id: 5980b1080734bdae496a18071a2c2b5887c67d85
2020-07-22 08:55:02 -07:00
Zitan Chen b923dc720b BackupEngine computes table checksums only once if db session ids are available (#7110)
Summary:
BackupEngine requires computing table checksums twice when backing up table files to the `shared_checksum` directory.

The repeated computation can be avoided by utilizing the db session id stored as a part of the table properties.

Filenames of table files in the `shared_checksum` directory depend on the following conditions:
1. the naming scheme is `kOptionalChecksumAndDbSessionId`,
2. `db_session_id` is not empty,
3. checksum is available in the DB manifest.

If 1,2,3 are satisfied, then the filenames will be of the form `<file_number>_<checksum>_<db_session_id>.sst`.
If 1,2 are satisfied, then the filenames will be of the form `<file_number>_<db_session_id>.sst`.
In all other cases, the filenames are of the form `<file_number>_<checksum>_<size>.sst`.

Additionally, if `kOptionalChecksumAndDbSessionId` is used (and not falling back to `kChecksumAndFileSize`), the `<checksum>` appeared in the filenames is hexadecimally encoded, instead of being plain `uint32_t` value.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7110

Test Plan: backupable_db_test and manual tests.

Reviewed By: ajkr

Differential Revision: D22508992

Pulled By: gg814

fbshipit-source-id: 5669f0ea9ad5a097f69f6d87aca4abba15032389
2020-07-21 10:35:40 -07:00
Andrew Kryczka 9a83fd21e6 stagger first DumpMallocStats after opening DB (#7145)
Summary:
Previously when running `db_bench` with large value for `num_multi_dbs` and enabled `Options::dump_malloc_stats`, we would see most CPU spent in jemalloc locking. After this PR that no longer shows up at the top of the profile.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7145

Reviewed By: riversand963

Differential Revision: D22593031

Pulled By: ajkr

fbshipit-source-id: 3b3fc91f93249c6afee53f59f34c487c3fc5add6
2020-07-17 16:13:26 -07:00
Zhichao Cao a10f12eda1 Auto resume the DB from Retryable IO Error (#6765)
Summary:
In current codebase, in write path, if Retryable IO Error happens, SetBGError is called. The retryable IO Error is converted to hard error and DB is in read only mode. User or application needs to resume it. In this PR, if Retryable IO Error happens in one DB, SetBGError will create a new thread to call Resume (auto resume). otpions.max_bgerror_resume_count controls if auto resume is enabled or not (if max_bgerror_resume_count<=0, auto resume will not be enabled). options.bgerror_resume_retry_interval controls the time interval to call Resume again if the previous resume fails due to the Retryable IO Error. If non-retryable error happens during resume, auto resume will terminate.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/6765

Test Plan: Added the unit test cases in error_handler_fs_test and pass make asan_check

Reviewed By: anand1976

Differential Revision: D21916789

Pulled By: zhichao-cao

fbshipit-source-id: acb8b5e5dc3167adfa9425a5b7fc104f6b95cb0b
2020-07-15 11:03:58 -07:00
Yanqin Jin 27735dea9a Report corrupted keys during compaction (#7124)
Summary:
Currently, RocksDB lets compaction to go through even in case of
corrupted keys, the number of which is reported in CompactionJobStats.
However, RocksDB does not check this value. We should let compaction run
in a stricter mode.

Temporarily disable two tests that allow corrupted keys in compaction.
With this PR, the two tests will assert(false) and terminate. Still need
to investigate what is the recommended google-test way of doing it.
Death test (EXPECT_DEATH) in gtest has warnings now.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7124

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D22530722

Pulled By: riversand963

fbshipit-source-id: 6a5a6a992028c6d4f92cb74693c92db462ae4ad6
2020-07-14 17:18:17 -07:00
Andrew Kryczka 82611ee25a save key comparisons in BlockIter::BinarySeek (#7068)
Summary:
This is a followup to https://github.com/facebook/rocksdb/issues/6646. In that PR, for simplicity I just appended a comparison against the 0th restart key in case `BinarySeek()`'s binary search landed at index 0. As a result there were `2/(N+1) + log_2(N)` key comparisons. This PR does it differently. Now we expand the binary search range by one so it also covers the case where target is at or before the restart key at index 0. As a result, it involves `log_2(N+1)` key comparisons.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7068

Test Plan:
ran readrandom with mostly default settings and counted key comparisons
using `PerfContext`.

before: `user_key_comparison_count = 28881965`
after: `user_key_comparison_count = 27823245`

setup command:

```
$ TEST_TMPDIR=/dev/shm/dbbench ./db_bench -benchmarks=fillrandom,compact -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -max_background_jobs=12 -level_compaction_dynamic_level_bytes=true -num=10000000
```

benchmark command:

```
$ TEST_TMPDIR=/dev/shm/dbbench/ ./db_bench -use_existing_db=true -benchmarks=readrandom -disable_auto_compactions=true -num=10000000 -compression_type=none -reads=1000000 -perf_level=3
```

Reviewed By: anand1976

Differential Revision: D22357032

Pulled By: ajkr

fbshipit-source-id: 8b01e9c1c2a4e9d02fc9dfe16c1cc0327f8bdf24
2020-07-09 12:27:20 -07:00
Akanksha Mahajan 54f171fe90 Update Flush policy in PartitionedIndexBuilder on switching from user-key to internal-key mode (#7096)
Summary:
When format_version is high enough to support user-key and
there are index entries for same user key that spans multiple data
blocks then it changes from user-key mode to internal-key mode. But the
flush policy is not reset to point to Block Builder of internal-keys.
After this switch, no entries are added to user key index partition
result, thus it never triggers flushing the block.

Fix: 1. After adding the entry in sub_builder_index_, if there is a switch
from user-key to internal-key, then flush policy is updated to point to
Block Builder of internal-keys index partition.
2. Set sub_builder_index_->seperator_is_key_plus_seq_ = true if
seperator_is_key_plus_seq_  is set to true so that subsequent partitions
can also use internal key mode.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7096

Test Plan: make check -j64

Reviewed By: ajkr

Differential Revision: D22416598

Pulled By: akankshamahajan15

fbshipit-source-id: 01fc2dc07ea1b32f8fb803995ebe6e9a3fbe67ac
2020-07-08 21:03:04 -07:00
wenh 226d1f9c73 extend listener callback functions to more file I/O operations (#7055)
Summary:
Currently, `EventListener` in listner.h only have callback functions for file read and write. One may favor extended callback functions for more file I/O operations like flush, sync and close. This PR tries to add those interface and have them called when appropriate throughout the code base.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7055

Test Plan:
Write an experimental listener with those new callback functions with log output in them; run experiments and check logs to see those functions are actually called.
Default test suits `make check` should also be included.

Reviewed By: riversand963

Differential Revision: D22380624

Pulled By: roghnin

fbshipit-source-id: 4121491d45c2c2aae8c255e7998090559a241c6a
2020-07-07 18:21:18 -07:00
Andrew Kryczka dd29ad4223 Separate internal and user key comparators in BlockIter (#6944)
Summary:
Replace `BlockIter::comparator_` and `IndexBlockIter::user_comparator_wrapper_` with a concrete `UserComparatorWrapper` and `InternalKeyComparator`. The motivation for this change was the inconvenience of not knowing the concrete type of `BlockIter::comparator_`, which prevented calling specialized internal key comparison functions to optimize comparison of keys with global seqno applied.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/6944

Test Plan:
benchmark setup -- single file DBs, in-memory, no compression. "normal_db"
created by regular flush; "ingestion_db" created by ingesting a file. Both
DBs have same contents.

```
$ TEST_TMPDIR=/dev/shm/normal_db/ ./db_bench -benchmarks=fillrandom,compact -write_buffer_size=10485760000 -disable_auto_compactions=true -compression_type=none -num=1000000
$ ./ldb write_extern_sst ./tmp.sst --db=/dev/shm/ingestion_db/dbbench/ --compression_type=no --hex --create_if_missing < <(./sst_dump --command=scan --output_hex --file=/dev/shm/normal_db/dbbench/000007.sst | awk 'began {print "0x" substr($1, 2, length($1) - 2), "==>", "0x" $5} ; /^Sst file format: block-based/ {began=1}')
$ ./ldb ingest_extern_sst ./tmp.sst --db=/dev/shm/ingestion_db/dbbench/
```

benchmark run command:

```
$ TEST_TMPDIR=/dev/shm/$DB/ ./db_bench -benchmarks=seekrandom -seek_nexts=$SEEK_NEXT -use_existing_db=true -cache_index_and_filter_blocks=false -num=1000000 -cache_size=0 -threads=1 -reads=200000000 -mmap_read=1 -verify_checksum=false
```

results: perf improved marginally for ingestion_db and did not change significantly for normal_db:

SEEK_NEXT | DB | code | ops/sec | % change
-- | -- | -- | -- | --
0 | normal_db | master | 350880 |  
0 | normal_db | PR6944 | 351040 | 0.0
0 | ingestion_db | master | 343255 |  
0 | ingestion_db | PR6944 | 349424 | 1.8
10 | normal_db | master | 218711 |  
10 | normal_db | PR6944 | 217892 | -0.4
10 | ingestion_db | master | 220334 |  
10 | ingestion_db | PR6944 | 226437 | 2.8

Reviewed By: pdillinger

Differential Revision: D21924676

Pulled By: ajkr

fbshipit-source-id: ea4288a2eefa8112eb6c651a671c1de18c12e538
2020-07-07 17:26:16 -07:00
Zitan Chen 373d5ac485 BackupEngine verifies table file checksums on creating new backups (#7015)
Summary:
When table file checksums are enabled and stored in the DB manifest by using the RocksDB default crc32c checksum function, BackupEngine will calculate the crc32c checksum of the file to be copied and compare the calculated result with the one stored in the DB manifest before copying the file to the backup directory.

After copying to the backup directory, BackupEngine will verify the checksum of the copied file with the one calculated before copying. This helps detect some rare corruption events such as bit-flips during the copying process.

No verification with checksums in DB manifest will be performed if the table file checksum function is not the RocksDB default crc32c checksum function.

In addition, If `share_table_files` and `share_files_with_checksum` are true, BackupEngine will compare the checksums computed before and after copying of the table files.

Corresponding tests are added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7015

Test Plan: Passed make check

Reviewed By: pdillinger

Differential Revision: D22165732

Pulled By: gg814

fbshipit-source-id: ee0e8cc397c455eba64545c29380b9d9853588ec
2020-07-02 18:15:12 -07:00
Peter Dillinger a680a7ea37 Un-revert #7049, revert #7022 (#7071)
Summary:
Even though local bisection gave me a clear signal (and still does) that reverting https://github.com/facebook/rocksdb/issues/7049 would fix the failures in MultiThreadedDBTest, https://github.com/facebook/rocksdb/issues/7022 seems to be the root cause. Reverting https://github.com/facebook/rocksdb/issues/7022 and keeping https://github.com/facebook/rocksdb/issues/7049 seems to fix the issue in local reproducer also. (Had these landed in opposite order, bisection would have found the root cause.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7071

Reviewed By: akankshamahajan15

Differential Revision: D22362857

Pulled By: pdillinger

fbshipit-source-id: ed63df3d74e9d4ce1604de8fe43b216166c7a3f0
2020-07-02 13:30:41 -07:00
Akanksha Mahajan 5edfe3a3d8 Update Flush policy in PartitionedIndexBuilder on switching from user-key to internal-key mode (#7022)
Summary:
When format_version is high enough to support user-key and there are index entries for same user key that spans multiple data blocks then it changes from user-key mode to internal-key mode. But the flush policy is not reset to point to Block Builder of internal-keys. After this switch, no entries are added to user key index partition result, thus it never triggers flushing the block.

Fix: After adding the entry in sub_builder_index_, if there is a switch from user-key to internal-key, then flush policy is updated to point to Block Builder of internal-keys index partition.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7022

Test Plan:
1. make check -j64
           2. Added one unit test case

Reviewed By: ajkr

Differential Revision: D22197734

Pulled By: akankshamahajan15

fbshipit-source-id: d87e9e46bccab8e896ee6979d6b79c51f73d479e
2020-07-01 14:58:08 -07:00
Zitan Chen 6a243b3ade Generalize BackupEngine naming option for share_files_with_checksum SSTs and revert BackupEngine::VerifyBackup to check only file sizes by default (#7032)
Summary:
`bool BackupableDBOptions::new_naming_for_backup_files` is updated to `BackupTableNameOption BackupableDBOptions::share_files_with_checksum_naming`, where `BackupTableNameOption` is an `enum` type with two enumerators `kChecksumAndFileSize` and `kChecksumAndFileSize`. This opens up possibilities of extenting the current naming scheme for backup table files. By default, `BackupTableNameOption BackupableDBOptions::share_files_with_checksum_naming` is set to `kChecksumAndDbSessionId`.

Revert `BackupEngine::VerifyBackup` to only check file sizes by default.

Also fix the construction of the `SstFileDumper` in `GetFileDbIdentities` by setting a proper `Env` of the `Options` passed in the constructor.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7032

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D22237763

Pulled By: gg814

fbshipit-source-id: 466902a4e731babd64e30f0e82ca1aa82962e52e
2020-06-30 18:47:16 -07:00
Burton Li 5be2cb6948 Compaction filter support for BlobDB (#6850)
Summary:
Added compaction filter support for BlobDB non-TTL values. Same as vanilla RocksDB, user compaction filter applies to all k/v pairs of the compaction for non-TTL values. It honors `min_blob_size`, which potentially results value transitions between inlined data and stored-in-blob data when size of value is changed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6850

Reviewed By: siying

Differential Revision: D22263487

Pulled By: ltamasi

fbshipit-source-id: 8fc03f8cde2a5c831e63b436b3dbf1b7f90939e8
2020-06-29 17:32:14 -07:00
Zitan Chen 1569dc48f5 BackupEngine::VerifyBackup verifies checksum by default (#7014)
Summary:
A parameter `verify_with_checksum` is added to `BackupEngine::VerifyBackup`, which is true by default. So now `BackupEngine::VerifyBackup` verifies backup files with checksum AND file size by default. When `verify_with_checksum` is false, `BackupEngine::VerifyBackup` only compares file sizes to verify backup files.

Also add a test for the case when corruption does not change the file size.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7014

Test Plan: Passed backupable_db_test

Reviewed By: zhichao-cao

Differential Revision: D22165590

Pulled By: gg814

fbshipit-source-id: 606a7450714e868bceb38598c89fd356c6004f4f
2020-06-26 11:42:12 -07:00
Zitan Chen 95fbb62c44 Update HISTORY.md to include the Public API Change for DB::OpenForReadonly introduced earlier (#7023)
Summary:
`DB::OpenForReadOnly()` now returns `Status::NotFound` when the specified DB directory does not exist. Previously the error returned depended on the underlying `Env`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7023

Reviewed By: ajkr

Differential Revision: D22207845

Pulled By: gg814

fbshipit-source-id: f35830811a0e67efb0ee82eda3a9739bc526baba
2020-06-25 06:14:29 -07:00
Zitan Chen be41c61f22 Add a new option for BackupEngine to store table files under shared_checksum using DB session id in the backup filenames (#6997)
Summary:
`BackupableDBOptions::new_naming_for_backup_files` is added. This option is false by default. When it is true, backup table filenames under directory shared_checksum are of the form `<file_number>_<crc32c>_<db_session_id>.sst`.

Note that when this option is true, it comes into effect only when both `share_files_with_checksum` and `share_table_files` are true.

Three new test cases are added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6997

Test Plan: Passed make check.

Reviewed By: ajkr

Differential Revision: D22098895

Pulled By: gg814

fbshipit-source-id: a1d9145e7fe562d71cde7ac995e17cb24fd42e76
2020-06-24 19:31:25 -07:00
Yanqin Jin e66199d848 First step towards handling MANIFEST write error (#6949)
Summary:
This PR provides preliminary support for handling IO error during MANIFEST write.
File write/sync is not guaranteed to be atomic. If we encounter an IOError while writing/syncing to the MANIFEST file, we cannot be sure about the state of the MANIFEST file. The version edits may or may not have reached the file. During cleanup, if we delete the newly-generated SST files referenced by the pending version edit(s), but the version edit(s) actually are persistent in the MANIFEST, then next recovery attempt will process the version edits(s) and then fail since the SST files have already been deleted.
One approach is to truncate the MANIFEST after write/sync error, so that it is safe to delete the SST files. However, file truncation may not be supported on certain file systems. Therefore, we take the following approach.
If an IOError is detected during MANIFEST write/sync, we disable file deletions for the faulty database. Depending on whether the IOError is retryable (set by underlying file system), either RocksDB or application can call `DB::Resume()`, or simply shutdown and restart. During `Resume()`, RocksDB will try to switch to a new MANIFEST and write all existing in-memory version storage in the new file. If this succeeds, then RocksDB may proceed. If all recovery is completed, then file deletions will be re-enabled.
Note that multiple threads can call `LogAndApply()` at the same time, though only one of them will be going through the process MANIFEST write, possibly batching the version edits of other threads. When the leading MANIFEST writer finishes, all of the MANIFEST writing threads in this batch will have the same IOError. They will all call `ErrorHandler::SetBGError()` in which file deletion will be disabled.

Possible future directions:
- Add an `ErrorContext` structure so that it is easier to pass more info to `ErrorHandler`. Currently, as in this example, a new `BackgroundErrorReason` has to be added.

Test plan (dev server):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6949

Reviewed By: anand1976

Differential Revision: D22026020

Pulled By: riversand963

fbshipit-source-id: f3c68a2ef45d9b505d0d625c7c5e0c88495b91c8
2020-06-24 19:07:08 -07:00
Peter Dillinger 5b2bbacb6f Minimize memory internal fragmentation for Bloom filters (#6427)
Summary:
New experimental option BBTO::optimize_filters_for_memory builds
filters that maximize their use of "usable size" from malloc_usable_size,
which is also used to compute block cache charges.

Rather than always "rounding up," we track state in the
BloomFilterPolicy object to mix essentially "rounding down" and
"rounding up" so that the average FP rate of all generated filters is
the same as without the option. (YMMV as heavily accessed filters might
be unluckily lower accuracy.)

Thus, the option near-minimizes what the block cache considers as
"memory used" for a given target Bloom filter false positive rate and
Bloom filter implementation. There are no forward or backward
compatibility issues with this change, though it only works on the
format_version=5 Bloom filter.

With Jemalloc, we see about 10% reduction in memory footprint (and block
cache charge) for Bloom filters, but 1-2% increase in storage footprint,
due to encoding efficiency losses (FP rate is non-linear with bits/key).

Why not weighted random round up/down rather than state tracking? By
only requiring malloc_usable_size, we don't actually know what the next
larger and next smaller usable sizes for the allocator are. We pick a
requested size, accept and use whatever usable size it has, and use the
difference to inform our next choice. This allows us to narrow in on the
right balance without tracking/predicting usable sizes.

Why not weight history of generated filter false positive rates by
number of keys? This could lead to excess skew in small filters after
generating a large filter.

Results from filter_bench with jemalloc (irrelevant details omitted):

    (normal keys/filter, but high variance)
    $ ./filter_bench -quick -impl=2 -average_keys_per_filter=30000 -vary_key_count_ratio=0.9
    Build avg ns/key: 29.6278
    Number of filters: 5516
    Total size (MB): 200.046
    Reported total allocated memory (MB): 220.597
    Reported internal fragmentation: 10.2732%
    Bits/key stored: 10.0097
    Average FP rate %: 0.965228
    $ ./filter_bench -quick -impl=2 -average_keys_per_filter=30000 -vary_key_count_ratio=0.9 -optimize_filters_for_memory
    Build avg ns/key: 30.5104
    Number of filters: 5464
    Total size (MB): 200.015
    Reported total allocated memory (MB): 200.322
    Reported internal fragmentation: 0.153709%
    Bits/key stored: 10.1011
    Average FP rate %: 0.966313

    (very few keys / filter, optimization not as effective due to ~59 byte
     internal fragmentation in blocked Bloom filter representation)
    $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000 -vary_key_count_ratio=0.9
    Build avg ns/key: 29.5649
    Number of filters: 162950
    Total size (MB): 200.001
    Reported total allocated memory (MB): 224.624
    Reported internal fragmentation: 12.3117%
    Bits/key stored: 10.2951
    Average FP rate %: 0.821534
    $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000 -vary_key_count_ratio=0.9 -optimize_filters_for_memory
    Build avg ns/key: 31.8057
    Number of filters: 159849
    Total size (MB): 200
    Reported total allocated memory (MB): 208.846
    Reported internal fragmentation: 4.42297%
    Bits/key stored: 10.4948
    Average FP rate %: 0.811006

    (high keys/filter)
    $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000000 -vary_key_count_ratio=0.9
    Build avg ns/key: 29.7017
    Number of filters: 164
    Total size (MB): 200.352
    Reported total allocated memory (MB): 221.5
    Reported internal fragmentation: 10.5552%
    Bits/key stored: 10.0003
    Average FP rate %: 0.969358
    $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000000 -vary_key_count_ratio=0.9 -optimize_filters_for_memory
    Build avg ns/key: 30.7131
    Number of filters: 160
    Total size (MB): 200.928
    Reported total allocated memory (MB): 200.938
    Reported internal fragmentation: 0.00448054%
    Bits/key stored: 10.1852
    Average FP rate %: 0.963387

And from db_bench (block cache) with jemalloc:

    $ ./db_bench -db=/dev/shm/dbbench.no_optimize -benchmarks=fillrandom -format_version=5 -value_size=90 -bloom_bits=10 -num=2000000 -threads=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false
    $ ./db_bench -db=/dev/shm/dbbench -benchmarks=fillrandom -format_version=5 -value_size=90 -bloom_bits=10 -num=2000000 -threads=8 -optimize_filters_for_memory -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false
    $ (for FILE in /dev/shm/dbbench.no_optimize/*.sst; do ./sst_dump --file=$FILE --show_properties | grep 'filter block' ; done) | awk '{ t += $4; } END { print t; }'
    17063835
    $ (for FILE in /dev/shm/dbbench/*.sst; do ./sst_dump --file=$FILE --show_properties | grep 'filter block' ; done) | awk '{ t += $4; } END { print t; }'
    17430747
    $ #^ 2.1% additional filter storage
    $ ./db_bench -db=/dev/shm/dbbench.no_optimize -use_existing_db -benchmarks=readrandom,stats -statistics -bloom_bits=10 -num=2000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false -duration=10 -cache_index_and_filter_blocks -cache_size=1000000000
    rocksdb.block.cache.index.add COUNT : 33
    rocksdb.block.cache.index.bytes.insert COUNT : 8440400
    rocksdb.block.cache.filter.add COUNT : 33
    rocksdb.block.cache.filter.bytes.insert COUNT : 21087528
    rocksdb.bloom.filter.useful COUNT : 4963889
    rocksdb.bloom.filter.full.positive COUNT : 1214081
    rocksdb.bloom.filter.full.true.positive COUNT : 1161999
    $ #^ 1.04 % observed FP rate
    $ ./db_bench -db=/dev/shm/dbbench -use_existing_db -benchmarks=readrandom,stats -statistics -bloom_bits=10 -num=2000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false -optimize_filters_for_memory -duration=10 -cache_index_and_filter_blocks -cache_size=1000000000
    rocksdb.block.cache.index.add COUNT : 33
    rocksdb.block.cache.index.bytes.insert COUNT : 8448592
    rocksdb.block.cache.filter.add COUNT : 33
    rocksdb.block.cache.filter.bytes.insert COUNT : 18220328
    rocksdb.bloom.filter.useful COUNT : 5360933
    rocksdb.bloom.filter.full.positive COUNT : 1321315
    rocksdb.bloom.filter.full.true.positive COUNT : 1262999
    $ #^ 1.08 % observed FP rate, 13.6% less memory usage for filters

(Due to specific key density, this example tends to generate filters that are "worse than average" for internal fragmentation. "Better than average" cases can show little or no improvement.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6427

Test Plan: unit test added, 'make check' with gcc, clang and valgrind

Reviewed By: siying

Differential Revision: D22124374

Pulled By: pdillinger

fbshipit-source-id: f3e3aa152f9043ddf4fae25799e76341d0d8714e
2020-06-22 13:32:07 -07:00
Matthew Von-Maszewski 1092f19d95 Make EncryptEnv inheritable (#6830)
Summary:
EncryptEnv class is both declared and defined within env_encryption.cc.  This makes it really tough to derive new classes from that base.

This branch moves declaration of the class to rocksdb/env_encryption.h.  The change facilitates making new encryption modules (such as an upcoming openssl AES CTR pull request) possible / easy.

The only coding change was to add the EncryptEnv object to env_basic_test.cc.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6830

Reviewed By: riversand963

Differential Revision: D21706593

Pulled By: ajkr

fbshipit-source-id: 64d2da95a1569ceeb9b1549c3bec5404cf4c89f0
2020-06-22 13:27:16 -07:00
sdong d6b7b7712f Fix a bug that causes iterator to return wrong result in a rare data race (#6973)
Summary:
The bug fixed in https://github.com/facebook/rocksdb/pull/1816/ is now applicable to iterator too. This was not an issue but https://github.com/facebook/rocksdb/pull/2886 caused the regression. If a put and DB flush happens just between iterator to get latest sequence number and getting super version, empty result for the key or an older value can be returned, which is wrong.
Fix it in the same way as the fix in https://github.com/facebook/rocksdb/issues/1816, that is to get the sequence number after referencing the super version.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6973

Test Plan: Will run stress tests for a while to make sure there is no general regression.

Reviewed By: ajkr

Differential Revision: D22029348

fbshipit-source-id: 94390f93630906796d6e2fec321f44a920953fd1
2020-06-18 10:16:38 -07:00