Commit Graph

741 Commits

Author SHA1 Message Date
Yanqin Jin fb09ef05dc Attempt to recover from db with missing table files (#6334)
Summary:
There are situations when RocksDB tries to recover, but the db is in an inconsistent state due to SST files referenced in the MANIFEST being missing. In this case, previous RocksDB will just fail the recovery and return a non-ok status.
This PR enables another possibility. During recovery, RocksDB checks possible MANIFEST files, and try to recover to the most recent state without missing table file. `VersionSet::Recover()` applies version edits incrementally and "materializes" a version only when this version does not reference any missing table file. After processing the entire MANIFEST, the version created last will be the latest version.
`DBImpl::Recover()` calls `VersionSet::Recover()`. Afterwards, WAL replay will *not* be performed.
To use this capability, set `options.best_efforts_recovery = true` when opening the db. Best-efforts recovery is currently incompatible with atomic flush.

Test plan (on devserver):
```
$make check
$COMPILE_WITH_ASAN=1 make all && make check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6334

Reviewed By: anand1976

Differential Revision: D19778960

Pulled By: riversand963

fbshipit-source-id: c27ea80f29bc952e7d3311ecf5ee9c54393b40a8
2020-03-20 19:30:48 -07:00
sdong 331e6199df Include more information in file lock failure (#6507)
Summary:
When users fail to open a DB with file lock failure, it is sometimes hard for users to debug. We now include the time the lock is acquired and the thread ID that acquired the lock, to help users debug problems like this. Default Env's thread ID is used.

Since type of lockedFiles is changed, rename it to follow naming convention too.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6507

Test Plan: Add a unit test and improve an existing test to validate the case.

Differential Revision: D20378333

fbshipit-source-id: 312fe0e9733fd1d1e9969c321b90ce523cf4708a
2020-03-11 16:23:08 -07:00
Yanqin Jin d93812c9ae Iterator with timestamp (#6255)
Summary:
Preliminary support for iterator with user timestamp. Current implementation does not consider merge operator and reverse iterator. Auto compaction is also disabled in unit tests.

Create an iterator with timestamp.
```
...
read_opts.timestamp = &ts;
auto* iter = db->NewIterator(read_opts);
// target is key without timestamp.
for (iter->Seek(target); iter->Valid(); iter->Next()) {}
for (iter->SeekToFirst(); iter->Valid(); iter->Next()) {}
delete iter;
read_opts.timestamp = &ts1;
// lower_bound and upper_bound are without timestamp.
read_opts.iterate_lower_bound = &lower_bound;
read_opts.iterate_upper_bound = &upper_bound;
auto* iter1 = db->NewIterator(read_opts);
// Do Seek or SeekToFirst()
delete iter1;
```

Test plan (dev server)
```
$make check
```

Simple benchmarking (dev server)
1. The overhead introduced by this PR even when timestamp is disabled.
key size: 16 bytes
value size: 100 bytes
Entries: 1000000
Data reside in main memory, and try to stress iterator.
Repeated three times on master and this PR.
- Seek without next
```
./db_bench -db=/dev/shm/rocksdbtest-1000 -benchmarks=fillseq,seekrandom -enable_pipelined_write=false -disable_wal=true -format_version=3
```
master: 159047.0 ops/sec
this PR: 158922.3 ops/sec (2% drop in throughput)
- Seek and next 10 times
```
./db_bench -db=/dev/shm/rocksdbtest-1000 -benchmarks=fillseq,seekrandom -enable_pipelined_write=false -disable_wal=true -format_version=3 -seek_nexts=10
```
master: 109539.3 ops/sec
this PR: 107519.7 ops/sec (2% drop in throughput)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6255

Differential Revision: D19438227

Pulled By: riversand963

fbshipit-source-id: b66b4979486f8474619f4aa6bdd88598870b0746
2020-03-06 16:24:27 -08:00
Otto Kekäläinen f6c2777d95 Fix spelling: commited -> committed (#6481)
Summary:
In most places in the code the variable names are spelled correctly as
COMMITTED but in a couple places not. This fixes them and ensures the
variable is always called COMMITTED everywhere.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6481

Differential Revision: D20306776

Pulled By: pdillinger

fbshipit-source-id: b6c1bfe41db559b4bc6955c530934460c07f7022
2020-03-06 12:45:20 -08:00
Cheng Chang afb97094ae Skip high levels with no key falling in the range in CompactRange (#6482)
Summary:
In CompactRange, if there is no key in memtable falling in the specified range, then flush is skipped.
This PR extends this skipping logic to SST file levels: it starts compaction from the highest level (starting from L0) that has files with key falling in the specified range, instead of always starts from L0.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6482

Test Plan:
A new test ManualCompactionTest::SkipLevel is added.

Also updated a test related to statistics of index block cache hit in db_test2, the index cache hit is increased by 1 in this PR because when checking overlap for the key range in L0, OverlapWithLevelIterator will do a seek in the table cache iterator, which will read from the cached index.

Also updated db_compaction_test and db_test to use correct range for full compaction.

Differential Revision: D20251149

Pulled By: cheng-chang

fbshipit-source-id: f822157cf4796972bd5035d9d7178d8dfb7af08b
2020-03-04 20:15:25 -08:00
sdong 17bef7d3a8 Fix data race of GetCreationTimeOfOldestFile() (#6473)
Summary:
When DBImpl::GetCreationTimeOfOldestFile() calls Version::GetCreationTimeOfOldestFile(), the version is not directly or indirectly referenced, so an event like compaction can race with the operation and cause DBImpl::GetCreationTimeOfOldestFile() to access delocated data. This was caught by an ASAN run:

==268==ERROR: AddressSanitizer: heap-use-after-free on address 0x612000b7d198 at pc 0x000018332913 bp 0x7f391510d310 sp 0x7f391510d308
READ of size 8 at 0x612000b7d198 thread T845 (store_load-33)
SCARINESS: 51 (8-byte-read-heap-use-after-free)
    #0 0x18332912 in rocksdb::Version::GetCreationTimeOfOldestFile(unsigned long*) rocksdb/src/db/version_set.cc:1488
    https://github.com/facebook/rocksdb/issues/1 0x1803ddaa in rocksdb::DBImpl::GetCreationTimeOfOldestFile(unsigned long*) rocksdb/src/db/db_impl/db_impl.cc:4499
    https://github.com/facebook/rocksdb/issues/2 0xe24ca09 in rocksdb::StackableDB::GetCreationTimeOfOldestFile(unsigned long*) rocksdb/utilities/stackable_db.h:392
    ......
0x612000b7d198 is located 216 bytes inside of 296-byte region [0x612000b7d0c0,0x612000b7d1e8)
freed by thread T28 here:
    ......
    https://github.com/facebook/rocksdb/issues/5 0x1832c73f in std::vector<rocksdb::FileMetaData*, std::allocator<rocksdb::FileMetaData*> >::~vector() third-party-buck/platform007/build/libgcc/include/c++/trunk/bits/stl_vector.h:435
    https://github.com/facebook/rocksdb/issues/6 0x1832c73f in rocksdb::VersionStorageInfo::~VersionStorageInfo() rocksdb/src/db/version_set.cc:734
    https://github.com/facebook/rocksdb/issues/7 0x1832cf42 in rocksdb::Version::~Version() rocksdb/src/db/version_set.cc:758
    https://github.com/facebook/rocksdb/issues/8 0x9d1bb5 in rocksdb::Version::Unref() rocksdb/src/db/version_set.cc:2869
    https://github.com/facebook/rocksdb/issues/9 0x183e7631 in rocksdb::Compaction::~Compaction() rocksdb/src/db/compaction/compaction.cc:275
    https://github.com/facebook/rocksdb/issues/10 0x9e6de6 in std::default_delete<rocksdb::Compaction>::operator()(rocksdb::Compaction*) const third-party-buck/platform007/build/libgcc/include/c++/trunk/bits/unique_ptr.h:78
    https://github.com/facebook/rocksdb/issues/11 0x9e6de6 in std::unique_ptr<rocksdb::Compaction, std::default_delete<rocksdb::Compaction> >::reset(rocksdb::Compaction*) third-party-buck/platform007/build/libgcc/include/c++/trunk/bits/unique_ptr.h:376
    https://github.com/facebook/rocksdb/issues/12 0x9e6de6 in rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority) rocksdb/src/db/db_impl/db_impl_compaction_flush.cc:2826
    https://github.com/facebook/rocksdb/issues/13 0x9ac3b8 in rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority) rocksdb/src/db/db_impl/db_impl_compaction_flush.cc:2320
    https://github.com/facebook/rocksdb/issues/14 0x9abff7 in rocksdb::DBImpl::BGWorkCompaction(void*) rocksdb/src/db/db_impl/db_impl_compaction_flush.cc:2096
    ......

Fix the issue by reference the super version and use the referenced version from it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6473

Test Plan: Run ASAN for all existing tests.

Differential Revision: D20196416

fbshipit-source-id: 5f4a7918110fc7b8dd7841932d376bc9d1e59d6f
2020-03-02 16:37:01 -08:00
Andrew Kryczka 69679e7375 Fix range deletion tombstone ingestion with global seqno (#6429)
Summary:
Original author: jeffrey-xiao

If we are writing a global seqno for an ingested file, the range
tombstone metablock gets accessed and put into the cache during
ingestion preparation. At the time, the global seqno of the ingested
file has not yet been determined, so the cached block will not have a
global seqno. When the file is ingested and we read its range tombstone
metablock, it will be returned from the cache with no global seqno. In
that case, we use the actual seqnos stored in the range tombstones,
which are all zero, so the tombstones cover nothing.

This commit removes global_seqno_ variable from Block. When iterating
over a block, the global seqno for the block is determined by the
iterator instead of storing this mutable attribute in Block.
Additionally, this commit adds a regression test to check that keys are
deleted when ingesting a file with a global seqno and range deletion
tombstones.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6429

Differential Revision: D19961563

Pulled By: ajkr

fbshipit-source-id: 5cf777397fa3e452401f0bf0364b0750492487b7
2020-02-25 15:31:48 -08:00
Cheng Chang b47a714051 Update release version to 6.8 (#6450)
Summary:
Update release version to 6.8
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6450

Test Plan: no code change

Differential Revision: D20071889

Pulled By: cheng-chang

fbshipit-source-id: 91450aae09b201926469ff32f59ed436366f3b74
2020-02-24 11:44:47 -08:00
sdong d75ce0a8ae Mention rocksdb_namespace.h in HISTORY.md (#6439)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6439

Differential Revision: D20012442

fbshipit-source-id: a7c9569826eac39cd7ea69c90f08a21dd4caa335
2020-02-20 14:55:19 -08:00
Yanqin Jin 362b8d4393 Fix MANIFEST name assignment (#6426)
Summary:
Currently, a new MANIFEST file is assigned a new file number when 1) no
MANIFEST is open, or 2) current MANIFEST file size exceeds a threshold. This is
not sufficient. There are cases when the caller explicitly specifies that a new
MANIFEST be created. For example, if user sets options.write_dbid_to_manifest = true,
and there are WAL files, then RocksDB will run into an issue during recovery.
`DBImpl::Recover()` will call `LogAndApply()` to write dbid. At this point, the db being
recovered creates a new MANIFEST, say, MANIFEST-000003. Since there are WALs,
`DBImpl::RecoverLogFiles` will be called. Towards the end of this function, we call
`LogAndApply(new_descriptor_log=true)`, which explicitly creates a new MANIFEST.
However, the manifest_file_number is wrong before this fix. Consequently, RocksDB
opens an existing, non-empty file for append, effectively truncating the file to zero.
If a crash occurs, then there will be data loss.

Test Plan (devserver):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6426

Test Plan: make check

Differential Revision: D19951866

Pulled By: riversand963

fbshipit-source-id: 4b1b9fc28d4fe2ac12764b388ef9e61f05e766da
2020-02-20 14:30:58 -08:00
sdong fdf882ded2 Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433)
Summary:
When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433

Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.

Differential Revision: D19977691

fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
2020-02-20 12:09:57 -08:00
Andrew Kryczka 0f9dcb88b2 Return NotSupported from WriteBatchWithIndex::DeleteRange (#5393)
Summary:
As discovered in https://github.com/facebook/rocksdb/issues/5260 and https://github.com/facebook/rocksdb/issues/5392, reads on the indexed batch do not account for range tombstones. So, return `Status::NotSupported` from `WriteBatchWithIndex::DeleteRange` until we properly support it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5393

Test Plan: added unit test

Differential Revision: D19912360

Pulled By: ajkr

fbshipit-source-id: 0bbfc978ea015d64516ca708fce2429abba524cb
2020-02-18 11:18:25 -08:00
sdong 6e97d4de00 By default turn IO Uring off. (#6405)
Summary:
We realized bugs related to IO Uring. Turn it off by default.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6405

Test Plan: Manually run build_tools/build_detect_platform and observe outputs.

Differential Revision: D19862792

fbshipit-source-id: 5d5e8e2762997b72a145ae59389ef3d7e4ccd060
2020-02-12 18:01:49 -08:00
Burton Li e64508917b db_bench supports for generating random variable sized value. (#6386)
Summary:
1. `db_bench` now supports `value_size_distribution_type`, `value_size_min`, `value_size_max` options for generating random variable sized value.
2. Added `blob_db_compression_type` option for BlobDB to enable blob compression.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6386

Differential Revision: D19859406

Pulled By: zhichao-cao

fbshipit-source-id: ace52674090023fde15d832392110bf288a8e215
2020-02-12 14:47:03 -08:00
anand76 3e49249d30 Ensure all MultiGet IO errors are propagated to user (#6403)
Summary:
Unrevert the previous fix to propagate error status, and an additional fix to not treat a memtable lookup MergeInProgress status as an error.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6403

Test Plan:
Unit tests
Tried running stress tests but couldn't repro the stress failure

Differential Revision: D19846721

Pulled By: anand1976

fbshipit-source-id: 7db10cccbdc863d9b559497f0a46b608d2488ca4
2020-02-11 17:27:22 -08:00
Tomas Kolda e412a426d6 JNI direct buffer support for basic operations (#2283)
Summary:
It is very useful to support direct ByteBuffers in Java. It allows to have zero memory copy and some serializers are using that directly so one do not need to create byte[] array for it.

This change also contains some fixes for Windows JNI build.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/2283

Differential Revision: D19834971

Pulled By: pdillinger

fbshipit-source-id: 44173aa02afc9836c5498c592fd1ea95b6086e8e
2020-02-11 14:48:30 -08:00
anand76 35ed530d2c Revert "Check KeyContext status in MultiGet (#6387)" (#6401)
Summary:
This reverts commit d70011bccc. The commit is causing some stress test failure due to unexpected Status::MergeInProgress() return for some keys.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6401

Differential Revision: D19826623

Pulled By: anand1976

fbshipit-source-id: edd634cede9cb7bdd2cb8f46e662ea709b16d2f1
2020-02-10 22:23:36 -08:00
Zhichao Cao 4369f2c7bb Checksum for each SST file and stores in MANIFEST (#6216)
Summary:
In the current code base, RocksDB generate the checksum for each block and verify the checksum at usage. Current PR enable SST file checksum. After a SST file is generated by Flush or Compaction, RocksDB generate the SST file checksum and store the checksum value and checksum method name in the vs_info and MANIFEST as part for the FileMetadata.

Added the enable_sst_file_checksum to Options to enable or disable file checksum. Added sst_file_checksum to Options such that user can plugin their own SST file checksum calculate method via overriding the SstFileChecksum class. The checksum information inlcuding uint32_t checksum value and a checksum name (string).  A new tool is added to LDB such that user can dump out a list of file checksum information from MANIFEST. If user enables the file checksum but does not provide the sst_file_checksum instance, RocksDB will use the default crc32checksum implemented in table/sst_file_checksum_crc32c.h
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6216

Test Plan: Added the testing case in table_test and ldb_cmd_test to verify checksum is correct in different level. Pass make asan_check.

Differential Revision: D19171461

Pulled By: zhichao-cao

fbshipit-source-id: b2e53479eefc5bb0437189eaa1941670e5ba8b87
2020-02-10 15:52:52 -08:00
anand76 d70011bccc Check KeyContext status in MultiGet (#6387)
Summary:
Currently, any IO errors and checksum mismatches while reading data
blocks, are being ignored by the batched MultiGet. Its only looking at
the GetContext state. Fix that.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6387

Test Plan: Add unit tests

Differential Revision: D19799819

Pulled By: anand1976

fbshipit-source-id: 46133dccbb04e64067b9fe6cda73e282203db969
2020-02-07 16:48:16 -08:00
sdong 876c2dbff4 Allow readahead when reading option files. (#6372)
Summary:
Right, when reading from option files, no readahead is used and 8KB buffer is used. It might introduce high latency if the file system provide high latency and doesn't do readahead. Instead, introduce a readahead to the file. When calling inside DB, infer the value from options.log_readahead. Otherwise, a default 512KB readahead size is used.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6372

Test Plan: Add --log_readahead_size in db_bench. Run it with several options and observe read size from option files using strace.

Differential Revision: D19727739

fbshipit-source-id: e6d8053b0a64259abc087f1f388b9cd66fa8a583
2020-02-07 15:18:26 -08:00
Levi Tamasi 1b4be4cac9 BlobDB: ignore trivially moved files when updating the SST<->blob file mapping (#6381)
Summary:
BlobDB keeps track of the mapping between SSTs and blob files using
the `OnFlushCompleted` and `OnCompactionCompleted` callbacks of
the `EventListener` interface: upon receiving a flush notification, a link
is added between the newly flushed SST and the corresponding blob file;
for compactions, links are removed for the inputs and added for the outputs.
The earlier code performed this link deletion and addition even for
trivially moved files; the new code walks through the two lists together
(in a fashion that's similar to merge sort) and skips such files.
This should mitigate https://github.com/facebook/rocksdb/issues/6338,
wherein an assertion is triggered with the earlier code when a compaction
notification for a trivial move precedes the flush notification for the
moved SST.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6381

Test Plan: make check

Differential Revision: D19773729

Pulled By: ltamasi

fbshipit-source-id: ae0f273ded061110dd9334e8fb99b0d7786650b0
2020-02-07 12:50:57 -08:00
Mike Kolupaev 637e64b9ac Add an option to prevent DB::Open() from querying sizes of all sst files (#6353)
Summary:
When paranoid_checks is on, DBImpl::CheckConsistency() iterates over all sst files and calls Env::GetFileSize() for each of them. As far as I could understand, this is pretty arbitrary and doesn't affect correctness - if filesystem doesn't corrupt fsynced files, the file sizes will always match; if it does, it may as well corrupt contents as well as sizes, and rocksdb doesn't check contents on open.

If there are thousands of sst files, getting all their sizes takes a while. If, on top of that, Env is overridden to use some remote storage instead of local filesystem, it can be *really* slow and overload the remote storage service. This PR adds an option to not do GetFileSize(); instead it does GetChildren() for parent directory to check that all the expected sst files are at least present, but doesn't check their sizes.

We can't just disable paranoid_checks instead because paranoid_checks do a few other important things: make the DB read-only on write errors, print error messages on read errors, etc.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6353

Test Plan: ran the added sanity check unit test. Will try it out in a LogDevice test cluster where the GetFileSize() calls are causing a lot of trouble.

Differential Revision: D19656425

Pulled By: al13n321

fbshipit-source-id: c2c421b367633033760d1f56747bad206d1fbf82
2020-02-04 01:27:26 -08:00
Adam Retter 7242dae7fe Improve RocksJava Comparator (#6252)
Summary:
This is a redesign of the API for RocksJava comparators with the aim of improving performance. It also simplifies the class hierarchy.

**NOTE**: This breaks backwards compatibility for existing 3rd party Comparators implemented in Java... so we need to consider carefully which release branches this goes into.

Previously when implementing a comparator in Java the developer had a choice of subclassing either `DirectComparator` or `Comparator` which would use direct and non-direct byte-buffers resepectively (via `DirectSlice` and `Slice`).

In this redesign there we have eliminated the overhead of using the Java Slice classes, and just use `ByteBuffer`s. The `ComparatorOptions` supplied when constructing a Comparator allow you to choose between direct and non-direct byte buffers by setting `useDirect`.

In addition, the `ComparatorOptions` now allow you to choose whether a ByteBuffer is reused over multiple comparator calls, by setting `maxReusedBufferSize > 0`. When buffers are reused, ComparatorOptions provides a choice of mutex type by setting `useAdaptiveMutex`.

 ---
[JMH benchmarks previously indicated](https://github.com/facebook/rocksdb/pull/6241#issue-356398306) that the difference between C++ and Java for implementing a comparator was ~7x slowdown in Java.

With these changes, when reusing buffers and guarding access to them via mutexes the slowdown is approximately the same. However, these changes offer a new facility to not reuse mutextes, which reduces the slowdown to ~5.5x in Java. We also offer a `thread_local` mechanism for reusing buffers, which reduces slowdown to ~5.2x in Java (closes https://github.com/facebook/rocksdb/pull/4425).

These changes also form a good base for further optimisation work such as further JNI lookup caching, and JNI critical.

 ---
These numbers were captured without jemalloc. With jemalloc, the performance improves for all tests, and the Java slowdown reduces to between 4.8x and 5.x.

```
ComparatorBenchmarks.put                                                native_bytewise  thrpt   25  124483.795 ± 2032.443  ops/s
ComparatorBenchmarks.put                                        native_reverse_bytewise  thrpt   25  114414.536 ± 3486.156  ops/s
ComparatorBenchmarks.put              java_bytewise_non-direct_reused-64_adaptive-mutex  thrpt   25   17228.250 ± 1288.546  ops/s
ComparatorBenchmarks.put          java_bytewise_non-direct_reused-64_non-adaptive-mutex  thrpt   25   16035.865 ± 1248.099  ops/s
ComparatorBenchmarks.put                java_bytewise_non-direct_reused-64_thread-local  thrpt   25   21571.500 ±  871.521  ops/s
ComparatorBenchmarks.put                  java_bytewise_direct_reused-64_adaptive-mutex  thrpt   25   23613.773 ± 8465.660  ops/s
ComparatorBenchmarks.put              java_bytewise_direct_reused-64_non-adaptive-mutex  thrpt   25   16768.172 ± 5618.489  ops/s
ComparatorBenchmarks.put                    java_bytewise_direct_reused-64_thread-local  thrpt   25   23921.164 ± 8734.742  ops/s
ComparatorBenchmarks.put                              java_bytewise_non-direct_no-reuse  thrpt   25   17899.684 ±  839.679  ops/s
ComparatorBenchmarks.put                                  java_bytewise_direct_no-reuse  thrpt   25   22148.316 ± 1215.527  ops/s
ComparatorBenchmarks.put      java_reverse_bytewise_non-direct_reused-64_adaptive-mutex  thrpt   25   11311.126 ±  820.602  ops/s
ComparatorBenchmarks.put  java_reverse_bytewise_non-direct_reused-64_non-adaptive-mutex  thrpt   25   11421.311 ±  807.210  ops/s
ComparatorBenchmarks.put        java_reverse_bytewise_non-direct_reused-64_thread-local  thrpt   25   11554.005 ±  960.556  ops/s
ComparatorBenchmarks.put          java_reverse_bytewise_direct_reused-64_adaptive-mutex  thrpt   25   22960.523 ± 1673.421  ops/s
ComparatorBenchmarks.put      java_reverse_bytewise_direct_reused-64_non-adaptive-mutex  thrpt   25   18293.317 ± 1434.601  ops/s
ComparatorBenchmarks.put            java_reverse_bytewise_direct_reused-64_thread-local  thrpt   25   24479.361 ± 2157.306  ops/s
ComparatorBenchmarks.put                      java_reverse_bytewise_non-direct_no-reuse  thrpt   25    7942.286 ±  626.170  ops/s
ComparatorBenchmarks.put                          java_reverse_bytewise_direct_no-reuse  thrpt   25   11781.955 ± 1019.843  ops/s
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6252

Differential Revision: D19331064

Pulled By: pdillinger

fbshipit-source-id: 1f3b794e6a14162b2c3ffb943e8c0e64a0c03738
2020-02-03 12:30:13 -08:00
Maysam Yabandeh 3316d29221 Disable recycle_log_file_num when it is incompatible with recovery mode (#6351)
Summary:
Non-zero recycle_log_file_num is incompatible with kPointInTimeRecovery and kAbsoluteConsistency recovery modes. Currently SanitizeOptions changes the recovery mode to kTolerateCorruptedTailRecords, while to resolve this option conflict it makes more sense to compromise recycle_log_file_num, which is a performance feature, instead of wal_recovery_mode, which is a safety feature.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6351

Differential Revision: D19648931

Pulled By: maysamyabandeh

fbshipit-source-id: dd0bf78349edc007518a00c4d63931fd69294ad7
2020-01-31 07:28:30 -08:00
anand76 fb05b5a652 Force a new manifest file if append to current one fails (#6331)
Summary:
Fix for issue https://github.com/facebook/rocksdb/issues/6316

When an append/sync of the manifest file fails due to an IO error such
as NoSpace, we don't always put the DB in read-only mode. This is true
for flush and compactions, as well as foreground operatons such as column family
add/drop, CompactFiles etc. Subsequent changes to the DB will be
recorded in the same manifest file, which would have a corrupted record
in the middle due to the previous failure. On next DB::Open(), it will
fail to process the full manifest and data will be lost.

To fix this, we reset VersionSet::descriptor_log_ on append/sync
failure, which will force a new manifest file to be written on the next
append.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6331

Test Plan: Add new unit tests in error_handler_test.cc

Differential Revision: D19632951

Pulled By: anand1976

fbshipit-source-id: 68d527cb6e59a94cbbbf9f5a17a7f464381d51e3
2020-01-30 10:56:29 -08:00
Levi Tamasi 9e3ace42a4 Add statistics for BlobDB GC (#6296)
Summary:
The patch adds statistics support to the new BlobDB garbage collection implementation;
namely, it adds support for the following (pre-existing) tickers:

`BLOB_DB_GC_NUM_FILES`: the number of blob files obsoleted by the GC logic.
`BLOB_DB_GC_NUM_NEW_FILES`: the number of new blob files generated by the GC logic.
`BLOB_DB_GC_FAILURES`: the number of failed GC passes (where a GC pass is
equivalent to a (sub)compaction).
`BLOB_DB_GC_NUM_KEYS_RELOCATED`: the number of blobs relocated to new blob
files by the GC logic.
`BLOB_DB_GC_BYTES_RELOCATED`: the total size of blobs relocated to new blob files.

The tickers `BLOB_DB_GC_NUM_KEYS_OVERWRITTEN`, `BLOB_DB_GC_NUM_KEYS_EXPIRED`,
`BLOB_DB_GC_BYTES_OVERWRITTEN`, `BLOB_DB_GC_BYTES_EXPIRED`, and
`BLOB_DB_GC_MICROS` are not relevant for the new GC logic, and are thus marked
deprecated.

The patch also adds a couple of log messages that log the number and total size of
blobs encountered and relocated during a GC pass, as well as the number of blob
files created and obsoleted.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6296

Test Plan: Extended unit tests and used the BlobDB mode of `db_bench`.

Differential Revision: D19402513

Pulled By: ltamasi

fbshipit-source-id: d53d2bfbf4928a1db1e9346c67ebb9007b8932ec
2020-01-29 16:46:16 -08:00
Maysam Yabandeh 2f973ca96e Double Crash in kPointInTimeRecovery with TransactionDB (#6313)
Summary:
In WritePrepared there could be gap in sequence numbers. This breaks the trick we use in kPointInTimeRecovery which assume the first seq in the log right after the corrupted log is one larger than the last seq we read from the logs. To let this trick keep working, we add a dummy entry with the expected sequence to the first log right after recovery.
Also in WriteCommitted, if the log right after the corrupted log is empty, since it has no sequence number to let the sequential trick work, it is assumed as unexpected behavior. This is however expected to happen if we close the db after recovering from a corruption and before writing anything new to it. To remedy that, we apply the same technique by writing a dummy entry to the log that is created after the corrupted log.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6313

Differential Revision: D19458291

Pulled By: maysamyabandeh

fbshipit-source-id: 09bc49e574690085df45b034ca863ff315937e2d
2020-01-29 11:40:55 -08:00
sdong 8f2bee6747 Add ReadOptions.auto_prefix_mode (#6314)
Summary:
Add a new option ReadOptions.auto_prefix_mode. When set to true, iterator should return the same result as total order seek, but may choose to do prefix seek internally, based on iterator upper bounds. Also fix two previous bugs when handling prefix extrator changes: (1) reverse iterator should not rely on upper bound to determine prefix. Fix it with skipping prefix check. (2) block-based filter is not handled properly.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6314

Test Plan: (1) add a unit test; (2) add the check to stress test and run see whether it can pass at least one run.

Differential Revision: D19458717

fbshipit-source-id: 51c1bcc5cdd826c2469af201979a39600e779bce
2020-01-28 14:44:05 -08:00
sdong 7aa66c704f Move HISTORY.md entry of hash index fix from 6.7 to unreleased (#6337)
Summary:
Commits related to hash index fix have been reverted in 6.7.fb branch. Update HISTORY.md to keep it in sync.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6337

Differential Revision: D19593717

fbshipit-source-id: 466178dc6205c9e41ccced41bf281a0952bdc2ca
2020-01-27 17:45:42 -08:00
sdong f10f135938 Fix regression bug of hash index with iterator total order seek (#6328)
Summary:
https://github.com/facebook/rocksdb/pull/6028 introduces a bug for hash index in SST files. If a table reader is created when total order seek is used, prefix_extractor might be passed into table reader as null. While later when prefix seek is used, the same table reader used, hash index is checked but prefix extractor is null and the program would crash.
Fix the issue by fixing http://github.com/facebook/rocksdb/pull/6028 in the way that prefix_extractor is preserved but ReadOptions.total_order_seek is checked

Also, a null pointer check is added so that a bug like this won't cause segfault in the future.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6328

Test Plan: Add a unit test that would fail without the fix. Stress test that reproduces the crash would pass.

Differential Revision: D19586751

fbshipit-source-id: 8de77690167ddf5a77a01e167cf89430b1bfba42
2020-01-27 15:44:54 -08:00
Fosco Marotto bd698e4f55 Update version for next release, 6.7.0 (#6320)
Summary:
Adjusted history for 6.6.1 and 6.6.2, switched master version to 6.7.0.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6320

Differential Revision: D19499272

Pulled By: gfosco

fbshipit-source-id: 2bafb2456951f231e411e9c03aaa4c044f497684
2020-01-24 15:36:32 -08:00
Levi Tamasi f34782a67d Fix the "records dropped" statistics (#6325)
Summary:
The earlier code used two conflicting definitions for the number of
input records going into a compaction, one based on the
`rocksdb.num.entries` table property and one based on
`CompactionIterationStats`. The first one is correct and in line
with how output records are counted, while the second one incorrectly
ignores input records in various cases when the `CompactionIterator`
advances or reseeks the input iterator (this can happen, amongst other
cases, when dealing with `SingleDelete`s, regular `Delete`s, `Merge`s,
and compaction filters). This can result in the code undercounting the
input records and computing an incorrect value for "records dropped"
during the compaction. The patch fixes this by switching over to the
correct (table property based) input record count for "records dropped".
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6325

Test Plan: Tested using `make check` and `db_bench`.

Differential Revision: D19525491

Pulled By: ltamasi

fbshipit-source-id: 4340b0b2f41546db8e356db70ca02199e48fa636
2020-01-23 15:27:22 -08:00
anand76 0672a6db64 Fix queue manipulation in WriteThread::BeginWriteStall() (#6322)
Summary:
When there is a write stall, the active write group leader calls ```BeginWriteStall()``` to walk the queue of writers and remove any with the ```no_slowdown``` option set. There was a bug in the code which updated the back pointer but not the forward pointer (```link_newer```), corrupting the list and causing some threads to wait forever. This PR fixes it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6322

Test Plan: Add a unit test in db_write_test

Differential Revision: D19538313

Pulled By: anand1976

fbshipit-source-id: 6fbed819e594913f435886606f5d36f74f235c3a
2020-01-23 14:01:28 -08:00
Levi Tamasi 73f65b457e Adjust thread pool sizes when setting max_background_jobs dynamically (#6300)
Summary:
https://github.com/facebook/rocksdb/pull/2205 introduced a new
configuration option called `max_background_jobs`, superseding the
earlier options `max_background_flushes` and
`max_background_compactions`. However, unlike
`max_background_compactions`, setting `max_background_jobs` dynamically
through the `SetDBOptions` interface does not adjust the size of the
thread pools (see https://github.com/facebook/rocksdb/issues/6298). The
patch fixes this.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6300

Test Plan: Extended unit test.

Differential Revision: D19430899

Pulled By: ltamasi

fbshipit-source-id: 704006605b3c13c3d1b997ccc0831ee369721074
2020-01-16 14:35:10 -08:00
sdong d2b4d42d4b Fix kHashSearch bug with SeekForPrev (#6297)
Summary:
When prefix is enabled the expected behavior when the prefix of the target does not exist is for Seek is to seek to any key larger than target and SeekToPrev to any key less than the target.
Currently. the prefix index (kHashSearch) returns OK status but sets Invalid() to indicate two cases: a prefix of the searched key does not exist, ii) the key is beyond the range of the keys in SST file. The SeekForPrev implementation in BlockBasedTable thus does not have enough information to know when it should set the index key to first (to return a key smaller than target). The patch fixes that by returning NotFound status for cases that the prefix does not exist. SeekForPrev in BlockBasedTable accordingly SeekToFirst instead of SeekToLast on the index iterator.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6297

Test Plan: SeekForPrev of non-exsiting prefix is added to block_test.cc, and a test case is added in db_test2, which fails without the fix.

Differential Revision: D19404695

fbshipit-source-id: cafbbf95f8f60ff9ede9ccc99d25bfa1cf6fcdc3
2020-01-15 14:28:39 -08:00
Maysam Yabandeh d4b7fbf0d5 kHashSearch incompatible with index_block_restart_interval>1 (#6294)
Summary:
kHashSearch index type is incompatible with index_block_restart_interval larger than 1. The patch asserts that and also resets index_block_restart_interval value if it is incompatible with kHashSearch.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6294

Differential Revision: D19394229

Pulled By: maysamyabandeh

fbshipit-source-id: 8a12712ab25e81094a7f71ecd43f773dd4fb6acd
2020-01-14 11:21:27 -08:00
sdong 894c6d21af Bug when multiple files at one level contains the same smallest key (#6285)
Summary:
The fractional cascading index is not correctly generated when two files at the same level contains the same smallest or largest user key.
The result would be that it would hit an assertion in debug mode and lower level files might be skipped.
This might cause wrong results when the same user keys are of merge operands and Get() is called using the exact user key. In that case, the lower files would need to further checked.
The fix is to fix the fractional cascading index.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6285

Test Plan: Add a unit test which would cause the assertion which would be fixed.

Differential Revision: D19358426

fbshipit-source-id: 39b2b1558075fd95e99491d462a67f9f2298c48e
2020-01-13 16:27:42 -08:00
Sagar Vemuri cfa585611d Consider all compaction input files to compute the oldest ancestor time (#6279)
Summary:
Look at all compaction input files to compute the oldest ancestor time.

In https://github.com/facebook/rocksdb/issues/5992 we changed how creation_time (aka oldest-ancestor-time) table property of compaction output files is computed from max(creation-time-of-all-compaction-inputs) to min(creation-time-of-all-inputs). This exposed a bug where, during compaction, the creation_time:s of only the L0 compaction inputs were being looked at, and all other input levels were being ignored. This PR fixes the issue.
Some TTL compactions when using Level-Style compactions might not have run due to this bug.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6279

Test Plan: Enhanced the unit tests to validate that the correct time is propagated to the compaction outputs.

Differential Revision: D19337812

Pulled By: sagar0

fbshipit-source-id: edf8a72f11e405e93032ff5f45590816debe0bb4
2020-01-10 19:02:42 -08:00
wolfkdy 1ab1231acf parallel occ (#6240)
Summary:
This is a continuation of https://github.com/facebook/rocksdb/pull/5320/files
I open a new mr for these purposes, half a year has past since the old mr is posted so it's almost impossible to fulfill some points below on the old mr, especially 5)
1) add validation modes for optimistic txns
2) modify unittests to test both modes
3) make format
4) refine hash functor
5) push to master
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6240

Differential Revision: D19301296

fbshipit-source-id: 5b5b3cbd39558f43947f7d2dec6cd31a06386edb
2020-01-07 14:20:38 -08:00
Yanqin Jin 1aaa145877 Fix a data race for cfd->log_number_ (#6249)
Summary:
A thread calling LogAndApply may release db mutex when calling
WriteCurrentStateToManifest() which reads cfd->log_number_. Another thread can
call SwitchMemtable() and writes to cfd->log_number_.
Solution is to cache the cfd->log_number_ before releasing mutex in
LogAndApply.

Test Plan (on devserver):
```
$COMPILE_WITH_TSAN=1 make db_stress
$./db_stress --acquire_snapshot_one_in=10000 --avoid_unnecessary_blocking_io=1 --block_size=16384 --bloom_bits=16 --bottommost_compression_type=zstd --cache_index_and_filter_blocks=1 --cache_size=1048576 --checkpoint_one_in=1000000 --checksum_type=kxxHash --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_ttl=0 --compression_max_dict_bytes=16384 --compression_type=zstd --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --db=/dev/shm/rocksdb/rocksdb_crashtest_blackbox --db_write_buffer_size=1048576 --delpercent=5 --delrangepercent=0 --destroy_db_initially=0 --enable_pipelined_write=0  --flush_one_in=1000000 --format_version=5 --get_live_files_and_wal_files_one_in=1000000 --index_block_restart_interval=5 --index_type=0 --log2_keys_per_lock=22 --long_running_snapshots=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=1000000 --max_manifest_file_size=16384 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=3 --memtablerep=skip_list --mmap_read=0 --nooverwritepercent=1 --open_files=500000 --ops_per_thread=100000000 --partition_filters=0 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefixpercent=5 --progress_reports=0 --readpercent=45 --recycle_log_file_num=0 --reopen=20 --set_options_one_in=10000 --snapshot_hold_ops=100000 --subcompactions=2 --sync=1 --target_file_size_base=2097152 --target_file_size_multiplier=2 --test_batches_snapshots=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 --use_multiget=1 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --write_buffer_size=4194304 --write_dbid_to_manifest=1 --writepercent=35
```
Then repeat the following multiple times, e.g. 100 after compiling with tsan.
```
$./db_test2 --gtest_filter=DBTest2.SwitchMemtableRaceWithNewManifest
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6249

Differential Revision: D19235077

Pulled By: riversand963

fbshipit-source-id: 79467b52f48739ce7c27e440caa2447a40653173
2020-01-06 20:09:51 -08:00
Adam Retter e697da0b18 RocksDB#keyMayExist should not assume database values are unicode strings (#6186)
Summary:
Closes https://github.com/facebook/rocksdb/issues/6183
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6186

Differential Revision: D19201281

Pulled By: pdillinger

fbshipit-source-id: 1c96b4ea09e826f91e44b0009eba3de0991d9053
2019-12-20 14:27:58 -08:00
Levi Tamasi 7a7ca8eb5b BlobDB: only compare CF IDs when checking whether an API call is for the default CF (#6226)
Summary:
BlobDB currently only supports using the default column family. The earlier
code enforces this by comparing the `ColumnFamilyHandle` passed to the
`Get`/`Put`/etc. call with the handle returned by `DefaultColumnFamily`
(which, at the end of the day, comes from `DBImpl::default_cf_handle_`).
Since other `ColumnFamilyHandle`s can also point to the default column
family, this can reject legitimate requests as well. (As an example,
with the earlier code, the handle returned by `BlobDB::Open` cannot
actually be used in API calls.) The patch fixes this by comparing only
the IDs of the column family handles instead of the pointers themselves.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6226

Test Plan: `make check`

Differential Revision: D19187461

Pulled By: ltamasi

fbshipit-source-id: 54ce2e12ebb1f07e6d1e70e3b1e0213dfa94bda2
2019-12-19 18:05:49 -08:00
Levi Tamasi 130e710056 Add BlobDB GC cutoff parameter to db_bench (#6211)
Summary:
The patch makes it possible to set the BlobDB configuration option
`garbage_collection_cutoff` on the command line. In addition, it changes
the `db_bench` code so that the default values of BlobDB related
parameters are taken from the defaults of the actual BlobDB
configuration options (note: this changes the the default of
`blob_db_bytes_per_sync`).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6211

Test Plan: Ran `db_bench` with various values of the new parameter.

Differential Revision: D19166895

Pulled By: ltamasi

fbshipit-source-id: 305ccdf0123b9db032b744715810babdc3e3b7d5
2019-12-18 17:46:08 -08:00
Jermy Li f453bcb40d Add unit tests for concurrent CF iteration and drop (#6180)
Summary:
improve https://github.com/facebook/rocksdb/issues/6147
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6180

Differential Revision: D19148936

fbshipit-source-id: f691c9879fd51d54e96c1a99670cf85ca4485a89
2019-12-18 11:54:35 -08:00
Levi Tamasi cbd58af9c3 Update HISTORY.md with recent BlobDB related changes
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6202

Differential Revision: D19144158

Pulled By: ltamasi

fbshipit-source-id: 3e2522ced458568e3a2a045663704e30ab0ac223
2019-12-17 19:09:21 -08:00
anand1976 1be48cb895 Fix crash in Transaction::MultiGet() when num_keys > 32
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6192

Test Plan:
Add a unit test that fails without the fix and passes now
make check

Differential Revision: D19124781

Pulled By: anand1976

fbshipit-source-id: 8c8cb6fa16c3fc23ec011e168561a13f76bbd783
2019-12-16 20:39:35 -08:00
Levi Tamasi 2d095b4dbc Update HISTORY.md with the recent memtable trimming fixes
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6194

Differential Revision: D19125292

Pulled By: ltamasi

fbshipit-source-id: d41aca2755ec4bec07feedd6b561e8d18606a931
2019-12-16 15:19:52 -08:00
anand76 afa2420c2b Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.

This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.

The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.

This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.

The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761

Differential Revision: D18868376

Pulled By: anand1976

fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
2019-12-13 14:48:41 -08:00
奏之章 c4ce8e637f Fix RangeDeletion bug (#6062)
Summary:
Read keys from a snapshot that a range deletion were added after the snapshot  was created and this range deletion was inside an immutable memtable, we will get wrong key set.
More detail rest in codes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6062

Differential Revision: D18966785

Pulled By: pdillinger

fbshipit-source-id: 38a60bb1e2d0a1dbfc8ec641617200b6a02b86c3
2019-12-12 15:18:02 -08:00
ferhat elmas afdc58d478 Fix typos in history
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6116

Differential Revision: D18935622

fbshipit-source-id: 59f7a7bc9f0116ae6354ea217896622a34329d3c
2019-12-11 11:04:46 -08:00
Yanqin Jin 09fcf4fb6b Fix a potential bug scheduling unnecessary threads (#6104)
Summary:
RocksDB should decrement the counter `unscheduled_flushes_` as soon as the bg
thread is scheduled. Before this fix, the counter is decremented only when the
bg thread starts and picks an element from the flush queue. This may cause more
than necessary bg threads to be scheduled. Not a correctness issue, but may
affect flush thread count.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6104

Test Plan:
```
make check
```

Differential Revision: D18735584

Pulled By: riversand963

fbshipit-source-id: d36272d4a08a494aeeab6200a3cff7a3d1a2dc10
2019-11-27 14:48:49 -08:00
sdong aa1857e2df Support options.max_open_files = -1 with periodic_compaction_seconds (#6090)
Summary:
options.periodic_compaction_seconds isn't supported when options.max_open_files != -1. It's because that the information of file creation time is stored in table properties and are not guaranteed to be loaded unless options.max_open_files = -1. Relax this constraint by storing the information in manifest.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6090

Test Plan: Pass all existing tests; Modify an existing test to force the manifest value to take 0 to simulate backward compatibility case; manually open the DB generated with the change by release 4.2.

Differential Revision: D18702268

fbshipit-source-id: 13e0bd94f546498a04f3dc5fc0d9dff5125ec9eb
2019-11-26 21:39:56 -08:00
anand76 496a6ae895 Fix HISTORY.md for 6.6.0 (#6096)
Summary:
Some of the entries were incorrectly listed under 6.5.0.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6096

Differential Revision: D18722801

Pulled By: gfosco

fbshipit-source-id: 18d1187deb6a9d69a8feb68b727d2f720a65f2bc
2019-11-26 19:04:49 -08:00
Peter Dillinger ca3b6c28c9 Expose and elaborate FilterBuildingContext (#6088)
Summary:
This change enables custom implementations of FilterPolicy to
wrap a variety of NewBloomFilterPolicy and select among them based on
contextual information such as table level and compaction style.

* Moves FilterBuildingContext to public API and elaborates it with more
useful data. (It would be nice to put more general options-like data,
but at the time this object is constructed, we are using internal APIs
ImmutableCFOptions and MutableCFOptions and don't have easy access to
ColumnFamilyOptions that I can tell.)

* Renames BloomFilterPolicy::GetFilterBitsBuilderInternal to
GetBuilderWithContext, because it's now public.

* Plumbs through the table's "level_at_creation" for filter building
context.

* Simplified some tests by adding GetBuilder() to
MockBlockBasedTableTester.

* Adds test as DBBloomFilterTest.ContextCustomFilterPolicy, including
sample wrapper class LevelAndStyleCustomFilterPolicy.

* Fixes a cross-test bug in DBBloomFilterTest.OptimizeFiltersForHits
where it does not reset perf context.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6088

Test Plan: make check, valgrind on db_bloom_filter_test

Differential Revision: D18697817

Pulled By: pdillinger

fbshipit-source-id: 5f987a2d7b07cc7a33670bc08ca6b4ca698c1cf4
2019-11-26 18:24:10 -08:00
Peter Dillinger 57f3032285 Allow fractional bits/key in BloomFilterPolicy (#6092)
Summary:
There's no technological impediment to allowing the Bloom
filter bits/key to be non-integer (fractional/decimal) values, and it
provides finer control over the memory vs. accuracy trade-off. This is
especially handy in using the format_version=5 Bloom filter in place
of the old one, because bits_per_key=9.55 provides the same accuracy as
the old bits_per_key=10.

This change not only requires refining the logic for choosing the best
num_probes for a given bits/key setting, it revealed a flaw in that logic.
As bits/key gets higher, the best num_probes for a cache-local Bloom
filter is closer to bpk / 2 than to bpk * 0.69, the best choice for a
standard Bloom filter. For example, at 16 bits per key, the best
num_probes is 9 (FP rate = 0.0843%) not 11 (FP rate = 0.0884%).
This change fixes and refines that logic (for the format_version=5
Bloom filter only, just in case) based on empirical tests to find
accuracy inflection points between each num_probes.

Although bits_per_key is now specified as a double, the new Bloom
filter converts/rounds this to "millibits / key" for predictable/precise
internal computations. Just in case of unforeseen compatibility
issues, we round to the nearest whole number bits / key for the
legacy Bloom filter, so as not to unlock new behaviors for it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6092

Test Plan: unit tests included

Differential Revision: D18711313

Pulled By: pdillinger

fbshipit-source-id: 1aa73295f152a995328cb846ef9157ae8a05522a
2019-11-26 15:59:34 -08:00
sdong 77eab5c85a Make default value of options.ttl to be 30 days when it is supported. (#6073)
Summary:
By default options.ttl is disabled. We believe a better default will be 30 days, which means deleted data the database will be removed from SST files slightly after 30 days, for most of the cases.

Make the default UINT64_MAX - 1 to indicate that it is not overridden by users.

Change periodic_compaction_seconds to be UINT64_MAX - 1 to UINT64_MAX  too to be consistent. Also fix a small bug in the previous periodic_compaction_seconds default code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6073

Test Plan: Add unit tests for it.

Differential Revision: D18669626

fbshipit-source-id: 957cd4374cafc1557d45a0ba002010552a378cc8
2019-11-26 10:00:32 -08:00
Sebastiano Peluso fcd7e03832 Ignore value of BackupableDBOptions::max_valid_backups_to_open when B… (#6072)
Summary:
This change ignores the value of BackupableDBOptions::max_valid_backups_to_open when a BackupEngine is not read-only.

Issue: https://github.com/facebook/rocksdb/issues/4997

Note on tests: I had to remove test case WriteOnlyEngine of BackupableDBTest because it was not consistent with the new semantic of BackupableDBOptions::max_valid_backups_to_open. Maybe, we should think about adding a new interface for append-only BackupEngines. On the other hand, I changed LimitBackupsOpened test case to use a read-only BackupEngine, and I added a new specific test case for the change.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6072

Reviewed By: pdillinger

Differential Revision: D18687364

Pulled By: sebastianopeluso

fbshipit-source-id: 77bc1f927d623964d59137a93de123bbd719da4e
2019-11-26 10:00:31 -08:00
sdong 0bc87442ae Update HISTORY.md for forward compatibility (#6085)
Summary:
https://github.com/facebook/rocksdb/pull/6060 broke forward compatiblity for releases from 3.10 to 4.2. Update HISTORY.md to mention it. Also remove it from the compatibility tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6085

Differential Revision: D18691694

fbshipit-source-id: 4ef903783dc722b8a4d3e8229abbf0f021a114c9
2019-11-26 10:00:31 -08:00
Sagar Vemuri 669ea77d9f Support ttl in Universal Compaction (#6071)
Summary:
`options.ttl` is now supported in universal compaction, similar to how periodic compactions are implemented in PR https://github.com/facebook/rocksdb/issues/5970 .
Setting `options.ttl` will simply set `options.periodic_compaction_seconds` to execute the periodic compactions code path.
Discarded PR https://github.com/facebook/rocksdb/issues/4749 in lieu of this.

This is a short term work-around/hack of falling back to periodic compactions when ttl is set.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6071

Test Plan: Added a unit test.

Differential Revision: D18668336

Pulled By: sagar0

fbshipit-source-id: e75f5b81ba949f77ef9eff05e44bb1c757f58612
2019-11-22 22:13:35 -08:00
sdong d8c28e692a Support options.ttl with options.max_open_files = -1 (#6060)
Summary:
Previously, options.ttl cannot be set with options.max_open_files = -1, because it makes use of creation_time field in table properties, which is not available unless max_open_files = -1. With this commit, the information will be stored in manifest and when it is available, will be used instead.

Note that, this change will break forward compatibility for release 5.1 and older.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6060

Test Plan: Extend existing test case to options.max_open_files != -1, and simulate backward compatility in one test case by forcing the value to be 0.

Differential Revision: D18631623

fbshipit-source-id: 30c232a8672de5432ce9608bb2488ecc19138830
2019-11-22 21:23:00 -08:00
Yanqin Jin 0ce0edbe12 Fix a data race between GetColumnFamilyMetaData and MarkFilesBeingCompacted (#6056)
Summary:
Use db mutex to protect the execution of Version::GetColumnFamilyMetaData()
called in DBImpl::GetColumnFamilyMetaData().
Without mutex, GetColumnFamilyMetaData() races with MarkFilesBeingCompacted()
for access to FileMetaData::being_compacted.
Other than mutex, there are several more alternatives.

- Make FileMetaData::being_compacted an atomic variable. This will make
  FileMetaData non-copy-able.

- Separate being_compacted from FileMetaData. This requires re-organizing data
  structures that are already used in many places.

Test Plan (dev server):
```
make check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6056

Differential Revision: D18620488

Pulled By: riversand963

fbshipit-source-id: 87f89660b5d5e2ab4ef7962b7b2a7d00e346aa3b
2019-11-20 16:36:29 -08:00
Little-Wallace ec3e3c3e02 Fix corruption with intra-L0 on ingested files (#5958)
Summary:
## Problem Description

Our process was abort when it call `CheckConsistency`. And the information in  `stderr` show that "`L0 files seqno 3001491972 3004797440 vs. 3002875611 3004524421` ".  Here are the causes of the accident I investigated.

* RocksDB will call `CheckConsistency` whenever `MANIFEST` file is update. It will check sequence number interval of every file, except files which were ingested.
* When one file is ingested into RocksDB, it will be assigned the value of global sequence number, and the minimum and maximum seqno of this file are equal, which are both equal to global sequence number.
* `CheckConsistency`  determines whether the file is ingested by whether the smallest and largest seqno of an sstable file are equal.
* If IntraL0Compaction picks one sst which was ingested just now and compacted it into another sst,  the `smallest_seqno` of this new file will be smaller than his `largest_seqno`.
    * If more than one ingested file was ingested before memtable schedule flush,  and they all compact into one new sstable file by `IntraL0Compaction`. The sequence interval of this new file will be included in the interval of the memtable.  So `CheckConsistency` will return a `Corruption`.
    * If a sstable was ingested after the memtable was schedule to flush, which would assign a larger seqno to it than memtable. Then the file was compacted with other files (these files were all flushed before the memtable) in L0 into one file. This compaction start before the flush job of memtable start,  but completed after the flush job finish. So this new file produced by the compaction (we call it s1) would have a larger interval of sequence number than the file produced by flush (we call it s2).  **But there was still some data in s1  written into RocksDB before the s2, so it's possible that some data in s2 was cover by old data in s1.** Of course, it would also make a `Corruption` because of overlap of seqno. There is the relationship of the files:
    > s1.smallest_seqno < s2.smallest_seqno < s2.largest_seqno  < s1.largest_seqno

So I skip pick sst file which was ingested in function `FindIntraL0Compaction `

## Reason

Here is my bug report: https://github.com/facebook/rocksdb/issues/5913

There are two situations that can cause the check to fail.

### First situation:
- First we ingest five external sst into Rocksdb, and they happened to be ingested in L0. and there had been some data in memtable, which make the smallest sequence number of memtable is less than which of sst that we ingest.

- If there had been one compaction job which compacted sst from L0 to L1, `LevelCompactionPicker` would trigger a `IntraL0Compaction` which would compact this five sst from L0 to L0. We call this sst A, which was merged from five ingested sst.

- Then some data was put into memtable, and memtable was flushed to L0. We called this sst B.
- RocksDB check consistency , and find the `smallest_seqno` of B is  less than that of A and crash. Because A was merged from five sst, the smallest sequence number of it was less than the biggest sequece number of itself, so RocksDB could not tell if A was produce by ingested.

### Secondary situaion

- First we have flushed many sst in L0,  we call them [s1, s2, s3].

- There is an immutable memtable request to be flushed, but because flush thread is busy, so it has not been picked. we call it m1.  And at the moment, one sst is ingested into L0. We call it s4. Because s4 is ingested after m1 became immutable memtable, so it has a larger log sequence number than m1.

- m1 is flushed in L0. because it is small, this flush job finish quickly. we call it s5.

- [s1, s2, s3, s4] are compacted into one sst to L0, by IntraL0Compaction.  We call it s6.
  - compacted 4@0 files to L0
- When s6 is added into manifest,  the corruption happened. because the largest sequence number of s6 is equal to s4, and they are both larger than that of s5.  But because s1 is older than m1, so the smallest sequence number of s6 is smaller than that of s5.
   - s6.smallest_seqno < s5.smallest_seqno < s5.largest_seqno < s6.largest_seqno
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5958

Differential Revision: D18601316

fbshipit-source-id: 5fe54b3c9af52a2e1400728f565e895cde1c7267
2019-11-19 15:09:11 -08:00
tabokie 20b48c6478 Fix blob context when db_iter uses seek (#6051)
Summary:
Fix: when `db_iter` falls back to using seek by `FindValueForCurrentKeyUsingSeek`, `is_blob_` flag is not properly set on encountering BlobIndex.
Also patch existing test for the mentioned code path.
Signed-off-by: tabokie <xy.tao@outlook.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6051

Differential Revision: D18596274

Pulled By: ltamasi

fbshipit-source-id: 8e4714af263b99dc2c379707d50db88fe6799278
2019-11-19 11:39:02 -08:00
Peter Dillinger f059c7d9b9 New Bloom filter implementation for full and partitioned filters (#6007)
Summary:
Adds an improved, replacement Bloom filter implementation (FastLocalBloom) for full and partitioned filters in the block-based table. This replacement is faster and more accurate, especially for high bits per key or millions of keys in a single filter.

Speed

The improved speed, at least on recent x86_64, comes from
* Using fastrange instead of modulo (%)
* Using our new hash function (XXH3 preview, added in a previous commit), which is much faster for large keys and only *slightly* slower on keys around 12 bytes if hashing the same size many thousands of times in a row.
* Optimizing the Bloom filter queries with AVX2 SIMD operations. (Added AVX2 to the USE_SSE=1 build.) Careful design was required to support (a) SIMD-optimized queries, (b) compatible non-SIMD code that's simple and efficient, (c) flexible choice of number of probes, and (d) essentially maximized accuracy for a cache-local Bloom filter. Probes are made eight at a time, so any number of probes up to 8 is the same speed, then up to 16, etc.
* Prefetching cache lines when building the filter. Although this optimization could be applied to the old structure as well, it seems to balance out the small added cost of accumulating 64 bit hashes for adding to the filter rather than 32 bit hashes.

Here's nominal speed data from filter_bench (200MB in filters, about 10k keys each, 10 bits filter data / key, 6 probes, avg key size 24 bytes, includes hashing time) on Skylake DE (relatively low clock speed):

$ ./filter_bench -quick -impl=2 -net_includes_hashing # New Bloom filter
Build avg ns/key: 47.7135
Mixed inside/outside queries...
  Single filter net ns/op: 26.2825
  Random filter net ns/op: 150.459
    Average FP rate %: 0.954651
$ ./filter_bench -quick -impl=0 -net_includes_hashing # Old Bloom filter
Build avg ns/key: 47.2245
Mixed inside/outside queries...
  Single filter net ns/op: 63.2978
  Random filter net ns/op: 188.038
    Average FP rate %: 1.13823

Similar build time but dramatically faster query times on hot data (63 ns to 26 ns), and somewhat faster on stale data (188 ns to 150 ns). Performance differences on batched and skewed query loads are between these extremes as expected.

The only other interesting thing about speed is "inside" (query key was added to filter) vs. "outside" (query key was not added to filter) query times. The non-SIMD implementations are substantially slower when most queries are "outside" vs. "inside". This goes against what one might expect or would have observed years ago, as "outside" queries only need about two probes on average, due to short-circuiting, while "inside" always have num_probes (say 6). The problem is probably the nastily unpredictable branch. The SIMD implementation has few branches (very predictable) and has pretty consistent running time regardless of query outcome.

Accuracy

The generally improved accuracy (re: Issue https://github.com/facebook/rocksdb/issues/5857) comes from a better design for probing indices
within a cache line (re: Issue https://github.com/facebook/rocksdb/issues/4120) and improved accuracy for millions of keys in a single filter from using a 64-bit hash function (XXH3p). Design details in code comments.

Accuracy data (generalizes, except old impl gets worse with millions of keys):
Memory bits per key: FP rate percent old impl -> FP rate percent new impl
6: 5.70953 -> 5.69888
8: 2.45766 -> 2.29709
10: 1.13977 -> 0.959254
12: 0.662498 -> 0.411593
16: 0.353023 -> 0.0873754
24: 0.261552 -> 0.0060971
50: 0.225453 -> ~0.00003 (less than 1 in a million queries are FP)

Fixes https://github.com/facebook/rocksdb/issues/5857
Fixes https://github.com/facebook/rocksdb/issues/4120

Unlike the old implementation, this implementation has a fixed cache line size (64 bytes). At 10 bits per key, the accuracy of this new implementation is very close to the old implementation with 128-byte cache line size. If there's sufficient demand, this implementation could be generalized.

Compatibility

Although old releases would see the new structure as corrupt filter data and read the table as if there's no filter, we've decided only to enable the new Bloom filter with new format_version=5. This provides a smooth path for automatic adoption over time, with an option for early opt-in.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6007

Test Plan: filter_bench has been used thoroughly to validate speed, accuracy, and correctness. Unit tests have been carefully updated to exercise new and old implementations, as well as the logic to select an implementation based on context (format_version).

Differential Revision: D18294749

Pulled By: pdillinger

fbshipit-source-id: d44c9db3696e4d0a17caaec47075b7755c262c5f
2019-11-13 16:44:01 -08:00
anand76 6c7b1a0cc7 Batched MultiGet API for multiple column families (#5816)
Summary:
Add a new API that allows a user to call MultiGet specifying multiple keys belonging to different column families. This is mainly useful for users who want to do a consistent read of keys across column families, with the added performance benefits of batching and returning values using PinnableSlice.

As part of this change, the code in the original multi-column family MultiGet for acquiring the super versions has been refactored into a separate function that can be used by both, the batching and the non-batching versions of MultiGet.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5816

Test Plan:
make check
make asan_check
asan_crash_test

Differential Revision: D18408676

Pulled By: anand1976

fbshipit-source-id: 933e7bec91dd70e7b633be4ff623a1116cc28c8d
2019-11-12 13:52:55 -08:00
anand76 03ce7fb292 Fix a buffer overrun problem in BlockBasedTable::MultiGet (#6014)
Summary:
The calculation in BlockBasedTable::MultiGet for the required buffer length for reading in compressed blocks is incorrect. It needs to take the 5-byte block trailer into account.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6014

Test Plan: Add a unit test DBBasicTest.MultiGetBufferOverrun that fails in asan_check before the fix, and passes after.

Differential Revision: D18412753

Pulled By: anand1976

fbshipit-source-id: 754dfb66be1d5f161a7efdf87be872198c7e3b72
2019-11-11 16:59:15 -08:00
Sagar Vemuri c17384fea4 Cascade TTL Compactions to move expired key ranges to bottom levels faster (#5992)
Summary:
When users use Level-Compaction-with-TTL by setting `cf_options.ttl`, the ttl-expired data could take n*ttl time to reach the bottom level (where n is the number of levels) due to how the `creation_time` table property was calculated for the newly created files during compaction. The creation time of new files was set to a max of all compaction-input-files-creation-times which essentially resulted in resetting the ttl as the key range moves across levels. This behavior is now fixed by changing the `creation_time` to be based on minimum of all compaction-input-files-creation-times; this will cause cascading compactions across levels for the ttl-expired data to move to the bottom level, resulting in getting rid of tombstones/deleted-data faster.

This will help start cascading compactions to move the expired key range to the bottom-most level faster.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5992

Test Plan: `make check`

Differential Revision: D18257883

Pulled By: sagar0

fbshipit-source-id: 00df0bb8d0b7e14d9fc239df2cba8559f3e54cbc
2019-11-11 14:09:01 -08:00
Peter Dillinger aa63abf698 Auto-GarbageCollect on PurgeOldBackups and DeleteBackup (#6015)
Summary:
Only if there is a crash, power failure, or I/O error in
DeleteBackup, shared or private files from the backup might be left
behind that are not cleaned up by PurgeOldBackups or DeleteBackup-- only
by GarbageCollect. This makes the BackupEngine API "leaky by default."
Even if it means a modest performance hit, I think we should make
Delete and Purge do as they say, with ongoing best effort: i.e. future
calls will attempt to finish any incomplete work from earlier calls.

This change does that by having DeleteBackup and PurgeOldBackups do a
GarbageCollect, unless (to minimize performance hit) this BackupEngine
has already done a GarbageCollect and there have been no
deletion-related I/O errors in that GarbageCollect or since then.

Rejected alternative 1: remove meta file last instead of first. This would in theory turn partially deleted backups into corrupted backups, but code changes would be needed to allow the missing files and consider it acceptably corrupt, rather than failing to open the BackupEngine. This might be a reasonable choice, but I mostly rejected it because it doesn't solve the legacy problem of cleaning up existing lingering files.

Rejected alternative 2: use a deletion marker file. If deletion started with creating a file that marks a backup as flagged for deletion, then we could reliably detect partially deleted backups and efficiently finish removing them. In addition to not solving the legacy problem, this could be precarious if there's a disk full situation, and we try to create a new file in order to delete some files. Ugh.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6015

Test Plan: Updated unit tests

Differential Revision: D18401333

Pulled By: pdillinger

fbshipit-source-id: 12944e372ce6809f3f5a4c416c3b321a8927d925
2019-11-08 19:15:35 -08:00
Levi Tamasi f80050fa8f Add file number/oldest referenced blob file number to {Sst,Live}FileMetaData (#6011)
Summary:
The patch exposes the file numbers of the SSTs as well as the oldest blob
files they contain a reference to through the GetColumnFamilyMetaData/
GetLiveFilesMetaData interface.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6011

Test Plan:
Fixed and extended the existing unit tests. (The earlier ColumnFamilyMetaDataTest
wasn't really testing anything because the generated memtables were never
flushed, so the metadata structure was essentially empty.)

Differential Revision: D18361697

Pulled By: ltamasi

fbshipit-source-id: d5ed1d94ac70858b84393c48711441ddfe1251e9
2019-11-07 14:04:16 -08:00
anand76 9836a1fa33 Fix MultiGet crash when no_block_cache is set (#5991)
Summary:
This PR fixes https://github.com/facebook/rocksdb/issues/5975. In ```BlockBasedTable::RetrieveMultipleBlocks()```, we were calling ```MaybeReadBlocksAndLoadToCache()```, which is a no-op if neither uncompressed nor compressed block cache are configured.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5991

Test Plan:
1. Add unit tests that fail with the old code and pass with the new
2. make check and asan_check

Cc spetrunia

Differential Revision: D18272744

Pulled By: anand1976

fbshipit-source-id: e62fa6090d1a6adf84fcd51dfd6859b03c6aebfe
2019-11-07 12:02:21 -08:00
Peter Dillinger 18f57f5ef8 Add new persistent 64-bit hash (#5984)
Summary:
For upcoming new SST filter implementations, we will use a new
64-bit hash function (XXH3 preview, slightly modified). This change
updates hash.{h,cc} for that change, adds unit tests, and out-of-lines
the implementations to keep hash.h as clean/small as possible.

In developing the unit tests, I discovered that the XXH3 preview always
returns zero for the empty string. Zero is problematic for some
algorithms (including an upcoming SST filter implementation) if it
occurs more often than at the "natural" rate, so it should not be
returned from trivial values using trivial seeds. I modified our fork
of XXH3 to return a modest hash of the seed for the empty string.

With hash function details out-of-lines in hash.h, it makes sense to
enable XXH_INLINE_ALL, so that direct calls to XXH64/XXH32/XXH3p
are inlined. To fix array-bounds warnings on some inline calls, I
injected some casts to uintptr_t in xxhash.cc. (Issue reported to Yann.)
Revised: Reverted using XXH_INLINE_ALL for now.  Some Facebook
checks are unhappy about #include on xxhash.cc file. I would
fix that by rename to xxhash_cc.h, but to best preserve history I want
to do that in a separate commit (PR) from the uintptr casts.

Also updated filter_bench for this change, improving the performance
predictability of dry run hashing and adding support for 64-bit hash
(for upcoming new SST filter implementations, minor dead code in the
tool for now).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5984

Differential Revision: D18246567

Pulled By: pdillinger

fbshipit-source-id: 6162fbf6381d63c8cc611dd7ec70e1ddc883fbb8
2019-10-31 16:36:35 -07:00
sdong aa6f7d0995 Support periodic compaction in universal compaction (#5970)
Summary:
Previously, periodic compaction is not supported in universal compaction. Add the support using following approach: if any file is marked as qualified for periodid compaction, trigger a full compaction. If a full compaction is prevented by files being compacted, try to compact the higher levels than files currently being compacted. If in this way we can only compact the last sorted run and none of the file to be compacted qualifies for periodic compaction, skip the compact. This is to prevent the same single level compaction from being executed again and again.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5970

Test Plan: Add several test cases.

Differential Revision: D18147097

fbshipit-source-id: 8ecc308154d9aca96fb192c51fbceba3947550c1
2019-10-31 11:31:37 -07:00
sdong 2a9e5caffe Make FIFO compaction take default 30 days TTL by default (#5987)
Summary:
Right now, by default FIFO compaction has no TTL. We believe that a default TTL of 30 days will be better. With this patch, the default will be changed to 30 days. Default of Options.periodic_compaction_seconds will mean the same as options.ttl. If Options.ttl and Options.periodic_compaction_seconds left default, a default 30 days TTL will be used. If both options are set, the stricter value of the two will be used.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5987

Test Plan: Add an option sanitize test to cover the case.

Differential Revision: D18237935

fbshipit-source-id: a6dcea1f36c3849e13c0a69e413d73ad8eab58c9
2019-10-31 11:13:12 -07:00
Sagar Vemuri 4c9aa30a62 Auto enable Periodic Compactions if a Compaction Filter is used (#5865)
Summary:
- Periodic compactions are auto-enabled if a compaction filter or a compaction filter factory is set, in Level Compaction.
- The default value of `periodic_compaction_seconds` is changed to UINT64_MAX, which lets RocksDB auto-tune periodic compactions as needed. An explicit value of 0 will still work as before ie. to disable periodic compactions completely. For now, on seeing a compaction filter along with a UINT64_MAX value for `periodic_compaction_seconds`, RocksDB will make SST files older than 30 days to go through periodic copmactions.

Some RocksDB users make use of compaction filters to control when their data can be deleted, usually with a custom TTL logic. But it is occasionally possible that the compactions get delayed by considerable time due to factors like low writes to a key range, data reaching bottom level, etc before the TTL expiry. Periodic Compactions feature was originally built to help such cases. Now periodic compactions are auto enabled by default when compaction filters or compaction filter factories are used, as it is generally helpful to all cases to collect garbage.

`periodic_compaction_seconds` is set to a large value, 30 days, in `SanitizeOptions` when RocksDB sees that a `compaction_filter` or `compaction_filter_factory` is used.

This is done only for Level Compaction style.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5865

Test Plan:
- Added a new test `DBCompactionTest.LevelPeriodicCompactionWithCompactionFilters` to make sure that `periodic_compaction_seconds` is set if either `compaction_filter` or `compaction_filter_factory` options are set.
- `COMPILE_WITH_ASAN=1 make check`

Differential Revision: D17659180

Pulled By: sagar0

fbshipit-source-id: 4887b9cf2e53cf2dc93a7b658c6b15e1181217ee
2019-10-29 15:05:51 -07:00
Vijay Nadimpalli 79018ba51b Upgrading version to 6.6.0 on Master (#5965)
Summary:
Upgrading version to 6.6.0 on Master.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5965

Differential Revision: D18119839

Pulled By: vjnadimpalli

fbshipit-source-id: 4adbcbb82b108d2f626e88c786453baad8455f4e
2019-10-28 13:14:45 -07:00
Yanqin Jin 2309fd63bf Update column families' log number altogether after flushing during recovery (#5856)
Summary:
A bug occasionally shows up in crash test, and https://github.com/facebook/rocksdb/issues/5851 reproduces it.
The bug can surface in the following way.
1. Database has multiple column families.
2. Between one DB restart, the last log file is corrupted in the middle (not the tail)
3. During restart, DB crashes between flushing between two column families.

Then DB will fail to be opened again with error "SST file is ahead of WALs".
Solution is to update the log number associated with each column family altogether after flushing all column families' memtables. The version edits should be written to a new MANIFEST. Only after writing to all these version edits succeed does RocksDB (atomically) points the CURRENT file to the new MANIFEST.

Test plan (on devserver):
```
$make all && make check
```
Specifically
```
$make db_test2
$./db_test2 --gtest_filter=DBTest2.CrashInRecoveryMultipleCF
```
Also checked for compatibility as follows.
Use this branch, run DBTest2.CrashInRecoveryMultipleCF and preserve the db directory.
Then checkout 5.4, build ldb, and dump the MANIFEST.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5856

Differential Revision: D17620818

Pulled By: riversand963

fbshipit-source-id: b52ce5969c9a8052cacec2bd805fcfb373589039
2019-10-24 18:29:30 -07:00
Levi Tamasi f7e7b34ebe Propagate SST and blob file numbers through the EventListener interface (#5962)
Summary:
This patch adds a number of new information elements to the FlushJobInfo and
CompactionJobInfo structures that are passed to EventListeners via the
OnFlush{Begin, Completed} and OnCompaction{Begin, Completed} callbacks.
Namely, for flushes, the file numbers of the new SST and the oldest blob file it
references are propagated. For compactions, the new pieces of information are
the file number, level, and the oldest blob file referenced by each compaction
input and output file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5962

Test Plan:
Extended the EventListener unit tests with logic that checks that these information
elements are correctly propagated from the corresponding FileMetaData.

Differential Revision: D18095568

Pulled By: ltamasi

fbshipit-source-id: 6874359a6aadb53366b5fe87adcb2f9bd27a0a56
2019-10-24 14:44:15 -07:00
Peter Dillinger 27a124571f Fix memory leak on error opening PlainTable (#5951)
Summary:
Several error paths in opening of a plain table would leak memory. PR https://github.com/facebook/rocksdb/issues/5940 opened the leak to one more error path, which happens to have been (mistakenly) exercised by CuckooTableDBTest.AdaptiveTable. That test has been fixed, and the exercising of
plain table error cases (more than before) has been added as BadOptions1 and BadOptions2
to PlainTableDBTest. This effectively moved the memory leak to plain_table_db_test.

Also here is a cheap fix for the memory leak, without (yet?) changing the signature of
ReadTableProperties. This fixes ASAN on unit tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5951

Test Plan: make COMPILE_WITH_ASAN=1 check

Differential Revision: D18051940

Pulled By: pdillinger

fbshipit-source-id: e2952930c09a2b46c4f1ff09818c5090426929de
2019-10-21 16:53:06 -07:00
sdong a0cd920026 LevelIterator to avoid gap after prefix bloom filters out a file (#5861)
Summary:
Right now, when LevelIterator::Seek() is called, when a file is filtered out by prefix bloom filter, the position is put to the beginning of the next file. This is a confusing internal interface because many keys in the levels are skipped. Avoid this behavior by checking the key of the next file against the seek key, and invalidate the whole iterator if the prefix doesn't match.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5861

Test Plan: Add a new unit test to validate the behavior; run all exsiting tests; run crash_test

Differential Revision: D17918213

fbshipit-source-id: f06b47d937c7cc8919001f18dcc3af5b28c9cdac
2019-10-21 11:40:57 -07:00
Levi Tamasi 69bd8a2859 Update HISTORY.md with recent BlobDB adjacent changes
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5939

Differential Revision: D18009096

Pulled By: ltamasi

fbshipit-source-id: 032a48a302f9da38aecf4055b5a8d4e1dffd9dc7
2019-10-18 10:24:23 -07:00
Yi Wu 1f9d7c0f54 Fix OnFlushCompleted fired before flush result write to MANIFEST (#5908)
Summary:
When there are concurrent flush job on the same CF, `OnFlushCompleted` can be called before the flush result being install to LSM. Fixing the issue by passing `FlushJobInfo` through `MemTable`, and the thread who commit the flush result can fetch the `FlushJobInfo` and fire `OnFlushCompleted` on behave of the thread actually writing the SST.

Fix https://github.com/facebook/rocksdb/issues/5892
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5908

Test Plan: Add new test. The test will fail without the fix.

Differential Revision: D17916144

Pulled By: riversand963

fbshipit-source-id: e18df67d9533b5baee52ae3605026cdeb05cbe10
2019-10-16 10:40:23 -07:00
Maysam Yabandeh 2c9e9f2a59 Update HISTORY for SeekForPrev bug fix (#5925)
Summary:
Update history for the bug fix in https://github.com/facebook/rocksdb/pull/5907
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5925

Differential Revision: D17952605

Pulled By: maysamyabandeh

fbshipit-source-id: 609afcbb2e4087f9153822c4d11193a75a7b0e7a
2019-10-16 07:59:26 -07:00
Yanqin Jin 6febfd8451 OnTableFileCreationCompleted use "(nil)" for empty file during flush (#5905)
Summary:
Compaction can call OnTableFileCreationCompleted(). If file is empty, "(nil)"
is used as the file name.
Do the same for flush.

Test plan (dev server):
```
make all
make check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5905

Differential Revision: D17883285

Pulled By: riversand963

fbshipit-source-id: 6565884adbb00e8023d88b17dfb3b6eb92220b59
2019-10-14 09:54:10 -07:00
Yanqin Jin 167cdc9f17 Support custom env in sst_dump (#5845)
Summary:
This PR allows for the creation of custom env when using sst_dump. If
the user does not set options.env or set options.env to nullptr, then sst_dump
will automatically try to create a custom env depending on the path to the sst
file or db directory. In order to use this feature, the user must call
ObjectRegistry::Register() beforehand.

Test Plan (on devserver):
```
$make all && make check
```
All tests must pass to ensure this change does not break anything.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5845

Differential Revision: D17678038

Pulled By: riversand963

fbshipit-source-id: 58ecb4b3f75246d52b07c4c924a63ee61c1ee626
2019-10-08 19:19:12 -07:00
Tomas Kolda e3a93c9ee1 Fix crash when background task fails (#5879)
Summary:
Fixing crash. Full story in issue: https://github.com/facebook/rocksdb/issues/5878
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5879

Differential Revision: D17812299

Pulled By: anand1976

fbshipit-source-id: 14e5a4fc502ade974583da9692d0ed6e5014613a
2019-10-08 14:20:01 -07:00
sdong 846e05005d Revert "Merging iterator to avoid child iterator reseek for some cases (#5286)" (#5871)
Summary:
This reverts commit 9fad3e21eb.

Iterator verification in stress tests sometimes fail for assertion
table/block_based/block_based_table_reader.cc:2973: void rocksdb::BlockBasedTableIterator<TBlockIter, TValue>::FindBlockForward() [with TBlockIter = rocksdb::DataBlockIter; TValue = rocksdb::Slice]: Assertion `!next_block_is_out_of_bound || user_comparator_.Compare(*read_options_.iterate_upper_bound, index_iter_->user_key()) <= 0' failed.

It is likely to be linked to https://github.com/facebook/rocksdb/pull/5286 together with https://github.com/facebook/rocksdb/pull/5468 as the former PR makes some child iterator's seek being avoided, so that upper bound condition fails to be updated there. Strictly speaking, the former PR was merged before the latter one, but the latter one feels a more important improvement so I choose to revert the former one for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5871

Differential Revision: D17689196

fbshipit-source-id: 4ded5be68f67bee2782d31a29cb72ea68f59dd8c
2019-10-01 11:22:41 -07:00
Yanqin Jin ae45835703 Add TryCatchUpWithPrimary to StackableDB (#5855)
Summary:
as title.

Test Plan (on devserver):
```
$make all && make check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5855

Differential Revision: D17615125

Pulled By: riversand963

fbshipit-source-id: bd6ed8cf59eafff41f0d1fc044f39e8f3573172a
2019-09-26 17:00:55 -07:00
Maysam Yabandeh 6652c94f59 Fix a bug in format_version 3 + partition filters + prefix search (#5835)
Summary:
Partitioned filters make use of a top-level index to find the partition in which the filter resides. The top-level index has a key per partition. The key is guaranteed to be larger or equal than any key in that partition. When used with format_version 3, which excludes the sequence number form index keys, the separator key in the index could be equal to the prefix of the keys in the next partition. In this way, when searching for the key, the top-level index will lead us to the previous partition, which has no key with that prefix. The prefix bloom test thus returns false, although the prefix exists in the bloom of the next partition.
The patch fixes that by a hack: It always adds the prefix of the first key of the next partition to the bloom of the current partition. In this way, in the corner cases that the index will lead us to the previous partition, we still can find the bloom filter there.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5835

Differential Revision: D17513585

Pulled By: maysamyabandeh

fbshipit-source-id: e2d1ff26c759e6e03875c4d57f4228316ecf50e9
2019-09-24 14:00:11 -07:00
WangQingping 02554b3c38 Update HISTORY.md for stop manual compaction
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5837

Differential Revision: D17529753

fbshipit-source-id: 98bbf22c690384b2f440286151dffdaaa744e97c
2019-09-23 15:03:39 -07:00
Maysam Yabandeh 6ec6a4a9a4 Remove snap_refresh_nanos option (#5826)
Summary:
The snap_refresh_nanos option didn't bring much benefit. Remove the feature to simplify the code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5826

Differential Revision: D17467147

Pulled By: maysamyabandeh

fbshipit-source-id: 4f950b046990d0d1292d7fc04c2ccafaf751c7f0
2019-09-18 20:26:04 -07:00
Maysam Yabandeh 638d239507 Charge block cache for cache internal usage (#5797)
Summary:
For our default block cache, each additional entry has extra memory overhead. It include LRUHandle (72 bytes currently) and the cache key (two varint64, file id and offset). The usage is not negligible. For example for block_size=4k, the overhead accounts for an extra 2% memory usage for the cache. The patch charging the cache for the extra usage, reducing untracked memory usage outside block cache. The feature is enabled by default and can be disabled by passing kDontChargeCacheMetadata to the cache constructor.
This PR builds up on https://github.com/facebook/rocksdb/issues/4258
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5797

Test Plan:
- Existing tests are updated to either disable the feature when the test has too much dependency on the old way of accounting the usage or increasing the cache capacity to account for the additional charge of metadata.
- The Usage tests in cache_test.cc are augmented to test the cache usage under kFullChargeCacheMetadata.

Differential Revision: D17396833

Pulled By: maysamyabandeh

fbshipit-source-id: 7684ccb9f8a40ca595e4f5efcdb03623afea0c6f
2019-09-16 15:26:21 -07:00
Peter (Stig) Edwards 54fc617891 Mention sst_dump cmd=recompress changes (#5807)
Summary:
As requested by siying in https://github.com/facebook/rocksdb/pull/5791#issuecomment-531417468
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5807

Differential Revision: D17399349

fbshipit-source-id: 5986c3894f356becd393fee0f1aeadcd9affc798
2019-09-16 10:45:03 -07:00
Lingjing You 9ba88a1e5d Update history.md for option memtable_insert_hint_per_batch (#5799)
Summary:
Update history.md for option memtable_insert_hint_per_batch
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5799

Differential Revision: D17369186

fbshipit-source-id: 71d82f9d99d9a52d1475d1b0153670957b6111e9
2019-09-13 10:51:32 -07:00
Ronak Sisodia 27f516acc8 Update HISTORY.md for option to make write group size configurable (#5798)
Summary:
Update HISTORY.md for option to make write group size configurable .
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5798

Differential Revision: D17369062

fbshipit-source-id: 390a3fa0b01675e91879486a729cf2cc7624d106
2019-09-13 10:43:09 -07:00
Peter Dillinger b55b2f45d0 Faster new DynamicBloom implementation (for memtable) (#5762)
Summary:
Since DynamicBloom is now only used in-memory, we're free to
change it without schema compatibility issues. The new implementation
is drawn from (with manifest permission)
303542a767/bloom_simulation_tests/foo.cc (L613)

This has several speed advantages over the prior implementation:
* Uses fastrange instead of %
* Minimum logic to determine first (and all) probed memory addresses
* (Major) Two probes per 64-bit memory fetch/write.
* Very fast and effective (murmur-like) hash expansion/re-mixing. (At
least on recent CPUs, integer multiplication is very cheap.)

While a Bloom filter with 512-bit cache locality has about a 1.15x FP
rate penalty (e.g. 0.84% to 0.97%), further restricting to two probes
per 64 bits incurs an additional 1.12x FP rate penalty (e.g. 0.97% to
1.09%). Nevertheless, the unit tests show no "mediocre" FP rate samples,
unlike the old implementation with more erratic FP rates.

Especially for the memtable, we expect speed to outweigh somewhat higher
FP rates. For example, a negative table query would have to be 1000x
slower than a BF query to justify doubling BF query time to shave 10% off
FP rate (working assumption around 1% FP rate). While that seems likely
for SSTs, my data suggests a speed factor of roughly 50x for the memtable
(vs. BF; ~1.5% lower write throughput when enabling memtable Bloom
filter, after this change).  Thus, it's probably not worth even 5% more
time in the Bloom filter to shave off 1/10th of the Bloom FP rate, or 0.1%
in absolute terms, and it's probably at least 20% slower to recoup that
much FP rate from this new implementation. Because of this, we do not see
a need for a 'locality' option that affects the MemTable Bloom filter
and have decoupled the MemTable Bloom filter from Options::bloom_locality.

Note that just 3% more memory to the Bloom filter (10.3 bits per key vs.
just 10) is able to make up for the ~12% FP rate drop in the new
implementation:

[] # Nearly "ideal" FP-wise but reasonably fast cache-local implementation
[~/wormhashing/bloom_simulation_tests] ./foo_gcc_IMPL_CACHE_WORM64_FROM32_any.out 10000000 6 10 $RANDOM 100000000
./foo_gcc_IMPL_CACHE_WORM64_FROM32_any.out time: 3.29372 sampled_fp_rate: 0.00985956 ...

[] # Close match to this new implementation
[~/wormhashing/bloom_simulation_tests] ./foo_gcc_IMPL_CACHE_MUL64_BLOCK_FROM32_any.out 10000000 6 10.3 $RANDOM 100000000
./foo_gcc_IMPL_CACHE_MUL64_BLOCK_FROM32_any.out time: 2.10072 sampled_fp_rate: 0.00985655 ...

[] # Old locality=1 implementation
[~/wormhashing/bloom_simulation_tests] ./foo_gcc_IMPL_CACHE_ROCKSDB_DYNAMIC_any.out 10000000 6 10 $RANDOM 100000000
./foo_gcc_IMPL_CACHE_ROCKSDB_DYNAMIC_any.out time: 3.95472 sampled_fp_rate: 0.00988943 ...

Also note the dramatic speed improvement vs. alternatives.

--

Performance unit test: DynamicBloomTest.concurrent_with_perf is updated
to report more precise timing data. (Measure running time of each
thread, not just longest running thread, etc.) Results averaged over
various sizes enabled with --enable_perf and 20 runs each; old dynamic
bloom refers to locality=1, the faster of the old:

old dynamic bloom, avg add latency = 65.6468
new dynamic bloom, avg add latency = 44.3809
old dynamic bloom, avg query latency = 50.6485
new dynamic bloom, avg query latency = 43.2186
old avg parallel add latency = 41.678
new avg parallel add latency = 24.5238
old avg parallel hit latency = 14.6322
new avg parallel hit latency = 12.3939
old avg parallel miss latency = 16.7289
new avg parallel miss latency = 12.2134

Tested on a dedicated 64-bit production machine at Facebook. Significant
improvement all around.

Despite now using std::atomic<uint64_t>, quick before-and-after test on
a 32-bit machine (Intel Atom N270, released 2008) shows no regression in
performance, in some cases modest improvement.

--

Performance integration test (synthetic): with DEBUG_LEVEL=0, used
TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=fillrandom,readmissing,readrandom,stats --num=2000000
and optionally with -memtable_whole_key_filtering -memtable_bloom_size_ratio=0.01
300 runs each configuration.

Write throughput change by enabling memtable bloom:
Old locality=0: -3.06%
Old locality=1: -2.37%
New:            -1.50%
conclusion -> seems to substantially close the gap

Readmissing throughput change by enabling memtable bloom:
Old locality=0: +34.47%
Old locality=1: +34.80%
New:            +33.25%
conclusion -> maybe a small new penalty from FP rate

Readrandom throughput change by enabling memtable bloom:
Old locality=0: +31.54%
Old locality=1: +31.13%
New:            +30.60%
conclusion -> maybe also from FP rate (after memtable flush)

--

Another conclusion we can draw from this new implementation is that the
existing 32-bit hash function is not inherently crippling the Bloom
filter speed or accuracy, below about 5 million keys. For speed, the
implementation is essentially the same whether starting with 32-bits or
64-bits of hash; it just determines whether the first multiplication
after fastrange is a pseudorandom expansion or needed re-mix. Note that
this multiplication can occur while memory is fetching.

For accuracy, in a standard configuration, you need about 5 million
keys before you have about a 1.1x FP penalty due to using a
32-bit hash vs. 64-bit:

[~/wormhashing/bloom_simulation_tests] ./foo_gcc_IMPL_CACHE_MUL64_BLOCK_FROM32_any.out $((5 * 1000 * 1000 * 10)) 6 10 $RANDOM 100000000
./foo_gcc_IMPL_CACHE_MUL64_BLOCK_FROM32_any.out time: 2.52069 sampled_fp_rate: 0.0118267 ...
[~/wormhashing/bloom_simulation_tests] ./foo_gcc_IMPL_CACHE_MUL64_BLOCK_any.out $((5 * 1000 * 1000 * 10)) 6 10 $RANDOM 100000000
./foo_gcc_IMPL_CACHE_MUL64_BLOCK_any.out time: 2.43871 sampled_fp_rate: 0.0109059
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5762

Differential Revision: D17214194

Pulled By: pdillinger

fbshipit-source-id: ad9da031772e985fd6b62a0e1db8e81892520595
2019-09-05 14:59:25 -07:00
Affan Dar 229e6fbe0e Adding DB::GetCurrentWalFile() API as a repliction/backup helper (#5765)
Summary:
Adding a light weight API to get last live WAL file name and size. Meant to be used as a helper for backup/restore tooling in a larger ecosystem such as MySQL with a MyRocks storage engine.

Specifically within MySQL's backup/restore mechanism, this call can be made with a write lock on the mysql db to get a transactionally consistent snapshot of the current WAL file position along with other non-rocksdb log/data files.

Without this, the alternative would be to take the aforementioned lock, scan the WAL dir for all files, find the last file and note its exact size as the rocksdb 'checkpoint'.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5765

Differential Revision: D17172717

Pulled By: affandar

fbshipit-source-id: f2fabafd4c0e6fc45f126670c8c88a9f84cb8a37
2019-09-04 12:10:17 -07:00
Yanqin Jin 44eca41add Fix a bug in file ingestion (#5760)
Summary:
Before this PR, when the number of column families involved in a file ingestion exceeds 2, a bug in the looping logic prevents correct file number being assigned to each ingestion job.
Also skip deleting non-existing hard links during cleanup-after-failure.

Test plan (devserver)
```
$COMPILE_WITH_ASAN=1 make all
$./external_sst_file_test --gtest_filter=ExternalSSTFileTest/ExternalSSTFileTest.IngestFilesIntoMultipleColumnFamilies_*/*
$makke check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5760

Differential Revision: D17142982

Pulled By: riversand963

fbshipit-source-id: 06c1847a4e7a402647bcf28d124e70f2a0f9daf6
2019-08-30 18:29:07 -07:00
Pratik Dhandharia a281822331 Lower the risk for users to run options.force_consistency_checks = true (#5744)
Summary:
Open-source users recently reported two occurrences of LSM-tree corruption (https://github.com/facebook/rocksdb/issues/5558 is one), which would be caught by options.force_consistency_checks = true. options.force_consistency_checks has a usability limitation because it crashes the service once inconsistency is detected. This makes the feature hard to use. Most users serve from multiple RocksDB shards per server and the impacts of crashing the service is higher than it should be.

Instead, we just pass the error back to users without killing the service, and ask them to deal with the problem accordingly.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5744

Differential Revision: D17096940

Pulled By: pdhandharia

fbshipit-source-id: b6780039044e265f26ed2ad03c51f4abbe8b603c
2019-08-29 14:07:37 -07:00
Zhongyi Xie 2f41ecfe75 Refactor trimming logic for immutable memtables (#5022)
Summary:
MyRocks currently sets `max_write_buffer_number_to_maintain` in order to maintain enough history for transaction conflict checking. The effectiveness of this approach depends on the size of memtables. When memtables are small, it may not keep enough history; when memtables are large, this may consume too much memory.
We are proposing a new way to configure memtable list history: by limiting the memory usage of immutable memtables. The new option is `max_write_buffer_size_to_maintain` and it will take precedence over the old `max_write_buffer_number_to_maintain` if they are both set to non-zero values. The new option accounts for the total memory usage of flushed immutable memtables and mutable memtable. When the total usage exceeds the limit, RocksDB may start dropping immutable memtables (which is also called trimming history), starting from the oldest one.
The semantics of the old option actually works both as an upper bound and lower bound. History trimming will start if number of immutable memtables exceeds the limit, but it will never go below (limit-1) due to history trimming.
In order the mimic the behavior with the new option, history trimming will stop if dropping the next immutable memtable causes the total memory usage go below the size limit. For example, assuming the size limit is set to 64MB, and there are 3 immutable memtables with sizes of 20, 30, 30. Although the total memory usage is 80MB > 64MB, dropping the oldest memtable will reduce the memory usage to 60MB < 64MB, so in this case no memtable will be dropped.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5022

Differential Revision: D14394062

Pulled By: miasantreble

fbshipit-source-id: 60457a509c6af89d0993f988c9b5c2aa9e45f5c5
2019-08-23 13:55:34 -07:00
anand76 9046bdc5d3 Fix MultiGet() bug when whole_key_filtering is disabled (#5665)
Summary:
The batched MultiGet() implementation was not correctly handling bloom filter lookups when whole_key_filtering is disabled. It was incorrectly skipping keys not in the prefix_extractor domain, and not calling transform for keys in domain. This PR fixes both problems by moving the domain check and transformation to the FilterBlockReader.

Tests:
Unit test (confirmed failed before the fix)
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5665

Differential Revision: D16902380

Pulled By: anand1976

fbshipit-source-id: a6be81ad68a6e37134a65246aec7a2c590eccf00
2019-08-21 10:23:23 -07:00