Commit Graph

176 Commits

Author SHA1 Message Date
mrambacher 12f1137355 Add a SystemClock class to capture the time functions of an Env (#7858)
Summary:
Introduces and uses a SystemClock class to RocksDB.  This class contains the time-related functions of an Env and these functions can be redirected from the Env to the SystemClock.

Many of the places that used an Env (Timer, PerfStepTimer, RepeatableThread, RateLimiter, WriteController) for time-related functions have been changed to use SystemClock instead.  There are likely more places that can be changed, but this is a start to show what can/should be done.  Over time it would be nice to migrate most (if not all) of the uses of the time functions from the Env to the SystemClock.

There are several Env classes that implement these functions.  Most of these have not been converted yet to SystemClock implementations; that will come in a subsequent PR.  It would be good to unify many of the Mock Timer implementations, so that they behave similarly and be tested similarly (some override Sleep, some use a MockSleep, etc).

Additionally, this change will allow new methods to be introduced to the SystemClock (like https://github.com/facebook/rocksdb/issues/7101 WaitFor) in a consistent manner across a smaller number of classes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7858

Reviewed By: pdillinger

Differential Revision: D26006406

Pulled By: mrambacher

fbshipit-source-id: ed10a8abbdab7ff2e23d69d85bd25b3e7e899e90
2021-01-25 22:09:11 -08:00
Adam Retter 6e0f62f2b6 Add more tests to ASSERT_STATUS_CHECKED (3), API change (#7715)
Summary:
Third batch of adding more tests to ASSERT_STATUS_CHECKED.

* db_compaction_filter_test
* db_compaction_test
* db_dynamic_level_test
* db_inplace_update_test
* db_sst_test
* db_tailing_iter_test
* db_io_failure_test

Also update GetApproximateSizes APIs to all return Status.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7715

Reviewed By: jay-zhuang

Differential Revision: D25806896

Pulled By: pdillinger

fbshipit-source-id: 6cb9d62ba5a756c645812754c596ad3995d7c262
2021-01-06 14:15:02 -08:00
mrambacher f35f7f2704 Fix many tests to run with MEM_ENV and ENCRYPTED_ENV; Introduce a MemoryFileSystem class (#7566)
Summary:
This PR does a few things:

1.  The MockFileSystem class was split out from the MockEnv.  This change would theoretically allow a MockFileSystem to be used by other Environments as well (if we created a means of constructing one).  The MockFileSystem implements a FileSystem in its entirety and does not rely on any Wrapper implementation.

2.  Make the RocksDB test suite work when MOCK_ENV=1 and ENCRYPTED_ENV=1 are set.  To accomplish this, a few things were needed:
- The tests that tried to use the "wrong" environment (Env::Default() instead of env_) were updated
- The MockFileSystem was changed to support the features it was missing or mishandled (such as recursively deleting files in a directory or supporting renaming of a directory).

3.  Updated the test framework to have a ROCKSDB_GTEST_SKIP macro.  This can be used to flag tests that are skipped.  Currently, this defaults to doing nothing (marks the test as SUCCESS) but will mark the tests as SKIPPED when RocksDB is upgraded to a version of gtest that supports this (gtest-1.10).

I have run a full "make check" with MEM_ENV, ENCRYPTED_ENV,  both, and neither under both MacOS and RedHat.  A few tests were disabled/skipped for the MEM/ENCRYPTED cases.  The error_handler_fs_test fails/hangs for MEM_ENV (presumably a timing problem) and I will introduce another PR/issue to track that problem.  (I will also push a change to disable those tests soon).  There is one more test in DBTest2 that also fails which I need to investigate or skip before this PR is merged.

Theoretically, this PR should also allow the test suite to run against an Env loaded from the registry, though I do not have one to try it with currently.

Finally, once this is accepted, it would be nice if there was a CircleCI job to run these tests on a checkin so this effort does not become stale.  I do not know how to do that, so if someone could write that job, it would be appreciated :)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7566

Reviewed By: zhichao-cao

Differential Revision: D24408980

Pulled By: jay-zhuang

fbshipit-source-id: 911b1554a4d0da06fd51feca0c090a4abdcb4a5f
2020-10-27 10:33:09 -07:00
Andrew Kryczka 8115eb520d add Status check assertions for repair_test (#7455)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7455

Reviewed By: riversand963

Differential Revision: D23985283

Pulled By: ajkr

fbshipit-source-id: 5dd2be62350f6e31d13a1e7821cb848a37699c93
2020-09-29 16:30:08 -07:00
mrambacher 7d472accdc Bring the Configurable options together (#5753)
Summary:
This PR merges the functionality of making the ColumnFamilyOptions, TableFactory, and DBOptions into Configurable into a single PR, resolving any merge conflicts

Pull Request resolved: https://github.com/facebook/rocksdb/pull/5753

Reviewed By: ajkr

Differential Revision: D23385030

Pulled By: zhichao-cao

fbshipit-source-id: 8b977a7731556230b9b8c5a081b98e49ee4f160a
2020-09-14 17:01:01 -07:00
sdong 722814e357 Get() to fail with underlying failures in PartitionIndexReader::CacheDependencies() (#7297)
Summary:
Right now all I/O failures under PartitionIndexReader::CacheDependencies() is swallowed. This doesn't impact correctness but we've made a decision that any I/O error in read path now should be returned to users for awareness. Return errors in those cases instead.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7297

Test Plan: Add a new unit test that ingest errors in this code path and see Get() fails. Only one I/O path is hit in PartitionIndexReader::CacheDependencies(). Several option changes are attempt but not able to got other pread paths triggered. Not sure whether other failure cases would be even possible. Would rely on continuous stress test to validate it.

Reviewed By: anand1976

Differential Revision: D23257950

fbshipit-source-id: 859dbc92fa239996e1bb378329344d3d54168c03
2020-08-25 19:01:05 -07:00
Jay Zhuang 3e422ce0ca Fix a timer_test deadlock (#7277)
Summary:
There's a potential deadlock caused by MockTimeEnv time value get to a large number, which causes TimedWait() wait forever. The test misuses the microseconds as seconds, making it more likely to happen.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7277

Reviewed By: pdillinger

Differential Revision: D23183873

Pulled By: jay-zhuang

fbshipit-source-id: 6fc38ebd40b4125a99551204b271f91a27e70086
2020-08-20 08:43:13 -07:00
sdong b194c21bba Whole DBTest to skip fsync (#7274)
Summary:
After https://github.com/facebook/rocksdb/pull/7036, we still see extra DBTest that can timeout when running 10 or 20 in parallel. Expand skip-fsync mode in whole DBTest. Still preserve other tests from doing this mode to be conservative.

This commit reinstates https://github.com/facebook/rocksdb/issues/7049, whose un-revert was lost in an automatic
infrastructure mis-merge.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7274

Test Plan: Run all existing files.

Reviewed By: pdillinger

Differential Revision: D23177444

fbshipit-source-id: 1f61690b2ac6333c3b2c87176fef6b2cba086b33
2020-08-17 18:42:25 -07:00
Jay Zhuang 69760b4d05 Introduce a global StatsDumpScheduler for stats dumping (#7223)
Summary:
Have a global StatsDumpScheduler for all DB instance stats dumping, including `DumpStats()` and `PersistStats()`. Before this, there're 2 dedicate threads for every DB instance, one for DumpStats() one for PersistStats(), which could create lots of threads if there're hundreds DB instances.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7223

Reviewed By: riversand963

Differential Revision: D23056737

Pulled By: jay-zhuang

fbshipit-source-id: 0faa2311142a73433ebb3317361db7cbf43faeba
2020-08-14 20:12:44 -07:00
Peter Dillinger 6ac1d25fd0 Fix+clean up handling of mock sleeps (#7101)
Summary:
We have a number of tests hanging on MacOS and windows due to
mishandling of code for mock sleeps. In addition, the code was in
terrible shape because the same variable (addon_time_) would sometimes
refer to microseconds and sometimes to seconds. One test even assumed it
was nanoseconds but was written to pass anyway.

This has been cleaned up so that DB tests generally use a SpecialEnv
function to mock sleep, for either some number of microseconds or seconds
depending on the function called. But to call one of these, the test must first
call SetMockSleep (precondition enforced with assertion), which also turns
sleeps in RocksDB into mock sleeps. To also removes accounting for actual
clock time, call SetTimeElapseOnlySleepOnReopen, which implies
SetMockSleep (on DB re-open). This latter setting only works by applying
on DB re-open, otherwise havoc can ensue if Env goes back in time with
DB open.

More specifics:

Removed some unused test classes, and updated comments on the general
problem.

Fixed DBSSTTest.GetTotalSstFilesSize using a sync point callback instead
of mock time. For this we have the only modification to production code,
inserting a sync point callback in flush_job.cc, which is not a change to
production behavior.

Removed unnecessary resetting of mock times to 0 in many tests. RocksDB
deals in relative time. Any behaviors relying on absolute date/time are likely
a bug. (The above test DBSSTTest.GetTotalSstFilesSize was the only one
clearly injecting a specific absolute time for actual testing convenience.) Just
in case I misunderstood some test, I put this note in each replacement:
// NOTE: Presumed unnecessary and removed: resetting mock time in env

Strengthened some tests like MergeTestTime, MergeCompactionTimeTest, and
FilterCompactionTimeTest in db_test.cc

stats_history_test and blob_db_test are each their own beast, rather deeply
dependent on MockTimeEnv. Each gets its own variant of a work-around for
TimedWait in a mock time environment. (Reduces redundancy and
inconsistency in stats_history_test.)

Intended follow-up:

Remove TimedWait from the public API of InstrumentedCondVar, and only
make that accessible through Env by passing in an InstrumentedCondVar and
a deadline. Then the Env implementations mocking time can fix this problem
without using sync points. (Test infrastructure using sync points interferes
with individual tests' control over sync points.)

With that change, we can simplify/consolidate the scattered work-arounds.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7101

Test Plan: make check on Linux and MacOS

Reviewed By: zhichao-cao

Differential Revision: D23032815

Pulled By: pdillinger

fbshipit-source-id: 7f33967ada8b83011fb54e8279365c008bd6610b
2020-08-11 12:41:30 -07:00
mrambacher c7c7b07f06 More Makefile Cleanup (#7097)
Summary:
Cleans up some of the dependencies on test code in the Makefile while building tools:
- Moves the test::RandomString, DBBaseTest::RandomString into Random
- Moves the test::RandomHumanReadableString into Random
- Moves the DestroyDir method into file_utils
- Moves the SetupSyncPointsToMockDirectIO into sync_point.
- Moves the FaultInjection Env and FS classes under env

These changes allow all of the tools to build without dependencies on test_util, thereby simplifying the build dependencies.  By moving the FaultInjection code, the dependency in db_stress on different libraries for debug vs release was eliminated.

Tested both release and debug builds via Make and CMake for both static and shared libraries.

More work remains to clean up how the tools are built and remove some unnecessary dependencies.  There is also more work that should be done to get the Makefile and CMake to align in their builds -- what is in the libraries and the sizes of the executables are different.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7097

Reviewed By: riversand963

Differential Revision: D22463160

Pulled By: pdillinger

fbshipit-source-id: e19462b53324ab3f0b7c72459dbc73165cc382b2
2020-07-09 14:35:17 -07:00
Jay Zhuang 00de699096 Replace reinterpret_cast with static_cast_with_check (#7067)
Summary:
Replace `reinterpret_cast` with `static_cast_with_check` for `DBImpl` and `ColumnFamilyHandleImpl`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7067

Reviewed By: siying

Differential Revision: D22361587

Pulled By: jay-zhuang

fbshipit-source-id: dfe9e8f3af39c3d27cc372c55ab9ad905eb0a5a1
2020-07-02 19:25:41 -07:00
Peter Dillinger 52d59e0c93 Revert "Whole DBTest to skip fsync (#7049)" (#7070)
Summary:
This reverts commit 4f1534bdb0.

This commit caused failures and deadlocks in
MultiThreadedDBTest.MultiThreaded/69 and others.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7070

Reviewed By: riversand963

Differential Revision: D22358778

Pulled By: pdillinger

fbshipit-source-id: faf8f2cb469a7063a113921c8e9c64a9f7610dac
2020-07-02 10:22:43 -07:00
sdong 4f1534bdb0 Whole DBTest to skip fsync (#7049)
Summary:
After https://github.com/facebook/rocksdb/pull/7036, we still see extra DBTest that can timeout when running 10 or 20 in parallel. Expand skip-fsync mode in whole DBTest. Still preserve other tests from doing this mode to be conservative.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7049

Test Plan: Run all existing files.

Reviewed By: pdillinger

Differential Revision: D22301700

fbshipit-source-id: f9a9e3b3b26ce640665a47cb8bff33ba0c89b565
2020-07-01 19:37:56 -07:00
sdong 58547e533b Disable fsync in some tests to speed them up (#7036)
Summary:
Fsyncing files is not providing more test coverage in many tests. Provide an option in SpecialEnv to turn it off to speed it up and enable this option in some tests with relatively long run time.
Most of those tests can be divided as parameterized gtest too. This two speed up approaches are orthogonal and we can do both if needed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7036

Test Plan: Run all tests and make sure they pass.

Reviewed By: ltamasi

Differential Revision: D22268084

fbshipit-source-id: 6d4a838a1b7328c13931a2a5d93de57aa02afaab
2020-06-29 16:56:59 -07:00
Peter Dillinger 2f1700c8c5 Fix failure to write output in SpecialEnv::GetCurrentTime (#6803)
Summary:
This very old test code bug was causing a new valgrind failure
in MultiGetDeadlineExceeded

Also fix hang in MultiGetDeadlineExceeded by unifying with some logic from another test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6803

Test Plan: run that unit test under valgrind, make check

Reviewed By: siying

Differential Revision: D21388470

Pulled By: pdillinger

fbshipit-source-id: 0ce99d6d5eb8cd3195b17406892c8c5cff5fa5dd
2020-05-05 13:11:29 -07:00
Peter Dillinger 249eff0f30 Stats for redundant insertions into block cache (#6681)
Summary:
Since read threads do not coordinate on loading data into block
cache, two threads between Lookup and Insert can end up loading and
inserting the same data. This is particularly concerning with
cache_index_and_filter_blocks since those are hot and more likely to
be race targets if ejected from (or not pre-populated in) the cache.

Particularly with moves toward disaggregated / network storage, the cost
of redundant retrieval might be high, and we should at least have some
hard statistics from which we can estimate impact.

Example with full filter thrashing "cliff":

    $ ./db_bench --benchmarks=fillrandom --num=15000000 --cache_index_and_filter_blocks -bloom_bits=10
    ...
    $ ./db_bench --db=/tmp/rocksdbtest-172704/dbbench --use_existing_db --benchmarks=readrandom,stats --num=200000 --cache_index_and_filter_blocks --cache_size=$((130 * 1024 * 1024)) --bloom_bits=10 --threads=16 -statistics 2>&1 | egrep '^rocksdb.block.cache.(.*add|.*redundant)' | grep -v compress | sort
    rocksdb.block.cache.add COUNT : 14181
    rocksdb.block.cache.add.failures COUNT : 0
    rocksdb.block.cache.add.redundant COUNT : 476
    rocksdb.block.cache.data.add COUNT : 12749
    rocksdb.block.cache.data.add.redundant COUNT : 18
    rocksdb.block.cache.filter.add COUNT : 1003
    rocksdb.block.cache.filter.add.redundant COUNT : 217
    rocksdb.block.cache.index.add COUNT : 429
    rocksdb.block.cache.index.add.redundant COUNT : 241
    $ ./db_bench --db=/tmp/rocksdbtest-172704/dbbench --use_existing_db --benchmarks=readrandom,stats --num=200000 --cache_index_and_filter_blocks --cache_size=$((120 * 1024 * 1024)) --bloom_bits=10 --threads=16 -statistics 2>&1 | egrep '^rocksdb.block.cache.(.*add|.*redundant)' | grep -v compress | sort
    rocksdb.block.cache.add COUNT : 1182223
    rocksdb.block.cache.add.failures COUNT : 0
    rocksdb.block.cache.add.redundant COUNT : 302728
    rocksdb.block.cache.data.add COUNT : 31425
    rocksdb.block.cache.data.add.redundant COUNT : 12
    rocksdb.block.cache.filter.add COUNT : 795455
    rocksdb.block.cache.filter.add.redundant COUNT : 130238
    rocksdb.block.cache.index.add COUNT : 355343
    rocksdb.block.cache.index.add.redundant COUNT : 172478
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6681

Test Plan: Some manual testing (above) and unit test covering key metrics is included

Reviewed By: ltamasi

Differential Revision: D21134113

Pulled By: pdillinger

fbshipit-source-id: c11497b5f00f4ffdfe919823904e52d0a1a91d87
2020-04-27 13:20:27 -07:00
sdong fdf882ded2 Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433)
Summary:
When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433

Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.

Differential Revision: D19977691

fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
2020-02-20 12:09:57 -08:00
Connor1996 3e26a94ba1 Add oldest snapshot sequence property (#6228)
Summary:
Add oldest snapshot sequence property, so we can use `db.GetProperty("rocksdb.oldest-snapshot-sequence")` to get the sequence number of the oldest snapshot.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6228

Differential Revision: D19264145

Pulled By: maysamyabandeh

fbshipit-source-id: 67fbe5304d89cbc475bd404e30d1299f7b11c010
2020-01-07 08:36:44 -08:00
Peter Dillinger ca3b6c28c9 Expose and elaborate FilterBuildingContext (#6088)
Summary:
This change enables custom implementations of FilterPolicy to
wrap a variety of NewBloomFilterPolicy and select among them based on
contextual information such as table level and compaction style.

* Moves FilterBuildingContext to public API and elaborates it with more
useful data. (It would be nice to put more general options-like data,
but at the time this object is constructed, we are using internal APIs
ImmutableCFOptions and MutableCFOptions and don't have easy access to
ColumnFamilyOptions that I can tell.)

* Renames BloomFilterPolicy::GetFilterBitsBuilderInternal to
GetBuilderWithContext, because it's now public.

* Plumbs through the table's "level_at_creation" for filter building
context.

* Simplified some tests by adding GetBuilder() to
MockBlockBasedTableTester.

* Adds test as DBBloomFilterTest.ContextCustomFilterPolicy, including
sample wrapper class LevelAndStyleCustomFilterPolicy.

* Fixes a cross-test bug in DBBloomFilterTest.OptimizeFiltersForHits
where it does not reset perf context.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6088

Test Plan: make check, valgrind on db_bloom_filter_test

Differential Revision: D18697817

Pulled By: pdillinger

fbshipit-source-id: 5f987a2d7b07cc7a33670bc08ca6b4ca698c1cf4
2019-11-26 18:24:10 -08:00
anand76 6c7b1a0cc7 Batched MultiGet API for multiple column families (#5816)
Summary:
Add a new API that allows a user to call MultiGet specifying multiple keys belonging to different column families. This is mainly useful for users who want to do a consistent read of keys across column families, with the added performance benefits of batching and returning values using PinnableSlice.

As part of this change, the code in the original multi-column family MultiGet for acquiring the super versions has been refactored into a separate function that can be used by both, the batching and the non-batching versions of MultiGet.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5816

Test Plan:
make check
make asan_check
asan_crash_test

Differential Revision: D18408676

Pulled By: anand1976

fbshipit-source-id: 933e7bec91dd70e7b633be4ff623a1116cc28c8d
2019-11-12 13:52:55 -08:00
Yanqin Jin 5d9a67e718 Support loading custom objects in unit tests (#5676)
Summary:
Most existing RocksDB unit tests run on `Env::Default()`. It will be useful to port the unit tests to non-default environments, e.g. `HdfsEnv`, etc.
This pull request is one step towards this goal. If RocksDB unit tests are built with a static library exposing a function `RegisterCustomObjects()`, then it is possible to implement custom object registrar logic in the library. RocksDB unit test can call `RegisterCustomObjects()` at the beginning.
By default, `ROCKSDB_UNITTESTS_WITH_CUSTOM_OBJECTS_FROM_STATIC_LIBS` is not defined, thus this PR has no impact on existing RocksDB because `RegisterCustomObjects()` is a noop.
Test plan (on devserver):
```
$make clean && COMPILE_WITH_ASAN=1 make -j32 all
$make check
```
All unit tests must pass.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5676

Differential Revision: D16679157

Pulled By: riversand963

fbshipit-source-id: aca571af3fd0525277cdc674248d0fe06e060f9d
2019-08-09 15:12:08 -07:00
Zhongyi Xie d68f9f4580 simplify include directive involving inttypes (#5402)
Summary:
When using `PRIu64` type of printf specifier, current code base does the following:
```
#ifndef __STDC_FORMAT_MACROS
#define __STDC_FORMAT_MACROS
#endif
#include <inttypes.h>
```
However, this can be simplified to
```
#include <cinttypes>
```
as long as flag `-std=c++11` is used.
This should solve issues like https://github.com/facebook/rocksdb/issues/5159
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5402

Differential Revision: D15701195

Pulled By: miasantreble

fbshipit-source-id: 6dac0a05f52aadb55e9728038599d3d2e4b59d03
2019-06-06 13:56:07 -07:00
Vijay Nadimpalli 49c5a12dbe Organizing rocksdb/db directory
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5390

Differential Revision: D15579388

Pulled By: vjnadimpalli

fbshipit-source-id: 5bfc95e31554b8ff05b97b76d6534113f527f366
2019-05-31 11:57:01 -07:00
Siying Dong 8843129ece Move some memory related files from util/ to memory/ (#5382)
Summary:
Move arena, allocator, and memory tools under util to a separate memory/ directory.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5382

Differential Revision: D15564655

Pulled By: siying

fbshipit-source-id: 9cd6b5d0d3d52b39606e19221fa154596e5852a5
2019-05-30 17:44:09 -07:00
Vijay Nadimpalli 50e470791d Organizing rocksdb/table directory by format
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5373

Differential Revision: D15559425

Pulled By: vjnadimpalli

fbshipit-source-id: 5d6d6d615582bedd96a4b879bb25d429a6de8b55
2019-05-30 14:51:11 -07:00
Siying Dong e9e0101ca4 Move test related files under util/ to test_util/ (#5377)
Summary:
There are too many types of files under util/. Some test related files don't belong to there or just are just loosely related. Mo
ve them to a new directory test_util/, so that util/ is cleaner.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5377

Differential Revision: D15551366

Pulled By: siying

fbshipit-source-id: 0f5c8653832354ef8caa31749c0143815d719e2c
2019-05-30 11:25:51 -07:00
Siying Dong 545d206040 Move some file related files outside util/ (#5375)
Summary:
util/ means for lower level libraries, so it's a good idea to move the files which requires knowledge to DB out. Create a file/ and move some files there.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5375

Differential Revision: D15550935

Pulled By: siying

fbshipit-source-id: 61a9715dcde5386eebfb43e93f847bba1ae0d3f2
2019-05-29 20:47:06 -07:00
Maysam Yabandeh f383641a1d Unordered Writes (#5218)
Summary:
Performing unordered writes in rocksdb when unordered_write option is set to true. When enabled the writes to memtable are done without joining any write thread. This offers much higher write throughput since the upcoming writes would not have to wait for the slowest memtable write to finish. The tradeoff is that the writes visible to a snapshot might change over time. If the application cannot tolerate that, it should implement its own mechanisms to work around that. Using TransactionDB with WRITE_PREPARED write policy is one way to achieve that. Doing so increases the max throughput by 2.2x without however compromising the snapshot guarantees.
The patch is prepared based on an original by siying
Existing unit tests are extended to include unordered_write option.

Benchmark Results:
```
TEST_TMPDIR=/dev/shm/ ./db_bench_unordered --benchmarks=fillrandom --threads=32 --num=10000000 -max_write_buffer_number=16 --max_background_jobs=64 --batch_size=8 --writes=3000000 -level0_file_num_compaction_trigger=99999 --level0_slowdown_writes_trigger=99999 --level0_stop_writes_trigger=99999 -enable_pipelined_write=false -disable_auto_compactions  --unordered_write=1
```
With WAL
- Vanilla RocksDB: 78.6 MB/s
- WRITER_PREPARED with unordered_write: 177.8 MB/s (2.2x)
- unordered_write: 368.9 MB/s (4.7x with relaxed snapshot guarantees)

Without WAL
- Vanilla RocksDB: 111.3 MB/s
- WRITER_PREPARED with unordered_write: 259.3 MB/s MB/s (2.3x)
- unordered_write: 645.6 MB/s (5.8x with relaxed snapshot guarantees)

- WRITER_PREPARED with unordered_write disable concurrency control: 185.3 MB/s MB/s (2.35x)

Limitations:
- The feature is not yet extended to `max_successive_merges` > 0. The feature is also incompatible with `enable_pipelined_write` = true as well as with `allow_concurrent_memtable_write` = false.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5218

Differential Revision: D15219029

Pulled By: maysamyabandeh

fbshipit-source-id: 38f2abc4af8780148c6128acdba2b3227bc81759
2019-05-13 17:47:21 -07:00
Sagar Vemuri 3548e4220d Improve explicit user readahead performance (#5246)
Summary:
Improve the iterators performance when the user explicitly sets the readahead size via `ReadOptions.readahead_size`.

1. Stop creating new table readers when the user explicitly sets readahead size.
2. Make use of an internal buffer based on `FilePrefetchBuffer` instead of using `ReadaheadRandomAccessFileReader`, to handle the user readahead requests (for both buffered and direct io cases).
3. Add `readahead_size` to db_bench.

**Benchmarks:**
https://gist.github.com/sagar0/53693edc320a18abeaeca94ca32f5737

For 1 MB readahead, Buffered IO performance improves by 28% and Direct IO performance improves by 50%.
For 512KB readahead, Buffered IO performance improves by 30% and Direct IO performance improves by 67%.

**Test Plan:**
Updated `DBIteratorTest.ReadAhead` test to make sure that:
- no new table readers are created for iterators on setting ReadOptions.readahead_size
- At least "readahead" number of bytes are actually getting read on each iterator read.

TODO later:
- Use similar logic for compactions as well.
- This ties in nicely with #4052 and paves the way for removing ReadaheadRandomAcessFile later.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5246

Differential Revision: D15107946

Pulled By: sagar0

fbshipit-source-id: 2c1149729ca7d779e4e8b7710ba6f4e8cbfd3bea
2019-04-26 21:24:10 -07:00
anand76 fefd4b98c5 Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.

Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency

The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.

Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).

Batch   Sizes

1        | 2        | 4         | 8      | 16  | 32

Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074        - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14        - MultiGet (w/ batching)

Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135

Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62

Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891

dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10  ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011

Differential Revision: D14348703

Pulled By: anand1976

fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
2019-04-11 14:28:26 -07:00
Siying Dong cf3a671733 Remove cuckoo hash memtable (#4953)
Summary:
Cuckoo Hash is less useful than we initially expected. Remove it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4953

Differential Revision: D13979264

Pulled By: siying

fbshipit-source-id: 2a60afdaa989f045357398b43a1cc5d46f4492ed
2019-02-07 16:15:27 -08:00
Anand Ananthabhotla b9d6eccac1 Lock free MultiGet (#4754)
Summary:
Avoid locking the DB mutex in order to reference SuperVersions. Instead, we get the thread local cached SuperVersion for each column family in the list. It depends on finding a sequence number that overlaps with all the open memtables. We start with the latest published sequence number, and if any of the memtables is sealed before we can get all the SuperVersions, the process is repeated. After a few times, give up and lock the DB mutex.

Tests:
1. Unit tests
2. make check
3. db_bench -

TEST_TMPDIR=/dev/shm ./db_bench -use_existing_db=true -benchmarks=readrandom -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=5000000 -reads=1000000 -threads=32 -compression_type=none -cache_size=1048576000 -batch_size=1 -bloom_bits=1
readrandom   :       0.167 micros/op 5983920 ops/sec;  426.2 MB/s (1000000 of 1000000 found)

Multireadrandom with batch size 1:
multireadrandom :       0.176 micros/op 5684033 ops/sec; (1000000 of 1000000 found)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4754

Differential Revision: D13363550

Pulled By: anand1976

fbshipit-source-id: 6243e8de7dbd9c8bb490a8eca385da0c855b1dd4
2019-01-02 11:42:54 -08:00
Siying Dong da1c64b6e7 Introduce a CPU time counter in perf_context (#4741)
Summary:
Introduce the first CPU timing counter, perf_context.get_cpu_nanos. This opens a door to more CPU counters in the future.
Only Posix Env has it implemented using clock_gettime() with CLOCK_THREAD_CPUTIME_ID. How accurate the counter is depends on the platform.
Make PerfStepTimer to take an Env as an argument, and sometimes pass it in. The direct reason is to make the unit tests to use SpecialEnv where we can ingest logic there. But in long term, this is a good change.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4741

Differential Revision: D13287798

Pulled By: siying

fbshipit-source-id: 090361049d9d5095d1d1a369fe1338d2e2e1c73f
2018-12-20 12:03:44 -08:00
Sagar Vemuri dc3528077a Update all unique/shared_ptr instances to be qualified with namespace std (#4638)
Summary:
Ran the following commands to recursively change all the files under RocksDB:
```
find . -type f -name "*.cc" -exec sed -i 's/ unique_ptr/ std::unique_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/<unique_ptr/<std::unique_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/ shared_ptr/ std::shared_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/<shared_ptr/<std::shared_ptr/g' {} +
```
Running `make format` updated some formatting on the files touched.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4638

Differential Revision: D12934992

Pulled By: sagar0

fbshipit-source-id: 45a15d23c230cdd64c08f9c0243e5183934338a8
2018-11-09 11:19:58 -08:00
Yanqin Jin de18a2d82e Update test to cover a new case in file ingestion (#4614)
Summary:
The new case is directIO = true, write_global_seqno = false in which we no longer write global_seqno to the external SST file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4614

Differential Revision: D12885001

Pulled By: riversand963

fbshipit-source-id: 7541bdc608b3a0c93d3c3c435da1b162b36673d4
2018-11-01 16:23:49 -07:00
Bo Hou cd9404bb77 xxhash 64 support
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4607

Reviewed By: siying

Differential Revision: D12836696

Pulled By: jsjhoubo

fbshipit-source-id: 7122ccb712d0b0f1cd998aa4477e0da1401bd870
2018-11-01 15:44:06 -07:00
Yanqin Jin 5b4c709fad Enable atomic flush (#4023)
Summary:
Adds a DB option `atomic_flush` to control whether to enable this feature. This PR is a subset of [PR 3752](https://github.com/facebook/rocksdb/pull/3752).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4023

Differential Revision: D8518381

Pulled By: riversand963

fbshipit-source-id: 1e3bb33e99bb102876a31b378d93b0138ff6634f
2018-10-26 15:08:43 -07:00
Yi Wu 933250e355 Fix RepeatableThreadTest::MockEnvTest hang (#4560)
Summary:
When `MockTimeEnv` is used in test to mock time methods, we cannot use `CondVar::TimedWait` because it is using real time, not the mocked time for wait timeout. On Mac the method can return immediately without awaking other waiting threads, if the real time is larger than `wait_until` (which is a mocked time). When that happen, the `wait()` method will fall into an infinite loop.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4560

Differential Revision: D10472851

Pulled By: yiwu-arbug

fbshipit-source-id: 898902546ace7db7ac509337dd8677a527209d19
2018-10-21 20:17:18 -07:00
Zhongyi Xie cac87fcf57 move dump stats to a separate thread (#4382)
Summary:
Currently statistics are supposed to be dumped to info log at intervals of `options.stats_dump_period_sec`. However the implementation choice was to bind it with compaction thread, meaning if the database has been serving very light traffic, the stats may not get dumped at all.
We decided to separate stats dumping into a new timed thread using `TimerQueue`, which is already used in blob_db. This will allow us schedule new timed tasks with more deterministic behavior.

Tested with db_bench using `--stats_dump_period_sec=20` in command line:
> LOG:2018/09/17-14:07:45.575025 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS -------
LOG:2018/09/17-14:08:05.643286 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS -------
LOG:2018/09/17-14:08:25.691325 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS -------
LOG:2018/09/17-14:08:45.740989 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS -------

LOG content:
> 2018/09/17-14:07:45.575025 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS -------
2018/09/17-14:07:45.575080 7fe99fbfe700 [WARN] [db/db_impl.cc:606]
** DB Stats **
Uptime(secs): 20.0 total, 20.0 interval
Cumulative writes: 4447K writes, 4447K keys, 4447K commit groups, 1.0 writes per commit group, ingest: 5.57 GB, 285.01 MB/s
Cumulative WAL: 4447K writes, 0 syncs, 4447638.00 writes per sync, written: 5.57 GB, 285.01 MB/s
Cumulative stall: 00:00:0.012 H:M:S, 0.1 percent
Interval writes: 4447K writes, 4447K keys, 4447K commit groups, 1.0 writes per commit group, ingest: 5700.71 MB, 285.01 MB/s
Interval WAL: 4447K writes, 0 syncs, 4447638.00 writes per sync, written: 5.57 MB, 285.01 MB/s
Interval stall: 00:00:0.012 H:M:S, 0.1 percent
** Compaction Stats [default] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4382

Differential Revision: D9933051

Pulled By: miasantreble

fbshipit-source-id: 6d12bb1e4977674eea4bf2d2ac6d486b814bb2fa
2018-10-08 22:54:43 -07:00
Yi Wu dc813e4b85 Improve log handling when recover without flush (#4405)
Summary:
Improve log handling when avoid_flush_during_recovery=true.
1. restore total_log_size_ after recovery, by summing up existing log sizes. Fixes #4253.
2. truncate the last existing log, since this log can contain preallocated space and it will be a waste to keep the space. It avoids a crash loop of user application cause a lot of log with non-trivial size being created and ultimately take up all disk space.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4405

Differential Revision: D9953933

Pulled By: yiwu-arbug

fbshipit-source-id: 967780fee8acec7f358b6eb65190fb4684f82e56
2018-09-26 10:37:48 -07:00
Maysam Yabandeh 65ac72edd9 Fix bug in partition filters with format_version=4 (#4381)
Summary:
Value delta encoding in format_version 4 requires the differences between the size of two consecutive handles to be sent to BlockBuilder::Add. This applies not only to indexes on blocks but also the indexes on indexes and filters in partitioned indexes and filters respectively. The patch fixes a bug where the partitioned filters would encode the entire size of the handle rather than the difference of the size with the last size.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4381

Differential Revision: D9879505

Pulled By: maysamyabandeh

fbshipit-source-id: 27a22e49b482b927fbd5629dc310c46d63d4b6d1
2018-09-17 17:28:15 -07:00
Dan Melnic ca92fc71a4 Initialize uninitialized std::atomic variables
Summary: Initialize uninitialized std::atomic variables

Reviewed By: yfeldblum

Differential Revision: D9758050

fbshipit-source-id: 865d89eddafc81f3cab6f11e2ebb669f7ff70d04
2018-09-12 08:58:05 -07:00
Maysam Yabandeh caf0f53a74 Index value delta encoding (#3983)
Summary:
Given that index value is a BlockHandle, which is basically an <offset, size> pair we can apply delta encoding on the values. The first value at each index restart interval encoded the full BlockHandle but the rest encode only the size. Refer to IndexBlockIter::DecodeCurrentValue for the detail of the encoding. This reduces the index size which helps using the  block cache more efficiently. The feature is enabled with using format_version 4.

The feature comes with a bit of cpu overhead which should be paid back by the higher cache hits due to smaller index block size.
Results with sysbench read-only using 4k blocks and using 16 index restart interval:
Format 2:
19585   rocksdb read-only range=100
Format 3:
19569   rocksdb read-only range=100
Format 4:
19352   rocksdb read-only range=100
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3983

Differential Revision: D8361343

Pulled By: maysamyabandeh

fbshipit-source-id: f882ee082322acac32b0072e2bdbb0b5f854e651
2018-08-09 16:58:40 -07:00
Nathan VanBenschoten ef7815b803 Support range deletion tombstones in IngestExternalFile SSTs (#3778)
Summary:
Fixes #3391.

This change adds a `DeleteRange` method to `SstFileWriter` and adds
support for ingesting SSTs with range deletion tombstones. This is
important for applications that need to atomically ingest SSTs while
clearing out any existing keys in a given key range.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3778

Differential Revision: D8821836

Pulled By: anand1976

fbshipit-source-id: ca7786c1947ff129afa703dab011d524c7883844
2018-07-13 22:43:09 -07:00
Anand Ananthabhotla e3eba52a5d Re-enable kUniversalSubcompactions option_config (#4125)
Summary:
1. Move kUniversalSubcompactions up before kEnd in db_test_util.h, so
tests that cycle through all the option_configs include this
2. Skip kUniversalSubcompactions wherever kUniversalCompaction and
kUniversalCompactionMultilevel are skipped

Related to #3935
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4125

Differential Revision: D8828637

Pulled By: anand1976

fbshipit-source-id: 650dee15fd27d85281cf9bb4ca8ab460e04cac6f
2018-07-13 11:13:01 -07:00
Maysam Yabandeh d0c38c0c8c Extend some tests to format_version=3 (#3942)
Summary:
format_version=3 changes the format of SST index. This is however not being tested currently since tests only work with the default format_version which is currently 2. The patch extends the most related tests to also test for format_version=3.
Closes https://github.com/facebook/rocksdb/pull/3942

Differential Revision: D8238413

Pulled By: maysamyabandeh

fbshipit-source-id: 915725f55753dd8e9188e802bf471c23645ad035
2018-06-04 20:13:00 -07:00
Maysam Yabandeh 44cf84932f Fix the bug of some test scenarios being put after kEnd
Summary:
DBTestBase::OptionConfig includes the scenarios that unit tests could iterate over them by calling ChangeOptions(). Some of the options have  been mistakenly put after kEnd which makes them essentially invisible to ChangeOptions() caller. This patch fixes it except for kUniversalSubcompactions which is left as TODO since it would break some unit tests.
Closes https://github.com/facebook/rocksdb/pull/3935

Differential Revision: D8230748

Pulled By: maysamyabandeh

fbshipit-source-id: edddb8fffcd161af1809fef24798ce118f8593db
2018-05-31 19:28:00 -07:00
Siying Dong d59549298f Skip deleted WALs during recovery
Summary:
This patch record min log number to keep to the manifest while flushing SST files to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic.

Before the commit, for 2PC case, we determined which log number to keep in FindObsoleteFiles(). We looked at the earliest logs with outstanding prepare entries, or prepare entries whose respective commit or abort are in memtable. With the commit, the same calculation is done while we apply the SST flush. Just before installing the flush file, we precompute the earliest log file to keep after the flush finishes using the same logic (but skipping the memtables just flushed), record this information to the manifest entry for this new flushed SST file. This pre-computed value is also remembered in memory, and will later be used to determine whether a log file can be deleted. This value is unlikely to change until next flush because the commit entry will stay in memtable. (In WritePrepared, we could have removed the older log files as soon as all prepared entries are committed. It's not yet done anyway. Even if we do it, the only thing we loss with this new approach is earlier log deletion between two flushes, which does not guarantee to happen anyway because the obsolete file clean-up function is only executed after flush or compaction)

This min log number to keep is stored in the manifest using the safely-ignore customized field of AddFile entry, in order to guarantee that the DB generated using newer release can be opened by previous releases no older than 4.2.
Closes https://github.com/facebook/rocksdb/pull/3765

Differential Revision: D7747618

Pulled By: siying

fbshipit-source-id: d00c92105b4f83852e9754a1b70d6b64cb590729
2018-05-03 15:43:09 -07:00
Siying Dong d5afa73789 Revert "Skip deleted WALs during recovery"
Summary:
This reverts commit 73f21a7b21.

It breaks compatibility. When created a DB using a build with this new change, opening the DB and reading the data will fail with this error:

"Corruption: Can't access /000000.sst: IO error: while stat a file for size: /tmp/xxxx/000000.sst: No such file or directory"

This is because the dummy AddFile4 entry generated by the new code will be treated as a real entry by an older build. The older build will think there is a real file with number 0, but there isn't such a file.
Closes https://github.com/facebook/rocksdb/pull/3762

Differential Revision: D7730035

Pulled By: siying

fbshipit-source-id: f2051859eff20ef1837575ecb1e1bb96b3751e77
2018-04-23 12:01:26 -07:00
Zhongyi Xie 2770a94c42 make MockTimeEnv::current_time_ atomic to fix data race
Summary:
fix a new TSAN failure
https://gist.github.com/miasantreble/7599c33f4e17da1024c67d4540dbe397
Closes https://github.com/facebook/rocksdb/pull/3694

Differential Revision: D7565310

Pulled By: miasantreble

fbshipit-source-id: f672c96e925797b34dec6e20b59527e8eebaa825
2018-04-10 14:13:18 -07:00
Phani Shekhar Mantripragada 446b32cfc3 Support for Column family specific paths.
Summary:
In this change, an option to set different paths for different column families is added.
This option is set via cf_paths setting of ColumnFamilyOptions. This option will work in a similar fashion to db_paths setting. Cf_paths is a vector of Dbpath values which contains a pair of the absolute path and target size. Multiple levels in a Column family can go to different paths if cf_paths has more than one path.
To maintain backward compatibility, if cf_paths is not specified for a column family, db_paths setting will be used. Note that, if db_paths setting is also not specified, RocksDB already has code to use db_name as the only path.

Changes :
1) A new member "cf_paths" is added to ImmutableCfOptions. This is set, based on cf_paths setting of ColumnFamilyOptions and db_paths setting of ImmutableDbOptions.  This member is used to identify the path information whenever files are accessed.
2) Validation checks are added for cf_paths setting based on existing checks for db_paths setting.
3) DestroyDB, PurgeObsoleteFiles etc. are edited to support multiple cf_paths.
4) Unit tests are added appropriately.
Closes https://github.com/facebook/rocksdb/pull/3102

Differential Revision: D6951697

Pulled By: ajkr

fbshipit-source-id: 60d2262862b0a8fd6605b09ccb0da32bb331787d
2018-04-05 19:58:20 -07:00
Maysam Yabandeh 73f21a7b21 Skip deleted WALs during recovery
Summary:
This patch record the deleted WAL numbers in the manifest to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic.
Closes https://github.com/facebook/rocksdb/pull/3488

Differential Revision: D6967893

Pulled By: maysamyabandeh

fbshipit-source-id: 13119feb155a08ab6d4909f437c7a750480dc8a1
2018-03-30 11:28:05 -07:00
Andrew Kryczka 5d68243e61 Comment out unused variables
Summary:
Submitting on behalf of another employee.
Closes https://github.com/facebook/rocksdb/pull/3557

Differential Revision: D7146025

Pulled By: ajkr

fbshipit-source-id: 495ca5db5beec3789e671e26f78170957704e77e
2018-03-05 13:13:41 -08:00
Igor Sugak aba3409740 Back out "[codemod] - comment out unused parameters"
Reviewed By: igorsugak

fbshipit-source-id: 4a93675cc1931089ddd574cacdb15d228b1e5f37
2018-02-22 12:43:17 -08:00
David Lai f4a030ce81 - comment out unused parameters
Reviewed By: everiq, igorsugak

Differential Revision: D7046710

fbshipit-source-id: 8e10b1f1e2aecebbfb229c742e214db887e5a461
2018-02-22 09:44:23 -08:00
Maysam Yabandeh 8eb1d445c3 Unbreak MemTableRep API change
Summary:
The MemTableRep API was broken by this commit: 813719e952
This patch reverts the changes and instead adds InsertKey (and etc.) overloads to extend the MemTableRep API without breaking the existing classes that inherit from it.
Closes https://github.com/facebook/rocksdb/pull/3513

Differential Revision: D7004134

Pulled By: maysamyabandeh

fbshipit-source-id: e568d91fe1e17dd76c0c1f6c7dd51a18633b1c4f
2018-02-15 17:27:24 -08:00
Maysam Yabandeh 813719e952 WritePrepared Txn: Duplicate Keys, Memtable part
Summary:
Currently DB does not accept duplicate keys (keys with the same user key and the same sequence number). If Memtable returns false when receiving such keys, we can benefit from this signal to properly increase the sequence number in the rare cases when we have a duplicate key in the write batch written to DB under WritePrepared transactions.
Closes https://github.com/facebook/rocksdb/pull/3418

Differential Revision: D6822412

Pulled By: maysamyabandeh

fbshipit-source-id: adea3ce5073131cd38ed52b16bea0673b1a19e77
2018-01-31 18:57:07 -08:00
Yi Wu dc360df81e Fix multiple build failures
Summary:
* Fix DBTest.CompactRangeWithEmptyBottomLevel lite build failure
* Fix DBTest.AutomaticConflictsWithManualCompaction failure introduce by #3366
* Fix BlockBasedTableTest::IndexUncompressed should be disabled if snappy is disabled
* Fix ASAN failure with DBBasicTest::DBClose test
Closes https://github.com/facebook/rocksdb/pull/3373

Differential Revision: D6732313

Pulled By: yiwu-arbug

fbshipit-source-id: 1eb9b9d9a8d795f56188fa9770db9353f6fdedc5
2018-01-16 17:30:39 -08:00
Zhongyi Xie 5fac4729cc make compaction_readahead_size_ thread safe
Summary:
this should fix the failing tsan_check
Closes https://github.com/facebook/rocksdb/pull/3192

Differential Revision: D6390004

Pulled By: miasantreble

fbshipit-source-id: 6cadfc6f68febb1a77b0abcdb5416570dad926a5
2017-11-21 20:11:38 -08:00
Zhongyi Xie 32e31d49d1 Make DBOption compaction_readahead_size dynamic
Summary: Closes https://github.com/facebook/rocksdb/pull/3004

Differential Revision: D6056141

Pulled By: miasantreble

fbshipit-source-id: 56df1630f464fd56b07d25d38161f699e0528b7f
2017-11-16 17:57:25 -08:00
Mikhail Antonov 7fe3b32896 Added support for differential snapshots
Summary:
The motivation for this PR is to add to RocksDB support for differential (incremental) snapshots, as snapshot of the DB changes between two points in time (one can think of it as diff between to sequence numbers, or the diff D which can be thought of as an SST file or just set of KVs that can be applied to sequence number S1 to get the database to the state at sequence number S2).

This feature would be useful for various distributed storages layers built on top of RocksDB, as it should help reduce resources (time and network bandwidth) needed to recover and rebuilt DB instances as replicas in the context of distributed storages.

From the API standpoint that would like client app requesting iterator between (start seqnum) and current DB state, and reading the "diff".

This is a very draft PR for initial review in the discussion on the approach, i'm going to rework some parts and keep updating the PR.

For now, what's done here according to initial discussions:

Preserving deletes:
 - We want to be able to optionally preserve recent deletes for some defined period of time, so that if a delete came in recently and might need to be included in the next incremental snapshot it would't get dropped by a compaction. This is done by adding new param to Options (preserve deletes flag) and new variable to DB Impl where we keep track of the sequence number after which we don't want to drop tombstones, even if they are otherwise eligible for deletion.
 - I also added a new API call for clients to be able to advance this cutoff seqnum after which we drop deletes; i assume it's more flexible to let clients control this, since otherwise we'd need to keep some kind of timestamp < -- > seqnum mapping inside the DB, which sounds messy and painful to support. Clients could make use of it by periodically calling GetLatestSequenceNumber(), noting the timestamp, doing some calculation and figuring out by how much we need to advance the cutoff seqnum.
 - Compaction codepath in compaction_iterator.cc has been modified to avoid dropping tombstones with seqnum > cutoff seqnum.

Iterator changes:
 - couple params added to ReadOptions, to optionally allow client to request internal keys instead of user keys (so that client can get the latest value of a key, be it delete marker or a put), as well as min timestamp and min seqnum.

TableCache changes:
 - I modified table_cache code to be able to quickly exclude SST files from iterators heep if creation_time on the file is less then iter_start_ts as passed in ReadOptions. That would help a lot in some DB settings (like reading very recent data only or using FIFO compactions), but not so much for universal compaction with more or less long iterator time span.

What's left:

 - Still looking at how to best plug that inside DBIter codepath. So far it seems that FindNextUserKeyInternal only parses values as UserKeys, and iter->key() call generally returns user key. Can we add new API to DBIter as internal_key(), and modify this internal method to optionally set saved_key_ to point to the full internal key? I don't need to store actual seqnum there, but I do need to store type.
Closes https://github.com/facebook/rocksdb/pull/2999

Differential Revision: D6175602

Pulled By: mikhail-antonov

fbshipit-source-id: c779a6696ee2d574d86c69cec866a3ae095aa900
2017-11-01 18:56:43 -07:00
Yi Wu 66a2c44ef4 Add DB::Properties::kEstimateOldestKeyTime
Summary:
With FIFO compaction we would like to get the oldest data time for monitoring. The problem is we don't have timestamp for each key in the DB. As an approximation, we expose the earliest of sst file "creation_time" property.

My plan is to override the property with a more accurate value with blob db, where we actually have timestamp.
Closes https://github.com/facebook/rocksdb/pull/2842

Differential Revision: D5770600

Pulled By: yiwu-arbug

fbshipit-source-id: 03833c8f10bbfbee62f8ea5c0d03c0cafb5d853a
2017-10-23 15:27:27 -07:00
Sagar Vemuri 72502cf227 Revert "comment out unused parameters"
Summary:
This reverts the previous commit 1d7048c598, which broke the build.

Did a `git revert 1d7048c`.
Closes https://github.com/facebook/rocksdb/pull/2627

Differential Revision: D5476473

Pulled By: sagar0

fbshipit-source-id: 4756ff5c0dfc88c17eceb00e02c36176de728d06
2017-07-21 18:26:26 -07:00
Victor Gao 1d7048c598 comment out unused parameters
Summary: This uses `clang-tidy` to comment out unused parameters (in functions, methods and lambdas) in fbcode. Cases that the tool failed to handle are fixed manually.

Reviewed By: igorsugak

Differential Revision: D5454343

fbshipit-source-id: 5dee339b4334e25e963891b519a5aa81fbf627b2
2017-07-21 14:57:44 -07:00
Sushma Devendrappa 0655b58582 enable PinnableSlice for RowCache
Summary:
This patch enables using PinnableSlice for RowCache, changes include
not releasing the cache handle immediately after lookup in TableCache::Get, instead pass a Cleanble function which does Cache::RleaseHandle.
Closes https://github.com/facebook/rocksdb/pull/2492

Differential Revision: D5316216

Pulled By: maysamyabandeh

fbshipit-source-id: d2a684bd7e4ba73772f762e58a82b5f4fbd5d362
2017-07-17 15:08:30 -07:00
Siying Dong 3c327ac2d0 Change RocksDB License
Summary: Closes https://github.com/facebook/rocksdb/pull/2589

Differential Revision: D5431502

Pulled By: siying

fbshipit-source-id: 8ebf8c87883daa9daa54b2303d11ce01ab1f6f75
2017-07-15 16:11:23 -07:00
Ewout Prangsma 51778612c9 Encryption at rest support
Summary:
This PR adds support for encrypting data stored by RocksDB when written to disk.

It adds an `EncryptedEnv` override of the `Env` class with matching overrides for sequential&random access files.
The encryption itself is done through a configurable `EncryptionProvider`. This class creates is asked to create `BlockAccessCipherStream` for a file. This is where the actual encryption/decryption is being done.
Currently there is a Counter mode implementation of `BlockAccessCipherStream` with a `ROT13` block cipher (NOTE the `ROT13` is for demo purposes only!!).

The Counter operation mode uses an initial counter & random initialization vector (IV).
Both are created randomly for each file and stored in a 4K (default size) block that is prefixed to that file. The `EncryptedEnv` implementation is such that clients of the `Env` class do not see this prefix (nor data, nor in filesize).
The largest part of the prefix block is also encrypted, and there is room left for implementation specific settings/values/keys in there.

To test the encryption, the `DBTestBase` class has been extended to consider a new environment variable called `ENCRYPTED_ENV`. If set, the test will setup a encrypted instance of the `Env` class to use for all tests.
Typically you would run it like this:

```
ENCRYPTED_ENV=1 make check_some
```

There is also an added test that checks that some data inserted into the database is or is not "visible" on disk. With `ENCRYPTED_ENV` active it must not find plain text strings, with `ENCRYPTED_ENV` unset, it must find the plain text strings.
Closes https://github.com/facebook/rocksdb/pull/2424

Differential Revision: D5322178

Pulled By: sdwilsh

fbshipit-source-id: 253b0a9c2c498cc98f580df7f2623cbf7678a27f
2017-06-26 16:56:24 -07:00
Siying Dong 5c97a7c066 Unit Tests for sync, range sync and file close failures
Summary: Closes https://github.com/facebook/rocksdb/pull/2454

Differential Revision: D5255320

Pulled By: siying

fbshipit-source-id: 0080830fa8eb5da6de25e17ba68aee91018c7913
2017-06-26 13:27:58 -07:00
Maysam Yabandeh 499ebb3ab5 Optimize for serial commits in 2PC
Summary:
Throughput: 46k tps in our sysbench settings (filling the details later)

The idea is to have the simplest change that gives us a reasonable boost
in 2PC throughput.

Major design changes:
1. The WAL file internal buffer is not flushed after each write. Instead
it is flushed before critical operations (WAL copy via fs) or when
FlushWAL is called by MySQL. Flushing the WAL buffer is also protected
via mutex_.
2. Use two sequence numbers: last seq, and last seq for write. Last seq
is the last visible sequence number for reads. Last seq for write is the
next sequence number that should be used to write to WAL/memtable. This
allows to have a memtable write be in parallel to WAL writes.
3. BatchGroup is not used for writes. This means that we can have
parallel writers which changes a major assumption in the code base. To
accommodate for that i) allow only 1 WriteImpl that intends to write to
memtable via mem_mutex_--which is fine since in 2PC almost all of the memtable writes
come via group commit phase which is serial anyway, ii) make all the
parts in the code base that assumed to be the only writer (via
EnterUnbatched) to also acquire mem_mutex_, iii) stat updates are
protected via a stat_mutex_.

Note: the first commit has the approach figured out but is not clean.
Submitting the PR anyway to get the early feedback on the approach. If
we are ok with the approach I will go ahead with this updates:
0) Rebase with Yi's pipelining changes
1) Currently batching is disabled by default to make sure that it will be
consistent with all unit tests. Will make this optional via a config.
2) A couple of unit tests are disabled. They need to be updated with the
serial commit of 2PC taken into account.
3) Replacing BatchGroup with mem_mutex_ got a bit ugly as it requires
releasing mutex_ beforehand (the same way EnterUnbatched does). This
needs to be cleaned up.
Closes https://github.com/facebook/rocksdb/pull/2345

Differential Revision: D5210732

Pulled By: maysamyabandeh

fbshipit-source-id: 78653bd95a35cd1e831e555e0e57bdfd695355a4
2017-06-24 14:11:29 -07:00
Siying Dong 95b0e89b5d Improve write buffer manager (and allow the size to be tracked in block cache)
Summary:
Improve write buffer manager in several ways:
1. Size is tracked when arena block is allocated, rather than every allocation, so that it can better track actual memory usage and the tracking overhead is slightly lower.
2. We start to trigger memtable flush when 7/8 of the memory cap hits, instead of 100%, and make 100% much harder to hit.
3. Allow a cache object to be passed into buffer manager and the size allocated by memtable can be costed there. This can help users have one single memory cap across block cache and memtable.
Closes https://github.com/facebook/rocksdb/pull/2350

Differential Revision: D5110648

Pulled By: siying

fbshipit-source-id: b4238113094bf22574001e446b5d88523ba00017
2017-06-02 14:26:56 -07:00
Andrew Kryczka a4d9c02511 Pass CF ID to MemTableRepFactory
Summary:
Some users want to monitor column family activity in their custom memtable implementations. Previously there was no way to figure out with which column family a memtable is associated. This diff:

- adds an overload to MemTableRepFactory::CreateMemTableRep() that provides the CF ID. For compatibility, its default implementation calls the old overload.
- updates MemTable to create MemTableRep's using the new overload.
Closes https://github.com/facebook/rocksdb/pull/2346

Differential Revision: D5108061

Pulled By: ajkr

fbshipit-source-id: 3a1921214a348dd8ea0f54e1cab3b71c3d46d616
2017-06-02 12:12:06 -07:00
Yi Wu ad19eb8686 Fixing blob db sequence number handling
Summary:
Blob db rely on base db returning sequence number through write batch after DB::Write(). However after recent changes to the write path, DB::Writ()e no longer return sequence number in some cases. Fixing it by have WriteBatchInternal::InsertInto() always encode sequence number into write batch.

Stacking on #2375.
Closes https://github.com/facebook/rocksdb/pull/2385

Differential Revision: D5148358

Pulled By: yiwu-arbug

fbshipit-source-id: 8bda0aa07b9334ed03ed381548b39d167dc20c33
2017-05-31 10:56:45 -07:00
Tamir Duberstein 103d0692ea Avoid unsupported attributes when not building with UBSAN
Summary:
yiwu-arbug see individual commits.
Closes https://github.com/facebook/rocksdb/pull/2318

Differential Revision: D5141520

Pulled By: yiwu-arbug

fbshipit-source-id: 7987c92ab4461eef36afce5a133d3a0ee0c96300
2017-05-30 11:13:01 -07:00
Aaron Gao f7bb1a0060 support merge and delete in file ingestion
Summary:
Previously sst_file_writer only supports kTypeValue, we need kTypeMerge and kTypeDeletion also as user requested.
Closes https://github.com/facebook/rocksdb/pull/2361

Differential Revision: D5139402

Pulled By: lightmark

fbshipit-source-id: 092a60756d01692539d817a3765ebfd58a8d7f88
2017-05-26 12:11:21 -07:00
Aaron Gao 3e86c0f07c disable direct reads for log and manifest and add direct io to tests
Summary:
Disable direct reads for log and manifest. Direct reads should not affect sequential_file
Also add kDirectIO for option_config_ in db_test_util
Closes https://github.com/facebook/rocksdb/pull/2337

Differential Revision: D5100261

Pulled By: lightmark

fbshipit-source-id: 0ebfd13b93fa1b8f9acae514ac44f8125a05868b
2017-05-22 18:41:28 -07:00
Andrew Kryczka 7c1c8ce5ac Avoid calling fallocate with UINT64_MAX
Summary:
When user doesn't set a limit on compaction output file size, let's use the sum of the input files' sizes. This will avoid passing UINT64_MAX as fallocate()'s length. Reported in #2249.

Test setup:
- command: `TEST_TMPDIR=/data/rocksdb-test/ strace -e fallocate ./db_compaction_test --gtest_filter=DBCompactionTest.ManualCompactionUnknownOutputSize`
- filesystem: xfs

before this diff:
`fallocate(10, 01, 0, 1844674407370955160) = -1 ENOSPC (No space left on device)`

after this diff:
`fallocate(10, 01, 0, 1977)              = 0`
Closes https://github.com/facebook/rocksdb/pull/2252

Differential Revision: D5007275

Pulled By: ajkr

fbshipit-source-id: 4491404a6ae8a41328aede2e2d6f4d9ac3e38880
2017-05-04 17:43:22 -07:00
Siying Dong d2dce5611a Move some files under util/ to separate dirs
Summary:
Move some files under util/ to new directories env/, monitoring/ options/ and cache/
Closes https://github.com/facebook/rocksdb/pull/2090

Differential Revision: D4833681

Pulled By: siying

fbshipit-source-id: 2fd8bef
2017-04-05 19:09:16 -07:00
Siying Dong 6ef8c620d3 Move auto_roll_logger and filename out of db/
Summary:
It is confusing to have auto_roll_logger to stay under db/, which has nothing to do with database. Move filename together as it is a dependency.
Closes https://github.com/facebook/rocksdb/pull/2080

Differential Revision: D4821141

Pulled By: siying

fbshipit-source-id: ca7d768
2017-04-03 18:39:14 -07:00
Maysam Yabandeh e7731d119a Configure index partition size
Summary:
Allow the users to specify the target index partition size.

With this patch an index partition is cut before its estimated in-memory size goes above the configured value for metadata_block_size. The filter partitions are still cut right after an index partition is cut.
Closes https://github.com/facebook/rocksdb/pull/2041

Differential Revision: D4780216

Pulled By: maysamyabandeh

fbshipit-source-id: 95a0831
2017-03-28 12:09:12 -07:00
Maysam Yabandeh 8b0097b49b Readers for partition filter
Summary:
This is the last split of this pull request: https://github.com/facebook/rocksdb/pull/1891 which includes the reader part as well as the tests.
Closes https://github.com/facebook/rocksdb/pull/1961

Differential Revision: D4672216

Pulled By: maysamyabandeh

fbshipit-source-id: 6a2b829
2017-03-22 09:24:15 -07:00
Maysam Yabandeh e6725e8c8d Fix some bugs in MockEnv
Summary:
Fixing some bugs in MockEnv so it be actually used.
Closes https://github.com/facebook/rocksdb/pull/1914

Differential Revision: D4609923

Pulled By: maysamyabandeh

fbshipit-source-id: ca25735
2017-03-13 09:54:11 -07:00
Siying Dong d5b607a43f Make db_wal_test slightly faster
Summary:
Avoid to run db_wal_test in all the DB test options, and some small changes.
Closes https://github.com/facebook/rocksdb/pull/1921

Differential Revision: D4622054

Pulled By: siying

fbshipit-source-id: 890fd64
2017-02-28 17:39:10 -08:00
Siying Dong 1ba2804b7f Remove XFunc tests
Summary:
Xfunc is hardly used. Remove it to keep the code simple.
Closes https://github.com/facebook/rocksdb/pull/1905

Differential Revision: D4603220

Pulled By: siying

fbshipit-source-id: 731f96d
2017-02-23 12:09:11 -08:00
Aaron Gao 0824934423 truncate patch
Summary:
omit the override for the previous commit
Closes https://github.com/facebook/rocksdb/pull/1898

Differential Revision: D4598743

Pulled By: lightmark

fbshipit-source-id: f98a378
2017-02-22 10:39:11 -08:00
Aaron Gao 286a36db7f posix writablefile truncate
Summary:
we occasionally missing this call so the file size will be wrong
Closes https://github.com/facebook/rocksdb/pull/1894

Differential Revision: D4598446

Pulled By: lightmark

fbshipit-source-id: 42b6ef5
2017-02-22 10:09:14 -08:00
Maysam Yabandeh 69d5262c81 Two-level Indexes
Summary:
Partition Index blocks and use a Partition-index as a 2nd level index.

The two-level index can be used by setting
BlockBasedTableOptions::kTwoLevelIndexSearch as the index type and
configuring BlockBasedTableOptions::index_per_partition

t15539501
Closes https://github.com/facebook/rocksdb/pull/1814

Differential Revision: D4473535

Pulled By: maysamyabandeh

fbshipit-source-id: bffb87e
2017-02-06 16:39:12 -08:00
Islam AbdelRahman c963460dbc Fix tests under GCC_481
Summary:
This fix the issue with tests failing under GCC 481, I am not sure what is the exact reason
Closes https://github.com/facebook/rocksdb/pull/1735

Differential Revision: D4374094

Pulled By: IslamAbdelRahman

fbshipit-source-id: b3625bc
2017-01-03 17:54:12 -08:00
Aaron Gao 972f96b3fb direct io write support
Summary:
rocksdb direct io support

```
[gzh@dev11575.prn2 ~/rocksdb] ./db_bench -benchmarks=fillseq --num=1000000
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
RocksDB:    version 5.0
Date:       Wed Nov 23 13:17:43 2016
CPU:        40 * Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
CPUCache:   25600 KB
Keys:       16 bytes each
Values:     100 bytes each (50 bytes after compression)
Entries:    1000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    110.6 MB (estimated)
FileSize:   62.9 MB (estimated)
Write rate: 0 bytes/second
Compression: Snappy
Memtablerep: skip_list
Perf Level: 1
WARNING: Assertions are enabled; benchmarks unnecessarily slow
------------------------------------------------
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
DB path: [/tmp/rocksdbtest-112628/dbbench]
fillseq      :       4.393 micros/op 227639 ops/sec;   25.2 MB/s

[gzh@dev11575.prn2 ~/roc
Closes https://github.com/facebook/rocksdb/pull/1564

Differential Revision: D4241093

Pulled By: lightmark

fbshipit-source-id: 98c29e3
2016-12-22 13:09:19 -08:00
Yi Wu c270735861 Iterator should be in corrupted status if merge operator return false
Summary:
Iterator should be in corrupted status if merge operator return false.
Also add test to make sure if max_successive_merges is hit during write,
data will not be lost.
Closes https://github.com/facebook/rocksdb/pull/1665

Differential Revision: D4322695

Pulled By: yiwu-arbug

fbshipit-source-id: b327b05
2016-12-16 11:09:16 -08:00
Maysam Yabandeh 182b940e70 Add WriteOptions.no_slowdown
Summary:
If the WriteOptions.no_slowdown flag is set AND we need to wait or sleep for
the write request, then fail immediately with Status::Incomplete().
Closes https://github.com/facebook/rocksdb/pull/1527

Differential Revision: D4191405

Pulled By: maysamyabandeh

fbshipit-source-id: 7f3ce3f
2016-11-21 18:09:13 -08:00
Siying Dong 972e3ff295 Enable allow_concurrent_memtable_write and enable_write_thread_adaptive_yield by default
Summary: Closes https://github.com/facebook/rocksdb/pull/1496

Differential Revision: D4168080

Pulled By: siying

fbshipit-source-id: 056ae62
2016-11-16 09:39:09 -08:00
Islam AbdelRahman 869ae5d786 Support IngestExternalFile (remove AddFile restrictions)
Summary:
Changes in the diff

API changes:
- Introduce IngestExternalFile to replace AddFile (I think this make the API more clear)
- Introduce IngestExternalFileOptions (This struct will encapsulate the options for ingesting the external file)
- Deprecate AddFile() API

Logic changes:
- If our file overlap with the memtable we will flush the memtable
- We will find the first level in the LSM tree that our file key range overlap with the keys in it
- We will find the lowest level in the LSM tree above the the level we found in step 2 that our file can fit in and ingest our file in it
- We will assign a global sequence number to our new file
- Remove AddFile restrictions by using global sequence numbers

Other changes:
- Refactor all AddFile logic to be encapsulated in ExternalSstFileIngestionJob

Test Plan:
unit tests (still need to add more)
addfile_stress (https://reviews.facebook.net/D65037)

Reviewers: yiwu, andrewkr, lightmark, yhchiang, sdong

Reviewed By: sdong

Subscribers: jkedgar, hcz, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D65061
2016-10-20 17:05:32 -07:00
Islam AbdelRahman b88f8e87c5 Support SST files with Global sequence numbers [reland]
Summary:
reland https://reviews.facebook.net/D62523

- Update SstFileWriter to include a property for a global sequence number in the SST file `rocksdb.external_sst_file.global_seqno`
- Update TableProperties to be aware of the offset of each property in the file
- Update BlockBasedTableReader and Block to be able to honor the sequence number in `rocksdb.external_sst_file.global_seqno` property and use it to overwrite all sequence number in the file

Something worth mentioning is that we don't update the seqno in the index block since and when doing a binary search, the reason for that is that it's guaranteed that SST files with global seqno will have only one user_key and each key will have seqno=0 encoded in it, This mean that this key is greater than any other key with seqno> 0. That mean that we can actually keep the current logic for these blocks

Test Plan: unit tests

Reviewers: sdong, yhchiang

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D65211
2016-10-18 16:59:37 -07:00
sdong b666f85445 Consider more factors when determining preallocation size of WAL files
Summary: Currently the WAL file preallocation size is 1.1 * write_buffer_size. This, however, will be over-estimated if options.db_write_buffer_size or options.max_total_wal_size is set and is much smaller.

Test Plan: Add a unit test.

Reviewers: andrewkr, yiwu

Reviewed By: yiwu

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D63957
2016-09-19 12:04:35 -07:00
Edouard A 66a91e2607 Add NoSpace subcode to IOError (#1320)
Add a sub code to distinguish "out of space" errors from regular I/O errors
2016-09-07 12:37:45 -07:00
Islam AbdelRahman d11c09d9e2 Eliminate memcpy from ForwardIterator
Summary:
This diff update ForwardIterator to support pinning keys and values, which will allow DBIter to take advantage of that and eliminate memcpy when executing merge operators
This diff is stacked on D61305

Test Plan:
existing tests (updated them to test tailing iterator)
new test

Reviewers: andrewkr, yhchiang, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D60009
2016-08-11 19:10:16 -07:00
sdong 56dd034115 read_options.background_purge_on_iterator_cleanup to cover forward iterator and log file closing too.
Summary: With read_options.background_purge_on_iterator_cleanup=true, File deletion and closing can still happen in forward iterator, or WAL file closing. Cover those cases too.

Test Plan: I am adding unit tests.

Reviewers: andrewkr, IslamAbdelRahman, yiwu

Reviewed By: yiwu

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D61503
2016-08-10 13:16:41 -07:00
Islam AbdelRahman 68a8e6b8fa Introduce FullMergeV2 (eliminate memcpy from merge operators)
Summary:
This diff update the code to pin the merge operator operands while the merge operation is done, so that we can eliminate the memcpy cost, to do that we need a new public API for FullMerge that replace the std::deque<std::string> with std::vector<Slice>

This diff is stacked on top of D56493 and D56511

In this diff we
- Update FullMergeV2 arguments to be encapsulated in MergeOperationInput and MergeOperationOutput which will make it easier to add new arguments in the future
- Replace std::deque<std::string> with std::vector<Slice> to pass operands
- Replace MergeContext std::deque with std::vector (based on a simple benchmark I ran https://gist.github.com/IslamAbdelRahman/78fc86c9ab9f52b1df791e58943fb187)
- Allow FullMergeV2 output to be an existing operand

```
[Everything in Memtable | 10K operands | 10 KB each | 1 operand per key]

DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=10000 --num=10000 --disable_auto_compactions --value_size=10240 --write_buffer_size=1000000000

[FullMergeV2]
readseq      :       0.607 micros/op 1648235 ops/sec; 16121.2 MB/s
readseq      :       0.478 micros/op 2091546 ops/sec; 20457.2 MB/s
readseq      :       0.252 micros/op 3972081 ops/sec; 38850.5 MB/s
readseq      :       0.237 micros/op 4218328 ops/sec; 41259.0 MB/s
readseq      :       0.247 micros/op 4043927 ops/sec; 39553.2 MB/s

[master]
readseq      :       3.935 micros/op 254140 ops/sec; 2485.7 MB/s
readseq      :       3.722 micros/op 268657 ops/sec; 2627.7 MB/s
readseq      :       3.149 micros/op 317605 ops/sec; 3106.5 MB/s
readseq      :       3.125 micros/op 320024 ops/sec; 3130.1 MB/s
readseq      :       4.075 micros/op 245374 ops/sec; 2400.0 MB/s
```

```
[Everything in Memtable | 10K operands | 10 KB each | 10 operand per key]

DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=1000 --num=10000 --disable_auto_compactions --value_size=10240 --write_buffer_size=1000000000

[FullMergeV2]
readseq      :       3.472 micros/op 288018 ops/sec; 2817.1 MB/s
readseq      :       2.304 micros/op 434027 ops/sec; 4245.2 MB/s
readseq      :       1.163 micros/op 859845 ops/sec; 8410.0 MB/s
readseq      :       1.192 micros/op 838926 ops/sec; 8205.4 MB/s
readseq      :       1.250 micros/op 800000 ops/sec; 7824.7 MB/s

[master]
readseq      :      24.025 micros/op 41623 ops/sec;  407.1 MB/s
readseq      :      18.489 micros/op 54086 ops/sec;  529.0 MB/s
readseq      :      18.693 micros/op 53495 ops/sec;  523.2 MB/s
readseq      :      23.621 micros/op 42335 ops/sec;  414.1 MB/s
readseq      :      18.775 micros/op 53262 ops/sec;  521.0 MB/s

```

```
[Everything in Block cache | 10K operands | 10 KB each | 1 operand per key]

[FullMergeV2]
$ DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --num=100000 --db="/dev/shm/merge-random-10K-10KB" --cache_size=1000000000 --use_existing_db --disable_auto_compactions
readseq      :      14.741 micros/op 67837 ops/sec;  663.5 MB/s
readseq      :       1.029 micros/op 971446 ops/sec; 9501.6 MB/s
readseq      :       0.974 micros/op 1026229 ops/sec; 10037.4 MB/s
readseq      :       0.965 micros/op 1036080 ops/sec; 10133.8 MB/s
readseq      :       0.943 micros/op 1060657 ops/sec; 10374.2 MB/s

[master]
readseq      :      16.735 micros/op 59755 ops/sec;  584.5 MB/s
readseq      :       3.029 micros/op 330151 ops/sec; 3229.2 MB/s
readseq      :       3.136 micros/op 318883 ops/sec; 3119.0 MB/s
readseq      :       3.065 micros/op 326245 ops/sec; 3191.0 MB/s
readseq      :       3.014 micros/op 331813 ops/sec; 3245.4 MB/s
```

```
[Everything in Block cache | 10K operands | 10 KB each | 10 operand per key]

DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --num=100000 --db="/dev/shm/merge-random-10-operands-10K-10KB" --cache_size=1000000000 --use_existing_db --disable_auto_compactions

[FullMergeV2]
readseq      :      24.325 micros/op 41109 ops/sec;  402.1 MB/s
readseq      :       1.470 micros/op 680272 ops/sec; 6653.7 MB/s
readseq      :       1.231 micros/op 812347 ops/sec; 7945.5 MB/s
readseq      :       1.091 micros/op 916590 ops/sec; 8965.1 MB/s
readseq      :       1.109 micros/op 901713 ops/sec; 8819.6 MB/s

[master]
readseq      :      27.257 micros/op 36687 ops/sec;  358.8 MB/s
readseq      :       4.443 micros/op 225073 ops/sec; 2201.4 MB/s
readseq      :       5.830 micros/op 171526 ops/sec; 1677.7 MB/s
readseq      :       4.173 micros/op 239635 ops/sec; 2343.8 MB/s
readseq      :       4.150 micros/op 240963 ops/sec; 2356.8 MB/s
```

Test Plan: COMPILE_WITH_ASAN=1 make check -j64

Reviewers: yhchiang, andrewkr, sdong

Reviewed By: sdong

Subscribers: lovro, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D57075
2016-07-20 09:49:03 -07:00
Islam AbdelRahman fa813f7478 Update DB::AddFile() to ingest the file to the lowest possible level
Summary:
DB::AddFile() right now always add the ingested file to L0
update the logic to add the file to the lowest possible level

Test Plan: unit tests

Reviewers: jkedgar, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, yoshinorim

Differential Revision: https://reviews.facebook.net/D59637
2016-06-21 17:57:59 -07:00