Commit Graph

260 Commits

Author SHA1 Message Date
Peter Dillinger a680a7ea37 Un-revert #7049, revert #7022 (#7071)
Summary:
Even though local bisection gave me a clear signal (and still does) that reverting https://github.com/facebook/rocksdb/issues/7049 would fix the failures in MultiThreadedDBTest, https://github.com/facebook/rocksdb/issues/7022 seems to be the root cause. Reverting https://github.com/facebook/rocksdb/issues/7022 and keeping https://github.com/facebook/rocksdb/issues/7049 seems to fix the issue in local reproducer also. (Had these landed in opposite order, bisection would have found the root cause.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7071

Reviewed By: akankshamahajan15

Differential Revision: D22362857

Pulled By: pdillinger

fbshipit-source-id: ed63df3d74e9d4ce1604de8fe43b216166c7a3f0
2020-07-02 13:30:41 -07:00
Peter Dillinger 52d59e0c93 Revert "Whole DBTest to skip fsync (#7049)" (#7070)
Summary:
This reverts commit 4f1534bdb0.

This commit caused failures and deadlocks in
MultiThreadedDBTest.MultiThreaded/69 and others.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7070

Reviewed By: riversand963

Differential Revision: D22358778

Pulled By: pdillinger

fbshipit-source-id: faf8f2cb469a7063a113921c8e9c64a9f7610dac
2020-07-02 10:22:43 -07:00
sdong 4f1534bdb0 Whole DBTest to skip fsync (#7049)
Summary:
After https://github.com/facebook/rocksdb/pull/7036, we still see extra DBTest that can timeout when running 10 or 20 in parallel. Expand skip-fsync mode in whole DBTest. Still preserve other tests from doing this mode to be conservative.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7049

Test Plan: Run all existing files.

Reviewed By: pdillinger

Differential Revision: D22301700

fbshipit-source-id: f9a9e3b3b26ce640665a47cb8bff33ba0c89b565
2020-07-01 19:37:56 -07:00
Akanksha Mahajan 5edfe3a3d8 Update Flush policy in PartitionedIndexBuilder on switching from user-key to internal-key mode (#7022)
Summary:
When format_version is high enough to support user-key and there are index entries for same user key that spans multiple data blocks then it changes from user-key mode to internal-key mode. But the flush policy is not reset to point to Block Builder of internal-keys. After this switch, no entries are added to user key index partition result, thus it never triggers flushing the block.

Fix: After adding the entry in sub_builder_index_, if there is a switch from user-key to internal-key, then flush policy is updated to point to Block Builder of internal-keys index partition.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7022

Test Plan:
1. make check -j64
           2. Added one unit test case

Reviewed By: ajkr

Differential Revision: D22197734

Pulled By: akankshamahajan15

fbshipit-source-id: d87e9e46bccab8e896ee6979d6b79c51f73d479e
2020-07-01 14:58:08 -07:00
sdong d64cf0e4ee Move away from direct TmpDir() call in some tests (#7030)
Summary:
Some tests directly uses TmpDir() as temporary directory without adding any randomize factor. This would cause failures when tests run in parallel. Fix it by moving some of them to test::PerThreadDBPath()
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7030

Test Plan: Watch existing tests pass

Reviewed By: zhichao-cao

Differential Revision: D22224710

fbshipit-source-id: 28c9932fede0a4a64670e5b5fdb08f4fb5dccdd0
2020-06-25 12:09:57 -07:00
sdong d6b7b7712f Fix a bug that causes iterator to return wrong result in a rare data race (#6973)
Summary:
The bug fixed in https://github.com/facebook/rocksdb/pull/1816/ is now applicable to iterator too. This was not an issue but https://github.com/facebook/rocksdb/pull/2886 caused the regression. If a put and DB flush happens just between iterator to get latest sequence number and getting super version, empty result for the key or an older value can be returned, which is wrong.
Fix it in the same way as the fix in https://github.com/facebook/rocksdb/issues/1816, that is to get the sequence number after referencing the super version.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6973

Test Plan: Will run stress tests for a while to make sure there is no general regression.

Reviewed By: ajkr

Differential Revision: D22029348

fbshipit-source-id: 94390f93630906796d6e2fec321f44a920953fd1
2020-06-18 10:16:38 -07:00
sdong 223b57eeb8 Fix the bug that compressed cache is disabled in read-only DBs (#6990)
Summary:
Compressed block cache is disabled in https://github.com/facebook/rocksdb/pull/4650 for no good reason. Re-enable it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6990

Test Plan: Add a unit test to make sure a general function works with read-only DB + compressed block cache.

Reviewed By: ltamasi

Differential Revision: D22072755

fbshipit-source-id: 2a55df6363de23a78979cf6c747526359e5dc7a1
2020-06-17 14:30:54 -07:00
Zhen Li 9c24a5cb4d Fix persistent cache on windows (#6932)
Summary:
Persistent cache feature caused rocks db crash on windows. I posted a issue for it, https://github.com/facebook/rocksdb/issues/6919. I found this is because no "persistent_cache_key_prefix" is generated for persistent cache. Looking repo history, "GetUniqueIdFromFile" is not implemented on Windows. So my fix is adding "NewId()" function in "persistent_cache" and using it to generate prefix for persistent cache. In this PR, i also re-enable related test cases defined in "db_test2" and "persistent_cache_test" for windows.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6932

Test Plan:
1. run related test cases in "db_test2" and "persistent_cache_test" on windows and see it passed.
2. manually run db_bench.exe with "read_cache_path" and verified.

Reviewed By: riversand963

Differential Revision: D21911608

Pulled By: cheng-chang

fbshipit-source-id: cdfd938d54a385edbb2836b13aaa1d39b0a6f1c2
2020-06-13 13:28:31 -07:00
Yanqin Jin 717749f4c0 Fail point-in-time WAL recovery upon IOError reading WAL (#6963)
Summary:
If `options.wal_recovery_mode == WALRecoveryMode::kPointInTimeRecovery`, RocksDB stops replaying WAL once hitting an error and discards the rest of the WAL. This can lead to data loss if the error occurs at an offset smaller than the last sync'ed offset.
Ideally, RocksDB point-in-time recovery should permit recovery if the error occurs after last synced offset while fail recovery if error occurs before the last synced offset. However, RocksDB does not track the synced offset of WALs. Consequently, RocksDB does not know whether an error occurs before or after the last synced offset. An error can be one of the following.
- WAL record checksum mismatch. This can result from both corruption of synced data and dropping of unsynced data during shutdown. We cannot be sure which one. In order not to defeat the original motivation to permit the latter case, we keep the original behavior of point-in-time WAL recovery.
- IOError. This means the WAL can be bad, an indicator of whole file becoming unavailable, not to mention synced part of the WAL. Therefore, we choose to modify the behavior of point-in-time recovery and fail the database recovery.

Test plan (devserver):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6963

Reviewed By: ajkr

Differential Revision: D22011083

Pulled By: riversand963

fbshipit-source-id: f9cbf29a37dc5cc40d3fa62f89eed1ad67ca1536
2020-06-11 18:42:10 -07:00
Levi Tamasi 722ebba834 Turn DBTest2.CompressionFailures into a parameterized test (#6968)
Summary:
`DBTest2.CompressionFailures` currently tests many configurations
sequentially using nested loops, which often leads to timeouts
in our test system. The patch turns it into a parameterized test
instead.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6968

Test Plan: `make check`

Reviewed By: siying

Differential Revision: D22006954

Pulled By: ltamasi

fbshipit-source-id: f71f2f7108086b7651ecfce3d79a7fab24620b2c
2020-06-11 16:35:39 -07:00
Zitan Chen 23e446a157 Disable OpenForReadOnly tests in the LITE mode (#6947)
Summary:
Disable two OpenForReadOnly tests in the LITE mode
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6947

Test Plan: passed db_test2

Reviewed By: cheng-chang

Differential Revision: D21914345

Pulled By: gg814

fbshipit-source-id: 58e81baf5d8cf8adcedaef3966aa3a427bbdf7c2
2020-06-05 17:28:24 -07:00
Zitan Chen 02df00d97b API change: DB::OpenForReadOnly will not write to the file system unless create_if_missing is true (#6900)
Summary:
DB::OpenForReadOnly will not write anything to the file system (i.e., create directories or files for the DB) unless create_if_missing is true.

This change also fixes some subcommands of ldb, which write to the file system even if the purpose is for readonly.

Two tests for this updated behavior of DB::OpenForReadOnly are also added.

Other minor changes:
1. Updated HISTORY.md to include this API change of DB::OpenForReadOnly;
2. Updated the help information for the put and batchput subcommands of ldb with the option [--create_if_missing];
3. Updated the comment of Env::DeleteDir to emphasize that it returns OK only if the directory to be deleted is empty.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6900

Test Plan: passed make check; also manually tested a few ldb subcommands

Reviewed By: pdillinger

Differential Revision: D21822188

Pulled By: gg814

fbshipit-source-id: 604cc0f0d0326a937ee25a32cdc2b512f9a3be6e
2020-06-03 18:57:49 -07:00
sdong afa3518839 Revert "Update googletest from 1.8.1 to 1.10.0 (#6808)" (#6923)
Summary:
This reverts commit 8d87e9cea1.

Based on offline discussions, it's too early to upgrade to gtest 1.10, as it prevents some developers from using an older version of gtest to integrate to some other systems. Revert it for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6923

Reviewed By: pdillinger

Differential Revision: D21864799

fbshipit-source-id: d0726b1ff649fc911b9378f1763316200bd363fc
2020-06-03 15:55:03 -07:00
Adam Retter 8d87e9cea1 Update googletest from 1.8.1 to 1.10.0 (#6808)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6808

Reviewed By: anand1976

Differential Revision: D21483984

Pulled By: pdillinger

fbshipit-source-id: 70c5eff2bd54ddba469761d95e4cd4611fb8e598
2020-06-01 20:33:42 -07:00
Yanqin Jin b11a8b1b9a Fix valgrind error by init memory region (#6842)
Summary:
As title. After allocating a memory buffer, initialize its content to 0s.
This fixes valgrind issue introduced in https://github.com/facebook/rocksdb/issues/6709.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6842

Test Plan:
```
$make valgrind_test
$valgrind --tool=memcheck --track-origins=yes ./db_test2 --gtest_filter=DBTest2.CompressionFailures
```

Reviewed By: pdillinger

Differential Revision: D21551848

Pulled By: riversand963

fbshipit-source-id: e87a6f413e3f3d92d8e23d8ecc4cf93479c6674c
2020-05-14 18:50:03 -07:00
Ziyue Yang c384c08a4f Add tests for compression failure in BlockBasedTableBuilder (#6709)
Summary:
Currently there is no check for whether BlockBasedTableBuilder will expose
compression error status if compression fails during the table building.
This commit adds fake faulting compressors and a unit test to test such
cases.

This check finds 5 bugs, and this commit also fixes them:

1. Not handling compression failure well in
   BlockBasedTableBuilder::BGWorkWriteRawBlock.
2. verify_compression failing in BlockBasedTableBuilder when used with ZSTD.
3. Wrongly passing the same reference of block contents to
   BlockBasedTableBuilder::CompressAndVerifyBlock in parallel compression.
4. Wrongly setting block_rep->first_key_in_next_block to nullptr in
   BlockBasedTableBuilder::EnterUnbuffered when there are still incoming data
   blocks.
5. Not maintaining variables for compression ratio estimation and first_block
   in BlockBasedTableBuilder::EnterUnbuffered.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6709

Reviewed By: ajkr

Differential Revision: D21236254

fbshipit-source-id: 101f6e62b2bac2b7be72be198adf93cd32a1ff46
2020-05-12 09:27:35 -07:00
sdong 680c416348 Avoid Swallowing Some File Consistency Checking Bugs (#6793)
Summary:
We are swallowing some file consistency checking failures. This is not expected. We are fixing two cases: DB reopen and manifest dump.
More places are not fixed and need follow-up.

Error from CheckConsistencyForDeletes() is also swallowed, which is not fixed in this PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6793

Test Plan: Add a unit test to cover the reopen case.

Reviewed By: riversand963

Differential Revision: D21366525

fbshipit-source-id: eb438a322237814e8d5125f916a3c6de97f39ded
2020-05-04 14:18:11 -07:00
Ziyue Yang 03a781a90c Add pipelined & parallel compression optimization (#6262)
Summary:
This PR adds support for pipelined & parallel compression optimization for `BlockBasedTableBuilder`. This optimization makes block building, block compression and block appending a pipeline, and uses multiple threads to accelerate block compression. Users can set `CompressionOptions::parallel_threads` greater than 1 to enable compression parallelism.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6262

Reviewed By: ajkr

Differential Revision: D20651306

fbshipit-source-id: 62125590a9c15b6d9071def9dc72589c1696a4cb
2020-04-01 16:40:18 -07:00
Cheng Chang afb97094ae Skip high levels with no key falling in the range in CompactRange (#6482)
Summary:
In CompactRange, if there is no key in memtable falling in the specified range, then flush is skipped.
This PR extends this skipping logic to SST file levels: it starts compaction from the highest level (starting from L0) that has files with key falling in the specified range, instead of always starts from L0.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6482

Test Plan:
A new test ManualCompactionTest::SkipLevel is added.

Also updated a test related to statistics of index block cache hit in db_test2, the index cache hit is increased by 1 in this PR because when checking overlap for the key range in L0, OverlapWithLevelIterator will do a seek in the table cache iterator, which will read from the cached index.

Also updated db_compaction_test and db_test to use correct range for full compaction.

Differential Revision: D20251149

Pulled By: cheng-chang

fbshipit-source-id: f822157cf4796972bd5035d9d7178d8dfb7af08b
2020-03-04 20:15:25 -08:00
sdong fdf882ded2 Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433)
Summary:
When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433

Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.

Differential Revision: D19977691

fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
2020-02-20 12:09:57 -08:00
sdong 800d24ddc5 Fix DBTest2.ChangePrefixExtractor LITE build (#6356)
Summary:
DBTest2.ChangePrefixExtractor fails in LITE build because LITE build doesn't support adaptive build. Fix it by removing the stats check but only check correctness.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6356

Test Plan: Run the test with both of LITE and non-LITE build.

Differential Revision: D19669537

fbshipit-source-id: 6d7dd6c8a79f18e80ca1636864b9c71922030d8e
2020-01-31 15:44:14 -08:00
sdong ec496347bc Add a unit test for prefix extractor changes (#6323)
Summary:
Add a unit test for prefix extractor change, including a check that fails due to a bug.
Also comment out the partitioned filter case which will fail the test too.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6323

Test Plan: Run the test and it passes (and fails if the SeekForPrev() part is uncommented)

Differential Revision: D19509744

fbshipit-source-id: 678202ca97b5503e9de73b54b90de9e5ba822b72
2020-01-31 11:02:03 -08:00
sdong 71874c5aaf Fix LITE build with DBTest2.AutoPrefixMode1 (#6346)
Summary:
DBTest2.AutoPrefixMode1 doesn't pass because auto prefix mode is not supported there.
Fix it by disabling the test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6346

Test Plan: Run DBTest2.AutoPrefixMode1 in lite mode

Differential Revision: D19627486

fbshipit-source-id: fbde75260aeecb7e6fc406e09c19a71a95aa5f08
2020-01-29 16:43:42 -08:00
sdong 8f2bee6747 Add ReadOptions.auto_prefix_mode (#6314)
Summary:
Add a new option ReadOptions.auto_prefix_mode. When set to true, iterator should return the same result as total order seek, but may choose to do prefix seek internally, based on iterator upper bounds. Also fix two previous bugs when handling prefix extrator changes: (1) reverse iterator should not rely on upper bound to determine prefix. Fix it with skipping prefix check. (2) block-based filter is not handled properly.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6314

Test Plan: (1) add a unit test; (2) add the check to stress test and run see whether it can pass at least one run.

Differential Revision: D19458717

fbshipit-source-id: 51c1bcc5cdd826c2469af201979a39600e779bce
2020-01-28 14:44:05 -08:00
sdong d87cffaea4 Fix another bug caused by recent hash index fix (#6305)
Summary:
Recent bug fix related to hash index introduced a new bug: hash index can return NotFound but it is not handled by BlockBasedTable::Get(). The end result is that Get() stops being executed too early. Fix it by ignoring NotFound code in Get().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6305

Test Plan: A problematic DB used to return NotFound incorrectly, and now able to return correct result. Will try to construct a unit test too.0

Differential Revision: D19438925

fbshipit-source-id: e751afa8c13728d56511cfeb1bc811ecb99f3217
2020-01-17 01:41:04 -08:00
sdong f8b5ef85ec Fix a bug caused by recent fix of Prefix Hash (#6302)
Summary:
Recent fix to Prefix Hash https://github.com/facebook/rocksdb/pull/6292 caused a bug that the newly created NotFound status in hash index is never reset. This causes reseek or implict reseek to return wrong results sometimes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6302

Test Plan:
Add a unit test that would fail. Not fix.
crash test with hash test would fail in several seconds. With the fix, it will run about several minutes before failing with another failure.

Differential Revision: D19424572

fbshipit-source-id: c5276f36a95fd0e2837e30190476d2fe21ed8566
2020-01-16 10:47:20 -08:00
sdong d2b4d42d4b Fix kHashSearch bug with SeekForPrev (#6297)
Summary:
When prefix is enabled the expected behavior when the prefix of the target does not exist is for Seek is to seek to any key larger than target and SeekToPrev to any key less than the target.
Currently. the prefix index (kHashSearch) returns OK status but sets Invalid() to indicate two cases: a prefix of the searched key does not exist, ii) the key is beyond the range of the keys in SST file. The SeekForPrev implementation in BlockBasedTable thus does not have enough information to know when it should set the index key to first (to return a key smaller than target). The patch fixes that by returning NotFound status for cases that the prefix does not exist. SeekForPrev in BlockBasedTable accordingly SeekToFirst instead of SeekToLast on the index iterator.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6297

Test Plan: SeekForPrev of non-exsiting prefix is added to block_test.cc, and a test case is added in db_test2, which fails without the fix.

Differential Revision: D19404695

fbshipit-source-id: cafbbf95f8f60ff9ede9ccc99d25bfa1cf6fcdc3
2020-01-15 14:28:39 -08:00
sdong 76c117b24b Fix LITE test build broken by recent commit (#6295)
Summary:
A recent commit adds a unit test that uses a function not available in LITE build. Fix it by avoiding the call
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6295

Test Plan: Run the test in LITE build and see it passes.

Differential Revision: D19395678

fbshipit-source-id: 37b42835bae02511630d80f7cafb1179401bc033
2020-01-14 13:17:04 -08:00
sdong 894c6d21af Bug when multiple files at one level contains the same smallest key (#6285)
Summary:
The fractional cascading index is not correctly generated when two files at the same level contains the same smallest or largest user key.
The result would be that it would hit an assertion in debug mode and lower level files might be skipped.
This might cause wrong results when the same user keys are of merge operands and Get() is called using the exact user key. In that case, the lower files would need to further checked.
The fix is to fix the fractional cascading index.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6285

Test Plan: Add a unit test which would cause the assertion which would be fixed.

Differential Revision: D19358426

fbshipit-source-id: 39b2b1558075fd95e99491d462a67f9f2298c48e
2020-01-13 16:27:42 -08:00
Maysam Yabandeh eff5e076f5 unordered_write incompatible with max_successive_merges (#6284)
Summary:
unordered_write is incompatible with non-zero max_successive_merges. Although we check this at runtime, we currently don't prevent the user from setting this combination in options. This has led to stress tests to fail with this combination is tried in ::SetOptions.
The patch fixes that and also reverts the changes performed by https://github.com/facebook/rocksdb/pull/6254, in which max_successive_merges was mistakenly declared incompatible with unordered_write.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6284

Differential Revision: D19356115

Pulled By: maysamyabandeh

fbshipit-source-id: f06dadec777622bd75f267361c022735cf8cecb6
2020-01-10 16:53:19 -08:00
Yanqin Jin a8b1085ae2 Fix test in LITE mode (#6267)
Summary:
Currently, the recently-added test DBTest2.SwitchMemtableRaceWithNewManifest
fails in LITE mode since SetOptions() returns "Not supported". I do not want to
put `#ifndef ROCKSDB_LITE` because it reduces test coverage. Instead, just
trigger compaction on a different column family. The bg compaction thread
calling LogAndApply() may race with thread calling SwitchMemtable().

Test Plan (dev server):
make check
OPT=-DROCKSDB_LITE make check

or run DBTest2.SwitchMemtableRaceWithNewManifest 100 times.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6267

Differential Revision: D19301309

Pulled By: riversand963

fbshipit-source-id: 88cedcca2f985968ed3bb234d324ffa2aa04ca50
2020-01-07 13:47:03 -08:00
Yanqin Jin 1aaa145877 Fix a data race for cfd->log_number_ (#6249)
Summary:
A thread calling LogAndApply may release db mutex when calling
WriteCurrentStateToManifest() which reads cfd->log_number_. Another thread can
call SwitchMemtable() and writes to cfd->log_number_.
Solution is to cache the cfd->log_number_ before releasing mutex in
LogAndApply.

Test Plan (on devserver):
```
$COMPILE_WITH_TSAN=1 make db_stress
$./db_stress --acquire_snapshot_one_in=10000 --avoid_unnecessary_blocking_io=1 --block_size=16384 --bloom_bits=16 --bottommost_compression_type=zstd --cache_index_and_filter_blocks=1 --cache_size=1048576 --checkpoint_one_in=1000000 --checksum_type=kxxHash --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_ttl=0 --compression_max_dict_bytes=16384 --compression_type=zstd --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --db=/dev/shm/rocksdb/rocksdb_crashtest_blackbox --db_write_buffer_size=1048576 --delpercent=5 --delrangepercent=0 --destroy_db_initially=0 --enable_pipelined_write=0  --flush_one_in=1000000 --format_version=5 --get_live_files_and_wal_files_one_in=1000000 --index_block_restart_interval=5 --index_type=0 --log2_keys_per_lock=22 --long_running_snapshots=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=1000000 --max_manifest_file_size=16384 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=3 --memtablerep=skip_list --mmap_read=0 --nooverwritepercent=1 --open_files=500000 --ops_per_thread=100000000 --partition_filters=0 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefixpercent=5 --progress_reports=0 --readpercent=45 --recycle_log_file_num=0 --reopen=20 --set_options_one_in=10000 --snapshot_hold_ops=100000 --subcompactions=2 --sync=1 --target_file_size_base=2097152 --target_file_size_multiplier=2 --test_batches_snapshots=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 --use_multiget=1 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --write_buffer_size=4194304 --write_dbid_to_manifest=1 --writepercent=35
```
Then repeat the following multiple times, e.g. 100 after compiling with tsan.
```
$./db_test2 --gtest_filter=DBTest2.SwitchMemtableRaceWithNewManifest
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6249

Differential Revision: D19235077

Pulled By: riversand963

fbshipit-source-id: 79467b52f48739ce7c27e440caa2447a40653173
2020-01-06 20:09:51 -08:00
Maysam Yabandeh 48a678b7c9 Prevent an incompatible combination of options (#6254)
Summary:
allow_concurrent_memtable_write is incompatible with non-zero max_successive_merges. Although we check this at runtime, we currently don't prevent the user from setting this combination in options. This has led to stress tests to fail with this combination is tried in ::SetOptions. The patch fixes that.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6254

Differential Revision: D19265819

Pulled By: maysamyabandeh

fbshipit-source-id: 47f2e2dc26fe0972c7152f4da15dadb9703f1179
2020-01-02 16:15:06 -08:00
解轶伦 39fcaf8246 delete superversions in BackgroundCallPurge (#6146)
Summary:
I found that CleanupSuperVersion() may block Get() for 30ms+ (per MemTable is 256MB).

Then I found "delete sv" in ~SuperVersion() takes the time.

The backtrace looks like this

DBImpl::GetImpl() -> DBImpl::ReturnAndCleanupSuperVersion() ->
DBImpl::CleanupSuperVersion() : delete sv; -> ~SuperVersion()

I think it's better to delete in a background thread,  please review it。
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6146

Differential Revision: D18972066

fbshipit-source-id: 0f7b0b70b9bb1e27ad6fc1c8a408fbbf237ae08c
2019-12-17 13:22:57 -08:00
sdong bb23bfe63c Fix a regression bug on total order seek with prefix enabled and range delete (#6028)
Summary:
Recent change https://github.com/facebook/rocksdb/pull/5861 mistakely use "prefix_extractor_ != nullptr" as the condition to determine whehter prefix bloom filter isused. It fails to consider read_options.total_order_seek, so it is wrong. The result is that an optimization for non-total-order seek is mistakely applied to total order seek, and introduces a bug in following corner case:
Because of RangeDelete(), a file's largest key is extended. Seek key falls into the range deleted file, so level iterator seeks into the previous file without getting any key. The correct behavior is to place the iterator to the first key of the next file. However, an optimization is triggered and invalidates the iterator because it is out of the prefix range, causing wrong results. This behavior is reproduced in the unit test added.
Fix the bug by setting prefix_extractor to be null if total order seek is used.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6028

Test Plan: Add a unit test which fails without the fix.

Differential Revision: D18479063

fbshipit-source-id: ac075f013029fcf69eb3a598f14c98cce3e810b3
2019-11-13 10:11:34 -08:00
Yanqin Jin 2309fd63bf Update column families' log number altogether after flushing during recovery (#5856)
Summary:
A bug occasionally shows up in crash test, and https://github.com/facebook/rocksdb/issues/5851 reproduces it.
The bug can surface in the following way.
1. Database has multiple column families.
2. Between one DB restart, the last log file is corrupted in the middle (not the tail)
3. During restart, DB crashes between flushing between two column families.

Then DB will fail to be opened again with error "SST file is ahead of WALs".
Solution is to update the log number associated with each column family altogether after flushing all column families' memtables. The version edits should be written to a new MANIFEST. Only after writing to all these version edits succeed does RocksDB (atomically) points the CURRENT file to the new MANIFEST.

Test plan (on devserver):
```
$make all && make check
```
Specifically
```
$make db_test2
$./db_test2 --gtest_filter=DBTest2.CrashInRecoveryMultipleCF
```
Also checked for compatibility as follows.
Use this branch, run DBTest2.CrashInRecoveryMultipleCF and preserve the db directory.
Then checkout 5.4, build ldb, and dump the MANIFEST.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5856

Differential Revision: D17620818

Pulled By: riversand963

fbshipit-source-id: b52ce5969c9a8052cacec2bd805fcfb373589039
2019-10-24 18:29:30 -07:00
sdong a0cd920026 LevelIterator to avoid gap after prefix bloom filters out a file (#5861)
Summary:
Right now, when LevelIterator::Seek() is called, when a file is filtered out by prefix bloom filter, the position is put to the beginning of the next file. This is a confusing internal interface because many keys in the levels are skipped. Avoid this behavior by checking the key of the next file against the seek key, and invalidate the whole iterator if the prefix doesn't match.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5861

Test Plan: Add a new unit test to validate the behavior; run all exsiting tests; run crash_test

Differential Revision: D17918213

fbshipit-source-id: f06b47d937c7cc8919001f18dcc3af5b28c9cdac
2019-10-21 11:40:57 -07:00
Andrew Kryczka b00761eea6 Fix block cache ID uniqueness for Windows builds (#5844)
Summary:
Since we do not evict a file's blocks from block cache before that file
is deleted, we require a file's cache ID prefix is both unique and
non-reusable. However, the Windows functionality we were relying on only
guaranteed uniqueness. That meant a newly created file could be assigned
the same cache ID prefix as a deleted file. If the newly created file
had block offsets matching the deleted file, full cache keys could be
exactly the same, resulting in obsolete data blocks returned from cache
when trying to read from the new file.

We noticed this when running on FAT32 where compaction was writing out
of order keys due to reading obsolete blocks from its input files. The
functionality is documented as behaving the same on NTFS, although I
wasn't able to repro it there.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5844

Test Plan:
we had a reliable repro of out-of-order keys on FAT32 that
was fixed by this change

Differential Revision: D17752442

fbshipit-source-id: 95d983f9196cf415f269e19293b97341edbf7e00
2019-10-11 18:19:31 -07:00
sdong 76e951dbb1 Add a unit test to reproduce a corruption bug (#5851)
Summary:
This is a bug occaionally shows up in crash test, and this unit test is to reproduce it. The bug is following:
1. Database has multiple CFs.
2. Between one DB restart, the last log file is corrupted in the middle (not the tail)
3. During restart, DB crashes between flushes between two CFs.
The DB will fail to be opened again with error "SST file is ahead of WALs"
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5851

Test Plan: Run the test itself.

Differential Revision: D17614721

fbshipit-source-id: 1b0abce49b203a76a039e38e76bc940429975f20
2019-09-26 16:18:42 -07:00
sdong c06b54d0c6 Apply formatter on recent 45 commits. (#5827)
Summary:
Some recent commits might not have passed through the formatter. I formatted recent 45 commits. The script hangs for more commits so I stopped there.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5827

Test Plan: Run all existing tests.

Differential Revision: D17483727

fbshipit-source-id: af23113ee63015d8a43d89a3bc2c1056189afe8f
2019-09-19 12:34:17 -07:00
sdong 43a340bebe Merging iterator to disble reseek optimization in prefix seek (#5815)
Summary:
We are seeing a bug of wrong results with merging iterator's reseek avoidence feature and prefix extractor. Disable this optimization for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5815

Test Plan: Validated the same MyRocks case was fixed; run all existing tests.

Differential Revision: D17430776

fbshipit-source-id: aef664277ba0ab8a2e68331ff0db6ae682535371
2019-09-17 17:10:29 -07:00
andrew 622683000c Allow users to stop manual compactions (#3971)
Summary:
Manual compaction may bring in very high load because sometime the amount of data involved in a compaction could be large, which may affect online service. So it would be good if the running compaction making the server busy can be stopped immediately. In this implementation, stopping manual compaction condition is only checked in slow process. We let deletion compaction and trivial move go through.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3971

Test Plan: add tests at more spots.

Differential Revision: D17369043

fbshipit-source-id: 575a624fb992ce0bb07d9443eb209e547740043c
2019-09-16 21:01:47 -07:00
Maysam Yabandeh 638d239507 Charge block cache for cache internal usage (#5797)
Summary:
For our default block cache, each additional entry has extra memory overhead. It include LRUHandle (72 bytes currently) and the cache key (two varint64, file id and offset). The usage is not negligible. For example for block_size=4k, the overhead accounts for an extra 2% memory usage for the cache. The patch charging the cache for the extra usage, reducing untracked memory usage outside block cache. The feature is enabled by default and can be disabled by passing kDontChargeCacheMetadata to the cache constructor.
This PR builds up on https://github.com/facebook/rocksdb/issues/4258
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5797

Test Plan:
- Existing tests are updated to either disable the feature when the test has too much dependency on the old way of accounting the usage or increasing the cache capacity to account for the additional charge of metadata.
- The Usage tests in cache_test.cc are augmented to test the cache usage under kFullChargeCacheMetadata.

Differential Revision: D17396833

Pulled By: maysamyabandeh

fbshipit-source-id: 7684ccb9f8a40ca595e4f5efcdb03623afea0c6f
2019-09-16 15:26:21 -07:00
Yanqin Jin 5d9a67e718 Support loading custom objects in unit tests (#5676)
Summary:
Most existing RocksDB unit tests run on `Env::Default()`. It will be useful to port the unit tests to non-default environments, e.g. `HdfsEnv`, etc.
This pull request is one step towards this goal. If RocksDB unit tests are built with a static library exposing a function `RegisterCustomObjects()`, then it is possible to implement custom object registrar logic in the library. RocksDB unit test can call `RegisterCustomObjects()` at the beginning.
By default, `ROCKSDB_UNITTESTS_WITH_CUSTOM_OBJECTS_FROM_STATIC_LIBS` is not defined, thus this PR has no impact on existing RocksDB because `RegisterCustomObjects()` is a noop.
Test plan (on devserver):
```
$make clean && COMPILE_WITH_ASAN=1 make -j32 all
$make check
```
All unit tests must pass.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5676

Differential Revision: D16679157

Pulled By: riversand963

fbshipit-source-id: aca571af3fd0525277cdc674248d0fe06e060f9d
2019-08-09 15:12:08 -07:00
Vijay Nadimpalli d150e01474 New API to get all merge operands for a Key (#5604)
Summary:
This is a new API added to db.h to allow for fetching all merge operands associated with a Key. The main motivation for this API is to support use cases where doing a full online merge is not necessary as it is performance sensitive. Example use-cases:
1. Update subset of columns and read subset of columns -
Imagine a SQL Table, a row is encoded as a K/V pair (as it is done in MyRocks). If there are many columns and users only updated one of them, we can use merge operator to reduce write amplification. While users only read one or two columns in the read query, this feature can avoid a full merging of the whole row, and save some CPU.
2. Updating very few attributes in a value which is a JSON-like document -
Updating one attribute can be done efficiently using merge operator, while reading back one attribute can be done more efficiently if we don't need to do a full merge.
----------------------------------------------------------------------------------------------------
API :
Status GetMergeOperands(
      const ReadOptions& options, ColumnFamilyHandle* column_family,
      const Slice& key, PinnableSlice* merge_operands,
      GetMergeOperandsOptions* get_merge_operands_options,
      int* number_of_operands)

Example usage :
int size = 100;
int number_of_operands = 0;
std::vector<PinnableSlice> values(size);
GetMergeOperandsOptions merge_operands_info;
db_->GetMergeOperands(ReadOptions(), db_->DefaultColumnFamily(), "k1", values.data(), merge_operands_info, &number_of_operands);

Description :
Returns all the merge operands corresponding to the key. If the number of merge operands in DB is greater than merge_operands_options.expected_max_number_of_operands no merge operands are returned and status is Incomplete. Merge operands returned are in the order of insertion.
merge_operands-> Points to an array of at-least merge_operands_options.expected_max_number_of_operands and the caller is responsible for allocating it. If the status returned is Incomplete then number_of_operands will contain the total number of merge operands found in DB for key.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5604

Test Plan:
Added unit test and perf test in db_bench that can be run using the command:
./db_bench -benchmarks=getmergeoperands --merge_operator=sortlist

Differential Revision: D16657366

Pulled By: vjnadimpalli

fbshipit-source-id: 0faadd752351745224ee12d4ae9ef3cb529951bf
2019-08-06 14:26:44 -07:00
sdong 66b5613d0c row_cache to share entry for recent snapshots (#5600)
Summary:
Right now, users cannot take advantage of row cache, unless no snapshot is used, or Get() is repeated for the same snapshots. This limits the usage of row cache.
This change eliminate this restriction in some cases. If the snapshot used is newer than the largest sequence number in the file, and write callback function is not registered, the same row cache key is used as no snapshot is given. We still need the callback function restriction for now because the callback function may filter out different keys for different snapshots even if the snapshots are new.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5600

Test Plan: Add a unit test.

Differential Revision: D16386616

fbshipit-source-id: 6b7d214bd215d191b03ccf55926ad4b703ec2e53
2019-07-22 18:56:19 -07:00
Sagar Vemuri 1b59a490ef Fix flaky DBTest2.PresetCompressionDict test (#5378)
Summary:
Fix flaky DBTest2.PresetCompressionDict test.

This PR fixes two issues with the test:
1. Replaces `GetSstFiles` with `TotalSize`, which is based on `DB::GetColumnFamilyMetaData` so that only the size of the live SST files is taken into consideration when computing the total size of all sst files. Earlier, with `GetSstFiles`, even obsolete files were getting picked up.
1. In ZSTD compression, it is sometimes possible that using a trained dictionary is not better than using an untrained one. Using a trained dictionary performs well in 99% of the cases, but still in the remaining ~1% of the cases (out of 10000 runs) using an untrained dictionary gets better compression results.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5378

Differential Revision: D15559100

Pulled By: sagar0

fbshipit-source-id: c35adbf13871f520a2cec48f8bad9ff27ff7a0b4
2019-05-30 16:11:27 -07:00
Siying Dong 4e0f2aadb0 DB::Close() to fail when there are unreleased snapshots (#5272)
Summary:
Sometimes, users might make mistake of not releasing snapshots before closing the DB. This is undocumented use of RocksDB and the behavior is unknown. We return DB::Close() to provide a way to check it for the users. Aborted() will be returned to users when they call DB::Close().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5272

Differential Revision: D15159713

Pulled By: siying

fbshipit-source-id: 39369def612398d9f239d83d396b5a28e5af65cd
2019-05-01 10:17:30 -07:00
Zhongyi Xie baa5302447 Avoid double-compacting data in bottom level in manual compactions (#5138)
Summary:
Depending on the config, manual compaction (leveled compaction style) does following compactions:
L0->L1
L1->L2
...
Ln-1 -> Ln
Ln -> Ln
The final Ln -> Ln compaction is partly unnecessary as it recompacts all the files that were just generated by the Ln-1 -> Ln. We should avoid recompacting such files. This rule should be applied to Lmax only.
Resolves issue https://github.com/facebook/rocksdb/issues/4995
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5138

Differential Revision: D14940106

Pulled By: miasantreble

fbshipit-source-id: 8d3cf5507a17e76f3333cfd4bac5256d005636e5
2019-04-16 23:32:20 -07:00
Siying Dong beb44ec3eb WriteBufferManager's dummy entry size to block cache 1MB -> 256KB (#5175)
Summary:
Dummy cache size of 1MB is too large for small block sizes. Our GetDefaultCacheShardBits() use min_shard_size = 512L * 1024L to determine number of shards, so 1MB will excceeds the size of the whole shard and make the cache excceeds the budget.
Change it to 256KB accordingly.
There shouldn't be obvious performance impact, since inserting a cache entry every 256KB of memtable inserts is still infrequently enough.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5175

Differential Revision: D14954289

Pulled By: siying

fbshipit-source-id: 2c275255c1ac3992174e06529e44c55538325c94
2019-04-16 12:03:07 -07:00
Siying Dong 85b2bde3dd Still implement StatisticsImpl::measureTime() (#5181)
Summary:
Since Statistics::measureTime() is deprecated, StatisticsImpl::measureTime() is not implemented. We realized that users might have a wrapped Statistics implementation in which measureTime() is implemented as forwarded to StatisticsImpl, and causes assert failure. In order to make the change less intrusive, we implement StatisticsImpl::measureTime(). We will revisit whether we need to remove it after several releases.

Also, add a test to make sure that a Statistics implementation using the old interface still works.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5181

Differential Revision: D14907089

Pulled By: siying

fbshipit-source-id: 29b6202fd04e30ed6f6adcaeb1000e87f10d1e1a
2019-04-12 11:00:35 -07:00
Siying Dong ed9f5e21aa Change OptimizeForPointLookup() and OptimizeForSmallDb() (#5165)
Summary:
Change the behavior of OptimizeForSmallDb() so that it is less likely to go out of memory.
Change the behavior of OptimizeForPointLookup() to take advantage of the new memtable whole key filter, and move away from prefix extractor as well as hash-based indexing, as they are prone to misuse.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5165

Differential Revision: D14880709

Pulled By: siying

fbshipit-source-id: 9af30e3c9e151eceea6d6b38701a58f1f9fb692d
2019-04-11 10:45:36 -07:00
Zhichao Cao ebb9b2ed16 Fix the potential DB crash caused by call EndTrace before StartTrace (#5130)
Summary:
Although user should first call StartTrace to begin the RocksDB tracing function and call EndTrace to stop the tracing process, user can accidentally call EndTrace first. It will cause segment fault and crash the DB instance. The issue is fixed by checking the pointer first.

Test case added in db_test2.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5130

Differential Revision: D14691420

Pulled By: zhichao-cao

fbshipit-source-id: 3be13d2f944bc453728ef8eef67b68d7ad0939c8
2019-04-03 13:26:34 -07:00
Maysam Yabandeh 14b3f683a1 WriteUnPrepared: less virtual in iterator callback (#5049)
Summary:
WriteUnPrepared adds a virtual function, MaxUnpreparedSequenceNumber, to ReadCallback, which returns 0 unless WriteUnPrepared is enabled and the transaction has uncommitted data written to the DB. Together with snapshot sequence number, this determines the last sequence that is visible to reads.
The patch clarifies the guarantees of the GetIterator API in WriteUnPrepared transactions and make use of that to statically initialize the read callback and thus avoid the virtual call.
Furthermore it increases the minimum value for min_uncommitted from 0 to 1 as seq 0 is used only for last level keys that are committed in all snapshots.

The following benchmark shows +0.26% higher throughput in seekrandom benchmark.

Benchmark:
./db_bench --benchmarks=fillrandom --use_existing_db=0 --num=1000000 --db=/dev/shm/dbbench

./db_bench --benchmarks=seekrandom[X10] --use_existing_db=1 --db=/dev/shm/dbbench --num=1000000 --duration=60 --seek_nexts=100
seekrandom [AVG    10 runs] : 20355 ops/sec;  225.2 MB/sec
seekrandom [MEDIAN 10 runs] : 20425 ops/sec;  225.9 MB/sec

./db_bench_lessvirtual3 --benchmarks=seekrandom[X10] --use_existing_db=1 --db=/dev/shm/dbbench --num=1000000 --duration=60 --seek_nexts=100
seekrandom [AVG    10 runs] : 20409 ops/sec;  225.8 MB/sec
seekrandom [MEDIAN 10 runs] : 20487 ops/sec;  226.6 MB/sec
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5049

Differential Revision: D14366459

Pulled By: maysamyabandeh

fbshipit-source-id: ebaff8908332a5ae9af7defeadabcb624be660ef
2019-04-02 14:47:16 -07:00
Shi Feng 01e6badbb6 Introduce CPU timers for iterator seek and next (#5076)
Summary:
Introduce CPU timers for iterator seek and next operations. Seek
counter includes SeekToFirst, SeekToLast and SeekForPrev, w/ the
caveat that SeekToLast timer doesn't include some post processing
time if upper bound is defined.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5076

Differential Revision: D14525218

Pulled By: fredfsh

fbshipit-source-id: 03ba25df3b22b06c072621e4de0eacfa1445f0d9
2019-03-26 16:32:13 -07:00
Wenjie Yang 36c2a7cfb1 Add an option to filter traces (#5082)
Summary:
Add an option to filter out READ or WRITE operations while tracing.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5082

Differential Revision: D14515083

Pulled By: mrmiywj

fbshipit-source-id: 2504c89a9abf1dd629cad44b4104092702d77610
2019-03-19 14:36:51 -07:00
Maysam Yabandeh a661c0d208 WritePrepared: optimize read path by avoiding virtual (#5018)
Summary:
The read path includes a callback function, ReadCallback, which would eventually calls IsInSnapshot to figure if a particular seq is in the reading snapshot or not. This callback is virtual, which adds the cost of multiple virtual function call to each read. The first few checks in IsInSnapshot, however, are quite trivial and take care of majority of the cases. The patch moves those to a non-virtual function in the the parent class, ReadCallback, to lower the virtual callback cost.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5018

Differential Revision: D14226562

Pulled By: maysamyabandeh

fbshipit-source-id: 6feed5b34f3b082e52092c5ef143e29b49c46b44
2019-02-26 16:56:19 -08:00
Siying Dong 93f7e7a450 Temporarily Disable DBTest2.PresetCompressionDict (#5003)
Summary:
DBTest2.PresetCompressionDict is flaky. Temparily disable it for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5003

Differential Revision: D14139505

Pulled By: siying

fbshipit-source-id: ebf1872d364b76b2cb021b489ea2f17ee997116a
2019-02-19 14:44:12 -08:00
Michael Liu ca89ac2ba9 Apply modernize-use-override (2nd iteration)
Summary:
Use C++11’s override and remove virtual where applicable.
Change are automatically generated.

Reviewed By: Orvid

Differential Revision: D14090024

fbshipit-source-id: 1e9432e87d2657e1ff0028e15370a85d1739ba2a
2019-02-14 14:41:36 -08:00
Andrew Kryczka 62f70f6d14 Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.

So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:

- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952

Differential Revision: D13967980

Pulled By: ajkr

fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
2019-02-11 19:47:32 -08:00
tang-jianfeng 08809f5e6c Implement trace sampling (#4963)
Summary:
Implement trace sampling to allow user to specify the sampling frequency, i.e. save one per how many requests, so that a user does not need to log all if he/she is interested in only a sampled set.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4963

Differential Revision: D14011190

Pulled By: tang-jianfeng

fbshipit-source-id: 078b631d9319b67cb089dd2c30e21d0df8dc406a
2019-02-08 18:08:18 -08:00
Alexander Zinoviev 32a6dd9a41 Add a new CPU time counter to compaction report (#4889)
Summary:
Measure CPU time consumed for a compaction and report it in the stats report
Enable NowCPUNanos() to work for MacOS
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4889

Differential Revision: D13701276

Pulled By: zinoale

fbshipit-source-id: 5024e5bbccd4dd10fd90d947870237f436445055
2019-01-29 17:24:00 -08:00
Andrew Kryczka 01013ae766 Digest ZSTD compression dictionary once when writing SST file (#4849)
Summary:
This is essentially a re-submission of #4251 with a few improvements:

- Split `CompressionDict` into two separate classes: `CompressionDict` and `UncompressionDict`
- Eliminated `Init` functions. Instead do all initialization work in constructors.
- Added test case for parallel DB open, which is the scenario where #4251 failed under TSAN.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4849

Differential Revision: D13606039

Pulled By: ajkr

fbshipit-source-id: 08c236059798c710db9cbf545fce0f371232d447
2019-01-18 19:12:57 -08:00
Maysam Yabandeh d56ac22b44 Remove duplicates from SnapshotList::GetAll (#4860)
Summary:
The vector returned by SnapshotList::GetAll could have duplicate entries if two separate snapshots have the same sequence number. However, when this vector is used in compaction the duplicate entires are of no use and could be safely ignored. Moreover not having duplicate entires simplifies reasoning in the compaction_iterator.cc code. For example when searching for the previous_snap we currently use the snapshot before the current one but the way the code uses that it expects it to be also less than the current snapshot, which would be simpler to read if there is no duplicate entry in the snapshot list.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4860

Differential Revision: D13615502

Pulled By: maysamyabandeh

fbshipit-source-id: d45bf01213ead5f39db811f951802da6fcc3332b
2019-01-09 16:25:42 -08:00
Siying Dong 8641e9adf7 Non-initial file preloading should always prefetch index and filter (#4852)
Summary:
https://github.com/facebook/rocksdb/pull/3340 introduces preloading when max_open_files != -1.
It doesn't preload index and filter in non-initial file loading case. This is a little bit too
complicated to understand. We observed in one MyRocks use case where the filter is expected to be
preloaded but is not. To simplify the use case, we simply always prefetch the index and filter.
They anyway is expected to be loaded in the file verification phase anyway.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4852

Differential Revision: D13595402

Pulled By: siying

fbshipit-source-id: d4d8624eb3e849e20aeb990df2100502d85aff31
2019-01-08 12:47:34 -08:00
Siying Dong f0dda35d7d Preload some files even if options.max_open_files (#3340)
Summary:
Choose to preload some files if options.max_open_files != -1. This can slightly narrow the gap of performance between options.max_open_files is -1 and a large number. To avoid a significant regression to DB reopen speed if options.max_open_files != -1. Limit the files to preload in DB open time to 16.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3340

Differential Revision: D6686945

Pulled By: siying

fbshipit-source-id: 8ec11bbdb46e3d0cdee7b6ad5897a09c5a07869f
2018-12-28 18:02:28 -08:00
Siying Dong da1c64b6e7 Introduce a CPU time counter in perf_context (#4741)
Summary:
Introduce the first CPU timing counter, perf_context.get_cpu_nanos. This opens a door to more CPU counters in the future.
Only Posix Env has it implemented using clock_gettime() with CLOCK_THREAD_CPUTIME_ID. How accurate the counter is depends on the platform.
Make PerfStepTimer to take an Env as an argument, and sometimes pass it in. The direct reason is to make the unit tests to use SpecialEnv where we can ingest logic there. But in long term, this is a good change.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4741

Differential Revision: D13287798

Pulled By: siying

fbshipit-source-id: 090361049d9d5095d1d1a369fe1338d2e2e1c73f
2018-12-20 12:03:44 -08:00
Zhichao Cao 7125e24619 Add the max trace file size limitation option to Tracing (#4610)
Summary:
If user do not end the trace manually, the tracing will continue which can potential use up all the storage space and cause problem. In this PR, the max trace file size is added to the TraceOptions and user can set the value if they need or the default is 64GB.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4610

Differential Revision: D12893400

Pulled By: zhichao-cao

fbshipit-source-id: acf4b5a6076bb691778bdfbac4864e1006758953
2018-11-27 14:27:05 -08:00
Siying Dong 13579e8c5a WriteBufferManger doens't cost to cache if no limit is set (#4695)
Summary:
WriteBufferManger is not invoked when allocating memory for memtable if the limit is not set even if a cache is passed. It is inconsistent from the comment syas. Fix it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4695

Differential Revision: D13112722

Pulled By: siying

fbshipit-source-id: 0b27eef63867f679cd06033ea56907c0569597f4
2018-11-18 16:55:43 -08:00
Siying Dong abb1a8fc23 Add a unit test to assert number of preads (#4657)
Summary:
We used to have a bug, which caused every block to be read twice, and none of our tests caught it. Add a very simply unit test to make sure that when reading a data block, we only issue one pread against the SST file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4657

Differential Revision: D13005260

Pulled By: siying

fbshipit-source-id: 03167b554ad2451192b1707415536d7d05e9026c
2018-11-13 12:52:19 -08:00
Sagar Vemuri 2993cd2002 Fix RocksDB Lite build (#4675)
Summary:
Our internal CI test caught RocksDB Lite build failures. The failures are due to a new test introduced in #4665 using `SSTFileWriter` and `IngestExternalFile`, but these is not exposed under lite mode. Fixed by #ifdef'ing out the test.

```
db/db_test2.cc: In member function ‘virtual void rocksdb::DBTest2_TestCompactFiles_Test::TestBody()’:
db/db_test2.cc:2907:3: error: ‘SstFileWriter’ is not a member of ‘rocksdb’
   rocksdb::SstFileWriter sst_file_writer{rocksdb::EnvOptions(), options};
   ^
In file included from ./util/testharness.h:15:0,
                 from ./table/mock_table.h:23,
                 from ./db/db_test_util.h:44,
                 from db/db_test2.cc:13:
db/db_test2.cc:2912:13: error: ‘sst_file_writer’ was not declared in this scope
   ASSERT_OK(sst_file_writer.Open(external_file1));
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4675

Differential Revision: D13035984

Pulled By: sagar0

fbshipit-source-id: c1ceac550dfac1a85eeea436693dc7dd467519a6
2018-11-12 19:01:37 -08:00
DorianZheng 0f88160f67 Fix `CompactFiles` bug (#4665)
Summary:
`CompactFiles` gets `SuperVersion` before `WaitForIngestFile`, while `IngestExternalFile` may add files that overlap with `input_file_names`

The timeline of execution flow is as follow:

Let's say that level N has two file [1,2] and [5,6]
```
timeline              user_thread1                             user_thread2
t0   |      CompactFiles([1, 2], [5, 6]) begin
t1   |         GetReferencedSuperVersion()
t2   |                                              IngestExternalFile([3,4]) to level N begin
t3   |             CompactFiles resume
     V
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4665

Differential Revision: D13030674

Pulled By: ajkr

fbshipit-source-id: 8be19477fd6e505032267a979d32f3097cc3be51
2018-11-12 14:32:18 -08:00
Sagar Vemuri dc3528077a Update all unique/shared_ptr instances to be qualified with namespace std (#4638)
Summary:
Ran the following commands to recursively change all the files under RocksDB:
```
find . -type f -name "*.cc" -exec sed -i 's/ unique_ptr/ std::unique_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/<unique_ptr/<std::unique_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/ shared_ptr/ std::shared_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/<shared_ptr/<std::shared_ptr/g' {} +
```
Running `make format` updated some formatting on the files touched.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4638

Differential Revision: D12934992

Pulled By: sagar0

fbshipit-source-id: 45a15d23c230cdd64c08f9c0243e5183934338a8
2018-11-09 11:19:58 -08:00
Peter Pei 09814f2cfc support OnCompactionBegin (#4431)
Summary:
fix #4288

Add `OnCompactionBegin` support to `rocksdb::EventListener`.

Currently, we only have these three callbacks:

- OnFlushBegin
- OnFlushCompleted
- OnCompactionCompleted

As paolococchi requested in #4288 , and ajkr agreed, we should also support `OnCompactionBegin`.

This PR is a try to implement the support of `OnCompactionBegin`.

Hope it is useful to you.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4431

Differential Revision: D10055515

Pulled By: yiwu-arbug

fbshipit-source-id: 39c0f95f8e9ff1c7ca3a10787502a17f258d2334
2018-10-10 17:32:27 -07:00
DorianZheng 27090ae8f6 Fix DBImpl::GetColumnFamilyHandleUnlocked race condition (#4391)
Summary:
- Fix DBImpl API race condition

The timeline of execution flow is as follow:
```
timeline              user_thread1                      user_thread2
t1   |     cfh = GetColumnFamilyHandleUnlocked(0)
t2   |     id1 = cfh->GetID()
t3   |                                                GetColumnFamilyHandleUnlocked(1)
t4   |     id2 = cfh->GetID()
     V
```
The original implementation return a pointer to a stateful variable, so that the return `ColumnFamilyHandle` will be changed when another thread calls `GetColumnFamilyHandleUnlocked` with different `column family id`

- Expose ColumnFamily ID to compaction event listener

- Fix the return status of `DBImpl::GetLatestSequenceForKey`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4391

Differential Revision: D10221243

Pulled By: yiwu-arbug

fbshipit-source-id: dec60ee9ff0c8261a2f2413a8506ec1063991993
2018-10-08 14:24:16 -07:00
Sagar Vemuri 3db584059c Remove sync point from Block destructor (#4370)
Summary:
AddressSanitizer: heap-use-after-free in std::__atomic_base<bool>::load(std::memory_order) const
==1798517==ABORTING
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4370

Differential Revision: D9844146

Pulled By: sagar0

fbshipit-source-id: 18a2970b1d504b4f6c8fb04857f26e0f32124dd1
2018-09-15 00:12:57 -07:00
Siying Dong f3d91a0b57 Add a unit test to verify iterators release data blocks after using them (#4170)
Summary:
Add a unit test to check that iterators release data blocks after it has moved away from it. Verify the same for compaction input iterators.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4170

Differential Revision: D8962513

Pulled By: siying

fbshipit-source-id: 05a5b604d7d29887fb488f2cda7286f554a14407
2018-08-13 17:43:14 -07:00
Zhichao Cao 6d75319d95 Add tracing function of Seek() and SeekForPrev() to trace_replay (#4228)
Summary:
In the current trace_and replay, Get an WriteBatch are traced. This pull request track down the Seek() and SeekForPrev() to the trace file. <target_key, timestamp, column_family_id> are write to the file.

Replay of Iterator is not supported in the current implementation.

Tested with trace_analyzer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4228

Differential Revision: D9201381

Pulled By: zhichao-cao

fbshipit-source-id: 6f9cc9cb3c20260af741bee065ec35c5c96354ab
2018-08-10 17:57:40 -07:00
Sagar Vemuri 12b6cdeed3 Trace and Replay for RocksDB (#3837)
Summary:
A framework for tracing and replaying RocksDB operations.

A binary trace file is created by capturing the DB operations, and it can be replayed back at the same rate using db_bench.

- Column-families are supported
- Multi-threaded tracing is supported.
- TraceReader and TraceWriter are exposed to the user, so that tracing to various destinations can be enabled (say, to other messaging/logging services). By default, a FileTraceReader and FileTraceWriter are implemented to capture to a file and replay from it.
- This is not yet ideal to be enabled in production due to large performance overhead, but it can be safely tried out in a shadow setup, say, for analyzing RocksDB operations.

Currently supported DB operations:
- Writes:
-- Put
-- Merge
-- Delete
-- SingleDelete
-- DeleteRange
-- Write
- Reads:
-- Get (point lookups)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3837

Differential Revision: D7974837

Pulled By: sagar0

fbshipit-source-id: 8ec65aaf336504bc1f6ed0feae67f6ed5ef97a72
2018-08-01 00:27:08 -07:00
Siying Dong 8425c8bd4d BlockBasedTableReader: automatically adjust tail prefetch size (#4156)
Summary:
Right now we use one hard-coded prefetch size to prefetch data from the tail of the SST files. However, this may introduce a waste for some use cases, while not efficient for others.
Introduce a way to adjust this prefetch size by tracking 32 recent times, and pick a value with which the wasted read is less than 10%
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4156

Differential Revision: D8916847

Pulled By: siying

fbshipit-source-id: 8413f9eb3987e0033ed0bd910f83fc2eeaaf5758
2018-07-20 14:43:37 -07:00
Maysam Yabandeh 8581a93a6b Per-thread unique test db names (#4135)
Summary:
The patch makes sure that two parallel test threads will operate on different db paths. This enables using open source tools such as gtest-parallel to run the tests of a file in parallel.
Example: ``` ~/gtest-parallel/gtest-parallel ./table_test```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4135

Differential Revision: D8846653

Pulled By: maysamyabandeh

fbshipit-source-id: 799bad1abb260e3d346bcb680d2ae207a852ba84
2018-07-13 17:27:39 -07:00
Maysam Yabandeh 0a5b5d88b2 Remove ReadOnly part of PinnableSliceAndMmapReads from Lite (#4070)
Summary:
Lite does not support readonly DBs.
Closes https://github.com/facebook/rocksdb/pull/4070

Differential Revision: D8677858

Pulled By: maysamyabandeh

fbshipit-source-id: 536887d2363ee2f5d8e1ea9f1a511e643a1707fa
2018-06-28 08:42:17 -07:00
Maysam Yabandeh 235ab9dd32 Pin mmap files in ReadOnlyDB (#4053)
Summary:
https://github.com/facebook/rocksdb/pull/3881 fixed a bug where PinnableSlice pin mmap files which could be deleted with background compaction. This is however a non-issue for ReadOnlyDB when there is no compaction running and max_open_files is -1. This patch reenables the pinning feature for that case.
Closes https://github.com/facebook/rocksdb/pull/4053

Differential Revision: D8662546

Pulled By: maysamyabandeh

fbshipit-source-id: 402962602eb0f644e17822748332999c3af029fd
2018-06-27 17:13:34 -07:00
Manuel Ung a16e00b7b9 WriteUnPrepared Txn: Disable seek to snapshot optimization (#3955)
Summary:
This is implemented by extending ReadCallback with another function `MaxUnpreparedSequenceNumber` which returns the largest visible sequence number for the current transaction, if there is uncommitted data written to DB. Otherwise, it returns zero, indicating no uncommitted data.

There are the places where reads had to be modified.
- Get and Seek/Next was just updated to seek to max(snapshot_seq, MaxUnpreparedSequenceNumber()) instead, and iterate until a key was visible.
- Prev did not need need updates since it did not use the Seek to sequence number optimization. Assuming that locks were held when writing unprepared keys, and ValidateSnapshot runs, there should only be committed keys and unprepared keys of the current transaction, all of which are visible. Prev will simply iterate to get the last visible key.
- Reseeking to skip keys optimization was also disabled for write unprepared, since it's possible to hit the max_skip condition even while reseeking. There needs to be some way to resolve infinite looping in this case.
Closes https://github.com/facebook/rocksdb/pull/3955

Differential Revision: D8286688

Pulled By: lth

fbshipit-source-id: 25e42f47fdeb5f7accea0f4fd350ef35198caafe
2018-06-27 12:23:07 -07:00
Andrew Kryczka fea2b1dfb2 Copy Get() result when file reads use mmap
Summary:
For iterator reads, a `SuperVersion` is pinned to preserve a snapshot of SST files, and `Block`s are pinned to allow `key()` and `value()` to return pointers directly into a RocksDB memory region. This works for both non-mmap reads, where the block owns the memory region, and mmap reads, where the file owns the memory region.

For point reads with `PinnableSlice`, only the `Block` object is pinned. This works for non-mmap reads because the block owns the memory region, so even if the file is deleted after compaction, the memory region survives. However, for mmap reads, file deletion causes the memory region to which the `PinnableSlice` refers to be unmapped.   The result is usually a segfault upon accessing the `PinnableSlice`, although sometimes it returned wrong results (I repro'd this a bunch of times with `db_stress`).

This PR copies the value into the `PinnableSlice` when it comes from mmap'd memory. We can tell whether the `Block` owns its memory using `Block::cachable()`, which is unset when reads do not use the provided buffer as is the case with mmap file reads. When that is false we ensure the result of `Get()` is copied.

This feels like a short-term solution as ideally we'd have the `PinnableSlice` pin the mmap'd memory so we can do zero-copy reads. It seemed hard so I chose this approach to fix correctness in the meantime.
Closes https://github.com/facebook/rocksdb/pull/3881

Differential Revision: D8076288

Pulled By: ajkr

fbshipit-source-id: 31d78ec010198723522323dbc6ea325122a46b08
2018-06-01 16:57:58 -07:00
Andrew Kryczka 072ae671a7 Apply use_direct_io_for_flush_and_compaction to writes only
Summary:
Previously `DBOptions::use_direct_io_for_flush_and_compaction=true` combined with `DBOptions::use_direct_reads=false` could cause RocksDB to simultaneously read from two file descriptors for the same file, where background reads used direct I/O and foreground reads used buffered I/O. Our measurements found this mixed-mode I/O negatively impacted foreground read perf, compared to when only buffered I/O was used.

This PR makes the mixed-mode I/O situation impossible by repurposing `DBOptions::use_direct_io_for_flush_and_compaction` to only apply to background writes, and `DBOptions::use_direct_reads` to apply to all reads. There is no risk of direct background direct writes happening simultaneously with buffered reads since we never read from and write to the same file simultaneously.
Closes https://github.com/facebook/rocksdb/pull/3829

Differential Revision: D7915443

Pulled By: ajkr

fbshipit-source-id: 78bcbf276449b7e7766ab6b0db246f789fb1b279
2018-05-09 19:42:58 -07:00
David Lai 3be9b36453 comment unused parameters to turn on -Wunused-parameter flag
Summary:
This PR comments out the rest of the unused arguments which allow us to turn on the -Wunused-parameter flag. This is the second part of a codemod relating to https://github.com/facebook/rocksdb/pull/3557.
Closes https://github.com/facebook/rocksdb/pull/3662

Differential Revision: D7426121

Pulled By: Dayvedde

fbshipit-source-id: 223994923b42bd4953eb016a0129e47560f7e352
2018-04-12 17:59:16 -07:00
Phani Shekhar Mantripragada 446b32cfc3 Support for Column family specific paths.
Summary:
In this change, an option to set different paths for different column families is added.
This option is set via cf_paths setting of ColumnFamilyOptions. This option will work in a similar fashion to db_paths setting. Cf_paths is a vector of Dbpath values which contains a pair of the absolute path and target size. Multiple levels in a Column family can go to different paths if cf_paths has more than one path.
To maintain backward compatibility, if cf_paths is not specified for a column family, db_paths setting will be used. Note that, if db_paths setting is also not specified, RocksDB already has code to use db_name as the only path.

Changes :
1) A new member "cf_paths" is added to ImmutableCfOptions. This is set, based on cf_paths setting of ColumnFamilyOptions and db_paths setting of ImmutableDbOptions.  This member is used to identify the path information whenever files are accessed.
2) Validation checks are added for cf_paths setting based on existing checks for db_paths setting.
3) DestroyDB, PurgeObsoleteFiles etc. are edited to support multiple cf_paths.
4) Unit tests are added appropriately.
Closes https://github.com/facebook/rocksdb/pull/3102

Differential Revision: D6951697

Pulled By: ajkr

fbshipit-source-id: 60d2262862b0a8fd6605b09ccb0da32bb331787d
2018-04-05 19:58:20 -07:00
Dmitri Smirnov c364eb42b5 Windows cumulative patch
Summary:
This patch addressed several issues.
  Portability including db_test std::thread -> port::Thread Cc: @
  and %z to ROCKSDB portable macro. Cc: maysamyabandeh

  Implement Env::AreFilesSame

  Make the implementation of file unique number more robust

  Get rid of C-runtime and go directly to Windows API when dealing
  with file primitives.

  Implement GetSectorSize() and aling unbuffered read on the value if
  available.

  Adjust Windows Logger for the new interface, implement CloseImpl() Cc: anand1976

  Fix test running script issue where $status var was of incorrect scope
  so the failures were swallowed and not reported.

  DestroyDB() creates a logger and opens a LOG file in the directory
  being cleaned up. This holds a lock on the folder and the cleanup is
  prevented. This fails one of the checkpoin tests. We observe the same in production.
  We close the log file in this change.

 Fix DBTest2.ReadAmpBitmapLiveInCacheAfterDBClose failure where the test
 attempts to open a directory with NewRandomAccessFile which does not
 work on Windows.
  Fix DBTest.SoftLimit as it is dependent on thread timing. CC: yiwu-arbug
Closes https://github.com/facebook/rocksdb/pull/3552

Differential Revision: D7156304

Pulled By: siying

fbshipit-source-id: 43db0a757f1dfceffeb2b7988043156639173f5b
2018-03-06 11:57:43 -08:00
Andrew Kryczka 5d68243e61 Comment out unused variables
Summary:
Submitting on behalf of another employee.
Closes https://github.com/facebook/rocksdb/pull/3557

Differential Revision: D7146025

Pulled By: ajkr

fbshipit-source-id: 495ca5db5beec3789e671e26f78170957704e77e
2018-03-05 13:13:41 -08:00
Igor Sugak aba3409740 Back out "[codemod] - comment out unused parameters"
Reviewed By: igorsugak

fbshipit-source-id: 4a93675cc1931089ddd574cacdb15d228b1e5f37
2018-02-22 12:43:17 -08:00
David Lai f4a030ce81 - comment out unused parameters
Reviewed By: everiq, igorsugak

Differential Revision: D7046710

fbshipit-source-id: 8e10b1f1e2aecebbfb229c742e214db887e5a461
2018-02-22 09:44:23 -08:00
Andrew Kryczka ab5ab36ac2 fix DBTest2.ReadAmpBitmapLiveInCacheAfterDBClose file ID support check
Summary:
Updated the test case to handle tmpfs mounted at directories different from "/dev/shm/".
Closes https://github.com/facebook/rocksdb/pull/3440

Differential Revision: D6848213

Pulled By: ajkr

fbshipit-source-id: 465e9dbf0921d0930161f732db6b3766bb030589
2018-01-30 16:50:42 -08:00
Andrew Kryczka 46e599fc6b fix live WALs purged while file deletions disabled
Summary:
When calling `DisableFileDeletions` followed by `GetSortedWalFiles`, we guarantee the files returned by the latter call won't be deleted until after file deletions are re-enabled. However, `GetSortedWalFiles` didn't omit files already planned for deletion via `PurgeObsoleteFiles`, so the guarantee could be broken.

We fix it by making `GetSortedWalFiles` wait for the number of pending purges to hit zero if file deletions are disabled. This condition is eventually met since `PurgeObsoleteFiles` is guaranteed to be called for the existing pending purges, and new purges cannot be scheduled while file deletions are disabled. Once the condition is met, `GetSortedWalFiles` simply returns the content of DB and archive directories, which nobody can delete (except for deletion scheduler, for which I plan to fix this bug later) until deletions are re-enabled.
Closes https://github.com/facebook/rocksdb/pull/3341

Differential Revision: D6681131

Pulled By: ajkr

fbshipit-source-id: 90b1e2f2362ea9ef715623841c0826611a817634
2018-01-17 17:42:04 -08:00
Andrew Kryczka 93f69cb93a use bottommost compression when base level is bottommost
Summary:
The previous compression type selection caused unexpected behavior when the base level was also the bottommost level. The following sequence of events could happen:

- full compaction generates files with `bottommost_compression` type
- now base level is bottommost level since all files are in the same level
- any compaction causes files to be rewritten `compression_per_level` type since bottommost compression didn't apply to base level

I changed the code to make bottommost compression apply to base level.
Closes https://github.com/facebook/rocksdb/pull/3141

Differential Revision: D6264614

Pulled By: ajkr

fbshipit-source-id: d7aaa8675126896684154a1f2c9034d6214fde82
2017-11-09 17:42:00 -08:00
Andrew Kryczka 24ad430600 pass key/value samples through zstd compression dictionary generator
Summary:
Instead of using samples directly, we now support passing the samples through zstd's dictionary generator when `CompressionOptions::zstd_max_train_bytes` is set to nonzero. If set to zero, we will use the samples directly as the dictionary -- same as before.

Note this is the first step of #2987, extracted into a separate PR per reviewer request.
Closes https://github.com/facebook/rocksdb/pull/3057

Differential Revision: D6116891

Pulled By: ajkr

fbshipit-source-id: 70ab13cc4c734fa02e554180eed0618b75255497
2017-11-02 22:56:36 -07:00
Yi Wu 8e63cad078 fix lite build
Summary:
* make `checksum_type_string_map` available for lite
* comment out `FilesPerLevel` in lite mode.
* travis and legocastle lite build also build `all` target and run tests
Closes https://github.com/facebook/rocksdb/pull/3015

Differential Revision: D6069822

Pulled By: yiwu-arbug

fbshipit-source-id: 9fe92ac220e711e9e6ed4e921bd25ef4314796a0
2017-10-17 08:57:09 -07:00
Maysam Yabandeh f46464d383 write-prepared txn: call IsInSnapshot
Summary:
This patch instruments the read path to verify each read value against an optional ReadCallback class. If the value is rejected, the reader moves on to the next value. The WritePreparedTxn makes use of this feature to skip sequence numbers that are not in the read snapshot.
Closes https://github.com/facebook/rocksdb/pull/2850

Differential Revision: D5787375

Pulled By: maysamyabandeh

fbshipit-source-id: 49d808b3062ab35e7ae98ad388f659757794184c
2017-09-11 09:14:48 -07:00
Artem Danilov 8a6708f5f2 Extend property map with compaction stats
Summary:
This branch extends existing property map which keeps values in doubles to keep values in strings so that it can be used to provide wider range of properties. The immediate need for that is to provide IO stall stats in an easy parseable way to MyRocks which is also part of this branch.
Closes https://github.com/facebook/rocksdb/pull/2794

Differential Revision: D5717676

Pulled By: Tema

fbshipit-source-id: e34ba5b79ba774697f7b97ce1138d8fd55471b8a
2017-08-30 15:26:55 -07:00
Yi Wu 3c840d1a6d Allow DB reopen with reduced options.num_levels
Summary:
Allow user to reduce number of levels in LSM by issue a full CompactRange() and put the result in a lower level, and then reopen DB with reduced options.num_levels. Previous this will fail on reopen on when recovery replaying the previous MANIFEST and found a historical file was on a higher level than the new options.num_levels. The workaround was after CompactRange(), reopen the DB with old num_levels, which will create a new MANIFEST, and then reopen the DB again with new num_levels.

This patch relax the check of levels during recovery. It allows DB to open if there was a historical file on level > options.num_levels, but was also deleted.
Closes https://github.com/facebook/rocksdb/pull/2740

Differential Revision: D5629354

Pulled By: yiwu-arbug

fbshipit-source-id: 545903f6b36b6083e8cbaf777176aef2f488021d
2017-08-24 16:10:54 -07:00
Siying Dong 666a005f9b Support prefetch last 512KB with direct I/O in block based file reader
Summary:
Right now, if direct I/O is enabled, prefetching the last 512KB cannot be applied, except compaction inputs or readahead is enabled for iterators. This can create a lot of I/O for HDD cases. To solve the problem, the 512KB is prefetched in block based table if direct I/O is enabled. The prefetched buffer is passed in totegher with random access file reader, so that we try to read from the buffer before reading from the file. This can be extended in the future to support flexible user iterator readahead too.
Closes https://github.com/facebook/rocksdb/pull/2708

Differential Revision: D5593091

Pulled By: siying

fbshipit-source-id: ee36ff6d8af11c312a2622272b21957a7b5c81e7
2017-08-11 12:16:45 -07:00
Siying Dong e7697b8ce8 Fix LITE unit tests
Summary: Closes https://github.com/facebook/rocksdb/pull/2649

Differential Revision: D5505778

Pulled By: siying

fbshipit-source-id: 7e935603ede3d958ea087ed6b8cfc4121e8797bc
2017-07-26 21:11:47 -07:00
Sagar Vemuri 72502cf227 Revert "comment out unused parameters"
Summary:
This reverts the previous commit 1d7048c598, which broke the build.

Did a `git revert 1d7048c`.
Closes https://github.com/facebook/rocksdb/pull/2627

Differential Revision: D5476473

Pulled By: sagar0

fbshipit-source-id: 4756ff5c0dfc88c17eceb00e02c36176de728d06
2017-07-21 18:26:26 -07:00
Victor Gao 1d7048c598 comment out unused parameters
Summary: This uses `clang-tidy` to comment out unused parameters (in functions, methods and lambdas) in fbcode. Cases that the tool failed to handle are fixed manually.

Reviewed By: igorsugak

Differential Revision: D5454343

fbshipit-source-id: 5dee339b4334e25e963891b519a5aa81fbf627b2
2017-07-21 14:57:44 -07:00
Siying Dong 3c327ac2d0 Change RocksDB License
Summary: Closes https://github.com/facebook/rocksdb/pull/2589

Differential Revision: D5431502

Pulled By: siying

fbshipit-source-id: 8ebf8c87883daa9daa54b2303d11ce01ab1f6f75
2017-07-15 16:11:23 -07:00
Ewout Prangsma 51778612c9 Encryption at rest support
Summary:
This PR adds support for encrypting data stored by RocksDB when written to disk.

It adds an `EncryptedEnv` override of the `Env` class with matching overrides for sequential&random access files.
The encryption itself is done through a configurable `EncryptionProvider`. This class creates is asked to create `BlockAccessCipherStream` for a file. This is where the actual encryption/decryption is being done.
Currently there is a Counter mode implementation of `BlockAccessCipherStream` with a `ROT13` block cipher (NOTE the `ROT13` is for demo purposes only!!).

The Counter operation mode uses an initial counter & random initialization vector (IV).
Both are created randomly for each file and stored in a 4K (default size) block that is prefixed to that file. The `EncryptedEnv` implementation is such that clients of the `Env` class do not see this prefix (nor data, nor in filesize).
The largest part of the prefix block is also encrypted, and there is room left for implementation specific settings/values/keys in there.

To test the encryption, the `DBTestBase` class has been extended to consider a new environment variable called `ENCRYPTED_ENV`. If set, the test will setup a encrypted instance of the `Env` class to use for all tests.
Typically you would run it like this:

```
ENCRYPTED_ENV=1 make check_some
```

There is also an added test that checks that some data inserted into the database is or is not "visible" on disk. With `ENCRYPTED_ENV` active it must not find plain text strings, with `ENCRYPTED_ENV` unset, it must find the plain text strings.
Closes https://github.com/facebook/rocksdb/pull/2424

Differential Revision: D5322178

Pulled By: sdwilsh

fbshipit-source-id: 253b0a9c2c498cc98f580df7f2623cbf7678a27f
2017-06-26 16:56:24 -07:00
Andrew Kryczka c217e0b9c7 Call RateLimiter for compaction reads
Summary:
Allow users to rate limit background work based on read bytes, written bytes, or sum of read and written bytes. Support these by changing the RateLimiter API, so no additional options were needed.
Closes https://github.com/facebook/rocksdb/pull/2433

Differential Revision: D5216946

Pulled By: ajkr

fbshipit-source-id: aec57a8357dbb4bfde2003261094d786d94f724e
2017-06-13 14:56:46 -07:00
Siying Dong 52a7f38b19 WriteOptions.low_pri which can throttle low pri writes if needed
Summary:
If ReadOptions.low_pri=true and compaction is behind, the write will either return immediate or be slowed down based on ReadOptions.no_slowdown.
Closes https://github.com/facebook/rocksdb/pull/2369

Differential Revision: D5127619

Pulled By: siying

fbshipit-source-id: d30e1cff515890af0eff32dfb869d2e4c9545eb0
2017-06-05 15:02:35 -07:00
Siying Dong 95b0e89b5d Improve write buffer manager (and allow the size to be tracked in block cache)
Summary:
Improve write buffer manager in several ways:
1. Size is tracked when arena block is allocated, rather than every allocation, so that it can better track actual memory usage and the tracking overhead is slightly lower.
2. We start to trigger memtable flush when 7/8 of the memory cap hits, instead of 100%, and make 100% much harder to hit.
3. Allow a cache object to be passed into buffer manager and the size allocated by memtable can be costed there. This can help users have one single memory cap across block cache and memtable.
Closes https://github.com/facebook/rocksdb/pull/2350

Differential Revision: D5110648

Pulled By: siying

fbshipit-source-id: b4238113094bf22574001e446b5d88523ba00017
2017-06-02 14:26:56 -07:00
Andrew Kryczka bb01c1880c Introduce max_background_jobs mutable option
Summary:
- `max_background_flushes` and `max_background_compactions` are still supported for backwards compatibility
- `base_background_compactions` is completely deprecated. Now we just throttle to one background compaction when there's no pressure.
- `max_background_jobs` is added to automatically partition the concurrent background jobs into flushes vs compactions. Currently it's very simple as we just allocate one-fourth of the jobs to flushes, and the remaining can be used for compactions.
- The test cases that set `base_background_compactions > 1` needed to be updated. I just grab the pressure token such that the desired number of compactions can be scheduled.
Closes https://github.com/facebook/rocksdb/pull/2205

Differential Revision: D4937461

Pulled By: ajkr

fbshipit-source-id: df52cbbd497e13bbc9a60560a5ac2a2526b3f1f9
2017-05-24 11:29:08 -07:00
Aaron Gao 492fc49a86 fix readampbitmap tests
Summary:
fix test failure of ReadAmpBitmap and ReadAmpBitmapLiveInCacheAfterDBClose.
test ReadAmpBitmapLiveInCacheAfterDBClose individually and make check
Closes https://github.com/facebook/rocksdb/pull/2271

Differential Revision: D5038133

Pulled By: lightmark

fbshipit-source-id: 803cd6f45ccfdd14a9d9473c8af311033e164be8
2017-05-10 12:29:23 -07:00
Andrew Kryczka be421b0b16 portable sched_getcpu calls
Summary:
- added a feature test in build_detect_platform to check whether sched_getcpu() is available. glibc offers it only on some platforms (e.g., linux but not mac); this way should be easier than maintaining a list of platforms on which it's available.
- refactored PhysicalCoreID() to be simpler / less repetitive. ordered the conditional compilation clauses from most-to-least preferred
Closes https://github.com/facebook/rocksdb/pull/2272

Differential Revision: D5038093

Pulled By: ajkr

fbshipit-source-id: 81d7db3cc620250de220bdeb3194b2b3d7673de7
2017-05-10 12:29:23 -07:00
Aaron Gao 259a00eaca unbiase readamp bitmap
Summary:
Consider BlockReadAmpBitmap with bytes_per_bit = 32. Suppose bytes [a, b) were used, while bytes [a-32, a)
 and [b+1, b+33) weren't used; more formally, the union of ranges passed to BlockReadAmpBitmap::Mark() contains [a, b) and doesn't intersect with [a-32, a) and [b+1, b+33). Then bits [floor(a/32), ceil(b/32)] will be set, and so the number of useful bytes will be estimated as (ceil(b/32) - floor(a/32)) * 32, which is on average equal to b-a+31.

An extreme example: if we use 1 byte from each block, it'll be counted as 32 bytes from each block.

It's easy to remove this bias by slightly changing the semantics of the bitmap. Currently each bit represents a byte range [i*32, (i+1)*32).

This diff makes each bit represent a single byte: i*32 + X, where X is a random number in [0, 31] generated when bitmap is created. So, e.g., if you read a single byte at random, with probability 31/32 it won't be counted at all, and with probability 1/32 it will be counted as 32 bytes; so, on average it's counted as 1 byte.

*But there is one exception: the last bit will always set with the old way.*

(*) - assuming read_amp_bytes_per_bit = 32.
Closes https://github.com/facebook/rocksdb/pull/2259

Differential Revision: D5035652

Pulled By: lightmark

fbshipit-source-id: bd98b1b9b49fbe61f9e3781d07f624e3cbd92356
2017-05-10 01:49:52 -07:00
Siying Dong d616ebea23 Add GPLv2 as an alternative license.
Summary: Closes https://github.com/facebook/rocksdb/pull/2226

Differential Revision: D4967547

Pulled By: siying

fbshipit-source-id: dd3b58ae1e7a106ab6bb6f37ab5c88575b125ab4
2017-04-27 18:06:12 -07:00
Tomas Kolda 04d58970cb AIX and Solaris Sparc Support
Summary:
Replacement of #2147

The change was squashed due to a lot of conflicts.
Closes https://github.com/facebook/rocksdb/pull/2194

Differential Revision: D4929799

Pulled By: siying

fbshipit-source-id: 5cd49c254737a1d5ac13f3c035f128e86524c581
2017-04-21 20:48:04 -07:00
Aaron Gao 44fa8ece9b change use_direct_writes to use_direct_io_for_flush_and_compaction
Summary:
Replace Options::use_direct_writes with Options::use_direct_io_for_flush_and_compaction
Now if Options::use_direct_io_for_flush_and_compaction = true, we will enable direct io for both reads and writes for flush and compaction job. Whereas Options::use_direct_reads controls user reads like iterator and Get().
Closes https://github.com/facebook/rocksdb/pull/2117

Differential Revision: D4860912

Pulled By: lightmark

fbshipit-source-id: d93575a8a5e780cf7e40797287edc425ee648c19
2017-04-13 16:12:04 -07:00
Sagar Vemuri 6a6723ee1e Move MergeOperatorPinning tests to be with other merge operator tests
Summary:
Moved MergeOperatorPinning tests from db_test2.cc to db_merge_operator_test.cc.

[This is the same code as PR #2104 , which has already been reviewed,  but I am creating a new PR as I cannot import from #2104 onto phabricator anymore even after rebasing. I'll close and discard #2104.]
Closes https://github.com/facebook/rocksdb/pull/2125

Differential Revision: D4863312

Pulled By: sagar0

fbshipit-source-id: 0f71a7690aa09c1d03ee85ce2bc1d2d89e4f4399
2017-04-11 16:15:06 -07:00
Maysam Yabandeh 8b0097b49b Readers for partition filter
Summary:
This is the last split of this pull request: https://github.com/facebook/rocksdb/pull/1891 which includes the reader part as well as the tests.
Closes https://github.com/facebook/rocksdb/pull/1961

Differential Revision: D4672216

Pulled By: maysamyabandeh

fbshipit-source-id: 6a2b829
2017-03-22 09:24:15 -07:00
Siying Dong 8f5bf04468 Flush triggered by DB write buffer size picks the oldest unflushed CF
Summary:
Previously, when DB write buffer size triggers, we always pick the CF with most data in its memtable to flush. This approach can minimize total flush happens. Change the behavior to always pick the oldest unflushed CF, which makes it the same behavior when max_total_wal_size hits. This approach will minimize size used by max_total_wal_size.
Closes https://github.com/facebook/rocksdb/pull/1987

Differential Revision: D4703214

Pulled By: siying

fbshipit-source-id: 9ff8b09
2017-03-21 11:09:10 -07:00
Sagar Vemuri 97edc72d39 Add a memtable-only iterator
Summary:
This PR is to support a way to iterate over all the keys that are just in memtables.
Closes https://github.com/facebook/rocksdb/pull/1953

Differential Revision: D4663500

Pulled By: sagar0

fbshipit-source-id: 144e177
2017-03-07 11:54:10 -08:00
Aaron Gao 286a36db7f posix writablefile truncate
Summary:
we occasionally missing this call so the file size will be wrong
Closes https://github.com/facebook/rocksdb/pull/1894

Differential Revision: D4598446

Pulled By: lightmark

fbshipit-source-id: 42b6ef5
2017-02-22 10:09:14 -08:00
Dmitri Smirnov 0a4cdde50a Windows thread
Summary:
introduce new methods into a public threadpool interface,
- allow submission of std::functions as they allow greater flexibility.
- add Joining methods to the implementation to join scheduled and submitted jobs with
  an option to cancel jobs that did not start executing.
- Remove ugly `#ifdefs` between pthread and std implementation, make it uniform.
- introduce pimpl for a drop in replacement of the implementation
- Introduce rocksdb::port::Thread typedef which is a replacement for std::thread.  On Posix Thread defaults as before std::thread.
- Implement WindowsThread that allocates memory in a more controllable manner than windows std::thread with a replaceable implementation.
- should be no functionality changes.
Closes https://github.com/facebook/rocksdb/pull/1823

Differential Revision: D4492902

Pulled By: siying

fbshipit-source-id: c74cb11
2017-02-06 14:54:18 -08:00
Siying Dong 036d668b19 Fix wrong result in data race case related to Get()
Summary:
In theory, Get() can get a wrong result, if it races in a special with with flush. The bug can be reproduced in DBTest2.GetRaceFlush. Fix this bug by getting snapshot after referencing the super version.
Closes https://github.com/facebook/rocksdb/pull/1816

Differential Revision: D4475958

Pulled By: siying

fbshipit-source-id: bd9e67a
2017-02-03 11:39:15 -08:00
Siying Dong f25f1ec60b Add test DBTest2.GetRaceFlush which can expose a data race bug
Summary:
A current data race issue in Get() and Flush() can cause a Get() to return wrong results when a flush happened in the middle. Disable the test for now.
Closes https://github.com/facebook/rocksdb/pull/1813

Differential Revision: D4472310

Pulled By: siying

fbshipit-source-id: 5755ebd
2017-01-26 16:39:14 -08:00
Siying Dong 0e8dfd6062 Fix OptimizeForPointLookup()
Summary:
If users directly call OptimizeForPointLookup(), it is broken as the option isn't compatible with parallel memtable insert. Fix it by using memtable bloomo filter instead.
Closes https://github.com/facebook/rocksdb/pull/1791

Differential Revision: D4442836

Pulled By: siying

fbshipit-source-id: bf6c9cd
2017-01-20 10:54:12 -08:00
Yi Wu 5d1457dbbf Dump persistent cache options
Summary:
Dump persistent cache options
Closes https://github.com/facebook/rocksdb/pull/1679

Differential Revision: D4337019

Pulled By: yiwu-arbug

fbshipit-source-id: 3812f8a
2016-12-19 14:09:12 -08:00
Karthikeyan Radhakrishnan 4118e13330 Persistent Cache: Expose stats to user via public API
Summary:
Exposing persistent cache stats (counters) to the user via public API.
Closes https://github.com/facebook/rocksdb/pull/1485

Differential Revision: D4155274

Pulled By: siying

fbshipit-source-id: 30a9f50
2016-11-21 17:39:13 -08:00
Siying Dong 972e3ff295 Enable allow_concurrent_memtable_write and enable_write_thread_adaptive_yield by default
Summary: Closes https://github.com/facebook/rocksdb/pull/1496

Differential Revision: D4168080

Pulled By: siying

fbshipit-source-id: 056ae62
2016-11-16 09:39:09 -08:00
Maysam Yabandeh 361010d447 Exporting compaction stats in the form of a map
Summary:
Currently the compaction stats are printed to stdout. We want to export the compaction stats in a map format so that the upper layer apps (e.g., MySQL) could present
the stats in any format required by the them.
Closes https://github.com/facebook/rocksdb/pull/1477

Differential Revision: D4149836

Pulled By: maysamyabandeh

fbshipit-source-id: b3df19f
2016-11-11 20:54:14 -08:00
Islam AbdelRahman 5691a1d8a4 Fix compaction conflict with running compaction
Summary:
Issue scenario:
(1) We have 3 files in L1 and we issue a compaction that will compact them into 1 file in L2
(2) While compaction (1) is running, we flush a file into L0 and trigger another compaction that decide to move this file to L1 and then move it again to L2 (this file don't overlap with any other files)
(3) compaction (1) finishes and install the file it generated in L2, but this file overlap with the file we generated in (2) so we break the LSM consistency

Looks like this issue can be triggered by using non-exclusive manual compaction or AddFile()

Test Plan: unit tests

Reviewers: sdong

Reviewed By: sdong

Subscribers: hermanlee4, jkedgar, andrewkr, dhruba, yoshinorim

Differential Revision: https://reviews.facebook.net/D64947
2016-10-13 10:49:06 -07:00
Aaron Gao 715256338a forbid merge during recovery
Summary:
Mitigate regression bug of options.max_successive_merges hit during DB Recovery
For https://reviews.facebook.net/D62625

Test Plan: make all check

Reviewers: horuff, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D62655
2016-09-21 11:05:07 -07:00
sdong b666f85445 Consider more factors when determining preallocation size of WAL files
Summary: Currently the WAL file preallocation size is 1.1 * write_buffer_size. This, however, will be over-estimated if options.db_write_buffer_size or options.max_total_wal_size is set and is much smaller.

Test Plan: Add a unit test.

Reviewers: andrewkr, yiwu

Reviewed By: yiwu

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D63957
2016-09-19 12:04:35 -07:00
sdong 607628d349 Support ZSTD with finalized format
Summary:
ZSTD 1.0.0 is coming. We can finally add a support of ZSTD without worrying about compatibility.
Still keep ZSTDNotFinal for compatibility reason.

Test Plan: Run all tests. Run db_bench with ZSTD version with RocksDB built with ZSTD 1.0 and older.

Reviewers: andrewkr, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: cyan, igor, IslamAbdelRahman, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D63141
2016-09-06 12:22:16 -07:00
sdong 32149059f9 Merge options source_compaction_factor, max_grandparent_overlap_bytes and expanded_compaction_factor into max_compaction_bytes
Summary: To reduce number of options, merge source_compaction_factor, max_grandparent_overlap_bytes and expanded_compaction_factor into max_compaction_bytes.

Test Plan: Add two new unit tests. Run all existing tests, including jtest.

Reviewers: yhchiang, igor, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59829
2016-09-01 14:33:24 -07:00
Islam AbdelRahman b49b92cf28 Introduce Read amplification bitmap (read amp statistics)
Summary:
Add ReadOptions::read_amp_bytes_per_bit option which allow us to create a bitmap for every data block we read
the bitmap will contain (block_size / read_amp_bytes_per_bit) bits.

We will use this bitmap to mark which bytes have been used of the block so we can calculate the read amplification

Test Plan: added new tests

Reviewers: andrewkr, yhchiang, sdong

Reviewed By: sdong

Subscribers: yiwu, leveldb, march, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D58707
2016-08-26 18:55:58 -07:00
sdong dade61ac26 Mitigate regression bug of options.max_successive_merges hit during DB Recovery
Summary:
After 1b8a2e8fdd, DB Pointer is passed to WriteBatchInternal::InsertInto() while DB recovery. This can cause deadlock if options.max_successive_merges hits. In that case DB::Get() will be called. Get() will try to acquire the DB mutex, which is already held by the DB::Open(), causing a deadlock condition.

This commit mitigates the problem by not passing the DB pointer unless 2PC is allowed.

Test Plan: Add a new test and run it.

Reviewers: IslamAbdelRahman, andrewkr, kradhakrishnan, horuff

Reviewed By: kradhakrishnan

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D62625
2016-08-25 17:30:34 -07:00
Islam AbdelRahman d11c09d9e2 Eliminate memcpy from ForwardIterator
Summary:
This diff update ForwardIterator to support pinning keys and values, which will allow DBIter to take advantage of that and eliminate memcpy when executing merge operators
This diff is stacked on D61305

Test Plan:
existing tests (updated them to test tailing iterator)
new test

Reviewers: andrewkr, yhchiang, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D60009
2016-08-11 19:10:16 -07:00
omegaga e70020e4f6 Only cache level 0 indexes and filter when opening table reader
Summary: In T8216281 we decided to disable prefetching the index and filter during opening table handlers during startup (max_open_files = -1).

Test Plan: Rely on `IndexAndFilterBlocksOfNewTableAddedToCache` to guarantee L0 indexes and filters are still cached and change `PinL0IndexAndFilterBlocksTest` to make sure other levels are not cached (maybe add one more test to test we don't cache other levels?)

Reviewers: sdong, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59913
2016-07-20 11:23:31 -07:00
Islam AbdelRahman 68a8e6b8fa Introduce FullMergeV2 (eliminate memcpy from merge operators)
Summary:
This diff update the code to pin the merge operator operands while the merge operation is done, so that we can eliminate the memcpy cost, to do that we need a new public API for FullMerge that replace the std::deque<std::string> with std::vector<Slice>

This diff is stacked on top of D56493 and D56511

In this diff we
- Update FullMergeV2 arguments to be encapsulated in MergeOperationInput and MergeOperationOutput which will make it easier to add new arguments in the future
- Replace std::deque<std::string> with std::vector<Slice> to pass operands
- Replace MergeContext std::deque with std::vector (based on a simple benchmark I ran https://gist.github.com/IslamAbdelRahman/78fc86c9ab9f52b1df791e58943fb187)
- Allow FullMergeV2 output to be an existing operand

```
[Everything in Memtable | 10K operands | 10 KB each | 1 operand per key]

DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=10000 --num=10000 --disable_auto_compactions --value_size=10240 --write_buffer_size=1000000000

[FullMergeV2]
readseq      :       0.607 micros/op 1648235 ops/sec; 16121.2 MB/s
readseq      :       0.478 micros/op 2091546 ops/sec; 20457.2 MB/s
readseq      :       0.252 micros/op 3972081 ops/sec; 38850.5 MB/s
readseq      :       0.237 micros/op 4218328 ops/sec; 41259.0 MB/s
readseq      :       0.247 micros/op 4043927 ops/sec; 39553.2 MB/s

[master]
readseq      :       3.935 micros/op 254140 ops/sec; 2485.7 MB/s
readseq      :       3.722 micros/op 268657 ops/sec; 2627.7 MB/s
readseq      :       3.149 micros/op 317605 ops/sec; 3106.5 MB/s
readseq      :       3.125 micros/op 320024 ops/sec; 3130.1 MB/s
readseq      :       4.075 micros/op 245374 ops/sec; 2400.0 MB/s
```

```
[Everything in Memtable | 10K operands | 10 KB each | 10 operand per key]

DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=1000 --num=10000 --disable_auto_compactions --value_size=10240 --write_buffer_size=1000000000

[FullMergeV2]
readseq      :       3.472 micros/op 288018 ops/sec; 2817.1 MB/s
readseq      :       2.304 micros/op 434027 ops/sec; 4245.2 MB/s
readseq      :       1.163 micros/op 859845 ops/sec; 8410.0 MB/s
readseq      :       1.192 micros/op 838926 ops/sec; 8205.4 MB/s
readseq      :       1.250 micros/op 800000 ops/sec; 7824.7 MB/s

[master]
readseq      :      24.025 micros/op 41623 ops/sec;  407.1 MB/s
readseq      :      18.489 micros/op 54086 ops/sec;  529.0 MB/s
readseq      :      18.693 micros/op 53495 ops/sec;  523.2 MB/s
readseq      :      23.621 micros/op 42335 ops/sec;  414.1 MB/s
readseq      :      18.775 micros/op 53262 ops/sec;  521.0 MB/s

```

```
[Everything in Block cache | 10K operands | 10 KB each | 1 operand per key]

[FullMergeV2]
$ DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --num=100000 --db="/dev/shm/merge-random-10K-10KB" --cache_size=1000000000 --use_existing_db --disable_auto_compactions
readseq      :      14.741 micros/op 67837 ops/sec;  663.5 MB/s
readseq      :       1.029 micros/op 971446 ops/sec; 9501.6 MB/s
readseq      :       0.974 micros/op 1026229 ops/sec; 10037.4 MB/s
readseq      :       0.965 micros/op 1036080 ops/sec; 10133.8 MB/s
readseq      :       0.943 micros/op 1060657 ops/sec; 10374.2 MB/s

[master]
readseq      :      16.735 micros/op 59755 ops/sec;  584.5 MB/s
readseq      :       3.029 micros/op 330151 ops/sec; 3229.2 MB/s
readseq      :       3.136 micros/op 318883 ops/sec; 3119.0 MB/s
readseq      :       3.065 micros/op 326245 ops/sec; 3191.0 MB/s
readseq      :       3.014 micros/op 331813 ops/sec; 3245.4 MB/s
```

```
[Everything in Block cache | 10K operands | 10 KB each | 10 operand per key]

DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --num=100000 --db="/dev/shm/merge-random-10-operands-10K-10KB" --cache_size=1000000000 --use_existing_db --disable_auto_compactions

[FullMergeV2]
readseq      :      24.325 micros/op 41109 ops/sec;  402.1 MB/s
readseq      :       1.470 micros/op 680272 ops/sec; 6653.7 MB/s
readseq      :       1.231 micros/op 812347 ops/sec; 7945.5 MB/s
readseq      :       1.091 micros/op 916590 ops/sec; 8965.1 MB/s
readseq      :       1.109 micros/op 901713 ops/sec; 8819.6 MB/s

[master]
readseq      :      27.257 micros/op 36687 ops/sec;  358.8 MB/s
readseq      :       4.443 micros/op 225073 ops/sec; 2201.4 MB/s
readseq      :       5.830 micros/op 171526 ops/sec; 1677.7 MB/s
readseq      :       4.173 micros/op 239635 ops/sec; 2343.8 MB/s
readseq      :       4.150 micros/op 240963 ops/sec; 2356.8 MB/s
```

Test Plan: COMPILE_WITH_ASAN=1 make check -j64

Reviewers: yhchiang, andrewkr, sdong

Reviewed By: sdong

Subscribers: lovro, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D57075
2016-07-20 09:49:03 -07:00
omegaga cd4178a015 Add a new feature to enforce a sync point only active on a thread
Summary: Add markers to sync points. A marked sync point will only be active when it is on the same thread as the marker sync point.

Test Plan: Write a unit test to validate.

Reviewers: sdong, IslamAbdelRahman, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D60375
2016-07-07 11:29:14 -07:00
sdong 32df9733d1 Add options.write_buffer_manager: control total memtable size across DB instances
Summary: Add option write_buffer_manager to help users control total memory spent on memtables across multiple DB instances.

Test Plan: Add a new unit test.

Reviewers: yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: adela, benj, sumeet, muthu, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59925
2016-07-05 18:11:25 -07:00
Islam AbdelRahman c70a9335de Fix mutex unlock issue between scheduled compaction and ReleaseCompactionFiles()
Summary:
NotifyOnCompactionCompleted can unlock the mutex.
That mean that we can schedule a background compaction that will start before we ReleaseCompactionFiles().

Test Plan:
added unittest
existing unittest

Reviewers: yhchiang, sdong

Reviewed By: sdong

Subscribers: yoshinorim, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D58065
2016-05-18 14:56:30 -07:00
krad a08c8c851a Added PersistentCache abstraction
Summary:
Added a new abstraction to cache page to RocksDB designed for the read
cache use.

RocksDB current block cache is more of an object cache. For the persistent read cache
project, what we need is a page cache equivalent. This changes adds a cache
abstraction to RocksDB to cache pages called PersistentCache. PersistentCache can cache
uncompressed pages or raw pages (content as in filesystem). The user can
choose to operate PersistentCache either in  COMPRESSED or UNCOMPRESSED mode.

Blame Rev:

Test Plan: Run unit tests

Reviewers: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D55707
2016-05-15 22:17:18 -07:00
Islam AbdelRahman d86f9b9c3f Fix lite build
Summary: Fix lite build

Test Plan: run under lite

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D57945
2016-05-09 16:08:30 -07:00
Islam AbdelRahman 4b31723433 Add bottommost_compression option
Summary:
Add a new option that can be used to set a specific compression algorithm for bottommost level.
This option will only affect levels larger than base level.

I have also updated CompactionJobInfo to include the compression algorithm used in compaction

Test Plan:
added new unittest
existing unittests

Reviewers: andrewkr, yhchiang, sdong

Reviewed By: sdong

Subscribers: lightmark, andrewkr, dhruba, yoshinorim

Differential Revision: https://reviews.facebook.net/D57669
2016-05-09 15:57:19 -07:00
Andrew Kryczka a06faa6327 Skip PresetCompressionDict test for lite
Summary:
This test relies on "rocksdb.num-files-at-levelN" property that isn't
implemented in rocksdb lite. So we will compile it only for non-lite builds.

Test Plan:
  $ make -j40 check 'OPT=-g -DROCKSDB_LITE'

Reviewers: sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57387
2016-04-28 15:11:28 -07:00
Andrew Kryczka 0f428c5619 Fix compression dictionary clang osx error
Summary:
There was one narrowing conversion in D52287 that only showed up with
clang on osx.

Test Plan:
  $ make clean && USE_CLANG=1 DISABLE_JEMALLOC=1 TEST_TMPDIR=/dev/shm/rocksdb OPT=-g make -j32 check

Reviewers: sdong, lightmark, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57357
2016-04-28 10:42:10 -07:00
Andrew Kryczka 843d2e3137 Shared dictionary compression using reference block
Summary:
This adds a new metablock containing a shared dictionary that is used
to compress all data blocks in the SST file. The size of the shared dictionary
is configurable in CompressionOptions and defaults to 0. It's currently only
used for zlib/lz4/lz4hc, but the block will be stored in the SST regardless of
the compression type if the user chooses a nonzero dictionary size.

During compaction, computes the dictionary by randomly sampling the first
output file in each subcompaction. It pre-computes the intervals to sample
by assuming the output file will have the maximum allowable length. In case
the file is smaller, some of the pre-computed sampling intervals can be beyond
end-of-file, in which case we skip over those samples and the dictionary will
be a bit smaller. After the dictionary is generated using the first file in a
subcompaction, it is loaded into the compression library before writing each
block in each subsequent file of that subcompaction.

On the read path, gets the dictionary from the metablock, if it exists. Then,
loads that dictionary into the compression library before reading each block.

Test Plan: new unit test

Reviewers: yhchiang, IslamAbdelRahman, cyan, sdong

Reviewed By: sdong

Subscribers: andrewkr, yoshinorim, kradhakrishnan, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D52287
2016-04-27 17:36:03 -07:00
Yi Wu 792762c42c Split db_test.cc
Summary: Split db_test.cc into several files. Moving several helper functions into DBTestBase.

Test Plan: make check

Reviewers: sdong, yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: dhruba, andrewkr, kradhakrishnan, yhchiang, leveldb, sdong

Differential Revision: https://reviews.facebook.net/D56715
2016-04-18 09:42:50 -07:00
Victor Tyutyunov e5c614e1df Fixing snapshot 0 assertion
Summary:
Solution is not to change db sequence number to start from 1 because 0 value is used in multiple other places.
Fix covers only compact_iterator::findEarliestVisibleSnapshot with updated logic to support snapshot's numbering starting from 0.

Test Plan:
run:
  make all check

it should pass all tests

Reviewers: IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: lgalanis, mgalushka, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56601
2016-04-16 01:47:15 -07:00