Commit graph

2688 commits

Author SHA1 Message Date
Yi Wu c2247dc1c7 Make DBImpl::has_unpersisted_data_ atomic
Summary:
Seems to me `has_unpersisted_data_` is read from read thread and write
from write thread concurrently without synchronization. Making it an
atomic.

I update the logic not because seeing any problem with it, but it just
feel confusing.
Closes https://github.com/facebook/rocksdb/pull/1869

Differential Revision: D4555837

Pulled By: yiwu-arbug

fbshipit-source-id: eff2ab8
2017-02-13 18:54:13 -08:00
Sagar Vemuri eb912a927e Remove disableDataSync option
Summary:
Remove disableDataSync, and another similarly named disable_data_sync options.
This is being done to simplify options, and also because the performance gains of this feature can be achieved by other methods.
Closes https://github.com/facebook/rocksdb/pull/1859

Differential Revision: D4541292

Pulled By: sagar0

fbshipit-source-id: 5b3a6ca
2017-02-13 11:09:13 -08:00
Dmitri Smirnov a5adda0642 Fix repair issues
Summary:
Record the first parsed sequence number as the minimum
  so we can find the true minimum otherwise everything is larger than zero.
  Fix the comparator name comparision.
Closes https://github.com/facebook/rocksdb/pull/1858

Differential Revision: D4544365

Pulled By: ajkr

fbshipit-source-id: 439cbc2
2017-02-10 10:54:12 -08:00
Andrew Kryczka b48e4778be Consolidate file cutting logic in compaction loop
Summary:
It was really annoying to have two places (top and bottom of compaction loop) where we cut output files. I had bugs in both DeleteRange and dictionary compression due to updating only one of the two. This diff consolidates the file-cutting logic to the bottom of the compaction loop.

Keep in mind that my goal with input_status is to be consistent with the past behavior, even though I'm not sure it's ideal.
Closes https://github.com/facebook/rocksdb/pull/1832

Differential Revision: D4503038

Pulled By: ajkr

fbshipit-source-id: 7da5213
2017-02-08 16:24:17 -08:00
Maysam Yabandeh c4a37dcb44 Print the missed last layer in cfstats
Summary:
Printing compaction stats used to operate on two variable:
number_levels_: for printing the layer
num_levels_to_check: for updating the compaction score

After this commit: 361010d447 these two are mixed up and as a result the last layer might not be printed out: https://fb.facebook.com/groups/rocksdb.internal/permalink/1315716625143616/

number_levels_ was used to decide which layers to print: 672300f47f/db/internal_stats.cc (L753) but after the patch it is based on the return value of DumpCFMapStats 361010d447/db/internal_stats.cc (L929) which returns num_levels_to_check: 361010d447/db/internal_stats.cc (L917)
Closes https://github.com/facebook/rocksdb/pull/1853

Differential Revision: D4529280

Pulled By: maysamyabandeh

fbshipit-source-id: 3fd9448
2017-02-08 10:39:15 -08:00
Maysam Yabandeh 69d5262c81 Two-level Indexes
Summary:
Partition Index blocks and use a Partition-index as a 2nd level index.

The two-level index can be used by setting
BlockBasedTableOptions::kTwoLevelIndexSearch as the index type and
configuring BlockBasedTableOptions::index_per_partition

t15539501
Closes https://github.com/facebook/rocksdb/pull/1814

Differential Revision: D4473535

Pulled By: maysamyabandeh

fbshipit-source-id: bffb87e
2017-02-06 16:39:12 -08:00
Dmitri Smirnov 0a4cdde50a Windows thread
Summary:
introduce new methods into a public threadpool interface,
- allow submission of std::functions as they allow greater flexibility.
- add Joining methods to the implementation to join scheduled and submitted jobs with
  an option to cancel jobs that did not start executing.
- Remove ugly `#ifdefs` between pthread and std implementation, make it uniform.
- introduce pimpl for a drop in replacement of the implementation
- Introduce rocksdb::port::Thread typedef which is a replacement for std::thread.  On Posix Thread defaults as before std::thread.
- Implement WindowsThread that allocates memory in a more controllable manner than windows std::thread with a replaceable implementation.
- should be no functionality changes.
Closes https://github.com/facebook/rocksdb/pull/1823

Differential Revision: D4492902

Pulled By: siying

fbshipit-source-id: c74cb11
2017-02-06 14:54:18 -08:00
Vitaliy Liptchinsky 1aaa898cf1 Adding GetApproximateMemTableStats method
Summary:
Added method that returns approx num of entries as well as size for memtables.
Closes https://github.com/facebook/rocksdb/pull/1841

Differential Revision: D4511990

Pulled By: VitaliyLi

fbshipit-source-id: 9a4576e
2017-02-06 14:54:16 -08:00
Siying Dong 036d668b19 Fix wrong result in data race case related to Get()
Summary:
In theory, Get() can get a wrong result, if it races in a special with with flush. The bug can be reproduced in DBTest2.GetRaceFlush. Fix this bug by getting snapshot after referencing the super version.
Closes https://github.com/facebook/rocksdb/pull/1816

Differential Revision: D4475958

Pulled By: siying

fbshipit-source-id: bd9e67a
2017-02-03 11:39:15 -08:00
Islam AbdelRahman 574b543f80 Rename merger.h -> merging_iterator.h
Summary:
merger.h was always a confusing name for me, simply give the file a better name
Closes https://github.com/facebook/rocksdb/pull/1836

Differential Revision: D4505357

Pulled By: IslamAbdelRahman

fbshipit-source-id: 07b28d8
2017-02-02 16:54:19 -08:00
oranagra b96372dead improving the C wrapper
Summary:
- rocksdb_property_int (so that we don't have to parse strings)
- and rocksdb_set_options (to allow controlling options via strings)
- a few other missing options exposed
- a documentation comment fix
Closes https://github.com/facebook/rocksdb/pull/1793

Differential Revision: D4456569

Pulled By: yiwu-arbug

fbshipit-source-id: 9f1fac1
2017-01-27 17:39:16 -08:00
Siying Dong 04c4ec41d1 Change corruption_test to use 4 bits.
Summary:
In the patch which LRU cache was made use dynamic shard bits, I changed to 2 shard bits to make the test happy. Look like it is occasionally still unhappy. Change it to 4 shard bits.
Closes https://github.com/facebook/rocksdb/pull/1815

Differential Revision: D4475849

Pulled By: siying

fbshipit-source-id: 575ff00
2017-01-27 11:24:16 -08:00
Siying Dong 2d75cd40d3 NewLRUCache() to pick number of shard bits based on capacity if not given
Summary:
If the users use the NewLRUCache() without passing in the number of shard bits, instead of using hard-coded 6, we'll determine it based on capacity.
Closes https://github.com/facebook/rocksdb/pull/1584

Differential Revision: D4242517

Pulled By: siying

fbshipit-source-id: 86b0f18
2017-01-27 06:39:12 -08:00
Siying Dong f25f1ec60b Add test DBTest2.GetRaceFlush which can expose a data race bug
Summary:
A current data race issue in Get() and Flush() can cause a Get() to return wrong results when a flush happened in the middle. Disable the test for now.
Closes https://github.com/facebook/rocksdb/pull/1813

Differential Revision: D4472310

Pulled By: siying

fbshipit-source-id: 5755ebd
2017-01-26 16:39:14 -08:00
sdong 5dad9d6d28 Avoid logs_ operation out of DB mutex
Summary:
logs_.back() is called out of DB mutex, which can cause data race. We move the access into the DB mutex protection area.
Closes https://github.com/facebook/rocksdb/pull/1774

Reviewed By: AsyncDBConnMarkedDownDBException

Differential Revision: D4417472

Pulled By: AsyncDBConnMarkedDownDBException

fbshipit-source-id: 2da1f1e
2017-01-25 15:54:13 -08:00
Islam AbdelRahman a7b13919bf Fix CompactFiles() bug when used with CompactionFilter using SuperVersion
Summary:
GetAndRefSuperVersion() should not be called again in the same thread before ReturnAndCleanupSuperVersion() is called.

If we have a compaction filter that is using DB::Get, This will happen
```
CompactFiles() {
  GetAndRefSuperVersion() // -- first call
    ..
    CompactionFilter() {
      GetAndRefSuperVersion() // -- second call
      ReturnAndCleanupSuperVersion()
    }
    ..
  ReturnAndCleanupSuperVersion()
}
```

We solve this issue in the same way Iterator is solving it, but using GetReferencedSuperVersion()

This was discovered in https://github.com/facebook/mysql-5.6/issues/427 by alxyang
Closes https://github.com/facebook/rocksdb/pull/1803

Differential Revision: D4460155

Pulled By: IslamAbdelRahman

fbshipit-source-id: 5e54322
2017-01-25 14:09:13 -08:00
Andrew Kryczka 616a1464ea Fix DeleteRange including sentinels in output files
Summary:
when writing RangeDelAggregator::AddToBuilder, I forgot that there are sentinel tombstones in the middle of the interval map since gaps between real tombstones are represented with sentinels.

blame: #1614
Closes https://github.com/facebook/rocksdb/pull/1804

Differential Revision: D4460426

Pulled By: ajkr

fbshipit-source-id: 69444b5
2017-01-25 11:09:12 -08:00
Islam AbdelRahman 03ca2ac8a9 Remove function from DBImpl that are not used anywhere
Summary:
GetAndRefSuperVersionUnlocked
ReturnAndCleanupSuperVersionUnlocked
GetColumnFamilyHandleUnlocked

Are dead code that are not used any where
Closes https://github.com/facebook/rocksdb/pull/1802

Differential Revision: D4459948

Pulled By: IslamAbdelRahman

fbshipit-source-id: 30fa89d
2017-01-24 19:24:13 -08:00
Andrew Kryczka b0029bc7fa Test merge op covered by range deletion in memtable
Summary:
It's a test case for #1797. Also got rid of kTypeDeletion in the conditional since we treat it the same as kTypeRangeDeletion.
Closes https://github.com/facebook/rocksdb/pull/1800

Differential Revision: D4451300

Pulled By: ajkr

fbshipit-source-id: b39dda1
2017-01-24 13:39:11 -08:00
Andrew Kryczka d438e1ec17 Test range deletion block outlives table reader
Summary:
This test ensures RangeDelAggregator can still access blocks even if it outlives the table readers that created them (detailed description in comments).

I plan to optimize away the extra cache lookup we currently do in BlockBasedTable::NewRangeTombstoneIterator(), as it is ~5% CPU in my random read benchmark in a database with 1k tombstones. This test will help make sure nothing breaks in the process.
Closes https://github.com/facebook/rocksdb/pull/1739

Differential Revision: D4375954

Pulled By: ajkr

fbshipit-source-id: aef9357
2017-01-24 13:24:14 -08:00
Andrew Kryczka 9da4d542fe Range deletions unsupported in tailing iterator
Summary:
change the iterator status to NotSupported as soon as a range tombstone
is encountered by a ForwardIterator.
Closes https://github.com/facebook/rocksdb/pull/1593

Differential Revision: D4246294

Pulled By: ajkr

fbshipit-source-id: aef9f49
2017-01-23 13:39:12 -08:00
Hyeonseok Oh f2b4939da4 fixed typo
Summary:
I fixed exisit -> exist
Closes https://github.com/facebook/rocksdb/pull/1799

Differential Revision: D4451466

Pulled By: yiwu-arbug

fbshipit-source-id: b447c3a
2017-01-23 12:54:13 -08:00
yinqiwen 973f1b78fd memtable: delete merge value for range deleteion
Summary: Closes https://github.com/facebook/rocksdb/pull/1797

Differential Revision: D4448004

Pulled By: ajkr

fbshipit-source-id: 3ffc27c
2017-01-23 12:24:14 -08:00
Vitaliy Liptchinsky 753ff84a3d Fix get approx size
Summary:
Fixing GetApproximateSize bug for the case of computing stats for mem tables only.
Closes https://github.com/facebook/rocksdb/pull/1795

Differential Revision: D4445507

Pulled By: IslamAbdelRahman

fbshipit-source-id: 3905846
2017-01-20 15:54:12 -08:00
Jay Lee 537da370da c: allow set savepoint to writebatch
Summary:
Allow set SavePoint to WriteBatch in C ABI.
Closes https://github.com/facebook/rocksdb/pull/1698

Differential Revision: D4378556

Pulled By: yiwu-arbug

fbshipit-source-id: afca746
2017-01-20 13:24:13 -08:00
Changli Gao 5ac97314e7 Fix std::out_of_range when DBOptions::keep_log_file_num is zero
Summary:
We should validate this option, otherwise we may see
std::out_of_range thrown at: db/db_impl.cc:1124

1123     for (unsigned int i = 0; i <= end; i++) {
1124       std::string& to_delete = old_info_log_files.at(i);
1125       std::string full_path_to_delete =
1126           (immutable_db_options_.db_log_dir.empty()
Closes https://github.com/facebook/rocksdb/pull/1722

Differential Revision: D4379495

Pulled By: yiwu-arbug

fbshipit-source-id: e136552
2017-01-20 13:24:12 -08:00
Shu Zhang 3c0852d1da Make ingest external file backward compatible
Summary: Closes https://github.com/facebook/rocksdb/pull/1783

Differential Revision: D4443463

Pulled By: IslamAbdelRahman

fbshipit-source-id: 39d21d6
2017-01-20 12:09:19 -08:00
Siying Dong 0e8dfd6062 Fix OptimizeForPointLookup()
Summary:
If users directly call OptimizeForPointLookup(), it is broken as the option isn't compatible with parallel memtable insert. Fix it by using memtable bloomo filter instead.
Closes https://github.com/facebook/rocksdb/pull/1791

Differential Revision: D4442836

Pulled By: siying

fbshipit-source-id: bf6c9cd
2017-01-20 10:54:12 -08:00
Vitaliy Liptchinsky e840213d6e Change DB::GetApproximateSizes for more flexibility needed for MyRocks
Summary:
Added an option to GetApproximateSizes to exclude file stats, as MyRocks has those counted exactly and we need only stats from memtables.
Closes https://github.com/facebook/rocksdb/pull/1787

Differential Revision: D4441111

Pulled By: IslamAbdelRahman

fbshipit-source-id: c11f4c3
2017-01-20 09:39:11 -08:00
Yi Wu 9239103cd4 Flush job should release reference current version if sync log failed
Summary:
Fix the bug when sync log fail, FlushJob::Run() will not be execute and
reference to cfd->current() will not be release.
Closes https://github.com/facebook/rocksdb/pull/1792

Differential Revision: D4441316

Pulled By: yiwu-arbug

fbshipit-source-id: 5523e28
2017-01-19 23:09:15 -08:00
Islam AbdelRahman da54d36a96 Disable IngestExternalFile in ReadOnly mode
Summary:
Disable IngestExternalFile() in read only mode
Closes https://github.com/facebook/rocksdb/pull/1781

Differential Revision: D4439179

Pulled By: IslamAbdelRahman

fbshipit-source-id: b7e46e7
2017-01-19 15:54:19 -08:00
Reid Horuff 5cf176ca15 Fix for 2PC causing WAL to grow too large
Summary:
Consider the following single column family scenario:
prepare in log A
commit in log B
*WAL is too large, flush all CFs to releast log A*
*CFA is on log B so we do not see CFA is depending on log A so no flush is requested*

To fix this we must also consider the log containing the prepare section when determining what log a CF is dependent on.
Closes https://github.com/facebook/rocksdb/pull/1768

Differential Revision: D4403265

Pulled By: reidHoruff

fbshipit-source-id: ce800ff
2017-01-19 15:39:12 -08:00
Andrew Kryczka f9d18e22d2 Fix DeleteRange file boundary correctness issue with max_compaction_bytes
Summary:
Cockroachdb exposed this bug in #1778. The bug happens when a compaction's output files are ended due to exceeding max_compaction_bytes. In that case we weren't taking into account the next file's start key when deciding how far to extend the current file's max_key. This caused the non-overlapping key-range invariant to be violated.

Note this was correctly handled for the usual case of cutting compaction output, which is file size exceeding max_output_file_size. I am not sure why these are two separate code paths, but we can consider refactoring it to prevent such errors in the future.
Closes https://github.com/facebook/rocksdb/pull/1784

Differential Revision: D4430235

Pulled By: ajkr

fbshipit-source-id: 80af748
2017-01-18 11:54:22 -08:00
Islam AbdelRahman 3ce091fd73 Add KEEP_DB env var option
Summary:
When debugging tests, it's useful to preserve the DB to investigate it and check the logs
This will allow us to set KEEP_DB=1 to preserve the DB
Closes https://github.com/facebook/rocksdb/pull/1759

Differential Revision: D4393826

Pulled By: IslamAbdelRahman

fbshipit-source-id: 1bff689
2017-01-17 13:54:20 -08:00
Siying Dong 77b4806625 Fix 2PC with concurrent memtable insert
Summary:
If concurrent memtable insert is enabled, and one prepare command and a normal command are grouped into a commit group, the sequence ID will be calculated incorrectly.
Closes https://github.com/facebook/rocksdb/pull/1730

Differential Revision: D4371081

Pulled By: siying

fbshipit-source-id: cd40c6d
2017-01-17 11:24:28 -08:00
Mike Kolupaev d18dd2c41f Abort compactions more reliably when closing DB
Summary:
DB shutdown aborts running compactions by setting an atomic shutting_down=true that CompactionJob periodically checks. Without this PR it checks it before processing every _output_ value. If compaction filter filters everything out, the compaction is uninterruptible. This PR adds checks for shutting_down on every _input_ value (in CompactionIterator and MergeHelper).

There's also some minor code cleanup along the way.
Closes https://github.com/facebook/rocksdb/pull/1639

Differential Revision: D4306571

Pulled By: yiwu-arbug

fbshipit-source-id: f050890
2017-01-11 15:09:21 -08:00
Changli Gao 9f246298e2 Performance: Iterate vector by reference
Summary: Closes https://github.com/facebook/rocksdb/pull/1763

Differential Revision: D4398796

Pulled By: yiwu-arbug

fbshipit-source-id: b82636d
2017-01-11 10:54:37 -08:00
Dmitri Smirnov 3c233ca4ea Fix Windows environment issues
Summary:
Enable directIO on WritableFileImpl::Append
     with offset being current length of the file.
     Enable UniqueID tests on Windows, disable others but
     leeting them to compile. Unique tests are valuable to
     detect failures on different filesystems and upcoming
     ReFS.
     Clear output in WinEnv Getchildren.This is different from
     previous strategy, do not touch output on failure.
     Make sure DBTest.OpenWhenOpen works with windows error message
Closes https://github.com/facebook/rocksdb/pull/1746

Differential Revision: D4385681

Pulled By: IslamAbdelRahman

fbshipit-source-id: c07b702
2017-01-09 15:54:12 -08:00
Maysam Yabandeh d0ba8ec8f9 Revert "PinnableSlice"
Summary:
This reverts commit 54d94e9c2c.

The pull request was landed by mistake.
Closes https://github.com/facebook/rocksdb/pull/1755

Differential Revision: D4391678

Pulled By: maysamyabandeh

fbshipit-source-id: 36d5149
2017-01-08 14:24:12 -08:00
Maysam Yabandeh 54d94e9c2c PinnableSlice
Summary:
Currently the point lookup values are copied to a string provided by the user.
This incures an extra memcpy cost. This patch allows doing point lookup
via a PinnableSlice which pins the source memory location (instead of
copying their content) and releases them after the content is consumed
by the user. The old API of Get(string) is translated to the new API
underneath.

 Here is the summary for improvements:
 1. value 100 byte: 1.8%  regular, 1.2% merge values
 2. value 1k   byte: 11.5% regular, 7.5% merge values
 3. value 10k byte: 26% regular,    29.9% merge values

 The improvement for merge could be more if we extend this approach to
 pin the merge output and delay the full merge operation until the user
 actually needs it. We have put that for future work.

PS:
Sometimes we observe a small decrease in performance when switching from
t5452014 to this patch but with the old Get(string) API. The difference
is a little and could be noise. More importantly it is safely
cancelled
Closes https://github.com/facebook/rocksdb/pull/1732

Differential Revision: D4374613

Pulled By: maysamyabandeh

fbshipit-source-id: a077f1a
2017-01-08 13:54:13 -08:00
Andrew Kryczka b104b87814 Maintain position in range deletions map
Summary:
When deletion-collapsing mode is enabled (i.e., for DBIter/CompactionIterator), we maintain position in the tombstone maps across calls to ShouldDelete(). Since iterators often access keys sequentially (or reverse-sequentially), scanning forward/backward from the last position can be faster than binary-searching the map for every key.

- When Next() is invoked on an iterator, we use kForwardTraversal to scan forwards, if needed, until arriving at the range deletion containing the next key.
- Similarly for Prev(), we use kBackwardTraversal to scan backwards in the range deletion map.
- When the iterator seeks, we use kBinarySearch for repositioning
- After tombstones are added or before the first ShouldDelete() invocation, the current position is set to invalid, which forces kBinarySearch to be used.
- Non-iterator users (i.e., Get()) use kFullScan, which has the same behavior as before---scan the whole map for every key passed to ShouldDelete().
Closes https://github.com/facebook/rocksdb/pull/1701

Differential Revision: D4350318

Pulled By: ajkr

fbshipit-source-id: 5129b76
2017-01-05 10:39:12 -08:00
siddontang 653ac1f9c6 C API: support total_order_mode
Summary: Closes https://github.com/facebook/rocksdb/pull/1687

Differential Revision: D4349210

Pulled By: IslamAbdelRahman

fbshipit-source-id: 32d0fbd
2017-01-03 18:39:14 -08:00
Adam Retter 85ac1a320a Fix rocksdb::Status::getState
Summary:
This fixes the Java API for Status#getState use in Native code and also simplifies the implementation of rocksdb::Status::getState.
Closes https://github.com/facebook/rocksdb/issues/1688
Closes https://github.com/facebook/rocksdb/pull/1714

Differential Revision: D4364181

Pulled By: yiwu-arbug

fbshipit-source-id: 8e073b4
2017-01-03 18:39:14 -08:00
Islam AbdelRahman 76711b6e77 Make ExternalSSTFileTest::CompactionDeadlock more deterministic
Summary:
It's not always true that `ASSERT_EQ(running_threads.load(), 2);`
Closes https://github.com/facebook/rocksdb/pull/1736

Differential Revision: D4374091

Pulled By: IslamAbdelRahman

fbshipit-source-id: 4f70bbd
2017-01-03 18:09:20 -08:00
Islam AbdelRahman c963460dbc Fix tests under GCC_481
Summary:
This fix the issue with tests failing under GCC 481, I am not sure what is the exact reason
Closes https://github.com/facebook/rocksdb/pull/1735

Differential Revision: D4374094

Pulled By: IslamAbdelRahman

fbshipit-source-id: b3625bc
2017-01-03 17:54:12 -08:00
Vincent Lee e425ec1162 utilities/backupable: backup should limit the copy size of wal.
Summary:
Since the backup work as snapshot, we should only copy
 the bytes of the wal while we get the alive files.
Closes https://github.com/facebook/rocksdb/pull/1733

Differential Revision: D4373457

Pulled By: ajkr

fbshipit-source-id: 389318f
2016-12-31 10:54:20 -08:00
Maysam Yabandeh 0712d541d1 Delegate Cleanables
Summary:
Cleanable objects will perform the registered cleanups when
they are destructed. We however rather to delay this cleaning like when
we are gathering the merge operands. Current approach is to create the
Cleanable object on heap (instead of on stack) and delay deleting it.

By allowing Cleanables to delegate their cleanups to another cleanable
object we can delay the cleaning without however the need to craete the
cleanable object on heap and keeping it around. This patch applies this
technique for the cleanups of BlockIter and shows improved performance
for some in-memory benchmarks:
+1.8% for merge worklaod, +6.4% for non-merge workload when the merge
operator is specified.
https://our.intern.facebook.com/intern/tasks?t=15168163

Non-merge benchmark:
TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench --benchmarks=fillrandom
--num=1000000 -value_size=100 -compression_type=none

Reading random with no merge operator specified:
TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench
--benchmarks="read
Closes https://github.com/facebook/rocksdb/pull/1711

Differential Revision: D4361163

Pulled By: maysamyabandeh

fbshipit-source-id: 9801e07
2016-12-29 15:54:19 -08:00
Islam AbdelRahman d58ef52ba6 Allow SstFileWriter to Fadvise the file away from page cache
Summary:
Add `fadvise_trigger` option to `SstFileWriter`

If fadvise_trigger is passed with a non-zero value, SstFileWriter will invalidate the os page cache every `fadvise_trigger` bytes for the sst file
Closes https://github.com/facebook/rocksdb/pull/1731

Differential Revision: D4371246

Pulled By: IslamAbdelRahman

fbshipit-source-id: 91caff1
2016-12-29 15:09:19 -08:00
Siying Dong 17a4b75cc3 Always fsync the file after file copying
Summary:
File copying happens when creating checkpoints and bulkloading files from different FS partition. We should fsync the files when copying them to guarantee durability. A side effect will be that the dirty pages in file system buffers won't grow too large.
Closes https://github.com/facebook/rocksdb/pull/1728

Differential Revision: D4371083

Pulled By: siying

fbshipit-source-id: 579e14c
2016-12-28 19:09:16 -08:00
leipeng a738af8f84 db/pinned_iterators_manager.h: bugfix
Summary:
std::unique(beg, end) returns an iterator of unique_end, data behind unique_end should not be accessed.
Closes https://github.com/facebook/rocksdb/pull/1726

Differential Revision: D4371076

Pulled By: IslamAbdelRahman

fbshipit-source-id: 5564450
2016-12-28 18:54:57 -08:00
Siying Dong 438f22bc56 Fix bug of Checkpoint loses recent transactions with 2PC
Summary:
If 2PC is enabled, checkpoint may not copy previous log files that contain uncommitted prepare records. In this diff we keep those files.
Closes https://github.com/facebook/rocksdb/pull/1724

Differential Revision: D4368319

Pulled By: siying

fbshipit-source-id: cc2c746
2016-12-28 12:24:16 -08:00
Aaron Gao 972f96b3fb direct io write support
Summary:
rocksdb direct io support

```
[gzh@dev11575.prn2 ~/rocksdb] ./db_bench -benchmarks=fillseq --num=1000000
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
RocksDB:    version 5.0
Date:       Wed Nov 23 13:17:43 2016
CPU:        40 * Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
CPUCache:   25600 KB
Keys:       16 bytes each
Values:     100 bytes each (50 bytes after compression)
Entries:    1000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    110.6 MB (estimated)
FileSize:   62.9 MB (estimated)
Write rate: 0 bytes/second
Compression: Snappy
Memtablerep: skip_list
Perf Level: 1
WARNING: Assertions are enabled; benchmarks unnecessarily slow
------------------------------------------------
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
DB path: [/tmp/rocksdbtest-112628/dbbench]
fillseq      :       4.393 micros/op 227639 ops/sec;   25.2 MB/s

[gzh@dev11575.prn2 ~/roc
Closes https://github.com/facebook/rocksdb/pull/1564

Differential Revision: D4241093

Pulled By: lightmark

fbshipit-source-id: 98c29e3
2016-12-22 13:09:19 -08:00
Islam AbdelRahman 989e644ed8 Remove sst_file_manager option from LITE
Summary:
Remove sst_file_manager option from LITE
Closes https://github.com/facebook/rocksdb/pull/1690

Differential Revision: D4341331

Pulled By: IslamAbdelRahman

fbshipit-source-id: 9f9328d
2016-12-21 17:54:21 -08:00
Islam AbdelRahman 1beef6569a Fix c_test
Summary:
addfile phase in c_test could fail because in previous steps we did a DeleteRange.
Fix the test by simply moving the addfile phase before DeleteRange
Closes https://github.com/facebook/rocksdb/pull/1672

Differential Revision: D4328896

Pulled By: IslamAbdelRahman

fbshipit-source-id: 1d946df
2016-12-21 17:39:14 -08:00
Andrew Kryczka 50e305de98 Collapse range deletions
Summary:
Added a tombstone-collapsing mode to RangeDelAggregator, which eliminates overlap in the TombstoneMap. In this mode, we can check whether a tombstone covers a user key using upper_bound() (i.e., binary search). However, the tradeoff is the overhead to add tombstones is now higher, so at first I've only enabled it for range scans (compaction/flush/user iterators), where we expect a high number of calls to ShouldDelete() for the same tombstones. Point queries like Get() will still use the linear scan approach.

Also in this diff I changed RangeDelAggregator's TombstoneMap to use multimap with user keys instead of map with internal keys. Callers sometimes provided ParsedInternalKey directly, from which it would've required string copying to derive an internal key Slice with which we could search the map.
Closes https://github.com/facebook/rocksdb/pull/1614

Differential Revision: D4270397

Pulled By: ajkr

fbshipit-source-id: 93092c7
2016-12-19 16:54:12 -08:00
Yi Wu 5d1457dbbf Dump persistent cache options
Summary:
Dump persistent cache options
Closes https://github.com/facebook/rocksdb/pull/1679

Differential Revision: D4337019

Pulled By: yiwu-arbug

fbshipit-source-id: 3812f8a
2016-12-19 14:09:12 -08:00
Daniel Black 342370f1d3 Simplify MemTable::Update
Summary:
As suggested by testn in #1650

The Add is at the end of the function. Having a fallthough
will result in it being added twice.
Closes https://github.com/facebook/rocksdb/pull/1676

Differential Revision: D4331906

Pulled By: yiwu-arbug

fbshipit-source-id: 895c4a0
2016-12-17 00:09:13 -08:00
Ding Ma 1a136c1f13 Expose file size
Summary:
add a new function to SstFileWriter that will tell the user how big is there file right now.
Closes https://github.com/facebook/rocksdb/pull/1686

Differential Revision: D4338868

Pulled By: mdyuki1016

fbshipit-source-id: c1ee16a
2016-12-16 18:39:12 -08:00
Andrew Kryczka fbff4628a9 Reduce compaction iterator status checks
Summary:
seems it's expensive to check status since the underlying merge iterator checks status of all its children. so only do it when it's really necessary to get the status before invoking Next(), i.e., when we're advancing to get the first key in the next file.
Closes https://github.com/facebook/rocksdb/pull/1691

Differential Revision: D4343446

Pulled By: siying

fbshipit-source-id: 70ab315
2016-12-16 17:39:09 -08:00
Daniel Black 816c1e30ca gcc-7 requires include <functional> for std::function
Summary:
Fixes compile error:

In file included from ./util/statistics.h:17:0,
                 from ./util/stop_watch.h:8,
                 from ./util/perf_step_timer.h:9,
                 from ./util/iostats_context_imp.h:8,
                 from ./util/posix_logger.h:27,
                 from ./port/util_logger.h:18,
                 from ./db/auto_roll_logger.h:15,
                 from db/auto_roll_logger.cc:6:
./util/thread_local.h:65:16: error: 'function' in namespace 'std' does not name a template type
   typedef std::function<void(void*, void*)> FoldFunc;
Closes https://github.com/facebook/rocksdb/pull/1656

Differential Revision: D4318702

Pulled By: yiwu-arbug

fbshipit-source-id: 8c5d17a
2016-12-16 11:24:18 -08:00
Yi Wu c270735861 Iterator should be in corrupted status if merge operator return false
Summary:
Iterator should be in corrupted status if merge operator return false.
Also add test to make sure if max_successive_merges is hit during write,
data will not be lost.
Closes https://github.com/facebook/rocksdb/pull/1665

Differential Revision: D4322695

Pulled By: yiwu-arbug

fbshipit-source-id: b327b05
2016-12-16 11:09:16 -08:00
siddontang 8f5d24ae68 C API: support get usage and pinned_usage for cache
Summary: Closes https://github.com/facebook/rocksdb/pull/1671

Differential Revision: D4327453

Pulled By: yiwu-arbug

fbshipit-source-id: bcdbc65
2016-12-15 17:24:17 -08:00
Daniel Black cfc34d7c4e Missing break in case in DBTestBase::CurrentOptions
Summary:
Found by gcc-7 compile error.

This appeared to be a fault as these options seems too different.
Closes https://github.com/facebook/rocksdb/pull/1667

Differential Revision: D4324174

Pulled By: yiwu-arbug

fbshipit-source-id: 0f65383
2016-12-13 18:39:14 -08:00
Daniel Black bfbcec2339 Gcc 7 error expansion to defined
Summary:
sorry if these gcc-7/clang-4 cleanups are getting tedious.
Closes https://github.com/facebook/rocksdb/pull/1658

Differential Revision: D4318792

Pulled By: yiwu-arbug

fbshipit-source-id: 8e85891
2016-12-13 18:39:14 -08:00
Daniel Black 67adc937b6 intentional fallthough (prevents gcc-7/clang-4 error)
Summary:
db/memtable.cc: In member function 'void rocksdb::MemTable::Update(rocksdb::SequenceNumber, const rocksdb::Slice&, const rocksdb::Slice&)':
db/memtable.cc:736:11: error: this statement may fall through [-Werror=implicit-fallthrough=]
           }
           ^
db/memtable.cc:738:9: note: here
         default:
         ^~~~~~~
cc1plus: all warnings being treated as errors

closes #1650
Closes https://github.com/facebook/rocksdb/pull/1655

Differential Revision: D4318696

Pulled By: yiwu-arbug

fbshipit-source-id: 1a8981c
2016-12-13 14:39:17 -08:00
Islam AbdelRahman 1a146f89c7 break Flush wait for dropped CF
Summary:
In FlushJob we dont do the Flush if the CF is dropped
https://github.com/facebook/rocksdb/blob/master/db/flush_job.cc#L184-L188

but inside WaitForFlushMemTable we keep waiting forever even if the CF is dropped.
Closes https://github.com/facebook/rocksdb/pull/1664

Differential Revision: D4321032

Pulled By: IslamAbdelRahman

fbshipit-source-id: 6e2b25d
2016-12-13 14:09:12 -08:00
Yi Wu 36d42e65d0 Disable test to unblock travis build
Summary:
The two tests keep failing in travis. Disable them and will fix later.
Closes https://github.com/facebook/rocksdb/pull/1648

Differential Revision: D4316389

Pulled By: yiwu-arbug

fbshipit-source-id: 0a370e7
2016-12-13 11:54:14 -08:00
siddontang b57dd9262a C API: support writebatch delete range
Summary:
Seem that writebatch delete range can work now, so I add C API for later use.

Btw, can we use this feature in production now?
Closes https://github.com/facebook/rocksdb/pull/1647

Differential Revision: D4314534

Pulled By: ajkr

fbshipit-source-id: e835165
2016-12-13 11:24:18 -08:00
Islam AbdelRahman 2ba59b5a1e Disallow ingesting files into dropped CFs
Summary:
This PR update IngestExternalFile to return an error if we try to ingest a file into a dropped CF.

Right now if IngestExternalFile want to flush a memtable, and it's ingesting a file into a dropped CF, it will wait forever since flushing is not possible for the dropped CF
Closes https://github.com/facebook/rocksdb/pull/1657

Differential Revision: D4318657

Pulled By: IslamAbdelRahman

fbshipit-source-id: ed6ea2b
2016-12-13 00:54:14 -08:00
Jonathan Lee 2cabdb8f44 Increase buffer size
Summary:
When compiling with GCC>=7.0.0, "db/internal_stats.cc" fails to compile as the data being written to the buffer potentially exceeds its size.

This fix simply doubles the size of the buffer, thus accommodating the max possible data size.
Closes https://github.com/facebook/rocksdb/pull/1635

Differential Revision: D4302162

Pulled By: yiwu-arbug

fbshipit-source-id: c76ad59
2016-12-09 11:54:22 -08:00
Jonathan Lee 4a17b47bb5 Remove unnecessary header include
Summary:
Remove "util/testharness.h" from list of includes for "db/db_filesnapshot.cc", as it wasn't being used and thus caused an extraneous dependency on gtest.
Closes https://github.com/facebook/rocksdb/pull/1634

Differential Revision: D4302146

Pulled By: yiwu-arbug

fbshipit-source-id: e900c0b
2016-12-09 11:54:21 -08:00
Mike Kolupaev 8c2b921fdf Fixed a crash in debug build in flush_job.cc
Summary:
It was doing `&range_del_iters[0]` on an empty vector. Even though the resulting pointer is never dereferenced, it's still bad for two reasons:
* the practical reason: it crashes with `std::out_of_range` exception in our debug build,
* the "C++ standard lawyer" reason: it's undefined behavior because, in `std::vector` implementation, it probably "dereferences" a null pointer, which is invalid even though it doesn't actually read the pointed memory, just converts a pointer into a reference (and then flush_job.cc converts it back to pointer); nullptr references are undefined behavior.
Closes https://github.com/facebook/rocksdb/pull/1612

Differential Revision: D4265625

Pulled By: al13n321

fbshipit-source-id: db26fb9
2016-12-09 10:39:12 -08:00
Islam AbdelRahman 20ce081fae Fix issue where IngestExternalFile insert blocks in block cache with g_seqno=0
Summary:
When we Ingest an external file we open it to read some metadata and first/last key
during doing that we insert blocks into the block cache with global_seqno = 0

If we move the file (did not copy it) into the DB, we will use these blocks with the wrong seqno in the read path
Closes https://github.com/facebook/rocksdb/pull/1627

Differential Revision: D4293332

Pulled By: yiwu-arbug

fbshipit-source-id: 3ce5523
2016-12-08 13:39:18 -08:00
zhangjinpeng1987 45c7ce1377 CompactRangeOptions C API
Summary:
Add C API for CompactRangeOptions.
Closes https://github.com/facebook/rocksdb/pull/1596

Differential Revision: D4252339

Pulled By: yiwu-arbug

fbshipit-source-id: f768f93
2016-12-07 17:54:14 -08:00
Andrew Kryczka b821984d31 DeleteRange read path end-to-end tests
Summary: Closes https://github.com/facebook/rocksdb/pull/1592

Differential Revision: D4246260

Pulled By: ajkr

fbshipit-source-id: ce03fa2
2016-12-07 12:54:17 -08:00
Artemiy Kolesnikov 2f4fc539c6 Compaction::IsTrivialMove relaxing
Summary:
IsTrivialMove returns true if no input file overlaps with output_level+1 with more than max_compaction_bytes_ bytes.
Closes https://github.com/facebook/rocksdb/pull/1619

Differential Revision: D4278338

Pulled By: yiwu-arbug

fbshipit-source-id: 994c001
2016-12-07 11:54:11 -08:00
Islam AbdelRahman ed8fbdb560 Add EventListener::OnExternalFileIngested() event
Summary:
Add EventListener::OnExternalFileIngested() to allow user to subscribe to external file ingestion events
Closes https://github.com/facebook/rocksdb/pull/1623

Differential Revision: D4285844

Pulled By: IslamAbdelRahman

fbshipit-source-id: 0b95a88
2016-12-06 14:09:17 -08:00
Mike Kolupaev beb36d9c1e Fixed CompactionFilter::Decision::kRemoveAndSkipUntil
Summary:
Embarassingly enough, the first time I tried to use my new feature in logdevice it crashed with this assertion failure:

  db/pinned_iterators_manager.h:30: void rocksdb::PinnedIteratorsManager::StartPinning(): Assertion `pinning_enabled == false' failed

The issue was that `pinned_iters_mgr_.StartPinning()` was called but `pinned_iters_mgr_.ReleasePinnedData()` wasn't.
Closes https://github.com/facebook/rocksdb/pull/1611

Differential Revision: D4265622

Pulled By: al13n321

fbshipit-source-id: 747b10f
2016-12-05 15:24:11 -08:00
Islam AbdelRahman 67f37cf198 Allow user to specify a CF for SST files generated by SstFileWriter
Summary:
Allow user to explicitly specify that the generated file by SstFileWriter will be ingested in a specific CF.
This allow us to persist the CF id in the generated file
Closes https://github.com/facebook/rocksdb/pull/1615

Differential Revision: D4270422

Pulled By: IslamAbdelRahman

fbshipit-source-id: 7fb954e
2016-12-05 14:24:16 -08:00
Anton Safonov 9053fe2a5c Made delete_obsolete_files_period_micros option dynamic
Summary:
Made delete_obsolete_files_period_micros option dynamic. It can be updating using DB::SetDBOptions().
Closes https://github.com/facebook/rocksdb/pull/1595

Differential Revision: D4246569

Pulled By: tonek

fbshipit-source-id: d23f560
2016-12-05 14:24:16 -08:00
Islam AbdelRahman edde954e7b fix clang build
Summary:
override is missing for FilterV2
Closes https://github.com/facebook/rocksdb/pull/1606

Differential Revision: D4263832

Pulled By: IslamAbdelRahman

fbshipit-source-id: d8b337a
2016-12-01 18:39:10 -08:00
Islam AbdelRahman e39d080871 Fix travis (compile for clang < 3.9)
Summary:
Travis fail because it uses clang 3.6 which don't recognize
`__attribute__((__no_sanitize__("undefined")))`
Closes https://github.com/facebook/rocksdb/pull/1601

Differential Revision: D4257175

Pulled By: IslamAbdelRahman

fbshipit-source-id: fb4d1ab
2016-12-01 10:09:22 -08:00
fangchenliaohui b77007df8b Bug: paralle_group status updated in WriteThread::CompleteParallelWorker
Summary:
Multi-write thread may update the status of the parallel_group in
WriteThread::CompleteParallelWorker if the status of Writer is not ok!
When copy write status to the paralle_group, the write thread just hold the
mutex of the the writer processed by itself. it is useless. The thread
should held the the leader of the parallel_group instead.
Closes https://github.com/facebook/rocksdb/pull/1598

Differential Revision: D4252335

Pulled By: siying

fbshipit-source-id: 3864cf7
2016-12-01 09:54:11 -08:00
Mike Kolupaev 247d0979aa Support for range skips in compaction filter
Summary:
This adds the ability for compaction filter to say "drop this key-value, and also drop everything up to key x". This will cause the compaction to seek input iterator to x, without reading the data. This can make compaction much faster when large consecutive chunks of data are filtered out. See the changes in include/rocksdb/compaction_filter.h for the new API.

Along the way this diff also adds ability for compaction filter changing merge operands, similar to how it can change values; we're not going to use this feature, it just seemed easier and cleaner to implement it than to document that it's not implemented :)

The diff is not as big as it may seem, about half of the lines are a test.
Closes https://github.com/facebook/rocksdb/pull/1599

Differential Revision: D4252092

Pulled By: al13n321

fbshipit-source-id: 41e1e48
2016-12-01 07:09:15 -08:00
Panagiotis Ktistakis 96fcefbf1d c api: expose option for dynamic level size target
Summary: Closes https://github.com/facebook/rocksdb/pull/1587

Differential Revision: D4245923

Pulled By: yiwu-arbug

fbshipit-source-id: 6ee7291
2016-11-30 11:24:14 -08:00
zhangjinpeng1987 00197cff39 Add C API to set base_backgroud_compactions
Summary:
Add C API to set base_backgroud_compactions
Closes https://github.com/facebook/rocksdb/pull/1571

Differential Revision: D4245709

Pulled By: yiwu-arbug

fbshipit-source-id: 792c6b8
2016-11-30 11:09:13 -08:00
Andrew Kryczka 5b219eccb5 deleterange end-to-end test improvements for lite/robustness
Summary: Closes https://github.com/facebook/rocksdb/pull/1591

Differential Revision: D4246019

Pulled By: ajkr

fbshipit-source-id: 0c4aa37
2016-11-29 12:24:13 -08:00
Andrew Kryczka e333528991 DeleteRange write path end-to-end tests
Summary: Closes https://github.com/facebook/rocksdb/pull/1578

Differential Revision: D4241171

Pulled By: ajkr

fbshipit-source-id: ce5fd83
2016-11-29 11:09:22 -08:00
Siying Dong 7784980fcd Fix mis-reporting of compaction read bytes to the base level
Summary:
In dynamic leveled compaction, when calculating read bytes, output level bytes may be wronglyl calculated as input level inputs. Fix it.
Closes https://github.com/facebook/rocksdb/pull/1475

Differential Revision: D4148412

Pulled By: siying

fbshipit-source-id: f2f475a
2016-11-29 11:09:22 -08:00
Islam AbdelRahman 3c6b49ed66 Fix implicit conversion between int64_t to int
Summary:
Make conversion explicit, implicit conversion breaks the build
Closes https://github.com/facebook/rocksdb/pull/1589

Differential Revision: D4245158

Pulled By: IslamAbdelRahman

fbshipit-source-id: aaec00d
2016-11-29 10:54:15 -08:00
Siying Dong b3b875657f Remove unused assignment in db/db_iter.cc
Summary:
"make analyze" complains the assignment is not useful. Remove it.
Closes https://github.com/facebook/rocksdb/pull/1581

Differential Revision: D4241697

Pulled By: siying

fbshipit-source-id: 178f67a
2016-11-29 09:09:14 -08:00
Andrew Kryczka 4f6e89b1d0 Fix range deletion covering key in same SST file
Summary:
AddTombstones() needs to be before t->Get(), oops :'(
Closes https://github.com/facebook/rocksdb/pull/1576

Differential Revision: D4241041

Pulled By: ajkr

fbshipit-source-id: 781ceea
2016-11-28 22:54:13 -08:00
Islam AbdelRahman a2bf265a39 Avoid intentional overflow in GetL0ThresholdSpeedupCompaction
Summary:
99c052a34f fixes integer overflow in GetL0ThresholdSpeedupCompaction() by checking if int become -ve.
UBSAN will complain about that since this is still an overflow, we can fix the issue by simply using int64_t
Closes https://github.com/facebook/rocksdb/pull/1582

Differential Revision: D4241525

Pulled By: IslamAbdelRahman

fbshipit-source-id: b3ae21f
2016-11-28 18:39:13 -08:00
Islam AbdelRahman 52fd1ff2c2 disable UBSAN for functions with intentional -ve shift / overflow
Summary:
disable UBSAN for functions with intentional left shift on -ve number / overflow

These functions are
rocksdb:: Hash
FixedLengthColBufEncoder::Append
FaultInjectionTest:: Key
Closes https://github.com/facebook/rocksdb/pull/1577

Differential Revision: D4240801

Pulled By: IslamAbdelRahman

fbshipit-source-id: 3e1caf6
2016-11-28 17:54:12 -08:00
Islam AbdelRahman 1886c435b9 Fix CompactionJob::Install division by zero
Summary:
Fix CompactionJob::Install division by zero
Closes https://github.com/facebook/rocksdb/pull/1580

Differential Revision: D4240794

Pulled By: IslamAbdelRahman

fbshipit-source-id: 7286721
2016-11-28 16:54:16 -08:00
Islam AbdelRahman 13e66a8f51 Fix compaction_job.cc division by zero
Summary:
Fix division by zero in compaction_job.cc
Closes https://github.com/facebook/rocksdb/pull/1575

Differential Revision: D4240818

Pulled By: IslamAbdelRahman

fbshipit-source-id: a8bc757
2016-11-28 16:39:13 -08:00
Andrew Kryczka 01eabf7375 Fix double-counted deletion stat
Summary:
Both the single deletion and the value are included in compaction outputs, so no need to update the stat for the value's deletion yet, otherwise it'd be double-counted.
Closes https://github.com/facebook/rocksdb/pull/1574

Differential Revision: D4241181

Pulled By: ajkr

fbshipit-source-id: c9aaa15
2016-11-28 15:54:12 -08:00
Andrew Kryczka 7ffb10fc1a DeleteRange compaction statistics
Summary:
- "rocksdb.compaction.key.drop.range_del" - number of keys dropped during compaction due to a range tombstone covering them
- "rocksdb.compaction.range_del.drop.obsolete" - number of range tombstones dropped due to compaction to bottom level and no snapshot saving them
- s/CompactionIteratorStats/CompactionIterationStats/g since this class is no longer specific to CompactionIterator -- it's also updated for range tombstone iteration during compaction
- Move the above class into a separate .h file to avoid circular dependency.
Closes https://github.com/facebook/rocksdb/pull/1520

Differential Revision: D4187179

Pulled By: ajkr

fbshipit-source-id: 10c2103
2016-11-28 11:54:12 -08:00
Mike Kolupaev 236d4c67e9 Less linear search in DBIter::Seek() when keys are overwritten a lot
Summary:
In one deployment we saw high latencies (presumably from slow iterator operations) and a lot of CPU time reported by perf with this stack:

```
  rocksdb::MergingIterator::Next
  rocksdb::DBIter::FindNextUserEntryInternal
  rocksdb::DBIter::Seek
```

I think what's happening is:
1. we create a snapshot iterator,
2. we do lots of Put()s for the same key x; this creates lots of entries in memtable,
3. we seek the iterator to a key slightly smaller than x,
4. the seek walks over lots of entries in memtable for key x, skipping them because of high sequence numbers.

CC IslamAbdelRahman
Closes https://github.com/facebook/rocksdb/pull/1413

Differential Revision: D4083879

Pulled By: IslamAbdelRahman

fbshipit-source-id: a83ddae
2016-11-28 10:24:11 -08:00
Siying Dong cd7c4143d7 Improve Write Stalling System
Summary:
Current write stalling system has the problem of lacking of positive feedback if the restricted rate is already too low. Users sometimes stack in very low slowdown value. With the diff, we add a positive feedback (increasing the slowdown value) if we recover from slowdown state back to normal. To avoid the positive feedback to keep the slowdown value to be to high, we add issue a negative feedback every time we are close to the stop condition. Experiments show it is easier to reach a relative balance than before.

Also increase level0_stop_writes_trigger default from 24 to 32. Since level0_slowdown_writes_trigger default is 20, stop trigger 24 only gives four files as the buffer time to slowdown writes. In order to avoid stop in four files while 20 files have been accumulated, the slowdown value must be very low, which is amost the same as stop. It also doesn't give enough time for the slowdown value to converge. Increase it to 32 will smooth out the system.
Closes https://github.com/facebook/rocksdb/pull/1562

Differential Revision: D4218519

Pulled By: siying

fbshipit-source-id: 95e4088
2016-11-23 09:24:15 -08:00