Commit graph

2220 commits

Author SHA1 Message Date
sdong fcafac053f Fix memory leak in ColumnFamilyTest.WriteStall*
Summary: ColumnFamilyTest.WriteStallSingleColumnFamily and ColumnFamilyTest.WriteStallTwoColumnFamilies didn't clean up test state cleanly, causing memory leak. Fix it.

Test Plan: Run the two tests in valgrind and make sure they now pass.

Reviewers: yhchiang, anthony, rven, kradhakrishnan, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52347
2015-12-28 12:30:21 -08:00
sdong 11672df19a Fix CLANG errors introduced by 7d87f02799
Summary: Fix some CLANG errors introduced in 7d87f02799

Test Plan: Build with both of CLANG and gcc

Reviewers: rven, yhchiang, kradhakrishnan, anthony, IslamAbdelRahman, ngbronson

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52329
2015-12-28 10:00:58 -08:00
Nathan Bronson 7d87f02799 support for concurrent adds to memtable
Summary:
This diff adds support for concurrent adds to the skiplist memtable
implementations.  Memory allocation is made thread-safe by the addition of
a spinlock, with small per-core buffers to avoid contention.  Concurrent
memtable writes are made via an additional method and don't impose a
performance overhead on the non-concurrent case, so parallelism can be
selected on a per-batch basis.

Write thread synchronization is an increasing bottleneck for higher levels
of concurrency, so this diff adds --enable_write_thread_adaptive_yield
(default off).  This feature causes threads joining a write batch
group to spin for a short time (default 100 usec) using sched_yield,
rather than going to sleep on a mutex.  If the timing of the yield calls
indicates that another thread has actually run during the yield then
spinning is avoided.  This option improves performance for concurrent
situations even without parallel adds, although it has the potential to
increase CPU usage (and the heuristic adaptation is not yet mature).

Parallel writes are not currently compatible with
inplace updates, update callbacks, or delete filtering.
Enable it with --allow_concurrent_memtable_write (and
--enable_write_thread_adaptive_yield).  Parallel memtable writes
are performance neutral when there is no actual parallelism, and in
my experiments (SSD server-class Linux and varying contention and key
sizes for fillrandom) they are always a performance win when there is
more than one thread.

Statistics are updated earlier in the write path, dropping the number
of DB mutex acquisitions from 2 to 1 for almost all cases.

This diff was motivated and inspired by Yahoo's cLSM work.  It is more
conservative than cLSM: RocksDB's write batch group leader role is
preserved (along with all of the existing flush and write throttling
logic) and concurrent writers are blocked until all memtable insertions
have completed and the sequence number has been advanced, to preserve
linearizability.

My test config is "db_bench -benchmarks=fillrandom -threads=$T
-batch_size=1 -memtablerep=skip_list -value_size=100 --num=1000000/$T
-level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999
-disable_auto_compactions --max_write_buffer_number=8
-max_background_flushes=8 --disable_wal --write_buffer_size=160000000
--block_size=16384 --allow_concurrent_memtable_write" on a two-socket
Xeon E5-2660 @ 2.2Ghz with lots of memory and an SSD hard drive.  With 1
thread I get ~440Kops/sec.  Peak performance for 1 socket (numactl
-N1) is slightly more than 1Mops/sec, at 16 threads.  Peak performance
across both sockets happens at 30 threads, and is ~900Kops/sec, although
with fewer threads there is less performance loss when the system has
background work.

Test Plan:
1. concurrent stress tests for InlineSkipList and DynamicBloom
2. make clean; make check
3. make clean; DISABLE_JEMALLOC=1 make valgrind_check; valgrind db_bench
4. make clean; COMPILE_WITH_TSAN=1 make all check; db_bench
5. make clean; COMPILE_WITH_ASAN=1 make all check; db_bench
6. make clean; OPT=-DROCKSDB_LITE make check
7. verify no perf regressions when disabled

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: MarkCallaghan, IslamAbdelRahman, anthony, yhchiang, rven, sdong, guyg8, kradhakrishnan, dhruba

Differential Revision: https://reviews.facebook.net/D50589
2015-12-25 11:03:40 -08:00
sdong 5b2587b5cb DBTest.HardLimit use special memtable
Summary: DBTest.HardLimit fails in appveyor build. Use special mem table to make the test behavior depends less on platform

Test Plan: Run the test with JEMALLOC both on and off.

Reviewers: yhchiang, kradhakrishnan, rven, anthony, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52317
2015-12-25 10:25:34 -08:00
Siying Dong 298ba27ae2 Merge pull request #846 from yuslepukhin/enble_c4244_lossofdata
Enable MS compiler warning c4244.
2015-12-23 22:59:42 -08:00
Siying Dong 7810aa802a Merge pull request #899 from zhipeng-jia/fix_clang_warning
Fix clang warnings
2015-12-23 22:58:52 -08:00
Siying Dong 4c5560d70a Merge pull request #895 from zhipeng-jia/develop
Fix computation of size of last sub-compaction
2015-12-23 22:45:03 -08:00
sdong d43da8ae0d DBTest.DelayedWriteRate: fix assert of sign and unsign comparison
Summary: DBTest.DelayedWriteRate has sign and unsign comparisons that break Windows build. Fix it.

Test Plan: Build and run the test modified.

Reviewers: IslamAbdelRahman, rven, anthony, yhchiang, kradhakrishnan

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52311
2015-12-23 22:38:12 -08:00
sdong 15b8902264 Change default options.delayed_write_rate
Summary: We now have a mechanism to further slowdown writes. Double default options.delayed_write_rate to try to keep the default behavior closer to it used to be.

Test Plan: Run all tests.

Reviewers: IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: yhchiang, kradhakrishnan, rven, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52281
2015-12-23 14:51:55 -08:00
Zhipeng Jia 73b175a773 Fix clang warnings regarding unnecessary std::move 2015-12-24 04:10:00 +08:00
sdong b9f77ba12b When slowdown is triggered, reduce the write rate
Summary: It's usually hard for users to set a value of options.delayed_write_rate. With this diff, after slowdown condition triggers, we greedily reduce write rate if estimated pending compaction bytes increase. If estimated compaction pending bytes drop, we increase the write rate.

Test Plan:
Add a unit test
Test with db_bench setting:
TEST_TMPDIR=/dev/shm/ ./db_bench --benchmarks=fillrandom -num=10000000 --soft_pending_compaction_bytes_limit=1000000000 --hard_pending_compaction_bytes_limit=3000000000 --delayed_write_rate=100000000

and make sure without the commit, write stop will happen, but with the commit, it will not happen.

Reviewers: igor, anthony, rven, yhchiang, kradhakrishnan, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52131
2015-12-23 11:33:15 -08:00
Andrew Kryczka 445d5b8c5c Fix clang build
Summary:
Missed this in https://reviews.facebook.net/D51633 because I didn't
wait for 'make commit-prereq' to finish

Test Plan: make clean && USE_CLANG=1 make -j32 all

Reviewers: IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D52275
2015-12-23 10:49:42 -08:00
Andrew Kryczka e089db40f9 Skip bottom-level filter block caching when hit-optimized
Summary:
When Get() or NewIterator() trigger file loads, skip caching the filter block if
(1) optimize_filters_for_hits is set and (2) the file is on the bottommost
level. Also skip checking filters under the same conditions, which means that
for a preloaded file or a file that was trivially-moved to the bottom level, its
filter block will eventually expire from the cache.

- added parameters/instance variables in various places in order to propagate the config ("skip_filters") from version_set to block_based_table_reader
- in BlockBasedTable::Rep, this optimization prevents filter from being loaded when the file is opened simply by setting filter_policy = nullptr
- in BlockBasedTable::Get/BlockBasedTable::NewIterator, this optimization prevents filter from being used (even if it was loaded already) by setting filter = nullptr

Test Plan:
updated unit test:

  $ ./db_test --gtest_filter=DBTest.OptimizeFiltersForHits

will also run 'make check'

Reviewers: sdong, igor, paultuckfield, anthony, rven, kradhakrishnan, IslamAbdelRahman, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D51633
2015-12-23 10:15:07 -08:00
Zhipeng Jia aa515823bc Fix clang warning 2015-12-23 19:23:58 +08:00
Islam AbdelRahman d005c66faf Report compaction reason in CompactionListener
Summary:
Add CompactionReason to CompactionJobInfo
This will allow users to understand why compaction started which will help options tuning

Test Plan:
added new tests
make check -j64

Reviewers: yhchiang, anthony, kradhakrishnan, sdong, rven

Reviewed By: rven

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D51975
2015-12-22 11:37:19 -08:00
Zhipeng Jia 728f944f0d Fix computation of size of last sub-compaction 2015-12-22 18:37:51 +08:00
Zhipeng Jia e0abec1580 Sorting std::vector instead of using std::set 2015-12-22 14:34:57 +08:00
Alex Yang 33e09c0e19 add call to install superversion and schedule work in enableautocompactions
Summary:
This patch fixes https://github.com/facebook/mysql-5.6/issues/121

There is a recent change in rocksdb to disable auto compactions on startup: https://reviews.facebook.net/D51147. However, there is a small timing window where a column family needs to be compacted and schedules a compaction, but the scheduled compaction fails when it checks the disable_auto_compactions setting. The expectation is once the application is ready, it will call EnableAutoCompactions() to allow new compactions to go through. However, if the Column family is stalled because L0 is full, and no writes can go through, it is possible the column family may never have a new compaction request get scheduled. EnableAutoCompaction() should probably schedule an new flush and compaction event when it resets disable_auto_compaction.

Using InstallSuperVersionAndScheduleWork, we call SchedulePendingFlush,
SchedulePendingCompaction, as well as MaybeScheduleFlushOrcompaction on all the
column families to avoid the situation above.

This is still a first pass for feedback.
Could also just call SchedePendingFlush and SchedulePendingCompaction directly.

Test Plan:
Run on Asan build
cd _build-5.6-ASan/ && ./mysql-test/mtr --mem --big --testcase-timeout=36000 --suite-timeout=12000 --parallel=16 --suite=rocksdb,rocksdb_rpl,rocksdb_sys_vars --mysqld=--default-storage-engine=rocksdb --mysqld=--skip-innodb --mysqld=--default-tmp-storage-engine=MyISAM --mysqld=--rocksdb rocksdb_rpl.rpl_rocksdb_stress_crash --repeat=1000

Ensure that it no longer hangs during the test.

Reviewers: hermanlee4, yhchiang, anthony

Reviewed By: anthony

Subscribers: leveldb, yhchiang, dhruba

Differential Revision: https://reviews.facebook.net/D51747
2015-12-21 10:06:49 -08:00
Zhipeng Jia 24c7dae130 Fix clang warning regarding implicit conversion 2015-12-21 23:57:55 +08:00
Reid Horuff 97ea8afaaf compaction assertion triggering test fix for sequence zeroing assertion trip 2015-12-18 16:08:31 -08:00
Nathan Bronson a48382399d Fix use-after free in db_bench
Test Plan: valgrind db_bench

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D52101
2015-12-18 06:42:57 -08:00
Zhipeng Jia 131f7ddf63 fix typo: sr to picking_sr 2015-12-18 17:02:36 +08:00
sdong c37729a6a6 db_bench: --soft_pending_compaction_bytes_limit should set options.soft_pending_compaction_bytes_limit
Summary: Fix a bug that options.soft_pending_compaction_bytes_limit is not actually set with --soft_pending_compaction_bytes_limit

Test Plan: Run db_bench with this parameter and make sure the parameter is set correctly.

Reviewers: anthony, kradhakrishnan, yhchiang, IslamAbdelRahman, igor, rven

Reviewed By: rven

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52125
2015-12-17 18:28:56 -08:00
Venkatesh Radhakrishnan 7b12ae97d4 Add signalall after removing item from manual_compaction deque
Summary:
When there are waiting manual compactions, we need to signal
them after removing the current manual compaction from the deque.

Test Plan: ColumnFamilytTest.SameCFManualManualCommaction

Reviewers: anthony, IslamAbdelRahman, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: dhruba, yoshinorim

Differential Revision: https://reviews.facebook.net/D52119
2015-12-17 16:59:00 -08:00
sdong d72b31774e Slowdown when writing to the last write buffer
Summary: Now if inserting to mem table is much faster than writing to files, there is no mechanism users can rely on to avoid stopping for reaching options.max_write_buffer_number. With the commit, if there are more than four maximum write buffers configured, we slow down to the rate of options.delayed_write_rate while we reach the last one.

Test Plan:
1. Add a new unit test.
2. Run db_bench with

./db_bench --benchmarks=fillrandom --num=10000000 --max_background_flushes=6 --batch_size=32 -max_write_buffer_number=4 --delayed_write_rate=500000 --statistics

based on hard drive and see stopping is avoided with the commit.

Reviewers: yhchiang, IslamAbdelRahman, anthony, rven, kradhakrishnan, igor

Reviewed By: igor

Subscribers: MarkCallaghan, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52047
2015-12-17 10:49:08 -08:00
Islam AbdelRahman aececc209e Introduce ReadOptions::pin_data (support zero copy for keys)
Summary:
This patch update the Iterator API to introduce new functions that allow users to keep the Slices returned by key() valid as long as the Iterator is not deleted

ReadOptions::pin_data : If true keep loaded blocks in memory as long as the iterator is not deleted
Iterator::IsKeyPinned() : If true, this mean that the Slice returned by key() is valid as long as the iterator is not deleted

Also add a new option BlockBasedTableOptions::use_delta_encoding to allow users to disable delta_encoding if needed.

Benchmark results (using https://phabricator.fb.com/P20083553)

```
// $ du -h /home/tec/local/normal.4K.Snappy/db10077
// 6.1G    /home/tec/local/normal.4K.Snappy/db10077

// $ du -h /home/tec/local/zero.8K.LZ4/db10077
// 6.4G    /home/tec/local/zero.8K.LZ4/db10077

// Benchmarks for shard db10077
// _build/opt/rocks/benchmark/rocks_copy_benchmark \
//      --normal_db_path="/home/tec/local/normal.4K.Snappy/db10077" \
//      --zero_db_path="/home/tec/local/zero.8K.LZ4/db10077"

// First run
// ============================================================================
// rocks/benchmark/RocksCopyBenchmark.cpp          relative  time/iter  iters/s
// ============================================================================
// BM_StringCopy                                                 1.73s  576.97m
// BM_StringPiece                                   103.74%      1.67s  598.55m
// ============================================================================
// Match rate : 1000000 / 1000000

// Second run
// ============================================================================
// rocks/benchmark/RocksCopyBenchmark.cpp          relative  time/iter  iters/s
// ============================================================================
// BM_StringCopy                                              611.99ms     1.63
// BM_StringPiece                                   203.76%   300.35ms     3.33
// ============================================================================
// Match rate : 1000000 / 1000000
```

Test Plan: Unit tests

Reviewers: sdong, igor, anthony, yhchiang, rven

Reviewed By: rven

Subscribers: dhruba, lovro, adsharma

Differential Revision: https://reviews.facebook.net/D48999
2015-12-16 12:08:30 -08:00
Gunnar Kudrjavets 97265f5f14 Fix minor bugs in delete operator, snprintf, and size_t usage
Summary:
List of changes:

1) Fix the snprintf() usage in cases where wrong variable was used to determine the output buffer size.

2) Remove unnecessary checks before calling delete operator.

3) Increase code correctness by using size_t type when getting vector's size.

4) Unify the coding style by removing namespace::std usage at the top of the file to confirm to the majority usage.

5) Fix various lint errors pointed out by 'arc lint'.

Test Plan:
Code review and build:

git diff
make clean
make -j 32 commit-prereq
arc lint

Reviewers: kradhakrishnan, sdong, rven, anthony, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D51849
2015-12-15 15:26:20 -08:00
Zhipeng Jia 99ae549d37 Fix typo 2015-12-15 23:47:47 +08:00
Islam AbdelRahman 636cd3c714 Clean up listener_test (reuse db_test_util)
Summary: Reuse db_test_util in listener_test

Test Plan:
make listener_test -j64 && ./listener_test
USE_CLANG=1 make listener_test -j64 && ./listener_test

Reviewers: yhchiang, rven, kradhakrishnan, anthony

Reviewed By: anthony

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D51939
2015-12-14 13:36:32 -08:00
Venkatesh Radhakrishnan 030215bf01 Running manual compactions in parallel with other automatic or manual compactions in restricted cases
Summary:
This diff provides a framework for doing manual
compactions in parallel with other compactions. We now have a deque of manual compactions. We also pass manual compactions as an argument from RunManualCompactions down to
BackgroundCompactions, so that RunManualCompactions can be reentrant.
Parallelism is controlled by the two routines
ConflictingManualCompaction to allow/disallow new parallel/manual
compactions based on already existing ManualCompactions. In this diff, by default manual compactions still have to run exclusive of other compactions. However, by setting the compaction option, exclusive_manual_compaction to false, it is possible to run other compactions in parallel with a manual compaction. However, we are still restricted to one manual compaction per column family at a time. All of these restrictions will be relaxed in future diffs.
I will be adding more tests later.

Test Plan: Rocksdb regression + new tests + valgrind

Reviewers: igor, anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, sdong

Reviewed By: sdong

Subscribers: yoshinorim, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D47973
2015-12-14 11:20:34 -08:00
Dmitri Smirnov aca403d2b5 Fix another rebase problems. 2015-12-11 17:33:40 -08:00
Dmitri Smirnov a6fbdd64e0 Fix rebase issues and new code warnings. 2015-12-11 16:56:24 -08:00
Dmitri Smirnov 3fa68af316 Enable MS compiler warning c4244.
Mostly due to the fact that there are differences in sizes of int,long
  on 64 bit systems vs GNU.
2015-12-11 16:52:41 -08:00
Dmitri Smirnov 236fe21c92 Enable MS compiler warning c4244.
Mostly due to the fact that there are differences in sizes of int,long
  on 64 bit systems vs GNU.
2015-12-11 16:47:34 -08:00
agiardullo 3bfd3d39a3 Use SST files for Transaction conflict detection
Summary:
Currently, transactions can fail even if there is no actual write conflict.  This is due to relying on only the memtables to check for write-conflicts.  Users have to tune memtable settings to try to avoid this, but it's hard to figure out exactly how to tune these settings.

With this diff, TransactionDB will use both memtables and SST files to determine if there are any write conflicts.  This relies on the fact that BlockBasedTable stores sequence numbers for all writes that happen after any open snapshot.  Also, D50295 is needed to prevent SingleDelete from disappearing writes (the TODOs in this test code will be fixed once the other diff is approved and merged).

Note that Optimistic transactions will still rely on tuning memtable settings as we do not want to read from SST while on the write thread.  Also, memtable settings can still be used to reduce how often TransactionDB needs to read SST files.

Test Plan: unit tests, db bench

Reviewers: rven, yhchiang, kradhakrishnan, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb, yoshinorim

Differential Revision: https://reviews.facebook.net/D50475
2015-12-11 12:34:11 -08:00
Yueh-Hsuan Chiang f0a8e5a2d8 Fixed the valgrind error in ColumnFamilyTest::CreateAndDropRace
Summary: Fixed the valgrind error in ColumnFamilyTest::CreateAndDropRace

Test Plan: valgrind --error-exitcode=2 --leak-check=full ./column_family_test

Reviewers: kradhakrishnan, rven, anthony, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D51795
2015-12-10 11:53:53 -08:00
agiardullo 9e44629061 Change SingleDelete to support conflict checking
Summary: For Transactions, we want to start using the SST files to do write conflict checking.  To do this, we need to make sure that compaction never removes all writes if an earlier snapshot exists.  So I had to change the way we process SingleDeletes to sometimes leave a SingleDelete behind when we encounter a Put followed by a SingleDelete.  See the comments in this diff for a more detailed explanation.

Test Plan: added more unit tests

Reviewers: rven, igor, kradhakrishnan, IslamAbdelRahman, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D50295
2015-12-10 11:35:38 -08:00
charsyam c30b499541 fix typos in comments 2015-12-11 01:54:48 +09:00
sdong 56e77f0967 Deprecate options.soft_rate_limit and add options.soft_pending_compaction_bytes_limit
Summary: Deprecate options.soft_rate_limit, which is hard to tune, with options.soft_pending_compaction_bytes_limit, which would trigger the slowdown if estimated pending compaction bytes exceeds the threshold. The hope is to make it more striaght-forward to tune.

Test Plan: Modify DBTest.SoftLimit to cover options.soft_pending_compaction_bytes_limit instead; run all unit tests.

Reviewers: IslamAbdelRahman, yhchiang, rven, kradhakrishnan, igor, anthony

Reviewed By: anthony

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51117
2015-12-09 18:22:45 -08:00
sdong d6e1035a1f A new compaction picking priority that optimizes for write amplification for random updates.
Summary: Introduce a compaction picking priority that picks files who contains the oldest rows to compact. This is a mode that slightly improves write amplification for random update cases.

Test Plan: Add a unit test and run it in valgrind too.

Reviewers: yhchiang, anthony, IslamAbdelRahman, rven, kradhakrishnan, MarkCallaghan, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51459
2015-12-09 18:13:03 -08:00
Yueh-Hsuan Chiang 0991cee6cd Merge pull request #815 from SherlockNoMad/CounterFix
Fix EstimateNumKeys Counter Inaccurate Issue
2015-12-09 14:10:49 -08:00
sdong ac8e56f050 db_bench: in uncompress benchmark, get Snappy size from compressed stream
Summary: Now in benchmark "uncompress" in db_bench, we get size from compressed stream for all other compression types except Snappy, where we allocate memory based on parameter. Change it to match to behavior of other compression types.

Test Plan: Run ./db_bench --benchmarks=uncompress with snappy and other compression types.

Reviewers: yhchiang, kradhakrishnan, anthony, IslamAbdelRahman, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51681
2015-12-08 18:11:58 -08:00
Siying Dong fa3dbf203f Merge pull request #853 from Vaisman/enable_C4267_warning
Enable C4267 warning
2015-12-08 17:59:24 -08:00
Siying Dong ad6aaf4fab Merge pull request #848 from SherlockNoMad/db_bench
Split histogram per OperationType in db_bench
2015-12-08 17:58:40 -08:00
Yueh-Hsuan Chiang 774b80e99e Resubmit the fix for a race condition in persisting options
Summary:
This patch fix a race condition in persisting options which will cause a crash when:

* Thread A obtain cf options and start to persist options based on that cf options.
* Thread B kicks in and finish DropColumnFamily and delete cf_handle.
* Thread A wakes up and tries to finish the persisting options and crashes.

Test Plan: Add a test in column_family_test that can reproduce the crash

Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51717
2015-12-08 17:01:02 -08:00
agiardullo e5c5f23814 Support marking snapshots for write-conflict checking - Take 2
Summary:
D51183 was reverted due to breaking the LITE build.

This diff is the same as D51183 but with a fix for the LITE BUILD(D51693)

Test Plan: run all unit tests

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D51711
2015-12-08 16:47:31 -08:00
Venkatesh Radhakrishnan 3d8bb2c890 Fix valgrind failure in IncreaseUniversalCompactionNumLevels
Summary:
Fixing a valgrind failure in DBTestUniversalCompaction
in the IncreaseUniversalCompactionNumLevels test. Using
SpecialSkipList with 10 rows per file.

Test Plan: Run valgrind and functional tests.

Reviewers: anthony, yhchiang, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D51705
2015-12-08 11:45:29 -08:00
sdong 1d63c3d610 Revert "Support marking snapshots for write-conflict checking"
This reverts commit ec704aafdc for it broke RocksDB LITE build.
2015-12-08 09:27:17 -08:00
agiardullo ec704aafdc Support marking snapshots for write-conflict checking
Summary:
D50475 enables using SST files for transaction write-conflict checking.  In order for this to work, we need to make sure not to compact out SingleDeletes when there is an earlier transaction snapshot(D50295).  If there is a long-held snapshot, this could reduce the benefit of the SingleDelete optimization.

This diff allows Transactions to mark snapshots as being used for write-conflict checking.  Then, during compaction, we will be able to optimize SingleDeletes better in the future.

This diff adds a flag to SnapshotImpl which is used by Transactions.  This diff also passes the earliest write-conflict snapshot's sequence number to CompactionIterator.  This diff does not actually change Compaction (after this diff is pushed, D50295 will be able to use this information).

Test Plan: no behavior change, ran existing tests

Reviewers: rven, kradhakrishnan, yhchiang, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D51183
2015-12-07 19:40:51 -08:00
sdong 770dea9325 Fix occasional failure of DBTest.DynamicCompactionOptions
Summary: DBTest.DynamicCompactionOptions ocasionally fails during valgrind run. We sent a sleeping task to block compaction thread pool but we don't wait it to run.

Test Plan: Run the test multiple times in an environment which can cause failure.

Reviewers: rven, kradhakrishnan, igor, IslamAbdelRahman, anthony, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51687
2015-12-07 18:38:39 -08:00
SherlockNoMad ebc2d490d1 Split histogram per OperationType in db_bench 2015-12-07 17:33:18 -08:00
sdong f307036bde Revert "Fix a race condition in persisting options"
This reverts commit 2fa3ed5180. It breaks RocksDB lite build
2015-12-07 17:09:12 -08:00
Yueh-Hsuan Chiang 2fa3ed5180 Fix a race condition in persisting options
Summary:
This patch fix a race condition in persisting options which will cause a crash when:

* Thread A obtain cf options and start to persist options based on that cf options.
* Thread B kicks in and finish DropColumnFamily and delete cf_handle.
* Thread A wakes up and tries to finish the persisting options and crashes.

Test Plan: Add a test in column_family_test that can reproduce the crash

Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D51609
2015-12-07 15:25:12 -08:00
Venkatesh Radhakrishnan f276c3a821 Fix valgrind failures in 3 tests in db_compaction_test due to new skiplist changes
Summary:
Several tests in db_compaction_test are failing with aborts in
valgrind. These are LevelCompactionThirdPath, LevelCompactionPathUse and
CompressLevelCompaction. We now use the SpecialSkipListFactory to make
them more deterministic

Test Plan: valgrind

Reviewers: anthony, yhchiang, kradhakrishnan, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D51663
2015-12-07 11:57:00 -08:00
sdong 291088ae4e Fix undeterministic failure of ColumnFamilyTest.DifferentWriteBufferSizes
Summary: After the skip list optimization, ColumnFamilyTest.DifferentWriteBufferSizes can occasionally fail with flush triggering of column family 3. Insert more data to it to make sure flush will trigger.

Test Plan: Run it multiple times with both of jemaloc on and off and see it always passes. (Without thd commit the run with jemalloc fails with chance of about one in two)

Reviewers: rven, yhchiang, IslamAbdelRahman, anthony, kradhakrishnan, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51645
2015-12-07 10:53:29 -08:00
SherlockNoMad 355fa94365 EstimatedNumKeys Counter Inaccurate 2015-12-07 10:51:08 -08:00
Islam AbdelRahman a9ca9107b9 Fix db_universal_compaction_test
Summary:
db_universal_compaction_test is still failing because of
UniversalCompactionNumLevels/DBTestUniversalCompaction.UniversalCompactionSecondPathRatio/0

https://travis-ci.org/facebook/rocksdb/jobs/94949919

Use same approach to fix other tests to fix this test

Test Plan: Run ./db_universal_compaction_test on mac and make sure all the tests pass

Reviewers: kradhakrishnan, yhchiang, rven, anthony, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D51591
2015-12-04 13:27:56 -08:00
krad d3bb572da6 Build break fix.
Summary: Skip list now cannot estimate memory across allocators
consistently and hence triggers flush at different time. This breaks certain
unit tests.

The fix is to adopt key count instead of size for flush.

Test Plan: Ran test on dev box and mac (where it used to fail)

Reviewers: sdong

CC: leveldb@

Task ID: #9273334

Blame Rev:
2015-12-04 11:45:51 -08:00
Alex Yang e8180f9901 added public api to schedule flush/compaction, code to prevent race with db::open
Summary:
Fixes T8781168.

Added a new function EnableAutoCompactions in db.h to be publicly
avialable.  This allows compaction to be re-enabled after disabling it via
SetOptions

Refactored code to set the dbptr earlier on in TransactionDB::Open and DB::Open
Temporarily disable auto_compaction in TransactionDB::Open until dbptr is set to
prevent race condition.

Test Plan:
Ran make all check

verified fix on myrocks side:
was able to reproduce the seg fault with
../tools/mysqltest.sh --mem --force rocksdb.drop_table

method was to manually sleep the thread after DB::Open but before TransactionDB ptr was
assigned in transaction_db_impl.cc:
  DB::Open(db_options, dbname, column_families_copy, handles, &db);
  clock_t goal = (60000 * 10) + clock();
  while (goal > clock());
  ...dbptr(aka rdb) gets assigned below

verified my changes fixed the issue.

Also added unit test 'ToggleAutoCompaction' in transaction_test.cc

Reviewers: hermanlee4, anthony

Reviewed By: anthony

Subscribers: alex, dhruba

Differential Revision: https://reviews.facebook.net/D51147
2015-12-03 22:59:44 -08:00
Islam AbdelRahman 19b1201b2b Merge pull request #865 from yuslepukhin/fix_db_table_properties_test
Avoid empty ranges vector with subsequent zero element access
2015-12-03 17:32:20 -08:00
yuslepukhin e0de7ef87b Avoid empty ranges vector with subsequent zero element access 2015-12-02 14:50:33 -08:00
Yueh-Hsuan Chiang a330f0b3bb Fix incorrect merge in db/db_compaction_test.cc
Summary: Fix incorrect merge in db/db_compaction_test.cc

Test Plan: db_compaction_test

Reviewers: igor, sdong, anthony, IslamAbdelRahman, rven, kradhakrishnan

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D51531
2015-12-02 14:09:09 -08:00
Yueh-Hsuan Chiang bd7a49d448 Make DBCompactionTestWithParam::CompactionTrigger more deterministic
Summary: Make DBCompactionTestWithParam::CompactionTrigger more deterministic

Test Plan: ./db_compaction_test

Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D51507
2015-12-02 14:06:33 -08:00
sdong bcd7bd1229 Relax verification condition of DBTest.SuggestCompactRangeTest
Summary: Verifiction condition of DBTest.SuggestCompactRangeTest is too strict. Based on key distribution, we might have more small files in last level. Not check number of files in the last level.

Test Plan: Run DBTest.SuggestCompactRangeTest with both of jemalloc on and off.

Reviewers: rven, IslamAbdelRahman, yhchiang, kradhakrishnan, igor, anthony

Reviewed By: anthony

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51501
2015-12-01 21:12:24 -08:00
sdong f9103d9a30 DBTest.DynamicCompactionOptions: More deterministic and readable
Summary: DBTest.DynamicCompactionOptions sometimes fails the assert but I can't repro it locally. Make it more deterministic and readable and see whether the problem is still there.

Test Plan: Run tht test and make sure it passes

Reviewers: kradhakrishnan, yhchiang, igor, rven, IslamAbdelRahman, anthony

Reviewed By: anthony

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51309
2015-12-01 16:49:47 -08:00
sdong 0ad68518bb Fix DBCompactionTestWithParam.CompactionTrigger in non-jemalloc build.
Summary: DBCompactionTestWithParam.CompactionTrigger fails in non-jemalloc build, after the skip list memtable change. Fix it by making mem table flush trigger by number of entries.

Test Plan: Run the test using both of jemalloc and non-jemalloc build.

Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, igor, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51471
2015-12-01 12:25:22 -08:00
sdong 459c7fba36 Revert previous behavior of internal_key_skipped_count
Summary: With recent commit 33e0c93826, db iterator skips perf context counter internal_key_skipped_count when blindly issuing internal Next(). Now increment the counter by one when issuing this Next()

Test Plan: Run all existing tests

Reviewers: rven, yhchiang, IslamAbdelRahman, kradhakrishnan, igor, anthony

Reviewed By: anthony

Subscribers: yoshinorim, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51465
2015-11-30 21:55:05 -08:00
agiardullo 481f9edb15 Fix CLANG build
Summary: fix clang build

Test Plan: build

Reviewers: IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D51453
2015-11-30 20:02:13 -08:00
sdong ef8ed3681c Fix DBTest.SuggestCompactRangeTest for disable jemalloc case
Summary: DBTest.SuggestCompactRangeTest fails for the case when jemalloc is disabled, including ASAN and valgrind builds. It is caused by the improvement of skip list, which allocates different size of nodes for a new records. Fix it by using a special mem table that triggers a flush by number of entries. In that way the behavior will be consistent for all allocators.

Test Plan: Run the test with both of DISABLE_JEMALLOC=1 and 0

Reviewers: anthony, rven, yhchiang, kradhakrishnan, igor, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51423
2015-11-30 16:40:47 -08:00
sdong db320b1b82 DB to only flush the column family with the largest memtable while option.db_write_buffer_size is hit
Summary: When option.db_write_buffer_size is hit, we currently flush all column families. Move to flush the column family with the largest active memt table instead. In this way, we can avoid too many small files in some cases.

Test Plan: Modify test DBTest.SharedWriteBuffer to work with the updated behavior

Reviewers: kradhakrishnan, yhchiang, rven, anthony, IslamAbdelRahman, igor

Reviewed By: igor

Subscribers: march, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51291
2015-11-30 13:36:57 -08:00
sdong 33e0c93826 Reduce extra key comparision in DBIter::Next()
Summary: Now DBIter::Next() always compares with current key with itself first, which is unnecessary if the last key is not a merge key. I made the change and didn't see db_iter_test fails. Want to hear whether people have any idea what I miss.

Test Plan: Run all unit tests

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D48279
2015-11-24 17:16:18 -08:00
Nathan Bronson 9a9d4759b2 InlineSkipList part 3/3 - new skiplist type that colocates key and node
Summary:
This diff completes the creation of InlineSkipList<Cmp>, which is like
SkipList<const char*, Cmp> but it always allocates the key contiguously
with the node.  This allows us to remove the pointer from the node
to the key.  As a result the memory usage of the skip list is reduced
(by 1 to sizeof(void*) bytes depending on the padding required to align
the key storage), cache locality is improved, and we halve the number
of calls to the allocator.

For skip lists whose keys are freshly-allocated const char*,
InlineSkipList is stricly preferrable to SkipList.  This diff doesn't
replace SkipList, however, because some of the use cases of SkipList in
RocksDB are either character sequences that are not allocated at the
same time as the skip list node allocation (for example
hash_linklist_rep) or have different key types (for example
write_batch_with_index).  Taking advantage of inline allocation for
those cases is left to future work.

The perf win is biggest for small values.  For single-threaded CPU-bound
(32M fillrandom operations with no WAL log) with 16 byte keys and 0 byte
values, the db_bench perf goes from ~310k ops/sec to ~410k ops/sec.  For
large values the improvement is less pronounced, but seems to be between
5% and 10% on the same configuration.

Test Plan: make check

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D51123
2015-11-24 15:16:02 -08:00
Nathan Bronson 5201729545 InlineSkipList - part 2/3
Summary:
This diff is 2/3 in a sequence that introduces a skip list optimized
for a key that is a freshly-allocated const char*.  The change is broken
into pieces to make it easier to review.  This piece removes the Key
template type, introduces the AllocateKey interface, and changes the
unit test from using uint64_t as the Key type to using pointers to an 8
byte blob.

Test Plan: unit test

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D51285
2015-11-24 14:30:56 -08:00
Nathan Bronson 78812ec6bf InlineSkipList - part 1/3
Summary:
This diff is 1/3 in a sequence that introduces a skip list optimized for
a key that is a freshly-allocated const char*.  The diff is broken into
pieces to make it easier to review.  This piece only introduces the new
type by copying the existing SkipList, with mechanical naming changes
and reformatting.

Test Plan: new unit test

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D51279
2015-11-24 14:30:22 -08:00
Vasili Svirski 41b32c6059 Enable C4267 warning
* conversion from 'size_t' to 'type', by add static_cast

Tested:
* by build solution on Windows, Linux locally,
* run tests
* build CI system successful
2015-11-24 16:33:09 +03:00
agiardullo c5b467306d Fix race condition that causes valgrind failures
Summary: DBTest.DynamicLevelCompressionPerLevel2 sometimes fails during valgrind runs.  This causes our valgrind tests to fail.  Not sure what the best fix is for this test, but hopefully this simple change is sufficient.

Test Plan: run test

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D51111
2015-11-20 18:26:48 -08:00
Siying Dong efb01a055a Merge pull request #850 from yuslepukhin/enable_2015_build
Build on Visual Studio 2015 Update 1
2015-11-20 17:57:22 -08:00
Venkatesh Radhakrishnan 81be49c755 Have a way for compaction filter to ignore snapshots
Summary:
Provide an API for compaction filter to specify that it needs
to be applied even if there are snapshots.

Test Plan: DBTestCompactionFilter.CompactionFilterIgnoreSnapshot

Reviewers: yhchiang, IslamAbdelRahman, sdong, anthony

Reviewed By: anthony

Subscribers: yoshinorim, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D51087
2015-11-20 15:57:26 -08:00
yuslepukhin 047bd22aae Build on Visual Studio 2015 Update 1 2015-11-20 15:31:47 -08:00
sdong 189b3e03df Fix uninitilizeded SpecialEnv::time_elapse_only_sleep_
Summary: SpecialEnv::time_elapse_only_sleep_ is not initialized, which might cause some test failures. Fix it.

Test Plan: Run some unit tests. Since tests already broken. Might want to commit it sooner.

Reviewers: IslamAbdelRahman, yhchiang, anthony

Reviewed By: anthony

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D50937
2015-11-17 16:22:17 -08:00
sdong d5540e18e6 DBTest.MergeTestTime to only use fake time to be determinstic
Summary: DBTest.MergeTestTime is a test verifying timing counters. Depending on real time may cause non-determinstic results. Change to fake time to be determinsitic.

Test Plan: Run the test and make sure it passes

Reviewers: yhchiang, anthony, rven, kradhakrishnan, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D50883
2015-11-17 14:40:23 -08:00
Islam AbdelRahman 605a24d94e Block forward_iterator_bench under MAC and Windows
Summary:
Travis is now failing because we cannot compile forward_iterator_bench under MAC
https://travis-ci.org/facebook/rocksdb/jobs/91524025

In forward_iterator_bench.cc we are using multiple functions that are not available in MAC like
htobe64
be64toh

Blocking forward_iterator_bench under MAC

Test Plan: compile under mac

Reviewers: rven, yhchiang, anthony, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D50889
2015-11-17 11:51:37 -08:00
Venkatesh Radhakrishnan 9b8c9be0b5 Fix forward_iterator allocation of vector.
Summary:
db_tailing_iter_test was failing on some platforms because of
an incorrect allocation and use. This diff fixes the issue.

Test Plan:
db_tailing_iter_test
Run valgrind for db_tailing_iter_test

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D50835
2015-11-17 10:27:51 -08:00
sdong 5cbb7e43e0 DBTest.MergeTestTime: relax counter upper bound verification
Summary: Timing counters' upper bounds depend on platform. It frequently fails in valgrind runs. Relax the upper bound.

Test Plan: Run the same valgrind test and make sure it passes.

Reviewers: rven, anthony, kradhakrishnan, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D50829
2015-11-16 19:47:07 -08:00
Reid Horuff 3381e2c3e7 Handle multiple calls to DBImpl::PauseBackgroundWork() and DBImpl::ContinueBackgroundWork()
Summary: Handle multiple calls to DBImpl::PauseBackgroundWork() and DBImpl::ContinueBackgroundWork()

Test Plan: rocksdb.information_schema handles this case.

Reviewers: igor

Reviewed By: igor

Subscribers: hermanlee4, jkedgar, dhruba

Differential Revision: https://reviews.facebook.net/D50781
2015-11-16 14:20:18 -08:00
Islam AbdelRahman ca5566d209 Fix clang build
Summary: Fix clang

Test Plan:
USE_CLANG=1 make all -j64

Reviewers: sdong, yhchiang, anthony, rven

Reviewed By: rven

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D50793
2015-11-16 14:14:39 -08:00
Dmitri Smirnov cb9459f85c Fix empty vector write in ForwardIterator 2015-11-16 13:58:10 -08:00
Islam AbdelRahman a163cc2d5a Lint everything
Summary:
```
arc2 lint --everything
```

run the linter on the whole code repo to fix exisitng lint issues

Test Plan: make check -j64

Reviewers: sdong, rven, anthony, kradhakrishnan, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D50769
2015-11-16 12:56:21 -08:00
sdong dac5b248b1 UniversalCompactionPicker::PickCompaction(): avoid to form compactions if there is no file
Summary:
Currently RocksDB may break in lines like this:

for (size_t i = sorted_runs.size() - 1; i >= first_index_after; i--) {

if options.level0_file_num_compaction_trigger=0.

Fix it by not executing the logic of picking compactions if there is no file (sorted_runs.size() = 0). Also internally set options.level0_file_num_compaction_trigger=1 if users give a 0. 0 is a value makes no sense in RocksDB.

Test Plan: Run all tests. Will add a unit test too.

Reviewers: yhchiang, IslamAbdelRahman, anthony, kradhakrishnan, rven

Reviewed By: rven

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D50727
2015-11-16 10:32:45 -08:00
Venkatesh Radhakrishnan d06b63e99f Fix Rocksdb lite build failure in forward_iterator_bench
Summary:
Fixed Rocksdb lite build failure in forward_iterator_bench by
defining main for the ROCKSDB_LITE case

Test Plan: build ROCKSDB_LITE

Reviewers: anthony, yhchiang, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D50733
2015-11-16 09:57:08 -08:00
Venkatesh Radhakrishnan 7824444bfc Reuse file iterators in tailing iterator when memtable is flushed
Summary:
Under a tailing workload, there were increased block cache
misses when a memtable was flushed because we were rebuilding iterators
in that case since the version set changed. This was exacerbated in the
case of iterate_upper_bound, since file iterators which were over the
iterate_upper_bound would have been deleted and are now brought back as
part of the Rebuild, only to be deleted again. We now renew the iterators
and only build iterators for files which are added and delete file
iterators for files which are deleted.
Refer to https://reviews.facebook.net/D50463 for previous version

Test Plan: DBTestTailingIterator.TailingIteratorTrimSeekToNext

Reviewers: anthony, IslamAbdelRahman, igor, tnovak, yhchiang, sdong

Reviewed By: sdong

Subscribers: yhchiang, march, dhruba, leveldb, lovro

Differential Revision: https://reviews.facebook.net/D50679
2015-11-13 15:50:59 -08:00
Venkatesh Radhakrishnan 2ae4d7d708 Make sure that CompactFiles does not run two parallel Level 0 compactions
Summary:
Since level 0 files can overlap, two level 0 compactions cannot
run in parallel. Compact files needs to check this before running a
compaction.

Test Plan: CompactFilesTest.L0ConflictsFiles

Reviewers: igor, IslamAbdelRahman, anthony, sdong, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D50079
2015-11-13 12:01:00 -08:00
Nathan Bronson 6ce42dd075 Don't merge WriteBatch-es if WAL is disabled
Summary:
There's no need for WriteImpl to flatten the write batch group
into a single WriteBatch if the WAL is disabled.  This diff moves the
flattening into the WAL step, and skips flattening entirely if it isn't
needed.  It's good for about 5% speedup on a multi-threaded workload
with no WAL.

This diff also adds clarifying comments about the chance for partial
failure of WriteBatchInternal::InsertInto, and always sets bg_error_ if
the memtable state diverges from the logged state or if a WriteBatch
succeeds only partially.

Benchmark for speedup:
  db_bench -benchmarks=fillrandom -threads=16 -batch_size=1 -memtablerep=skip_list -value_size=0 --num=200000 -level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999 -disable_auto_compactions --max_write_buffer_number=8 -max_background_flushes=8 --disable_wal --write_buffer_size=160000000

Test Plan: asserts + make check

Reviewers: sdong, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D50583
2015-11-12 10:50:38 -08:00
Yueh-Hsuan Chiang 56245ddcf5 Fixed DBCompactionTest.SkipStatsUpdateTest
Summary:
DBCompactionTest.SkipStatsUpdateTest relies on the number
of files opened during the DB::Open process, but the persisting
options file support altered this number and thus makes
DBCompactionTest.SkipStatsUpdateTest in certain environment.

This patch fixed this test failure.

Test Plan: db_compaction_test

Reviewers: igor, sdong, anthony, IslamAbdelRahman

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D50637
2015-11-12 07:45:53 -08:00
Yueh-Hsuan Chiang e78389b554 Fixed build failure of RocksDBLite test on options_file_test.cc
Summary: Fixed build failure of RocksDBLite test

Test Plan: options_file_test

Reviewers: igor, sdong, anthony, IslamAbdelRahman

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D50595
2015-11-10 23:23:36 -08:00
Yueh-Hsuan Chiang e114f0abb8 Enable RocksDB to persist Options file.
Summary:
This patch allows rocksdb to persist options into a file on
DB::Open, SetOptions, and Create / Drop ColumnFamily.
Options files are created under the same directory as the rocksdb
instance.

In addition, this patch also adds a fail_if_missing_options_file in DBOptions
that makes any function call return non-ok status when it is not able to
persist options properly.

  // If true, then DB::Open / CreateColumnFamily / DropColumnFamily
  // / SetOptions will fail if options file is not detected or properly
  // persisted.
  //
  // DEFAULT: false
  bool fail_if_missing_options_file;

Options file names are formatted as OPTIONS-<number>, and RocksDB
will always keep the latest two options files.

Test Plan:
Add options_file_test.

options_test
column_family_test

Reviewers: igor, IslamAbdelRahman, sdong, anthony

Reviewed By: anthony

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D48285
2015-11-10 22:58:01 -08:00
Nathan Bronson 631863c63b track WriteBatch contents
Summary:
Parallel writes will only be possible for certain combinations of
flags and WriteBatch contents.  Traversing the WriteBatch at write time
to check these conditions would be expensive, but it is very cheap to
keep track of when building WriteBatch-es.  When loading WriteBatch-es
during recovery, a deferred computation state is used so that the flags
never need to be computed.

Test Plan:
1. add asserts and EXPECT_EQ-s
2. make check

Reviewers: sdong, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D50337
2015-11-10 16:56:06 -08:00
Nathan Bronson b81b430987 Switch to thread-local random for skiplist
Summary:
Using a TLS random instance for skiplist makes it smaller
(useful for hash_skiplist_rep) and prepares skiplist for concurrent
adds.  This diff also modifies the branching factor math to avoid an
unnecessary division.

This diff has the effect of changing the sequence of skip list node
height choices made by tests, so it has the potential to cause unit
test failures for tests that implicitly rely on the exact structure
of the skip list.  Tests that try to exactly trigger a compaction are
likely suspects for this problem (these tests have always been brittle to
changes in the skiplist details).  I've minimizes this risk by reseeding
the main thread's Random at the beginning of each test, increasing the
universal compaction size_ratio limit from 101% to 105% for some tests,
and verifying that the tests pass many times.

Test Plan: for i in `seq 0 9`; do make check; done

Reviewers: sdong, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D50439
2015-11-09 19:25:22 -08:00
Islam AbdelRahman 5b9ce1a323 Merge pull request #820 from yuslepukhin/enable_compiler_warnings
Enable Windows warnings C4307 C4309 C4512 C4701
2015-11-06 12:08:25 -08:00
Dmitri Smirnov 20f57b1715 Enable Windows warnings C4307 C4309 C4512 C4701
Enable C4307 'operator' : integral constant overflow
  Longs and ints on Windows are 32-bit hence the overflow
  Enable C4309 'conversion' : truncation of constant value
  Enable C4512 'class' : assignment operator could not be generated
  Enable C4701 Potentially uninitialized local variable 'name' used
2015-11-06 11:34:06 -08:00
Nathan Bronson 2b42000f43 incorrect batch group size computation for write throttling
Summary:
When a write batch can't join a batch group due to the total
size of the contained batches, the write controller's GetDelay is passed
a size value that includes the rejected batch.

Test Plan: make check

Reviewers: sdong, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D50343
2015-11-06 09:23:55 -08:00
Venkatesh Radhakrishnan ae7940b628 Fix regression failure in PrefixTest.PrefixValid
Summary: Use IterKey to store prefix_start_ so that it doesn't get freed

Test Plan: PrefixTest.PrefixValid

Reviewers: anthony, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D50289
2015-11-05 16:43:54 -08:00
Venkatesh Radhakrishnan 9d50afc3b9 Prefix-based iterating only shows keys in prefix
Summary:
MyRocks testing found an issue that while iterating over keys
that are outside the prefix, sometimes wrong results were seen for keys
outside the prefix. We now tighten the range of keys seen with a new
read option called prefix_seen_at_start. This remembers the starting
prefix and then compares it on a Next for equality of prefix. If they
are from a different prefix, it sets valid to false.

Test Plan: PrefixTest.PrefixValid

Reviewers: IslamAbdelRahman, sdong, yhchiang, anthony

Reviewed By: anthony

Subscribers: spetrunia, hermanlee4, yoshinorim, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D50211
2015-11-05 13:24:05 -08:00
Yueh-Hsuan Chiang 7d7ee2b654 Add Memory Insight support to utilities
Summary:
This patch introduces utilities/memory, which currently includes
GetApproximateMemoryUsageByType that reports different types of
rocksdb memory usage given a list of input DBs.

The API also take care of the case where Cache could be shared
across multiple column families / multiple db instances.

Currently, it reports memory usage of memtable, table-readers
and cache.

Test Plan: utilities/memory/memory_test.cc

Reviewers: igor, anthony, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D49257
2015-11-03 17:52:17 -08:00
Yueh-Hsuan Chiang 3ecbab0040 Add GetAggregatedIntProperty(): returns the aggregated value from all CFs
Summary:
This patch adds GetAggregatedIntProperty() that returns the aggregated
value from all CFs

Test Plan: Added a test in db_test

Reviewers: igor, sdong, anthony, IslamAbdelRahman, rven

Reviewed By: rven

Subscribers: rven, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D49497
2015-11-03 15:54:18 -08:00
Islam AbdelRahman f31442fb5c Merge pull request #803 from SherlockNoMad/SkipFlush
Add Option to Skip Flushing in TableBuilder
2015-11-02 14:56:11 -08:00
Igor Canadi 279c8e0cd8 Merge pull request #811 from OverlordQ/unused-variable-warning
Fix introduced in 2ab7065 was reverted by 18285c1.
2015-11-02 12:44:27 -08:00
Brent Garber affd833690 Fix introduced in 2ab7065 was reverted by 18285c1.
Corrects:

db/memtablerep_bench.cc:135:22: error: ‘FLAGS_env’ defined but not used [-Werror=unused-variable]
 static rocksdb::Env* FLAGS_env = rocksdb::Env::Default();
                      ^
cc1plus: all warnings being treated as errors
Makefile:1147: recipe for target 'db/memtablerep_bench.o' failed
2015-11-02 15:35:45 -05:00
SherlockNoMad ccc8c10c0c Move skip_table_builder_flush to BlockBasedTableOption 2015-10-30 18:33:01 -07:00
Dmitri Smirnov eaaf081d16 Do not suppress C4018 'expression' : signed/unsigned mismatch
The code compiles cleanly for the most part. Fix db_test.
  Move debug file to testutil library.
2015-10-30 17:03:16 -07:00
Islam AbdelRahman ff4499e297 Update DB::AddFile() to have less restrictions
Summary:
Update DB::AddFile() restrictions to be
  - Key range in loaded table file don't overlap with existing keys or tombstones in DB.
  - No other writes happen during AddFile call.

The updated AddFile() will verify that the file key range don't overlap with any keys or tombstones in the DB, and then add the file to L0

Test Plan: unit tests

Reviewers: igor, rven, anthony, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: adsharma, ameyag, dhruba

Differential Revision: https://reviews.facebook.net/D49233
2015-10-30 16:38:10 -07:00
sdong 11c71a365a db_bench: --compaction_pri default should be rocksdb::Options().compaction_pri
Summary: Currently db_bnech's --compaction_pri default is set to be rocksdb::Options().compaction_style. Change it to rocksdb::Options().compaction_pri. Although, for now both is 0.

Test Plan: Build db_bench

Reviewers: anthony, rven, IslamAbdelRahman, kradhakrishnan, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D49773
2015-10-30 15:02:33 -07:00
SherlockNoMad a6dd0831d5 Add Option to Skip Flushing in TableBuilder 2015-10-29 22:10:25 -07:00
Islam AbdelRahman 2872e0c8c2 Clean and expose CreateLoggerFromOptions
Summary:
CreateLoggerFromOptions have some parameters like  db_log_dir and env, these parameters are redundant since they already exist in DBOptions

this patch remove the redundant parameters and expose CreateLoggerFromOptions to users

Test Plan: make check

Reviewers: igor, anthony, yhchiang, rven, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: dhruba, hermanlee4

Differential Revision: https://reviews.facebook.net/D49713
2015-10-29 18:07:37 -07:00
sdong 296c3a1f94 "make format" in some recent commits
Summary: Run "make format" for some recent commits.

Test Plan: Build and run tests

Reviewers: IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D49707
2015-10-29 17:11:14 -07:00
Siying Dong 6388e7f4e2 Merge pull request #798 from yuslepukhin/readahead_buffermanagement
Implement smart buffer management in Windows Env.
2015-10-29 15:02:59 -07:00
Dmitri Smirnov 1277a48f1b Fix 80 character limit issue. 2015-10-29 11:34:34 -07:00
Herman Lee 0d720dfc17 Use the correct variable when fetching table properties.
Summary:
An uninitialized parameter was being passed into the call to fetch the table
properties during the compaction notification callbacks.

Test Plan:
Build it with myrocks and verify unit test passed.
Run unit tests.

Reviewers: rven, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D49635
2015-10-28 16:28:11 -07:00
Dmitri Smirnov 6fbc4f9f3e Implement smart buffer management.
introduce a new DBOption random_access_max_buffer_size to limit
  the size of the random access buffer used for unbuffered access.
  Implement read ahead buffering when enabled.
  To that effect propagate compaction_readahead_size and the new option
  to the env options to make it available for the implementation.
  Add Hint() override so SetupForCompaction() call would call Hint()
  readahead can now be setup from both Hint() and EnableReadAhead()
  Add new option random_access_max_buffer_size support
  db_bench, options_helper to make it string parsable
  and the unit test.
2015-10-27 14:44:16 -07:00
Praveen Rao 4ce117c4d5 Merge branch 'master' into wal_filter 2015-10-26 19:03:34 -07:00
Praveen Rao 32cdec634e Fail recovery if filter provides more records than original and corresponding unit-test, fix naming conventions 2015-10-26 18:11:18 -07:00
Siying Dong 138876a62c Merge pull request #746 from ceph/wip-recycle
Add Options.recycle_log_file_num for Recycling WAL Files
2015-10-26 15:01:28 -07:00
Dmitri Smirnov 3c750b59ae No need to #ifdef test only code on windows 2015-10-22 15:15:37 -07:00
sdong e3d4e14075 DBCompactionTestWithParam.ManualCompaction to verify block cache is not filled in manual compaction
Summary: Manual compaction should not fill block cache. Add the verification in unit test

Test Plan: Run the test

Reviewers: yhchiang, kradhakrishnan, rven, IslamAbdelRahman, anthony, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D49089
2015-10-20 10:36:49 -07:00
sdong 6d6776f6b8 Log more information for the add file with overlapping range failure
Summary: crash_test sometimes fails, hitting the add file overlapping assert. Add information in info logs help us to find the bug.

Test Plan: Run all test suites. Do some manual tests to make sure printing is correct.

Reviewers: kradhakrishnan, yhchiang, anthony, IslamAbdelRahman, rven, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D49017
2015-10-19 17:31:13 -07:00
Praveen Rao 7951b9b079 make field order match initialization order 2015-10-19 17:03:01 -07:00
Praveen Rao 2938c5c137 merge upstream changes 2015-10-19 15:21:33 -07:00
Sage Weil a7b2bedfb0 log_{reader,write}: recyclable record format
Introduce new tags for records that have a log_number.  This changes the
header size from 7 to 11 for these records, making this a
backward-incompatible change.

If we read a record that belongs to a different log_number (i.e., a
previous instantiation of this log file, before it was most recently
recycled), we return kOldRecord from ReadPhysicalRecord.  ReadRecord
will translate this into a kEof or kBadRecord depending on what the
WAL recovery mode is.

We make several adjustments to the log_test.cc tests to compensate for the
fact that the header size varies between the two modes.

Signed-off-by: Sage Weil <sage@redhat.com>
2015-10-19 17:24:05 -04:00
Igor Canadi 4e07c99a9a Fix iOS build
Summary: We don't yet have a CI build for iOS, so our iOS compile gets broken sometimes. Most of the errors are from assumption that size_t is 64-bit, while it's actually 32-bit on some (all?) iOS platforms. This diff fixes the compile.

Test Plan:
TARGET_OS=IOS make static_lib

Observe there are no warnings

Reviewers: sdong, anthony, IslamAbdelRahman, kradhakrishnan, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D49029
2015-10-19 13:40:44 -07:00
Praveen Rao 0c59691dde Handle multiple batches in single log record - allow app to return a new batch + allow app to return corrupted record status 2015-10-19 13:27:40 -07:00
Dmitri Smirnov 2f680ed094 Make index same type as auto deduced uint32_t 2015-10-19 12:29:11 -07:00
Dmitri Smirnov 09f853550c uint is a not a datatype on windows. 2015-10-19 11:28:22 -07:00
Alexey Maykov f18acd8875 Fixed the clang compilation failure
Summary: As above.

Test Plan: USE_CLANG=1 make check -j

Reviewers: igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D48981
2015-10-19 10:38:50 -07:00
Sage Weil 4104e9bb67 log_reader: introduce kBadHeader; drop wal mode from ReadPhysicalRecord
Move the WAL recovery mode logic out of ReadPhysicalRecord.  To do this we
introduce a new type indicating when we fail to read a valid header.

Signed-off-by: Sage Weil <sage@redhat.com>
2015-10-18 21:24:32 -04:00
Sage Weil 9c33f64d19 log_reader: pass in WALRecoveryMode instead of bool report_eof_inconsistency
Soon our behavior will depend on more than just whther we are in
kAbsoluteConsistency or not.

Signed-off-by: Sage Weil <sage@redhat.com>
2015-10-18 21:24:32 -04:00
Sage Weil 7188052107 db_test_util: add recycle_log_files to set of tested options
Signed-off-by: Sage Weil <sage@redhat.com>
2015-10-18 21:24:32 -04:00
Sage Weil 3ac13c99d1 log_reader: pass log_number and optional info_log to ctor
We will need the log number to validate the recycle-style CRCs.  The log
is helpful for debugging, but optional, as not all callers have it.

Signed-off-by: Sage Weil <sage@redhat.com>
2015-10-18 21:24:32 -04:00
Sage Weil 5830c699f2 log_writer: pass log number and whether recycling is enabled to ctor
When we recycle log files, we need to mix the log number into the CRC
for each record.  Note that for logs that don't get recycled (like the
manifest), we always pass a log_number of 0 and false.

Signed-off-by: Sage Weil <sage@redhat.com>
2015-10-18 21:24:32 -04:00
Sage Weil 666376150c db_impl: recycle log files
If log recycling is enabled, put old WAL files on a recycle queue instead of
deleting them.  When we need a new log file, take a recycled file off the
list if one is available.

Signed-off-by: Sage Weil <sage@redhat.com>
2015-10-18 21:24:32 -04:00
Sage Weil d666225a0a db_impl: disable recycle_log_files if WAL archive is enabled
We can't recycle the files if they are being archived.

Signed-off-by: Sage Weil <sage@redhat.com>
2015-10-18 21:21:24 -04:00
Sage Weil 543c12ab06 options: add recycle_log_file_num option
Signed-off-by: Sage Weil <sage@redhat.com>
2015-10-18 21:21:24 -04:00
Alexey Maykov e1a09a7703 Implementation for GetPropertiesOfTablesInRange
Summary: In MyRocks, it is sometimes important to get propeties only for the subset of the database. This diff implements the API in RocksDB.

Test Plan: ran the GetPropertiesOfTablesInRange

Reviewers: rven, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D48651
2015-10-17 13:34:43 -07:00
Yueh-Hsuan Chiang ad471453e8 Allow GetProperty to report the number of currently running flushes / compactions.
Summary:
Add rocksdb.num-running-compactions and rocksdb.num-running-flushes
to GetIntProperty() that reports the number of currently running
compactions / flushes.

Test Plan: augmented existing tests in db_test

Reviewers: igor, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D48693
2015-10-17 00:16:36 -07:00
sdong 277dea78f0 Add more kill points
Summary:
Add kill points in:
1. after creating a file
2. before writing a manifest record
3. before syncing manifest
4. before creating a new current file
5. after creating a new current file

Test Plan: Run all current tests.

Reviewers: yhchiang, igor, anthony, IslamAbdelRahman, rven, kradhakrishnan

Reviewed By: kradhakrishnan

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D48855
2015-10-16 14:35:12 -07:00
Venkatesh Radhakrishnan a98fbacfa0 Moving memtable related files from util to a new directory memtable
Summary:
We are cleaning up dependencies.
This diff takes a first step at moving memtable files to their own
directory called memtable. In future diffs, we will move other memtable
files from db to memtable.

Test Plan: make check

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D48915
2015-10-16 14:10:33 -07:00
Islam AbdelRahman 952ad994a9 Fix db_test under ROCKSDB_LITE
Summary:
This diff exclude alot of tests in db_test that are not compiling / failing under ROCKSD_LITE

Test Plan:
OPT=-DROCKSDB_LITE make check -j64
make check -j64

Reviewers: yhchiang, igor, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D48771
2015-10-15 10:59:31 -07:00
Islam AbdelRahman 6d730b4ae7 Block tests under ROCKSDB_LITE
Summary:
This patch will block all tests (not including db_test) that don't compile / fail under ROCKSDB_LITE

Test Plan:
OPT=-DROCKSDB_LITE make db_compaction_filter_test -j64 &&
OPT=-DROCKSDB_LITE make db_compaction_test -j64 &&
OPT=-DROCKSDB_LITE make db_dynamic_level_test -j64 &&
OPT=-DROCKSDB_LITE make db_log_iter_test -j64 &&
OPT=-DROCKSDB_LITE make db_tailing_iter_test -j64 &&
OPT=-DROCKSDB_LITE make db_universal_compaction_test -j64 &&
OPT=-DROCKSDB_LITE make ldb_cmd_test -j64

make clean

make db_compaction_filter_test -j64 &&
make db_compaction_test -j64 &&
make db_dynamic_level_test -j64 &&
make db_log_iter_test -j64 &&
make db_tailing_iter_test -j64 &&
make db_universal_compaction_test -j64 &&
make ldb_cmd_test -j64

Reviewers: yhchiang, igor, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D48723
2015-10-15 10:51:00 -07:00
sdong dae49e829e Make DBTest.ReadLatencyHistogramByLevel more robust
Summary:
Two fixes:
1. Wait compaction after generating each L0 file so that we are sure there are one L0 file left.
2. https://reviews.facebook.net/D48423 increased from 500 keys to 700 keys but in verification phase we are still querying the first 500 keys. It is a bug to fix.

Test Plan: Run the test in the same environment that fails by chance of one in tens of times. It doesn't fail after 1000 times.

Reviewers: yhchiang, IslamAbdelRahman, igor, rven, kradhakrishnan

Reviewed By: rven, kradhakrishnan

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D48759
2015-10-14 16:08:55 -07:00
Islam AbdelRahman b81b2ec25d Fix benchmarks under ROCKSDB_LITE
Summary: Fix db_bench and memtablerep_bench under ROCKSDB_LITE

Test Plan:
OPT=-DROCKSDB_LITE make db_bench -j64
OPT=-DROCKSDB_LITE make memtablerep_bench -j64
make db_bench -j64
make memtablerep_bench -j64

Reviewers: yhchiang, anthony, rven, igor, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D48717
2015-10-14 12:43:00 -07:00
Venkatesh Radhakrishnan e587dbe03a Move manual_compaction_test.cc from util to db
Summary: manual_compaction_test.cc incorrectly in util. Moved to db.

Test Plan: make check

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D48687
2015-10-14 11:06:27 -07:00