Commit graph

1150 commits

Author SHA1 Message Date
Islam AbdelRahman d6c838f1e1 Add SstFileManager (component tracking all SST file in DBs and control the deletion rate)
Summary:
Add a new class SstFileTracker that will be notified whenever a DB add/delete/move and sst file, it will also replace DeleteScheduler
SstFileTracker can be used later to abort writes when we exceed a specific size

Test Plan: unit tests

Reviewers: rven, anthony, yhchiang, sdong

Reviewed By: sdong

Subscribers: igor, lovro, march, dhruba

Differential Revision: https://reviews.facebook.net/D50469
2016-01-28 18:35:01 -08:00
sdong d20915d52a Disable stats about mutex duration by default
Summary: Measuring mutex duration will measure time inside DB mutex, which breaks our best practice. Add a stat level in Statistics class. By default, disable to measure the mutex operations.

Test Plan: Add a unit test to make sure it is off by default.

Reviewers: rven, anthony, IslamAbdelRahman, kradhakrishnan, andrewkr, yhchiang

Reviewed By: yhchiang

Subscribers: MarkCallaghan, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D53367
2016-01-26 14:56:55 -08:00
Andrew Kryczka 167bd8856d [directory includes cleanup] Finish removing util->db dependencies 2016-01-26 10:49:24 -08:00
Andrew Kryczka acd7d58695 [directory includes cleanup] Remove util->db dependency for ThreadStatusUtil
Summary:
We can avoid the dependency by forward-declaring ColumnFamilyData and
then treating it as a black box. That means callers of ThreadStatusUtil need to
explicitly provide more options, even if they can be derived from the
ColumnFamilyData, since ThreadStatusUtil doesn't include the definition.

This is part of a series of diffs to eliminate circular dependencies between
directories (e.g., db/* files depending on util/* files and vice-versa).

Test Plan:
  $ ./db_test --gtest_filter=DBTest.GetThreadStatus
  $ make -j32 commit-prereq

Reviewers: sdong, yhchiang, IslamAbdelRahman

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D53361
2016-01-26 10:49:16 -08:00
Andrew Kryczka 46f9cd46af [directory includes cleanup] Move cross-function test points
Summary:
I split the db-specific test points out into a separate file under db/
directory. There were also a few bugs to fix in xfunc.{h,cc} that prevented it
from compiling previously; see https://reviews.facebook.net/D36825.

Test Plan:
compilation works now, below command works, will also run "make xfunc".

  $ make check ROCKSDB_XFUNC_TEST='managed_new' tests-regexp='DBTest' -j32

Reviewers: sdong, yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D53343
2016-01-26 10:49:05 -08:00
sdong fb9811ee9b Add a perf context level that doesn't measure time for mutex operations
Summary: Timing mutex operations can impact scalability of the system. Add a new perf context level that can measure time counters except for mutex.

Test Plan: Add a new unit test case to make sure it is not set.

Reviewers: IslamAbdelRahman, rven, kradhakrishnan, yhchiang, anthony

Reviewed By: anthony

Subscribers: MarkCallaghan, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D53199
2016-01-25 14:45:03 -08:00
sdong f1ed170105 Add tests to make sure new DB or ColumnFamily options are settable through string
Summary: Add a test to fail if someone adds a DB options.

Test Plan: Run the test, run the test with valgrind. Add an option to DB option in the middle or in the end and make sure it fails.

Reviewers: yhchiang, anthony, IslamAbdelRahman, kradhakrishnan, rven, andrewkr

Reviewed By: andrewkr

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D53097
2016-01-22 16:18:44 -08:00
sdong 2c2b72218d Disable OptionsParserTest.BlockBasedTableOptionsAllFieldsSettable under CLANG
Summary: OptionsParserTest.BlockBasedTableOptionsAllFieldsSettable is failiong under CLANG. Disable the test to unblock the build.

Test Plan: Run it both of CLANG and GCC

Reviewers: kradhakrishnan, rven, andrewkr, anthony, yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D53157
2016-01-21 10:56:34 -08:00
sdong 202be23e46 Add test that verifies all options in BlockBasedTableOptions is settable through GetBlockBasedTableOptionsFromString()
Summary: Add a test OptionsParserTest.BlockBasedTableOptionsAdded, which will fail if a new option is added to BlockBasedTableOptions but is not settable through GetBlockBasedTableOptionsFromString().

Test Plan: Run the test. Also manually remove and add options and make sure it fails.

Reviewers: anthony, IslamAbdelRahman, kradhakrishnan, rven, yhchiang, andrewkr

Reviewed By: andrewkr

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52953
2016-01-20 21:59:28 -08:00
sdong fdbff42391 Crash test to make kill decision for every kill point
Summary:
In crash test, when coming to each kill point, we start a random class using seed as current second. With this approach, for every second, the random number used is the same. However, in each second, there are multiple kill points with different frequency. It makes it hard to reason about chance of kill point to trigger. With this commit, we use thread local random seed to generate the random number, so that it will take different values per second, hoping it makes chances of killing much easier to reason about.

Also significantly reduce the kill odd to make sure time before kiling is similar as before.

Test Plan: Run white box crash test and see the killing happens as expected and the run time time before killing reasonable.

Reviewers: kradhakrishnan, IslamAbdelRahman, rven, yhchiang, andrewkr, anthony

Reviewed By: anthony

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52971
2016-01-19 18:11:24 -08:00
David Bernard df7c2f3b59 As per google coding standard replace "using" in option_builder.cc and
geodb_impl.cc
2016-01-20 00:27:29 +00:00
David Bernard d78c6b28c4 Changes for build on solaris
Makefile adjust paths for solaris build
Makefile enable _GLIBCXX_USE_C99 so that std::to_string is available
db_compaction_test.cc Initialise a variable to avoid a compilation error
db_impl.cc Include <alloca.h>
db_test.cc Include <alloca.h>
Environment.java recognise solaris envrionment
options_bulder.cc Make log unambiguous
geodb_impl.cc Make log and floor unambiguous
2016-01-19 04:45:21 +00:00
Gunnar Kudrjavets aec10f734b Guard falloc.h inclusion to avoid build breaks
Summary: Depending on the order of include paths and versions of various headers we may end up in a situation where we'll encounter a build break caused by redefinition of constants. gcc-4.9-glibc-2.20 header update to include/bits/fcntl-linux.h introduced the definitions of FALLOC_FL_* constants. However, linux/falloc.h from kernel-headers also has FALLOC_FL_* constants defined. Therefore during the compilation we'll get "previously defined" errors.

Test Plan:
Both in the environment where the build break manifests (to make sure that the change fixed the problem) and in the environment where everything builds fine (to make sure that there are no regressions):

make clean
make -j 32

Reviewers: sdong, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D52821
2016-01-14 22:47:15 -08:00
Dmitri Smirnov ac50fd3a71 Align statistics
Use Yield macro to make it a little more portable between platforms.
2016-01-13 14:53:23 -08:00
Igor Canadi 48a8667c30 Merge pull request #929 from warrenfalk/fix32
fix a compile error on 32-bit (fixes #634)
2016-01-12 11:02:36 -08:00
sdong 9a8e3f73ed plain table reader: non-mmap mode to keep two recent buffers
Summary: In plain table reader's non-mmap mode, we only keep the most recent read buffer. However, for binary search, it is likely we come back to a location to read. To avoid one pread in such a case, we keep two read buffers. It should cover most of the cases.

Test Plan:
1. run tests
2. check the optimization works through strace when running
./table_reader_bench -mmap_read=false --num_keys2=1 -num_keys1=5000 -table_factory=plain_table --iterator --through_db

Reviewers: anthony, rven, kradhakrishnan, igor, yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51171
2016-01-08 10:53:57 -08:00
Warren Falk 94d9df2482 fix an unused function compiler warning in crc32c in 32-bit mode 2016-01-07 13:27:20 -05:00
Warren Falk 2f01e10fa9 use static_cast in crc32c instead of c-style cast 2016-01-07 13:22:09 -05:00
Warren Falk 601f1306a1 fix shorten-64-to-32 warning in crc32c 2016-01-07 13:12:15 -05:00
Warren Falk 55b37efa15 fix a compile error on 32-bit 2016-01-07 11:51:52 -05:00
sdong c9e2490bc6 Fix DynamicBloomTest.concurrent_with_perf to pass TSAN
Summary: TSAN fails on DynamicBloomTest.concurrent_with_perf. This change fixes it. Not sure why though.

Test Plan: Run the test with TSAN and make sure no warning shown.

Reviewers: yhchiang, IslamAbdelRahman, anthony, ngbronson, rven

Reviewed By: rven

Subscribers: rven, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52383
2015-12-29 16:28:45 -08:00
sdong edf1cd497f Not generating "__attribute__((__unused__))" for padding fields if it is not CLANG
Summary: Adding "__attribute__((__unused__))" after padding fields will pass CLANG build but will fail gcc 4.8.1. Fix it by not generating it under GCC 4.8.1.

Test Plan: Build under four combinations of USE_CLANG=0,1 and ROCKSDB_FBCODE_BUILD_WITH_481=0.1.

Reviewers: yhchiang, rven, ngbronson, anthony, IslamAbdelRahman

Reviewed By: anthony

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52371
2015-12-28 18:37:23 -08:00
sdong 11672df19a Fix CLANG errors introduced by 7d87f02799
Summary: Fix some CLANG errors introduced in 7d87f02799

Test Plan: Build with both of CLANG and gcc

Reviewers: rven, yhchiang, kradhakrishnan, anthony, IslamAbdelRahman, ngbronson

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52329
2015-12-28 10:00:58 -08:00
Siying Dong 7fafd52dce Merge pull request #900 from shuzhang1989/hdfs_env_fix
add a factory method for creating hdfs env
2015-12-28 09:28:04 -08:00
Shu Zhang 2b7c810db8 more foramt 2015-12-26 19:52:35 -08:00
Shu Zhang b79ccbd573 indent 2015-12-26 19:50:28 -08:00
Nathan Bronson 7d87f02799 support for concurrent adds to memtable
Summary:
This diff adds support for concurrent adds to the skiplist memtable
implementations.  Memory allocation is made thread-safe by the addition of
a spinlock, with small per-core buffers to avoid contention.  Concurrent
memtable writes are made via an additional method and don't impose a
performance overhead on the non-concurrent case, so parallelism can be
selected on a per-batch basis.

Write thread synchronization is an increasing bottleneck for higher levels
of concurrency, so this diff adds --enable_write_thread_adaptive_yield
(default off).  This feature causes threads joining a write batch
group to spin for a short time (default 100 usec) using sched_yield,
rather than going to sleep on a mutex.  If the timing of the yield calls
indicates that another thread has actually run during the yield then
spinning is avoided.  This option improves performance for concurrent
situations even without parallel adds, although it has the potential to
increase CPU usage (and the heuristic adaptation is not yet mature).

Parallel writes are not currently compatible with
inplace updates, update callbacks, or delete filtering.
Enable it with --allow_concurrent_memtable_write (and
--enable_write_thread_adaptive_yield).  Parallel memtable writes
are performance neutral when there is no actual parallelism, and in
my experiments (SSD server-class Linux and varying contention and key
sizes for fillrandom) they are always a performance win when there is
more than one thread.

Statistics are updated earlier in the write path, dropping the number
of DB mutex acquisitions from 2 to 1 for almost all cases.

This diff was motivated and inspired by Yahoo's cLSM work.  It is more
conservative than cLSM: RocksDB's write batch group leader role is
preserved (along with all of the existing flush and write throttling
logic) and concurrent writers are blocked until all memtable insertions
have completed and the sequence number has been advanced, to preserve
linearizability.

My test config is "db_bench -benchmarks=fillrandom -threads=$T
-batch_size=1 -memtablerep=skip_list -value_size=100 --num=1000000/$T
-level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999
-disable_auto_compactions --max_write_buffer_number=8
-max_background_flushes=8 --disable_wal --write_buffer_size=160000000
--block_size=16384 --allow_concurrent_memtable_write" on a two-socket
Xeon E5-2660 @ 2.2Ghz with lots of memory and an SSD hard drive.  With 1
thread I get ~440Kops/sec.  Peak performance for 1 socket (numactl
-N1) is slightly more than 1Mops/sec, at 16 threads.  Peak performance
across both sockets happens at 30 threads, and is ~900Kops/sec, although
with fewer threads there is less performance loss when the system has
background work.

Test Plan:
1. concurrent stress tests for InlineSkipList and DynamicBloom
2. make clean; make check
3. make clean; DISABLE_JEMALLOC=1 make valgrind_check; valgrind db_bench
4. make clean; COMPILE_WITH_TSAN=1 make all check; db_bench
5. make clean; COMPILE_WITH_ASAN=1 make all check; db_bench
6. make clean; OPT=-DROCKSDB_LITE make check
7. verify no perf regressions when disabled

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: MarkCallaghan, IslamAbdelRahman, anthony, yhchiang, rven, sdong, guyg8, kradhakrishnan, dhruba

Differential Revision: https://reviews.facebook.net/D50589
2015-12-25 11:03:40 -08:00
Shu Zhang b4aa823661 format 2015-12-24 20:38:35 -08:00
Shu Zhang 4dfdd1d928 format 2015-12-24 20:32:29 -08:00
Siying Dong 298ba27ae2 Merge pull request #846 from yuslepukhin/enble_c4244_lossofdata
Enable MS compiler warning c4244.
2015-12-23 22:59:42 -08:00
Siying Dong 7810aa802a Merge pull request #899 from zhipeng-jia/fix_clang_warning
Fix clang warnings
2015-12-23 22:58:52 -08:00
Zhipeng Jia ec2664fefd Fix clang compile error under Linux 2015-12-24 12:41:40 +08:00
Shu Zhang 4fd23fb130 add a factory method for creating hdfs env 2015-12-23 17:26:50 -08:00
sdong 15b8902264 Change default options.delayed_write_rate
Summary: We now have a mechanism to further slowdown writes. Double default options.delayed_write_rate to try to keep the default behavior closer to it used to be.

Test Plan: Run all tests.

Reviewers: IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: yhchiang, kradhakrishnan, rven, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52281
2015-12-23 14:51:55 -08:00
sdong b9f77ba12b When slowdown is triggered, reduce the write rate
Summary: It's usually hard for users to set a value of options.delayed_write_rate. With this diff, after slowdown condition triggers, we greedily reduce write rate if estimated pending compaction bytes increase. If estimated compaction pending bytes drop, we increase the write rate.

Test Plan:
Add a unit test
Test with db_bench setting:
TEST_TMPDIR=/dev/shm/ ./db_bench --benchmarks=fillrandom -num=10000000 --soft_pending_compaction_bytes_limit=1000000000 --hard_pending_compaction_bytes_limit=3000000000 --delayed_write_rate=100000000

and make sure without the commit, write stop will happen, but with the commit, it will not happen.

Reviewers: igor, anthony, rven, yhchiang, kradhakrishnan, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52131
2015-12-23 11:33:15 -08:00
Igor Canadi 8ac7fb8377 Merge pull request #863 from zhangyybuaa/fix_hdfs_error
Fix build error with hdfs
2015-12-22 09:27:51 +01:00
sdong 167fb919a5 ZSTD to use CompressionOptions.level
Summary: Now ZSTD hard code level 1. Change it to use the compression level setting.

Test Plan: Run it with hacked codes of sst_dump and show ZSTD compression sizes with different levels.

Reviewers: rven, anthony, yhchiang, kradhakrishnan, igor, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: yoshinorim, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52041
2015-12-16 16:58:04 -08:00
Islam AbdelRahman aececc209e Introduce ReadOptions::pin_data (support zero copy for keys)
Summary:
This patch update the Iterator API to introduce new functions that allow users to keep the Slices returned by key() valid as long as the Iterator is not deleted

ReadOptions::pin_data : If true keep loaded blocks in memory as long as the iterator is not deleted
Iterator::IsKeyPinned() : If true, this mean that the Slice returned by key() is valid as long as the iterator is not deleted

Also add a new option BlockBasedTableOptions::use_delta_encoding to allow users to disable delta_encoding if needed.

Benchmark results (using https://phabricator.fb.com/P20083553)

```
// $ du -h /home/tec/local/normal.4K.Snappy/db10077
// 6.1G    /home/tec/local/normal.4K.Snappy/db10077

// $ du -h /home/tec/local/zero.8K.LZ4/db10077
// 6.4G    /home/tec/local/zero.8K.LZ4/db10077

// Benchmarks for shard db10077
// _build/opt/rocks/benchmark/rocks_copy_benchmark \
//      --normal_db_path="/home/tec/local/normal.4K.Snappy/db10077" \
//      --zero_db_path="/home/tec/local/zero.8K.LZ4/db10077"

// First run
// ============================================================================
// rocks/benchmark/RocksCopyBenchmark.cpp          relative  time/iter  iters/s
// ============================================================================
// BM_StringCopy                                                 1.73s  576.97m
// BM_StringPiece                                   103.74%      1.67s  598.55m
// ============================================================================
// Match rate : 1000000 / 1000000

// Second run
// ============================================================================
// rocks/benchmark/RocksCopyBenchmark.cpp          relative  time/iter  iters/s
// ============================================================================
// BM_StringCopy                                              611.99ms     1.63
// BM_StringPiece                                   203.76%   300.35ms     3.33
// ============================================================================
// Match rate : 1000000 / 1000000
```

Test Plan: Unit tests

Reviewers: sdong, igor, anthony, yhchiang, rven

Reviewed By: rven

Subscribers: dhruba, lovro, adsharma

Differential Revision: https://reviews.facebook.net/D48999
2015-12-16 12:08:30 -08:00
Venkatesh Radhakrishnan 030215bf01 Running manual compactions in parallel with other automatic or manual compactions in restricted cases
Summary:
This diff provides a framework for doing manual
compactions in parallel with other compactions. We now have a deque of manual compactions. We also pass manual compactions as an argument from RunManualCompactions down to
BackgroundCompactions, so that RunManualCompactions can be reentrant.
Parallelism is controlled by the two routines
ConflictingManualCompaction to allow/disallow new parallel/manual
compactions based on already existing ManualCompactions. In this diff, by default manual compactions still have to run exclusive of other compactions. However, by setting the compaction option, exclusive_manual_compaction to false, it is possible to run other compactions in parallel with a manual compaction. However, we are still restricted to one manual compaction per column family at a time. All of these restrictions will be relaxed in future diffs.
I will be adding more tests later.

Test Plan: Rocksdb regression + new tests + valgrind

Reviewers: igor, anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, sdong

Reviewed By: sdong

Subscribers: yoshinorim, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D47973
2015-12-14 11:20:34 -08:00
Dmitri Smirnov aca403d2b5 Fix another rebase problems. 2015-12-11 17:33:40 -08:00
Dmitri Smirnov 236fe21c92 Enable MS compiler warning c4244.
Mostly due to the fact that there are differences in sizes of int,long
  on 64 bit systems vs GNU.
2015-12-11 16:47:34 -08:00
Yueh-Hsuan Chiang 00d6edf6a0 Ensure the destruction order of PosixEnv and ThreadLocalPtr
Summary:
By default, RocksDB initializes the singletons of ThreadLocalPtr first, then initializes PosixEnv
via static initializer.  Destructor terminates objects in reverse order, so terminating PosixEnv
(calling pthread_mutex_lock), then ThreadLocal (calling pthread_mutex_destroy).

However, in certain case, application might initialize PosixEnv first, then ThreadLocalPtr.
This will cause core dump at the end of the program (eg. https://github.com/facebook/mysql-5.6/issues/122)

This patch fix this issue by ensuring the destruction order by moving the global static singletons
to function static singletons.  Since function static singletons are initialized when the function is first
called, this property allows us invoke to enforce the construction of the static PosixEnv and the
singletons of ThreadLocalPtr by calling the function where the ThreadLocalPtr singletons belongs
right before we initialize the static PosixEnv.

Test Plan: Verified in the MyRocks.

Reviewers: yoshinorim, IslamAbdelRahman, rven, kradhakrishnan, anthony, sdong, MarkCallaghan

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D51789
2015-12-11 00:21:58 -08:00
charsyam c30b499541 fix typos in comments 2015-12-11 01:54:48 +09:00
sdong 56e77f0967 Deprecate options.soft_rate_limit and add options.soft_pending_compaction_bytes_limit
Summary: Deprecate options.soft_rate_limit, which is hard to tune, with options.soft_pending_compaction_bytes_limit, which would trigger the slowdown if estimated pending compaction bytes exceeds the threshold. The hope is to make it more striaght-forward to tune.

Test Plan: Modify DBTest.SoftLimit to cover options.soft_pending_compaction_bytes_limit instead; run all unit tests.

Reviewers: IslamAbdelRahman, yhchiang, rven, kradhakrishnan, igor, anthony

Reviewed By: anthony

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51117
2015-12-09 18:22:45 -08:00
sdong d6e1035a1f A new compaction picking priority that optimizes for write amplification for random updates.
Summary: Introduce a compaction picking priority that picks files who contains the oldest rows to compact. This is a mode that slightly improves write amplification for random update cases.

Test Plan: Add a unit test and run it in valgrind too.

Reviewers: yhchiang, anthony, IslamAbdelRahman, rven, kradhakrishnan, MarkCallaghan, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51459
2015-12-09 18:13:03 -08:00
yuslepukhin 49957f9a98 Prefer integer arithmetics
The code had conversion to double then casting to size_t
  and then casting uint32_t which caused compiler warning (VS15).
2015-12-09 14:06:23 -08:00
Siying Dong 9c227923c6 Merge pull request #788 from OpenChannelSSD/to_fb_master2
Move posix threads into a library
2015-12-08 18:06:38 -08:00
Siying Dong fa3dbf203f Merge pull request #853 from Vaisman/enable_C4267_warning
Enable C4267 warning
2015-12-08 17:59:24 -08:00
Siying Dong 56bbecc316 Merge pull request #867 from SherlockNoMad/CacheFix
Replace malloc with new for LRU Cache Handle
2015-12-08 17:58:29 -08:00
Yueh-Hsuan Chiang 774b80e99e Resubmit the fix for a race condition in persisting options
Summary:
This patch fix a race condition in persisting options which will cause a crash when:

* Thread A obtain cf options and start to persist options based on that cf options.
* Thread B kicks in and finish DropColumnFamily and delete cf_handle.
* Thread A wakes up and tries to finish the persisting options and crashes.

Test Plan: Add a test in column_family_test that can reproduce the crash

Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51717
2015-12-08 17:01:02 -08:00