rocksdb

Commit Graph

Author	SHA1	Message	Date
David Bernard	d78c6b28c4	Changes for build on solaris Makefile adjust paths for solaris build Makefile enable _GLIBCXX_USE_C99 so that std::to_string is available db_compaction_test.cc Initialise a variable to avoid a compilation error db_impl.cc Include <alloca.h> db_test.cc Include <alloca.h> Environment.java recognise solaris envrionment options_bulder.cc Make log unambiguous geodb_impl.cc Make log and floor unambiguous	2016-01-19 04:45:21 +00:00
Gunnar Kudrjavets	aec10f734b	Guard falloc.h inclusion to avoid build breaks Summary: Depending on the order of include paths and versions of various headers we may end up in a situation where we'll encounter a build break caused by redefinition of constants. gcc-4.9-glibc-2.20 header update to include/bits/fcntl-linux.h introduced the definitions of FALLOC_FL_* constants. However, linux/falloc.h from kernel-headers also has FALLOC_FL_* constants defined. Therefore during the compilation we'll get "previously defined" errors. Test Plan: Both in the environment where the build break manifests (to make sure that the change fixed the problem) and in the environment where everything builds fine (to make sure that there are no regressions): make clean make -j 32 Reviewers: sdong, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D52821	2016-01-14 22:47:15 -08:00
Dmitri Smirnov	ac50fd3a71	Align statistics Use Yield macro to make it a little more portable between platforms.	2016-01-13 14:53:23 -08:00
Igor Canadi	48a8667c30	Merge pull request #929 from warrenfalk/fix32 fix a compile error on 32-bit (fixes #634)	2016-01-12 11:02:36 -08:00
sdong	9a8e3f73ed	plain table reader: non-mmap mode to keep two recent buffers Summary: In plain table reader's non-mmap mode, we only keep the most recent read buffer. However, for binary search, it is likely we come back to a location to read. To avoid one pread in such a case, we keep two read buffers. It should cover most of the cases. Test Plan: 1. run tests 2. check the optimization works through strace when running ./table_reader_bench -mmap_read=false --num_keys2=1 -num_keys1=5000 -table_factory=plain_table --iterator --through_db Reviewers: anthony, rven, kradhakrishnan, igor, yhchiang, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D51171	2016-01-08 10:53:57 -08:00
Warren Falk	94d9df2482	fix an unused function compiler warning in crc32c in 32-bit mode	2016-01-07 13:27:20 -05:00
Warren Falk	2f01e10fa9	use static_cast in crc32c instead of c-style cast	2016-01-07 13:22:09 -05:00
Warren Falk	601f1306a1	fix shorten-64-to-32 warning in crc32c	2016-01-07 13:12:15 -05:00
Warren Falk	55b37efa15	fix a compile error on 32-bit	2016-01-07 11:51:52 -05:00
sdong	c9e2490bc6	Fix DynamicBloomTest.concurrent_with_perf to pass TSAN Summary: TSAN fails on DynamicBloomTest.concurrent_with_perf. This change fixes it. Not sure why though. Test Plan: Run the test with TSAN and make sure no warning shown. Reviewers: yhchiang, IslamAbdelRahman, anthony, ngbronson, rven Reviewed By: rven Subscribers: rven, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D52383	2015-12-29 16:28:45 -08:00
sdong	edf1cd497f	Not generating "__attribute__((__unused__))" for padding fields if it is not CLANG Summary: Adding "__attribute__((__unused__))" after padding fields will pass CLANG build but will fail gcc 4.8.1. Fix it by not generating it under GCC 4.8.1. Test Plan: Build under four combinations of USE_CLANG=0,1 and ROCKSDB_FBCODE_BUILD_WITH_481=0.1. Reviewers: yhchiang, rven, ngbronson, anthony, IslamAbdelRahman Reviewed By: anthony Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D52371	2015-12-28 18:37:23 -08:00
sdong	11672df19a	Fix CLANG errors introduced by `7d87f02799` Summary: Fix some CLANG errors introduced in `7d87f02799` Test Plan: Build with both of CLANG and gcc Reviewers: rven, yhchiang, kradhakrishnan, anthony, IslamAbdelRahman, ngbronson Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D52329	2015-12-28 10:00:58 -08:00
Siying Dong	7fafd52dce	Merge pull request #900 from shuzhang1989/hdfs_env_fix add a factory method for creating hdfs env	2015-12-28 09:28:04 -08:00
Shu Zhang	2b7c810db8	more foramt	2015-12-26 19:52:35 -08:00
Shu Zhang	b79ccbd573	indent	2015-12-26 19:50:28 -08:00
Nathan Bronson	7d87f02799	support for concurrent adds to memtable Summary: This diff adds support for concurrent adds to the skiplist memtable implementations. Memory allocation is made thread-safe by the addition of a spinlock, with small per-core buffers to avoid contention. Concurrent memtable writes are made via an additional method and don't impose a performance overhead on the non-concurrent case, so parallelism can be selected on a per-batch basis. Write thread synchronization is an increasing bottleneck for higher levels of concurrency, so this diff adds --enable_write_thread_adaptive_yield (default off). This feature causes threads joining a write batch group to spin for a short time (default 100 usec) using sched_yield, rather than going to sleep on a mutex. If the timing of the yield calls indicates that another thread has actually run during the yield then spinning is avoided. This option improves performance for concurrent situations even without parallel adds, although it has the potential to increase CPU usage (and the heuristic adaptation is not yet mature). Parallel writes are not currently compatible with inplace updates, update callbacks, or delete filtering. Enable it with --allow_concurrent_memtable_write (and --enable_write_thread_adaptive_yield). Parallel memtable writes are performance neutral when there is no actual parallelism, and in my experiments (SSD server-class Linux and varying contention and key sizes for fillrandom) they are always a performance win when there is more than one thread. Statistics are updated earlier in the write path, dropping the number of DB mutex acquisitions from 2 to 1 for almost all cases. This diff was motivated and inspired by Yahoo's cLSM work. It is more conservative than cLSM: RocksDB's write batch group leader role is preserved (along with all of the existing flush and write throttling logic) and concurrent writers are blocked until all memtable insertions have completed and the sequence number has been advanced, to preserve linearizability. My test config is "db_bench -benchmarks=fillrandom -threads=$T -batch_size=1 -memtablerep=skip_list -value_size=100 --num=1000000/$T -level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999 -disable_auto_compactions --max_write_buffer_number=8 -max_background_flushes=8 --disable_wal --write_buffer_size=160000000 --block_size=16384 --allow_concurrent_memtable_write" on a two-socket Xeon E5-2660 @ 2.2Ghz with lots of memory and an SSD hard drive. With 1 thread I get ~440Kops/sec. Peak performance for 1 socket (numactl -N1) is slightly more than 1Mops/sec, at 16 threads. Peak performance across both sockets happens at 30 threads, and is ~900Kops/sec, although with fewer threads there is less performance loss when the system has background work. Test Plan: 1. concurrent stress tests for InlineSkipList and DynamicBloom 2. make clean; make check 3. make clean; DISABLE_JEMALLOC=1 make valgrind_check; valgrind db_bench 4. make clean; COMPILE_WITH_TSAN=1 make all check; db_bench 5. make clean; COMPILE_WITH_ASAN=1 make all check; db_bench 6. make clean; OPT=-DROCKSDB_LITE make check 7. verify no perf regressions when disabled Reviewers: igor, sdong Reviewed By: sdong Subscribers: MarkCallaghan, IslamAbdelRahman, anthony, yhchiang, rven, sdong, guyg8, kradhakrishnan, dhruba Differential Revision: https://reviews.facebook.net/D50589	2015-12-25 11:03:40 -08:00
Shu Zhang	b4aa823661	format	2015-12-24 20:38:35 -08:00
Shu Zhang	4dfdd1d928	format	2015-12-24 20:32:29 -08:00
Siying Dong	298ba27ae2	Merge pull request #846 from yuslepukhin/enble_c4244_lossofdata Enable MS compiler warning c4244.	2015-12-23 22:59:42 -08:00
Siying Dong	7810aa802a	Merge pull request #899 from zhipeng-jia/fix_clang_warning Fix clang warnings	2015-12-23 22:58:52 -08:00
Zhipeng Jia	ec2664fefd	Fix clang compile error under Linux	2015-12-24 12:41:40 +08:00
Shu Zhang	4fd23fb130	add a factory method for creating hdfs env	2015-12-23 17:26:50 -08:00
sdong	15b8902264	Change default options.delayed_write_rate Summary: We now have a mechanism to further slowdown writes. Double default options.delayed_write_rate to try to keep the default behavior closer to it used to be. Test Plan: Run all tests. Reviewers: IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: yhchiang, kradhakrishnan, rven, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D52281	2015-12-23 14:51:55 -08:00
sdong	b9f77ba12b	When slowdown is triggered, reduce the write rate Summary: It's usually hard for users to set a value of options.delayed_write_rate. With this diff, after slowdown condition triggers, we greedily reduce write rate if estimated pending compaction bytes increase. If estimated compaction pending bytes drop, we increase the write rate. Test Plan: Add a unit test Test with db_bench setting: TEST_TMPDIR=/dev/shm/ ./db_bench --benchmarks=fillrandom -num=10000000 --soft_pending_compaction_bytes_limit=1000000000 --hard_pending_compaction_bytes_limit=3000000000 --delayed_write_rate=100000000 and make sure without the commit, write stop will happen, but with the commit, it will not happen. Reviewers: igor, anthony, rven, yhchiang, kradhakrishnan, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D52131	2015-12-23 11:33:15 -08:00
Igor Canadi	8ac7fb8377	Merge pull request #863 from zhangyybuaa/fix_hdfs_error Fix build error with hdfs	2015-12-22 09:27:51 +01:00
sdong	167fb919a5	ZSTD to use CompressionOptions.level Summary: Now ZSTD hard code level 1. Change it to use the compression level setting. Test Plan: Run it with hacked codes of sst_dump and show ZSTD compression sizes with different levels. Reviewers: rven, anthony, yhchiang, kradhakrishnan, igor, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: yoshinorim, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D52041	2015-12-16 16:58:04 -08:00
Islam AbdelRahman	aececc209e	Introduce ReadOptions::pin_data (support zero copy for keys) Summary: This patch update the Iterator API to introduce new functions that allow users to keep the Slices returned by key() valid as long as the Iterator is not deleted ReadOptions::pin_data : If true keep loaded blocks in memory as long as the iterator is not deleted Iterator::IsKeyPinned() : If true, this mean that the Slice returned by key() is valid as long as the iterator is not deleted Also add a new option BlockBasedTableOptions::use_delta_encoding to allow users to disable delta_encoding if needed. Benchmark results (using https://phabricator.fb.com/P20083553) ``` // $ du -h /home/tec/local/normal.4K.Snappy/db10077 // 6.1G /home/tec/local/normal.4K.Snappy/db10077 // $ du -h /home/tec/local/zero.8K.LZ4/db10077 // 6.4G /home/tec/local/zero.8K.LZ4/db10077 // Benchmarks for shard db10077 // _build/opt/rocks/benchmark/rocks_copy_benchmark \ // --normal_db_path="/home/tec/local/normal.4K.Snappy/db10077" \ // --zero_db_path="/home/tec/local/zero.8K.LZ4/db10077" // First run // ============================================================================ // rocks/benchmark/RocksCopyBenchmark.cpp relative time/iter iters/s // ============================================================================ // BM_StringCopy 1.73s 576.97m // BM_StringPiece 103.74% 1.67s 598.55m // ============================================================================ // Match rate : 1000000 / 1000000 // Second run // ============================================================================ // rocks/benchmark/RocksCopyBenchmark.cpp relative time/iter iters/s // ============================================================================ // BM_StringCopy 611.99ms 1.63 // BM_StringPiece 203.76% 300.35ms 3.33 // ============================================================================ // Match rate : 1000000 / 1000000 ``` Test Plan: Unit tests Reviewers: sdong, igor, anthony, yhchiang, rven Reviewed By: rven Subscribers: dhruba, lovro, adsharma Differential Revision: https://reviews.facebook.net/D48999	2015-12-16 12:08:30 -08:00
Venkatesh Radhakrishnan	030215bf01	Running manual compactions in parallel with other automatic or manual compactions in restricted cases Summary: This diff provides a framework for doing manual compactions in parallel with other compactions. We now have a deque of manual compactions. We also pass manual compactions as an argument from RunManualCompactions down to BackgroundCompactions, so that RunManualCompactions can be reentrant. Parallelism is controlled by the two routines ConflictingManualCompaction to allow/disallow new parallel/manual compactions based on already existing ManualCompactions. In this diff, by default manual compactions still have to run exclusive of other compactions. However, by setting the compaction option, exclusive_manual_compaction to false, it is possible to run other compactions in parallel with a manual compaction. However, we are still restricted to one manual compaction per column family at a time. All of these restrictions will be relaxed in future diffs. I will be adding more tests later. Test Plan: Rocksdb regression + new tests + valgrind Reviewers: igor, anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, sdong Reviewed By: sdong Subscribers: yoshinorim, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D47973	2015-12-14 11:20:34 -08:00
Dmitri Smirnov	aca403d2b5	Fix another rebase problems.	2015-12-11 17:33:40 -08:00
Dmitri Smirnov	236fe21c92	Enable MS compiler warning c4244. Mostly due to the fact that there are differences in sizes of int,long on 64 bit systems vs GNU.	2015-12-11 16:47:34 -08:00
Yueh-Hsuan Chiang	00d6edf6a0	Ensure the destruction order of PosixEnv and ThreadLocalPtr Summary: By default, RocksDB initializes the singletons of ThreadLocalPtr first, then initializes PosixEnv via static initializer. Destructor terminates objects in reverse order, so terminating PosixEnv (calling pthread_mutex_lock), then ThreadLocal (calling pthread_mutex_destroy). However, in certain case, application might initialize PosixEnv first, then ThreadLocalPtr. This will cause core dump at the end of the program (eg. https://github.com/facebook/mysql-5.6/issues/122) This patch fix this issue by ensuring the destruction order by moving the global static singletons to function static singletons. Since function static singletons are initialized when the function is first called, this property allows us invoke to enforce the construction of the static PosixEnv and the singletons of ThreadLocalPtr by calling the function where the ThreadLocalPtr singletons belongs right before we initialize the static PosixEnv. Test Plan: Verified in the MyRocks. Reviewers: yoshinorim, IslamAbdelRahman, rven, kradhakrishnan, anthony, sdong, MarkCallaghan Reviewed By: anthony Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D51789	2015-12-11 00:21:58 -08:00
charsyam	c30b499541	fix typos in comments	2015-12-11 01:54:48 +09:00
sdong	56e77f0967	Deprecate options.soft_rate_limit and add options.soft_pending_compaction_bytes_limit Summary: Deprecate options.soft_rate_limit, which is hard to tune, with options.soft_pending_compaction_bytes_limit, which would trigger the slowdown if estimated pending compaction bytes exceeds the threshold. The hope is to make it more striaght-forward to tune. Test Plan: Modify DBTest.SoftLimit to cover options.soft_pending_compaction_bytes_limit instead; run all unit tests. Reviewers: IslamAbdelRahman, yhchiang, rven, kradhakrishnan, igor, anthony Reviewed By: anthony Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D51117	2015-12-09 18:22:45 -08:00
sdong	d6e1035a1f	A new compaction picking priority that optimizes for write amplification for random updates. Summary: Introduce a compaction picking priority that picks files who contains the oldest rows to compact. This is a mode that slightly improves write amplification for random update cases. Test Plan: Add a unit test and run it in valgrind too. Reviewers: yhchiang, anthony, IslamAbdelRahman, rven, kradhakrishnan, MarkCallaghan, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D51459	2015-12-09 18:13:03 -08:00
yuslepukhin	49957f9a98	Prefer integer arithmetics The code had conversion to double then casting to size_t and then casting uint32_t which caused compiler warning (VS15).	2015-12-09 14:06:23 -08:00
Siying Dong	9c227923c6	Merge pull request #788 from OpenChannelSSD/to_fb_master2 Move posix threads into a library	2015-12-08 18:06:38 -08:00
Siying Dong	fa3dbf203f	Merge pull request #853 from Vaisman/enable_C4267_warning Enable C4267 warning	2015-12-08 17:59:24 -08:00
Siying Dong	56bbecc316	Merge pull request #867 from SherlockNoMad/CacheFix Replace malloc with new for LRU Cache Handle	2015-12-08 17:58:29 -08:00
Yueh-Hsuan Chiang	774b80e99e	Resubmit the fix for a race condition in persisting options Summary: This patch fix a race condition in persisting options which will cause a crash when: * Thread A obtain cf options and start to persist options based on that cf options. * Thread B kicks in and finish DropColumnFamily and delete cf_handle. * Thread A wakes up and tries to finish the persisting options and crashes. Test Plan: Add a test in column_family_test that can reproduce the crash Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, sdong Reviewed By: sdong Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D51717	2015-12-08 17:01:02 -08:00
sdong	ea11923550	Upgrade to ZSTD 0.4.2 Summary: Change to call the new compression function. Test Plan: build and run db_bench with the compression to make sure it compresses. Reviewers: anthony, rven, kradhakrishnan, IslamAbdelRahman, igor, yhchiang Reviewed By: yhchiang Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D51603	2015-12-08 16:33:26 -08:00
sdong	770dea9325	Fix occasional failure of DBTest.DynamicCompactionOptions Summary: DBTest.DynamicCompactionOptions ocasionally fails during valgrind run. We sent a sleeping task to block compaction thread pool but we don't wait it to run. Test Plan: Run the test multiple times in an environment which can cause failure. Reviewers: rven, kradhakrishnan, igor, IslamAbdelRahman, anthony, yhchiang Reviewed By: yhchiang Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D51687	2015-12-07 18:38:39 -08:00
sdong	f307036bde	Revert "Fix a race condition in persisting options" This reverts commit `2fa3ed5180`. It breaks RocksDB lite build	2015-12-07 17:09:12 -08:00
Yueh-Hsuan Chiang	2fa3ed5180	Fix a race condition in persisting options Summary: This patch fix a race condition in persisting options which will cause a crash when: * Thread A obtain cf options and start to persist options based on that cf options. * Thread B kicks in and finish DropColumnFamily and delete cf_handle. * Thread A wakes up and tries to finish the persisting options and crashes. Test Plan: Add a test in column_family_test that can reproduce the crash Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D51609	2015-12-07 15:25:12 -08:00
Javier González	b2863017b1	Move posix threads into a library Summary: This patch moves all posix thread logic to a separate library. The motivation is to allow another environments to easily reuse posix threads. HDFS wraps already posix threads; this split would simplify this code. Test Plan: No new functionality is added to posix Env or the threading library, thus the current tests should suffice.	2015-12-07 12:03:38 +01:00
SherlockNoMad	3a98a7ae7f	Replace malloc with new for LRU Cache Handle	2015-12-04 15:12:07 -08:00
Zhang Yangyang	4687ced5db	fix ToString() not declared error	2015-12-02 21:45:28 +08:00
sdong	d27ea4c9e5	Initialize options.row_cache Summary: options.row_cache should already been initialized as null by default. Still try to set it following current convention, because one valgrind failure reports a failure related to it. Test Plan: Run all unit tests Reviewers: yhchiang, kradhakrishnan, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D51303	2015-11-30 10:30:35 -08:00
Nathan Bronson	9a9d4759b2	InlineSkipList part 3/3 - new skiplist type that colocates key and node Summary: This diff completes the creation of InlineSkipList<Cmp>, which is like SkipList<const char, Cmp> but it always allocates the key contiguously with the node. This allows us to remove the pointer from the node to the key. As a result the memory usage of the skip list is reduced (by 1 to sizeof(void) bytes depending on the padding required to align the key storage), cache locality is improved, and we halve the number of calls to the allocator. For skip lists whose keys are freshly-allocated const char*, InlineSkipList is stricly preferrable to SkipList. This diff doesn't replace SkipList, however, because some of the use cases of SkipList in RocksDB are either character sequences that are not allocated at the same time as the skip list node allocation (for example hash_linklist_rep) or have different key types (for example write_batch_with_index). Taking advantage of inline allocation for those cases is left to future work. The perf win is biggest for small values. For single-threaded CPU-bound (32M fillrandom operations with no WAL log) with 16 byte keys and 0 byte values, the db_bench perf goes from ~310k ops/sec to ~410k ops/sec. For large values the improvement is less pronounced, but seems to be between 5% and 10% on the same configuration. Test Plan: make check Reviewers: igor, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D51123	2015-11-24 15:16:02 -08:00
Vasili Svirski	41b32c6059	Enable C4267 warning * conversion from 'size_t' to 'type', by add static_cast Tested: * by build solution on Windows, Linux locally, * run tests * build CI system successful	2015-11-24 16:33:09 +03:00
yuslepukhin	047bd22aae	Build on Visual Studio 2015 Update 1	2015-11-20 15:31:47 -08:00

1 2 3 4 5 ...

1139 Commits