rocksdb

mirror of https://github.com/facebook/rocksdb.git synced 2024-11-26 07:30:54 +00:00

Author	SHA1	Message	Date
sdong	6ee38bb15c	Slowdown of writing to the last memtable should not override stopping Summary: Now slowing down for the last mem table takes priority against some stopping conditions. This is logically confusing. Fix it. Test Plan: Run all existing tests. Reviewers: yhchiang, IslamAbdelRahman, kradhakrishnan, andrewkr, anthony Reviewed By: anthony Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D53529	2016-01-28 21:52:29 -08:00
Yueh-Hsuan Chiang	6935eb24e0	Add ColumnFamilyHandle::GetDescriptor() Summary: This patch addes ColumnFamilyHandle::GetDescriptor(), which allows developers to obtain the CF options and names of the associated column family given its handle. // Returns the up-to-date descriptor used by the current handle. Since it // returns the up-to-date information, this call might internally locks // and releases DB mutex to access the up-to-date CF options. virtual ColumnFamilyDescriptor GetDescriptor() = 0; Test Plan: augment column_family_test Reviewers: sdong, yoshinorim, IslamAbdelRahman, rven, kradhakrishnan, anthony Reviewed By: anthony Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D51543	2016-01-06 18:14:01 -08:00
Siying Dong	22c0ed8a5f	Disable Visual Studio Warning C4351 Currently Windows build is broken because of Warning C4351. Disable the warning before figuring out the right way to fix it.	2015-12-28 15:06:34 -08:00
Nathan Bronson	7d87f02799	support for concurrent adds to memtable Summary: This diff adds support for concurrent adds to the skiplist memtable implementations. Memory allocation is made thread-safe by the addition of a spinlock, with small per-core buffers to avoid contention. Concurrent memtable writes are made via an additional method and don't impose a performance overhead on the non-concurrent case, so parallelism can be selected on a per-batch basis. Write thread synchronization is an increasing bottleneck for higher levels of concurrency, so this diff adds --enable_write_thread_adaptive_yield (default off). This feature causes threads joining a write batch group to spin for a short time (default 100 usec) using sched_yield, rather than going to sleep on a mutex. If the timing of the yield calls indicates that another thread has actually run during the yield then spinning is avoided. This option improves performance for concurrent situations even without parallel adds, although it has the potential to increase CPU usage (and the heuristic adaptation is not yet mature). Parallel writes are not currently compatible with inplace updates, update callbacks, or delete filtering. Enable it with --allow_concurrent_memtable_write (and --enable_write_thread_adaptive_yield). Parallel memtable writes are performance neutral when there is no actual parallelism, and in my experiments (SSD server-class Linux and varying contention and key sizes for fillrandom) they are always a performance win when there is more than one thread. Statistics are updated earlier in the write path, dropping the number of DB mutex acquisitions from 2 to 1 for almost all cases. This diff was motivated and inspired by Yahoo's cLSM work. It is more conservative than cLSM: RocksDB's write batch group leader role is preserved (along with all of the existing flush and write throttling logic) and concurrent writers are blocked until all memtable insertions have completed and the sequence number has been advanced, to preserve linearizability. My test config is "db_bench -benchmarks=fillrandom -threads=$T -batch_size=1 -memtablerep=skip_list -value_size=100 --num=1000000/$T -level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999 -disable_auto_compactions --max_write_buffer_number=8 -max_background_flushes=8 --disable_wal --write_buffer_size=160000000 --block_size=16384 --allow_concurrent_memtable_write" on a two-socket Xeon E5-2660 @ 2.2Ghz with lots of memory and an SSD hard drive. With 1 thread I get ~440Kops/sec. Peak performance for 1 socket (numactl -N1) is slightly more than 1Mops/sec, at 16 threads. Peak performance across both sockets happens at 30 threads, and is ~900Kops/sec, although with fewer threads there is less performance loss when the system has background work. Test Plan: 1. concurrent stress tests for InlineSkipList and DynamicBloom 2. make clean; make check 3. make clean; DISABLE_JEMALLOC=1 make valgrind_check; valgrind db_bench 4. make clean; COMPILE_WITH_TSAN=1 make all check; db_bench 5. make clean; COMPILE_WITH_ASAN=1 make all check; db_bench 6. make clean; OPT=-DROCKSDB_LITE make check 7. verify no perf regressions when disabled Reviewers: igor, sdong Reviewed By: sdong Subscribers: MarkCallaghan, IslamAbdelRahman, anthony, yhchiang, rven, sdong, guyg8, kradhakrishnan, dhruba Differential Revision: https://reviews.facebook.net/D50589	2015-12-25 11:03:40 -08:00
sdong	b9f77ba12b	When slowdown is triggered, reduce the write rate Summary: It's usually hard for users to set a value of options.delayed_write_rate. With this diff, after slowdown condition triggers, we greedily reduce write rate if estimated pending compaction bytes increase. If estimated compaction pending bytes drop, we increase the write rate. Test Plan: Add a unit test Test with db_bench setting: TEST_TMPDIR=/dev/shm/ ./db_bench --benchmarks=fillrandom -num=10000000 --soft_pending_compaction_bytes_limit=1000000000 --hard_pending_compaction_bytes_limit=3000000000 --delayed_write_rate=100000000 and make sure without the commit, write stop will happen, but with the commit, it will not happen. Reviewers: igor, anthony, rven, yhchiang, kradhakrishnan, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D52131	2015-12-23 11:33:15 -08:00
sdong	d72b31774e	Slowdown when writing to the last write buffer Summary: Now if inserting to mem table is much faster than writing to files, there is no mechanism users can rely on to avoid stopping for reaching options.max_write_buffer_number. With the commit, if there are more than four maximum write buffers configured, we slow down to the rate of options.delayed_write_rate while we reach the last one. Test Plan: 1. Add a new unit test. 2. Run db_bench with ./db_bench --benchmarks=fillrandom --num=10000000 --max_background_flushes=6 --batch_size=32 -max_write_buffer_number=4 --delayed_write_rate=500000 --statistics based on hard drive and see stopping is avoided with the commit. Reviewers: yhchiang, IslamAbdelRahman, anthony, rven, kradhakrishnan, igor Reviewed By: igor Subscribers: MarkCallaghan, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D52047	2015-12-17 10:49:08 -08:00
Venkatesh Radhakrishnan	030215bf01	Running manual compactions in parallel with other automatic or manual compactions in restricted cases Summary: This diff provides a framework for doing manual compactions in parallel with other compactions. We now have a deque of manual compactions. We also pass manual compactions as an argument from RunManualCompactions down to BackgroundCompactions, so that RunManualCompactions can be reentrant. Parallelism is controlled by the two routines ConflictingManualCompaction to allow/disallow new parallel/manual compactions based on already existing ManualCompactions. In this diff, by default manual compactions still have to run exclusive of other compactions. However, by setting the compaction option, exclusive_manual_compaction to false, it is possible to run other compactions in parallel with a manual compaction. However, we are still restricted to one manual compaction per column family at a time. All of these restrictions will be relaxed in future diffs. I will be adding more tests later. Test Plan: Rocksdb regression + new tests + valgrind Reviewers: igor, anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, sdong Reviewed By: sdong Subscribers: yoshinorim, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D47973	2015-12-14 11:20:34 -08:00
sdong	56e77f0967	Deprecate options.soft_rate_limit and add options.soft_pending_compaction_bytes_limit Summary: Deprecate options.soft_rate_limit, which is hard to tune, with options.soft_pending_compaction_bytes_limit, which would trigger the slowdown if estimated pending compaction bytes exceeds the threshold. The hope is to make it more striaght-forward to tune. Test Plan: Modify DBTest.SoftLimit to cover options.soft_pending_compaction_bytes_limit instead; run all unit tests. Reviewers: IslamAbdelRahman, yhchiang, rven, kradhakrishnan, igor, anthony Reviewed By: anthony Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D51117	2015-12-09 18:22:45 -08:00
sdong	dac5b248b1	UniversalCompactionPicker::PickCompaction(): avoid to form compactions if there is no file Summary: Currently RocksDB may break in lines like this: for (size_t i = sorted_runs.size() - 1; i >= first_index_after; i--) { if options.level0_file_num_compaction_trigger=0. Fix it by not executing the logic of picking compactions if there is no file (sorted_runs.size() = 0). Also internally set options.level0_file_num_compaction_trigger=1 if users give a 0. 0 is a value makes no sense in RocksDB. Test Plan: Run all tests. Will add a unit test too. Reviewers: yhchiang, IslamAbdelRahman, anthony, kradhakrishnan, rven Reviewed By: rven Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D50727	2015-11-16 10:32:45 -08:00
Venkatesh Radhakrishnan	a98fbacfa0	Moving memtable related files from util to a new directory memtable Summary: We are cleaning up dependencies. This diff takes a first step at moving memtable files to their own directory called memtable. In future diffs, we will move other memtable files from db to memtable. Test Plan: make check Reviewers: sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D48915	2015-10-16 14:10:33 -07:00
sdong	5de807ac16	Add options.hard_pending_compaction_bytes_limit to stop writes if compaction lagging behind Summary: Add an option to stop writes if compaction lefts behind. If estimated pending compaction bytes is more than threshold specified by options.hard_pending_compaction_bytes_liimt, writes will stop until compactions are cleared to under the threshold. Test Plan: Add unit test DBTest.HardLimit Reviewers: rven, kradhakrishnan, anthony, IslamAbdelRahman, yhchiang, igor Reviewed By: igor Subscribers: MarkCallaghan, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D45999	2015-09-14 12:51:16 -07:00
Ari Ekmekji	03ddce9a01	Add counters for L0 stall while L0-L1 compaction is taking place Summary: Although there are currently counters to keep track of the stall caused by having too many L0 files, there is no distinction as to whether when that stall occurs either (A) L0-L1 compaction is taking place to try and mitigate it, or (B) no L0-L1 compaction has been scheduled at the moment. This diff adds a counter for (A) so that the nature of L0 stalls can be better understood. Test Plan: make all && make check Reviewers: sdong, igor, anthony, noetzli, yhchiang Reviewed By: yhchiang Subscribers: MarkCallaghan, dhruba Differential Revision: https://reviews.facebook.net/D46749	2015-09-14 11:03:37 -07:00
agiardullo	b5b2b75e52	better tuning of arena block size Summary: Currently, if users didn't set options.arena_block_size, we set "result.arena_block_size = result.write_buffer_size / 10". It makes result.arena_block_size not a multiplier of 4KB, even if options.write_buffer_size is a multiplier of MBs. When calling malloc to arena_block_size, we may waste a small amount of memory for it. We now make the default to be /8 or /16 and align it to 4KB. Test Plan: unit tests Reviewers: sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D46467	2015-09-08 20:53:32 -07:00
Islam AbdelRahman	027ca5b2cd	Total SST files size DB Property Summary: Add a new DB property that calculate the total size of files used by all RocksDB Versions Test Plan: Unittests for the new property Reviewers: igor, yhchiang, anthony, rven, kradhakrishnan, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D44799	2015-08-20 11:47:19 -07:00
Yueh-Hsuan Chiang	df79eafcb3	Introduce GetIntProperty("rocksdb.size-all-mem-tables") Summary: Currently, GetIntProperty("rocksdb.cur-size-all-mem-tables") only returns the memory usage by those memtables which have not yet been flushed. This patch introduces GetIntProperty("rocksdb.size-all-mem-tables"), which includes the memory usage by all the memtables, includes those have been flushed but pinned by iterators. Test Plan: Added a test in db_test Reviewers: igor, anthony, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D44229	2015-08-19 13:32:09 -07:00
Igor Canadi	35ca59364c	Don't let flushes preempt compactions Summary: When we first started, max_background_flushes was 0 by default and compaction thread was executing flushes (since there was no flush thread). Then, we switched the default max_background_flushes to 1. However, we still support the case where there is no flush thread and flushes are done in compaction. This is making our code a bit more complicated. By not supporting this use-case we can make our code simpler. We have a special case that when you set max_background_flushes to 0, we schedule the flush to execute on the compaction thread. Test Plan: make check (there might be some unit tests that depend on this behavior) Reviewers: IslamAbdelRahman, yhchiang, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D41931	2015-07-17 12:02:52 -07:00
Yueh-Hsuan Chiang	4ce5be4255	fixed leaking log::Writers Summary: Fixes valgrind errors in column_family_test. Test Plan: `make check`, `make valgrind_check` Reviewers: igor, yhchiang Reviewed By: yhchiang Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D41181	2015-07-07 12:10:10 -07:00
Igor Canadi	760e9a94de	Fail DB::Open() when the requested compression is not available Summary: Currently RocksDB silently ignores this issue and doesn't compress the data. Based on discussion, we agree that this is pretty bad because it can cause confusion for our users. This patch fails DB::Open() if we don't support the compression that is specified in the options. Test Plan: make check with LZ4 not present. If Snappy is not present all tests will just fail because Snappy is our default library. We should make Snappy the requirement, since without it our default DB::Open() fails. Reviewers: sdong, MarkCallaghan, rven, yhchiang Reviewed By: yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D39687	2015-06-18 14:55:05 -07:00
Igor Canadi	4b8bb62f0a	Don't dump DBOptions for each column family Summary: Currently we dump DBOptions for each column family options we dump. This leads to duplicate lines in our LOG file. This diff fixes that. Test Plan: Check out the LOG Reviewers: sdong, rven, yhchiang Reviewed By: yhchiang Subscribers: IslamAbdelRahman, yoshinorim, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D39729	2015-06-18 10:15:54 -07:00
sdong	7842920be5	Slow down writes by bytes written Summary: We slow down data into the database to the rate of options.delayed_write_rate (a new option) with this patch. The thread synchronization approach I take is to still synchronize write controller by DB mutex and GetDelay() is inside DB mutex. Try to minimize the frequency of getting time in GetDelay(). I verified it through db_bench and it seems to work hard_rate_limit is deprecated. options.delayed_write_rate is still not dynamically changeable. Need to work on it as a follow-up. Test Plan: Add new unit tests in db_test Reviewers: yhchiang, rven, kradhakrishnan, anthony, MarkCallaghan, igor Reviewed By: igor Subscribers: ikabiljo, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D36351	2015-06-11 20:42:18 -07:00
Yueh-Hsuan Chiang	fe5c6321cb	Allow EventListener::OnCompactionCompleted to return CompactionJobStats. Summary: Allow EventListener::OnCompactionCompleted to return CompactionJobStats, which contains useful information about a compaction. Example CompactionJobStats returned by OnCompactionCompleted(): smallest_output_key_prefix 05000000 largest_output_key_prefix 06990000 elapsed_time 42419 num_input_records 300 num_input_files 3 num_input_files_at_output_level 2 num_output_records 200 num_output_files 1 actual_bytes_input 167200 actual_bytes_output 110688 total_input_raw_key_bytes 5400 total_input_raw_value_bytes 300000 num_records_replaced 100 is_manual_compaction 1 Test Plan: Developed a mega test in db_test which covers 20 variables in CompactionJobStats. Reviewers: rven, igor, anthony, sdong Reviewed By: sdong Subscribers: tnovak, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D38463	2015-06-02 17:07:16 -07:00
agiardullo	dc9d70de65	Optimistic Transactions Summary: Optimistic transactions supporting begin/commit/rollback semantics. Currently relies on checking the memtable to determine if there are any collisions at commit time. Not yet implemented would be a way of enuring the memtable has some minimum amount of history so that we won't fail to commit when the memtable is empty. You should probably start with transaction.h to get an overview of what is currently supported. Test Plan: Added a new test, but still need to look into stress testing. Reviewers: yhchiang, igor, rven, sdong Reviewed By: sdong Subscribers: adamretter, MarkCallaghan, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D33435	2015-05-29 14:36:35 -07:00
agiardullo	c815351038	Support saving history in memtable_list Summary: For transactions, we are using the memtables to validate that there are no write conflicts. But after flushing, we don't have any memtables, and transactions could fail to commit. So we want to someone keep around some extra history to use for conflict checking. In addition, we want to provide a way to increase the size of this history if too many transactions fail to commit. After chatting with people, it seems like everyone prefers just using Memtables to store this history (instead of a separate history structure). It seems like the best place for this is abstracted inside the memtable_list. I decide to create a separate list in MemtableListVersion as using the same list complicated the flush/installalflushresults logic too much. This diff adds a new parameter to control how much memtable history to keep around after flushing. However, it sounds like people aren't too fond of adding new parameters. So I am making the default size of flushed+not-flushed memtables be set to max_write_buffers. This should not change the maximum amount of memory used, but make it more likely we're using closer the the limit. (We are now postponing deleting flushed memtables until the max_write_buffer limit is reached). So while we might use more memory on average, we are still obeying the limit set (and you could argue it's better to go ahead and use up memory now instead of waiting for a write stall to happen to test this limit). However, if people are opposed to this default behavior, we can easily set it to 0 and require this parameter be set in order to use transactions. Test Plan: Added a xfunc test to play around with setting different values of this parameter in all tests. Added testing in memtablelist_test and planning on adding more testing here. Reviewers: sdong, rven, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D37443	2015-05-28 16:34:24 -07:00
Yueh-Hsuan Chiang	672dda9b3b	[API Change] Move listeners from ColumnFamilyOptions to DBOptions Summary: Move listeners from ColumnFamilyOptions to DBOptions Test Plan: listener_test compact_files_test Reviewers: rven, anthony, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D39087	2015-05-28 13:21:39 -07:00
Yueh-Hsuan Chiang	687214f878	Ensure ColumnFamilyOptions.num_levels >= 2 when level compaction is used. Summary: Ensure ColumnFamilyOptions.num_levels >= 2 when level compaction is used. Test Plan: Extend SanitizeOptions test in column_family_test Reviewers: sdong, rven, anthony, krishnanm86, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D38829	2015-05-22 11:35:40 -07:00
sdong	6fa7085121	CompactRange skips levels 1 to base_level -1 for dynamic level base size Summary: CompactRange() now is much more expensive for dynamic level base size as it goes through all the levels. Skip those not used levels between level 0 an base level. Test Plan: Run all unit tests Reviewers: yhchiang, rven, anthony, kradhakrishnan, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D37125	2015-05-18 10:54:11 -07:00
Yueh-Hsuan Chiang	df1f87a882	Fixed compile error in db/column_family.cc Summary: Fixed the following compile error in db/column_family.cc db/column_family.cc:633:33: error: ‘ASSERT_GT’ was not declared in this scope 16:14:45 ASSERT_GT(listeners.size(), 0U); Test Plan: make db_test Reviewers: igor, sdong, rven Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D38367	2015-05-12 16:20:03 -07:00
Yueh-Hsuan Chiang	14431e971d	Fixed a bug in EventListener::OnCompactionCompleted(). Summary: Fixed a bug in EventListener::OnCompactionCompleted() that returns incorrect list of input / output file names. Test Plan: Extend existing test in listener_test.cc Reviewers: sdong, rven, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D38349	2015-05-12 16:10:23 -07:00
clark.kang	6ede020dc4	fix typos	2015-04-25 18:14:27 +09:00
sdong	d01bbb53ae	Fix CompactRange for universal compaction with num_levels > 1 Summary: CompactRange for universal compaction with num_levels > 1 seems to have a bug. The unit test also has a bug so it doesn't capture the problem. Fix it. Revert the compact range to the logic equivalent to num_levels=1. Always compact all files together. It should also fix DBTest.IncreaseUniversalCompactionNumLevels. The issue was that options.write_buffer_size = 100 << 10 and options.write_buffer_size = 100 << 10 are not used in later test scenarios. So write_buffer_size of 4MB was used. The compaction trigger condition is not anymore obvious as expected. Test Plan: Run the new test and all test suites Reviewers: yhchiang, rven, kradhakrishnan, anthony, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D37551	2015-04-23 19:12:31 -07:00
Igor Canadi	5e067a7b19	Clean up compression logging Summary: Now we add warnings when user configures compression and the compression is not supported. Test Plan: Configured compression to non-supported values. Observed messages in my log: 2015/03/26-12:17:57.586341 7ffb8a496840 [WARN] Compression type chosen for level 2 is not supported: LZ4. RocksDB will not compress data on level 2. 2015/03/26-12:19:10.768045 7f36f15c5840 [WARN] Compression type chosen is not supported: LZ4. RocksDB will not compress data. Reviewers: rven, sdong, yhchiang Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D35979	2015-04-06 12:50:44 -07:00
sdong	953a885ebf	A new call back to TablePropertiesCollector to allow users know the entry is add, delete or merge Summary: Currently users have no idea a key is add, delete or merge from TablePropertiesCollector call back. Add a new function to add it. Also refactor the codes so that (1) make table property collector and internal table property collector two separate data structures with the later one now exposed (2) table builders only receive internal table properties Test Plan: Add cases in table_properties_collector_test to cover both of old and new ways of using TablePropertiesCollector. Reviewers: yhchiang, igor.sugak, rven, igor Reviewed By: rven, igor Subscribers: meyering, yoshinorim, maykov, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D35373	2015-04-06 10:27:21 -07:00
sdong	b23bbaa82a	Universal Compactions with Small Files Summary: With this change, we use L1 and up to store compaction outputs in universal compaction. The compaction pick logic stays the same. Outputs are stored in the largest "level" as possible. If options.num_levels=1, it behaves all the same as now. Test Plan: 1) convert most of existing unit tests for universal comapaction to include the option of one level and multiple levels. 2) add a unit test to cover parallel compaction in universal compaction and run it in one level and multiple levels 3) add unit test to migrate from multiple level setting back to one level setting 4) add a unit test to insert keys to trigger multiple rounds of compactions and verify results. Reviewers: rven, kradhakrishnan, yhchiang, igor Reviewed By: igor Subscribers: meyering, leveldb, MarkCallaghan, dhruba Differential Revision: https://reviews.facebook.net/D34539	2015-03-30 15:12:02 -07:00
Mark Callaghan	c8da670325	Stop printing per-level stall times. Summary: Per-level stall times are the suggested stall time, not the actual stall time so this change stops printing them both in the per-level output lines and in the summary. Also changed output for total stall time to include units in all cases. The new output looks like: Level Files Size(MB) Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) Stall(cnt) RecordIn RecordDrop ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 4/1 7 0.8 0.0 0.0 0.0 0.6 0.6 0.0 0.0 0.0 12.9 50 352 0.141 882 0 0 L1 5/0 9 0.9 0.0 0.0 0.0 0.0 0.0 0.6 0.0 0.0 0.0 0 0 0.000 0 0 0 L2 54/0 99 1.0 0.0 0.0 0.0 0.0 0.0 0.6 0.0 0.0 0.0 0 0 0.000 0 0 0 L3 289/0 527 0.5 0.0 0.0 0.0 0.0 0.0 0.5 0.0 0.0 0.0 0 0 0.000 0 0 0 Sum 352/1 642 0.0 0.0 0.0 0.0 0.6 0.6 1.7 1.0 0.0 12.9 50 352 0.141 882 0 0 Int 0/0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 15.5 0 3 0.118 7 0 0 Flush(GB): accumulative 0.627, interval 0.005 Stalls(count): 0 level0_slowdown, 0 level0_numfiles, 882 memtable_compaction, 0 leveln_slowdown_soft, 0 leveln_slowdown_hard Task ID: #6493861 Blame Rev: Test Plan: run db_bench, look at output Revert Plan: Database Impact: Memcache Impact: Other Notes: EImportant: - begin PUBLIC platform impact section - Bugzilla: # - end platform impact - Reviewers: igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D35085	2015-03-14 15:01:43 -07:00
Igor Canadi	db03739340	options.level_compaction_dynamic_level_bytes to allow RocksDB to pick size bases of levels dynamically. Summary: When having fixed max_bytes_for_level_base, the ratio of size of largest level and the second one can range from 0 to the multiplier. This makes LSM tree frequently irregular and unpredictable. It can also cause poor space amplification in some cases. In this improvement (proposed by Igor Kabiljo), we introduce a parameter option.level_compaction_use_dynamic_max_bytes. When turning it on, RocksDB is free to pick a level base in the range of (options.max_bytes_for_level_base/options.max_bytes_for_level_multiplier, options.max_bytes_for_level_base] so that real level ratios are close to options.max_bytes_for_level_multiplier. Test Plan: New unit tests and pass tests suites including valgrind. Reviewers: MarkCallaghan, rven, yhchiang, igor, ikabiljo Reviewed By: ikabiljo Subscribers: yoshinorim, ikabiljo, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D31437	2015-03-02 22:40:41 -08:00
Jinfu Leng	96d989f70d	catch config errors with L0 file count triggers Test Plan: Run "make clean && make all check" Reviewers: rven, igor, yhchiang, kradhakrishnan, MarkCallaghan, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D33627	2015-02-23 16:08:27 -08:00
sdong	d45a6a4002	Add rocksdb.num-live-versions: number of live versions Summary: Add a DB property about live versions. It can be helpful to figure out whether there are files not live but not yet deleted, in some use cases. Test Plan: make all check Reviewers: rven, yhchiang, igor Reviewed By: igor Subscribers: yoshinorim, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D33327	2015-02-19 13:10:37 -08:00
Igor Canadi	e7ea51a8e7	Introduce job_id for flush and compaction Summary: It would be good to assing background job their IDs. Two benefits: 1) makes LOGs more readable 2) I might use it in my EventLogger, which will try to make our LOG easier to read/query/visualize Test Plan: ran rocksdb, read the LOG Reviewers: sdong, rven, yhchiang Reviewed By: yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D31617	2015-02-12 09:54:48 -08:00
Yueh-Hsuan Chiang	181191a1e4	Add a counter for collecting the wait time on db mutex. Summary: Add a counter for collecting the wait time on db mutex. Also add MutexWrapper and CondVarWrapper for measuring wait time. Test Plan: ./db_test export ROCKSDB_TESTS=MutexWaitStats ./db_test verify stats output using db_bench make clean make release ./db_bench --statistics=1 --benchmarks=fillseq,readwhilewriting --num=10000 --threads=10 Sample output: rocksdb.db.mutex.wait.micros COUNT : 7546866 Reviewers: MarkCallaghan, rven, sdong, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D32787	2015-02-04 21:39:45 -08:00
Ori Bernstein	f9758e0129	Add compaction listener. Summary: This adds a listener for compactions, and gives some useful statistics on each compaction pass. Test Plan: Unit tests. Reviewers: sdong, igor, rven, yhchiang Reviewed By: yhchiang Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D31641	2015-01-27 14:44:02 -08:00
sdong	e919ecedfc	SuperVersion::Unref() to use sequential consistency to decrease ref counting Summary: I'm not sure the expected results of std::atomic::fetch_sub() when using memory_order_relaxed, and I suspect TSAN complains. Test Plan: make all check Reviewers: rven, yhchiang, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D32259	2015-01-27 14:08:08 -08:00
Igor Canadi	f1c8862479	Fix data race #1 Summary: This is first in a series of diffs that fixes data races detected by thread sanitizer. Here the problem is that we call Ref() on a column family during a single-threaded write, without holding a mutex. Test Plan: TSAN is no longer complaining about LevelLimitReopen. Reviewers: yhchiang, rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D32121	2015-01-26 11:48:07 -08:00
Igor Canadi	7731d51c82	Simplify column family concurrency Summary: This patch changes concurrency guarantees around ColumnFamilySet::column_families_ and ColumnFamilySet::column_families_data_. Before: * When mutating: lock DB mutex and spin lock * When reading: lock DB mutex OR spin lock After: * When mutating: lock DB mutex and be in write thread * When reading: lock DB mutex or be in write thread That way, we eliminate the spin lock that protects these hash maps and simplify concurrency. That means we don't need to lock the spin lock during writing, since writing is mutually exclusive with column family create/drop (the only operations that mutate those hash maps). With these new restrictions, I also needed to move column family create to the write thread (column family drop was already in the write thread). Even though we don't need to lock the spin lock during write, impact on performance should be minimal -- the spin lock is almost never busy, so locking it is almost free. This addresses task t5116919. Test Plan: make check Stress test with lots and lots of column family drop and create: time ./db_stress --threads=30 --ops_per_thread=5000000 --max_key=5000 --column_families=200 --clear_column_family_one_in=100000 --verify_before_write=0 --reopen=15 --max_background_compactions=10 --max_background_flushes=10 --db=/fast-rocksdb-tmp/db_stress/ Reviewers: yhchiang, rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D30651	2015-01-06 12:44:21 -08:00
Igor Canadi	fdb6be4e24	Rewritten system for scheduling background work Summary: When scaling to higher number of column families, the worst bottleneck was MaybeScheduleFlushOrCompaction(), which did a for loop over all column families while holding a mutex. This patch addresses the issue. The approach is similar to our earlier efforts: instead of a pull-model, where we do something for every column family, we can do a push-based model -- when we detect that column family is ready to be flushed/compacted, we add it to the flush_queue_/compaction_queue_. That way we don't need to loop over every column family in MaybeScheduleFlushOrCompaction. Here are the performance results: Command: ./db_bench --write_buffer_size=268435456 --db_write_buffer_size=268435456 --db=/fast-rocksdb-tmp/rocks_lots_of_cf --use_existing_db=0 --open_files=55000 --statistics=1 --histogram=1 --disable_data_sync=1 --max_write_buffer_number=2 --sync=0 --benchmarks=fillrandom --threads=16 --num_column_families=5000 --disable_wal=1 --max_background_flushes=16 --max_background_compactions=16 --level0_file_num_compaction_trigger=2 --level0_slowdown_writes_trigger=2 --level0_stop_writes_trigger=3 --hard_rate_limit=1 --num=33333333 --writes=33333333 Before the patch: fillrandom : 26.950 micros/op 37105 ops/sec; 4.1 MB/s After the patch: fillrandom : 17.404 micros/op 57456 ops/sec; 6.4 MB/s Next bottleneck is VersionSet::AddLiveFiles, which is painfully slow when we have a lot of files. This is coming in the next patch, but when I removed that code, here's what I got: fillrandom : 7.590 micros/op 131758 ops/sec; 14.6 MB/s Test Plan: make check two stress tests: Big number of compactions and flushes: ./db_stress --threads=30 --ops_per_thread=20000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0 --reopen=15 --max_background_compactions=10 --max_background_flushes=10 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000 max_background_flushes=0, to verify that this case also works correctly ./db_stress --threads=30 --ops_per_thread=2000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0 --reopen=3 --max_background_compactions=3 --max_background_flushes=0 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000 Reviewers: ljin, rven, yhchiang, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D30123	2014-12-19 20:38:12 +01:00
Jonah Cohen	a14b7873ee	Enforce write buffer memory limit across column families Summary: Introduces a new class for managing write buffer memory across column families. We supplement ColumnFamilyOptions::write_buffer_size with ColumnFamilyOptions::write_buffer, a shared pointer to a WriteBuffer instance that enforces memory limits before flushing out to disk. Test Plan: Added SharedWriteBuffer unit test to db_test.cc Reviewers: sdong, rven, ljin, igor Reviewed By: igor Subscribers: tnovak, yhchiang, dhruba, xjin, MarkCallaghan, yoshinorim Differential Revision: https://reviews.facebook.net/D22581	2014-12-02 12:09:20 -08:00
Igor Canadi	37d73d597e	Fix linters Summary: Two fixes: 1. if cpplint is not present on the system, don't return a confusing error in the linter 2. Add include_alpha, which means our includes should be sorted lexicographically Test Plan: Tried unsorting our includes, lint complained Reviewers: rven, ljin, yhchiang, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D28845	2014-12-02 13:53:39 -05:00
Yueh-Hsuan Chiang	bcf9086899	Block Universal and FIFO compactions in ROCKSDB_LITE Summary: Block Universal and FIFO compactions in ROCKSDB_LITE Test Plan: make shared_lib -j32 make OPT=-DROCKSDB_LITE shared_lib Reviewers: ljin, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D29589	2014-11-26 15:45:11 -08:00
Lei Jin	8d3f8f9696	remove all remaining references to cfd->options() Summary: The very last reference happens in DBImpl::GetOptions() I built with both DBImpl::GetOptions() and ColumnFamilyData::options() commented out Test Plan: make all check Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D29073	2014-11-18 10:20:10 -08:00
Igor Canadi	772bc97f13	No CompactFiles in ROCKSDB_LITE Summary: It adds lots of code. Test Plan: compile for iOS, compile for mac. works. Reviewers: rven, sdong, ljin, yhchiang Reviewed By: yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D28857	2014-11-13 16:45:33 -05:00
Igor Canadi	25f273027b	Fix iOS compile with -Wshorten-64-to-32 Summary: So iOS size_t is 32-bit, so we need to static_cast<size_t> any uint64_t :( Test Plan: TARGET_OS=IOS make static_lib Reviewers: dhruba, ljin, yhchiang, rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D28743	2014-11-13 14:39:30 -05:00
Yueh-Hsuan Chiang	28c82ff1b3	CompactFiles, EventListener and GetDatabaseMetaData Summary: This diff adds three sets of APIs to RocksDB. = GetColumnFamilyMetaData = * This APIs allow users to obtain the current state of a RocksDB instance on one column family. * See GetColumnFamilyMetaData in include/rocksdb/db.h = EventListener = * A virtual class that allows users to implement a set of call-back functions which will be called when specific events of a RocksDB instance happens. * To register EventListener, simply insert an EventListener to ColumnFamilyOptions::listeners = CompactFiles = * CompactFiles API inputs a set of file numbers and an output level, and RocksDB will try to compact those files into the specified level. = Example = * Example code can be found in example/compact_files_example.cc, which implements a simple external compactor using EventListener, GetColumnFamilyMetaData, and CompactFiles API. Test Plan: listener_test compactor_test example/compact_files_example export ROCKSDB_TESTS=CompactFiles db_test export ROCKSDB_TESTS=MetaData db_test Reviewers: ljin, igor, rven, sdong Reviewed By: sdong Subscribers: MarkCallaghan, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D24705	2014-11-07 14:45:18 -08:00
Igor Canadi	9f20395cd6	Turn -Wshadow back on Summary: It turns out that -Wshadow has different rules for gcc than clang. Previous commit fixed clang. This commits fixes the rest of the warnings for gcc. Test Plan: compiles Reviewers: ljin, yhchiang, rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D28131	2014-11-06 11:14:28 -08:00
Lei Jin	fd24ae9d05	SetOptions() to return status and also add it to StackableDB Summary: as title Test Plan: ./db_test Reviewers: sdong, yhchiang, rven, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D28269	2014-11-04 16:23:05 -08:00
sdong	ac6afaf9ef	Enforce naming convention of getters in version_set.h Summary: Enforce the accessier naming convention in functions in version_set.h Test Plan: make all check Reviewers: ljin, yhchiang, rven, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D28143	2014-11-04 09:59:05 -08:00
Igor Canadi	9f7fc3ac45	Turn on -Wshadow Summary: ...and fix all the errors :) Jim suggested turning on -Wshadow because it helped him fix number of critical bugs in fbcode. I think it's a good idea to be -Wshadow clean. Test Plan: compiles Reviewers: yhchiang, rven, sdong, ljin Reviewed By: ljin Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D27711	2014-10-31 11:59:54 -07:00
sdong	4d2ba38b65	Make VersionBuilder unit testable Summary: Rename Version::Builder to VersionBuilder and expose its definition to a header. Make VerisonBuilder not reference Version or ColumnFamilyData, only working with VersionStorageInfo. Add version_builder_test which has a simple test. Test Plan: make all check Reviewers: rven, yhchiang, igor, ljin Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D27969	2014-10-31 10:44:06 -07:00
sdong	76d1c28e82	Make CompactionPicker more easily tested Summary: Make compaction picker easier to test. The basic idea is to separate a minimum subcomponent of Version to VersionStorageInfo, which just responsible to LSM tree. A stub VersionStorageInfo can then be easily created and passed into compaction picker so that we can check the outputs. It now passes most tests. Still two things need to be done: (1) deal with the FIFO compaction's file size. (2) write an example test to make sure the interface can do the job. Add a compaction_picker_test to make sure compaction picker codes can be easily unit tested. Test Plan: Pass all unit tests and compaction_picker_test Reviewers: yhchiang, rven, igor, ljin Reviewed By: ljin Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D27639	2014-10-29 15:16:53 -07:00
Yueh-Hsuan Chiang	34d436b7db	Apply InfoLogLevel to the logs in db/column_family.cc Summary: Apply InfoLogLevel to the logs in db/column_family.cc Test Plan: make Reviewers: ljin, sdong, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D27843	2014-10-29 15:11:32 -07:00
Igor Canadi	a39e931e50	FlushProcess Summary: Abstract out FlushProcess and take it out of DBImpl. This also includes taking DeletionState outside of DBImpl. Currently this diff is only doing the refactoring. Future work includes: 1. Decoupling flush_process.cc, make it depend on less state 2. Write flush_process_test, which will mock out everything that FlushProcess depends on and test it in isolation Test Plan: make check Reviewers: rven, yhchiang, sdong, ljin Reviewed By: ljin Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D27561	2014-10-28 11:54:33 -07:00
Lei Jin	f981e08139	unfriend ColumnFamilyData from VersionSet Summary: as title Test Plan: make release will run full test on all stacked diffs before committing Reviewers: sdong, yhchiang, rven, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D27591	2014-10-28 10:04:38 -07:00
Lei Jin	f1841985e4	dynamic inplace_update options Summary: Make inplace_update_support and inplace_update_num_locks dynamic. inplace_callback becomes immutable We are almost free of references to cfd->options() in db_impl Test Plan: unit test Reviewers: igor, yhchiang, rven, sdong Reviewed By: sdong Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D25293	2014-10-27 12:10:13 -07:00
Lei Jin	4d5708aa56	dynamic soft_rate_limit and hard_rate_limit Summary: as title Test Plan: unit test I am only able to build the test case for hard_rate_limit. soft_rate_limit is essentially the same thing as hard_rate_limit Reviewers: igor, sdong, yhchiang Reviewed By: yhchiang Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D24759	2014-10-16 17:21:31 -07:00
Lei Jin	dc50a1a593	make max_write_buffer_number dynamic Summary: as title Test Plan: unit test Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D24729	2014-10-16 16:57:59 -07:00
Lei Jin	5ec53f3edf	make compaction related options changeable Summary: make compaction related options changeable. Most of changes are tedious, following the same convention: grabs MutableCFOptions at the beginning of compaction under mutex, then pass it throughout the job and register it in SuperVersion at the end. Test Plan: make all check Reviewers: igor, yhchiang, sdong Reviewed By: sdong Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D23349	2014-10-01 16:19:16 -07:00
Igor Canadi	21ddcf6e4f	Remove allow_thread_local Summary: See https://reviews.facebook.net/D19365 Test Plan: compiles Reviewers: sdong, yhchiang, ljin Reviewed By: ljin Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D23907	2014-09-24 13:12:16 -07:00
sdong	d0de413f4d	WriteBatchWithIndex to allow different Comparators for different column families Summary: Previously, one single column family is given to WriteBatchWithIndex to index keys for all column families. An extra map from column family ID to comparator is maintained which can override the default comparator given in the constructor. A WriteBatchWithIndex::SetComparatorForCF() is added for user to add comparators per column family. Also move more codes into anonymous namespace. Test Plan: Add a unit test Reviewers: ljin, igor Reviewed By: igor Subscribers: dhruba, leveldb, yhchiang Differential Revision: https://reviews.facebook.net/D23355	2014-09-22 13:47:39 -07:00
Lei Jin	a062e1f2c4	SetOptions() for memtable related options Summary: as title Test Plan: make all check I will think a way to set up stress test for this Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D23055	2014-09-17 12:49:13 -07:00
Igor Canadi	3d9e6f7759	Push model for flushing memtables Summary: When memtable is full it calls the registered callback. That callback then registers column family as needing the flush. Every write checks if there are some column families that need to be flushed. This completely eliminates the need for MakeRoomForWrite() function and simplifies our Write code-path. There is some complexity with the concurrency when the column family is dropped. I made it a bit less complex by dropping the column family from the write thread in https://reviews.facebook.net/D22965. Let me know if you want to discuss this. Test Plan: make check works. I'll also run db_stress with creating and dropping column families for a while. Reviewers: yhchiang, sdong, ljin Reviewed By: ljin Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D23067	2014-09-10 18:46:09 -07:00
Lei Jin	52311463e9	MemTableOptions Summary: removed reference to options in WriteBatch and DBImpl::Get() Test Plan: make all check Reviewers: yhchiang, igor, sdong Reviewed By: sdong Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D23049	2014-09-08 18:46:52 -07:00
Igor Canadi	2d57828d0e	Check stop level trigger-0 before slowdown level-0 trigger Summary: ... Test Plan: Can't repro the test failure, but let's see what jenkins says Reviewers: zagfox, sdong, ljin Reviewed By: sdong, ljin Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D23061	2014-09-08 15:23:58 -07:00
Igor Canadi	a2bb7c3c33	Push- instead of pull-model for managing Write stalls Summary: Introducing WriteController, which is a source of truth about per-DB write delays. Let's define an DB epoch as a period where there are no flushes and compactions (i.e. new epoch is started when flush or compaction finishes). Each epoch can either: * proceed with all writes without delay * delay all writes by fixed time * stop all writes The three modes are recomputed at each epoch change (flush, compaction), rather than on every write (which is currently the case). When we have a lot of column families, our current pull behavior adds a big overhead, since we need to loop over every column family for every write. With new push model, overhead on Write code-path is minimal. This is just the start. Next step is to also take care of stalls introduced by slow memtable flushes. The final goal is to eliminate function MakeRoomForWrite(), which currently needs to be called for every column family by every write. Test Plan: make check for now. I'll add some unit tests later. Also, perf test. Reviewers: dhruba, yhchiang, MarkCallaghan, sdong, ljin Reviewed By: ljin Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D22791	2014-09-08 11:20:25 -07:00
Lei Jin	5665e5e285	introduce ImmutableOptions Summary: As a preparation to support updating some options dynamically, I'd like to first introduce ImmutableOptions, which is a subset of Options that cannot be changed during the course of a DB lifetime without restart. ColumnFamily will keep both Options and ImmutableOptions. Any component below ColumnFamily should only take ImmutableOptions in their constructor. Other options should be taken from APIs, which will be allowed to adjust dynamically. I am yet to make changes to memtable and other related classes to take ImmutableOptions in their ctor. That can be done in a seprate diff as this one is already pretty big. Test Plan: make all check Reviewers: yhchiang, igor, sdong Reviewed By: sdong Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D22545	2014-09-04 16:18:36 -07:00
Lei Jin	384400128f	move block based table related options BlockBasedTableOptions Summary: I will move compression related options in a separate diff since this diff is already pretty lengthy. I guess I will also need to change JNI accordingly :( Test Plan: make all check Reviewers: yhchiang, igor, sdong Reviewed By: igor Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D21915	2014-08-25 14:22:05 -07:00
sdong	28b5c76004	WriteBatchWithIndex: a wrapper of WriteBatch, with a searchable index Summary: Add WriteBatchWithIndex so that a user can query data out of a WriteBatch, to support MongoDB's read-its-own-write. WriteBatchWithIndex uses a skiplist to store the binary index. The index stores the offset of the entry in the write batch. When searching for a key, the key for the entry is read by read the entry from the write batch from the offset. Define a new iterator class for querying data out of WriteBatchWithIndex. A user can create an iterator of the write batch for one column family, seek to a key and keep calling Next() to see next entries. I will add more unit tests if people are OK about this API. Test Plan: make all check Add unit tests. Reviewers: yhchiang, igor, MarkCallaghan, ljin Reviewed By: ljin Subscribers: dhruba, leveldb, xjin Differential Revision: https://reviews.facebook.net/D21381	2014-08-18 16:37:38 -07:00
miguelportilla	93e6b5e9d9	Changes to support unity build: * Script for building the unity.cc file via Makefile * Unity executable Makefile target for testing builds * Source code changes to fix compilation of unity build	2014-08-11 13:22:47 -04:00
Stanislau Hlebik	76286ee67e	Remove unnecessary constructor parameter from ColumnFamilyData Summary: const string& dbname parameter is not used Test Plan: make all Reviewers: sdong, igor Reviewed By: sdong Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D20703	2014-07-30 13:53:08 -07:00
Lei Jin	7e8bb71dd0	InternalStats to take cfd on constructor Summary: It has one-to-one relationship with CFD. Take a pointer to CFD on constructor to avoid passing cfd through member functions. Test Plan: make Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D20565	2014-07-28 12:27:08 -07:00
sdong	f6b7e1ed1a	Allow user to specify DB path of output file of manual compaction Summary: Add a parameter path_id to DB::CompactRange(), to indicate where the output file should be placed to. Test Plan: add a unit test Reviewers: yhchiang, ljin Reviewed By: ljin Subscribers: xjin, igor, dhruba, MarkCallaghan, leveldb Differential Revision: https://reviews.facebook.net/D20085	2014-07-21 19:06:00 -07:00
Lei Jin	f6f1533c6f	make internal stats independent of statistics Summary: also make it aware of column family output from db_bench ``` Compaction Stats [default] Level Files Size(MB) Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) RW-Amp W-Amp Rd(MB/s) Wr(MB/s) Rn(cnt) Rnp1(cnt) Wnp1(cnt) Wnew(cnt) Comp(sec) Comp(cnt) Avg(sec) Stall(sec) Stall(cnt) Avg(ms) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 14 956 0.9 0.0 0.0 0.0 2.7 2.7 0.0 0.0 0.0 111.6 0 0 0 0 24 40 0.612 75.20 492387 0.15 L1 21 2001 2.0 5.7 2.0 3.7 5.3 1.6 5.4 2.6 71.2 65.7 31 43 55 12 82 2 41.242 43.72 41183 1.06 L2 217 18974 1.9 16.5 2.0 14.4 15.1 0.7 15.6 7.4 70.1 64.3 17 182 185 3 241 16 15.052 0.00 0 0.00 L3 1641 188245 1.8 9.1 1.1 8.0 8.5 0.5 15.4 7.4 61.3 57.2 9 75 76 1 152 9 16.887 0.00 0 0.00 L4 4447 449025 0.4 13.4 4.8 8.6 9.1 0.5 4.7 1.9 77.8 52.7 38 79 100 21 176 38 4.639 0.00 0 0.00 Sum 6340 659201 0.0 44.7 10.0 34.7 40.6 6.0 32.0 15.2 67.7 61.6 95 379 416 37 676 105 6.439 118.91 533570 0.22 Int 0 0 0.0 1.2 0.4 0.8 1.3 0.5 5.2 2.7 59.1 65.6 3 7 9 2 20 10 2.003 0.00 0 0.00 Stalls(secs): 75.197 level0_slowdown, 0.000 level0_numfiles, 0.000 memtable_compaction, 43.717 leveln_slowdown Stalls(count): 492387 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 41183 leveln_slowdown DB Stats Uptime(secs): 202.1 total, 13.5 interval Cumulative writes: 6291456 writes, 6291456 batches, 1.0 writes per batch, 4.90 ingest GB Cumulative WAL: 6291456 writes, 6291456 syncs, 1.00 writes per sync, 4.90 GB written Interval writes: 1048576 writes, 1048576 batches, 1.0 writes per batch, 836.0 ingest MB Interval WAL: 1048576 writes, 1048576 syncs, 1.00 writes per sync, 0.82 MB written Test Plan: ran it Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19917	2014-07-21 12:57:29 -07:00
sdong	2459f7ec4e	Support Multiple DB paths (without having an interface to expose to users) Summary: In this patch, we allow RocksDB to support multiple DB paths internally. No user interface is supported yet so this patch is silent to users. Test Plan: make all check Reviewers: igor, haobo, ljin, yhchiang Reviewed By: yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D18921	2014-07-02 21:14:44 -07:00
Stanislau Hlebik	a3594867ba	Cache some conditions for DBImpl::MakeRoomForWrite Summary: Task 4580155. Some conditions in DBImpl::MakeRoomForWrite can be cached in ColumnFamilyData, because theirs value can be changed only during compaction, adding new memtable and/or add recalculation of compaction score. These conditions are: cfd->imm()->size() == cfd->options()->max_write_buffer_number - 1 cfd->current()->NumLevelFiles(0) >= cfd->options()->level0_stop_writes_trigger cfd->options()->soft_rate_limit > 0.0 && (score = cfd->current()->MaxCompactionScore()) > cfd->options()->soft_rate_limit cfd->options()->hard_rate_limit > 1.0 && (score = cfd->current()->MaxCompactionScore()) > cfd->options()->hard_rate_limit P.S. As it's my first diff, Siying suggested to add everybody as a reviewers for this diff. Sorry, if I forgot someone or add someone by mistake. Test Plan: make all check Reviewers: haobo, xjin, dhruba, yhchiang, zagfox, ljin, sdong Reviewed By: sdong Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19311	2014-06-26 16:45:27 -07:00
Yueh-Hsuan Chiang	e6e259b8ab	Include max_write_buffer_number >= 2 to SanitizeOptions. Summary: Include max_write_buffer_number >= 2 to SanitizeOptions. Test Plan: make all check Reviewers: haobo, sdong, igor, ljin Reviewed By: ljin Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19077	2014-06-16 16:26:46 -07:00
Igor Canadi	8cb7ad83c3	Flush stale column families less aggressively Summary: We've seen some production issues where column family is detected as stale, although there is only one column family in the system. This is a quick fix that: 1) doesn't flush stale column families if there's only one of them 2) Use 4 as a coefficient instead of 2 for determening when a column family is stale. This will make flushing less aggressive, while still keep a nice dynamic flushing of very stale CFs. Test Plan: make check Reviewers: dhruba, haobo, ljin, sdong Reviewed By: sdong Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D18861	2014-06-02 15:33:54 -07:00
Igor Canadi	6de6a06631	FIFO compaction style Summary: Introducing new compaction style -- FIFO. FIFO compaction style has write amplification of 1 (+1 for WAL) and it deletes the oldest files when the total DB size exceeds pre-configured values. FIFO compaction style is suited for storing high-frequency event logs. Test Plan: Added a unit test Reviewers: dhruba, haobo, sdong Reviewed By: dhruba Subscribers: alberts, leveldb Differential Revision: https://reviews.facebook.net/D18765	2014-05-21 11:43:35 -07:00
Igor Canadi	26f5dd9a5a	TablePropertiesCollectorFactory Summary: This diff addresses task #4296714 and rethinks how users provide us with TablePropertiesCollectors as part of Options. Here's description of task #4296714: I'm debugging #4295529 and noticed that our count of user properties kDeletedKeys is wrong. We're sharing one single InternalKeyPropertiesCollector with all Table Builders. In LOG Files, we're outputting number of kDeletedKeys as connected with a single table, while it's actually the total count of deleted keys since creation of the DB. For example, this table has 3155 entries and 1391828 deleted keys. The problem with current approach that we call methods on a single TablePropertiesCollector for all the tables we create. Even worse, we could do it from multiple threads at the same time and TablePropertiesCollector has no way of knowing which table we're calling it for. Good part: Looks like nobody inside Facebook is using Options::table_properties_collectors. This means we should be able to painfully change the API. In this change, I introduce TablePropertiesCollectorFactory. For every table we create, we call `CreateTablePropertiesCollector`, which creates a TablePropertiesCollector for a single table. We then use it sequentially from a single thread, which means it doesn't have to be thread-safe. Test Plan: Added a test in table_properties_collector_test that fails on master (build two tables, assert that kDeletedKeys count is correct for the second one). Also, all other tests Reviewers: sdong, dhruba, haobo, kailiu Reviewed By: kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D18579	2014-05-13 12:30:55 -07:00
Igor Canadi	161d9e586b	Don't overflow size_t in mac	2014-04-16 15:15:22 -07:00
Lei Jin	82b37a18bd	thread local for tailing iterator Summary: replace the super version acquisision in tailing itrator with thread local Test Plan: will post results Reviewers: igor, haobo, sdong, yhchiang, dhruba Reviewed By: igor CC: leveldb Differential Revision: https://reviews.facebook.net/D17757	2014-04-14 10:48:01 -07:00
Lei Jin	539dd207df	using thread local SuperVersion for NewIterator Summary: Similar to GetImp(), use SuperVersion from thread local instead of acquriing mutex. I don't expect this change will make a dent on NewIterator() performance because the bottleneck seems to be on the rest part of the API Test Plan: make asan_check will post perf numbers Reviewers: haobo, igor, sdong, dhruba, yhchiang Reviewed By: sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D17643	2014-04-14 09:34:59 -07:00
Igor Canadi	b42ceb9598	Simplify cleanup of dead (refcount == 0) column families	2014-04-07 14:31:02 -07:00
Igor Canadi	ddbd1ece88	Merge branch 'master' into columnfamilies Conflicts: db/db_impl.cc db/db_test.cc db/internal_stats.cc db/internal_stats.h db/version_edit.cc db/version_edit.h db/version_set.cc include/rocksdb/options.h util/options.cc	2014-03-31 13:39:24 -07:00
Igor Canadi	fb2346fc1f	[CF] Code cleanup part 1 Summary: I'm cleaning up some code preparing for the big diff review tomorrow. This is the first part of the cleanup. Changes are mostly cosmetic. The goal is to decrease amount of code difference between columnfamilies and master branch. This diff also fixes race condition when dropping column family. Test Plan: Ran db_stress with variety of parameters Reviewers: dhruba, haobo Differential Revision: https://reviews.facebook.net/D16833	2014-03-12 09:56:53 -07:00
Igor Canadi	9634ba42ac	Merge branch 'master' into columnfamilies Conflicts: db/compaction_picker.cc db/db_impl.cc db/db_impl.h db/tailing_iter.cc db/version_set.h include/rocksdb/options.h util/options.cc	2014-03-10 17:26:09 -07:00
Igor Canadi	1e0d47276c	Merge branch 'master' into columnfamilies Conflicts: db/db_impl.cc db/db_impl.h	2014-03-07 16:59:47 -08:00
Igor Canadi	80a207fc90	Merge branch 'master' into columnfamilies Conflicts: db/compaction_picker.cc db/compaction_picker.h db/db_impl.cc db/version_set.cc db/version_set.h include/rocksdb/options.h util/options.cc	2014-03-05 16:59:22 -08:00
Igor Canadi	9625acbf70	[CF] Dont reuse dropped column family IDs Summary: Column family IDs should be unique, even if column family is dropped. To achieve this, we save max column family in manifest. Note that the diff is still not ready. I'm only using differential to move the patch to my Mac machine. Test Plan: added a test to column_family_test Reviewers: dhruba, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D16581	2014-03-05 12:13:44 -08:00
Igor Canadi	335b207974	[CF] Delete SuperVersion in a special function Summary: Added a function DeleteSuperVersion that can be called in DBImpl destructor before PurgingObsoleteFiles. That way, PurgeObsoleteFiles will be able to delete all files held by alive super versions. Test Plan: column_family_test with valgrind Reviewers: dhruba, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D16545	2014-03-04 09:35:44 -08:00
Igor Canadi	9d0577a6be	Merge branch 'master' into columnfamilies Conflicts: db/db_impl.cc db/db_impl.h db/transaction_log_impl.cc db/transaction_log_impl.h include/rocksdb/options.h util/env.cc util/options.cc	2014-03-03 18:29:03 -08:00
Igor Canadi	8ea21a778b	[CF] Rething LogAndApply for column families Summary: I though I might get away with as little changes to LogAndApply() as possible. It turns out this is not the case. This diff introduces different behavior of LogAndApply() for three cases: 1. column family add 2. column family drop 3. no-column family manipulation (1) and (2) don't support group commit yet. There were a lot of problems with old version od LogAndApply, detected by db_stress. The biggest was non-atomicity of manifest writes and metadata changes (i.e. if column family add is in manifest, it also has to be in in-memory data structure). Test Plan: db_stress Reviewers: dhruba, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D16491	2014-02-28 14:46:48 -08:00
Igor Canadi	8b7ab9951c	[CF] Handle failure in WriteBatch::Handler Summary: * Add ColumnFamilyHandle::GetID() function. Client needs to know column family's ID to be able to construct WriteBatch * Handle WriteBatch::Handler failure gracefully. Since WriteBatch is not a very smart function (it takes raw CF id), client can add data to WriteBatch for column family that doesn't exist. In that case, we need to gracefully return failure status from DB::Write(). To do that, I added a return Status to WriteBatch functions PutCF, DeleteCF and MergeCF. Test Plan: Added test to column_family_test Reviewers: dhruba, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D16323	2014-02-26 10:10:00 -08:00
Igor Canadi	ccdb93e775	Merge branch 'master' into columnfamilies Conflicts: db/db_impl.cc db/db_impl.h db/memtable_list.cc db/memtable_list.h db/version_set.cc db/version_set.h	2014-02-12 14:01:30 -08:00
Igor Canadi	b06840aa7d	[CF] Rethinking ColumnFamilyHandle and fix to dropping column families Summary: The change to the public behavior: * When opening a DB or creating new column family client gets a ColumnFamilyHandle. * As long as column family handle is alive, client can do whatever he wants with it, even drop it * Dropped column family can still be read from (using the column family handle) * Added a new call CloseColumnFamily(). Client has to close all column families that he has opened before deleting the DB * As soon as column family is closed, any calls to DB using that column family handle will fail (also any outstanding calls) Internally: * Ref-counting ColumnFamilyData * New thread-safety for ColumnFamilySet * Dropped column families are now completely dropped and their memory cleaned-up Test Plan: added some tests to column_family_test Reviewers: dhruba, haobo, kailiu, sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D16101	2014-02-12 13:47:09 -08:00
Igor Canadi	99e61fdd5c	[CF] Separate dumping of DBOptions and ColumnFamilyOptions Summary: When we open a DB, we should dump only DBOptions and then when we create a new column family, we dump ColumnFamilyOptions for each one. Test Plan: make check, confirm contents of the LOG Reviewers: dhruba, haobo, sdong, kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D16011	2014-02-07 13:52:51 -08:00
Igor Canadi	0143abdbb0	Merge branch 'master' into columnfamilies Conflicts: HISTORY.md db/db_impl.cc db/db_impl.h db/db_iter.cc db/db_test.cc db/dbformat.h db/memtable.cc db/memtable_list.cc db/memtable_list.h db/table_cache.cc db/table_cache.h db/version_edit.h db/version_set.cc db/version_set.h db/write_batch.cc db/write_batch_test.cc include/rocksdb/options.h util/options.cc	2014-02-06 15:58:20 -08:00
Igor Canadi	f8d5443efe	[CF] Thread-safety guarantees for ColumnFamilySet Summary: Revised thread-safety guarantees and implemented a way to spinlock the object. Test Plan: make check Reviewers: dhruba, haobo, sdong, kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D15975	2014-02-06 11:57:36 -08:00
Igor Canadi	8fa8a708ef	[CF] Propagate correct options to WriteBatch::InsertInto Summary: WriteBatch can have multiple column families in one batch. Every column family has different options. So we have to add a way for write batch to get options for an arbitrary column family. This required a bit more acrobatics since lots of interfaces had to be changed. Test Plan: make check Reviewers: dhruba, haobo, sdong, kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D15957	2014-02-06 10:23:31 -08:00
Igor Canadi	f276e0e59d	[CF] Options -> DBOptions Summary: Replaced most of occurrences of Options with more specific DBOptions. This brings us very close to supporting different configuration options for each column family. Test Plan: make check Reviewers: dhruba, haobo, kailiu, sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D15933	2014-02-05 14:56:09 -08:00
Igor Canadi	6e56ab5702	[CF] Add full_options_ to ColumnFamilyData Summary: Lots of code expects Options on construction/function call. My original idea was to split Options argument into ColumnFamilyOptions and DBOptions (the latter only if needed). However, this will require huge code changes very deep in the stack. The better idea is to have ColumnFamilyData hold both ColumnFamilyOptions and Options. ColumnFamilyData::Options would be constructed from DBOptions (same for each column family) and ColumnFamilyOptions (different for each column family) Now when we construct a class or call any method that requires Options, we can just push him ColumnFamilyData::Options and be sure that it's using column-family-specific settings. Test Plan: make check Reviewers: dhruba, haobo, kailiu, sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D15927	2014-02-05 12:26:40 -08:00
Igor Canadi	c24d8c4e90	[CF] Rethink table cache Summary: Adapting table cache to column families is interesting. We want table cache to be global LRU, so if some column families are use not as often as others, we want them to be evicted from cache. However, current TableCache object also constructs tables on its own. If table is not found in the cache, TableCache automatically creates new table. We want each column family to be able to specify different table factory. To solve the problem, we still have a single LRU, but we provide the LRUCache object to TableCache on construction. We have one TableCache per column family, but the underyling cache is shared by all TableCache objects. This allows us to have a global LRU, but still be able to support different table factories for different column families. Also, in the future it will also be able to support different directories for different column families. Test Plan: make check Reviewers: dhruba, haobo, kailiu, sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D15915	2014-02-05 11:55:30 -08:00
Igor Canadi	7b9f134959	[CF] Move InternalStats to ColumnFamilyData Summary: InternalStats is a messy thing, keeping both DB data and column family data. However, it's better off living in ColumnFamilyData than in DBImpl. For now, at least. Test Plan: make check Reviewers: dhruba, kailiu, haobo, sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D15879	2014-02-04 17:59:50 -08:00
Igor Canadi	73f62255c1	[CF] Split SanitizeOptions into two Summary: There are three SanitizeOption-s now : one for DBOptions, one for ColumnFamilyOptions and one for Options (which just calls the other two) I have also reshuffled some options -- table_cache options and info_log should live in DBOptions, for example. Test Plan: make check doesn't complain Reviewers: dhruba, haobo, kailiu, sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D15873	2014-02-04 17:26:51 -08:00
Igor Canadi	0e22badc08	[column families] Iterator and MultiGet Summary: Support for different column families in Iterator and MultiGet code path. Test Plan: make check Reviewers: dhruba, haobo, kailiu, sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D15849	2014-02-03 17:44:40 -08:00
Igor Canadi	29bacb2eb6	VersionSet cleanup Summary: Removed icmp_ from VersionSet (since it's per-column-family, not per-DB-instance) Unfriended VersionSet and ColumnFamilyData (yay!) Removed VersionSet::NumberLevels() Cleaned up DBImpl Test Plan: make check Reviewers: dhruba, haobo, kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D15819	2014-02-03 13:10:47 -08:00
Igor Canadi	f7489123e2	Move compaction picker and internal key comparator to ColumnFamilyData Summary: Compaction picker and internal key comparator are different for each column family (not global), so they should live in ColumnFamilyData Test Plan: make check Reviewers: dhruba, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D15801	2014-01-31 16:06:55 -08:00
Igor Canadi	9ca638a86d	Enable iterating column families with a concurrent writer Summary: Sometimes we iterate through column families, and unlock the mutex in the body of the iteration. While mutex is unlocked, some column family might be created or dropped. We need to be able to continue iterating through column families even though our current column family got dropped. This diff implements circular linked lists that connect all column families. It then uses the link list to enable iterating through linked lists. Even if the column family is dropped, its next_ pointer still can be used to advance to another alive column family. Test Plan: make check Reviewers: dhruba, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D15603	2014-01-30 16:57:52 -08:00
Igor Canadi	6973bb1722	MakeRoomForWrite() support for column families Summary: Making room for write will be the hardest part of the column family implementation. For now, I just iterate through all column families and run MakeRoomForWrite() for every one. Test Plan: make check does not complain Reviewers: dhruba, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D15597	2014-01-30 16:12:08 -08:00
Igor Canadi	fa99d53e55	Change ColumnFamilyData from struct to class Summary: ColumnFamilyData grew a lot, there's much more data that it holds now. It makes more sense to encapsulate it better by making it a class. Test Plan: make check Reviewers: dhruba, haobo, kailiu, sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D15579	2014-01-29 15:18:36 -08:00
Igor Canadi	f24a3ee52d	Read from and write to different column families Summary: This one is big. It adds ability to write to and read from different column families (see the unit test). It also supports recovery of different column families from log, which was the hardest part to reason about. We need to make sure to never delete the log file which has unflushed data from any column family. To support that, I added another concept, which is versions_->MinLogNumber() Test Plan: Added a unit test in column_family_test Reviewers: dhruba, haobo, sdong, kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D15537	2014-01-29 11:38:16 -08:00
Igor Canadi	eb055609e4	[column families] Move memtable and immutable memtable list to column family data Summary: All memtables and immutable memtables are moved from DBImpl to ColumnFamilyData. For now, they are all referenced from default column family in DBImpl. It shouldn't be hard to get them from custom column family. Test Plan: make check Reviewers: dhruba, kailiu, sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D15459	2014-01-27 13:37:14 -08:00
Igor Canadi	7c5e583a27	ColumnFamilySet Summary: I created a separate class ColumnFamilySet to keep track of column families. Before we did this in VersionSet and I believe this approach is cleaner. Let me know if you have any comments. I will commit tomorrow. Test Plan: make check Reviewers: dhruba, haobo, kailiu, sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D15357	2014-01-23 14:03:38 -08:00

... 2 3 4 5 6

269 commits