rocksdb

Commit Graph

Author	SHA1	Message	Date
Venkatesh Radhakrishnan	7b12ae97d4	Add signalall after removing item from manual_compaction deque Summary: When there are waiting manual compactions, we need to signal them after removing the current manual compaction from the deque. Test Plan: ColumnFamilytTest.SameCFManualManualCommaction Reviewers: anthony, IslamAbdelRahman, kradhakrishnan, sdong Reviewed By: sdong Subscribers: dhruba, yoshinorim Differential Revision: https://reviews.facebook.net/D52119	2015-12-17 16:59:00 -08:00
Venkatesh Radhakrishnan	030215bf01	Running manual compactions in parallel with other automatic or manual compactions in restricted cases Summary: This diff provides a framework for doing manual compactions in parallel with other compactions. We now have a deque of manual compactions. We also pass manual compactions as an argument from RunManualCompactions down to BackgroundCompactions, so that RunManualCompactions can be reentrant. Parallelism is controlled by the two routines ConflictingManualCompaction to allow/disallow new parallel/manual compactions based on already existing ManualCompactions. In this diff, by default manual compactions still have to run exclusive of other compactions. However, by setting the compaction option, exclusive_manual_compaction to false, it is possible to run other compactions in parallel with a manual compaction. However, we are still restricted to one manual compaction per column family at a time. All of these restrictions will be relaxed in future diffs. I will be adding more tests later. Test Plan: Rocksdb regression + new tests + valgrind Reviewers: igor, anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, sdong Reviewed By: sdong Subscribers: yoshinorim, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D47973	2015-12-14 11:20:34 -08:00
agiardullo	3bfd3d39a3	Use SST files for Transaction conflict detection Summary: Currently, transactions can fail even if there is no actual write conflict. This is due to relying on only the memtables to check for write-conflicts. Users have to tune memtable settings to try to avoid this, but it's hard to figure out exactly how to tune these settings. With this diff, TransactionDB will use both memtables and SST files to determine if there are any write conflicts. This relies on the fact that BlockBasedTable stores sequence numbers for all writes that happen after any open snapshot. Also, D50295 is needed to prevent SingleDelete from disappearing writes (the TODOs in this test code will be fixed once the other diff is approved and merged). Note that Optimistic transactions will still rely on tuning memtable settings as we do not want to read from SST while on the write thread. Also, memtable settings can still be used to reduce how often TransactionDB needs to read SST files. Test Plan: unit tests, db bench Reviewers: rven, yhchiang, kradhakrishnan, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: dhruba, leveldb, yoshinorim Differential Revision: https://reviews.facebook.net/D50475	2015-12-11 12:34:11 -08:00
Yueh-Hsuan Chiang	774b80e99e	Resubmit the fix for a race condition in persisting options Summary: This patch fix a race condition in persisting options which will cause a crash when: * Thread A obtain cf options and start to persist options based on that cf options. * Thread B kicks in and finish DropColumnFamily and delete cf_handle. * Thread A wakes up and tries to finish the persisting options and crashes. Test Plan: Add a test in column_family_test that can reproduce the crash Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, sdong Reviewed By: sdong Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D51717	2015-12-08 17:01:02 -08:00
agiardullo	e5c5f23814	Support marking snapshots for write-conflict checking - Take 2 Summary: D51183 was reverted due to breaking the LITE build. This diff is the same as D51183 but with a fix for the LITE BUILD(D51693) Test Plan: run all unit tests Reviewers: sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D51711	2015-12-08 16:47:31 -08:00
sdong	1d63c3d610	Revert "Support marking snapshots for write-conflict checking" This reverts commit `ec704aafdc` for it broke RocksDB LITE build.	2015-12-08 09:27:17 -08:00
agiardullo	ec704aafdc	Support marking snapshots for write-conflict checking Summary: D50475 enables using SST files for transaction write-conflict checking. In order for this to work, we need to make sure not to compact out SingleDeletes when there is an earlier transaction snapshot(D50295). If there is a long-held snapshot, this could reduce the benefit of the SingleDelete optimization. This diff allows Transactions to mark snapshots as being used for write-conflict checking. Then, during compaction, we will be able to optimize SingleDeletes better in the future. This diff adds a flag to SnapshotImpl which is used by Transactions. This diff also passes the earliest write-conflict snapshot's sequence number to CompactionIterator. This diff does not actually change Compaction (after this diff is pushed, D50295 will be able to use this information). Test Plan: no behavior change, ran existing tests Reviewers: rven, kradhakrishnan, yhchiang, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D51183	2015-12-07 19:40:51 -08:00
sdong	f307036bde	Revert "Fix a race condition in persisting options" This reverts commit `2fa3ed5180`. It breaks RocksDB lite build	2015-12-07 17:09:12 -08:00
Yueh-Hsuan Chiang	2fa3ed5180	Fix a race condition in persisting options Summary: This patch fix a race condition in persisting options which will cause a crash when: * Thread A obtain cf options and start to persist options based on that cf options. * Thread B kicks in and finish DropColumnFamily and delete cf_handle. * Thread A wakes up and tries to finish the persisting options and crashes. Test Plan: Add a test in column_family_test that can reproduce the crash Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D51609	2015-12-07 15:25:12 -08:00
Alex Yang	e8180f9901	added public api to schedule flush/compaction, code to prevent race with db::open Summary: Fixes T8781168. Added a new function EnableAutoCompactions in db.h to be publicly avialable. This allows compaction to be re-enabled after disabling it via SetOptions Refactored code to set the dbptr earlier on in TransactionDB::Open and DB::Open Temporarily disable auto_compaction in TransactionDB::Open until dbptr is set to prevent race condition. Test Plan: Ran make all check verified fix on myrocks side: was able to reproduce the seg fault with ../tools/mysqltest.sh --mem --force rocksdb.drop_table method was to manually sleep the thread after DB::Open but before TransactionDB ptr was assigned in transaction_db_impl.cc: DB::Open(db_options, dbname, column_families_copy, handles, &db); clock_t goal = (60000 * 10) + clock(); while (goal > clock()); ...dbptr(aka rdb) gets assigned below verified my changes fixed the issue. Also added unit test 'ToggleAutoCompaction' in transaction_test.cc Reviewers: hermanlee4, anthony Reviewed By: anthony Subscribers: alex, dhruba Differential Revision: https://reviews.facebook.net/D51147	2015-12-03 22:59:44 -08:00
sdong	f9103d9a30	DBTest.DynamicCompactionOptions: More deterministic and readable Summary: DBTest.DynamicCompactionOptions sometimes fails the assert but I can't repro it locally. Make it more deterministic and readable and see whether the problem is still there. Test Plan: Run tht test and make sure it passes Reviewers: kradhakrishnan, yhchiang, igor, rven, IslamAbdelRahman, anthony Reviewed By: anthony Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D51309	2015-12-01 16:49:47 -08:00
Yueh-Hsuan Chiang	e114f0abb8	Enable RocksDB to persist Options file. Summary: This patch allows rocksdb to persist options into a file on DB::Open, SetOptions, and Create / Drop ColumnFamily. Options files are created under the same directory as the rocksdb instance. In addition, this patch also adds a fail_if_missing_options_file in DBOptions that makes any function call return non-ok status when it is not able to persist options properly. // If true, then DB::Open / CreateColumnFamily / DropColumnFamily // / SetOptions will fail if options file is not detected or properly // persisted. // // DEFAULT: false bool fail_if_missing_options_file; Options file names are formatted as OPTIONS-<number>, and RocksDB will always keep the latest two options files. Test Plan: Add options_file_test. options_test column_family_test Reviewers: igor, IslamAbdelRahman, sdong, anthony Reviewed By: anthony Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D48285	2015-11-10 22:58:01 -08:00
Yueh-Hsuan Chiang	7d7ee2b654	Add Memory Insight support to utilities Summary: This patch introduces utilities/memory, which currently includes GetApproximateMemoryUsageByType that reports different types of rocksdb memory usage given a list of input DBs. The API also take care of the case where Cache could be shared across multiple column families / multiple db instances. Currently, it reports memory usage of memtable, table-readers and cache. Test Plan: utilities/memory/memory_test.cc Reviewers: igor, anthony, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D49257	2015-11-03 17:52:17 -08:00
Yueh-Hsuan Chiang	3ecbab0040	Add GetAggregatedIntProperty(): returns the aggregated value from all CFs Summary: This patch adds GetAggregatedIntProperty() that returns the aggregated value from all CFs Test Plan: Added a test in db_test Reviewers: igor, sdong, anthony, IslamAbdelRahman, rven Reviewed By: rven Subscribers: rven, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D49497	2015-11-03 15:54:18 -08:00
Siying Dong	138876a62c	Merge pull request #746 from ceph/wip-recycle Add Options.recycle_log_file_num for Recycling WAL Files	2015-10-26 15:01:28 -07:00
Alexey Maykov	f18acd8875	Fixed the clang compilation failure Summary: As above. Test Plan: USE_CLANG=1 make check -j Reviewers: igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D48981	2015-10-19 10:38:50 -07:00
Sage Weil	666376150c	db_impl: recycle log files If log recycling is enabled, put old WAL files on a recycle queue instead of deleting them. When we need a new log file, take a recycled file off the list if one is available. Signed-off-by: Sage Weil <sage@redhat.com>	2015-10-18 21:24:32 -04:00
Alexey Maykov	e1a09a7703	Implementation for GetPropertiesOfTablesInRange Summary: In MyRocks, it is sometimes important to get propeties only for the subset of the database. This diff implements the API in RocksDB. Test Plan: ran the GetPropertiesOfTablesInRange Reviewers: rven, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D48651	2015-10-17 13:34:43 -07:00
Yueh-Hsuan Chiang	ad471453e8	Allow GetProperty to report the number of currently running flushes / compactions. Summary: Add rocksdb.num-running-compactions and rocksdb.num-running-flushes to GetIntProperty() that reports the number of currently running compactions / flushes. Test Plan: augmented existing tests in db_test Reviewers: igor, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D48693	2015-10-17 00:16:36 -07:00
sdong	35ad531be3	Seperate InternalIterator from Iterator Summary: Separate a new class InternalIterator from class Iterator, when the look-up is done internally, which also means they operate on key with sequence ID and type. This change will enable potential future optimizations but for now InternalIterator's functions are still the same as Iterator's. At the same time, separate the cleanup function to a separate class and let both of InternalIterator and Iterator inherit from it. Test Plan: Run all existing tests. Reviewers: igor, yhchiang, anthony, kradhakrishnan, IslamAbdelRahman, rven Reviewed By: rven Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D48549	2015-10-13 15:32:13 -07:00
Islam AbdelRahman	7f58ff7c31	Remove db_impl_debug from release build Summary: Remove db_impl_debug from NDEBUG, but allow it in ROCKSDB_LITE These functions by definition should not be included in NDEBUG and they are only used for testing This is based on offline discussion with @yhchiang and @igor Test Plan: make static_lib make check Reviewers: igor, sdong, yhchiang Reviewed By: yhchiang Subscribers: igor, yhchiang, dhruba Differential Revision: https://reviews.facebook.net/D48573	2015-10-12 17:35:43 -07:00
Islam AbdelRahman	c64ae05b1c	Move TEST_NewInternalIterator to NewInternalIterator Summary: Long time ago we add InternalDumpCommand to ldb_tool https://reviews.facebook.net/D11517 This command is using TEST_NewInternalIterator although it's not a test. This patch move TEST_NewInternalIterator outside of db_impl_debug.cc Test Plan: make check make static_lib Reviewers: yhchiang, igor, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D48561	2015-10-12 17:22:37 -07:00
Alexey Maykov	3d07b815f6	Passing table properties to compaction callback Summary: It would be nice to have and access to table properties in compaction callbacks. In MyRocks project, it will make possible to update optimizer statistics online. Test Plan: ran the unit test. Ran myrocks with the new way of collecting stats. Reviewers: igor, rven, yhchiang Reviewed By: yhchiang Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D48267	2015-10-09 18:10:55 -07:00
Igor Canadi	9803e0d813	compaction_filter.h cleanup Summary: Two changes: 1. remove *V2 filter stuff. we deprecated that a while ago 2. clarify what happens when user sets max_subcompactions to bigger than 1 Test Plan: none Reviewers: yhchiang, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D47871	2015-10-08 09:32:50 -07:00
Igor Canadi	115427ef63	Add APIs PauseBackgroundWork() and ContinueBackgroundWork() Summary: To support a new MongoDB capability, we need to make sure that we don't do any IO for a short period of time. For background, see: * https://jira.mongodb.org/browse/SERVER-20704 * https://jira.mongodb.org/browse/SERVER-18899 To implement that, I add a new API calls PauseBackgroundWork() and ContinueBackgroundWork() which reuse the capability we already have in place for RefitLevel() function. Test Plan: Added a new test in db_test. Made sure that test fails when PauseBackgroundWork() is commented out. Reviewers: IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D47901	2015-10-02 13:17:34 -07:00
Islam AbdelRahman	f03b5c987b	Add experimental DB::AddFile() to plug sst files into empty DB Summary: This is an initial version of bulk load feature This diff allow us to create sst files, and then bulk load them later, right now the restrictions for loading an sst file are (1) Memtables are empty (2) Added sst files have sequence number = 0, and existing values in database have sequence number = 0 (3) Added sst files values are not overlapping Test Plan: unit testing Reviewers: igor, ott, sdong Reviewed By: sdong Subscribers: leveldb, ott, dhruba Differential Revision: https://reviews.facebook.net/D39081	2015-09-23 12:42:43 -07:00
Andres Noetzli	014fd55adc	Support for SingleDelete() Summary: This patch fixes #7460559. It introduces SingleDelete as a new database operation. This operation can be used to delete keys that were never overwritten (no put following another put of the same key). If an overwritten key is single deleted the behavior is undefined. Single deletion of a non-existent key has no effect but multiple consecutive single deletions are not allowed (see limitations). In contrast to the conventional Delete() operation, the deletion entry is removed along with the value when the two are lined up in a compaction. Note: The semantics are similar to @igor's prototype that allowed to have this behavior on the granularity of a column family ( https://reviews.facebook.net/D42093 ). This new patch, however, is more aggressive when it comes to removing tombstones: It removes the SingleDelete together with the value whenever there is no snapshot between them while the older patch only did this when the sequence number of the deletion was older than the earliest snapshot. Most of the complex additions are in the Compaction Iterator, all other changes should be relatively straightforward. The patch also includes basic support for single deletions in db_stress and db_bench. Limitations: - Not compatible with cuckoo hash tables - Single deletions cannot be used in combination with merges and normal deletions on the same key (other keys are not affected by this) - Consecutive single deletions are currently not allowed (and older version of this patch supported this so it could be resurrected if needed) Test Plan: make all check Reviewers: yhchiang, sdong, rven, anthony, yoshinorim, igor Reviewed By: igor Subscribers: maykov, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D43179	2015-09-17 11:42:56 -07:00
Andres Noetzli	3c9cef1eed	Unified maps with Comparator for sorting, other cleanup Summary: This diff is a collection of cleanups that were initially part of D43179. Additionally it adds a unified way of defining key-value maps that use a Comparator for sorting (this was previously implemented in four different places). Test Plan: make clean check all Reviewers: rven, anthony, yhchiang, sdong, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D45993	2015-09-02 13:58:22 -07:00
agiardullo	c2f2cb0214	Pessimistic Transactions Summary: Initial implementation of Pessimistic Transactions. This diff contains the api changes discussed in D38913. This diff is pretty large, so let me know if people would prefer to meet up to discuss it. MyRocks folks: please take a look at the API in include/rocksdb/utilities/transaction[_db].h and let me know if you have any issues. Also, you'll notice a couple of TODOs in the implementation of RollbackToSavePoint(). After chatting with Siying, I'm going to send out a separate diff for an alternate implementation of this feature that implements the rollback inside of WriteBatch/WriteBatchWithIndex. We can then decide which route is preferable. Next, I'm planning on doing some perf testing and then integrating this diff into MongoRocks for further testing. Test Plan: Unit tests, db_bench parallel testing. Reviewers: igor, rven, sdong, yhchiang, yoshinorim Reviewed By: sdong Subscribers: hermanlee4, maykov, spetrunia, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D40869	2015-08-11 17:52:23 -07:00
agiardullo	16ea1c7d1c	simple ManagedSnapshot wrapper Summary: Implemented this simple wrapper for something else I was working on. Seemed like it makes sense to expose it instead of burying it in some random code. Test Plan: added test Reviewers: rven, kradhakrishnan, sdong, yhchiang Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D43293	2015-08-06 17:59:05 -07:00
sdong	6a4aaadcd7	Avoid type unique_ptr in LogWriterNumber::writer for Windows build break Summary: Visual Studio complains about deque<LogWriterNumber> because LogWriterNumber is non-copyable for its unique_ptr member writer. Move away from it, and do explit free. It is less safe but I can't think of a better way to unblock it. Test Plan: valgrind check test Reviewers: anthony, IslamAbdelRahman, kolmike, rven, yhchiang Reviewed By: yhchiang Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D43647	2015-08-06 10:52:41 -07:00
Mike Kolupaev	e06cf1a098	[wal changes 3/3] method in DB to sync WAL without blocking writers Summary: Subj. We really need this feature. Previous diff D40899 has most of the changes to make this possible, this diff just adds the method. Test Plan: `make check`, the new test fails without this diff; ran with ASAN, TSAN and valgrind. Reviewers: igor, rven, IslamAbdelRahman, anthony, kradhakrishnan, tnovak, yhchiang, sdong Reviewed By: sdong Subscribers: MarkCallaghan, maykov, hermanlee4, yoshinorim, tnovak, dhruba Differential Revision: https://reviews.facebook.net/D40905	2015-08-05 06:06:39 -07:00
Mike Kolupaev	fe09a6dae3	[wal changes 2/3] write with sync=true syncs previous unsynced wals to prevent illegal data loss Summary: I'll just copy internal task summary here: " This sequence will cause data loss in the middle after an sync write: non-sync write key 1 flush triggered, not yet scheduled sync write key 2 system crash After rebooting, users might see key 2 but not key 1, which violates the API of sync write. This can be reproduced using unit test FaultInjectionTest::DISABLED_WriteOptionSyncTest. One way to fix it is for a sync write, if there is outstanding unsynced log files, we need to syc them too. " This diff should be considered together with the next diff D40905; in isolation this fix probably could be a little simpler. Test Plan: `make check`; added a test for that (DBTest.SyncingPreviousLogs) before noticing FaultInjectionTest.WriteOptionSyncTest (keeping both since mine asserts a bit more); both tests fail without this diff; for D40905 stacked on top of this diff, ran tests with ASAN, TSAN and valgrind Reviewers: rven, yhchiang, IslamAbdelRahman, anthony, kradhakrishnan, igor, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D40899	2015-07-22 03:28:08 -07:00
Igor Canadi	35ca59364c	Don't let flushes preempt compactions Summary: When we first started, max_background_flushes was 0 by default and compaction thread was executing flushes (since there was no flush thread). Then, we switched the default max_background_flushes to 1. However, we still support the case where there is no flush thread and flushes are done in compaction. This is making our code a bit more complicated. By not supporting this use-case we can make our code simpler. We have a special case that when you set max_background_flushes to 0, we schedule the flush to execute on the compaction thread. Test Plan: make check (there might be some unit tests that depend on this behavior) Reviewers: IslamAbdelRahman, yhchiang, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D41931	2015-07-17 12:02:52 -07:00
Igor Canadi	5aea98ddd8	Deprecate WriteOptions::timeout_hint_us Summary: In one of our recent meetings, we discussed deprecating features that are not being actively used. One of those features, at least within Facebook, is timeout_hint. The feature is really nicely implemented, but if nobody needs it, we should remove it from our code-base (until we get a valid use-case). Some arguments: * Less code == better icache hit rate, smaller builds, simpler code * The motivation for adding timeout_hint_us was to work-around RocksDB's stall issue. However, we're currently addressing the stall issue itself (see @sdong's recent work on stall write_rate), so we should never see sharp lock-ups in the future. * Nobody is using the feature within Facebook's code-base. Googling for `timeout_hint_us` also doesn't yield any users. Test Plan: make check Reviewers: anthony, kradhakrishnan, sdong, yhchiang Reviewed By: yhchiang Subscribers: sdong, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D41937	2015-07-14 09:35:48 +02:00
sdong	76d3cd3286	Fix public API dependency on internal codes and dependency on MAX_INT32 Summary: Public API depends on port/port.h which is wrong. Fix it. Also with gcc 4.8.1 build was broken as MAX_INT32 was not recognized. Fix it by using ::max in linux. Test Plan: Build it and try to build an external project on top of it. Reviewers: anthony, yhchiang, kradhakrishnan, igor Reviewed By: igor Subscribers: yoshinorim, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D41745	2015-07-11 10:32:11 -07:00
Dmitri Smirnov	c903ccc4c2	Merge from github/master	2015-07-09 18:01:08 -07:00
Mike Kolupaev	218487d8dc	[wal changes 1/3] fixed unbounded wal growth in some workloads Summary: This fixes the following scenario we've hit: - we reached max_total_wal_size, created a new wal and scheduled flushing all memtables corresponding to the old one, - before the last of these flushes started its column family was dropped; the last background flush call was a no-op; no one removed the old wal from alive_logs_, - hours have passed and no flushes happened even though lots of data was written; data is written to different column families, compactions are disabled; old column families are dropped before memtable grows big enough to trigger a flush; the old wal still sits in alive_logs_ preventing max_total_wal_size limit from kicking in, - a few more hours pass and we run out disk space because of one huge .log file. Test Plan: `make check`; backported the new test, checked that it fails without this diff Reviewers: igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D40893	2015-07-02 14:27:00 -07:00
Dmitri Smirnov	18285c1e2f	Windows Port from Microsoft Summary: Make RocksDb build and run on Windows to be functionally complete and performant. All existing test cases run with no regressions. Performance numbers are in the pull-request. Test plan: make all of the existing unit tests pass, obtain perf numbers. Co-authored-by: Praveen Rao praveensinghrao@outlook.com Co-authored-by: Sherlock Huang baihan.huang@gmail.com Co-authored-by: Alex Zinoviev alexander.zinoviev@me.com Co-authored-by: Dmitri Smirnov dmitrism@microsoft.com	2015-07-01 16:13:56 -07:00
Islam AbdelRahman	12e030a992	Use CompactRangeOptions for CompactRange Summary: This diff update DB::CompactRange to use RangeCompactionOptions instead of using multiple parameters Old CompactRange is still available but deprecated Test Plan: make all check make rocksdbjava USE_CLANG=1 make all OPT=-DROCKSDB_LITE make release Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D40209	2015-06-17 14:36:14 -07:00
Igor Canadi	25d600569d	Clean up InstallSuperVersion Summary: We go to great lengths to make sure MaybeScheduleFlushOrCompaction() is called outside of write thread. But anyway, it's still called in the mutex, so it's not that much cheaper. This diff removes the "optimization" and cleans up the code a bit. Test Plan: make check Reviewers: rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D40113	2015-06-17 12:37:59 -07:00
sdong	40f562e747	Allow GetApproximateSize() to include mem table size if it is skip list memtable Summary: Add an option in GetApproximateSize() so that the result will include estimated sizes in mem tables. To implement it, implement an estimated count from the beginning to a key in skip list. The approach is to count to find the entry, how many Next() is issued from each level, and sum them with a weight that is <branching factor> ^ <level>. Test Plan: Add a test case Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D40119	2015-06-16 18:13:23 -07:00
sdong	7842920be5	Slow down writes by bytes written Summary: We slow down data into the database to the rate of options.delayed_write_rate (a new option) with this patch. The thread synchronization approach I take is to still synchronize write controller by DB mutex and GetDelay() is inside DB mutex. Try to minimize the frequency of getting time in GetDelay(). I verified it through db_bench and it seems to work hard_rate_limit is deprecated. options.delayed_write_rate is still not dynamically changeable. Need to work on it as a follow-up. Test Plan: Add new unit tests in db_test Reviewers: yhchiang, rven, kradhakrishnan, anthony, MarkCallaghan, igor Reviewed By: igor Subscribers: ikabiljo, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D36351	2015-06-11 20:42:18 -07:00
Islam AbdelRahman	d6ce0f7c61	Add largest sequence to FlushJobInfo Summary: Adding largest sequence number to FlushJobInfo and passing flushed file metadata to NotifyOnFlushCompleted which include alot of other values that we may want to expose in FlushJobInfo Test Plan: make check Reviewers: igor, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D39927	2015-06-11 15:22:22 -07:00
Yueh-Hsuan Chiang	2e764f06ea	[API Change] Improve EventListener::OnFlushCompleted interface Summary: EventListener::OnFlushCompleted() now passes a structure instead of a list of parameters. This minimizes the API change in the future. Test Plan: listener_test compact_files_test example/compact_files_example Reviewers: kradhakrishnan, sdong, IslamAbdelRahman, rven, igor Reviewed By: rven, igor Subscribers: IslamAbdelRahman, rven, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D39543	2015-06-05 12:28:51 -07:00
Islam AbdelRahman	3ce3bb3da2	Allowing L0 -> L1 trivial move on sorted data Summary: This diff updates the logic of how we do trivial move, now trivial move can run on any number of files in input level as long as they are not overlapping The conditions for trivial move have been updated Introduced conditions: - Trivial move cannot happen if we have a compaction filter (except if the compaction is not manual) - Input level files cannot be overlapping Removed conditions: - Trivial move only run when the compaction is not manual - Input level should can contain only 1 file More context on what tests failed because of Trivial move ``` DBTest.CompactionsGenerateMultipleFiles This test is expecting compaction on a file in L0 to generate multiple files in L1, this test will fail with trivial move because we end up with one file in L1 ``` ``` DBTest.NoSpaceCompactRange This test expect compaction to fail when we force environment to report running out of space, of course this is not valid in trivial move situation because trivial move does not need any extra space, and did not check for that ``` ``` DBTest.DropWrites Similar to DBTest.NoSpaceCompactRange ``` ``` DBTest.DeleteObsoleteFilesPendingOutputs This test expect that a file in L2 is deleted after it's moved to L3, this is not valid with trivial move because although the file was moved it is now used by L3 ``` ``` CuckooTableDBTest.CompactionIntoMultipleFiles Same as DBTest.CompactionsGenerateMultipleFiles ``` This diff is based on a work by @sdong https://reviews.facebook.net/D34149 Test Plan: make -j64 check Reviewers: rven, sdong, igor Reviewed By: igor Subscribers: yhchiang, ott, march, dhruba, sdong Differential Revision: https://reviews.facebook.net/D34797	2015-06-04 16:51:25 -07:00
Yueh-Hsuan Chiang	8afafc2783	Fix compile warning in db/db_impl Summary: Fix the following compile warning in db/db_impl db/db_impl.cc:1603:19: error: implicit conversion loses integer precision: 'const uint64_t' (aka 'const unsigned long') to 'int' [-Werror,-Wshorten-64-to-32] info.job_id = job_id; ~ ^~~~~~ Test Plan: db_test Reviewers: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D39423	2015-06-02 17:36:45 -07:00
Yueh-Hsuan Chiang	fe5c6321cb	Allow EventListener::OnCompactionCompleted to return CompactionJobStats. Summary: Allow EventListener::OnCompactionCompleted to return CompactionJobStats, which contains useful information about a compaction. Example CompactionJobStats returned by OnCompactionCompleted(): smallest_output_key_prefix 05000000 largest_output_key_prefix 06990000 elapsed_time 42419 num_input_records 300 num_input_files 3 num_input_files_at_output_level 2 num_output_records 200 num_output_files 1 actual_bytes_input 167200 actual_bytes_output 110688 total_input_raw_key_bytes 5400 total_input_raw_value_bytes 300000 num_records_replaced 100 is_manual_compaction 1 Test Plan: Developed a mega test in db_test which covers 20 variables in CompactionJobStats. Reviewers: rven, igor, anthony, sdong Reviewed By: sdong Subscribers: tnovak, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D38463	2015-06-02 17:07:16 -07:00
Yueh-Hsuan Chiang	fc83821270	Add EventListener::OnTableFileCreated() Summary: Add EventListener::OnTableFileCreated(), which will be called when a table file is created. This patch is part of the EventLogger and EventListener integration. Test Plan: Augment existing test in db/listener_test.cc Reviewers: anthony, kradhakrishnan, rven, igor, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D38865	2015-06-02 14:12:23 -07:00
sdong	4266d4fd90	Allow users to migrate to options.level_compaction_dynamic_level_bytes=true using CompactRange() Summary: In DB::CompactRange(), change parameter "reduce_level" to "change_level". Users can compact all data to the last level if needed. By doing it, users can migrate the DB to options.level_compaction_dynamic_level_bytes=true. Test Plan: Add a unit test for it. Reviewers: yhchiang, anthony, kradhakrishnan, igor, rven Reviewed By: rven Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D39099	2015-06-01 18:21:14 -07:00
Yueh-Hsuan Chiang	d333820bad	Removed DBImpl::notifying_events_ Summary: DBImpl::notifying_events_ is a internal counter in DBImpl which is used to prevent DB close when DB is notifying events. However, as the current events all rely on either compaction or flush which already have similar counters to prevent DB close, it is safe to remove notifying_events_. Test Plan: listener_test examples/compact_files_example Reviewers: igor, anthony, kradhakrishnan, rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D39315	2015-06-01 15:32:23 -07:00
agiardullo	dc9d70de65	Optimistic Transactions Summary: Optimistic transactions supporting begin/commit/rollback semantics. Currently relies on checking the memtable to determine if there are any collisions at commit time. Not yet implemented would be a way of enuring the memtable has some minimum amount of history so that we won't fail to commit when the memtable is empty. You should probably start with transaction.h to get an overview of what is currently supported. Test Plan: Added a new test, but still need to look into stress testing. Reviewers: yhchiang, igor, rven, sdong Reviewed By: sdong Subscribers: adamretter, MarkCallaghan, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D33435	2015-05-29 14:36:35 -07:00
Yueh-Hsuan Chiang	a0580205c8	Removed an unused private variable in db_impl.h Summary: Removed an unused private variable in db_impl.h Test Plan: make db_test Reviewers: sdong, anthony, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D38925	2015-05-26 10:46:26 -07:00
Yueh-Hsuan Chiang	e2c1d4b57f	[Public API Change] Make DB::GetDbIdentity() be const function. Summary: Make DB::GetDbIdentity() be const function. Test Plan: make db_test Reviewers: igor, rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D38745	2015-05-21 11:01:48 -07:00
agiardullo	711465ccec	API to fetch from both a WriteBatchWithIndex and the db Summary: Added a couple functions to WriteBatchWithIndex to make it easier to query the value of a key including reading pending writes from a batch. (This is needed for transactions). I created write_batch_with_index_internal.h to use to store an internal-only helper function since there wasn't a good place in the existing class hierarchy to store this function (and it didn't seem right to stick this function inside WriteBatchInternal::Rep). Since I needed to access the WriteBatchEntryComparator, I moved some helper classes from write_batch_with_index.cc into write_batch_with_index_internal.h/.cc. WriteBatchIndexEntry, ReadableWriteBatch, and WriteBatchEntryComparator are all unchanged (just moved to a different file(s)). Test Plan: Added new unit tests. Reviewers: rven, yhchiang, sdong, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D38037	2015-05-11 14:51:51 -07:00
Laurent Demailly	df4130ad85	fix crashes in stats and compaction filter for db_ttl_impl Summary: fix crashes in stats and compaction filter for db_ttl_impl Test Plan: Ran build with lots of debugging https://reviews.facebook.net/differential/diff/194175/ Reviewers: yhchiang, igor, rven Reviewed By: igor Subscribers: rven, dhruba Differential Revision: https://reviews.facebook.net/D38001	2015-05-05 16:54:47 -07:00
Igor Canadi	1bb4928da9	Include bunch of more events into EventLogger Summary: Added these events: * Recovery start, finish and also when recovery creates a file * Trivial move * Compaction start, finish and when compaction creates a file * Flush start, finish Also includes small fix to EventLogger Also added option ROCKSDB_PRINT_EVENTS_TO_STDOUT which is useful when we debug things. I've spent far too much time chasing LOG files. Still didn't get sst table properties in JSON. They are written very deeply into the stack. I'll address in separate diff. TODO: * Write specification. Let's first use this for a while and figure out what's good data to put here, too. After that we'll write spec * Write tools that parse and analyze LOGs. This can be in python or go. Good intern task. Test Plan: Ran db_bench with ROCKSDB_PRINT_EVENTS_TO_STDOUT. Here's the output: https://phabricator.fb.com/P19811976 Reviewers: sdong, yhchiang, rven, MarkCallaghan, kradhakrishnan, anthony Reviewed By: anthony Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D37521	2015-04-27 15:20:02 -07:00
Giuseppe Ottaviano	2dc421df48	Implement DB::PromoteL0 method Summary: This diff implements a new `DB` method `PromoteL0` which moves all files in L0 to a given level skipping compaction, provided that the files have disjoint ranges and all levels up to the target level are empty. This method provides finer-grain control for trivial compactions, and it is useful for bulk-loading pre-sorted keys. Compared to D34797, it does not change the semantics of an existing operation, which can impact existing code. PromoteL0 is designed to work well in combination with the proposed `GetSstFileWriter`/`AddFile` interface, enabling to "design" the level structure by populating one level at a time. Such fine-grained control can be very useful for static or mostly-static databases. Test Plan: `make check` Reviewers: IslamAbdelRahman, philipp, MarkCallaghan, yhchiang, igor, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D37107	2015-04-23 12:10:36 -07:00
Igor Canadi	6059bdf86a	Add experimental API MarkForCompaction() Summary: Some Mongo+Rocks datasets in Parse's environment are not doing compactions very frequently. During the quiet period (with no IO), we'd like to schedule compactions so that our reads become faster. Also, aggressively compacting during quiet periods helps when write bursts happen. In addition, we also want to compact files that are containing deleted key ranges (like old oplog keys). All of this is currently not possible with CompactRange() because it's single-threaded and blocks all other compactions from happening. Running CompactRange() risks an issue of blocking writes because we generate too much Level 0 files before the compaction is over. Stopping writes is very dangerous because they hold transaction locks. We tried running manual compaction once on Mongo+Rocks and everything fell apart. MarkForCompaction() solves all of those problems. This is very light-weight manual compaction. It is lower priority than automatic compactions, which means it shouldn't interfere with background process keeping the LSM tree clean. However, if no automatic compactions need to be run (or we have extra background threads available), we will start compacting files that are marked for compaction. Test Plan: added a new unit test Reviewers: yhchiang, rven, MarkCallaghan, sdong Reviewed By: sdong Subscribers: yoshinorim, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D37083	2015-04-17 16:44:45 -07:00
Igor Canadi	fd3dbef22b	Clean up old log files in background threads Summary: Cleaning up log files can do heavy IO, since we call ftruncate() in the destructor. We don't want to call ftruncate() in user threads. This diff moves cleaning to background threads (flush and compaction) Test Plan: make check, will also run valgrind Reviewers: yhchiang, rven, MarkCallaghan, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D36177	2015-03-30 15:04:10 -04:00
Venkatesh Radhakrishnan	98c37fda5d	Remove unused parameter in CancelAllBackgroundWork Summary: Some suggestions for cleanup from Igor. Test Plan: Regression tests. Reviewers: igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D35169	2015-03-16 21:07:54 -07:00
Igor Canadi	52d8347a91	EventLogger Summary: Here's my proposal for making our LOGs easier to read by machines. The idea is to dump all events as JSON objects. JSON is easy to read by humans, but more importantly, it's easy to read by machines. That way, we can parse this, load into SQLite/mongo and then query or visualize. I started with table_create and table_delete events, but if everybody agrees, I'll continue by adding more events (flush/compaction/etc etc) Test Plan: Ran db_bench. Observed: 2015/01/15-14:13:25.788019 1105ef000 EVENT_LOG_v1 {"time_micros": 1421360005788015, "event": "table_file_creation", "file_number": 12, "file_size": 1909699} 2015/01/15-14:13:25.956500 110740000 EVENT_LOG_v1 {"time_micros": 1421360005956498, "event": "table_file_deletion", "file_number": 12} Reviewers: yhchiang, rven, dhruba, MarkCallaghan, lgalanis, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D31647	2015-03-13 10:15:54 -07:00
Yueh-Hsuan Chiang	2b785d76b8	Fixed a bug where CompactFiles won't delete obsolete files until flush. Summary: Fixed a bug where CompactFiles won't delete obsolete files until flush. Test Plan: ./compact_files_test export ROCKSDB_TESTS=CompactFiles ./db_test Reviewers: rven, sdong, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D34671	2015-03-11 13:06:59 -07:00
Venkatesh Radhakrishnan	284be570c8	Provide a mechanism to inform Rocksdb that it is shutting down Summary: Provide an API which enables users to infor Rocksdb that it is shutting down. Test Plan: db_test Reviewers: sdong, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D34617	2015-03-11 10:31:02 -07:00
Igor Sugak	62247ffa3b	rocksdb: Add missing override Summary: When using latest clang (3.6 or 3.7/trunck) rocksdb is failing with many errors. Almost all of them are missing override errors. This diff adds missing override keyword. No manual changes. Prerequisites: bear and clang 3.5 build with extra tools ```lang=bash % USE_CLANG=1 bear make all # generate a compilation database http://clang.llvm.org/docs/JSONCompilationDatabase.html % clang-modernize -p . -include . -add-override % make format ``` Test Plan: Make sure all tests are passing. ```lang=bash % #Use default fb code clang. % make check ``` Verify less error and no missing override errors. ```lang=bash % # Have trunk clang present in path. % ROCKSDB_NO_FBCODE=1 CC=clang CXX=clang++ make ``` Reviewers: igor, kradhakrishnan, rven, meyering, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D34077	2015-02-26 11:28:41 -08:00
Igor Canadi	e7ea51a8e7	Introduce job_id for flush and compaction Summary: It would be good to assing background job their IDs. Two benefits: 1) makes LOGs more readable 2) I might use it in my EventLogger, which will try to make our LOG easier to read/query/visualize Test Plan: ran rocksdb, read the LOG Reviewers: sdong, rven, yhchiang Reviewed By: yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D31617	2015-02-12 09:54:48 -08:00
Yueh-Hsuan Chiang	181191a1e4	Add a counter for collecting the wait time on db mutex. Summary: Add a counter for collecting the wait time on db mutex. Also add MutexWrapper and CondVarWrapper for measuring wait time. Test Plan: ./db_test export ROCKSDB_TESTS=MutexWaitStats ./db_test verify stats output using db_bench make clean make release ./db_bench --statistics=1 --benchmarks=fillseq,readwhilewriting --num=10000 --threads=10 Sample output: rocksdb.db.mutex.wait.micros COUNT : 7546866 Reviewers: MarkCallaghan, rven, sdong, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D32787	2015-02-04 21:39:45 -08:00
Ori Bernstein	f9758e0129	Add compaction listener. Summary: This adds a listener for compactions, and gives some useful statistics on each compaction pass. Test Plan: Unit tests. Reviewers: sdong, igor, rven, yhchiang Reviewed By: yhchiang Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D31641	2015-01-27 14:44:02 -08:00
sdong	be8f0b12ed	Rename DBImpl::log_dir_unsynced_ to log_dir_synced_ Summary: log_dir_unsynced_ is a confusing name. Rename it to log_dir_synced_ and flip the value. Test Plan: Run ./fault_injection_test Reviewers: rven, yhchiang, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D32235	2015-01-26 16:01:36 -08:00
sdong	d888c95748	Sync WAL Directory and DB Path if different from DB directory Summary: 1. If WAL directory is different from db directory. Sync the directory after creating a log file under it. 2. After creating an SST file, sync its parent directory instead of DB directory. 3. change the check of kResetDeleteUnsyncedFiles in fault_injection_test. Since we changed the behavior to sync log files' parent directory after first WAL sync, instead of creating, kResetDeleteUnsyncedFiles will not guarantee to show post sync updates. Test Plan: make all check Reviewers: yhchiang, rven, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D32067	2015-01-26 14:17:45 -08:00
Yueh-Hsuan Chiang	c91cdd59c1	Allow GetThreadList() to indicate a thread is doing Compaction. Summary: Allow GetThreadList() to indicate a thread is doing Compaction. Test Plan: export ROCKSDB_TESTS=ThreadStatus ./db_test Reviewers: ljin, igor, sdong Reviewed By: sdong Subscribers: leveldb, dhruba, jonahcohen, rven Differential Revision: https://reviews.facebook.net/D30105	2015-01-13 00:04:08 -08:00
Igor Canadi	0acc738810	Speed up FindObsoleteFiles() Summary: There are two versions of FindObsoleteFiles(): * full scan, which is executed every 6 hours (and it's terribly slow) * no full scan, which is executed every time a background process finishes and iterator is deleted This diff is optimizing the second case (no full scan). Here's what we do before the diff: * Get the list of obsolete files (files with ref==0). Some files in obsolete_files set might actually be live. * Get the list of live files to avoid deleting files that are live. * Delete files that are in obsolete_files and not in live_files. After this diff: * The only files with ref==0 that are still live are files that have been part of move compaction. Don't include moved files in obsolete_files. * Get the list of obsolete files (which exclude moved files). * No need to get the list of live files, since all files in obsolete_files need to be deleted. I'll post the benchmark results, but you can get the feel of it here: https://reviews.facebook.net/D30123 This depends on D30123. P.S. We should do full scan only in failure scenarios, not every 6 hours. I'll do this in a follow-up diff. Test Plan: One new unit test. Made sure that unit test fails if we don't have a `if (!f->moved)` safeguard in ~Version. make check Big number of compactions and flushes: ./db_stress --threads=30 --ops_per_thread=20000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0 --reopen=15 --max_background_compactions=10 --max_background_flushes=10 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000 Reviewers: yhchiang, rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D30249	2014-12-22 12:04:45 +01:00
Igor Canadi	fdb6be4e24	Rewritten system for scheduling background work Summary: When scaling to higher number of column families, the worst bottleneck was MaybeScheduleFlushOrCompaction(), which did a for loop over all column families while holding a mutex. This patch addresses the issue. The approach is similar to our earlier efforts: instead of a pull-model, where we do something for every column family, we can do a push-based model -- when we detect that column family is ready to be flushed/compacted, we add it to the flush_queue_/compaction_queue_. That way we don't need to loop over every column family in MaybeScheduleFlushOrCompaction. Here are the performance results: Command: ./db_bench --write_buffer_size=268435456 --db_write_buffer_size=268435456 --db=/fast-rocksdb-tmp/rocks_lots_of_cf --use_existing_db=0 --open_files=55000 --statistics=1 --histogram=1 --disable_data_sync=1 --max_write_buffer_number=2 --sync=0 --benchmarks=fillrandom --threads=16 --num_column_families=5000 --disable_wal=1 --max_background_flushes=16 --max_background_compactions=16 --level0_file_num_compaction_trigger=2 --level0_slowdown_writes_trigger=2 --level0_stop_writes_trigger=3 --hard_rate_limit=1 --num=33333333 --writes=33333333 Before the patch: fillrandom : 26.950 micros/op 37105 ops/sec; 4.1 MB/s After the patch: fillrandom : 17.404 micros/op 57456 ops/sec; 6.4 MB/s Next bottleneck is VersionSet::AddLiveFiles, which is painfully slow when we have a lot of files. This is coming in the next patch, but when I removed that code, here's what I got: fillrandom : 7.590 micros/op 131758 ops/sec; 14.6 MB/s Test Plan: make check two stress tests: Big number of compactions and flushes: ./db_stress --threads=30 --ops_per_thread=20000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0 --reopen=15 --max_background_compactions=10 --max_background_flushes=10 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000 max_background_flushes=0, to verify that this case also works correctly ./db_stress --threads=30 --ops_per_thread=2000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0 --reopen=3 --max_background_compactions=3 --max_background_flushes=0 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000 Reviewers: ljin, rven, yhchiang, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D30123	2014-12-19 20:38:12 +01:00
sdong	d7a486668c	Improve scalability of DB::GetSnapshot() Summary: Now DB::GetSnapshot() doesn't scale to more column families, as it needs to go through all the column families to find whether snapshot is supported. This patch optimizes it. Test Plan: Add unit tests to cover negative cases. make all check Reviewers: yhchiang, rven, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D30093	2014-12-11 13:27:57 -08:00
sdong	046ba7d47c	Fix calculation of max_total_wal_size in db_options_.max_total_wal_size == 0 case Summary: This is a regression bug introduced by https://reviews.facebook.net/D24729 . max_total_wal_size would be off the target it should be more and more in the case that the a user holds the current super version after flush or compaction. This patch fixes it Test Plan: make all check Reviewers: yhchiang, rven, igor Reviewed By: igor Subscribers: ljin, yoshinorim, MarkCallaghan, hermanlee4, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D29961	2014-12-08 15:26:35 -08:00
sdong	1f04066cab	Add DBProperty to return number of snapshots and time for oldest snapshot Summary: Add a counter in SnapshotList to show number of snapshots. Also a unix timestamp in every snapshot. Add two DB Properties to return number of snapshots and timestamp of the oldest one. Test Plan: Add unit test checking Reviewers: yhchiang, rven, igor Reviewed By: igor Subscribers: leveldb, dhruba, MarkCallaghan Differential Revision: https://reviews.facebook.net/D29919	2014-12-05 17:07:49 -08:00
Jonah Cohen	a14b7873ee	Enforce write buffer memory limit across column families Summary: Introduces a new class for managing write buffer memory across column families. We supplement ColumnFamilyOptions::write_buffer_size with ColumnFamilyOptions::write_buffer, a shared pointer to a WriteBuffer instance that enforces memory limits before flushing out to disk. Test Plan: Added SharedWriteBuffer unit test to db_test.cc Reviewers: sdong, rven, ljin, igor Reviewed By: igor Subscribers: tnovak, yhchiang, dhruba, xjin, MarkCallaghan, yoshinorim Differential Revision: https://reviews.facebook.net/D22581	2014-12-02 12:09:20 -08:00
Venkatesh Radhakrishnan	004f416b77	Moved checkpoint to utilities Summary: Moved checkpoint to utilities. Addressed comments by Igor, Siying, Dhruba Test Plan: db_test/SnapshotLink Reviewers: dhruba, igor, sdong Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D29079	2014-11-20 15:54:47 -08:00
Yueh-Hsuan Chiang	d0c5f28a5c	Introduce GetThreadList API Summary: Add GetThreadList API, which allows developer to track the status of each process. Currently, calling GetThreadList will only get the list of background threads in RocksDB with their thread-id and thread-type (priority) set. Will add more support on this in the later diffs. ThreadStatus currently has the following properties: // An unique ID for the thread. const uint64_t thread_id; // The type of the thread, it could be ROCKSDB_HIGH_PRIORITY, // ROCKSDB_LOW_PRIORITY, and USER_THREAD const ThreadType thread_type; // The name of the DB instance where the thread is currently // involved with. It would be set to empty string if the thread // does not involve in any DB operation. const std::string db_name; // The name of the column family where the thread is currently // It would be set to empty string if the thread does not involve // in any column family. const std::string cf_name; // The event that the current thread is involved. // It would be set to empty string if the information about event // is not currently available. Test Plan: ./thread_list_test export ROCKSDB_TESTS=GetThreadList ./db_test Reviewers: rven, igor, sdong, ljin Reviewed By: ljin Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D25047	2014-11-20 10:49:32 -08:00
Lei Jin	1e4a45aac8	remove cfd->options() in DBImpl::NotifyOnFlushCompleted Summary: We should not reference cfd->options() directly! Test Plan: make release Reviewers: sdong, rven, igor, yhchiang Reviewed By: igor, yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D29061	2014-11-18 10:19:48 -08:00
Igor Canadi	5c04acda08	Explicitly clean JobContext Summary: This way we can gurantee that old MemTables get destructed before DBImpl gets destructed, which might be useful if we want to make them depend on state from DBImpl. Test Plan: make check with asserts in JobContext's destructor Reviewers: ljin, sdong, yhchiang, rven, jonahcohen Reviewed By: jonahcohen Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D28959	2014-11-14 15:43:10 -08:00
Venkatesh Radhakrishnan	6c1b040cc9	Provide openable snapshots Summary: Store links to live files in directory on same disk Test Plan: Take snapshot and open it. Added a test GetSnapshotLink in db_test. Reviewers: sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D28713	2014-11-14 11:38:26 -08:00
Igor Canadi	772bc97f13	No CompactFiles in ROCKSDB_LITE Summary: It adds lots of code. Test Plan: compile for iOS, compile for mac. works. Reviewers: rven, sdong, ljin, yhchiang Reviewed By: yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D28857	2014-11-13 16:45:33 -05:00
Igor Canadi	25f273027b	Fix iOS compile with -Wshorten-64-to-32 Summary: So iOS size_t is 32-bit, so we need to static_cast<size_t> any uint64_t :( Test Plan: TARGET_OS=IOS make static_lib Reviewers: dhruba, ljin, yhchiang, rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D28743	2014-11-13 14:39:30 -05:00
Yueh-Hsuan Chiang	28c82ff1b3	CompactFiles, EventListener and GetDatabaseMetaData Summary: This diff adds three sets of APIs to RocksDB. = GetColumnFamilyMetaData = * This APIs allow users to obtain the current state of a RocksDB instance on one column family. * See GetColumnFamilyMetaData in include/rocksdb/db.h = EventListener = * A virtual class that allows users to implement a set of call-back functions which will be called when specific events of a RocksDB instance happens. * To register EventListener, simply insert an EventListener to ColumnFamilyOptions::listeners = CompactFiles = * CompactFiles API inputs a set of file numbers and an output level, and RocksDB will try to compact those files into the specified level. = Example = * Example code can be found in example/compact_files_example.cc, which implements a simple external compactor using EventListener, GetColumnFamilyMetaData, and CompactFiles API. Test Plan: listener_test compactor_test example/compact_files_example export ROCKSDB_TESTS=CompactFiles db_test export ROCKSDB_TESTS=MetaData db_test Reviewers: ljin, igor, rven, sdong Reviewed By: sdong Subscribers: MarkCallaghan, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D24705	2014-11-07 14:45:18 -08:00
Igor Canadi	53af5d877d	Redesign pending_outputs_ Summary: Here's a prototype of redesigning pending_outputs_. This way, we don't have to expose pending_outputs_ to other classes (CompactionJob, FlushJob, MemtableList). DBImpl takes care of it. Still have to write some comments, but should be good enough to start the discussion. Test Plan: make check, will also run stress test Reviewers: ljin, sdong, rven, yhchiang Reviewed By: yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D28353	2014-11-07 11:50:34 -08:00
Lei Jin	fd24ae9d05	SetOptions() to return status and also add it to StackableDB Summary: as title Test Plan: ./db_test Reviewers: sdong, yhchiang, rven, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D28269	2014-11-04 16:23:05 -08:00
Igor Canadi	74eb4fbe93	CompactionJob Summary: Long awaited CompactionJob class! Move most compaction-related things from DBImpl to CompactionJob, making CompactionJob easier to test and understand. Currently this is just replicating exactly the same functionality with as little as change as possible. As future work, we should: 1. Add CompactionJob tests (I think I'll do that tomorrow) 2. Reduce CompactionJob's state that it inherits from DBImpl 3. Figure out how to do yielding to flush better. Currently I implemented a callback as we agreed yesterday, but I don't think it's a good long term solution. This reduces db_impl.cc from 5000+ LOC to 3400! Test Plan: make check, will add CompactionJob-specific tests, probably also move some tests from db_test to compaction_job_test Reviewers: rven, yhchiang, sdong, ljin Reviewed By: ljin Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D27957	2014-10-31 16:31:25 -07:00
Igor Canadi	635905481d	WalManager Summary: Decoupling code that deals with archived log files outside of DBImpl. That will make this code easier to reason about and test. It will also make the code easier to improve, because an improver doesn't have to understand DBImpl code in entirety. Test Plan: added test Reviewers: ljin, yhchiang, rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D27873	2014-10-29 17:43:37 -07:00
Igor Canadi	a39e931e50	FlushProcess Summary: Abstract out FlushProcess and take it out of DBImpl. This also includes taking DeletionState outside of DBImpl. Currently this diff is only doing the refactoring. Future work includes: 1. Decoupling flush_process.cc, make it depend on less state 2. Write flush_process_test, which will mock out everything that FlushProcess depends on and test it in isolation Test Plan: make check Reviewers: rven, yhchiang, sdong, ljin Reviewed By: ljin Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D27561	2014-10-28 11:54:33 -07:00
Igor Canadi	48842ab316	Deprecate AtomicPointer Summary: RocksDB already depends on C++11, so we might as well all the goodness that C++11 provides. This means that we don't need AtomicPointer anymore. The less things in port/, the easier it will be to port to other platforms. Test Plan: make check + careful visual review verifying that NoBarried got memory_order_relaxed, while Acquire/Release methods got memory_order_acquire and memory_order_release Reviewers: rven, yhchiang, ljin, sdong Reviewed By: ljin Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D27543	2014-10-27 14:50:21 -07:00
Lei Jin	dc50a1a593	make max_write_buffer_number dynamic Summary: as title Test Plan: unit test Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D24729	2014-10-16 16:57:59 -07:00
Igor Canadi	cc6c883f59	Stop stopping writes on bg_error_ Summary: This might have caused https://github.com/facebook/rocksdb/issues/345. If we're stopping writes and bg_error comes along, we will never unblock the write. Test Plan: compiles Reviewers: ljin Reviewed By: ljin Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D24807	2014-10-13 14:25:55 -07:00
sdong	8ea232b9e3	Add number of records dropped in compaction summary Summary: Add two stats to compaction summary: 1. Total input records from previous level 2. Total number of records dropped after compaction Test Plan: See outputs of printing when runnning locally Reviewers: ljin, igor, MarkCallaghan Reviewed By: MarkCallaghan Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D24411	2014-10-02 17:54:25 -07:00
Lei Jin	5ec53f3edf	make compaction related options changeable Summary: make compaction related options changeable. Most of changes are tedious, following the same convention: grabs MutableCFOptions at the beginning of compaction under mutex, then pass it throughout the job and register it in SuperVersion at the end. Test Plan: make all check Reviewers: igor, yhchiang, sdong Reviewed By: sdong Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D23349	2014-10-01 16:19:16 -07:00
Lei Jin	3c68006109	CompactedDBImpl Summary: Add a CompactedDBImpl that will enabled when calling OpenForReadOnly() and the DB only has one level (>0) of files. As a performan comparison, CuckooTable performs 2.1M/s with CompactedDBImpl vs. 1.78M/s with ReadOnlyDBImpl. Test Plan: db_bench Reviewers: yhchiang, igor, sdong Reviewed By: sdong Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D23553	2014-09-25 11:14:01 -07:00
Lei Jin	a062e1f2c4	SetOptions() for memtable related options Summary: as title Test Plan: make all check I will think a way to set up stress test for this Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D23055	2014-09-17 12:49:13 -07:00
Igor Canadi	dee91c259d	WriteThread Summary: This diff just moves the write thread control out of the DBImpl. I will need this as I will control column family data concurrency by only accessing some data in the write thread. That way, we won't have to lock our accesses to column family hash table (mappings from IDs to CFDs). Test Plan: make check Reviewers: sdong, yhchiang, ljin Reviewed By: ljin Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D23301	2014-09-12 16:23:58 -07:00
Igor Canadi	3d9e6f7759	Push model for flushing memtables Summary: When memtable is full it calls the registered callback. That callback then registers column family as needing the flush. Every write checks if there are some column families that need to be flushed. This completely eliminates the need for MakeRoomForWrite() function and simplifies our Write code-path. There is some complexity with the concurrency when the column family is dropped. I made it a bit less complex by dropping the column family from the write thread in https://reviews.facebook.net/D22965. Let me know if you want to discuss this. Test Plan: make check works. I'll also run db_stress with creating and dropping column families for a while. Reviewers: yhchiang, sdong, ljin Reviewed By: ljin Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D23067	2014-09-10 18:46:09 -07:00
Stanislau Hlebik	d343c3fe46	Improve db recovery Summary: Avoid creating unnecessary sst files while db opening Test Plan: make all check Reviewers: sdong, igor Reviewed By: igor Subscribers: zagfox, yhchiang, ljin, leveldb Differential Revision: https://reviews.facebook.net/D20661	2014-09-09 11:18:50 -07:00
Lei Jin	171d4ff4a2	remove TailingIterator reference in db_impl.h Summary: as title Test Plan: make release Reviewers: igor Differential Revision: https://reviews.facebook.net/D23073	2014-09-08 15:39:53 -07:00
Igor Canadi	a2bb7c3c33	Push- instead of pull-model for managing Write stalls Summary: Introducing WriteController, which is a source of truth about per-DB write delays. Let's define an DB epoch as a period where there are no flushes and compactions (i.e. new epoch is started when flush or compaction finishes). Each epoch can either: * proceed with all writes without delay * delay all writes by fixed time * stop all writes The three modes are recomputed at each epoch change (flush, compaction), rather than on every write (which is currently the case). When we have a lot of column families, our current pull behavior adds a big overhead, since we need to loop over every column family for every write. With new push model, overhead on Write code-path is minimal. This is just the start. Next step is to also take care of stalls introduced by slow memtable flushes. The final goal is to eliminate function MakeRoomForWrite(), which currently needs to be called for every column family by every write. Test Plan: make check for now. I'll add some unit tests later. Also, perf test. Reviewers: dhruba, yhchiang, MarkCallaghan, sdong, ljin Reviewed By: ljin Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D22791	2014-09-08 11:20:25 -07:00
Igor Canadi	9f1c80b556	Drop column family from write thread Summary: If we drop column family only from (single) write thread, we can be sure that nobody will drop the column family while we're writing (and our mutex is released). This greatly simplifies my patch that's getting rid of MakeRoomForWrite(). Test Plan: make check, but also running stress test Reviewers: ljin, sdong Reviewed By: sdong Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D22965	2014-09-05 15:20:05 -07:00
Lei Jin	c9e419ccb6	rename options_ to db_options_ in DBImpl to avoid confusion Summary: as title Test Plan: make release Reviewers: sdong, igor Reviewed By: igor Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D22935	2014-09-05 11:48:17 -07:00
Stanislau Hlebik	45a5e3ede0	Remove path with arena==nullptr from NewInternalIterator Summary: Simply code by removing code path which does not use Arena from NewInternalIterator Test Plan: make all check make valgrind_check Reviewers: sdong Reviewed By: sdong Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D22395	2014-09-04 17:40:41 -07:00
Lei Jin	5665e5e285	introduce ImmutableOptions Summary: As a preparation to support updating some options dynamically, I'd like to first introduce ImmutableOptions, which is a subset of Options that cannot be changed during the course of a DB lifetime without restart. ColumnFamily will keep both Options and ImmutableOptions. Any component below ColumnFamily should only take ImmutableOptions in their constructor. Other options should be taken from APIs, which will be allowed to adjust dynamically. I am yet to make changes to memtable and other related classes to take ImmutableOptions in their ctor. That can be done in a seprate diff as this one is already pretty big. Test Plan: make all check Reviewers: yhchiang, igor, sdong Reviewed By: sdong Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D22545	2014-09-04 16:18:36 -07:00
Stanislau Hlebik	9dcb75b6d9	Add is-file-deletions-enabled property Summary: Add property 'rocksdb.is-file-deletions-enable' which equals disable_delete_obsole_file_ Test Plan: make all check Reviewers: sdong Reviewed By: sdong Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D22119	2014-08-26 16:26:29 -07:00
Lei Jin	384400128f	move block based table related options BlockBasedTableOptions Summary: I will move compression related options in a separate diff since this diff is already pretty lengthy. I guess I will also need to change JNI accordingly :( Test Plan: make all check Reviewers: yhchiang, igor, sdong Reviewed By: igor Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D21915	2014-08-25 14:22:05 -07:00
Feng Zhu	5e642403a9	log db path info before open Summary: 1. write db MANIFEST, CURRENT, IDENTITY, sst files, log files to log before open Test Plan: run db and check LOG file Reviewers: ljin, yhchiang, igor, dhruba, sdong Reviewed By: sdong Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D21459	2014-08-13 13:45:13 -07:00
Stanislau Hlebik	06a52bda64	Flush only one column family Summary: Currently DBImpl::Flush() triggers flushes in all column families. Instead we need to trigger just the column family specified. Test Plan: make all check Reviewers: igor, ljin, yhchiang, sdong Reviewed By: sdong Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D20841	2014-08-11 22:10:32 -07:00
miguelportilla	93e6b5e9d9	Changes to support unity build: * Script for building the unity.cc file via Makefile * Unity executable Makefile target for testing builds * Source code changes to fix compilation of unity build	2014-08-11 13:22:47 -04:00
sdong	1242bfcad7	Add DB property "rocksdb.estimate-table-readers-mem" Summary: Add a DB Property "rocksdb.estimate-table-readers-mem" to return estimated memory usage by all loaded table readers, other than allocated from block cache. Refactor the property codes to allow getting property from a version, with DB mutex not acquired. Test Plan: Add several checks of this new property in existing codes for various cases. Reviewers: yhchiang, ljin Reviewed By: ljin Subscribers: xjin, igor, leveldb Differential Revision: https://reviews.facebook.net/D20733	2014-08-06 11:39:46 -07:00
sdong	f04356e660	Add DB::GetIntProperty() to return integer properties to be returned as integers Summary: We have quite some properties that are integers and we are adding more. Add a function to directly return them as an integer, instead of a string Test Plan: Add several unit test checks Reviewers: yhchiang, igor, dhruba, haobo, ljin Reviewed By: ljin Subscribers: yoshinorim, leveldb Differential Revision: https://reviews.facebook.net/D20637	2014-07-28 16:55:57 -07:00
Lei Jin	40fa8a4cd5	make statistics forward-able Summary: Make StatisticsImpl being able to forward stats to provided statistics implementation. The main purpose is to allow us to collect internal stats in the future even when user supplies custom statistics implementation. It avoids intrumenting 2 sets of stats collection code. One immediate use case is tuning advisor, which needs to collect some internal stats, users may not be interested. Test Plan: ran db_bench and see stats show up at the end of run Will run make all check since some tests rely on statistics Reviewers: yhchiang, sdong, igor Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D20145	2014-07-28 12:05:36 -07:00
sdong	f6b7e1ed1a	Allow user to specify DB path of output file of manual compaction Summary: Add a parameter path_id to DB::CompactRange(), to indicate where the output file should be placed to. Test Plan: add a unit test Reviewers: yhchiang, ljin Reviewed By: ljin Subscribers: xjin, igor, dhruba, MarkCallaghan, leveldb Differential Revision: https://reviews.facebook.net/D20085	2014-07-21 19:06:00 -07:00
Lei Jin	f6f1533c6f	make internal stats independent of statistics Summary: also make it aware of column family output from db_bench ``` Compaction Stats [default] Level Files Size(MB) Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) RW-Amp W-Amp Rd(MB/s) Wr(MB/s) Rn(cnt) Rnp1(cnt) Wnp1(cnt) Wnew(cnt) Comp(sec) Comp(cnt) Avg(sec) Stall(sec) Stall(cnt) Avg(ms) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 14 956 0.9 0.0 0.0 0.0 2.7 2.7 0.0 0.0 0.0 111.6 0 0 0 0 24 40 0.612 75.20 492387 0.15 L1 21 2001 2.0 5.7 2.0 3.7 5.3 1.6 5.4 2.6 71.2 65.7 31 43 55 12 82 2 41.242 43.72 41183 1.06 L2 217 18974 1.9 16.5 2.0 14.4 15.1 0.7 15.6 7.4 70.1 64.3 17 182 185 3 241 16 15.052 0.00 0 0.00 L3 1641 188245 1.8 9.1 1.1 8.0 8.5 0.5 15.4 7.4 61.3 57.2 9 75 76 1 152 9 16.887 0.00 0 0.00 L4 4447 449025 0.4 13.4 4.8 8.6 9.1 0.5 4.7 1.9 77.8 52.7 38 79 100 21 176 38 4.639 0.00 0 0.00 Sum 6340 659201 0.0 44.7 10.0 34.7 40.6 6.0 32.0 15.2 67.7 61.6 95 379 416 37 676 105 6.439 118.91 533570 0.22 Int 0 0 0.0 1.2 0.4 0.8 1.3 0.5 5.2 2.7 59.1 65.6 3 7 9 2 20 10 2.003 0.00 0 0.00 Stalls(secs): 75.197 level0_slowdown, 0.000 level0_numfiles, 0.000 memtable_compaction, 43.717 leveln_slowdown Stalls(count): 492387 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 41183 leveln_slowdown DB Stats Uptime(secs): 202.1 total, 13.5 interval Cumulative writes: 6291456 writes, 6291456 batches, 1.0 writes per batch, 4.90 ingest GB Cumulative WAL: 6291456 writes, 6291456 syncs, 1.00 writes per sync, 4.90 GB written Interval writes: 1048576 writes, 1048576 batches, 1.0 writes per batch, 836.0 ingest MB Interval WAL: 1048576 writes, 1048576 syncs, 1.00 writes per sync, 0.82 MB written Test Plan: ran it Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19917	2014-07-21 12:57:29 -07:00
Igor Canadi	20c056306b	Remove stats logger Summary: Browsing through the code, looks like StatsLogger is not used at all! Test Plan: compiles Reviewers: ljin, sdong, yhchiang, dhruba Reviewed By: dhruba Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D19827	2014-07-15 09:16:32 -04:00
Yueh-Hsuan Chiang	90a6aca48e	Finer report I/O stats about Flush and Compaction. Summary: This diff allows the I/O stats about Flush and Compaction to be reported in a more accurate way. Instead of measuring the size of a file, it measure I/O cost in per read / write basis. Test Plan: make all check Reviewers: sdong, igor, ljin Reviewed By: ljin Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19383	2014-07-03 16:28:03 -07:00
Yueh-Hsuan Chiang	d4d338de33	Add timeout_hint_us to WriteOptions and introduce Status::TimeOut. Summary: This diff adds timeout_hint_us to WriteOptions. If it's non-zero, then 1) writes associated with this options MAY be aborted when it has been waiting for longer than the specified time. If an abortion happens, associated writes will return Status::TimeOut. 2) the stall time of the associated write caused by flush or compaction will be limited by timeout_hint_us. The default value of timeout_hint_us is 0 (i.e., OFF.) The statistics of timeout writes will be recorded in WRITE_TIMEDOUT. Test Plan: export ROCKSDB_TESTS=WriteTimeoutAndDelayTest make db_test ./db_test Reviewers: igor, ljin, haobo, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D18837	2014-07-03 15:47:02 -07:00
sdong	2459f7ec4e	Support Multiple DB paths (without having an interface to expose to users) Summary: In this patch, we allow RocksDB to support multiple DB paths internally. No user interface is supported yet so this patch is silent to users. Test Plan: make all check Reviewers: igor, haobo, ljin, yhchiang Reviewed By: yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D18921	2014-07-02 21:14:44 -07:00
Igor Canadi	f146cab261	Centralize compression decision to compaction picker Summary: Before this diff, we're deciding enable_compression in CompactionPicker and then we're deciding final compression type in DBImpl. This is kind of confusing. After the diff, the final compression type will be decided in CompactionPicker. The reason for this is that I want CompactFiles() to specify output compression type, so that people can mix and match compression styles in their compaction algorithms. This diff makes it much easier to do that. Test Plan: make check Reviewers: dhruba, haobo, sdong, yhchiang, ljin Reviewed By: ljin Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19137	2014-07-02 20:40:57 +02:00
Lei Jin	c4e90c79ed	bug fix: iteration over ColumnFamilySet needs to be under mutex Summary: asan_crash_test is failing on segfault Test Plan: running asan_crash_test Reviewers: sdong, igor Reviewed By: igor Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19149	2014-06-19 09:31:14 -07:00
Igor Canadi	99d3eed2fd	Write Fast-path for single column family Summary: We have a perf regression of Write() even with one column family. Make fast path for single column family to avoid the perf regression. See task #4455480 Test Plan: make check Reviewers: sdong, ljin Reviewed By: sdong, ljin Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D18963	2014-06-06 17:26:23 -07:00
sdong	df9069d23f	In DB::NewIterator(), try to allocate the whole iterator tree in an arena Summary: In this patch, try to allocate the whole iterator tree starting from DBIter from an arena 1. ArenaWrappedDBIter is created when serves as the entry point of an iterator tree, with an arena in it. 2. Add an option to create iterator from arena for following iterators: DBIter, MergingIterator, MemtableIterator, all mem table's iterators, all table reader's iterators and two level iterator. 3. MergeIteratorBuilder is created to incrementally build the tree of internal iterators. It is passed to mem table list and version set and add iterators to it. Limitations: (1) Only DB::NewIterator() without tailing uses the arena. Other cases, including readonly DB and compactions are still from malloc (2) Two level iterator itself is allocated in arena, but not iterators inside it. Test Plan: make all check Reviewers: ljin, haobo Reviewed By: haobo Subscribers: leveldb, dhruba, yhchiang, igor Differential Revision: https://reviews.facebook.net/D18513	2014-06-02 17:44:57 -07:00
Igor Canadi	91ddd587cc	Only signal cond variable if need to Summary: At the end of BackgroundCallCompaction(), we call SignalAll(), even though we don't need to. If compaction hasn't done anything and there's another compaction running, there is no need to signal on the condition variable. Doing so creates a tight feedback loop which results in log files like: wait for memtable flush compaction nothing to do wait for memtable flush compaction nothing to do This change eliminates that Test Plan: make check Also: icanadi@dev1440 ~ $ grep "nothing to do" /fast-rocksdb-tmp/rocksdb_test/column_family_test/LOG \| wc -l 7435 icanadi@dev1440 ~ $ grep "nothing to do" /fast-rocksdb-tmp/rocksdb_test/column_family_test/LOG \| wc -l 372 First version is before the change, second version is after the change. Reviewers: dhruba, ljin, haobo, yhchiang, sdong Reviewed By: sdong Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D18855	2014-06-02 17:23:55 -07:00
Lei Jin	388d2054c7	forward iterator Summary: Forward iterator puts everything together in a flat structure instead of a hierarchy of nested iterators. this should simplify the code and provide better performance. It also enables more optimization since all information are accessiable in one place. Init evaluation shows about 6% improvement Test Plan: db_test and db_bench Reviewers: dhruba, igor, tnovak, sdong, haobo Reviewed By: haobo Subscribers: sdong, leveldb Differential Revision: https://reviews.facebook.net/D18795	2014-05-30 14:31:55 -07:00
Igor Canadi	df70047669	Flush stale column families Summary: Added a new option `max_total_wal_size`. Once the total WAL size goes over that, we make an attempt to flush all column families that still have data in the earliest WAL file. By default, I calculate `max_total_wal_size` dynamically, that should be good-enough for non-advanced customers. Test Plan: Added a test Reviewers: dhruba, haobo, sdong, ljin, yhchiang Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D18345	2014-04-30 14:33:40 -04:00
Yueh-Hsuan Chiang	9d9d2965cb	Add a new mem-table representation based on cuckoo hash. Summary: = Major Changes = * Add a new mem-table representation, HashCuckooRep, which is based cuckoo hash. Cuckoo hash uses multiple hash functions. This allows each key to have multiple possible locations in the mem-table. - Put: When insert a key, it will try to find whether one of its possible locations is vacant and store the key. If none of its possible locations are available, then it will kick out a victim key and store at that location. The kicked-out victim key will then be stored at a vacant space of its possible locations or kick-out another victim. In this diff, the kick-out path (known as cuckoo-path) is found using BFS, which guarantees to be the shortest. - Get: Simply tries all possible locations of a key --- this guarantees worst-case constant time complexity. - Time complexity: O(1) for Get, and average O(1) for Put if the fullness of the mem-table is below 80%. - Default using two hash functions, the number of hash functions used by the cuckoo-hash may dynamically increase if it fails to find a short-enough kick-out path. - Currently, HashCuckooRep does not support iteration and snapshots, as our current main purpose of this is to optimize point access. = Minor Changes = * Add IsSnapshotSupported() to DB to indicate whether the current DB supports snapshots. If it returns false, then DB::GetSnapshot() will always return nullptr. Test Plan: Run existing tests. Will develop a test specifically for cuckoo hash in the next diff. Reviewers: sdong, haobo Reviewed By: sdong CC: leveldb, dhruba, igor Differential Revision: https://reviews.facebook.net/D16155	2014-04-29 17:13:46 -07:00
Igor Canadi	dd9eb7a7d5	Cache result of ReadFirstRecord() Summary: ReadFirstRecord() reads the actual log file from disk on every call. This diff introduces a cache layer on top of ReadFirstRecord(), which should significantly speed up repeated calls to GetUpdatesSince(). I also cleaned up some stuff, but the whole TransactionLogIterator could use some refactoring, especially if we see increased usage. Test Plan: make check Reviewers: haobo, sdong, dhruba Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D18387	2014-04-29 13:27:58 -04:00
Igor Canadi	8ce5492623	Delete superversion and log outside of mutex Summary: As summary. Add two autovectors that get filled up in MakeRoomForWrite and they get deleted outside of mutex Test Plan: make check Reviewers: dhruba, haobo, ljin, sdong Reviewed By: ljin CC: leveldb Differential Revision: https://reviews.facebook.net/D18249	2014-04-25 14:58:02 -04:00
Igor Canadi	7d838856cf	Fix compile issues when doing make release	2014-04-15 16:00:10 -07:00
Igor Canadi	588bca2020	RocksDBLite Summary: Introducing RocksDBLite! Removes all the non-essential features and reduces the binary size. This effort should help our adoption on mobile. Binary size when compiling for IOS (`TARGET_OS=IOS m static_lib`) is down to 9MB from 15MB (without stripping) Test Plan: compiles :) Reviewers: dhruba, haobo, ljin, sdong, yhchiang Reviewed By: yhchiang CC: leveldb Differential Revision: https://reviews.facebook.net/D17835	2014-04-15 13:39:26 -07:00
Igor Canadi	e6acb874cd	Don't roll empty logs Summary: With multiple column families, especially when manual Flush is executed, we might roll the log file, although the current log file is empty (no data has been written to the log). After the diff, we won't create new log file if current is empty. Next, I will write an algorithm that will flush column families that reference old log files (i.e., that weren't flushed in a while) Test Plan: Added an unit test. Confirmed that unit test failes in master Reviewers: dhruba, haobo, ljin, sdong Reviewed By: ljin CC: leveldb Differential Revision: https://reviews.facebook.net/D17631	2014-04-15 09:57:25 -07:00
Lei Jin	82b37a18bd	thread local for tailing iterator Summary: replace the super version acquisision in tailing itrator with thread local Test Plan: will post results Reviewers: igor, haobo, sdong, yhchiang, dhruba Reviewed By: igor CC: leveldb Differential Revision: https://reviews.facebook.net/D17757	2014-04-14 10:48:01 -07:00
Igor Canadi	d1e2bce42d	CallFlushDuringCompaction	2014-04-07 15:03:15 -07:00
Igor Canadi	3d2fe844ab	Merge branch 'master' into columnfamilies Conflicts: db/db_impl.cc db/db_impl.h db/memtable_list.cc db/version_set.cc	2014-04-07 11:31:11 -07:00
sdong	ea0198fe9a	Create log::Writer out of DB Mutex Summary: Our measurement shows that sometimes new log::Write's constructor can take hundreds of milliseconds. It's unclear why but just simply move it out of DB mutex. Test Plan: make all check Reviewers: haobo, ljin, igor Reviewed By: haobo CC: nkg-, yhchiang, leveldb Differential Revision: https://reviews.facebook.net/D17487	2014-04-04 15:46:28 -07:00
sdong	b9767d0e09	Move several more logging inside DB mutex to log buffer Summary: Move several some common logging still in DB mutex to log buffer. Test Plan: make all check Reviewers: haobo, igor, ljin, nkg- Reviewed By: nkg- CC: nkg-, yhchiang, leveldb Differential Revision: https://reviews.facebook.net/D17439	2014-04-03 10:47:18 -07:00
Haobo Xu	48bc0c6ad3	[RocksDB] Fix a race condition in GetSortedWalFiles Summary: This patch fixed a race condition where a log file is moved to archived dir in the middle of GetSortedWalFiles. Without the fix, the log file would be missed in the result, which leads to transaction log iterator gap. A test utility SyncPoint is added to help reproducing the race condition. Test Plan: TransactionLogIteratorRace; make check Reviewers: dhruba, ljin Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D17121	2014-04-02 22:12:29 -07:00
Igor Canadi	ddbd1ece88	Merge branch 'master' into columnfamilies Conflicts: db/db_impl.cc db/db_test.cc db/internal_stats.cc db/internal_stats.h db/version_edit.cc db/version_edit.h db/version_set.cc include/rocksdb/options.h util/options.cc	2014-03-31 13:39:24 -07:00
Haobo Xu	a92194e5b2	[RocksDB] Add db property "rocksdb.cur-size-active-mem-table" Summary: as title Test Plan: db_test Reviewers: sdong Reviewed By: sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D17217	2014-03-27 15:14:04 -07:00
Igor Canadi	e8168382c4	Merge branch 'master' into columnfamilies Conflicts: db/db_impl.cc include/rocksdb/options.h util/options.cc	2014-03-25 11:09:40 -07:00
Danny Guo	b47812fba6	[rocksdb] new CompactionFilterV2 API Summary: This diff adds a new CompactionFilterV2 API that roll up the decisions of kv pairs during compactions. These kv pairs must share the same key prefix. They are buffered inside the db. typedef std::vector<Slice> SliceVector; virtual std::vector<bool> Filter(int level, const SliceVector& keys, const SliceVector& existing_values, std::vector<std::string>* new_values, std::vector<bool>* values_changed ) const = 0; Application can override the Filter() function to operate on the buffered kv pairs. More details in the inline documentation. Test Plan: make check. Added unit tests to make sure Keep, Delete, Change all works. Reviewers: haobo CCs: leveldb Differential Revision: https://reviews.facebook.net/D15087	2014-03-24 20:47:53 -07:00
Igor Canadi	ac328a86b9	Merge branch 'master' into columnfamilies Conflicts: db/db_impl.cc db/db_test.cc	2014-03-20 14:41:37 -07:00
Igor Canadi	e67241f0b9	Sanity check on Open Summary: Everytime a client opens a DB, we do a sanity check that: * checks the existance of all the necessary files * verifies that file sizes are correct Some of the code was stolen from https://reviews.facebook.net/D16935 Test Plan: added a unit test Reviewers: dhruba, haobo, sdong Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D17097	2014-03-20 14:18:29 -07:00
Igor Canadi	e20fa3f8a4	Merge branch 'master' into columnfamilies Conflicts: db/db_impl.cc db/internal_stats.cc db/internal_stats.h db/version_set.cc	2014-03-19 17:22:20 -07:00
sdong	71e6a34271	Add a DB property to indicate number of background errors encountered Summary: Add a property to calculate number of background errors encountered to help users build their monitoring Test Plan: Add a unit test. make all check Reviewers: haobo, igor, dhruba Reviewed By: igor CC: ljin, nkg-, yhchiang, leveldb Differential Revision: https://reviews.facebook.net/D16959	2014-03-18 14:28:30 -07:00
Igor Canadi	3055a15b29	Merge branch 'master' into columnfamilies Conflicts: db/db_impl.cc db/version_edit.cc db/version_edit.h db/version_set.cc	2014-03-18 13:24:27 -07:00
Igor Canadi	ae25742af9	Fix race condition in manifest roll Summary: When the manifest is getting rolled the following happens: 1) manifest_file_number_ is assigned to a new manifest number (even though the old one is still current) 2) mutex is unlocked 3) SetCurrentFile() creates temporary file manifest_file_number_.dbtmp 4) SetCurrentFile() renames manifest_file_number_.dbtmp to CURRENT 5) mutex is locked If FindObsoleteFiles happens between (3) and (4) it will: 1) Delete manifest_file_number_.dbtmp (because it's not in pending_outputs_) 2) Delete old manifest (because the manifest_file_number_ already points to a new one) I introduce the concept of prev_manifest_file_number_ that will avoid the race condition. However, we should discuss the future of MANIFEST file rolling. We found some race conditions with it last week and who knows how many more are there. Nobody is using it in production because we don't trust the implementation. Should we even support it? Test Plan: make check Reviewers: ljin, dhruba, haobo, sdong Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D16929	2014-03-17 21:50:15 -07:00
Igor Canadi	dff9214165	Merge branch 'master' into columnfamilies Conflicts: db/db_impl.cc tools/db_stress.cc	2014-03-12 10:17:41 -07:00
sdong	bd45633b71	Fix data race against logging data structure because of LogBuffer Summary: @igor pointed out that there is a potential data race because of the way we use the newly introduced LogBuffer. After "bg_compaction_scheduled_--" or "bg_flush_scheduled_--", they can both become 0. As soon as the lock is released after that, DBImpl's deconstructor can go ahead and deconstruct all the states inside DB, including the info_log object hold in a shared pointer of the options object it keeps. At that point it is not safe anymore to continue using the info logger to write the delayed logs. With the patch, lock is released temporarily for log buffer to be flushed before "bg_compaction_scheduled_--" or "bg_flush_scheduled_--". In order to make sure we don't miss any pending flush or compaction, a new flag bg_schedule_needed_ is added, which is set to be true if there is a pending flush or compaction but not scheduled because of the max thread limit. If the flag is set to be true, the scheduling function will be called before compaction or flush thread finishes. Thanks @igor for this finding! Test Plan: make all check Reviewers: haobo, igor Reviewed By: haobo CC: dhruba, ljin, yhchiang, igor, leveldb Differential Revision: https://reviews.facebook.net/D16767	2014-03-11 16:09:53 -07:00
Igor Canadi	457c78eb89	[CF] db_stress for column families Summary: I had this diff for a while to test column families implementation. Last night, I ran it sucessfully for 10 hours with the command: time ./db_stress --threads=30 --ops_per_thread=200000000 --max_key=5000 --column_families=20 --clear_column_family_one_in=3000000 --verify_before_write=1 --reopen=50 --max_background_compactions=10 --max_background_flushes=10 --db=/tmp/db_stress It is ready to be committed :) Test Plan: Ran it for 10 hours Reviewers: dhruba, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D16797	2014-03-11 12:06:12 -07:00
Igor Canadi	9634ba42ac	Merge branch 'master' into columnfamilies Conflicts: db/compaction_picker.cc db/db_impl.cc db/db_impl.h db/tailing_iter.cc db/version_set.h include/rocksdb/options.h util/options.cc	2014-03-10 17:26:09 -07:00
Haobo Xu	9e0e6aa7f6	[RocksDB] make sure KSVObsolete does not get accessed as a valid pointer. Summary: KSVObsolete is no longer nullptr and needs to be checked explicitly. Also did some minor code cleanup and added a stat counter to track superversion cleanups incurred in the foreground. Test Plan: make check Reviewers: ljin Reviewed By: ljin CC: leveldb Differential Revision: https://reviews.facebook.net/D16701	2014-03-10 12:55:25 -07:00
Haobo Xu	66da467983	[RocksDB] LogBuffer Cleanup Summary: Moved LogBuffer class to an internal header. Removed some unneccesary indirection. Enabled log buffer for BackgroundCallFlush. Forced log buffer flush right after Unlock to improve time ordering of info log. Test Plan: make check; db_bench compare LOG output Reviewers: sdong Reviewed By: sdong CC: leveldb, igor Differential Revision: https://reviews.facebook.net/D16707	2014-03-10 11:05:44 -07:00
Igor Canadi	9f15092ebd	[CF] NewIterators Summary: Adding the last missing function -- NewIterators(). Pretty simple implementation Test Plan: added a unit test Reviewers: dhruba, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D16689	2014-03-07 16:15:25 -08:00
Lei Jin	e5fa4944fc	use CAS when returning SuperVersion to ThreadLocal Summary: Add a check at the end of GetImpl to release SuperVersion if it becomes obsolete. Also do Scrape() inside InstallSuperVersion so it happens more frequent. Test Plan: make all check running asan_check now Reviewers: igor, haobo, sdong, dhruba Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D16641	2014-03-07 14:43:22 -08:00
Igor Canadi	80a207fc90	Merge branch 'master' into columnfamilies Conflicts: db/compaction_picker.cc db/compaction_picker.h db/db_impl.cc db/version_set.cc db/version_set.h include/rocksdb/options.h util/options.cc	2014-03-05 16:59:22 -08:00
sdong	ecb1ffa2a8	Buffer info logs when picking compactions and write them out after releasing the mutex Summary: Now while the background thread is picking compactions, it writes out multiple info_logs, especially for universal compaction, which introduces a chance of waiting log writing in mutex, which is bad. To remove this risk, write all those info logs to a buffer and flush it after releasing the mutex. Test Plan: make all check check the log lines while running some tests that trigger compactions. Reviewers: haobo, igor, dhruba Reviewed By: dhruba CC: i.am.jin.lei, dhruba, yhchiang, leveldb, nkg- Differential Revision: https://reviews.facebook.net/D16515	2014-03-05 15:36:32 -08:00
Igor Canadi	9d0577a6be	Merge branch 'master' into columnfamilies Conflicts: db/db_impl.cc db/db_impl.h db/transaction_log_impl.cc db/transaction_log_impl.h include/rocksdb/options.h util/env.cc util/options.cc	2014-03-03 18:29:03 -08:00
Yueh-Hsuan Chiang	a77527f2af	Add ReadOptions to TransactionLogIterator. Summary: Add an optional input parameter ReadOptions to DB::GetUpdateSince(), which allows the verification of checksums to be disabled by setting ReadOptions::verify_checksums to false. Test Plan: Tests are done off-line and will not be included in the regular unit test. Reviewers: igor Reviewed By: igor CC: leveldb, xjin, dhruba Differential Revision: https://reviews.facebook.net/D16305	2014-02-28 11:50:36 -08:00
Lei Jin	ad0c3747cb	cache SuperVersion in thread local storage to avoid mutex lock Summary: as title Test Plan: asan_check will post results later Reviewers: haobo, igor, dhruba, sdong Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D16257	2014-02-27 11:38:55 -08:00
Igor Canadi	9bce2b2a84	[CF] Fix lint errors in CF code Summary: Big CF diff uncovered some lint errors. This diff fixes some of them. Not much to see here Test Plan: make check Reviewers: dhruba, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D16347	2014-02-26 10:10:00 -08:00
Igor Canadi	5a91746277	log file is uint64_t	2014-02-25 12:57:43 -08:00
Igor Canadi	b69e7d99d5	[CF] Better handling of memtable logs Summary: DBImpl now keeps a list of alive_log_files_. On every FindObsoleteFiles, it deletes all alive log files that are smaller than versions_->MinLogNumber() Test Plan: make check passes no specific unit tests yet, will add Reviewers: dhruba, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D16293	2014-02-25 09:55:13 -08:00
Igor Canadi	422bb09cb0	Fix table properties Summary: Adapt table properties to column family world Test Plan: make check Reviewers: kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D16161	2014-02-14 17:13:10 -08:00
Igor Canadi	76c048183c	Merge branch 'master' into columnfamilies Conflicts: db/db_impl.cc db/db_test.cc include/rocksdb/db.h	2014-02-14 16:46:03 -08:00
Igor Canadi	c67d48c852	[CF] DB test to run on non-default column family Summary: This is a huge diff and it was hectic, but the idea is actually quite simple. Every operation (Put, Get, etc.) done on default column family in DBTest is now forwarded to non-default ("pikachu"). The good news is that we had zero test failures! Column families look stable so far. One interesting test that I adapted for column families is MultiThreadedTest. I replaced every Put() with a WriteBatch writing to all column families concurrently. Every Put in the write batch contains unique_id. Instead of Get() I do a multiget across all column families with the same key. If atomicity holds, I expect to see the same unique_id in all column families. Test Plan: This is a test! Reviewers: dhruba, haobo, kailiu, sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D16149	2014-02-14 16:08:59 -08:00
kailiu	63690625cd	Expose the table properties to application Summary: Provide a public API for users to access the table properties for each SSTable. Test Plan: Added a unit tests to test the function correctness under differnet conditions. Reviewers: haobo, dhruba, sdong Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D16083	2014-02-13 16:28:21 -08:00
Igor Canadi	ccdb93e775	Merge branch 'master' into columnfamilies Conflicts: db/db_impl.cc db/db_impl.h db/memtable_list.cc db/memtable_list.h db/version_set.cc db/version_set.h	2014-02-12 14:01:30 -08:00
Igor Canadi	b06840aa7d	[CF] Rethinking ColumnFamilyHandle and fix to dropping column families Summary: The change to the public behavior: * When opening a DB or creating new column family client gets a ColumnFamilyHandle. * As long as column family handle is alive, client can do whatever he wants with it, even drop it * Dropped column family can still be read from (using the column family handle) * Added a new call CloseColumnFamily(). Client has to close all column families that he has opened before deleting the DB * As soon as column family is closed, any calls to DB using that column family handle will fail (also any outstanding calls) Internally: * Ref-counting ColumnFamilyData * New thread-safety for ColumnFamilySet * Dropped column families are now completely dropped and their memory cleaned-up Test Plan: added some tests to column_family_test Reviewers: dhruba, haobo, kailiu, sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D16101	2014-02-12 13:47:09 -08:00
Lei Jin	5fbf2ef42d	preload table handle on Recover() when max_open_files == -1 Summary: This covers existing table files before DB open happens and avoids contention on table cache Test Plan: db_test Reviewers: haobo, sdong, igor, dhruba Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D16089	2014-02-12 10:43:27 -08:00
Igor Canadi	0143abdbb0	Merge branch 'master' into columnfamilies Conflicts: HISTORY.md db/db_impl.cc db/db_impl.h db/db_iter.cc db/db_test.cc db/dbformat.h db/memtable.cc db/memtable_list.cc db/memtable_list.h db/table_cache.cc db/table_cache.h db/version_edit.h db/version_set.cc db/version_set.h db/write_batch.cc db/write_batch_test.cc include/rocksdb/options.h util/options.cc	2014-02-06 15:58:20 -08:00
kailiu	84f8185fc0	Merge branch 'master' into performance Conflicts: HISTORY.md db/db_impl.cc db/memtable.cc	2014-02-05 21:21:00 -08:00
Igor Canadi	f276e0e59d	[CF] Options -> DBOptions Summary: Replaced most of occurrences of Options with more specific DBOptions. This brings us very close to supporting different configuration options for each column family. Test Plan: make check Reviewers: dhruba, haobo, kailiu, sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D15933	2014-02-05 14:56:09 -08:00
Igor Canadi	c24d8c4e90	[CF] Rethink table cache Summary: Adapting table cache to column families is interesting. We want table cache to be global LRU, so if some column families are use not as often as others, we want them to be evicted from cache. However, current TableCache object also constructs tables on its own. If table is not found in the cache, TableCache automatically creates new table. We want each column family to be able to specify different table factory. To solve the problem, we still have a single LRU, but we provide the LRUCache object to TableCache on construction. We have one TableCache per column family, but the underyling cache is shared by all TableCache objects. This allows us to have a global LRU, but still be able to support different table factories for different column families. Also, in the future it will also be able to support different directories for different column families. Test Plan: make check Reviewers: dhruba, haobo, kailiu, sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D15915	2014-02-05 11:55:30 -08:00
Igor Canadi	7b9f134959	[CF] Move InternalStats to ColumnFamilyData Summary: InternalStats is a messy thing, keeping both DB data and column family data. However, it's better off living in ColumnFamilyData than in DBImpl. For now, at least. Test Plan: make check Reviewers: dhruba, kailiu, haobo, sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D15879	2014-02-04 17:59:50 -08:00
Igor Canadi	73f62255c1	[CF] Split SanitizeOptions into two Summary: There are three SanitizeOption-s now : one for DBOptions, one for ColumnFamilyOptions and one for Options (which just calls the other two) I have also reshuffled some options -- table_cache options and info_log should live in DBOptions, for example. Test Plan: make check doesn't complain Reviewers: dhruba, haobo, kailiu, sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D15873	2014-02-04 17:26:51 -08:00
Igor Canadi	5e2c4fe766	Get rid of DBImpl::user_comparator() Summary: user_comparator() is a Column Family property, not DBImpl Test Plan: make check Reviewers: dhruba, haobo, kailiu, sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D15855	2014-02-03 17:50:47 -08:00
Igor Canadi	0e22badc08	[column families] Iterator and MultiGet Summary: Support for different column families in Iterator and MultiGet code path. Test Plan: make check Reviewers: dhruba, haobo, kailiu, sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D15849	2014-02-03 17:44:40 -08:00
Lei Jin	5b3b6549d6	use super_version in NewIterator() and MultiGet() function Summary: Use super_version insider NewIterator to avoid Ref() each component separately under mutex The new added bench shows NewIterator QPS increases from 515K to 719K No meaningful improvement for multiget I guess due to its relatively small cost comparing to 90 keys fetch in the test. Test Plan: unit test and db_bench Reviewers: igor, sdong Reviewed By: igor CC: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D15609	2014-02-03 13:13:36 -08:00
Igor Canadi	27a8856c23	Compacting column families Summary: This diff enables non-default column families to get compacted both automatically and also by calling CompactRange() Test Plan: make check Reviewers: dhruba, haobo, kailiu, sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D15813	2014-01-31 19:54:03 -08:00
Igor Canadi	3615f534d1	Enable flushing memtables from arbitrary column families Summary: Removed default_cfd_ from all flush code paths. This means we can now flush memtables from arbitrary column families! Test Plan: Added a new unit test Reviewers: dhruba, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D15789	2014-01-31 14:42:52 -08:00
Igor Canadi	6973bb1722	MakeRoomForWrite() support for column families Summary: Making room for write will be the hardest part of the column family implementation. For now, I just iterate through all column families and run MakeRoomForWrite() for every one. Test Plan: make check does not complain Reviewers: dhruba, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D15597	2014-01-30 16:12:08 -08:00
Igor Canadi	c37e7de669	Merge branch 'master' into columnfamilies Conflicts: db/db_impl.cc db/db_impl.h	2014-01-30 11:47:23 -08:00
Igor Canadi	ac92420fc5	Merge branch 'master' into performance Conflicts: db/db_impl.h	2014-01-30 10:09:23 -08:00
Igor Canadi	3c0dcf0e25	InternalStatistics Summary: In DBImpl we keep track of some statistics internally and expose them via GetProperty(). This diff encapsulates all the internal statistics into a class InternalStatisics. Most of it is copy/paste. Apart from cleaning up db_impl.cc, this diff is also necessary for Column families, since every column family should have its own CompactionStats, MakeRoomForWrite-stall stats, etc. It's much easier to keep track of it in every column family if it's nicely encapsulated in its own class. Test Plan: make check Reviewers: dhruba, kailiu, haobo, sdong, emayanke Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D15273	2014-01-29 20:40:41 -08:00
Igor Canadi	fa99d53e55	Change ColumnFamilyData from struct to class Summary: ColumnFamilyData grew a lot, there's much more data that it holds now. It makes more sense to encapsulate it better by making it a class. Test Plan: make check Reviewers: dhruba, haobo, kailiu, sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D15579	2014-01-29 15:18:36 -08:00
Igor Canadi	f24a3ee52d	Read from and write to different column families Summary: This one is big. It adds ability to write to and read from different column families (see the unit test). It also supports recovery of different column families from log, which was the hardest part to reason about. We need to make sure to never delete the log file which has unflushed data from any column family. To support that, I added another concept, which is versions_->MinLogNumber() Test Plan: Added a unit test in column_family_test Reviewers: dhruba, haobo, sdong, kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D15537	2014-01-29 11:38:16 -08:00
kailiu	a5e220f5ef	Merge branch 'master' into performance Conflicts: Makefile db/db_impl.cc db/db_test.cc db/memtable_list.cc db/memtable_list.h table/block_based_table_reader.cc table/table_test.cc util/cache.cc util/coding.cc	2014-01-28 10:35:55 -08:00
Igor Canadi	4bf25357ae	[column families] Removing VersionSet::current() Summary: Instead of VersionSet::current(), DBImpl uses default_cfd_->current directly. Test Plan: make check Reviewers: dhruba, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D15483	2014-01-27 17:04:46 -08:00
Igor Canadi	eb055609e4	[column families] Move memtable and immutable memtable list to column family data Summary: All memtables and immutable memtables are moved from DBImpl to ColumnFamilyData. For now, they are all referenced from default column family in DBImpl. It shouldn't be hard to get them from custom column family. Test Plan: make check Reviewers: dhruba, kailiu, sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D15459	2014-01-27 13:37:14 -08:00
Igor Canadi	ae16606f98	Merge branch 'master' into columnfamilies Conflicts: db/version_set.cc db/version_set.h	2014-01-27 11:11:51 -08:00
Igor Canadi	832158e7f7	Fsync directory after we create a new file Summary: @dhruba, I'm not sure where we need to sync the directory. I implemented the function in Env() and added the dir sync just after we close the newly created file in the builder. Should I also add FsyncDir() to new files that get created by a compaction? Test Plan: Confirmed that FsyncDir is returning Status::OK() Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D14751	2014-01-27 11:02:21 -08:00
Igor Canadi	1423e7c9de	Merge branch 'master' into columnfamilies Conflicts: db/version_set.cc db/version_set_reduce_num_levels.cc util/ldb_cmd.cc	2014-01-24 15:03:54 -08:00
Igor Canadi	c583157d49	MemTableListVersion Summary: MemTableListVersion is to MemTableList what Version is to VersionSet. I took almost the same ideas to develop MemTableListVersion. The reason is to have copying std::list done in background, while flushing, rather than in foreground (MultiGet() and NewIterator()) under a mutex! Also, whenever we copied MemTableList, we copied also some MemTableList metadata (flush_requested_, commit_in_progress_, etc.), which was wasteful. This diff avoids std::list copy under a mutex in both MultiGet() and NewIterator(). I created a small database with some number of immutable memtables, and creating 100.000 iterators in a single-thread (!) decreased from {188739, 215703, 198028} to {154352, 164035, 159817}. A lot of the savings come from code under a mutex, so we should see much higher savings with multiple threads. Creating new iterator is very important to LogDevice team. I also think this diff will make SuperVersion obsolete for performance reasons. I will try it in the next diff. SuperVersion gave us huge savings on Get() code path, but I think that most of the savings came from copying MemTableList under a mutex. If we had MemTableListVersion, we would never need to copy the entire object (like we still do in NewIterator() and MultiGet()) Test Plan: `make check` works. I will also do `make valgrind_check` before commit Reviewers: dhruba, haobo, kailiu, sdong, emayanke, tnovak Reviewed By: kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D15255	2014-01-24 14:52:08 -08:00
Igor Canadi	28d1a0c6f5	Merge branch 'master' into columnfamilies Conflicts: db/db_impl.cc db/db_impl.h db/db_impl_readonly.h db/db_test.cc include/rocksdb/db.h include/utilities/stackable_db.h	2014-01-24 09:27:29 -08:00
Lei Jin	aba2acb5ec	CompactRange() to return status Summary: as title Test Plan: make all check What else tests shall I cover? Reviewers: igor, haobo CC: Differential Revision: https://reviews.facebook.net/D15339	2014-01-23 16:41:46 -08:00
Kai Liu	054c5dda8c	Merge branch 'master' into performance Conflicts: db/db_impl.cc db/db_test.cc db/memtable.cc db/version_set.cc include/rocksdb/statistics.h util/statistics_imp.h	2014-01-23 16:32:49 -08:00
Tomislav Novak	81c9cc9b3b	Tailing iterator Summary: This diff implements a special type of iterator that doesn't create a snapshot (can be used to read newly inserted data) and is optimized for doing sequential reads. TailingIterator uses current superversion number to determine whether to invalidate its internal iterators. If the version hasn't changed, it can often avoid doing expensive seeks over immutable structures (sst files and immutable memtables). Test Plan: * new unit tests * running LD with this patch Reviewers: igor, dhruba, haobo, sdong, kailiu Reviewed By: sdong CC: leveldb, lovro, march Differential Revision: https://reviews.facebook.net/D15285	2014-01-23 16:26:08 -08:00
Igor Canadi	92a022ad07	Merge branch 'master' into columnfamilies Conflicts: db/db_impl.cc db/db_impl.h db/db_impl_readonly.cc db/version_set.cc	2014-01-22 10:59:07 -08:00
Igor Canadi	6fe9b57748	Refactor Recover() code Summary: This diff does two things: * Rethinks how we call Recover() with read_only option. Before, we call it with pointer to memtable where we'd like to apply those changes to. This memtable is set in db_impl_readonly.cc and it's actually DBImpl::mem_. Why don't we just apply updates to mem_ right away? It seems more intuitive. * Changes when we apply updates to manifest. Before, the process is to recover all the logs, flush it to sst files and then do one giant commit that atomically adds all recovered sst files and sets the next log number. This works good enough, but causes some small troubles for my column family approach, since I can't have one VersionEdit apply to more than single column family[1]. The change here is to commit the files recovered from logs right away. Here is the state of the world before the change: 1. Recover log 5, add new sst files to edit 2. Recover log 7, add new sst files to edit 3. Recover log 8, add new sst files to edit 4. Commit all added sst files to manifest and mark log files 5, 7 and 8 as recoverd (via SetLogNumber(9) function) After the change, we'll do: 1. Recover log 5, commit the new sst files and set log 5 as recovered 2. Recover log 7, commit the new sst files and set log 7 as recovered 3. Recover log 8, commit the new sst files and set log 8 as recovered The added (small) benefit is that if we fail after (2), the new recovery will only have to recover log 8. In previous case, we'll have to restart the recovery from the beginning. The bigger benefit will be to enable easier integration of multiple column families in Recovery code path. [1] I'm happy to dicuss this decison, but I believe this is the cleanest way to go. It also makes backward compatibility much easier. We don't have a requirement of adding multiple column families atomically. Test Plan: make check Reviewers: dhruba, haobo, kailiu, sdong Reviewed By: kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D15237	2014-01-22 10:45:26 -08:00
Igor Canadi	23f6791c9e	Merge branch 'master' into columnfamilies Conflicts: db/db_impl.cc db/db_impl_readonly.cc db/db_test.cc db/version_edit.cc db/version_edit.h db/version_set.cc db/version_set.h db/version_set_reduce_num_levels.cc	2014-01-21 17:01:52 -08:00
Igor Canadi	8079dd5d24	Merge branch 'master' into performance	2014-01-17 12:20:07 -08:00
Mark Callaghan	439e36db21	Fix SlowdownAmount Summary: This had a few bugs. 1) bottom and top were reversed. top is for the max value but the callers were passing the max value to bottom. The result is that the max sleep is used when n >= bottom. 2) one of the callers passed values with type double and these values are frequently between 1.0 and 2.0 so rounding will do some bad things 3) sometimes the function returned 0 when there should be a stall With this change and one other diff (out for review soon) there are slightly fewer stalls on one workload. With the fix. Stalls(secs): 160.166 level0_slowdown, 0.000 level0_numfiles, 0.000 memtable_compaction, 58.495 leveln_slowdown Stalls(count): 910261 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 54526 leveln_slowdown Without the fix. Stalls(secs): 172.227 level0_slowdown, 0.000 level0_numfiles, 0.000 memtable_compaction, 56.538 leveln_slowdown Stalls(count): 160831 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 52845 leveln_slowdown Task ID: # Blame Rev: Test Plan: run db_bench for --benchmarks=overwrite with IO-bound database Revert Plan: Database Impact: Memcache Impact: Other Notes: EImportant: - begin PUBLIC platform impact section - Bugzilla: # - end platform impact - Reviewers: haobo Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D15243	2014-01-17 10:15:30 -08:00
kailiu	1304d8c8ce	Merge branch 'master' into performance Conflicts: Makefile db/db_impl.cc db/db_impl.h db/db_test.cc db/memtable.cc db/memtable.h db/version_edit.h db/version_set.cc include/rocksdb/options.h util/hash_skiplist_rep.cc util/options.cc	2014-01-15 23:12:31 -08:00
Igor Canadi	2f4eda7890	Move functions from VersionSet to Version Summary: There were some functions in VersionSet that had no reason to be there instead of Version. Moving them to Version will make column families implementation easier. The functions moved are: * NumLevelBytes * LevelSummary * LevelFileSummary * MaxNextLevelOverlappingBytes * AddLiveFiles (previously AddLiveFilesCurrentVersion()) * NeedSlowdownForNumLevel0Files The diff continues on (and depends on) D15171 Test Plan: make check Reviewers: dhruba, haobo, kailiu, sdong, emayanke Reviewed By: sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D15183	2014-01-15 16:18:04 -08:00
Igor Canadi	d9cd7a063f	Fix CompactRange to apply filter to every key Summary: When doing CompactRange(), we should first flush the memtable and then calculate max_level_with_files. Also, we want to compact all the levels that have files, including level `max_level_with_files`. This patch fixed the unit test. Test Plan: Added a failing unit test and a fix, so it's not failing anymore. Reviewers: dhruba, haobo, sdong Reviewed By: haobo CC: leveldb, xjin Differential Revision: https://reviews.facebook.net/D14421	2014-01-14 16:19:09 -08:00
Igor Canadi	7d9f21cf23	BuildBatchGroup -- memcpy outside of lock Summary: When building batch group, don't actually build a new batch since it requires heavy-weight mem copy and malloc. Only store references to the batches and build the batch group without lock held. Test Plan: `make check` I am also planning to run performance tests. The workload that will benefit from this change is readwhilewriting. I will post the results once I have them. Reviewers: dhruba, haobo, kailiu Reviewed By: haobo CC: leveldb, xjin Differential Revision: https://reviews.facebook.net/D15063	2014-01-14 14:49:31 -08:00
Igor Canadi	151f9e144f	Merge branch 'master' into columnfamilies	2014-01-13 09:09:51 -08:00
Mark Callaghan	50994bf699	Don't always compress L0 files written by memtable flush Summary: Code was always compressing L0 files written by a memtable flush when compression was enabled. Now this is done when min_level_to_compress=0 for leveled compaction and when universal_compaction_size_percent=-1 for universal compaction. Task ID: #3416472 Blame Rev: Test Plan: ran db_bench with compression options Revert Plan: Database Impact: Memcache Impact: Other Notes: EImportant: - begin PUBLIC platform impact section - Bugzilla: # - end platform impact - Reviewers: dhruba, igor, sdong Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D14757	2014-01-07 21:50:26 -08:00
Igor Canadi	72918efffe	[column families] Implement DB::OpenWithColumnFamilies() Summary: In addition to implementing OpenWithColumnFamilies, this diff also includes some minor changes: * Changed all column family names from Slice() to std::string. The performance of column family name handling is not critical, and it's more convenient and cleaner to have names as std::strings * Implemented ColumnFamilyOptions(const Options&) and DBOptions(const Options&) * Added ColumnFamilyOptions to VersionSet::ColumnFamilyData. ColumnFamilyOptions are specified on OpenWithColumnFamilies() and CreateColumnFamily() I will keep the diff in the Phabricator for a day or two and will push to the branch then. Feel free to comment even after the diff has been pushed. Test Plan: Added a simple unit test Reviewers: dhruba, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D15033	2014-01-07 11:05:50 -08:00
Igor Canadi	d3a2ba9c64	Merge branch 'master' into columnfamilies	2014-01-07 11:05:03 -08:00
Igor Canadi	17a222670b	Merge branch 'master' into performance	2014-01-07 11:04:21 -08:00
Tomislav Novak	9f690ec62c	Fix a deadlock in CompactRange() Summary: The way DBImpl::TEST_CompactRange() throttles down the number of bg compactions can cause it to deadlock when CompactRange() is called concurrently from multiple threads. Imagine a following scenario with only two threads (max_background_compactions is 10 and bg_compaction_scheduled_ is initially 0): 1. Thread #1 increments bg_compaction_scheduled_ (to LargeNumber), sets bg_compaction_scheduled_ to 9 (newvalue), schedules the compaction (bg_compaction_scheduled_ is now 10) and waits for it to complete. 2. Thread #2 calls TEST_CompactRange(), increments bg_compaction_scheduled_ (now LargeNumber + 10) and waits on a cv for bg_compaction_scheduled_ to drop to LargeNumber. 3. BG thread completes the first manual compaction, decrements bg_compaction_scheduled_ and wakes up all threads waiting on bg_cv_. Thread #1 runs, increments bg_compaction_scheduled_ by LargeNumber again (now 2*LargeNumber + 9). Since that's more than LargeNumber + newvalue, thread #2 also goes to sleep (waiting on bg_cv_), without resetting bg_compaction_scheduled_. This diff attempts to address the problem by introducing a new counter bg_manual_only_ (when positive, MaybeScheduleFlushOrCompaction() will only schedule manual compactions). Test Plan: I could pretty much consistently reproduce the deadlock with a program that calls CompactRange(nullptr, nullptr) immediately after Write() from multiple threads. This no longer happens with this patch. Tests (make check) pass. Reviewers: dhruba, igor, sdong, haobo Reviewed By: igor CC: leveldb Differential Revision: https://reviews.facebook.net/D14799	2014-01-07 10:37:34 -08:00
Igor Canadi	ef6ad1708d	[column families] Support to create and drop column families Summary: This diff provides basic implementations of CreateColumnFamily(), DropColumnFamily() and ListColumnFamilies(). It builds on top of https://reviews.facebook.net/D14733 It also includes a bug fix for DBImplReadOnly, where Get implementation would be redirected to DBImpl instead of DBImplReadOnly. Test Plan: Added unit test Reviewers: dhruba, haobo, kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D15021	2014-01-03 01:12:16 -08:00
Kai Liu	774ed89c24	Replace vector with autovector Summary: this diff only replace the cases when we need to frequently create vector with small amount of entries. This diff doesn't aim to improve performance of a specific area, but more like a small scale test for the autovector and see how it works in real life. Test Plan: make check I also ran the performance tests, however there is no performance gain/loss. All performance numbers are pretty much the same before/after the change. Reviewers: dhruba, haobo, sdong, igor CC: leveldb Differential Revision: https://reviews.facebook.net/D14985	2014-01-02 16:43:35 -08:00
kailiu	e72aa37cc5	Merge branch 'master' into performance Conflicts: db/table_cache.cc	2014-01-02 16:34:59 -08:00
kailiu	476416c27c	Some minor refactoring on the code Summary: I made some cleanup while reading the source code in `db`. Most changes are about style, naming or C++ 11 new features. Test Plan: ran `make check` Reviewers: haobo, dhruba, sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D15009	2014-01-02 16:32:31 -08:00
Igor Canadi	6de1b5b83e	Merge branch 'master' into columnfamilies	2014-01-02 04:18:07 -08:00
Igor Canadi	b60c14f6ee	Support multi-threaded DisableFileDeletions() and EnableFileDeletions() Summary: We don't want two threads to clash if they concurrently call DisableFileDeletions() and EnableFileDeletions(). I'm adding a counter that will enable file deletions only after all DisableFileDeletions() calls have been negated with EnableFileDeletions(). However, we also don't want to break the old behavior, so I added a parameter force to EnableFileDeletions(). If force is true, we will still enable file deletions after every call to EnableFileDeletions(), which is what is happening now. Test Plan: make check Reviewers: dhruba, haobo, sanketh Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D14781	2014-01-02 03:33:42 -08:00
kailiu	f1cec73a76	Merge branch 'master' into performance Conflicts: db/db_impl.cc db/db_test.cc db/memtable.cc db/version_set.cc include/rocksdb/statistics.h	2013-12-27 12:23:17 -08:00
Igor Canadi	1fdb3f7dc6	[RocksDB] Optimize locking for Get Summary: Instead of locking and saving a DB state, we can cache a DB state and update it only when it changes. This change reduces lock contention and speeds up read operations on the DB. Performance improvements are substantial, although there is some cost in no-read workloads. I ran the regression tests on my devserver and here are the numbers: overwrite 56345 -> 63001 fillseq 193730 -> 185296 readrandom 771301 -> 1219803 (58% improvement!) readrandom_smallblockcache 677609 -> 862850 readrandom_memtable_sst 710440 -> 1109223 readrandom_fillunique_random 221589 -> 247869 memtablefillrandom 105286 -> 92643 memtablereadrandom 763033 -> 1288862 Test Plan: make asan_check I am also running db_stress Reviewers: dhruba, haobo, sdong, kailiu Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D14679	2013-12-20 09:57:58 -08:00
Igor Canadi	9385a5247e	[RocksDB] [Column Family] Interface proposal Summary: <This diff is for Column Family branch> Sharing some of the work I've done so far. This diff compiles and passes the tests. The biggest change is in options.h - I broke down Options into two parts - DBOptions and ColumnFamilyOptions. DBOptions is DB-specific (env, create_if_missing, block_cache, etc.) and ColumnFamilyOptions is column family-specific (all compaction options, compresion options, etc.). Note that this does not break backwards compatibility at all. Further, I created DBWithColumnFamily which inherits DB interface and adds new functions with column family support. Clients can transparently switch to DBWithColumnFamily and it will not break their backwards compatibility. There are few methods worth checking out: ListColumnFamilies(), MultiNewIterator(), MultiGet() and GetSnapshot(). [GetSnapshot() returns the snapshot across all column families for now - I think that's what we agreed on] Finally, I made small changes to WriteBatch so we are able to atomically insert data across column families. Please provide feedback. Test Plan: make check works, the code is backward compatible Reviewers: dhruba, haobo, sdong, kailiu, emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D14445	2013-12-18 13:08:22 -08:00
Mark Callaghan	e9e6b00d29	Add monitoring for universal compaction and add counters for compaction IO Summary: Adds these counters { WAL_FILE_SYNCED, "rocksdb.wal.synced" } number of writes that request a WAL sync { WAL_FILE_BYTES, "rocksdb.wal.bytes" }, number of bytes written to the WAL { WRITE_DONE_BY_SELF, "rocksdb.write.self" }, number of writes processed by the calling thread { WRITE_DONE_BY_OTHER, "rocksdb.write.other" }, number of writes not processed by the calling thread. Instead these were processed by the current holder of the write lock { WRITE_WITH_WAL, "rocksdb.write.wal" }, number of writes that request WAL logging { COMPACT_READ_BYTES, "rocksdb.compact.read.bytes" }, number of bytes read during compaction { COMPACT_WRITE_BYTES, "rocksdb.compact.write.bytes" }, number of bytes written during compaction Per-interval stats output was updated with WAL stats and correct stats for universal compaction including a correct value for write-amplification. It now looks like: Compactions Level Files Size(MB) Score Time(sec) Read(MB) Write(MB) Rn(MB) Rnp1(MB) Wnew(MB) RW-Amplify Read(MB/s) Write(MB/s) Rn Rnp1 Wnp1 NewW Count Ln-stall Stall-cnt -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 0 7 464 46.4 281 3411 3875 3411 0 3875 2.1 12.1 13.8 621 0 240 240 628 0.0 0 Uptime(secs): 310.8 total, 2.0 interval Writes cumulative: 9999999 total, 9999999 batches, 1.0 per batch, 1.22 ingest GB WAL cumulative: 9999999 WAL writes, 9999999 WAL syncs, 1.00 writes per sync, 1.22 GB written Compaction IO cumulative (GB): 1.22 new, 3.33 read, 3.78 write, 7.12 read+write Compaction IO cumulative (MB/sec): 4.0 new, 11.0 read, 12.5 write, 23.4 read+write Amplification cumulative: 4.1 write, 6.8 compaction Writes interval: 100000 total, 100000 batches, 1.0 per batch, 12.5 ingest MB WAL interval: 100000 WAL writes, 100000 WAL syncs, 1.00 writes per sync, 0.01 MB written Compaction IO interval (MB): 12.49 new, 14.98 read, 21.50 write, 36.48 read+write Compaction IO interval (MB/sec): 6.4 new, 7.6 read, 11.0 write, 18.6 read+write Amplification interval: 101.7 write, 102.9 compaction Stalls(secs): 142.924 level0_slowdown, 0.000 level0_numfiles, 0.805 memtable_compaction, 0.000 leveln_slowdown Stalls(count): 132461 level0_slowdown, 0 level0_numfiles, 3 memtable_compaction, 0 leveln_slowdown Task ID: #3329644, #3301695 Blame Rev: Test Plan: Revert Plan: Database Impact: Memcache Impact: Other Notes: EImportant: - begin PUBLIC platform impact section - Bugzilla: # - end platform impact - Reviewers: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D14583	2013-12-12 13:27:43 -08:00
kailiu	a82f42b765	rename db/memtablelist.{h,cc}	2013-12-10 19:03:13 -08:00
Igor Canadi	fb9fce4fc3	[RocksDB] BackupableDB Summary: In this diff I present you BackupableDB v1. You can easily use it to backup your DB and it will do incremental snapshots for you. Let's first describe how you would use BackupableDB. It's inheriting StackableDB interface so you can easily construct it with your DB object -- it will add a method RollTheSnapshot() to the DB object. When you call RollTheSnapshot(), current snapshot of the DB will be stored in the backup dir. To restore, you can just call RestoreDBFromBackup() on a BackupableDB (which is a static method) and it will restore all files from the backup dir. In the next version, it will even support automatic backuping every X minutes. There are multiple things you can configure: 1. backup_env and db_env can be different, which is awesome because then you can easily backup to HDFS or wherever you feel like. 2. sync - if true, it guarantees backup consistency on machine reboot 3. number of snapshots to keep - this will keep last N snapshots around if you want, for some reason, be able to restore from an earlier snapshot. All the backuping is done in incremental fashion - if we already have 00010.sst, we will not copy it again. IMPORTANT -- This is based on assumption that 00010.sst never changes - two files named 00010.sst from the same DB will always be exactly the same. Is this true? I always copy manifest, current and log files. 4. You can decide if you want to flush the memtables before you backup, or you're fine with backing up the log files -- either way, you get a complete and consistent view of the database at a time of backup. 5. More things you can find in BackupableDBOptions Here is the directory structure I use: backup_dir/CURRENT_SNAPSHOT - just 4 bytes holding the latest snapshot 0, 1, 2, ... - files containing serialized version of each snapshot - containing a list of files files/*.sst - sst files shared between snapshots - if one snapshot references 00010.sst and another one needs to backup it from the DB, it will just reference the same file files/ 0/, 1/, 2/, ... - snapshot directories containing private snapshot files - current, manifest and log files All the files are ref counted and deleted immediatelly when they get out of scope. Some other stuff in this diff: 1. Added GetEnv() method to the DB. Discussed with @haobo and we agreed that it seems right thing to do. 2. Fixed StackableDB interface. The way it was set up before, I was not able to implement BackupableDB. Test Plan: I have a unittest, but please don't look at this yet. I just hacked it up to help me with debugging. I will write a lot of good tests and update the diff. Also, `make asan_check` Reviewers: dhruba, haobo, emayanke Reviewed By: dhruba CC: leveldb, haobo Differential Revision: https://reviews.facebook.net/D14295	2013-12-09 14:06:52 -08:00
Mayank Agarwal	18802689b8	Make an API to get database identity from the IDENTITY file Summary: This would enable rocksdb users to get the db identity without depending on implementation details(storing that in IDENTITY file) Test Plan: db/db_test (has identity checks) Reviewers: dhruba, haobo, igor, kailiu Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D14463	2013-12-04 22:39:17 -08:00
Igor Canadi	043fc14c3e	Get rid of some shared_ptrs Summary: I went through all remaining shared_ptrs and removed the ones that I found not-necessary. Only GenerateCachePrefix() is called fairly often, so don't expect much perf wins. The ones that are left are accessed infrequently and I think we're fine with keeping them. Test Plan: make asan_check Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D14427	2013-12-03 11:17:58 -08:00
Dhruba Borthakur	98968ba937	Free obsolete memtables outside the dbmutex had a memory leak. Summary: The commit at `27bbef1180` had a memory leak that was detected by valgrind. The memtable that has a refcount decrement in MemTableList::InstallMemtableFlushResults was not freed. Test Plan: valgrind ./db_test --leak-check=full Reviewers: igor Reviewed By: igor CC: leveldb Differential Revision: https://reviews.facebook.net/D14391	2013-11-28 10:25:22 -08:00
Igor Canadi	3ce3658411	DB::GetOptions() Summary: We need access to options for BackupableDB Test Plan: make check Reviewers: dhruba Reviewed By: dhruba CC: leveldb, reconnect.grayhat Differential Revision: https://reviews.facebook.net/D14331	2013-11-25 15:51:50 -08:00
Igor Canadi	11c26bd4a4	[RocksDB] Interface changes required for BackupableDB Summary: This is part of https://reviews.facebook.net/D14295 -- smaller diff that is easier to review Test Plan: make asan_check Reviewers: dhruba, haobo, emayanke Reviewed By: emayanke CC: leveldb, kailiu, reconnect.grayhat Differential Revision: https://reviews.facebook.net/D14301	2013-11-25 12:39:23 -08:00
Igor Canadi	a0ce3fd00a	PurgeObsoleteFiles() unittest Summary: Created a unittest that verifies that automatic deletion performed by PurgeObsoleteFiles() works correctly. Also, few small fixes on the logic part -- call version_set_->GetObsoleteFiles() in FindObsoleteFiles() instead of on some arbitrary positions. Test Plan: Created a unit test Reviewers: dhruba, haobo, nkg- Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D14079	2013-11-14 18:03:57 -08:00
Igor Canadi	9bc4a26f56	Small changes in Deleting obsolete files Summary: @haobo's suggestions from https://reviews.facebook.net/D13827 Renaming some variables, deprecating purge_log_after_flush, changing for loop into auto for loop. I have not implemented deleting objects outside of mutex yet because it would require a big code change - we would delete object in db_impl, which currently does not know anything about object because it's defined in version_edit.h (FileMetaData). We should do it at some point, though. Test Plan: Ran deletefile_test Reviewers: haobo Reviewed By: haobo CC: leveldb, haobo Differential Revision: https://reviews.facebook.net/D14025	2013-11-12 11:53:26 -08:00
Dhruba Borthakur	318a4919d2	Fix valgrind check by initialising DeletionState. Summary: The valgrind error was introduced by commit `1510339e52`. Initialize DeletionState in constructor. Test Plan: valgrind --leak-check=yes ./deletefile_test Reviewers: igor, kailiu Reviewed By: kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D13983	2013-11-11 16:01:13 -08:00
Igor Canadi	1510339e52	Speed up FindObsoleteFiles Summary: Here's one solution we discussed on speeding up FindObsoleteFiles. Keep a set of all files in DBImpl and update the set every time we create a file. I probably missed few other spots where we create a file. It might speed things up a bit, but makes code uglier. I don't really like it. Much better approach would be to abstract all file handling to a separate class. Think of it as layer between DBImpl and Env. Having a separate class deal with file namings and deletion would benefit both code cleanliness (especially with huge DBImpl) and speed things up. It will take a huge effort to do this, though. Let's discuss offline today. Test Plan: Ran ./db_stress, verified that files are getting deleted Reviewers: dhruba, haobo, kailiu, emayanke Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D13827	2013-11-08 15:23:46 -08:00
shamdor	c2be2cba04	WAL log retention policy based on archive size. Summary: Archive cleaning will still happen every WAL_ttl seconds but archived logs will be deleted only if archive size is greater then a WAL_size_limit value. Empty archived logs will be deleted evety WAL_ttl. Test Plan: 1. Unit tests pass. 2. Benchmark. Reviewers: emayanke, dhruba, haobo, sdong, kailiu, igor Reviewed By: emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D13869	2013-11-06 18:46:28 -08:00
Mayank Agarwal	f837f5b1c9	Making the transaction log iterator more robust Summary: strict essentially means that we MUST find the startsequence. Thus we should return if starteSequence is not found in the first file in case strict is set. This will take care of ending the iterator in case of permanent gaps due to corruptions in the log files Also created NextImpl function that will have internal variable to distinguish whether Next is being called from StartSequence or by application. Set NotFoudn::gaps status to give an indication of gaps happeneing. Polished the inline documentation at various places Test Plan: * db_repl_stress test * db_test relating to transaction log iterator * fbcode/wormhole/rocksdb/rocks_log_iterator * sigma production machine sigmafio032.prn1 Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13689	2013-11-04 20:49:03 -08:00
Dhruba Borthakur	b4ad5e89ae	Implement a compressed block cache. Summary: Rocksdb can now support a uncompressed block cache, or a compressed block cache or both. Lookups first look for a block in the uncompressed cache, if it is not found only then it is looked up in the compressed cache. If it is found in the compressed cache, then it is uncompressed and inserted into the uncompressed cache. It is possible that the same block resides in the compressed cache as well as the uncompressed cache at the same time. Both caches have their own individual LRU policy. Test Plan: Unit test case attached. Reviewers: kailiu, sdong, haobo, leveldb Reviewed By: haobo CC: xjin, haobo Differential Revision: https://reviews.facebook.net/D12675	2013-11-01 14:31:35 -07:00
Siying Dong	f03b2df010	Follow-up Cleaning-up After D13521 Summary: This patch is to address @haobo's comments on D13521: 1. rename Table to be TableReader and make its factory function to be GetTableReader 2. move the compression type selection logic out of TableBuilder but to compaction logic 3. more accurate comments 4. Move stat name constants into BlockBasedTable implementation. 5. remove some uncleaned codes in simple_table_db_test Test Plan: pass test suites. Reviewers: haobo, dhruba, kailiu Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D13785	2013-10-30 10:52:33 -07:00
Mayank Agarwal	56305221c4	Unify DeleteFile and DeleteWalFiles Summary: This is to simplify rocksdb public APIs and improve the code quality. Created an additional parameter to ParseFileName for log sub type and improved the code for deleting a wal file. Wrote exhaustive unit-tests in delete_file_test Unification of other redundant APIs can be taken up in a separate diff Test Plan: Expanded delete_file test Reviewers: dhruba, haobo, kailiu, sdong Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13647	2013-10-25 08:32:14 -07:00
Siying Dong	9edda37027	Universal Compaction to Have a Size Percentage Threshold To Decide Whether to Compress Summary: This patch adds a option for universal compaction to allow us to only compress output files if the files compacted previously did not yet reach a specified ratio, to save CPU costs in some cases. Compression is always skipped for flushing. This is because the size information is not easy to evaluate for flushing case. We can improve it later. Test Plan: add test DBTest.UniversalCompactionCompressRatio1 and DBTest.UniversalCompactionCompressRatio12 Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13467	2013-10-17 13:33:39 -07:00
Dhruba Borthakur	9cd221094c	Add appropriate LICENSE and Copyright message. Summary: Add appropriate LICENSE and Copyright message. Test Plan: make check Reviewers: CC: Task ID: # Blame Rev:	2013-10-16 17:48:41 -07:00
Siying Dong	073cbfc8f0	Enable background flush thread by default and fix issues related to it Summary: Enable background flush thread in this patch and fix unit tests with: (1) After background flush, schedule a background compaction if condition satisfied; (2) Fix a bug that if universal compaction is enabled and number of levels are set to be 0, compaction will not be automatically triggered (3) Fix unit tests to wait for compaction to finish instead of flush, before checking the compaction results. Test Plan: pass all unit tests Reviewers: haobo, xjin, dhruba Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D13461	2013-10-16 13:32:53 -07:00
Siying Dong	88f2f89068	Change Function names from Compaction->Flush When they really mean Flush Summary: When I debug the unit test failures when enabling background flush thread, I feel the function names can be made clearer for people to understand. Also, if the names are fixed, in many places, some tests' bugs are obvious (and some of those tests are failing). This patch is to clean it up for future maintenance. Test Plan: Run test suites. Reviewers: haobo, dhruba, xjin Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13431	2013-10-14 15:12:15 -07:00
Dhruba Borthakur	4463b11cad	Migrate names of properties from 'leveldb' prefix to 'rocksdb' prefix. Summary: Migrate names of properties from 'leveldb' prefix to 'rocksdb' prefix. Test Plan: make check Reviewers: emayanke, haobo Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D13311	2013-10-06 00:14:26 -07:00
Dhruba Borthakur	0a9f873f4b	Removed scribe, thrift and java modules. Summary: Removed scribe, thrift and java modules. Test Plan: make release make check Reviewers: emayanke Reviewed By: emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D13293	2013-10-04 15:36:00 -07:00
Dhruba Borthakur	a143ef9b38	Change namespace from leveldb to rocksdb Summary: Change namespace from leveldb to rocksdb. This allows a single application to link in open-source leveldb code as well as rocksdb code into the same process. Test Plan: compile rocksdb Reviewers: emayanke Reviewed By: emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D13287	2013-10-04 11:59:26 -07:00
Mayank Agarwal	854d236361	Add backward compatible option in GetLiveFiles to choose whether to not Flush first Summary: As explained in comments in GetLiveFiles in db.h, this option will cause flush to be skipped in GetLiveFiles because some use-cases use GetSortedWalFiles after GetLiveFiles to generate more complete snapshots. Using GetSortedWalFiles after GetLiveFiles allows us to not Flush in GetLiveFiles first because wals have everything. Note: file deletions will be disabled before calling GLF or GSWF so live logs will not move to archive logs or get delted. Note: Manifest file is truncated to a proper value in GLF, so it will always reply from the proper wal files on a restart Test Plan: make Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13257	2013-10-04 10:20:10 -07:00
Haobo Xu	fa798e9e28	[Rocksdb] Submit mem table flush job in a different thread pool Summary: As title. This is just a quick hack and not ready for commit. fails a lot of unit test. I will test/debug it directly in ViewState shadow . Test Plan: Try it in shadow test. Reviewers: dhruba, xjin CC: leveldb Differential Revision: https://reviews.facebook.net/D12933	2013-10-03 14:37:19 -07:00

... 3 4 5 6 7 ...

533 Commits