rocksdb

mirror of https://github.com/facebook/rocksdb.git synced 2024-11-27 02:44:18 +00:00

Author	SHA1	Message	Date
Islam AbdelRahman	c70a9335de	Fix mutex unlock issue between scheduled compaction and ReleaseCompactionFiles() Summary: NotifyOnCompactionCompleted can unlock the mutex. That mean that we can schedule a background compaction that will start before we ReleaseCompactionFiles(). Test Plan: added unittest existing unittest Reviewers: yhchiang, sdong Reviewed By: sdong Subscribers: yoshinorim, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D58065	2016-05-18 14:56:30 -07:00
Aaron Gao	43afd72bee	[rocksdb] make more options dynamic Summary: make more ColumnFamilyOptions dynamic: - compression - soft_pending_compaction_bytes_limit - hard_pending_compaction_bytes_limit - min_partial_merge_operands - report_bg_io_stats - paranoid_file_checks Test Plan: Add sanity check in `db_test.cc` for all above options except for soft_pending_compaction_bytes_limit and hard_pending_compaction_bytes_limit. All passed. Reviewers: andrewkr, sdong, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D57519	2016-05-17 13:11:56 -07:00
Islam AbdelRahman	f6aedb62c0	Fix Transaction memory leak Summary: - Make sure we clean up recovered_transactions_ on DBImpl destructor - delete leaked txns and env in TransactionTest Test Plan: Run transaction_test under valgrind Reviewers: sdong, andrewkr, yhchiang, horuff Reviewed By: horuff Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D58263	2016-05-16 16:32:55 -07:00
Reid Horuff	a400336398	TransactionLogIterator sequence gap fix Summary: DBTestXactLogIterator.TransactionLogIterator was failing due the sequence gaps. This was caused by an off-by-one error when calculating the new sequence number after recovering from logs. Test Plan: db_log_iter_test Reviewers: andrewkr Subscribers: andrewkr, hermanlee4, dhruba, IslamAbdelRahman Differential Revision: https://reviews.facebook.net/D58053	2016-05-12 13:54:08 -07:00
Reid Horuff	c27061dae7	[rocksdb] 2PC double recovery bug fix Summary: 1. prepare() 2. crash 3. recover 4. commit() 5. crash 6. data is lost This is due to the transaction data still only residing in the WAL but because the logs were flushed on the first recovery the data is ignored on the second recovery. We must scan all logs found on recovery and only ignore redundant data at the time of replay. It is not possible to know which logs still contain relevant data at time of recovery. We cannot simply ignore a log because all of the non-2pc data it contains has already been written to L0. The changes made to MemTableInserter are to ensure that prepared sections are still recovered even if all of the non-2pc data in that log has already been flushed to L0. Test Plan: Provided test. Reviewers: sdong Subscribers: andrewkr, hermanlee4, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D57729	2016-05-10 14:06:07 -07:00
Reid Horuff	a657ee9a9c	[rocksdb] Recovery path sequence miscount fix Summary: Consider the following WAL with 4 batch entries prefixed with their sequence at time of memtable insert. [1: BEGIN_PREPARE, PUT, PUT, PUT, PUT, END_PREPARE(a)] [1: BEGIN_PREPARE, PUT, PUT, PUT, PUT, END_PREPARE(b)] [4: COMMIT(a)] [7: COMMIT(b)] The first two batches do not consume any sequence numbers so are both prefixed with seq=1. For 2pc commit, memtable insertion takes place before COMMIT batch is written to WAL. We can see that sequence number consumption takes place between WAL entries giving us the seemingly sparse sequence prefix for WAL entries. This is a valid WAL. Because with 2PC markers one WriteBatch points to another batch containing its inserts a writebatch can consume more or less sequence numbers than the number of sequence consuming entries that it contains. We can see that, given the entries in the WAL, 6 sequence ids were consumed. Yet on recovery the maximum sequence consumed would be 7 + 3 (the number of sequence numbers consumed by COMMIT(b)) So, now upon recovery we must track the actual consumption of sequence numbers. In the provided scenario there will be no sequence gaps, but it is possible to produce a sequence gap. This should not be a problem though. correct? Test Plan: provided test. Reviewers: sdong Subscribers: andrewkr, leveldb, dhruba, hermanlee4 Differential Revision: https://reviews.facebook.net/D57645	2016-05-10 14:06:07 -07:00
Reid Horuff	1b8a2e8fdd	[rocksdb] Memtable Log Referencing and Prepared Batch Recovery Summary: This diff is built on top of WriteBatch modification: https://reviews.facebook.net/D54093 and adds the required functionality to rocksdb core necessary for rocksdb to support 2PC. modfication of DBImpl::WriteImpl() - added two arguments uint64_t log_used = nullptr, uint64_t log_ref = 0; - log_used is an output argument which will return the log number which the incoming batch was inserted into, 0 if no WAL insert took place. - log_ref is a supplied log_number which all memtables inserted into will reference after the batch insert takes place. This number will reside in 'FindMinPrepLogReferencedByMemTable()' until all Memtables insertinto have flushed. - Recovery/writepath is now aware of prepared batches and commit and rollback markers. Test Plan: There is currently no test on this diff. All testing of this functionality takes place in the Transaction layer/diff but I will add some testing. Reviewers: IslamAbdelRahman, sdong Subscribers: leveldb, santoshb, andrewkr, vasilep, dhruba, hermanlee4 Differential Revision: https://reviews.facebook.net/D56919	2016-05-10 14:06:07 -07:00
Islam AbdelRahman	4b31723433	Add bottommost_compression option Summary: Add a new option that can be used to set a specific compression algorithm for bottommost level. This option will only affect levels larger than base level. I have also updated CompactionJobInfo to include the compression algorithm used in compaction Test Plan: added new unittest existing unittests Reviewers: andrewkr, yhchiang, sdong Reviewed By: sdong Subscribers: lightmark, andrewkr, dhruba, yoshinorim Differential Revision: https://reviews.facebook.net/D57669	2016-05-09 15:57:19 -07:00
Yi Wu	a92049e3e7	Added EventListener::OnTableFileCreationStarted() callback Summary: Added EventListener::OnTableFileCreationStarted. EventListener::OnTableFileCreated will be called on failure case. User can check creation status via TableFileCreationInfo::status. Test Plan: unit test. Reviewers: dhruba, yhchiang, ott, sdong Reviewed By: sdong Subscribers: sdong, kradhakrishnan, IslamAbdelRahman, andrewkr, yhchiang, leveldb, ott, dhruba Differential Revision: https://reviews.facebook.net/D56337	2016-04-29 11:35:00 -07:00
sdong	992a8f83b7	Not enable jemalloc status printing if USE_CLANG=1 Summary: Warning is printed out with USE_CLANG=1 when including jemalloc.h. Disable it in that case. Test Plan: Run db_bench with USE_CLANG=1 and not. Make sure they can all build and jemalloc status is printed out in the case where USE_CLANG is not set. Reviewers: andrewkr, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D57399	2016-04-28 16:16:14 -07:00
Islam AbdelRahman	0850bc5147	Fix build on machines without jemalloc Summary: It looks like we mistakenly enable JEMALLOC even if it's not available on the machine, that's why travis is failing Test Plan: check on my devserver check on my mac Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D57345	2016-04-27 18:25:19 -07:00
Sergey Makarenko	1c80dfab24	Print memory allocation counters Summary: Introduced option to dump malloc statistics using new option flag. Added new command line option to db_bench tool to enable this funtionality. Also extended build to support environments with/without jemalloc. Test Plan: 1) Build rocksdb using `make` command. Launch the following command `./db_bench --benchmarks=fillrandom --dump_malloc_stats=true --num=10000000` end verified that jemalloc dump is present in LOG file. 2) Build rocksdb using `DISABLE_JEMALLOC=1 make db_bench -j32` and ran the same db_bench tool and found the following message in LOG file: "Please compile with jemalloc to enable malloc dump". 3) Also built rocksdb using `make` command on MacOS to verify behavior in non-FB environment. Also to debug build configuration change temporary changed AM_DEFAULT_VERBOSITY = 1 in Makefile to see compiler and build tools output. For case 1) -DROCKSDB_JEMALLOC was present in compiler command line. For both 2) and 3) this flag was not present. Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D57321	2016-04-27 16:23:33 -07:00
sdong	ac0e54b4c6	CompactedDB should not be used if there is outstanding WAL files Summary: CompactedDB skips memtable. So we shouldn't use compacted DB if there is outstanding WAL files. Test Plan: Change to options.max_open_files = -1 perf context test to create a compacted DB, which we shouldn't do. Reviewers: yhchiang, kradhakrishnan, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D57057	2016-04-26 14:22:07 -07:00
Yueh-Hsuan Chiang	24110ce90c	Correct Statistics FLUSH_WRITE_BYTES Summary: In https://reviews.facebook.net/D56271, we fixed an issue where we consider flush as compaction. However, that makes us mistakenly count FLUSH_WRITE_BYTES twice (one in flush_job and one in db_impl.) This patch removes the one incremented in db_impl. Test Plan: db_test Reviewers: yiwu, andrewkr, IslamAbdelRahman, kradhakrishnan, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D57111	2016-04-25 12:01:01 -07:00
sdong	4b6833aec1	Rename options.compaction_measure_io_stats to options.report_bg_io_stats and include flush too. Summary: It is useful to print out IO stats in flush jobs too. Extend options.compaction_measure_io_stats to flush jobs and raname it. Test Plan: Try db_bench and see the stats are printed out. Reviewers: yhchiang Reviewed By: yhchiang Subscribers: kradhakrishnan, yiwu, IslamAbdelRahman, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D56769	2016-04-15 10:22:18 -07:00
Jay Edgar	2448f80375	Make sure that if use_mmap_reads is on use_os_buffer is also on Summary: The code assumes that if use_mmap_reads is on then use_os_buffer is also on. This make sense as by using memory mapped files for reading you are expecting the OS to cache what it needs. Add code to make sure the user does not turn off use_os_buffer when they turn on use_mmap_reads Test Plan: New test: DBTest.MMapAndBufferOptions Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D56397	2016-04-08 14:30:15 -07:00
Andrew Kryczka	2391ef7214	Embed column family name in SST file Summary: Added the column family name to the properties block. This property is omitted only if the property is unavailable, such as when RepairDB() writes SST files. In a next diff, I will change RepairDB to use this new property for deciding to which column family an existing SST file belongs. If this property is missing, it will add it to the "unknown" column family (same as its existing behavior). Test Plan: New unit test: $ ./db_table_properties_test --gtest_filter=DBTablePropertiesTest.GetColumnFamilyNameProperty Reviewers: IslamAbdelRahman, yhchiang, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D55605	2016-04-06 23:10:32 -07:00
Marton Trencseni	9b51987521	Adding pin_l0_filter_and_index_blocks_in_cache feature and related fixes. Summary: When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache. What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention. Test Plan: 'export TEST_TMPDIR=/dev/shm/ && DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32' is OK. I didn't run the Java tests, I don't have Java set up on my devserver. Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D56133	2016-04-01 10:42:39 -07:00
zensan	78711524b7	In all the places where log records are read, there was a check that record.size() should not be less than 12. This "magic number" seems to be the WriteBatch header (8 byte sequence and 4 byte count). Replaced all the places where "12" was used by WriteBatchInternal::kHeader.	2016-03-30 23:05:22 +05:30
Praveen Rao	583157f710	Avoid overloaded virtual function	2016-03-22 17:10:31 -07:00
Praveen Rao	4f1c74a46e	merge from master	2016-03-18 12:48:01 -07:00
Praveen Rao	f8c2189307	Publish log numbers for column family to wal_filter, and provide log number in the record callback	2016-03-18 12:32:15 -07:00
Baris Yazici	e8e6cf0173	fix: handle_fatal_signal (sig=6) in std::vector<std::string, std::allocator<std::string> >::_M_range_check \| c++/4.8.2/bits/stl_vector.h:794 #174 Summary: Fix for https://github.com/facebook/mysql-5.6/issues/174 When there is no old files to purge, vector.at(i) function was crashing if (old_info_log_file_count != 0 && old_info_log_file_count >= db_options_.keep_log_file_num) { std::sort(old_info_log_files.begin(), old_info_log_files.end()); size_t end = old_info_log_file_count - db_options_.keep_log_file_num; for (unsigned int i = 0; i <= end; i++) { std::string& to_delete = old_info_log_files.at(i); Added check to old_info_log_file_count be non zero. Test Plan: run existing tests Reviewers: gunnarku, vasilep, sdong, yhchiang Reviewed By: yhchiang Subscribers: andrewkr, webscalesql-eng, dhruba Differential Revision: https://reviews.facebook.net/D55245	2016-03-11 11:11:45 -08:00
Andrew Kryczka	d9620239d2	Cleanup stale manifests outside of full purge Summary: - Keep track of obsolete manifests in VersionSet - Updated FindObsoleteFiles() to put obsolete manifests in the JobContext for later use by PurgeObsoleteFiles() - Added test case that verifies a stale manifest is deleted by a non-full purge Test Plan: $ ./backupable_db_test --gtest_filter=BackupableDBTest.ChangeManifestDuringBackupCreation Reviewers: IslamAbdelRahman, yoshinorim, sdong Reviewed By: sdong Subscribers: andrewkr, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D55269	2016-03-10 18:16:21 -08:00
Yueh-Hsuan Chiang	765597fa78	Update compaction score right after CompactFiles forms a compaction Summary: This is a follow-up patch of https://reviews.facebook.net/D54891. As the information about files being compacted will also be used when making compaction decision, it is necessary to update the compaction score when a compaction plan has been made but not yet execute. This patch adds a missing call to update the compaction score in CompactFiles(). Test Plan: compact_files_test Reviewers: sdong, IslamAbdelRahman, kradhakrishnan, yiwu, andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D55227	2016-03-10 14:34:28 -08:00
Yueh-Hsuan Chiang	a7d4eb2f34	Fix a bug where flush does not happen when a manual compaction is running Summary: Currently, when rocksdb tries to run manual compaction to refit data into a level, there's a ReFitLevel() process that requires no bg work is currently running. When RocksDB plans to ReFitLevel(), it will do the following: 1. pause scheduling new bg work. 2. wait until all bg work finished 3. do the ReFitLevel() 4. unpause scheduling new bg work. However, as it pause scheduling new bg work at step one and waiting for all bg work finished in step 2, RocksDB will stop flushing until all bg work is done (which could take a long time.) This patch fix this issue by changing the way ReFitLevel() pause the background work: 1. pause scheduling compaction. 2. wait until all bg work finished. 3. pause scheduling flush 4. do ReFitLevel() 5. unpause both flush and compaction. The major difference is that. We only pause scheduling compaction in step 1 and wait for all bg work finished in step 2. This prevent flush being blocked for a long time. Although there's a very rare case that ReFitLevel() might be in starvation in step 2, but it's less likely the case as flush typically finish very fast. Test Plan: existing test. Reviewers: anthony, IslamAbdelRahman, kradhakrishnan, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D55029	2016-03-04 14:24:52 -08:00
Islam AbdelRahman	dfe96c72c3	Fix WriteLevel0TableForRecovery file delete protection Summary: The call to ``` CaptureCurrentFileNumberInPendingOutputs() ``` should be before ``` versions_->NewFileNumber() ``` Right now we are not actually protecting the file from being deleted Test Plan: make check Reviewers: sdong, anthony, yhchiang Reviewed By: yhchiang Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D54645	2016-03-03 18:25:07 -08:00
sdong	e79ad9e184	Add Iterator Property rocksdb.iterator.version_number Summary: We want to provide a way to detect whether an iterator is stale and needs to be recreated. Add a iterator property to return version number. Test Plan: Add two unit tests for it. Reviewers: IslamAbdelRahman, yhchiang, anthony, kradhakrishnan, andrewkr Reviewed By: andrewkr Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D54921	2016-03-02 16:23:59 -08:00
Islam AbdelRahman	6743135ea1	Fix DB::AddFile() issue when PurgeObsoleteFiles() is called Summary: In some situations the DB will scan all existing files in the DB path and delete the ones that are Obsolete. If this happen during adding an external sst file. this could cause the file to be deleted while we are adding it. This diff fix this issue Test Plan: unit test to reproduce the bug existing unit tests Reviewers: sdong, yhchiang, andrewkr Reviewed By: andrewkr Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D54627	2016-03-01 12:05:29 -08:00
sdong	38201b3599	Fix assert failure when DBImpl::SyncWAL() conflicts with log rolling Summary: DBImpl::SyncWAL() releases db mutex before calling DBImpl::MarkLogsSynced(), while inside DBImpl::MarkLogsSynced() we assert there is none or one outstanding log file. However, a memtable switch can happen in between and causing two or outstanding logs there, failing the assert. The diff adds a unit test that repros the issue and fix the assert so that the unit test passes. Test Plan: Run the new tests. Reviewers: anthony, kolmike, yhchiang, IslamAbdelRahman, kradhakrishnan, andrewkr Reviewed By: andrewkr Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D54621	2016-02-23 11:42:15 -08:00
Mike Kolupaev	eef63ef807	Fixed CompactFiles() spuriously failing or corrupting DB Summary: We started getting two kinds of crashes since we started using `DB::CompactFiles()`: (1) `CompactFiles()` fails saying something like "/data/logdevice/4440/shard12/012302.sst: No such file or directory", and presumably makes DB read-only, (2) DB fails to open saying "Corruption: Can't access /267000.sst: IO error: /data/logdevice/4440/shard1/267000.sst: No such file or directory". AFAICT, both can be explained by background thread deleting compaction output as "obsolete" while it's being written, before it's committed to manifest. If it ends up committed to the manifest, we get (2); if compaction notices the disappearance and fails, we get (1). The internal tasks t10068021 and t10134177 have some details about the investigation that led to this. Test Plan: `make -j check`; the new test fails to reopen the DB without the fix Reviewers: yhchiang Reviewed By: yhchiang Subscribers: dhruba, sdong Differential Revision: https://reviews.facebook.net/D54561	2016-02-22 13:54:58 -08:00
Islam AbdelRahman	df9ba6df62	Introduce SstFileManager::SetMaxAllowedSpaceUsage() to cap disk space usage Summary: Introude SstFileManager::SetMaxAllowedSpaceUsage() that can be used to limit the maximum space usage allowed for RocksDB. When this limit is exceeded WriteImpl() will fail and return Status::Aborted() Test Plan: unit testing Reviewers: yhchiang, anthony, andrewkr, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D53763	2016-02-17 15:20:23 -08:00
reid horuff	5bcf952a87	Fix WriteImpl empty batch hanging issue Summary: There is an issue in DBImpl::WriteImpl where if an empty writebatch comes in and sync=true then the logs will be marked as being synced yet the sync never actually happens because there is no data in the writebatch. This causes the next incoming batch to hang while waiting for the logs to complete syncing. This fix syncs logs even if the writebatch is empty. Test Plan: DoubleEmptyBatch unit test in transaction_test. Reviewers: yoshinorim, hermanlee4, sdong, ngbronson, anthony Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D54057	2016-02-16 12:21:33 -08:00
Mike Kolupaev	44371501f0	Fixed a segfault when compaction fails Summary: We've hit it today. Test Plan: `make -j check`; didn't reproduce the issue Reviewers: yhchiang Reviewed By: yhchiang Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D54219	2016-02-16 11:11:16 -08:00
Baraa Hamodi	21e95811d1	Updated all copyright headers to the new format.	2016-02-09 15:12:00 -08:00
Yueh-Hsuan Chiang	4a8cbf4e31	Allows Get and MultiGet to read directly from SST files. Summary: Add kSstFileTier to ReadTier, which allows Get and MultiGet to read only directly from SST files and skip mem-tables. kSstFileTier = 0x2 // data in SST files. // Note that this ReadTier currently only supports // Get and MultiGet and does not support iterators. Test Plan: add new test in db_test. Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, sdong Reviewed By: sdong Subscribers: igor, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D53511	2016-02-09 11:20:22 -08:00
reid horuff	6f71d3b68b	Improve perf of Pessimistic Transaction expirations (and optimistic transactions) Summary: copy from task 8196669: 1) Optimistic transactions do not support batching writes from different threads. 2) Pessimistic transactions do not support batching writes if an expiration time is set. In these 2 cases, we currently do not do any write batching in DBImpl::WriteImpl() because there is a WriteCallback that could decide at the last minute to abort the write. But we could support batching write operations with callbacks if we make sure to process the callbacks correctly. To do this, we would first need to modify write_thread.cc to stop preventing writes with callbacks from being batched together. Then we would need to change DBImpl::WriteImpl() to call all WriteCallback's in a batch, only write the batches that succeed, and correctly set the state of each batch's WriteThread::Writer. Test Plan: Added test WriteWithCallbackTest to write_callback_test.cc which creates multiple client threads and verifies that writes are batched and executed properly. Reviewers: hermanlee4, anthony, ngbronson Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D52863	2016-02-05 10:44:13 -08:00
Andrew Kryczka	284aa613a7	Eliminate duplicated property constants Summary: Before this diff, there were duplicated constants to refer to properties (user- facing API had strings and InternalStats had an enum). I noticed these were inconsistent in terms of which constants are provided, names of constants, and documentation of constants. Overall it seemed annoying/error-prone to maintain these duplicated constants. So, this diff gets rid of InternalStats's constants and replaces them with a map keyed on the user-facing constant. The value in that map contains a function pointer to get the property value, so we don't need to do string matching while holding db->mutex_. This approach has a side benefit of making many small handler functions rather than a giant switch-statement. Test Plan: db_properties_test passes, running "make commit-prereq -j32" Reviewers: sdong, yhchiang, kradhakrishnan, IslamAbdelRahman, rven, anthony Reviewed By: anthony Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D53253	2016-02-02 19:14:56 -08:00
Nathan Bronson	9c2cf9479b	Fix for --allow_concurrent_memtable_write with batching Summary: Concurrent memtable adds were incorrectly computing the last sequence number for a write batch group when the write batches were not solitary. This is the cause of https://github.com/facebook/mysql-5.6/issues/155 Test Plan: 1. unit tests 2. new unit test 3. parallel db_bench stress tests with batch size of 10 and asserts enabled Reviewers: igor, sdong Reviewed By: sdong Subscribers: IslamAbdelRahman, MarkCallaghan, dhruba Differential Revision: https://reviews.facebook.net/D53595	2016-02-01 20:41:57 -08:00
SherlockNoMad	37159a6448	Add histogram for value size per operation	2016-01-31 18:09:24 -08:00
Venkatesh Radhakrishnan	3b2a1ddd2e	Add options.base_background_compactions as a number of compaction threads for low compaction debt Summary: If options.base_background_compactions is given, we try to schedule number of compactions not existing this number, only when L0 files increase to certain number, or pending compaction bytes more than certain threshold, we schedule compactions based on options.max_background_compactions. The watermarks are calculated based on slowdown thresholds. Test Plan: Add new test cases in column_family_test. Adding more unit tests. Reviewers: IslamAbdelRahman, yhchiang, kradhakrishnan, rven, anthony Reviewed By: anthony Subscribers: leveldb, dhruba, yoshinorim Differential Revision: https://reviews.facebook.net/D53409	2016-01-29 16:15:53 -08:00
Islam AbdelRahman	d6c838f1e1	Add SstFileManager (component tracking all SST file in DBs and control the deletion rate) Summary: Add a new class SstFileTracker that will be notified whenever a DB add/delete/move and sst file, it will also replace DeleteScheduler SstFileTracker can be used later to abort writes when we exceed a specific size Test Plan: unit tests Reviewers: rven, anthony, yhchiang, sdong Reviewed By: sdong Subscribers: igor, lovro, march, dhruba Differential Revision: https://reviews.facebook.net/D50469	2016-01-28 18:35:01 -08:00
Andrew Kryczka	167bd8856d	[directory includes cleanup] Finish removing util->db dependencies	2016-01-26 10:49:24 -08:00
Andrew Kryczka	acd7d58695	[directory includes cleanup] Remove util->db dependency for ThreadStatusUtil Summary: We can avoid the dependency by forward-declaring ColumnFamilyData and then treating it as a black box. That means callers of ThreadStatusUtil need to explicitly provide more options, even if they can be derived from the ColumnFamilyData, since ThreadStatusUtil doesn't include the definition. This is part of a series of diffs to eliminate circular dependencies between directories (e.g., db/* files depending on util/* files and vice-versa). Test Plan: $ ./db_test --gtest_filter=DBTest.GetThreadStatus $ make -j32 commit-prereq Reviewers: sdong, yhchiang, IslamAbdelRahman Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D53361	2016-01-26 10:49:16 -08:00
Andrew Kryczka	46f9cd46af	[directory includes cleanup] Move cross-function test points Summary: I split the db-specific test points out into a separate file under db/ directory. There were also a few bugs to fix in xfunc.{h,cc} that prevented it from compiling previously; see https://reviews.facebook.net/D36825. Test Plan: compilation works now, below command works, will also run "make xfunc". $ make check ROCKSDB_XFUNC_TEST='managed_new' tests-regexp='DBTest' -j32 Reviewers: sdong, yhchiang Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D53343	2016-01-26 10:49:05 -08:00
Islam AbdelRahman	2fbc59a348	Disallow SstFileWriter from creating empty sst files Summary: SstFileWriter may create an sst file with no entries Right now this will fail when being ingested using DB::AddFile() saying that the keys are corrupted Test Plan: make check Reviewers: yhchiang, rven, anthony, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D52815	2016-01-25 13:47:07 -08:00
David Bernard	3f12e16f27	Make alloca.h optional	2016-01-19 06:17:31 +00:00
David Bernard	d78c6b28c4	Changes for build on solaris Makefile adjust paths for solaris build Makefile enable _GLIBCXX_USE_C99 so that std::to_string is available db_compaction_test.cc Initialise a variable to avoid a compilation error db_impl.cc Include <alloca.h> db_test.cc Include <alloca.h> Environment.java recognise solaris envrionment options_bulder.cc Make log unambiguous geodb_impl.cc Make log and floor unambiguous	2016-01-19 04:45:21 +00:00
Venkatesh Radhakrishnan	7ece10ecb6	DeleteFilesInRange: Mark files to be deleted as being compacted before applying change Summary: While running the myrocks regression suite, I found that while dropping a table soon after inserting rows into it resulted in an assertion failure in CheckConsistencyForDeletes for not finding a file which was recently added or moved. Marking the files to be deleted as being compacted before calling LogAndApplyChange fixed the assertion failures. Test Plan: DBCompactionTest.DeleteFileRange Reviewers: IslamAbdelRahman, anthony, yhchiang, kradhakrishnan, sdong Reviewed By: sdong Subscribers: dhruba, yoshinorim, leveldb Differential Revision: https://reviews.facebook.net/D52599	2016-01-07 14:48:45 -08:00
Reid Horuff	da032495d3	Optimize GetLatestSequenceForKey Summary: DBImpl::GetLatestSequenceForKey() can do memcpy's to load a value that will never be used. This can be optimized by changing all the Get() functions called to optionally not fetch the value (and only fetch the sequencenumber). Test Plan: optimistic_transaction_test and transaction_test Reviewers: anthony Reviewed By: anthony Subscribers: leveldb, dhruba, hermanlee4 Differential Revision: https://reviews.facebook.net/D52227	2016-01-06 13:43:22 -08:00
Venkatesh Radhakrishnan	d74c9f0a57	DeleteFilesInRange: Clean job context if no files deleted Summary: We need to clean the job context if we end up not deleting any files because no files are in the range specified. Test Plan: DBCompactionTest.DeleteFileRange Reviewers: sdong, anthony, yhchiang, kradhakrishnan, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D52467	2016-01-04 10:55:31 -08:00
Nathan Bronson	ac16663bd6	use -Werror=missing-field-initializers, to closer match MyRocks build Summary: myrocks seems to build rocksdb using -Wmissing-field-initializers (and treats warnings as errors). This diff adds that flag to the rocksdb build, and fixes the compilation failures that result. I have not checked for any other differences in the build flags for rocksdb build as part of myrocks. Test Plan: make check Reviewers: sdong, rven Reviewed By: rven Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D52443	2015-12-30 14:56:18 -08:00
Venkatesh Radhakrishnan	63ddb783db	Delete files in given key range Summary: This is an initial diff for providing the ability to delete files which are completely within a given range of keys. Test Plan: DBCompactionTest.DeleteRange Reviewers: IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: yoshinorim, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D52293	2015-12-29 13:22:13 -08:00
Siying Dong	22c0ed8a5f	Disable Visual Studio Warning C4351 Currently Windows build is broken because of Warning C4351. Disable the warning before figuring out the right way to fix it.	2015-12-28 15:06:34 -08:00
sdong	11672df19a	Fix CLANG errors introduced by `7d87f02799` Summary: Fix some CLANG errors introduced in `7d87f02799` Test Plan: Build with both of CLANG and gcc Reviewers: rven, yhchiang, kradhakrishnan, anthony, IslamAbdelRahman, ngbronson Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D52329	2015-12-28 10:00:58 -08:00
Nathan Bronson	7d87f02799	support for concurrent adds to memtable Summary: This diff adds support for concurrent adds to the skiplist memtable implementations. Memory allocation is made thread-safe by the addition of a spinlock, with small per-core buffers to avoid contention. Concurrent memtable writes are made via an additional method and don't impose a performance overhead on the non-concurrent case, so parallelism can be selected on a per-batch basis. Write thread synchronization is an increasing bottleneck for higher levels of concurrency, so this diff adds --enable_write_thread_adaptive_yield (default off). This feature causes threads joining a write batch group to spin for a short time (default 100 usec) using sched_yield, rather than going to sleep on a mutex. If the timing of the yield calls indicates that another thread has actually run during the yield then spinning is avoided. This option improves performance for concurrent situations even without parallel adds, although it has the potential to increase CPU usage (and the heuristic adaptation is not yet mature). Parallel writes are not currently compatible with inplace updates, update callbacks, or delete filtering. Enable it with --allow_concurrent_memtable_write (and --enable_write_thread_adaptive_yield). Parallel memtable writes are performance neutral when there is no actual parallelism, and in my experiments (SSD server-class Linux and varying contention and key sizes for fillrandom) they are always a performance win when there is more than one thread. Statistics are updated earlier in the write path, dropping the number of DB mutex acquisitions from 2 to 1 for almost all cases. This diff was motivated and inspired by Yahoo's cLSM work. It is more conservative than cLSM: RocksDB's write batch group leader role is preserved (along with all of the existing flush and write throttling logic) and concurrent writers are blocked until all memtable insertions have completed and the sequence number has been advanced, to preserve linearizability. My test config is "db_bench -benchmarks=fillrandom -threads=$T -batch_size=1 -memtablerep=skip_list -value_size=100 --num=1000000/$T -level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999 -disable_auto_compactions --max_write_buffer_number=8 -max_background_flushes=8 --disable_wal --write_buffer_size=160000000 --block_size=16384 --allow_concurrent_memtable_write" on a two-socket Xeon E5-2660 @ 2.2Ghz with lots of memory and an SSD hard drive. With 1 thread I get ~440Kops/sec. Peak performance for 1 socket (numactl -N1) is slightly more than 1Mops/sec, at 16 threads. Peak performance across both sockets happens at 30 threads, and is ~900Kops/sec, although with fewer threads there is less performance loss when the system has background work. Test Plan: 1. concurrent stress tests for InlineSkipList and DynamicBloom 2. make clean; make check 3. make clean; DISABLE_JEMALLOC=1 make valgrind_check; valgrind db_bench 4. make clean; COMPILE_WITH_TSAN=1 make all check; db_bench 5. make clean; COMPILE_WITH_ASAN=1 make all check; db_bench 6. make clean; OPT=-DROCKSDB_LITE make check 7. verify no perf regressions when disabled Reviewers: igor, sdong Reviewed By: sdong Subscribers: MarkCallaghan, IslamAbdelRahman, anthony, yhchiang, rven, sdong, guyg8, kradhakrishnan, dhruba Differential Revision: https://reviews.facebook.net/D50589	2015-12-25 11:03:40 -08:00
Siying Dong	298ba27ae2	Merge pull request #846 from yuslepukhin/enble_c4244_lossofdata Enable MS compiler warning c4244.	2015-12-23 22:59:42 -08:00
Islam AbdelRahman	d005c66faf	Report compaction reason in CompactionListener Summary: Add CompactionReason to CompactionJobInfo This will allow users to understand why compaction started which will help options tuning Test Plan: added new tests make check -j64 Reviewers: yhchiang, anthony, kradhakrishnan, sdong, rven Reviewed By: rven Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D51975	2015-12-22 11:37:19 -08:00
Alex Yang	33e09c0e19	add call to install superversion and schedule work in enableautocompactions Summary: This patch fixes https://github.com/facebook/mysql-5.6/issues/121 There is a recent change in rocksdb to disable auto compactions on startup: https://reviews.facebook.net/D51147. However, there is a small timing window where a column family needs to be compacted and schedules a compaction, but the scheduled compaction fails when it checks the disable_auto_compactions setting. The expectation is once the application is ready, it will call EnableAutoCompactions() to allow new compactions to go through. However, if the Column family is stalled because L0 is full, and no writes can go through, it is possible the column family may never have a new compaction request get scheduled. EnableAutoCompaction() should probably schedule an new flush and compaction event when it resets disable_auto_compaction. Using InstallSuperVersionAndScheduleWork, we call SchedulePendingFlush, SchedulePendingCompaction, as well as MaybeScheduleFlushOrcompaction on all the column families to avoid the situation above. This is still a first pass for feedback. Could also just call SchedePendingFlush and SchedulePendingCompaction directly. Test Plan: Run on Asan build cd _build-5.6-ASan/ && ./mysql-test/mtr --mem --big --testcase-timeout=36000 --suite-timeout=12000 --parallel=16 --suite=rocksdb,rocksdb_rpl,rocksdb_sys_vars --mysqld=--default-storage-engine=rocksdb --mysqld=--skip-innodb --mysqld=--default-tmp-storage-engine=MyISAM --mysqld=--rocksdb rocksdb_rpl.rpl_rocksdb_stress_crash --repeat=1000 Ensure that it no longer hangs during the test. Reviewers: hermanlee4, yhchiang, anthony Reviewed By: anthony Subscribers: leveldb, yhchiang, dhruba Differential Revision: https://reviews.facebook.net/D51747	2015-12-21 10:06:49 -08:00
Venkatesh Radhakrishnan	7b12ae97d4	Add signalall after removing item from manual_compaction deque Summary: When there are waiting manual compactions, we need to signal them after removing the current manual compaction from the deque. Test Plan: ColumnFamilytTest.SameCFManualManualCommaction Reviewers: anthony, IslamAbdelRahman, kradhakrishnan, sdong Reviewed By: sdong Subscribers: dhruba, yoshinorim Differential Revision: https://reviews.facebook.net/D52119	2015-12-17 16:59:00 -08:00
Islam AbdelRahman	aececc209e	Introduce ReadOptions::pin_data (support zero copy for keys) Summary: This patch update the Iterator API to introduce new functions that allow users to keep the Slices returned by key() valid as long as the Iterator is not deleted ReadOptions::pin_data : If true keep loaded blocks in memory as long as the iterator is not deleted Iterator::IsKeyPinned() : If true, this mean that the Slice returned by key() is valid as long as the iterator is not deleted Also add a new option BlockBasedTableOptions::use_delta_encoding to allow users to disable delta_encoding if needed. Benchmark results (using https://phabricator.fb.com/P20083553) ``` // $ du -h /home/tec/local/normal.4K.Snappy/db10077 // 6.1G /home/tec/local/normal.4K.Snappy/db10077 // $ du -h /home/tec/local/zero.8K.LZ4/db10077 // 6.4G /home/tec/local/zero.8K.LZ4/db10077 // Benchmarks for shard db10077 // _build/opt/rocks/benchmark/rocks_copy_benchmark \ // --normal_db_path="/home/tec/local/normal.4K.Snappy/db10077" \ // --zero_db_path="/home/tec/local/zero.8K.LZ4/db10077" // First run // ============================================================================ // rocks/benchmark/RocksCopyBenchmark.cpp relative time/iter iters/s // ============================================================================ // BM_StringCopy 1.73s 576.97m // BM_StringPiece 103.74% 1.67s 598.55m // ============================================================================ // Match rate : 1000000 / 1000000 // Second run // ============================================================================ // rocks/benchmark/RocksCopyBenchmark.cpp relative time/iter iters/s // ============================================================================ // BM_StringCopy 611.99ms 1.63 // BM_StringPiece 203.76% 300.35ms 3.33 // ============================================================================ // Match rate : 1000000 / 1000000 ``` Test Plan: Unit tests Reviewers: sdong, igor, anthony, yhchiang, rven Reviewed By: rven Subscribers: dhruba, lovro, adsharma Differential Revision: https://reviews.facebook.net/D48999	2015-12-16 12:08:30 -08:00
Gunnar Kudrjavets	97265f5f14	Fix minor bugs in delete operator, snprintf, and size_t usage Summary: List of changes: 1) Fix the snprintf() usage in cases where wrong variable was used to determine the output buffer size. 2) Remove unnecessary checks before calling delete operator. 3) Increase code correctness by using size_t type when getting vector's size. 4) Unify the coding style by removing namespace::std usage at the top of the file to confirm to the majority usage. 5) Fix various lint errors pointed out by 'arc lint'. Test Plan: Code review and build: git diff make clean make -j 32 commit-prereq arc lint Reviewers: kradhakrishnan, sdong, rven, anthony, yhchiang, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D51849	2015-12-15 15:26:20 -08:00
Venkatesh Radhakrishnan	030215bf01	Running manual compactions in parallel with other automatic or manual compactions in restricted cases Summary: This diff provides a framework for doing manual compactions in parallel with other compactions. We now have a deque of manual compactions. We also pass manual compactions as an argument from RunManualCompactions down to BackgroundCompactions, so that RunManualCompactions can be reentrant. Parallelism is controlled by the two routines ConflictingManualCompaction to allow/disallow new parallel/manual compactions based on already existing ManualCompactions. In this diff, by default manual compactions still have to run exclusive of other compactions. However, by setting the compaction option, exclusive_manual_compaction to false, it is possible to run other compactions in parallel with a manual compaction. However, we are still restricted to one manual compaction per column family at a time. All of these restrictions will be relaxed in future diffs. I will be adding more tests later. Test Plan: Rocksdb regression + new tests + valgrind Reviewers: igor, anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, sdong Reviewed By: sdong Subscribers: yoshinorim, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D47973	2015-12-14 11:20:34 -08:00
Dmitri Smirnov	236fe21c92	Enable MS compiler warning c4244. Mostly due to the fact that there are differences in sizes of int,long on 64 bit systems vs GNU.	2015-12-11 16:47:34 -08:00
agiardullo	3bfd3d39a3	Use SST files for Transaction conflict detection Summary: Currently, transactions can fail even if there is no actual write conflict. This is due to relying on only the memtables to check for write-conflicts. Users have to tune memtable settings to try to avoid this, but it's hard to figure out exactly how to tune these settings. With this diff, TransactionDB will use both memtables and SST files to determine if there are any write conflicts. This relies on the fact that BlockBasedTable stores sequence numbers for all writes that happen after any open snapshot. Also, D50295 is needed to prevent SingleDelete from disappearing writes (the TODOs in this test code will be fixed once the other diff is approved and merged). Note that Optimistic transactions will still rely on tuning memtable settings as we do not want to read from SST while on the write thread. Also, memtable settings can still be used to reduce how often TransactionDB needs to read SST files. Test Plan: unit tests, db bench Reviewers: rven, yhchiang, kradhakrishnan, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: dhruba, leveldb, yoshinorim Differential Revision: https://reviews.facebook.net/D50475	2015-12-11 12:34:11 -08:00
agiardullo	9e44629061	Change SingleDelete to support conflict checking Summary: For Transactions, we want to start using the SST files to do write conflict checking. To do this, we need to make sure that compaction never removes all writes if an earlier snapshot exists. So I had to change the way we process SingleDeletes to sometimes leave a SingleDelete behind when we encounter a Put followed by a SingleDelete. See the comments in this diff for a more detailed explanation. Test Plan: added more unit tests Reviewers: rven, igor, kradhakrishnan, IslamAbdelRahman, yhchiang, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D50295	2015-12-10 11:35:38 -08:00
Yueh-Hsuan Chiang	774b80e99e	Resubmit the fix for a race condition in persisting options Summary: This patch fix a race condition in persisting options which will cause a crash when: * Thread A obtain cf options and start to persist options based on that cf options. * Thread B kicks in and finish DropColumnFamily and delete cf_handle. * Thread A wakes up and tries to finish the persisting options and crashes. Test Plan: Add a test in column_family_test that can reproduce the crash Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, sdong Reviewed By: sdong Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D51717	2015-12-08 17:01:02 -08:00
agiardullo	e5c5f23814	Support marking snapshots for write-conflict checking - Take 2 Summary: D51183 was reverted due to breaking the LITE build. This diff is the same as D51183 but with a fix for the LITE BUILD(D51693) Test Plan: run all unit tests Reviewers: sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D51711	2015-12-08 16:47:31 -08:00
sdong	1d63c3d610	Revert "Support marking snapshots for write-conflict checking" This reverts commit `ec704aafdc` for it broke RocksDB LITE build.	2015-12-08 09:27:17 -08:00
agiardullo	ec704aafdc	Support marking snapshots for write-conflict checking Summary: D50475 enables using SST files for transaction write-conflict checking. In order for this to work, we need to make sure not to compact out SingleDeletes when there is an earlier transaction snapshot(D50295). If there is a long-held snapshot, this could reduce the benefit of the SingleDelete optimization. This diff allows Transactions to mark snapshots as being used for write-conflict checking. Then, during compaction, we will be able to optimize SingleDeletes better in the future. This diff adds a flag to SnapshotImpl which is used by Transactions. This diff also passes the earliest write-conflict snapshot's sequence number to CompactionIterator. This diff does not actually change Compaction (after this diff is pushed, D50295 will be able to use this information). Test Plan: no behavior change, ran existing tests Reviewers: rven, kradhakrishnan, yhchiang, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D51183	2015-12-07 19:40:51 -08:00
sdong	f307036bde	Revert "Fix a race condition in persisting options" This reverts commit `2fa3ed5180`. It breaks RocksDB lite build	2015-12-07 17:09:12 -08:00
Yueh-Hsuan Chiang	2fa3ed5180	Fix a race condition in persisting options Summary: This patch fix a race condition in persisting options which will cause a crash when: * Thread A obtain cf options and start to persist options based on that cf options. * Thread B kicks in and finish DropColumnFamily and delete cf_handle. * Thread A wakes up and tries to finish the persisting options and crashes. Test Plan: Add a test in column_family_test that can reproduce the crash Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D51609	2015-12-07 15:25:12 -08:00
Alex Yang	e8180f9901	added public api to schedule flush/compaction, code to prevent race with db::open Summary: Fixes T8781168. Added a new function EnableAutoCompactions in db.h to be publicly avialable. This allows compaction to be re-enabled after disabling it via SetOptions Refactored code to set the dbptr earlier on in TransactionDB::Open and DB::Open Temporarily disable auto_compaction in TransactionDB::Open until dbptr is set to prevent race condition. Test Plan: Ran make all check verified fix on myrocks side: was able to reproduce the seg fault with ../tools/mysqltest.sh --mem --force rocksdb.drop_table method was to manually sleep the thread after DB::Open but before TransactionDB ptr was assigned in transaction_db_impl.cc: DB::Open(db_options, dbname, column_families_copy, handles, &db); clock_t goal = (60000 * 10) + clock(); while (goal > clock()); ...dbptr(aka rdb) gets assigned below verified my changes fixed the issue. Also added unit test 'ToggleAutoCompaction' in transaction_test.cc Reviewers: hermanlee4, anthony Reviewed By: anthony Subscribers: alex, dhruba Differential Revision: https://reviews.facebook.net/D51147	2015-12-03 22:59:44 -08:00
sdong	db320b1b82	DB to only flush the column family with the largest memtable while option.db_write_buffer_size is hit Summary: When option.db_write_buffer_size is hit, we currently flush all column families. Move to flush the column family with the largest active memt table instead. In this way, we can avoid too many small files in some cases. Test Plan: Modify test DBTest.SharedWriteBuffer to work with the updated behavior Reviewers: kradhakrishnan, yhchiang, rven, anthony, IslamAbdelRahman, igor Reviewed By: igor Subscribers: march, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D51291	2015-11-30 13:36:57 -08:00
Reid Horuff	3381e2c3e7	Handle multiple calls to DBImpl::PauseBackgroundWork() and DBImpl::ContinueBackgroundWork() Summary: Handle multiple calls to DBImpl::PauseBackgroundWork() and DBImpl::ContinueBackgroundWork() Test Plan: rocksdb.information_schema handles this case. Reviewers: igor Reviewed By: igor Subscribers: hermanlee4, jkedgar, dhruba Differential Revision: https://reviews.facebook.net/D50781	2015-11-16 14:20:18 -08:00
Venkatesh Radhakrishnan	2ae4d7d708	Make sure that CompactFiles does not run two parallel Level 0 compactions Summary: Since level 0 files can overlap, two level 0 compactions cannot run in parallel. Compact files needs to check this before running a compaction. Test Plan: CompactFilesTest.L0ConflictsFiles Reviewers: igor, IslamAbdelRahman, anthony, sdong, yhchiang Reviewed By: yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D50079	2015-11-13 12:01:00 -08:00
Nathan Bronson	6ce42dd075	Don't merge WriteBatch-es if WAL is disabled Summary: There's no need for WriteImpl to flatten the write batch group into a single WriteBatch if the WAL is disabled. This diff moves the flattening into the WAL step, and skips flattening entirely if it isn't needed. It's good for about 5% speedup on a multi-threaded workload with no WAL. This diff also adds clarifying comments about the chance for partial failure of WriteBatchInternal::InsertInto, and always sets bg_error_ if the memtable state diverges from the logged state or if a WriteBatch succeeds only partially. Benchmark for speedup: db_bench -benchmarks=fillrandom -threads=16 -batch_size=1 -memtablerep=skip_list -value_size=0 --num=200000 -level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999 -disable_auto_compactions --max_write_buffer_number=8 -max_background_flushes=8 --disable_wal --write_buffer_size=160000000 Test Plan: asserts + make check Reviewers: sdong, igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D50583	2015-11-12 10:50:38 -08:00
Yueh-Hsuan Chiang	e114f0abb8	Enable RocksDB to persist Options file. Summary: This patch allows rocksdb to persist options into a file on DB::Open, SetOptions, and Create / Drop ColumnFamily. Options files are created under the same directory as the rocksdb instance. In addition, this patch also adds a fail_if_missing_options_file in DBOptions that makes any function call return non-ok status when it is not able to persist options properly. // If true, then DB::Open / CreateColumnFamily / DropColumnFamily // / SetOptions will fail if options file is not detected or properly // persisted. // // DEFAULT: false bool fail_if_missing_options_file; Options file names are formatted as OPTIONS-<number>, and RocksDB will always keep the latest two options files. Test Plan: Add options_file_test. options_test column_family_test Reviewers: igor, IslamAbdelRahman, sdong, anthony Reviewed By: anthony Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D48285	2015-11-10 22:58:01 -08:00
Venkatesh Radhakrishnan	9d50afc3b9	Prefix-based iterating only shows keys in prefix Summary: MyRocks testing found an issue that while iterating over keys that are outside the prefix, sometimes wrong results were seen for keys outside the prefix. We now tighten the range of keys seen with a new read option called prefix_seen_at_start. This remembers the starting prefix and then compares it on a Next for equality of prefix. If they are from a different prefix, it sets valid to false. Test Plan: PrefixTest.PrefixValid Reviewers: IslamAbdelRahman, sdong, yhchiang, anthony Reviewed By: anthony Subscribers: spetrunia, hermanlee4, yoshinorim, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D50211	2015-11-05 13:24:05 -08:00
Yueh-Hsuan Chiang	3ecbab0040	Add GetAggregatedIntProperty(): returns the aggregated value from all CFs Summary: This patch adds GetAggregatedIntProperty() that returns the aggregated value from all CFs Test Plan: Added a test in db_test Reviewers: igor, sdong, anthony, IslamAbdelRahman, rven Reviewed By: rven Subscribers: rven, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D49497	2015-11-03 15:54:18 -08:00
Islam AbdelRahman	ff4499e297	Update DB::AddFile() to have less restrictions Summary: Update DB::AddFile() restrictions to be - Key range in loaded table file don't overlap with existing keys or tombstones in DB. - No other writes happen during AddFile call. The updated AddFile() will verify that the file key range don't overlap with any keys or tombstones in the DB, and then add the file to L0 Test Plan: unit tests Reviewers: igor, rven, anthony, kradhakrishnan, sdong Reviewed By: sdong Subscribers: adsharma, ameyag, dhruba Differential Revision: https://reviews.facebook.net/D49233	2015-10-30 16:38:10 -07:00
Islam AbdelRahman	2872e0c8c2	Clean and expose CreateLoggerFromOptions Summary: CreateLoggerFromOptions have some parameters like db_log_dir and env, these parameters are redundant since they already exist in DBOptions this patch remove the redundant parameters and expose CreateLoggerFromOptions to users Test Plan: make check Reviewers: igor, anthony, yhchiang, rven, kradhakrishnan, sdong Reviewed By: sdong Subscribers: dhruba, hermanlee4 Differential Revision: https://reviews.facebook.net/D49713	2015-10-29 18:07:37 -07:00
sdong	296c3a1f94	"make format" in some recent commits Summary: Run "make format" for some recent commits. Test Plan: Build and run tests Reviewers: IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D49707	2015-10-29 17:11:14 -07:00
Herman Lee	0d720dfc17	Use the correct variable when fetching table properties. Summary: An uninitialized parameter was being passed into the call to fetch the table properties during the compaction notification callbacks. Test Plan: Build it with myrocks and verify unit test passed. Run unit tests. Reviewers: rven, yhchiang, igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D49635	2015-10-28 16:28:11 -07:00
Praveen Rao	4ce117c4d5	Merge branch 'master' into wal_filter	2015-10-26 19:03:34 -07:00
Praveen Rao	32cdec634e	Fail recovery if filter provides more records than original and corresponding unit-test, fix naming conventions	2015-10-26 18:11:18 -07:00
Siying Dong	138876a62c	Merge pull request #746 from ceph/wip-recycle Add Options.recycle_log_file_num for Recycling WAL Files	2015-10-26 15:01:28 -07:00
Praveen Rao	2938c5c137	merge upstream changes	2015-10-19 15:21:33 -07:00
Praveen Rao	0c59691dde	Handle multiple batches in single log record - allow app to return a new batch + allow app to return corrupted record status	2015-10-19 13:27:40 -07:00
Alexey Maykov	f18acd8875	Fixed the clang compilation failure Summary: As above. Test Plan: USE_CLANG=1 make check -j Reviewers: igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D48981	2015-10-19 10:38:50 -07:00
Sage Weil	9c33f64d19	log_reader: pass in WALRecoveryMode instead of bool report_eof_inconsistency Soon our behavior will depend on more than just whther we are in kAbsoluteConsistency or not. Signed-off-by: Sage Weil <sage@redhat.com>	2015-10-18 21:24:32 -04:00
Sage Weil	3ac13c99d1	log_reader: pass log_number and optional info_log to ctor We will need the log number to validate the recycle-style CRCs. The log is helpful for debugging, but optional, as not all callers have it. Signed-off-by: Sage Weil <sage@redhat.com>	2015-10-18 21:24:32 -04:00
Sage Weil	5830c699f2	log_writer: pass log number and whether recycling is enabled to ctor When we recycle log files, we need to mix the log number into the CRC for each record. Note that for logs that don't get recycled (like the manifest), we always pass a log_number of 0 and false. Signed-off-by: Sage Weil <sage@redhat.com>	2015-10-18 21:24:32 -04:00
Sage Weil	666376150c	db_impl: recycle log files If log recycling is enabled, put old WAL files on a recycle queue instead of deleting them. When we need a new log file, take a recycled file off the list if one is available. Signed-off-by: Sage Weil <sage@redhat.com>	2015-10-18 21:24:32 -04:00
Sage Weil	d666225a0a	db_impl: disable recycle_log_files if WAL archive is enabled We can't recycle the files if they are being archived. Signed-off-by: Sage Weil <sage@redhat.com>	2015-10-18 21:21:24 -04:00
Alexey Maykov	e1a09a7703	Implementation for GetPropertiesOfTablesInRange Summary: In MyRocks, it is sometimes important to get propeties only for the subset of the database. This diff implements the API in RocksDB. Test Plan: ran the GetPropertiesOfTablesInRange Reviewers: rven, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D48651	2015-10-17 13:34:43 -07:00
Yueh-Hsuan Chiang	ad471453e8	Allow GetProperty to report the number of currently running flushes / compactions. Summary: Add rocksdb.num-running-compactions and rocksdb.num-running-flushes to GetIntProperty() that reports the number of currently running compactions / flushes. Test Plan: augmented existing tests in db_test Reviewers: igor, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D48693	2015-10-17 00:16:36 -07:00
sdong	277dea78f0	Add more kill points Summary: Add kill points in: 1. after creating a file 2. before writing a manifest record 3. before syncing manifest 4. before creating a new current file 5. after creating a new current file Test Plan: Run all current tests. Reviewers: yhchiang, igor, anthony, IslamAbdelRahman, rven, kradhakrishnan Reviewed By: kradhakrishnan Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D48855	2015-10-16 14:35:12 -07:00
Venkatesh Radhakrishnan	a98fbacfa0	Moving memtable related files from util to a new directory memtable Summary: We are cleaning up dependencies. This diff takes a first step at moving memtable files to their own directory called memtable. In future diffs, we will move other memtable files from db to memtable. Test Plan: make check Reviewers: sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D48915	2015-10-16 14:10:33 -07:00
Islam AbdelRahman	f55d3009c0	Make db_test_util compile under ROCKSDB_LITE Summary: db_test_util is used in multiple test files but it dont compile under ROCKSDB_LITE Test Plan: make check make static_lib OPT=-DROCKSDB_LITE make db_wal_test Reviewers: igor, yhchiang, rven, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D48579	2015-10-13 17:33:23 -07:00
sdong	35ad531be3	Seperate InternalIterator from Iterator Summary: Separate a new class InternalIterator from class Iterator, when the look-up is done internally, which also means they operate on key with sequence ID and type. This change will enable potential future optimizations but for now InternalIterator's functions are still the same as Iterator's. At the same time, separate the cleanup function to a separate class and let both of InternalIterator and Iterator inherit from it. Test Plan: Run all existing tests. Reviewers: igor, yhchiang, anthony, kradhakrishnan, IslamAbdelRahman, rven Reviewed By: rven Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D48549	2015-10-13 15:32:13 -07:00
Praveen Rao	cc4d13e0a8	Put wal_filter under #ifndef ROCKSDB_LITE	2015-10-13 11:10:14 -07:00
Praveen Rao	eb24178553	merge from master	2015-10-12 17:24:21 -07:00
Islam AbdelRahman	c64ae05b1c	Move TEST_NewInternalIterator to NewInternalIterator Summary: Long time ago we add InternalDumpCommand to ldb_tool https://reviews.facebook.net/D11517 This command is using TEST_NewInternalIterator although it's not a test. This patch move TEST_NewInternalIterator outside of db_impl_debug.cc Test Plan: make check make static_lib Reviewers: yhchiang, igor, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D48561	2015-10-12 17:22:37 -07:00
Praveen Rao	59a0c219bb	Adding log filter to inspect and filter log records on recovery	2015-10-12 17:03:03 -07:00
Alexey Maykov	3d07b815f6	Passing table properties to compaction callback Summary: It would be nice to have and access to table properties in compaction callbacks. In MyRocks project, it will make possible to update optimizer statistics online. Test Plan: ran the unit test. Ran myrocks with the new way of collecting stats. Reviewers: igor, rven, yhchiang Reviewed By: yhchiang Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D48267	2015-10-09 18:10:55 -07:00
sdong	776bd8d5eb	Pass column family ID to table property collector Summary: Pass column family ID through TablePropertiesCollectorFactory::CreateTablePropertiesCollector() so that users can identify which column family this file is for and handle it differently. Test Plan: Add unit test scenarios in tests related to table properties collectors to verify the information passed in is correct. Reviewers: rven, yhchiang, anthony, kradhakrishnan, igor, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: yoshinorim, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D48411	2015-10-09 14:36:51 -07:00
dyniusz	0267502655	Support for LevelDB SST with .ldb suffix Summary: Handle SST files with both ".sst" and ".ldb" suffix. This enables user to migrate from leveldb to rocksdb. Test Plan: Added unit test with DB operating on SSTs with names schema. See db/dc_test.cc:SSTsWithLdbSuffixHandling for details Reviewers: yhchiang, sdong, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D48003	2015-10-06 17:46:22 -07:00
Igor Canadi	115427ef63	Add APIs PauseBackgroundWork() and ContinueBackgroundWork() Summary: To support a new MongoDB capability, we need to make sure that we don't do any IO for a short period of time. For background, see: * https://jira.mongodb.org/browse/SERVER-20704 * https://jira.mongodb.org/browse/SERVER-18899 To implement that, I add a new API calls PauseBackgroundWork() and ContinueBackgroundWork() which reuse the capability we already have in place for RefitLevel() function. Test Plan: Added a new test in db_test. Made sure that test fails when PauseBackgroundWork() is commented out. Reviewers: IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D47901	2015-10-02 13:17:34 -07:00
Islam AbdelRahman	f03b5c987b	Add experimental DB::AddFile() to plug sst files into empty DB Summary: This is an initial version of bulk load feature This diff allow us to create sst files, and then bulk load them later, right now the restrictions for loading an sst file are (1) Memtables are empty (2) Added sst files have sequence number = 0, and existing values in database have sequence number = 0 (3) Added sst files values are not overlapping Test Plan: unit testing Reviewers: igor, ott, sdong Reviewed By: sdong Subscribers: leveldb, ott, dhruba Differential Revision: https://reviews.facebook.net/D39081	2015-09-23 12:42:43 -07:00
sdong	d0c31641d2	Internal stats WAL file synced to match meaning of the stats of the same name Summary: https://reviews.facebook.net/D23343 changed WAL sync bytes to extra fsync. This change does the same for internal stats. Test Plan: Run all existing unit tests and verify results in db_bench. Reviewers: anthony, rven, igor, MarkCallaghan, kradhakrishnan, yhchiang Reviewed By: yhchiang Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D47349	2015-09-22 14:23:11 -07:00
Andres Noetzli	014fd55adc	Support for SingleDelete() Summary: This patch fixes #7460559. It introduces SingleDelete as a new database operation. This operation can be used to delete keys that were never overwritten (no put following another put of the same key). If an overwritten key is single deleted the behavior is undefined. Single deletion of a non-existent key has no effect but multiple consecutive single deletions are not allowed (see limitations). In contrast to the conventional Delete() operation, the deletion entry is removed along with the value when the two are lined up in a compaction. Note: The semantics are similar to @igor's prototype that allowed to have this behavior on the granularity of a column family ( https://reviews.facebook.net/D42093 ). This new patch, however, is more aggressive when it comes to removing tombstones: It removes the SingleDelete together with the value whenever there is no snapshot between them while the older patch only did this when the sequence number of the deletion was older than the earliest snapshot. Most of the complex additions are in the Compaction Iterator, all other changes should be relatively straightforward. The patch also includes basic support for single deletions in db_stress and db_bench. Limitations: - Not compatible with cuckoo hash tables - Single deletions cannot be used in combination with merges and normal deletions on the same key (other keys are not affected by this) - Consecutive single deletions are currently not allowed (and older version of this patch supported this so it could be resurrected if needed) Test Plan: make all check Reviewers: yhchiang, sdong, rven, anthony, yoshinorim, igor Reviewed By: igor Subscribers: maykov, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D43179	2015-09-17 11:42:56 -07:00
Venkatesh Radhakrishnan	51e1c11254	Do not flag error if file to be deleted does not exist Summary: Some users have observed errors in the log file when the log file or sst file is already deleted. Test Plan: Make sure that the errors do not appear for already deleted files. Reviewers: sdong Reviewed By: sdong Subscribers: anthony, kradhakrishnan, yhchiang, rven, igor, IslamAbdelRahman, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D47115	2015-09-17 10:21:34 -07:00
sdong	9aca7cd6d8	DB::Open() to flush info log after printing DB pointer Summary: Now DB::Open() flushes info log before printing DB pointer, so it may not show up if no activity after DB open. Move log flushing from after printing options to printing DB pointer. Test Plan: make commit-prereq Reviewers: igor, IslamAbdelRahman, yhchiang, kradhakrishnan, anthony, rven Reviewed By: rven Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D47121	2015-09-16 16:33:39 -07:00
Yueh-Hsuan Chiang	f21c7415a7	Change the log level of DB start-up log from Warn to Header. Summary: Change the log level of DB start-up log from Warn to Header. Test Plan: db_bench and observe the LOG header Reviewers: igor, anthony, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D47067	2015-09-16 11:31:45 -07:00
Alexey Maykov	3ebf11ed16	Adding the increment for a counter for a number of WAL syncs Summary: This will unblock the corresponding change in MyRocks Test Plan: ran rocksdb.write_sync test Reviewers: sdong, kolmike Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D46911	2015-09-16 11:00:49 -07:00
sdong	f3170b6f6c	DBImpl::FindObsoleteFiles() shouldn't release mutex between getting min_pending_output and scanning files Summary: Releasing mutex between getting min_pending_output and scanning files may cause min_pending_output to be max but some non-final files are found in file scanning, ending up with deleting wrong files. As a recent regression, mutex can be released while waiting for log sync. We move it to after file scanning. Test Plan: Run all existing tests. Don't think it is easy to write a unit test. Maybe we should find a way to assert lock not released so that we can have some test verification for similar cases. Reviewers: igor, anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, kolmike, rven Reviewed By: rven Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D46899	2015-09-14 23:39:30 -07:00
krad	1126644082	Relaxing consistency detection to include errors while inserting to memtable as WAL recovery error. Summary: The current code, considers data to be consistent if the record checksum passes. We do have customer issues where the record checksum passed but the data was incomprehensible. There is no way to get out of this error case since all WAL recovery model will consider this error as unrelated to WAL. Relaxing the definition and including errors while inserting to memtable as WAL errors and handing them as per the recovery level. Test Plan: Used customer dump to verify the fix for different level. The db opens for kSkipAnyCorruptedRecords and kPointInTimeRecovery, but fails for kAbsoluteConsistency and kTolerateCorruptedTailRecords. Reviewers: sdon igor CC: leveldb@ Task ID: #7918721 Blame Rev:	2015-09-10 12:56:17 -07:00
Igor Canadi	ac9bcb55ce	Set max_open_files based on ulimit Summary: We should never set max_open_files to be bigger than the system's ulimit. Otherwise we will get "Too many open files" errors. See an example in this Travis run: https://travis-ci.org/facebook/rocksdb/jobs/79591566 Test Plan: make check I will also verify that max_max_open_files is reasonable. Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D46551	2015-09-10 10:49:28 -07:00
Andres Noetzli	3c9cef1eed	Unified maps with Comparator for sorting, other cleanup Summary: This diff is a collection of cleanups that were initially part of D43179. Additionally it adds a unified way of defining key-value maps that use a Comparator for sorting (this was previously implemented in four different places). Test Plan: make clean check all Reviewers: rven, anthony, yhchiang, sdong, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D45993	2015-09-02 13:58:22 -07:00
Andres Noetzli	effd9dd1e1	Fix deadlock in WAL sync Summary: MarkLogsSynced() was doing `logs_.erase(it++);`. The standard is saying: ``` all iterators and references are invalidated, unless the erased members are at an end (front or back) of the deque (in which case only iterators and references to the erased members are invalidated) ``` Because `it` is an iterator to the first element of the container, it is invalidated, only one iteration is executed and `log.getting_synced = false;` is not being done, so `while (logs_.front().getting_synced)` in `WriteImpl()` is not terminating. Test Plan: make db_bench && ./db_bench --benchmarks=fillsync Reviewers: igor, rven, IslamAbdelRahman, anthony, kradhakrishnan, yhchiang, sdong, tnovak Reviewed By: tnovak Subscribers: kolmike, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D45807	2015-08-28 18:06:32 -07:00
Igor Canadi	5f4166c90e	ReadaheadRandomAccessFile -- userspace readahead Summary: ReadaheadRandomAccessFile acts as a transparent layer on top of RandomAccessFile. When a Read() request is issued, it issues a much bigger request to the OS and caches the result. When a new request comes in and we already have the data cached, it doesn't have to issue any requests to the OS. We add ReadaheadRandomAccessFile layer only when file is read during compactions. D45105 was incorrectly closed by Phabricator because I committed it to a separate branch (not master), so I'm resubmitting the diff. Test Plan: make check Reviewers: MarkCallaghan, sdong Reviewed By: sdong Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D45123	2015-08-26 15:25:59 -07:00
Andres Notzli	09d982f9e0	Fix compact_files_example Summary: See task #7983654. The example was triggering an assert in compaction job because the compaction was not marked as manual. With this patch, CompactionPicker::FormCompaction() marks compactions as manual. This patch also fixes a couple of typos, adds optimistic_transaction_example to .gitignore and librocksdb as a dependency for examples. Adding librocksdb as a dependency makes sure that the examples are built with the latest changes in librocksdb. Test Plan: make clean && cd examples && make all && ./compact_files_example Reviewers: rven, sdong, anthony, igor, yhchiang Reviewed By: yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D45117	2015-08-25 12:29:44 -07:00
Andres Noetzli	2050832974	Fixing race condition in DBTest.DynamicMemtableOptions Summary: This patch fixes a race condition in DBTEst.DynamicMemtableOptions. In rare cases, it was possible that the main thread would fill up both memtables before the flush job acquired its work. Then, the flush job was flushing both memtables together, producing only one L0 file while the test expected two. Now, the test waits for flushes to finish earlier, to make sure that the memtables are flushed in separate flush jobs. Test Plan: Insert "usleep(10000);" after "IOSTATS_SET_THREAD_POOL_ID(Env::Priority::HIGH);" in BGWorkFlush() to make the issue more likely. Then test with: make db_test && time while ./db_test --gtest_filter=*DynamicMemtableOptions; do true; done Reviewers: rven, sdong, yhchiang, anthony, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D45429	2015-08-24 17:04:18 -07:00
Igor Canadi	4ab26c5ad1	Smarter purging during flush Summary: Currently, we only purge duplicate keys and deletions during flush if `earliest_seqno_in_memtable <= newest_snapshot`. This means that the newest snapshot happened before we first created the memtable. This is almost never true for MyRocks and MongoRocks. This patch makes purging during flush able to understand snapshots. The main logic is copied from compaction_job.cc, although the logic over there is much more complicated and extensive. However, we should try to merge the common functionality at some point. I need this patch to implement no_overwrite_i_promise functionality for flush. We'll also need this to support SingleDelete() during Flush(). @yoshinorim requested the feature. Test Plan: make check I had to adjust some unit tests to understand this new behavior Reviewers: yhchiang, yoshinorim, anthony, sdong, noetzli Reviewed By: noetzli Subscribers: yoshinorim, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D42087	2015-08-24 11:11:12 -07:00
Islam AbdelRahman	3fd70b05b8	Rate limit deletes issued by DestroyDB Summary: Update DestroyDB so that all SST files in the first path id go through DeleteScheduler instead of being deleted immediately Test Plan: added a unittest Reviewers: igor, yhchiang, anthony, kradhakrishnan, rven, sdong Reviewed By: sdong Subscribers: jeanxu2012, dhruba Differential Revision: https://reviews.facebook.net/D44955	2015-08-19 15:02:17 -07:00
Ari Ekmekji	f0da6977a3	[Parallel L0-L1 Compaction Prep]: Giving Subcompactions Their Own State Summary: In prepration for running multiple threads at the same time during a compaction job, this patch assigns each subcompaction its own state (instead of sharing the one global CompactionState). Each subcompaction then uses this state to update its statistics, keep track of its snapshots, etc. during the course of execution. Then at the end of all the executions the statistics are aggregated across the subcompactions so that the final result is the same as if only one larger compaction had run. Test Plan: ./db_test ./db_compaction_test ./compaction_job_test Reviewers: sdong, anthony, igor, noetzli, yhchiang Reviewed By: yhchiang Subscribers: MarkCallaghan, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D43239	2015-08-18 11:06:23 -07:00
sdong	72613657f0	Measure file read latency histogram per level Summary: In internal stats, remember read latency histogram, if statistics is enabled. It can be retrieved from DB::GetProperty() with "rocksdb.dbstats" property, if it is enabled. Test Plan: Manually run db_bench and prints out "rocksdb.dbstats" by hand and make sure it prints out as expected Reviewers: igor, IslamAbdelRahman, rven, kradhakrishnan, anthony, yhchiang Reviewed By: yhchiang Subscribers: MarkCallaghan, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D44193	2015-08-14 17:32:42 -07:00
Nathan Bronson	b7198c3afe	reduce db mutex contention for write batch groups Summary: This diff allows a Writer to join the next write batch group without acquiring any locks. Waiting is performed via a per-Writer mutex, so all of the non-leader writers never need to acquire the db mutex. It is now possible to join a write batch group after the leader has been chosen but before the batch has been constructed. This diff doesn't increase parallelism, but reduces synchronization overheads. For some CPU-bound workloads (no WAL, RAM-sized working set) this can substantially reduce contention on the db mutex in a multi-threaded environment. With T=8 N=500000 in a CPU-bound scenario (see the test plan) this is good for a 33% perf win. Not all scenarios see such a win, but none show a loss. This code is slightly faster even for the single-threaded case (about 2% for the CPU-bound scenario below). Test Plan: 1. unit tests 2. COMPILE_WITH_TSAN=1 make check 3. stress high-contention scenarios with db_bench -benchmarks=fillrandom -threads=$T -batch_size=1 -memtablerep=skip_list -value_size=0 --num=$N -level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999 -disable_auto_compactions --max_write_buffer_number=8 -max_background_flushes=8 --disable_wal --write_buffer_size=160000000 Reviewers: sdong, igor, rven, ljin, yhchiang Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D43887	2015-08-14 10:55:43 -07:00
sdong	603b6da8b8	Add options.compaction_measure_io_stats to print write I/O stats in compactions Summary: Add options.compaction_measure_io_stats to print out / pass to listener accumulated time spent on write calls. Example outputs in info logs: 2015/08/12-16:27:59.463944 7fd428bff700 (Original Log Time 2015/08/12-16:27:59.463922) EVENT_LOG_v1 {"time_micros": 1439422079463897, "job": 6, "event": "compaction_finished", "output_level": 1, "num_output_files": 4, "total_output_size": 6900525, "num_input_records": 111483, "num_output_records": 106877, "file_write_nanos": 15663206, "file_range_sync_nanos": 649588, "file_fsync_nanos": 349614797, "file_prepare_write_nanos": 1505812, "lsm_state": [2, 4, 0, 0, 0, 0, 0]} Add two more counters in iostats_context. Also add a parameter of db_bench. Test Plan: Add a unit test. Also manually verify LOG outputs in db_bench Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D44115	2015-08-13 16:52:26 -07:00
agiardullo	0db807ec28	Transaction error statuses Summary: Based on feedback from spetrunia, we should better differentiate error statuses for transaction failures. https://github.com/MySQLOnRocksDB/mysql-5.6/issues/86#issuecomment-124605954 Test Plan: unit tests Reviewers: rven, kradhakrishnan, spetrunia, yhchiang, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D43323	2015-08-11 17:52:56 -07:00
agiardullo	c2f2cb0214	Pessimistic Transactions Summary: Initial implementation of Pessimistic Transactions. This diff contains the api changes discussed in D38913. This diff is pretty large, so let me know if people would prefer to meet up to discuss it. MyRocks folks: please take a look at the API in include/rocksdb/utilities/transaction[_db].h and let me know if you have any issues. Also, you'll notice a couple of TODOs in the implementation of RollbackToSavePoint(). After chatting with Siying, I'm going to send out a separate diff for an alternate implementation of this feature that implements the rollback inside of WriteBatch/WriteBatchWithIndex. We can then decide which route is preferable. Next, I'm planning on doing some perf testing and then integrating this diff into MongoRocks for further testing. Test Plan: Unit tests, db_bench parallel testing. Reviewers: igor, rven, sdong, yhchiang, yoshinorim Reviewed By: sdong Subscribers: hermanlee4, maykov, spetrunia, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D40869	2015-08-11 17:52:23 -07:00
sdong	6a4aaadcd7	Avoid type unique_ptr in LogWriterNumber::writer for Windows build break Summary: Visual Studio complains about deque<LogWriterNumber> because LogWriterNumber is non-copyable for its unique_ptr member writer. Move away from it, and do explit free. It is less safe but I can't think of a better way to unblock it. Test Plan: valgrind check test Reviewers: anthony, IslamAbdelRahman, kolmike, rven, yhchiang Reviewed By: yhchiang Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D43647	2015-08-06 10:52:41 -07:00
Mike Kolupaev	e06cf1a098	[wal changes 3/3] method in DB to sync WAL without blocking writers Summary: Subj. We really need this feature. Previous diff D40899 has most of the changes to make this possible, this diff just adds the method. Test Plan: `make check`, the new test fails without this diff; ran with ASAN, TSAN and valgrind. Reviewers: igor, rven, IslamAbdelRahman, anthony, kradhakrishnan, tnovak, yhchiang, sdong Reviewed By: sdong Subscribers: MarkCallaghan, maykov, hermanlee4, yoshinorim, tnovak, dhruba Differential Revision: https://reviews.facebook.net/D40905	2015-08-05 06:06:39 -07:00
Islam AbdelRahman	c45a57b41e	Support delete rate limiting Summary: Introduce DeleteScheduler that allow enforcing a rate limit on file deletion Instead of deleting files immediately, files are moved to trash directory and deleted in a background thread that apply sleep penalty between deletes if needed. I have updated PurgeObsoleteFiles and PurgeObsoleteWALFiles to use the delete_scheduler instead of env_->DeleteFile Test Plan: added delete_scheduler_test existing unit tests Reviewers: kradhakrishnan, anthony, rven, yhchiang, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D43221	2015-08-04 20:45:27 -07:00
Poornima Chozhiyath Raman	1bdfcef7bf	Fix when output level is 0 of universal compaction with trivial move Summary: Fix for universal compaction with trivial move, when the ouput level is 0. The tests where failing. Fixed by allowing normal compaction when output level is 0. Test Plan: modified test cases run successfully. Reviewers: sdong, yhchiang, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: anthony, kradhakrishnan, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D42933	2015-07-27 14:25:57 -07:00
Mike Kolupaev	fe09a6dae3	[wal changes 2/3] write with sync=true syncs previous unsynced wals to prevent illegal data loss Summary: I'll just copy internal task summary here: " This sequence will cause data loss in the middle after an sync write: non-sync write key 1 flush triggered, not yet scheduled sync write key 2 system crash After rebooting, users might see key 2 but not key 1, which violates the API of sync write. This can be reproduced using unit test FaultInjectionTest::DISABLED_WriteOptionSyncTest. One way to fix it is for a sync write, if there is outstanding unsynced log files, we need to syc them too. " This diff should be considered together with the next diff D40905; in isolation this fix probably could be a little simpler. Test Plan: `make check`; added a test for that (DBTest.SyncingPreviousLogs) before noticing FaultInjectionTest.WriteOptionSyncTest (keeping both since mine asserts a bit more); both tests fail without this diff; for D40905 stacked on top of this diff, ran tests with ASAN, TSAN and valgrind Reviewers: rven, yhchiang, IslamAbdelRahman, anthony, kradhakrishnan, igor, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D40899	2015-07-22 03:28:08 -07:00
agiardullo	064294081b	Improved FileExists API Summary: Add new CheckFileExists method. Considered changing the FileExists api but didn't want to break anyone's builds. Test Plan: unit tests Reviewers: yhchiang, igor, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D42003	2015-07-20 17:20:40 -07:00
sdong	6e9fbeb27c	Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future. Test Plan: Run all existing unit tests. Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D42321	2015-07-17 16:58:18 -07:00
Igor Canadi	35ca59364c	Don't let flushes preempt compactions Summary: When we first started, max_background_flushes was 0 by default and compaction thread was executing flushes (since there was no flush thread). Then, we switched the default max_background_flushes to 1. However, we still support the case where there is no flush thread and flushes are done in compaction. This is making our code a bit more complicated. By not supporting this use-case we can make our code simpler. We have a special case that when you set max_background_flushes to 0, we schedule the flush to execute on the compaction thread. Test Plan: make check (there might be some unit tests that depend on this behavior) Reviewers: IslamAbdelRahman, yhchiang, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D41931	2015-07-17 12:02:52 -07:00
Igor Canadi	a96fcd09b7	Deprecate CompactionFilterV2 Summary: It has been around for a while and it looks like it never found any uses in the wild. It's also complicating our compaction_job code quite a bit. We're deprecating it in 3.13, but will put it back in 3.14 if we actually find users that need this feature. Test Plan: make check Reviewers: noetzli, yhchiang, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D42405	2015-07-17 18:59:11 +02:00
sdong	6c0c8dee7b	Fix data loss after DB recovery by not allowing flush/compaction to be scheduled until DB opened Summary: Previous run may leave some SST files with higher file numbers than manifest indicates. Compaction or flush may start to run while DB::Open() is still going on. SST file garbage collection may happen interleaving with compaction or flush, and overwrite files generated by compaction of flushes after they are generated. This might cause data loss. This possibility of interleaving is recently introduced. Fix it by not allowing compaction or flush to be scheduled before DB::Open() finishes. Test Plan: Add a unit test. This verification will have a chance to fail without the fix but doesn't fix without the fix. Reviewers: kradhakrishnan, anthony, yhchiang, IslamAbdelRahman, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D42399	2015-07-16 10:57:41 -07:00
Poornima Chozhiyath Raman	beb19ad0dd	Fixing delete files in Trivial move of universal compaction Summary: Trvial move in universal compaction was failing when trying to move files from levels other than 0. This was because the DeleteFile while trivially moving, was only deleting files of level 0 which caused duplication of same file in different levels. This is fixed by passing the right level as argument in the call of DeleteFile while doing trivial move. Test Plan: ./db_test ran successfully with the new test cases. Reviewers: sdong Reviewed By: sdong Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D42135	2015-07-15 12:28:22 -07:00
Igor Canadi	5aea98ddd8	Deprecate WriteOptions::timeout_hint_us Summary: In one of our recent meetings, we discussed deprecating features that are not being actively used. One of those features, at least within Facebook, is timeout_hint. The feature is really nicely implemented, but if nobody needs it, we should remove it from our code-base (until we get a valid use-case). Some arguments: * Less code == better icache hit rate, smaller builds, simpler code * The motivation for adding timeout_hint_us was to work-around RocksDB's stall issue. However, we're currently addressing the stall issue itself (see @sdong's recent work on stall write_rate), so we should never see sharp lock-ups in the future. * Nobody is using the feature within Facebook's code-base. Googling for `timeout_hint_us` also doesn't yield any users. Test Plan: make check Reviewers: anthony, kradhakrishnan, sdong, yhchiang Reviewed By: yhchiang Subscribers: sdong, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D41937	2015-07-14 09:35:48 +02:00
sdong	f9728640f3	"make format" against last 10 commits Summary: This helps Windows port to format their changes, as discussed. Might have formatted some other codes too becasue last 10 commits include more. Test Plan: Build it. Reviewers: anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D41961	2015-07-13 13:50:18 -07:00
sdong	5fd11853cb	Print Fast CRC32 support information in DB LOG Summary: Print whether fast CRC32 is supported in DB info LOG Test Plan: Run db_bench and see it prints out correctly. Reviewers: yhchiang, anthony, kradhakrishnan, igor Reviewed By: igor Subscribers: MarkCallaghan, yoshinorim, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D41733	2015-07-10 17:59:36 -07:00
Dmitri Smirnov	c903ccc4c2	Merge from github/master	2015-07-09 18:01:08 -07:00
Poornima Chozhiyath Raman	4bed00a44b	Fix function name format according to google style Summary: Change the naming style of getter and setters according to Google C++ style in compaction.h file Test Plan: Compilation success Reviewers: sdong Reviewed By: sdong Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D41265	2015-07-08 15:21:10 -07:00
Poornima Chozhiyath Raman	c0b23dd5b0	Enabling trivial move in universal compaction Summary: This change enables trivial move if all the input files are non onverlapping while doing Universal Compaction. Test Plan: ./compaction_picker_test and db_test ran successfully with the new testcases. Reviewers: sdong Reviewed By: sdong Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D40875	2015-07-07 14:18:55 -07:00
Yueh-Hsuan Chiang	4ce5be4255	fixed leaking log::Writers Summary: Fixes valgrind errors in column_family_test. Test Plan: `make check`, `make valgrind_check` Reviewers: igor, yhchiang Reviewed By: yhchiang Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D41181	2015-07-07 12:10:10 -07:00
Mike Kolupaev	218487d8dc	[wal changes 1/3] fixed unbounded wal growth in some workloads Summary: This fixes the following scenario we've hit: - we reached max_total_wal_size, created a new wal and scheduled flushing all memtables corresponding to the old one, - before the last of these flushes started its column family was dropped; the last background flush call was a no-op; no one removed the old wal from alive_logs_, - hours have passed and no flushes happened even though lots of data was written; data is written to different column families, compactions are disabled; old column families are dropped before memtable grows big enough to trigger a flush; the old wal still sits in alive_logs_ preventing max_total_wal_size limit from kicking in, - a few more hours pass and we run out disk space because of one huge .log file. Test Plan: `make check`; backported the new test, checked that it fails without this diff Reviewers: igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D40893	2015-07-02 14:27:00 -07:00
Dmitri Smirnov	9dbde7277c	Merge remote-tracking branch 'origin' into ms_win_port	2015-07-02 11:34:22 -07:00
Dmitri Smirnov	18285c1e2f	Windows Port from Microsoft Summary: Make RocksDb build and run on Windows to be functionally complete and performant. All existing test cases run with no regressions. Performance numbers are in the pull-request. Test plan: make all of the existing unit tests pass, obtain perf numbers. Co-authored-by: Praveen Rao praveensinghrao@outlook.com Co-authored-by: Sherlock Huang baihan.huang@gmail.com Co-authored-by: Alex Zinoviev alexander.zinoviev@me.com Co-authored-by: Dmitri Smirnov dmitrism@microsoft.com	2015-07-01 16:13:56 -07:00
Venkatesh Radhakrishnan	c9cd404bcd	Make flush check for shutdown Summary: Fixes task 7156865 where a compaction causes a hang in flush memtable if CancelAllBackgroundWork was called prior to it. Stack trace is in : https://phabricator.fb.com/P19848829 We end up waiting for a flush which will never happen because there are no background threads. Test Plan: PreShutdownFlush Reviewers: sdong, igor Reviewed By: sdong, igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D40617	2015-06-25 14:43:25 -07:00
Islam AbdelRahman	674b1181cf	Bottommost level compaction option Summary: Replace force_bottommost_level_compaction in CompactRangeOption with an option that allow the user to (always skip, always compact, compact if compaction filter is present) the bottommost level for level based compaction. Test Plan: make check Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D40527	2015-06-23 13:32:40 -07:00
Giuseppe Ottaviano	782a1590f9	Implement a table-level row cache Summary: Implementation of a table-level row cache. It only caches point queries done through the `DB::Get` interface, queries done through the `Iterator` interface will completely skip the cache. Supports snapshots and merge operations. Test Plan: Ran `make valgrind_check commit-prereq` Reviewers: igor, philipp, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D39849	2015-06-23 10:25:45 -07:00
krad	de85e4cadf	Introduce WAL recovery consistency levels Summary: The "one size fits all" approach with WAL recovery will only introduce inconvenience for our varied clients as we go forward. The current recovery is a bit heuristic. We introduce the following levels of consistency while replaying the WAL. 1. RecoverAfterRestart (kTolerateCorruptedTailRecords) This mocks the current recovery mode. 2. RecoverAfterCleanShutdown (kAbsoluteConsistency) This is ideal for unit test and cases where the store is shutdown cleanly. We tolerate no corruption or incomplete writes. 3. RecoverPointInTime (kPointInTimeRecovery) This is ideal when using devices with controller cache or file systems which can loose data on restart. We recover upto the point were is no corruption or incomplete write. 4. RecoverAfterDisaster (kSkipAnyCorruptRecord) This is ideal mode to recover data. We tolerate corruption and incomplete writes, and we hop over those sections that we cannot make sense of salvaging as many records as possible. Test Plan: (1) Run added unit test to cover all levels. (2) Run make check. Reviewers: leveldb, sdong, igor Subscribers: yoshinorim, dhruba Differential Revision: https://reviews.facebook.net/D38487	2015-06-22 15:28:12 -07:00
Islam AbdelRahman	530534fceb	Fix trivial move merge Summary: Fixing bad merge Test Plan: make -j64 check (this is not enough to verify the fix) Reviewers: igor, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D40521	2015-06-22 15:20:30 -07:00
Igor Canadi	760e9a94de	Fail DB::Open() when the requested compression is not available Summary: Currently RocksDB silently ignores this issue and doesn't compress the data. Based on discussion, we agree that this is pretty bad because it can cause confusion for our users. This patch fails DB::Open() if we don't support the compression that is specified in the options. Test Plan: make check with LZ4 not present. If Snappy is not present all tests will just fail because Snappy is our default library. We should make Snappy the requirement, since without it our default DB::Open() fails. Reviewers: sdong, MarkCallaghan, rven, yhchiang Reviewed By: yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D39687	2015-06-18 14:55:05 -07:00
Islam AbdelRahman	4eabbdb7ec	Skip bottommost level compaction if possible Summary: This is https://reviews.facebook.net/D39999 but after introducing an option to force compaction the bottom most level Changes in this patch - Introduce force_bottommost_level_compaction to CompactRangeOptions that force compacting bottommost level during compaction - Skip bottommost level compaction if we dont have a compaction filter and force_bottommost_level_compaction options is not set Although tests pass on my machine but I suspect that there maybe some tests that I am not aware of that should use force_bottommost_level_compaction to pass in a deterministic way Test Plan: make check adding new tests Reviewers: igor, sdong, yhchiang Reviewed By: yhchiang Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D40059	2015-06-18 11:03:31 -07:00
Yueh-Hsuan Chiang	bb1c74ce18	Fixed a bug of CompactionStats in multi-level universal compaction case Summary: Universal compaction can involves in multiple levels. However, the current implementation of bytes_readn and bytes_readnp1 (and some other stats with postfix `n` and `np1`) assumes compaction can only have two levels. This patch fixes this bug and redefines bytes_readn and bytes_readnp1: * bytes_readnp1: the number of bytes read in the compaction output level. * bytes_readn: the total number of bytes read minus bytes_readnp1 Test Plan: Add a test in compaction_job_stats_test Reviewers: igor, sdong, rven, anthony, kradhakrishnan, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D40239	2015-06-17 23:40:34 -07:00
Islam AbdelRahman	12e030a992	Use CompactRangeOptions for CompactRange Summary: This diff update DB::CompactRange to use RangeCompactionOptions instead of using multiple parameters Old CompactRange is still available but deprecated Test Plan: make all check make rocksdbjava USE_CLANG=1 make all OPT=-DROCKSDB_LITE make release Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D40209	2015-06-17 14:36:14 -07:00
Igor Canadi	25d600569d	Clean up InstallSuperVersion Summary: We go to great lengths to make sure MaybeScheduleFlushOrCompaction() is called outside of write thread. But anyway, it's still called in the mutex, so it's not that much cheaper. This diff removes the "optimization" and cleans up the code a bit. Test Plan: make check Reviewers: rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D40113	2015-06-17 12:37:59 -07:00
sdong	40f562e747	Allow GetApproximateSize() to include mem table size if it is skip list memtable Summary: Add an option in GetApproximateSize() so that the result will include estimated sizes in mem tables. To implement it, implement an estimated count from the beginning to a key in skip list. The approach is to count to find the entry, how many Next() is issued from each level, and sum them with a weight that is <branching factor> ^ <level>. Test Plan: Add a test case Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D40119	2015-06-16 18:13:23 -07:00
Islam AbdelRahman	cccd2199a6	Revert skip bottommost compaction Summary: Reverting this diff https://reviews.facebook.net/D39999 Will add an option to force bottom most level compaction and then re submit it Test Plan: make check Reviewers: igor, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D40041	2015-06-12 10:43:33 -07:00
Islam AbdelRahman	20f2b54252	Skip bottom most level compaction if no compaction filter Summary: If we don't have a compaction filter then we can skip compacting the bottom most level Test Plan: make check added unit tests Reviewers: yhchiang, sdong, igor Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D39999	2015-06-12 09:56:08 -07:00
sdong	7842920be5	Slow down writes by bytes written Summary: We slow down data into the database to the rate of options.delayed_write_rate (a new option) with this patch. The thread synchronization approach I take is to still synchronize write controller by DB mutex and GetDelay() is inside DB mutex. Try to minimize the frequency of getting time in GetDelay(). I verified it through db_bench and it seems to work hard_rate_limit is deprecated. options.delayed_write_rate is still not dynamically changeable. Need to work on it as a follow-up. Test Plan: Add new unit tests in db_test Reviewers: yhchiang, rven, kradhakrishnan, anthony, MarkCallaghan, igor Reviewed By: igor Subscribers: ikabiljo, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D36351	2015-06-11 20:42:18 -07:00
Islam AbdelRahman	d6ce0f7c61	Add largest sequence to FlushJobInfo Summary: Adding largest sequence number to FlushJobInfo and passing flushed file metadata to NotifyOnFlushCompleted which include alot of other values that we may want to expose in FlushJobInfo Test Plan: make check Reviewers: igor, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D39927	2015-06-11 15:22:22 -07:00
Yueh-Hsuan Chiang	3eddd1abe9	Add Env::GetThreadID(), which returns the ID of the current thread. Summary: Add Env::GetThreadID(), which returns the ID of the current thread. In addition, make GetThreadList() and InfoLog use same unique ID for the same thread. Test Plan: db_test listener_test Reviewers: igor, rven, IslamAbdelRahman, kradhakrishnan, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D39735	2015-06-11 14:18:02 -07:00
Islam AbdelRahman	73faa3d41d	Handling edge cases for ReFitLevel Summary: Right now the level we pass to ReFitLevel is the maximum level with files (before compaction), there are multiple cases where this maximum level have changed after compaction - all files where in L0 (now maximum level is L1) - using kCompactionStyleUniversal (now maximum level in the last level) - level_compaction_dynamic_level_bytes ?? We can handle each of these cases individually, but I felt it's safer to calculate max_level_with_files again if we want to do a ReFitLevel Test Plan: adding some tests make -j64 check Reviewers: igor, sdong Reviewed By: sdong Subscribers: ott, dhruba Differential Revision: https://reviews.facebook.net/D39663	2015-06-11 14:15:52 -07:00
Venkatesh Radhakrishnan	406a5682eb	Fix hang when closing a DB after doing loads with WAL disabled. Summary: There is a hang during DB close in the following scenario: a) a load with WAL disabled was done, b) CancelAllBackgroundWork was called, c) DB Close was called This was because in that we will wait for a flush but we cannot do a background flush because we have called CancelAllBackgroundWork which marks the DB as shutting downn. Test Plan: Added DBTest FlushOnDestroy Reviewers: sdong Reviewed By: sdong Subscribers: yoshinorim, hermanlee4, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D39747	2015-06-09 10:39:49 -07:00
sdong	d8c8f08c12	GetSnapshot() and ReleaseSnapshot() to move new and free out of DB mutex Summary: We currently issue malloc and free inside DB mutex in GetSnapshot() and ReleaseSnapshot(). Move them out. Test Plan: Go through all tests make valgrind_check Reviewers: yhchiang, rven, IslamAbdelRahman, anthony, igor Reviewed By: igor Subscribers: maykov, hermanlee4, MarkCallaghan, yoshinorim, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D39753	2015-06-08 21:57:02 -07:00
sdong	6df589b446	Add TablePropertiesCollector::NeedCompact() to suggest DB to further compact output files Summary: It is experimental. Allow users to return from a call back function TablePropertiesCollector::NeedCompact(), based on the data in the file. It can be used to allow users to suggest DB to clear up delete tombstones faster. Test Plan: Add a unit test. Reviewers: igor, yhchiang, kradhakrishnan, rven Reviewed By: rven Subscribers: yoshinorim, MarkCallaghan, maykov, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D39585	2015-06-05 20:18:21 -07:00
Yueh-Hsuan Chiang	2e764f06ea	[API Change] Improve EventListener::OnFlushCompleted interface Summary: EventListener::OnFlushCompleted() now passes a structure instead of a list of parameters. This minimizes the API change in the future. Test Plan: listener_test compact_files_test example/compact_files_example Reviewers: kradhakrishnan, sdong, IslamAbdelRahman, rven, igor Reviewed By: rven, igor Subscribers: IslamAbdelRahman, rven, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D39543	2015-06-05 12:28:51 -07:00
Yueh-Hsuan Chiang	7322c74012	Revert incorrect commit Summary: Revert incorrect commit Test Plan: db_test Reviewers: sdong, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D39651	2015-06-05 11:23:09 -07:00
Islam AbdelRahman	31e60e2a77	Unlock mutex in ReFitLevel Summary: I encountered an issue where the database hang, it looks like the mutex is not unlocked on return in ReFitLevel function Test Plan: make -j64 check Reviewers: yhchiang, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D39609	2015-06-05 11:06:14 -07:00
Yueh-Hsuan Chiang	7647df8f9e	Fixed the tsan failure in util/compaction_job_stats_impl.cc Summary: The type of smallest_output_key_prefix and largest_output_key_prefix have been changed to std::string in https://reviews.facebook.net/D39537. As a result, we shouldn't do smallest_output_key_prefix[0] = 0 in the initialization. Test Plan: compile db_test with tsan enabled and repeat DBTest.CompactionDeletionTrigger test to verify the tsan issue has been gone. Reviewers: igor, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D39645	2015-06-05 11:05:35 -07:00
Islam AbdelRahman	3ce3bb3da2	Allowing L0 -> L1 trivial move on sorted data Summary: This diff updates the logic of how we do trivial move, now trivial move can run on any number of files in input level as long as they are not overlapping The conditions for trivial move have been updated Introduced conditions: - Trivial move cannot happen if we have a compaction filter (except if the compaction is not manual) - Input level files cannot be overlapping Removed conditions: - Trivial move only run when the compaction is not manual - Input level should can contain only 1 file More context on what tests failed because of Trivial move ``` DBTest.CompactionsGenerateMultipleFiles This test is expecting compaction on a file in L0 to generate multiple files in L1, this test will fail with trivial move because we end up with one file in L1 ``` ``` DBTest.NoSpaceCompactRange This test expect compaction to fail when we force environment to report running out of space, of course this is not valid in trivial move situation because trivial move does not need any extra space, and did not check for that ``` ``` DBTest.DropWrites Similar to DBTest.NoSpaceCompactRange ``` ``` DBTest.DeleteObsoleteFilesPendingOutputs This test expect that a file in L2 is deleted after it's moved to L3, this is not valid with trivial move because although the file was moved it is now used by L3 ``` ``` CuckooTableDBTest.CompactionIntoMultipleFiles Same as DBTest.CompactionsGenerateMultipleFiles ``` This diff is based on a work by @sdong https://reviews.facebook.net/D34149 Test Plan: make -j64 check Reviewers: rven, sdong, igor Reviewed By: igor Subscribers: yhchiang, ott, march, dhruba, sdong Differential Revision: https://reviews.facebook.net/D34797	2015-06-04 16:51:25 -07:00
Yueh-Hsuan Chiang	0b3172d071	Add EventListener::OnTableFileDeletion() Summary: Add EventListener::OnTableFileDeletion(), which will be called when a table file is deleted. Test Plan: Extend three existing tests in db_test to verify the deleted files. Reviewers: rven, anthony, kradhakrishnan, igor, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D38931	2015-06-03 19:57:01 -07:00
Yueh-Hsuan Chiang	8afafc2783	Fix compile warning in db/db_impl Summary: Fix the following compile warning in db/db_impl db/db_impl.cc:1603:19: error: implicit conversion loses integer precision: 'const uint64_t' (aka 'const unsigned long') to 'int' [-Werror,-Wshorten-64-to-32] info.job_id = job_id; ~ ^~~~~~ Test Plan: db_test Reviewers: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D39423	2015-06-02 17:36:45 -07:00
Yueh-Hsuan Chiang	fe5c6321cb	Allow EventListener::OnCompactionCompleted to return CompactionJobStats. Summary: Allow EventListener::OnCompactionCompleted to return CompactionJobStats, which contains useful information about a compaction. Example CompactionJobStats returned by OnCompactionCompleted(): smallest_output_key_prefix 05000000 largest_output_key_prefix 06990000 elapsed_time 42419 num_input_records 300 num_input_files 3 num_input_files_at_output_level 2 num_output_records 200 num_output_files 1 actual_bytes_input 167200 actual_bytes_output 110688 total_input_raw_key_bytes 5400 total_input_raw_value_bytes 300000 num_records_replaced 100 is_manual_compaction 1 Test Plan: Developed a mega test in db_test which covers 20 variables in CompactionJobStats. Reviewers: rven, igor, anthony, sdong Reviewed By: sdong Subscribers: tnovak, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D38463	2015-06-02 17:07:16 -07:00
Yueh-Hsuan Chiang	fc83821270	Add EventListener::OnTableFileCreated() Summary: Add EventListener::OnTableFileCreated(), which will be called when a table file is created. This patch is part of the EventLogger and EventListener integration. Test Plan: Augment existing test in db/listener_test.cc Reviewers: anthony, kradhakrishnan, rven, igor, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D38865	2015-06-02 14:12:23 -07:00
Yueh-Hsuan Chiang	898e803fc5	Add a stats counter for DB_WRITE back which was mistakenly removed. Summary: Add a stats counter for DB_WRITE back which was mistakenly removed. Test Plan: augment GroupCommitTest Reviewers: sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D39399	2015-06-02 12:35:12 -07:00
Mike Kolupaev	ec7a944360	more times in perf_context and iostats_context Summary: We occasionally get write stalls (>1s Write() calls) on HDD under read load. The following timers explain almost all of the stalls: - perf_context.db_mutex_lock_nanos - perf_context.db_condition_wait_nanos - iostats_context.open_time - iostats_context.allocate_time - iostats_context.write_time - iostats_context.range_sync_time - iostats_context.logger_time In my experiments each of these occasionally takes >1s on write path under some workload. There are rare cases when Write() takes long but none of these takes long. Test Plan: Added code to our application to write the listed timings to log for slow writes. They usually add up to almost exactly the time Write() call took. Reviewers: rven, yhchiang, sdong Reviewed By: sdong Subscribers: march, dhruba, tnovak Differential Revision: https://reviews.facebook.net/D39177	2015-06-02 02:07:58 -07:00
sdong	4266d4fd90	Allow users to migrate to options.level_compaction_dynamic_level_bytes=true using CompactRange() Summary: In DB::CompactRange(), change parameter "reduce_level" to "change_level". Users can compact all data to the last level if needed. By doing it, users can migrate the DB to options.level_compaction_dynamic_level_bytes=true. Test Plan: Add a unit test for it. Reviewers: yhchiang, anthony, kradhakrishnan, igor, rven Reviewed By: rven Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D39099	2015-06-01 18:21:14 -07:00
Yueh-Hsuan Chiang	d333820bad	Removed DBImpl::notifying_events_ Summary: DBImpl::notifying_events_ is a internal counter in DBImpl which is used to prevent DB close when DB is notifying events. However, as the current events all rely on either compaction or flush which already have similar counters to prevent DB close, it is safe to remove notifying_events_. Test Plan: listener_test examples/compact_files_example Reviewers: igor, anthony, kradhakrishnan, rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D39315	2015-06-01 15:32:23 -07:00
agiardullo	bc7a7a400c	fix LITE build Summary: Broken by optimistic transaction diff. (I only built 'release' not 'static_lib' when testing). Test Plan: build Reviewers: yhchiang, sdong, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D39219	2015-05-29 15:22:00 -07:00
agiardullo	dc9d70de65	Optimistic Transactions Summary: Optimistic transactions supporting begin/commit/rollback semantics. Currently relies on checking the memtable to determine if there are any collisions at commit time. Not yet implemented would be a way of enuring the memtable has some minimum amount of history so that we won't fail to commit when the memtable is empty. You should probably start with transaction.h to get an overview of what is currently supported. Test Plan: Added a new test, but still need to look into stress testing. Reviewers: yhchiang, igor, rven, sdong Reviewed By: sdong Subscribers: adamretter, MarkCallaghan, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D33435	2015-05-29 14:36:35 -07:00
agiardullo	c815351038	Support saving history in memtable_list Summary: For transactions, we are using the memtables to validate that there are no write conflicts. But after flushing, we don't have any memtables, and transactions could fail to commit. So we want to someone keep around some extra history to use for conflict checking. In addition, we want to provide a way to increase the size of this history if too many transactions fail to commit. After chatting with people, it seems like everyone prefers just using Memtables to store this history (instead of a separate history structure). It seems like the best place for this is abstracted inside the memtable_list. I decide to create a separate list in MemtableListVersion as using the same list complicated the flush/installalflushresults logic too much. This diff adds a new parameter to control how much memtable history to keep around after flushing. However, it sounds like people aren't too fond of adding new parameters. So I am making the default size of flushed+not-flushed memtables be set to max_write_buffers. This should not change the maximum amount of memory used, but make it more likely we're using closer the the limit. (We are now postponing deleting flushed memtables until the max_write_buffer limit is reached). So while we might use more memory on average, we are still obeying the limit set (and you could argue it's better to go ahead and use up memory now instead of waiting for a write stall to happen to test this limit). However, if people are opposed to this default behavior, we can easily set it to 0 and require this parameter be set in order to use transactions. Test Plan: Added a xfunc test to play around with setting different values of this parameter in all tests. Added testing in memtablelist_test and planning on adding more testing here. Reviewers: sdong, rven, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D37443	2015-05-28 16:34:24 -07:00
Yueh-Hsuan Chiang	ec4ff4e99c	Rename EventLoggerHelpers EventHelpers Summary: Rename EventLoggerHelpers EventHelpers, as it's going to include all event-related helper functions instead of EventLogger only stuffs. Test Plan: make Reviewers: sdong, rven, anthony Reviewed By: anthony Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D39093	2015-05-28 13:37:47 -07:00
Yueh-Hsuan Chiang	672dda9b3b	[API Change] Move listeners from ColumnFamilyOptions to DBOptions Summary: Move listeners from ColumnFamilyOptions to DBOptions Test Plan: listener_test compact_files_test Reviewers: rven, anthony, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D39087	2015-05-28 13:21:39 -07:00
Yueh-Hsuan Chiang	a0580205c8	Removed an unused private variable in db_impl.h Summary: Removed an unused private variable in db_impl.h Test Plan: make db_test Reviewers: sdong, anthony, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D38925	2015-05-26 10:46:26 -07:00
Yueh-Hsuan Chiang	5c224d1b70	Fixed two bugs on logging file deletion. Summary: This patch fixes the following two bugs on logging file deletion. 1. Previously, file deletion failure was only logged in INFO_LEVEL. This patch changes it to ERROR_LEVEL and does some code clean. 2. EventLogger previously will always generate the same log on table file deletion even when file deletion is not successful. Now the resulting status of file deletion will also be logged. Test Plan: make all check Reviewers: sdong, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D38817	2015-05-22 12:10:51 -07:00
Yueh-Hsuan Chiang	dc81efe415	Change the log-level of DB summary and options from INFO_LEVEL to WARN_LEVEL Summary: Change the log-level of DB summary and options from INFO_LEVEL to WARN_LEVEL Test Plan: Use db_bench to verify the log level. Sample output: 2015/05/22-00:20:39.778064 7fff75b41300 [WARN] RocksDB version: 3.11.0 2015/05/22-00:20:39.778095 7fff75b41300 [WARN] Git sha rocksdb_build_git_sha:7fee8775a459134c4cb04baae5bd1687e268f2a0 2015/05/22-00:20:39.778099 7fff75b41300 [WARN] Compile date May 22 2015 2015/05/22-00:20:39.778101 7fff75b41300 [WARN] DB SUMMARY 2015/05/22-00:20:39.778145 7fff75b41300 [WARN] SST files in /tmp/rocksdbtest-691931916/dbbench dir, Total Num: 0, files: 2015/05/22-00:20:39.778148 7fff75b41300 [WARN] Write Ahead Log file in /tmp/rocksdbtest-691931916/dbbench: 2015/05/22-00:20:39.778150 7fff75b41300 [WARN] Options.error_if_exists: 0 2015/05/22-00:20:39.778152 7fff75b41300 [WARN] Options.create_if_missing: 1 2015/05/22-00:20:39.778153 7fff75b41300 [WARN] Options.paranoid_checks: 1 Reviewers: MarkCallaghan, igor, kradhakrishnan Reviewed By: igor Subscribers: sdong, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D38835	2015-05-22 11:54:59 -07:00
Yueh-Hsuan Chiang	2abb592688	Avoid logging under mutex in DBImpl::WriteLevel0TableForRecovery(). Summary: Avoid logging under mutex in DBImpl::WriteLevel0TableForRecovery(). Test Plan: make all check Reviewers: igor, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D38823	2015-05-22 11:24:12 -07:00
Yueh-Hsuan Chiang	e2c1d4b57f	[Public API Change] Make DB::GetDbIdentity() be const function. Summary: Make DB::GetDbIdentity() be const function. Test Plan: make db_test Reviewers: igor, rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D38745	2015-05-21 11:01:48 -07:00
Yueh-Hsuan Chiang	812c461c96	Dump db stats in WARN level Summary: Dump db stats in WARN level Test Plan: run db_bench and verify the LOG Reviewers: igor, MarkCallaghan Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D38691	2015-05-19 18:42:17 -07:00
Igor Canadi	4a855c0799	Add an option wal_bytes_per_sync to control sync_file_range for WAL files Summary: sync_file_range is not always asyncronous and thus can block writes if we do this for WAL in the foreground thread. See more here: http://yoshinorimatsunobu.blogspot.com/2014/03/how-syncfilerange-really-works.html Some users don't want us to call sync_file_range on WALs. Some other do. Thus, I'm adding a separate option wal_bytes_per_sync to control calling sync_file_range on WAL files. bytes_per_sync will apply only to table files now. Test Plan: no more sync_file_range for WAL as evidenced by strace Reviewers: yhchiang, rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D38253	2015-05-18 17:03:59 -07:00
Igor Canadi	b0fdda4ff0	Allow flushes to run in parallel with manual compaction Summary: As title. I spent some time thinking about it and I don't think there should be any issue with running manual compaction and flushes in parallel Test Plan: make check works Reviewers: rven, yhchiang, sdong Reviewed By: yhchiang, sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D38355	2015-05-18 15:34:33 -07:00
sdong	6fa7085121	CompactRange skips levels 1 to base_level -1 for dynamic level base size Summary: CompactRange() now is much more expensive for dynamic level base size as it goes through all the levels. Skip those not used levels between level 0 an base level. Test Plan: Run all unit tests Reviewers: yhchiang, rven, anthony, kradhakrishnan, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D37125	2015-05-18 10:54:11 -07:00
Igor Canadi	dbd95b7532	Add more table properties to EventLogger Summary: Example output: {"time_micros": 1431463794310521, "job": 353, "event": "table_file_creation", "file_number": 387, "file_size": 86937, "table_info": {"data_size": "81801", "index_size": "9751", "filter_size": "0", "raw_key_size": "23448", "raw_average_key_size": "24.000000", "raw_value_size": "990571", "raw_average_value_size": "1013.890481", "num_data_blocks": "245", "num_entries": "977", "filter_policy_name": "", "kDeletedKeys": "0"}} Also fixed a bug where BuildTable() in recovery was passing Env::IOHigh argument into paranoid_checks_file parameter. Test Plan: make check + check out the output in the log Reviewers: sdong, rven, yhchiang Reviewed By: yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D38343	2015-05-12 15:53:55 -07:00
agiardullo	711465ccec	API to fetch from both a WriteBatchWithIndex and the db Summary: Added a couple functions to WriteBatchWithIndex to make it easier to query the value of a key including reading pending writes from a batch. (This is needed for transactions). I created write_batch_with_index_internal.h to use to store an internal-only helper function since there wasn't a good place in the existing class hierarchy to store this function (and it didn't seem right to stick this function inside WriteBatchInternal::Rep). Since I needed to access the WriteBatchEntryComparator, I moved some helper classes from write_batch_with_index.cc into write_batch_with_index_internal.h/.cc. WriteBatchIndexEntry, ReadableWriteBatch, and WriteBatchEntryComparator are all unchanged (just moved to a different file(s)). Test Plan: Added new unit tests. Reviewers: rven, yhchiang, sdong, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D38037	2015-05-11 14:51:51 -07:00
Igor Canadi	65fe1cfbb3	Cleanup CompactionJob Summary: Couple changes: 1. instead of SnapshotList, just take a vector of snapshots 2. don't take a separate parameter is_snapshots_supported. If there are snapshots in the list, that means they are supported. I actually think we should get rid of this notion of snapshots not being supported. 3. don't pass in mutable_cf_options as a parameter. Lifetime of mutable_cf_options is a bit tricky to maintain, so it's better to not pass it in for the whole compaction job. We only really need it when we install the compaction results. Test Plan: make check Reviewers: sdong, rven, yhchiang Reviewed By: yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D36627	2015-05-05 19:01:12 -07:00
Laurent Demailly	df4130ad85	fix crashes in stats and compaction filter for db_ttl_impl Summary: fix crashes in stats and compaction filter for db_ttl_impl Test Plan: Ran build with lots of debugging https://reviews.facebook.net/differential/diff/194175/ Reviewers: yhchiang, igor, rven Reviewed By: igor Subscribers: rven, dhruba Differential Revision: https://reviews.facebook.net/D38001	2015-05-05 16:54:47 -07:00
Igor Canadi	36a7408896	Fix UNLIKELY parenthesis Summary: Ooops :) status.ok() is acutally highly likely :) Test Plan: none Reviewers: rven, yhchiang, anthony Reviewed By: anthony Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D38043	2015-05-05 08:57:34 -07:00
Venkatesh Radhakrishnan	d2346c2cf0	Fix hang with large write batches and column families. Summary: This diff fixes a hang reported by a Github user. https://www.facebook.com/l.php?u=https%3A%2F%2Fgithub.com%2Ffacebook%2Frocksdb%2Fissues%2F595%23issuecomment-96983273&h=9AQFYOWlo Multiple large write batches with column families cause a hang. The issue was caused by not doing flushes/compaction when the write controller was stopped. Test Plan: Create a DBTest from the user's test case Reviewers: igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D37929	2015-05-01 15:41:50 -07:00
krad	d4540654e9	Optimize GetApproximateSizes() to use lesser CPU cycles. Summary: CPU profiling reveals GetApproximateSizes as a bottleneck for performance. The current implementation is sub-optimal, it scans every file in every level to compute the result. We can take advantage of the fact that all levels above 0 are sorted in the increasing order of key ranges and use binary search to locate the starting index. This can reduce the number of comparisons required to compute the result. Test Plan: We have good test coverage. Run the tests. Reviewers: sdong, igor, rven, dynamike Subscribers: dynamike, maykov, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D37755	2015-04-30 10:55:03 -07:00
Igor Canadi	7f47ba0e26	Fix possible SIGSEGV in CompactRange (github issue #596 ) Summary: For very detailed explanation of what's happening read this: https://github.com/facebook/rocksdb/issues/596 Test Plan: make check + new unit test Reviewers: yhchiang, anthony, rven Reviewed By: rven Subscribers: adamretter, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D37779	2015-04-29 10:52:31 -07:00
Igor Canadi	1bb4928da9	Include bunch of more events into EventLogger Summary: Added these events: * Recovery start, finish and also when recovery creates a file * Trivial move * Compaction start, finish and when compaction creates a file * Flush start, finish Also includes small fix to EventLogger Also added option ROCKSDB_PRINT_EVENTS_TO_STDOUT which is useful when we debug things. I've spent far too much time chasing LOG files. Still didn't get sst table properties in JSON. They are written very deeply into the stack. I'll address in separate diff. TODO: * Write specification. Let's first use this for a while and figure out what's good data to put here, too. After that we'll write spec * Write tools that parse and analyze LOGs. This can be in python or go. Good intern task. Test Plan: Ran db_bench with ROCKSDB_PRINT_EVENTS_TO_STDOUT. Here's the output: https://phabricator.fb.com/P19811976 Reviewers: sdong, yhchiang, rven, MarkCallaghan, kradhakrishnan, anthony Reviewed By: anthony Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D37521	2015-04-27 15:20:02 -07:00
sdong	d01bbb53ae	Fix CompactRange for universal compaction with num_levels > 1 Summary: CompactRange for universal compaction with num_levels > 1 seems to have a bug. The unit test also has a bug so it doesn't capture the problem. Fix it. Revert the compact range to the logic equivalent to num_levels=1. Always compact all files together. It should also fix DBTest.IncreaseUniversalCompactionNumLevels. The issue was that options.write_buffer_size = 100 << 10 and options.write_buffer_size = 100 << 10 are not used in later test scenarios. So write_buffer_size of 4MB was used. The compaction trigger condition is not anymore obvious as expected. Test Plan: Run the new test and all test suites Reviewers: yhchiang, rven, kradhakrishnan, anthony, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D37551	2015-04-23 19:12:31 -07:00
Igor Canadi	e003d3864c	Abstract out SetMaxPossibleForUserKey() and SetMinPossibleForUserKey Summary: Based on feedback from D37083. Are all of these correct? In some spaces it seems like we're doing SetMaxPossibleForUserKey() although we want the smallest possible internal key for user key. Test Plan: make check Reviewers: sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D37341	2015-04-23 18:08:37 -07:00
sdong	debaf85ef5	Bug of trivial move of dynamic level Summary: D36669 introduces a bug that trivial moved data is not going to specific level but the next level, which will incorrectly be level 1 for level 0 compaciton if base level is not level 1. Fixing it by appreciating the output level Test Plan: Run all tests Reviewers: MarkCallaghan, rven, yhchiang, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D37119	2015-04-14 21:42:08 -07:00
Igor Canadi	47b8743984	Make Compaction class easier to use Summary: The goal of this diff is to make Compaction class easier to use. This should also make new compaction algorithms easier to write (like CompactFiles from @yhchiang and dynamic leveled and multi-leveled universal from @sdong). Here are couple of things demonstrating that Compaction class is hard to use: 1. we have two constructors of Compaction class 2. there's this thing called grandparents_, but it appears to only be setup for leveled compaction and not compactfiles 3. it's easy to introduce a subtle and dangerous bug like this: D36225 4. SetupBottomMostLevel() is hard to understand and it shouldn't be. See this comment: `afbafeaeae/db/compaction.cc (L236-L241)`. It also made it harder for @yhchiang to write CompactFiles, as evidenced by this: `afbafeaeae/db/compaction_picker.cc (L204-L210)` The problem is that we create Compaction object, which holds a lot of state, and then pass it around to some functions. After those functions are done mutating, then we call couple of functions on Compaction object, like SetupBottommostLevel() and MarkFilesBeingCompacted(). It is very hard to see what's happening with all that Compaction's state while it's travelling across different functions. If you're writing a new PickCompaction() function you need to try really hard to understand what are all the functions you need to run on Compaction object and what state you need to setup. My proposed solution is to make important parts of Compaction immutable after construction. PickCompaction() should calculate compaction inputs and then pass them onto Compaction object once they are finalized. That makes it easy to create a new compaction -- just provide all the parameters to the constructor and you're done. No need to call confusing functions after you created your object. This diff doesn't fully achieve that goal, but it comes pretty close. Here are some of the changes: * have one Compaction constructor instead of two. * inputs_ is constant after construction * MarkFilesBeingCompacted() is now private to Compaction class and automatically called on construction/destruction. * SetupBottommostLevel() is gone. Compaction figures it out on its own based on the input. * CompactionPicker's functions are not passing around Compaction object anymore. They are only passing around the state that they need. Test Plan: make check make asan_check make valgrind_check Reviewers: rven, anthony, sdong, yhchiang Reviewed By: yhchiang Subscribers: sdong, yhchiang, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D36687	2015-04-10 15:01:54 -07:00
Yueh-Hsuan Chiang	9741dec0e5	Fix a compile error in ROCKSDB_LITE in db/db_impl.cc Summary: Fix a compile error in ROCKSDB_LITE in db/db_impl.cc related to internal_stats. Test Plan: make OPT=-DROCKSDB_LITE shared_lib Reviewers: sdong, igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D36819	2015-04-09 17:07:29 -07:00
sdong	b1bbdd7919	Create EnvOptions using sanitized DB Options Summary: Now EnvOptions uses unsanitized DB options. bytes_per_sync is tuned off when rate_limiter is used, but this change doesn't take effort. Test Plan: See different I/O pattern in db_bench running fillseq. Reviewers: yhchiang, kradhakrishnan, rven, anthony, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D36723	2015-04-08 14:40:42 -07:00
Igor Canadi	5e067a7b19	Clean up compression logging Summary: Now we add warnings when user configures compression and the compression is not supported. Test Plan: Configured compression to non-supported values. Observed messages in my log: 2015/03/26-12:17:57.586341 7ffb8a496840 [WARN] Compression type chosen for level 2 is not supported: LZ4. RocksDB will not compress data on level 2. 2015/03/26-12:19:10.768045 7f36f15c5840 [WARN] Compression type chosen is not supported: LZ4. RocksDB will not compress data. Reviewers: rven, sdong, yhchiang Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D35979	2015-04-06 12:50:44 -07:00
sdong	953a885ebf	A new call back to TablePropertiesCollector to allow users know the entry is add, delete or merge Summary: Currently users have no idea a key is add, delete or merge from TablePropertiesCollector call back. Add a new function to add it. Also refactor the codes so that (1) make table property collector and internal table property collector two separate data structures with the later one now exposed (2) table builders only receive internal table properties Test Plan: Add cases in table_properties_collector_test to cover both of old and new ways of using TablePropertiesCollector. Reviewers: yhchiang, igor.sugak, rven, igor Reviewed By: rven, igor Subscribers: meyering, yoshinorim, maykov, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D35373	2015-04-06 10:27:21 -07:00
Venkatesh Radhakrishnan	afbafeaeae	Disallow trivial move if compression level is different Summary: Check compression level of start_level with output_compression before allowing trivial move Test Plan: New DBTest CompressLevelCompactionThirdPath added Reviewers: igor, yhchiang, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D36213	2015-04-02 11:06:30 -07:00
sdong	b23bbaa82a	Universal Compactions with Small Files Summary: With this change, we use L1 and up to store compaction outputs in universal compaction. The compaction pick logic stays the same. Outputs are stored in the largest "level" as possible. If options.num_levels=1, it behaves all the same as now. Test Plan: 1) convert most of existing unit tests for universal comapaction to include the option of one level and multiple levels. 2) add a unit test to cover parallel compaction in universal compaction and run it in one level and multiple levels 3) add unit test to migrate from multiple level setting back to one level setting 4) add a unit test to insert keys to trigger multiple rounds of compactions and verify results. Reviewers: rven, kradhakrishnan, yhchiang, igor Reviewed By: igor Subscribers: meyering, leveldb, MarkCallaghan, dhruba Differential Revision: https://reviews.facebook.net/D34539	2015-03-30 15:12:02 -07:00
Igor Canadi	fd3dbef22b	Clean up old log files in background threads Summary: Cleaning up log files can do heavy IO, since we call ftruncate() in the destructor. We don't want to call ftruncate() in user threads. This diff moves cleaning to background threads (flush and compaction) Test Plan: make check, will also run valgrind Reviewers: yhchiang, rven, MarkCallaghan, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D36177	2015-03-30 15:04:10 -04:00
Herman Lee	e018892bb6	Formalize the DB properties string definitions. Summary: Assign the string properties to const string variables under the DB::Properties namespace. This helps catch typos during compilation and also consolidates the property definition in one place. Test Plan: Run rocksdb unit tests Reviewers: sdong, yoshinorim, igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D35991	2015-03-27 14:50:20 -07:00
Igor Canadi	030859eb5d	Dump compression info on startup Summary: It's useful to know if we have compression support or no Test Plan: Observed this in my LOG: 2015/03/26-10:34:35.460681 7f5b322b7840 Snappy supported 2015/03/26-10:34:35.460682 7f5b322b7840 Zlib supported 2015/03/26-10:34:35.460686 7f5b322b7840 Bzip supported 2015/03/26-10:34:35.460687 7f5b322b7840 LZ4 NOT supported Reviewers: sdong, yhchiang Reviewed By: yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D35955	2015-03-26 11:22:20 -07:00
Yueh-Hsuan Chiang	a057bb2a8e	Improve ThreadStatusSingleCompaction Summary: Improve ThreadStatusSingleCompaction in two ways: 1. Use SYNC_POINT to ensure compaction won't happen before the test finishes its "Put Phase" instead of using sleep. 2. In Put Phase, it continues until we have sufficient number of L0 files. Note that during the put phase, there won't be any compaction that consumes L0 files because of item 1. Test Plan: ./db_test --gtest_filter="ThreadStatusSingleCompaction" Reviewers: sdong, igor, rven Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D35727	2015-03-23 15:30:45 -07:00
Igor Canadi	b088c83e6e	Don't delete files when column family is dropped Summary: To understand the bug read t5943287 and check out the new test in column_family_test (ReadDroppedColumnFamily), iter 0. RocksDB contract allowes you to read a drop column family as long as there is a live reference. However, since our iteration ignores dropped column families, AddLiveFiles() didn't mark files of a dropped column families as live. So we deleted them. In this patch I no longer ignore dropped column families in the iteration. I think this behavior was confusing and it also led to this bug. Now if an iterator client wants to ignore dropped column families, he needs to do it explicitly. Test Plan: Added a new unit test that is failing on master. Unit test succeeds now. Reviewers: sdong, rven, yhchiang Reviewed By: yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D32535	2015-03-19 17:04:29 -07:00
Igor Canadi	c88ff4ca76	Deprecate removeScanCountLimit in NewLRUCache Summary: It is no longer used by the implementation, so we should also remove it from the public API. Test Plan: make check Reviewers: sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D34971	2015-03-17 15:04:37 -07:00
Venkatesh Radhakrishnan	98c37fda5d	Remove unused parameter in CancelAllBackgroundWork Summary: Some suggestions for cleanup from Igor. Test Plan: Regression tests. Reviewers: igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D35169	2015-03-16 21:07:54 -07:00
Venkatesh Radhakrishnan	b2b3086524	Speed up rocksDB close call. Summary: On RocksDB, when there are multiple instances doing flushes/compactions in the background, the close call takes a long time because the flushes/compactions need to complete before the database can shut down. If another instance is using the background threads and the compaction for this instance is in the queue since it has been scheduled, we still cannot shutdown. We now remove the scheduled background tasks which have not yet started running, so that shutdown is speeded up. Test Plan: DB Test added. Reviewers: yhchiang, igor, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D33741	2015-03-16 18:49:14 -07:00
Igor Canadi	52d8347a91	EventLogger Summary: Here's my proposal for making our LOGs easier to read by machines. The idea is to dump all events as JSON objects. JSON is easy to read by humans, but more importantly, it's easy to read by machines. That way, we can parse this, load into SQLite/mongo and then query or visualize. I started with table_create and table_delete events, but if everybody agrees, I'll continue by adding more events (flush/compaction/etc etc) Test Plan: Ran db_bench. Observed: 2015/01/15-14:13:25.788019 1105ef000 EVENT_LOG_v1 {"time_micros": 1421360005788015, "event": "table_file_creation", "file_number": 12, "file_size": 1909699} 2015/01/15-14:13:25.956500 110740000 EVENT_LOG_v1 {"time_micros": 1421360005956498, "event": "table_file_deletion", "file_number": 12} Reviewers: yhchiang, rven, dhruba, MarkCallaghan, lgalanis, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D31647	2015-03-13 10:15:54 -07:00
Yueh-Hsuan Chiang	2b785d76b8	Fixed a bug where CompactFiles won't delete obsolete files until flush. Summary: Fixed a bug where CompactFiles won't delete obsolete files until flush. Test Plan: ./compact_files_test export ROCKSDB_TESTS=CompactFiles ./db_test Reviewers: rven, sdong, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D34671	2015-03-11 13:06:59 -07:00
Venkatesh Radhakrishnan	284be570c8	Provide a mechanism to inform Rocksdb that it is shutting down Summary: Provide an API which enables users to infor Rocksdb that it is shutting down. Test Plan: db_test Reviewers: sdong, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D34617	2015-03-11 10:31:02 -07:00
Igor Canadi	485ac0dbd0	Add rate_limiter to string options Summary: I want to be able to set this through mongo config. Test Plan: added unit test Reviewers: sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D34599	2015-03-06 14:21:15 -08:00
Yueh-Hsuan Chiang	694988b627	Fix a bug in stall time counter. Improve its output format. Summary: Fix a bug in stall time counter. Improve its output format. Test Plan: export ROCKSDB_TESTS=Timeout ./db_test ./db_bench --benchmarks=fillrandom --stats_interval=10000 --statistics=true --stats_per_interval=1 --num=1000000 --threads=4 --level0_stop_writes_trigger=3 --level0_slowdown_writes_trigger=2 sample output: Uptime(secs): 35.8 total, 0.0 interval Cumulative writes: 359590 writes, 359589 keys, 183047 batches, 2.0 writes per batch, 0.04 GB user ingest, stall seconds: 1786.008 ms Cumulative WAL: 359591 writes, 183046 syncs, 1.96 writes per sync, 0.04 GB written Interval writes: 253 writes, 253 keys, 128 batches, 2.0 writes per batch, 0.0 MB user ingest, stall time: 0 us Interval WAL: 253 writes, 128 syncs, 1.96 writes per sync, 0.00 MB written Reviewers: MarkCallaghan, igor, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D34275	2015-03-03 12:48:12 -08:00
Igor Canadi	db03739340	options.level_compaction_dynamic_level_bytes to allow RocksDB to pick size bases of levels dynamically. Summary: When having fixed max_bytes_for_level_base, the ratio of size of largest level and the second one can range from 0 to the multiplier. This makes LSM tree frequently irregular and unpredictable. It can also cause poor space amplification in some cases. In this improvement (proposed by Igor Kabiljo), we introduce a parameter option.level_compaction_use_dynamic_max_bytes. When turning it on, RocksDB is free to pick a level base in the range of (options.max_bytes_for_level_base/options.max_bytes_for_level_multiplier, options.max_bytes_for_level_base] so that real level ratios are close to options.max_bytes_for_level_multiplier. Test Plan: New unit tests and pass tests suites including valgrind. Reviewers: MarkCallaghan, rven, yhchiang, igor, ikabiljo Reviewed By: ikabiljo Subscribers: yoshinorim, ikabiljo, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D31437	2015-03-02 22:40:41 -08:00
Mark Callaghan	c4bd03a97e	Fix typo in log message Summary: fix typo Task ID: # Blame Rev: Test Plan: Revert Plan: Database Impact: Memcache Impact: Other Notes: EImportant: - begin PUBLIC platform impact section - Bugzilla: # - end platform impact - Reviewers: igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D34251	2015-03-02 09:35:50 -08:00
Igor Sugak	62247ffa3b	rocksdb: Add missing override Summary: When using latest clang (3.6 or 3.7/trunck) rocksdb is failing with many errors. Almost all of them are missing override errors. This diff adds missing override keyword. No manual changes. Prerequisites: bear and clang 3.5 build with extra tools ```lang=bash % USE_CLANG=1 bear make all # generate a compilation database http://clang.llvm.org/docs/JSONCompilationDatabase.html % clang-modernize -p . -include . -add-override % make format ``` Test Plan: Make sure all tests are passing. ```lang=bash % #Use default fb code clang. % make check ``` Verify less error and no missing override errors. ```lang=bash % # Have trunk clang present in path. % ROCKSDB_NO_FBCODE=1 CC=clang CXX=clang++ make ``` Reviewers: igor, kradhakrishnan, rven, meyering, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D34077	2015-02-26 11:28:41 -08:00
Jinfu Leng	96d989f70d	catch config errors with L0 file count triggers Test Plan: Run "make clean && make all check" Reviewers: rven, igor, yhchiang, kradhakrishnan, MarkCallaghan, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D33627	2015-02-23 16:08:27 -08:00
Jim Meyering	a42324e370	build: do not relink every single binary just for a timestamp Summary: Prior to this change, "make check" would always waste a lot of time relinking 60+ binaries. With this change, it does that only when the generated file, util/build_version.cc, changes, and that happens only when the date changes or when the current git SHA changes. This change makes some other improvements: before, there was no rule to build a deleted util/build_version.cc. If it was somehow removed, any attempt to link a program would fail. There is no longer any need for the separate file, build_tools/build_detect_version. Its functionality is now in the Makefile. * Makefile (DEPFILES): Don't filter-out util/build_version.cc. No need, and besides, removing that dependency was wrong. (date, git_sha, gen_build_version): New helper variables. (util/build_version.cc): New rule, to create this file and update it only if it would contain new information. * build_tools/build_detect_platform: Remove file. * db/db_impl.cc: Now, print only date (not the time). * util/build_version.h (rocksdb_build_compile_time): Remove declaration. No longer used. Test Plan: - Run "make check" twice, and note that the second time no linking is performed. - Remove util/build_version.cc and ensure that any "make" command regenerates it before doing anything else. - Run this: strings librocksdb.a\|grep _build_. That prints output including the following: rocksdb_build_git_date:2015-02-19 rocksdb_build_git_sha:2.8.fb-1792-g3cb6cc0 Reviewers: ljin, sdong, igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D33591	2015-02-19 13:11:10 -08:00
Venkatesh Radhakrishnan	7d817268b9	Managed iterator Summary: This is a diff for managed iterator. A managed iterator is a wrapper around an iterator which saves the options for that iterator as well as the current key/value so that the underlying iterator and its associated memory can be released when it is aged out automatically or on the request of the user. Will provide the automatic release as a follow-up diff. Test Plan: Managed* tests in db_test and XF tests for managed iterator Reviewers: igor, yhchiang, anthony, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D31401	2015-02-18 11:49:31 -08:00
Yueh-Hsuan Chiang	e60bc99fe0	Allow GetThreadList to reflect flush activity. Summary: Allow GetThreadList to reflect flush activity. Test Plan: Developed ThreadStatusFlush test and updated ThreadStatusMultiCompaction test. ./db_test ./thread_list_test Reviewers: sdong, rven, igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D32871	2015-02-17 10:13:52 -08:00
Igor Canadi	e7ea51a8e7	Introduce job_id for flush and compaction Summary: It would be good to assing background job their IDs. Two benefits: 1) makes LOGs more readable 2) I might use it in my EventLogger, which will try to make our LOG easier to read/query/visualize Test Plan: ran rocksdb, read the LOG Reviewers: sdong, rven, yhchiang Reviewed By: yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D31617	2015-02-12 09:54:48 -08:00
Igor Canadi	863009b5a5	Fix deleting obsolete files #2 Summary: For description of the bug, see comment in db_test. The fix is pretty straight forward. Test Plan: added unit test. eventually we need better testing of FOF/POF process. Reviewers: yhchiang, rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D33081	2015-02-09 17:38:32 -08:00
sdong	91ac3b2067	Print DB pointer when opening a DB Summary: Having a pointer for DB will be helpful to debug when GDB or working on a dump. If the client process doesn't have any thread actively working on RocksDB, it can be hard to find out. Test Plan: make all check Reviewers: rven, yhchiang, igor Reviewed By: igor Subscribers: yoshinorim, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D33159	2015-02-09 12:52:58 -08:00
Igor Canadi	2a979822b6	Fix deleting obsolete files Summary: This diff basically reverts D30249 and also adds a unit test that was failing before this patch. I have no idea how I didn't catch this terrible bug when writing a diff, sorry about that :( I think we should redesign our system of keeping track of and deleting files. This is already a second bug in this critical piece of code. I'll think of few ideas. BTW this diff is also a regression when running lots of column families. I plan to revisit this separately. Test Plan: added a unit test Reviewers: yhchiang, rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D33045	2015-02-06 08:44:30 -08:00
Igor Canadi	6f10130354	Fix DestroyDB Summary: When DestroyDB() finds a wal file in the DB directory, it assumes it is actually in WAL directory. This can lead to confusion, since it reports IO error when it tries to delete wal file from DB directory. For example: https://ci-builds.fb.com/job/rocksdb_clang_build/296/console This change will fix our unit tests. Test Plan: unit tests work Reviewers: yhchiang, rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D32907	2015-02-05 20:09:42 -08:00
Yueh-Hsuan Chiang	181191a1e4	Add a counter for collecting the wait time on db mutex. Summary: Add a counter for collecting the wait time on db mutex. Also add MutexWrapper and CondVarWrapper for measuring wait time. Test Plan: ./db_test export ROCKSDB_TESTS=MutexWaitStats ./db_test verify stats output using db_bench make clean make release ./db_bench --statistics=1 --benchmarks=fillseq,readwhilewriting --num=10000 --threads=10 Sample output: rocksdb.db.mutex.wait.micros COUNT : 7546866 Reviewers: MarkCallaghan, rven, sdong, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D32787	2015-02-04 21:39:45 -08:00
Ori Bernstein	f9758e0129	Add compaction listener. Summary: This adds a listener for compactions, and gives some useful statistics on each compaction pass. Test Plan: Unit tests. Reviewers: sdong, igor, rven, yhchiang Reviewed By: yhchiang Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D31641	2015-01-27 14:44:02 -08:00
sdong	be8f0b12ed	Rename DBImpl::log_dir_unsynced_ to log_dir_synced_ Summary: log_dir_unsynced_ is a confusing name. Rename it to log_dir_synced_ and flip the value. Test Plan: Run ./fault_injection_test Reviewers: rven, yhchiang, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D32235	2015-01-26 16:01:36 -08:00
sdong	d888c95748	Sync WAL Directory and DB Path if different from DB directory Summary: 1. If WAL directory is different from db directory. Sync the directory after creating a log file under it. 2. After creating an SST file, sync its parent directory instead of DB directory. 3. change the check of kResetDeleteUnsyncedFiles in fault_injection_test. Since we changed the behavior to sync log files' parent directory after first WAL sync, instead of creating, kResetDeleteUnsyncedFiles will not guarantee to show post sync updates. Test Plan: make all check Reviewers: yhchiang, rven, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D32067	2015-01-26 14:17:45 -08:00
Igor Canadi	42189612c3	Fix data race #2 Summary: We should not be calling InternalStats methods outside of the mutex. Test Plan: COMPILE_WITH_TSAN=1 m db_test && ROCKSDB_TESTS=CompactionTrigger ./db_test failing before the diff, works now Reviewers: yhchiang, rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D32127	2015-01-23 18:04:39 -08:00
sdong	4e48753b73	Sync manifest file when initializing it Summary: Now we don't sync manifest file when initializing it, so DB cannot be safely reopened before the first mem table flush. Fix it by syncing it. This fixes fault_injection_test. Test Plan: make all check Reviewers: rven, yhchiang, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D32001	2015-01-22 14:32:03 -08:00
sdong	206237d121	DBImpl::CheckConsistency() shouldn't create path name with double "/" Summary: GetLiveFilesMetaData() already adds a leading "/" in file name. No need to add one extra "/" in DBImpl::CheckConsistency() Test Plan: make all check Reviewers: yhchiang, rven, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D31779	2015-01-21 16:36:13 -08:00
Yueh-Hsuan Chiang	b229f970df	Remove Compaction::ReleaseInputs(). Summary: This patch remove the unnecessary Compaction::ReleaseInputs(). Compaction::ReleaseInputs() tries to unref its input_version and column_family. However, such unref is always done in ~Compaction(), and all current ReleaseInputs() calls are right before the destructor. Test Plan: ./db_test Reviewers: igor Reviewed By: igor Subscribers: igor, rven, dhruba, sdong Differential Revision: https://reviews.facebook.net/D31605	2015-01-15 12:44:19 -08:00
Yueh-Hsuan Chiang	c91cdd59c1	Allow GetThreadList() to indicate a thread is doing Compaction. Summary: Allow GetThreadList() to indicate a thread is doing Compaction. Test Plan: export ROCKSDB_TESTS=ThreadStatus ./db_test Reviewers: ljin, igor, sdong Reviewed By: sdong Subscribers: leveldb, dhruba, jonahcohen, rven Differential Revision: https://reviews.facebook.net/D30105	2015-01-13 00:04:08 -08:00
sdong	9132e52ea4	DB Stats Dump to print total stall time Summary: Add printing of stall time in DB Stats: Sample outputs: DB Stats Uptime(secs): 53.2 total, 1.7 interval Cumulative writes: 625940 writes, 625939 keys, 625940 batches, 1.0 writes per batch, 0.49 GB user ingest, stall micros: 50691070 Cumulative WAL: 625940 writes, 625939 syncs, 1.00 writes per sync, 0.49 GB written Interval writes: 10859 writes, 10859 keys, 10859 batches, 1.0 writes per batch, 8.7 MB user ingest, stall micros: 1692319 Interval WAL: 10859 writes, 10859 syncs, 1.00 writes per sync, 0.01 MB written Test Plan: make all check verify printing using db_bench Reviewers: igor, yhchiang, rven, MarkCallaghan Reviewed By: MarkCallaghan Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D31239	2015-01-09 11:44:19 -08:00
Igor Canadi	7731d51c82	Simplify column family concurrency Summary: This patch changes concurrency guarantees around ColumnFamilySet::column_families_ and ColumnFamilySet::column_families_data_. Before: * When mutating: lock DB mutex and spin lock * When reading: lock DB mutex OR spin lock After: * When mutating: lock DB mutex and be in write thread * When reading: lock DB mutex or be in write thread That way, we eliminate the spin lock that protects these hash maps and simplify concurrency. That means we don't need to lock the spin lock during writing, since writing is mutually exclusive with column family create/drop (the only operations that mutate those hash maps). With these new restrictions, I also needed to move column family create to the write thread (column family drop was already in the write thread). Even though we don't need to lock the spin lock during write, impact on performance should be minimal -- the spin lock is almost never busy, so locking it is almost free. This addresses task t5116919. Test Plan: make check Stress test with lots and lots of column family drop and create: time ./db_stress --threads=30 --ops_per_thread=5000000 --max_key=5000 --column_families=200 --clear_column_family_one_in=100000 --verify_before_write=0 --reopen=15 --max_background_compactions=10 --max_background_flushes=10 --db=/fast-rocksdb-tmp/db_stress/ Reviewers: yhchiang, rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D30651	2015-01-06 12:44:21 -08:00
Igor Canadi	07aa4e0e35	Fix compaction summary log for trivial move Summary: When trivial move commit is done, we log the summary of the input version instead of current. This is inconsistent with other log messages and confusing. Test Plan: compiles Reviewers: sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D30939	2015-01-05 17:32:49 -08:00
Igor Canadi	62ad0a9b19	Deprecating skip_log_error_on_recovery Summary: Since https://reviews.facebook.net/D16119, we ignore partial tailing writes. Because of that, we no longer need skip_log_error_on_recovery. The documentation says "Skip log corruption error on recovery (If client is ok with losing most recent changes)", while the option actually ignores any corruption of the WAL (not only just the most recent changes). This is very dangerous and can lead to DB inconsistencies. This was originally set up to ignore partial tailing writes, which we now do automatically (after D16119). I have digged up old task t2416297 which confirms my findings. Test Plan: There was actually no tests that verified correct behavior of skip_log_error_on_recovery. Reviewers: yhchiang, rven, dhruba, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D30603	2015-01-05 13:35:56 -08:00
Igor Canadi	fa0b126c0c	Fix corruption_test -- if status is not OK, return status -- during recovery	2015-01-05 10:49:41 -08:00
Igor Canadi	d7b4bb62a7	Fail DB::Open() on WAL corruption Summary: This is a serious bug. If paranod_check == true and WAL is corrupted, we don't fail DB::Open(). I tried going into history and it seems we've been doing this for a long long time. I found this when investigating t5852041. Test Plan: Added unit test to verify correct behavior. Reviewers: yhchiang, rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D30597	2015-01-05 10:26:34 -08:00
Yueh-Hsuan Chiang	a944afd356	Fixed a compile error in db/db_impl.cc on ROCKSDB_LITE	2014-12-23 16:19:40 -08:00
Yueh-Hsuan Chiang	45bab305f9	Move GetThreadList() feature under Env. Summary: GetThreadList() feature depends on the thread creation and destruction, which is currently handled under Env. This patch moves GetThreadList() feature under Env to better manage the dependency of GetThreadList() feature on thread creation and destruction. Renamed ThreadStatusImpl to ThreadStatusUpdater. Add ThreadStatusUtil, which is a static class contains utility functions for ThreadStatusUpdater. Test Plan: run db_test, thread_list_test and db_bench and verify the life cycle of Env and ThreadStatusUpdater is properly managed. Reviewers: igor, sdong Reviewed By: sdong Subscribers: ljin, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D30057	2014-12-22 12:20:17 -08:00
Igor Canadi	4fd26f287c	Only execute flush from compaction if max_background_flushes = 0 Summary: As title. We shouldn't need to execute flush from compaction if there are dedicated threads doing flushes. Test Plan: make check Reviewers: rven, yhchiang, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D30579	2014-12-22 12:05:14 +01:00
Igor Canadi	0acc738810	Speed up FindObsoleteFiles() Summary: There are two versions of FindObsoleteFiles(): * full scan, which is executed every 6 hours (and it's terribly slow) * no full scan, which is executed every time a background process finishes and iterator is deleted This diff is optimizing the second case (no full scan). Here's what we do before the diff: * Get the list of obsolete files (files with ref==0). Some files in obsolete_files set might actually be live. * Get the list of live files to avoid deleting files that are live. * Delete files that are in obsolete_files and not in live_files. After this diff: * The only files with ref==0 that are still live are files that have been part of move compaction. Don't include moved files in obsolete_files. * Get the list of obsolete files (which exclude moved files). * No need to get the list of live files, since all files in obsolete_files need to be deleted. I'll post the benchmark results, but you can get the feel of it here: https://reviews.facebook.net/D30123 This depends on D30123. P.S. We should do full scan only in failure scenarios, not every 6 hours. I'll do this in a follow-up diff. Test Plan: One new unit test. Made sure that unit test fails if we don't have a `if (!f->moved)` safeguard in ~Version. make check Big number of compactions and flushes: ./db_stress --threads=30 --ops_per_thread=20000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0 --reopen=15 --max_background_compactions=10 --max_background_flushes=10 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000 Reviewers: yhchiang, rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D30249	2014-12-22 12:04:45 +01:00
Igor Canadi	f8999fcf31	Fix a SIGSEGV in BackgroundFlush Summary: This one wasn't easy to find :) What happens is we go through all cfds on flush_queue_ and find no cfds to flush, but the cfd is set to the last CF we looped through and following code assumes we want it flushed. BTW @sdong do you think we should also make BackgroundFlush() only check a single cfd for flushing instead of doing this `while (!flush_queue_.empty())`? Test Plan: regression test no longer fails Reviewers: sdong, rven, yhchiang Reviewed By: yhchiang Subscribers: sdong, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D30591	2014-12-21 00:23:28 -08:00
Igor Canadi	fdb6be4e24	Rewritten system for scheduling background work Summary: When scaling to higher number of column families, the worst bottleneck was MaybeScheduleFlushOrCompaction(), which did a for loop over all column families while holding a mutex. This patch addresses the issue. The approach is similar to our earlier efforts: instead of a pull-model, where we do something for every column family, we can do a push-based model -- when we detect that column family is ready to be flushed/compacted, we add it to the flush_queue_/compaction_queue_. That way we don't need to loop over every column family in MaybeScheduleFlushOrCompaction. Here are the performance results: Command: ./db_bench --write_buffer_size=268435456 --db_write_buffer_size=268435456 --db=/fast-rocksdb-tmp/rocks_lots_of_cf --use_existing_db=0 --open_files=55000 --statistics=1 --histogram=1 --disable_data_sync=1 --max_write_buffer_number=2 --sync=0 --benchmarks=fillrandom --threads=16 --num_column_families=5000 --disable_wal=1 --max_background_flushes=16 --max_background_compactions=16 --level0_file_num_compaction_trigger=2 --level0_slowdown_writes_trigger=2 --level0_stop_writes_trigger=3 --hard_rate_limit=1 --num=33333333 --writes=33333333 Before the patch: fillrandom : 26.950 micros/op 37105 ops/sec; 4.1 MB/s After the patch: fillrandom : 17.404 micros/op 57456 ops/sec; 6.4 MB/s Next bottleneck is VersionSet::AddLiveFiles, which is painfully slow when we have a lot of files. This is coming in the next patch, but when I removed that code, here's what I got: fillrandom : 7.590 micros/op 131758 ops/sec; 14.6 MB/s Test Plan: make check two stress tests: Big number of compactions and flushes: ./db_stress --threads=30 --ops_per_thread=20000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0 --reopen=15 --max_background_compactions=10 --max_background_flushes=10 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000 max_background_flushes=0, to verify that this case also works correctly ./db_stress --threads=30 --ops_per_thread=2000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0 --reopen=3 --max_background_compactions=3 --max_background_flushes=0 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000 Reviewers: ljin, rven, yhchiang, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D30123	2014-12-19 20:38:12 +01:00
Yueh-Hsuan Chiang	25f70a5abb	Avoid unnecessary unlock and lock mutex when notifying events. Summary: Avoid unnecessary unlock and lock mutex when notifying events. Test Plan: ./listener_test Reviewers: igor Reviewed By: igor Subscribers: sdong, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D30267	2014-12-16 17:10:23 -08:00
Venkatesh Radhakrishnan	7661e5a76e	Move the file copy out of the mutex. Summary: We now release the mutex before copying the files in the case of the trivial move. This path does not use the compaction job. Test Plan: DBTest.LevelCompactionThirdPath Reviewers: yhchiang, igor, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D30381	2014-12-16 16:57:22 -08:00
Venkatesh Radhakrishnan	153f4f0719	RocksDB: Allow Level-Style Compaction to Place Files in Different Paths Summary: Allow Level-style compaction to place files in different paths This diff provides the code for task 4854591. We now support level-compaction to place files in different paths by specifying them in db_paths along with the minimum level for files to store in that path. Test Plan: ManualLevelCompactionOutputPathId in db_test.cc Reviewers: yhchiang, MarkCallaghan, dhruba, yoshinorim, sdong Reviewed By: sdong Subscribers: yoshinorim, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D29799	2014-12-15 21:48:16 -08:00
sdong	d7a486668c	Improve scalability of DB::GetSnapshot() Summary: Now DB::GetSnapshot() doesn't scale to more column families, as it needs to go through all the column families to find whether snapshot is supported. This patch optimizes it. Test Plan: Add unit tests to cover negative cases. make all check Reviewers: yhchiang, rven, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D30093	2014-12-11 13:27:57 -08:00
sdong	046ba7d47c	Fix calculation of max_total_wal_size in db_options_.max_total_wal_size == 0 case Summary: This is a regression bug introduced by https://reviews.facebook.net/D24729 . max_total_wal_size would be off the target it should be more and more in the case that the a user holds the current super version after flush or compaction. This patch fixes it Test Plan: make all check Reviewers: yhchiang, rven, igor Reviewed By: igor Subscribers: ljin, yoshinorim, MarkCallaghan, hermanlee4, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D29961	2014-12-08 15:26:35 -08:00
sdong	1f04066cab	Add DBProperty to return number of snapshots and time for oldest snapshot Summary: Add a counter in SnapshotList to show number of snapshots. Also a unix timestamp in every snapshot. Add two DB Properties to return number of snapshots and timestamp of the oldest one. Test Plan: Add unit test checking Reviewers: yhchiang, rven, igor Reviewed By: igor Subscribers: leveldb, dhruba, MarkCallaghan Differential Revision: https://reviews.facebook.net/D29919	2014-12-05 17:07:49 -08:00
Mark Callaghan	32a0a03844	Add Moved(GB) to Compaction IO stats Summary: Adds counter for bytes moved (files pushed down a level rather than compacted) to compaction IO stats as Moved(GB). From the output removed these infrequently used columns: RW-Amp, Rn(cnt), Rnp1(cnt), Wnp1(cnt), Wnew(cnt). Example old output: Level Files Size(MB) Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) RW-Amp W-Amp Rd(MB/s) Wr(MB/s) Rn(cnt) Rnp1(cnt) Wnp1(cnt) Wnew(cnt) Comp(sec) Comp(cnt) Avg(sec) Stall(sec) Stall(cnt) Avg(ms) RecordIn RecordDrop ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 0/0 0 0.0 0.0 0.0 0.0 2130.8 2130.8 0.0 0.0 0.0 109.1 0 0 0 0 20002 25068 0.798 28.75 182059 0.16 0 0 L1 142/0 509 1.0 4618.5 2036.5 2582.0 4602.1 2020.2 4.5 2.3 88.5 88.1 24220 701246 1215528 514282 53466 4229 12.643 0.00 0 0.002032745988 300688729 Example new output: Level Files Size(MB) Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) Stall(sec) Stall(cnt) Avg(ms) RecordIn RecordDrop -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- L0 7/0 13 1.8 0.0 0.0 0.0 0.6 0.6 0.0 0.0 0.0 14.7 44 353 0.124 0.03 626 0.05 0 0 L1 9/0 16 1.6 0.0 0.0 0.0 0.0 0.0 0.6 0.0 0.0 0.0 0 0 0.000 0.00 0 0.00 0 0 Task ID: # Blame Rev: Test Plan: make check, run db_bench --fillseq --stats_per_interval --stats_interval and look at output Revert Plan: Database Impact: Memcache Impact: Other Notes: EImportant: - begin PUBLIC platform impact section - Bugzilla: # - end platform impact - Reviewers: igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D29787	2014-12-03 18:28:39 -08:00
Jonah Cohen	a14b7873ee	Enforce write buffer memory limit across column families Summary: Introduces a new class for managing write buffer memory across column families. We supplement ColumnFamilyOptions::write_buffer_size with ColumnFamilyOptions::write_buffer, a shared pointer to a WriteBuffer instance that enforces memory limits before flushing out to disk. Test Plan: Added SharedWriteBuffer unit test to db_test.cc Reviewers: sdong, rven, ljin, igor Reviewed By: igor Subscribers: tnovak, yhchiang, dhruba, xjin, MarkCallaghan, yoshinorim Differential Revision: https://reviews.facebook.net/D22581	2014-12-02 12:09:20 -08:00
Yueh-Hsuan Chiang	13de000f07	Add rocksdb::ToString() to address cases where std::to_string is not available. Summary: In some environment such as android, the c++ library does not have std::to_string. This path adds rocksdb::ToString(), which wraps std::to_string when std::to_string is not available, and implements std::to_string in the other case. Test Plan: make dbg -j32 ./db_test make clean make dbg OPT=-DOS_ANDROID -j32 ./db_test Reviewers: ljin, sdong, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D29181	2014-11-24 20:44:49 -08:00
Yueh-Hsuan Chiang	569853ed10	Fix leak when create_missing_column_families=true on ThreadStatus Summary: An entry of ConstantColumnFamilyInfo is created when: 1. DB::Open 2. CreateColumnFamily. However, there are cases that DB::Open could also call CreateColumnFamily when create_missing_column_families=true. As a result, it will create duplicate ConstantColumnFamilyInfo and one of them would be leaked. Test Plan: ./deletefile_test Reviewers: igor, sdong, ljin Reviewed By: ljin Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D29307	2014-11-22 00:04:41 -08:00
Yueh-Hsuan Chiang	4b63fcbff3	Add enable_thread_tracking to DBOptions Summary: Add enable_thread_tracking to DBOptions to allow tracking thread status related to the DB. Default is off. Test Plan: export ROCKSDB_TESTS=ThreadList ./db_test Reviewers: ljin, sdong, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D29289	2014-11-20 21:13:18 -08:00
Yueh-Hsuan Chiang	d0c5f28a5c	Introduce GetThreadList API Summary: Add GetThreadList API, which allows developer to track the status of each process. Currently, calling GetThreadList will only get the list of background threads in RocksDB with their thread-id and thread-type (priority) set. Will add more support on this in the later diffs. ThreadStatus currently has the following properties: // An unique ID for the thread. const uint64_t thread_id; // The type of the thread, it could be ROCKSDB_HIGH_PRIORITY, // ROCKSDB_LOW_PRIORITY, and USER_THREAD const ThreadType thread_type; // The name of the DB instance where the thread is currently // involved with. It would be set to empty string if the thread // does not involve in any DB operation. const std::string db_name; // The name of the column family where the thread is currently // It would be set to empty string if the thread does not involve // in any column family. const std::string cf_name; // The event that the current thread is involved. // It would be set to empty string if the information about event // is not currently available. Test Plan: ./thread_list_test export ROCKSDB_TESTS=GetThreadList ./db_test Reviewers: rven, igor, sdong, ljin Reviewed By: ljin Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D25047	2014-11-20 10:49:32 -08:00
Lei Jin	1e4a45aac8	remove cfd->options() in DBImpl::NotifyOnFlushCompleted Summary: We should not reference cfd->options() directly! Test Plan: make release Reviewers: sdong, rven, igor, yhchiang Reviewed By: igor, yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D29061	2014-11-18 10:19:48 -08:00
Igor Canadi	84af2ff8d3	Clean job context in DeleteFile	2014-11-14 16:20:24 -08:00
Igor Canadi	5c04acda08	Explicitly clean JobContext Summary: This way we can gurantee that old MemTables get destructed before DBImpl gets destructed, which might be useful if we want to make them depend on state from DBImpl. Test Plan: make check with asserts in JobContext's destructor Reviewers: ljin, sdong, yhchiang, rven, jonahcohen Reviewed By: jonahcohen Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D28959	2014-11-14 15:43:10 -08:00
Yueh-Hsuan Chiang	4161de92a3	Fix SIGSEGV Summary: As a short-term fix, let's go back to previous way of calculating NeedsCompaction(). SIGSEGV happens because NeedsCompaction() can happen before super_version (and thus MutableCFOptions) is initialized. Test Plan: make check Reviewers: ljin, sdong, rven, yhchiang Reviewed By: yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D28875	2014-11-13 15:21:04 -08:00
Igor Canadi	772bc97f13	No CompactFiles in ROCKSDB_LITE Summary: It adds lots of code. Test Plan: compile for iOS, compile for mac. works. Reviewers: rven, sdong, ljin, yhchiang Reviewed By: yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D28857	2014-11-13 16:45:33 -05:00
Yueh-Hsuan Chiang	1d1a64f58a	Move NeedsCompaction() from VersionStorageInfo to CompactionPicker Summary: Move NeedsCompaction() from VersionStorageInfo to CompactionPicker to allow different compaction strategy to have their own way to determine whether doing compaction is necessary. When compaction style is set to kCompactionStyleNone, then NeedsCompaction() will always return false. Test Plan: export ROCKSDB_TESTS=Compact ./db_test Reviewers: ljin, sdong, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D28719	2014-11-13 13:41:43 -08:00
Igor Canadi	25f273027b	Fix iOS compile with -Wshorten-64-to-32 Summary: So iOS size_t is 32-bit, so we need to static_cast<size_t> any uint64_t :( Test Plan: TARGET_OS=IOS make static_lib Reviewers: dhruba, ljin, yhchiang, rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D28743	2014-11-13 14:39:30 -05:00
Igor Canadi	767777c2bd	Turn on -Wshorten-64-to-32 and fix all the errors Summary: We need to turn on -Wshorten-64-to-32 for mobile. See D1671432 (internal phabricator) for details. This diff turns on the warning flag and fixes all the errors. There were also some interesting errors that I might call bugs, especially in plain table. Going forward, I think it makes sense to have this flag turned on and be very very careful when converting 64-bit to 32-bit variables. Test Plan: compiles Reviewers: ljin, rven, yhchiang, sdong Reviewed By: yhchiang Subscribers: bobbaldwin, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D28689	2014-11-11 16:47:22 -05:00
Igor Canadi	113796c493	Fix NewFileNumber() Summary: I mistakenly changed the behavior to ++next_file_number_ instead of next_file_number_++, as it should have been: `344edbb044/db/version_set.h (L539)` Test Plan: none. not sure if this would break anything. It's just different behavior, so I'd rather not risk Reviewers: ljin, rven, yhchiang, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D28557	2014-11-11 06:58:47 -08:00
Igor Canadi	4a3bd2bad2	Optimize usage of Status in CompactionJob Summary: Based on @ljin feedback Test Plan: compiles Reviewers: ljin, yhchiang, sdong Reviewed By: sdong Subscribers: ljin, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D28515	2014-11-10 11:57:58 -08:00
Igor Canadi	e3d3567b5b	Get rid of mutex in CompactionJob's state Summary: Based on @sdong's feedback in the diff, we shouldn't keep db_mutex in CompactionJob's state. This diff removes db_mutex from CompactionJob state, by making next_file_number_ atomic. That way we only need to pass the lock to InstallCompactionResults() because of LogAndApply() Test Plan: make check Reviewers: ljin, yhchiang, rven, sdong Reviewed By: sdong Subscribers: sdong, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D28491	2014-11-07 15:44:12 -08:00
Yueh-Hsuan Chiang	b8b3903429	Fixed compile error in db/db_impl.cc Summary: Fixed compile error in db/db_impl.cc Test Plan: make	2014-11-07 15:13:01 -08:00
Yueh-Hsuan Chiang	28c82ff1b3	CompactFiles, EventListener and GetDatabaseMetaData Summary: This diff adds three sets of APIs to RocksDB. = GetColumnFamilyMetaData = * This APIs allow users to obtain the current state of a RocksDB instance on one column family. * See GetColumnFamilyMetaData in include/rocksdb/db.h = EventListener = * A virtual class that allows users to implement a set of call-back functions which will be called when specific events of a RocksDB instance happens. * To register EventListener, simply insert an EventListener to ColumnFamilyOptions::listeners = CompactFiles = * CompactFiles API inputs a set of file numbers and an output level, and RocksDB will try to compact those files into the specified level. = Example = * Example code can be found in example/compact_files_example.cc, which implements a simple external compactor using EventListener, GetColumnFamilyMetaData, and CompactFiles API. Test Plan: listener_test compactor_test example/compact_files_example export ROCKSDB_TESTS=CompactFiles db_test export ROCKSDB_TESTS=MetaData db_test Reviewers: ljin, igor, rven, sdong Reviewed By: sdong Subscribers: MarkCallaghan, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D24705	2014-11-07 14:45:18 -08:00
Igor Canadi	53af5d877d	Redesign pending_outputs_ Summary: Here's a prototype of redesigning pending_outputs_. This way, we don't have to expose pending_outputs_ to other classes (CompactionJob, FlushJob, MemtableList). DBImpl takes care of it. Still have to write some comments, but should be good enough to start the discussion. Test Plan: make check, will also run stress test Reviewers: ljin, sdong, rven, yhchiang Reviewed By: yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D28353	2014-11-07 11:50:34 -08:00
Lei Jin	fd24ae9d05	SetOptions() to return status and also add it to StackableDB Summary: as title Test Plan: ./db_test Reviewers: sdong, yhchiang, rven, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D28269	2014-11-04 16:23:05 -08:00
Lei Jin	b1267750fb	fix the asan check Summary: as title Test Plan: ran it Reviewers: yhchiang, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D28311	2014-11-04 15:58:14 -08:00
Yueh-Hsuan Chiang	469d474ba0	Apply InfoLogLevel to the logs in db/db_impl.cc Summary: Apply InfoLogLevel to the logs in db/db_impl.cc Test Plan: db_test db_bench Reviewers: ljin, sdong, igor Reviewed By: igor Subscribers: leveldb, MarkCallaghan, dhruba Differential Revision: https://reviews.facebook.net/D28233	2014-11-04 10:28:08 -08:00
sdong	ac6afaf9ef	Enforce naming convention of getters in version_set.h Summary: Enforce the accessier naming convention in functions in version_set.h Test Plan: make all check Reviewers: ljin, yhchiang, rven, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D28143	2014-11-04 09:59:05 -08:00
sdong	09899f0b51	DB::Open() to automatically increase thread pool size if it is smaller than max number of parallel compactions or flushes Summary: With the patch, thread pool size will be automatically increased if DB's options ask for more parallelism of compactions or flushes. Too many users have been confused by the API. Change it to make it harder for users to make mistakes Test Plan: Add two unit tests to cover the function. Reviewers: yhchiang, rven, igor, MarkCallaghan, ljin Reviewed By: ljin Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D27555	2014-11-03 17:22:34 -08:00
Igor Canadi	74eb4fbe93	CompactionJob Summary: Long awaited CompactionJob class! Move most compaction-related things from DBImpl to CompactionJob, making CompactionJob easier to test and understand. Currently this is just replicating exactly the same functionality with as little as change as possible. As future work, we should: 1. Add CompactionJob tests (I think I'll do that tomorrow) 2. Reduce CompactionJob's state that it inherits from DBImpl 3. Figure out how to do yielding to flush better. Currently I implemented a callback as we agreed yesterday, but I don't think it's a good long term solution. This reduces db_impl.cc from 5000+ LOC to 3400! Test Plan: make check, will add CompactionJob-specific tests, probably also move some tests from db_test to compaction_job_test Reviewers: rven, yhchiang, sdong, ljin Reviewed By: ljin Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D27957	2014-10-31 16:31:25 -07:00
Igor Canadi	9f7fc3ac45	Turn on -Wshadow Summary: ...and fix all the errors :) Jim suggested turning on -Wshadow because it helped him fix number of critical bugs in fbcode. I think it's a good idea to be -Wshadow clean. Test Plan: compiles Reviewers: yhchiang, rven, sdong, ljin Reviewed By: ljin Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D27711	2014-10-31 11:59:54 -07:00
sdong	4d2ba38b65	Make VersionBuilder unit testable Summary: Rename Version::Builder to VersionBuilder and expose its definition to a header. Make VerisonBuilder not reference Version or ColumnFamilyData, only working with VersionStorageInfo. Add version_builder_test which has a simple test. Test Plan: make all check Reviewers: rven, yhchiang, igor, ljin Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D27969	2014-10-31 10:44:06 -07:00
Igor Canadi	635905481d	WalManager Summary: Decoupling code that deals with archived log files outside of DBImpl. That will make this code easier to reason about and test. It will also make the code easier to improve, because an improver doesn't have to understand DBImpl code in entirety. Test Plan: added test Reviewers: ljin, yhchiang, rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D27873	2014-10-29 17:43:37 -07:00

... 4 5 6 7 8 ...

1163 commits