rocksdb

Commit Graph

Author	SHA1	Message	Date
Aaron Gao	972f96b3fb	direct io write support Summary: rocksdb direct io support ``` [gzh@dev11575.prn2 ~/rocksdb] ./db_bench -benchmarks=fillseq --num=1000000 Initializing RocksDB Options from the specified file Initializing RocksDB Options from command-line flags RocksDB: version 5.0 Date: Wed Nov 23 13:17:43 2016 CPU: 40 * Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz CPUCache: 25600 KB Keys: 16 bytes each Values: 100 bytes each (50 bytes after compression) Entries: 1000000 Prefix: 0 bytes Keys per prefix: 0 RawSize: 110.6 MB (estimated) FileSize: 62.9 MB (estimated) Write rate: 0 bytes/second Compression: Snappy Memtablerep: skip_list Perf Level: 1 WARNING: Assertions are enabled; benchmarks unnecessarily slow ------------------------------------------------ Initializing RocksDB Options from the specified file Initializing RocksDB Options from command-line flags DB path: [/tmp/rocksdbtest-112628/dbbench] fillseq : 4.393 micros/op 227639 ops/sec; 25.2 MB/s [gzh@dev11575.prn2 ~/roc Closes https://github.com/facebook/rocksdb/pull/1564 Differential Revision: D4241093 Pulled By: lightmark fbshipit-source-id: 98c29e3	2016-12-22 13:09:19 -08:00
Islam AbdelRahman	989e644ed8	Remove sst_file_manager option from LITE Summary: Remove sst_file_manager option from LITE Closes https://github.com/facebook/rocksdb/pull/1690 Differential Revision: D4341331 Pulled By: IslamAbdelRahman fbshipit-source-id: 9f9328d	2016-12-21 17:54:21 -08:00
Daniel Black	bfbcec2339	Gcc 7 error expansion to defined Summary: sorry if these gcc-7/clang-4 cleanups are getting tedious. Closes https://github.com/facebook/rocksdb/pull/1658 Differential Revision: D4318792 Pulled By: yiwu-arbug fbshipit-source-id: 8e85891	2016-12-13 18:39:14 -08:00
Islam AbdelRahman	1a146f89c7	break Flush wait for dropped CF Summary: In FlushJob we dont do the Flush if the CF is dropped https://github.com/facebook/rocksdb/blob/master/db/flush_job.cc#L184-L188 but inside WaitForFlushMemTable we keep waiting forever even if the CF is dropped. Closes https://github.com/facebook/rocksdb/pull/1664 Differential Revision: D4321032 Pulled By: IslamAbdelRahman fbshipit-source-id: 6e2b25d	2016-12-13 14:09:12 -08:00
Islam AbdelRahman	2ba59b5a1e	Disallow ingesting files into dropped CFs Summary: This PR update IngestExternalFile to return an error if we try to ingest a file into a dropped CF. Right now if IngestExternalFile want to flush a memtable, and it's ingesting a file into a dropped CF, it will wait forever since flushing is not possible for the dropped CF Closes https://github.com/facebook/rocksdb/pull/1657 Differential Revision: D4318657 Pulled By: IslamAbdelRahman fbshipit-source-id: ed6ea2b	2016-12-13 00:54:14 -08:00
Islam AbdelRahman	ed8fbdb560	Add EventListener::OnExternalFileIngested() event Summary: Add EventListener::OnExternalFileIngested() to allow user to subscribe to external file ingestion events Closes https://github.com/facebook/rocksdb/pull/1623 Differential Revision: D4285844 Pulled By: IslamAbdelRahman fbshipit-source-id: 0b95a88	2016-12-06 14:09:17 -08:00
Anton Safonov	9053fe2a5c	Made delete_obsolete_files_period_micros option dynamic Summary: Made delete_obsolete_files_period_micros option dynamic. It can be updating using DB::SetDBOptions(). Closes https://github.com/facebook/rocksdb/pull/1595 Differential Revision: D4246569 Pulled By: tonek fbshipit-source-id: d23f560	2016-12-05 14:24:16 -08:00
Maysam Yabandeh	182b940e70	Add WriteOptions.no_slowdown Summary: If the WriteOptions.no_slowdown flag is set AND we need to wait or sleep for the write request, then fail immediately with Status::Incomplete(). Closes https://github.com/facebook/rocksdb/pull/1527 Differential Revision: D4191405 Pulled By: maysamyabandeh fbshipit-source-id: 7f3ce3f	2016-11-21 18:09:13 -08:00
Andrew Kryczka	fe349db57b	Remove Arena in RangeDelAggregator Summary: The Arena construction/destruction introduced significant overhead to read-heavy workload just by creating empty vectors for its blocks, so avoid it in RangeDelAggregator. Closes https://github.com/facebook/rocksdb/pull/1547 Differential Revision: D4207781 Pulled By: ajkr fbshipit-source-id: 9d1c130	2016-11-19 14:24:12 -08:00
Andrew Kryczka	3f62215210	Lazily initialize RangeDelAggregator's map and pinning manager Summary: Since a RangeDelAggregator is created for each read request, these heap-allocating member variables were consuming significant CPU (~3% total) which slowed down request throughput. The map and pinning manager are only necessary when range deletions exist, so we can defer their initialization until the first range deletion is encountered. Currently lazy initialization is done for reads only since reads pass us a single snapshot, which is easier to store on the stack for later insertion into the map than the vector passed to us by flush or compaction. Note the Arena member variable is still expensive, I will figure out what to do with it in a subsequent diff. It cannot be lazily initialized because we currently use this arena even to allocate empty iterators, which is necessary even when no range deletions exist. Closes https://github.com/facebook/rocksdb/pull/1539 Differential Revision: D4203488 Pulled By: ajkr fbshipit-source-id: 3b36279	2016-11-18 17:09:11 -08:00
Yi Wu	36e4762ce0	Remove Ticker::SEQUENCE_NUMBER Summary: Remove the ticker count because: * Having to reset the ticker count in WriteImpl is ineffiecent; * It doesn't make sense to have it as a ticker count if multiple db instance share a statistics object. Closes https://github.com/facebook/rocksdb/pull/1531 Differential Revision: D4194442 Pulled By: yiwu-arbug fbshipit-source-id: e2110a9	2016-11-16 22:39:09 -08:00
Andrew Kryczka	489d142808	DeleteRange interface Summary: Expose DeleteRange() interface since we think the implementation is functionally correct now. Closes https://github.com/facebook/rocksdb/pull/1503 Differential Revision: D4171921 Pulled By: ajkr fbshipit-source-id: 5e21c98	2016-11-15 15:24:16 -08:00
Artemiy Kolesnikov	91300d01f6	Dynamic max_total_wal_size option Summary: Closes https://github.com/facebook/rocksdb/pull/1509 Differential Revision: D4176426 Pulled By: yiwu-arbug fbshipit-source-id: b57689d	2016-11-14 22:54:17 -08:00
Lijun Tang	adb665e0bf	Allowed delayed_write_rate option to be dynamically set. Summary: Closes https://github.com/facebook/rocksdb/pull/1488 Differential Revision: D4157784 Pulled By: siying fbshipit-source-id: f150081	2016-11-12 15:54:11 -08:00
Maysam Yabandeh	361010d447	Exporting compaction stats in the form of a map Summary: Currently the compaction stats are printed to stdout. We want to export the compaction stats in a map format so that the upper layer apps (e.g., MySQL) could present the stats in any format required by the them. Closes https://github.com/facebook/rocksdb/pull/1477 Differential Revision: D4149836 Pulled By: maysamyabandeh fbshipit-source-id: b3df19f	2016-11-11 20:54:14 -08:00
Reid Horuff	1ca5f6d132	Fix 2PC Recovery SeqId Miscount Summary: Originally sequence ids were calculated, in recovery, based off of the first seqid found if the first log recovered. The working seqid was then incremented from that value based on every insertion that took place. This was faulty because of the potential for missing log files or inserts that skipped the WAL. The current recovery scheme grabs sequence from current recovering batch and increments using memtableinserter to track how many actual inserts take place. This works for 2PC batches as well scenarios where some logs are missing or inserts that skip the WAL. Closes https://github.com/facebook/rocksdb/pull/1486 Differential Revision: D4156064 Pulled By: reidHoruff fbshipit-source-id: a6da8d9	2016-11-10 11:09:22 -08:00
Andrew Kryczka	c90fef88b1	fix open failure with empty wal Summary: Closes https://github.com/facebook/rocksdb/pull/1490 Differential Revision: D4158821 Pulled By: IslamAbdelRahman fbshipit-source-id: 59b73f4	2016-11-09 22:24:26 -08:00
Reid Horuff	d133b08f68	Use correct sequence number when creating memtable Summary: copied from: `5ebfd2623a` Opening existing RocksDB attempts recovery from log files, which uses wrong sequence number to create the memtable. This is a regression introduced in change `a400336`. This change includes a test demonstrating the problem, without the fix the test fails with "Operation failed. Try again.: Transaction could not check for conflicts for operation at SequenceNumber 1 as the MemTable only contains changes newer than SequenceNumber 2. Increasing the value of the max_write_buffer_number_to_maintain option could reduce the frequency of this error" This change is a joint effort by Peter 'Stig' Edwards thatsafunnyname and me. Closes https://github.com/facebook/rocksdb/pull/1458 Differential Revision: D4143791 Pulled By: reidHoruff fbshipit-source-id: 5a25033	2016-11-09 12:24:17 -08:00
Islam AbdelRahman	9bd191d2f4	Fix deadlock between (WriterThread/Compaction/IngestExternalFile) Summary: A deadlock is possible if this happen (1) Writer thread is stopped because it's waiting for compaction to finish (2) Compaction is waiting for current IngestExternalFile() calls to finish (3) IngestExternalFile() is waiting to be able to acquire the writer thread (4) WriterThread is held by stopped writes that are waiting for compactions to finish This patch fix the issue by not incrementing num_running_ingest_file_ except when we acquire the writer thread. This patch include a unittest to reproduce the described scenario Closes https://github.com/facebook/rocksdb/pull/1480 Differential Revision: D4151646 Pulled By: IslamAbdelRahman fbshipit-source-id: 09b39db	2016-11-09 10:54:10 -08:00
Andrew Kryczka	9e7cf3469b	DeleteRange user iterator support Summary: Note: reviewed in https://reviews.facebook.net/D65115 - DBIter maintains a range tombstone accumulator. We don't cleanup obsolete tombstones yet, so if the user seeks back and forth, the same tombstones would be added to the accumulator multiple times. - DBImpl::NewInternalIterator() (used to make DBIter's underlying iterator) adds memtable/L0 range tombstones, L1+ range tombstones are added on-demand during NewSecondaryIterator() (see D62205) - DBIter uses ShouldDelete() when advancing to check whether keys are covered by range tombstones Closes https://github.com/facebook/rocksdb/pull/1464 Differential Revision: D4131753 Pulled By: ajkr fbshipit-source-id: be86559	2016-11-04 12:09:22 -07:00
Andrew Kryczka	f998c9790f	DeleteRange Get support Summary: During Get()/MultiGet(), build up a RangeDelAggregator with range tombstones as we search through live memtable, immutable memtables, and SST files. This aggregator is then used by memtable.cc's SaveValue() and GetContext::SaveValue() to check whether keys are covered. added tests for Get on memtables/files; end-to-end tests mainly in https://reviews.facebook.net/D64761 Closes https://github.com/facebook/rocksdb/pull/1456 Differential Revision: D4111271 Pulled By: ajkr fbshipit-source-id: 6e388d4	2016-11-03 18:54:20 -07:00
Yi Wu	437942e481	Add avoid_flush_during_shutdown DB option Summary: Add avoid_flush_during_shutdown DB option. Closes https://github.com/facebook/rocksdb/pull/1451 Differential Revision: D4108643 Pulled By: yiwu-arbug fbshipit-source-id: abdaf4d	2016-11-02 15:39:18 -07:00
Andrew Kryczka	40a2e406f8	DeleteRange flush support Summary: Changed BuildTable() (used for flush) to (1) add range tombstones to the aggregator, which is used by CompactionIterator to determine which keys can be removed; and (2) add aggregator's range tombstones to the table that is output for the flush. Closes https://github.com/facebook/rocksdb/pull/1438 Differential Revision: D4100025 Pulled By: ajkr fbshipit-source-id: cb01a70	2016-10-31 20:54:18 -07:00
Siying Dong	da61f348d3	Print compression and Fast CRC support info as Header level Summary: Currently the compression suppport and fast CRC support information is printed as info level. They should be in the same level as options, which is header level. Also add ZSTD to this printing. Closes https://github.com/facebook/rocksdb/pull/1448 Differential Revision: D4106608 Pulled By: yiwu-arbug fbshipit-source-id: cb9a076	2016-10-31 16:09:13 -07:00
Siying Dong	c90c48d3c8	Show More DB Stats in info logs Summary: DB Stats now are truncated if there are too many CFs. Extend the buffer size to allow more to be printed out. Also, separate out malloc to another log line. Closes https://github.com/facebook/rocksdb/pull/1439 Differential Revision: D4100943 Pulled By: yiwu-arbug fbshipit-source-id: 79f7218	2016-10-29 16:09:18 -07:00
Siying Dong	04751d5345	L0 compression should follow options.compression_per_level if not empty Summary: Currently, we don't use options.compression_per_level[0] as the compression style for L0 compression type, unless it is None. This behavior doesn't look like on purpose. This diff will make sure L0 compress using the style of options.compression_per_level[0]. Reviewed and accepted in: https://reviews.facebook.net/D65607 Closes https://github.com/facebook/rocksdb/pull/1435 Differential Revision: D4099368 Pulled By: siying fbshipit-source-id: cfbbdcd	2016-10-28 17:39:20 -07:00
Aaron Gao	9de2f75216	revert Prev() in MergingIterator to use previous code in non-prefix-seek mode Summary: Siying suggested to keep old code for normal mode prev() for safety Test Plan: make check -j64 Reviewers: yiwu, andrewkr, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D65439	2016-10-24 13:13:01 -07:00
Aaron Gao	59a7c0337b	Change ioptions to store user_comparator, fix bug Summary: change ioptions.comparator to user_comparator instread of internal_comparator. Also change Comparator* to InternalKeyComparator* to make its type explicitly. Test Plan: make all check -j64 Reviewers: andrewkr, sdong, yiwu Reviewed By: yiwu Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D65121	2016-10-21 11:31:42 -07:00
Islam AbdelRahman	869ae5d786	Support IngestExternalFile (remove AddFile restrictions) Summary: Changes in the diff API changes: - Introduce IngestExternalFile to replace AddFile (I think this make the API more clear) - Introduce IngestExternalFileOptions (This struct will encapsulate the options for ingesting the external file) - Deprecate AddFile() API Logic changes: - If our file overlap with the memtable we will flush the memtable - We will find the first level in the LSM tree that our file key range overlap with the keys in it - We will find the lowest level in the LSM tree above the the level we found in step 2 that our file can fit in and ingest our file in it - We will assign a global sequence number to our new file - Remove AddFile restrictions by using global sequence numbers Other changes: - Refactor all AddFile logic to be encapsulated in ExternalSstFileIngestionJob Test Plan: unit tests (still need to add more) addfile_stress (https://reviews.facebook.net/D65037) Reviewers: yiwu, andrewkr, lightmark, yhchiang, sdong Reviewed By: sdong Subscribers: jkedgar, hcz, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D65061	2016-10-20 17:05:32 -07:00
Andrew Kryczka	f47054015d	Handle WAL deletion when using avoid_flush_during_recovery Summary: Previously the WAL files that were avoided during recovery would never be considered for deletion. That was because alive_log_files_ was only populated when log files are created. This diff further populates alive_log_files_ with existing log files that aren't flushed during recovery, such that FindObsoleteFiles() can find them later. Depends on D64053. Test Plan: new unit test, verifies it fails before this change and passes after Reviewers: sdong, IslamAbdelRahman, yiwu Reviewed By: yiwu Subscribers: leveldb, dhruba, andrewkr Differential Revision: https://reviews.facebook.net/D64059	2016-10-14 12:59:51 -07:00
Yi Wu	e29d3b67c2	Make max_background_compactions and base_background_compactions dynamic changeable Summary: Add DB::SetDBOptions to dynamic change max_background_compactions and base_background_compactions. I'll add more dynamic changeable options soon. Test Plan: unit test. Reviewers: yhchiang, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D64749	2016-10-14 12:25:39 -07:00
Islam AbdelRahman	5691a1d8a4	Fix compaction conflict with running compaction Summary: Issue scenario: (1) We have 3 files in L1 and we issue a compaction that will compact them into 1 file in L2 (2) While compaction (1) is running, we flush a file into L0 and trigger another compaction that decide to move this file to L1 and then move it again to L2 (this file don't overlap with any other files) (3) compaction (1) finishes and install the file it generated in L2, but this file overlap with the file we generated in (2) so we break the LSM consistency Looks like this issue can be triggered by using non-exclusive manual compaction or AddFile() Test Plan: unit tests Reviewers: sdong Reviewed By: sdong Subscribers: hermanlee4, jkedgar, andrewkr, dhruba, yoshinorim Differential Revision: https://reviews.facebook.net/D64947	2016-10-13 10:49:06 -07:00
Andrew Kryczka	017de666c7	fixup commit Summary: I accidentally left out these changes from my commit of D64053 due to messing up the merge conflict resolution. Test Plan: ./db_wal_test Reviewers: Subscribers: Tasks: Blame Revision: D64053	2016-10-13 08:48:40 -07:00
Andrew Kryczka	1b7af5fb1a	Redo handling of recycled logs in full purge Summary: This reverts commit `9e4aa798c3`, which doesn't handle all cases (see inline comment). I reimplemented the logic as suggested in the initial PR: https://github.com/facebook/rocksdb/pull/1313. This approach has two benefits: - All the parsing/filtering of full_scan_candidate_files is kept together in PurgeObsoleteFiles. - We only need to check whether log file is recycled in one place where we've already determined it's a log file Test Plan: new unit test, verified fails before the original fix, still passes now. Reviewers: IslamAbdelRahman, yiwu, sdong Reviewed By: yiwu, sdong Subscribers: leveldb, dhruba, andrewkr Differential Revision: https://reviews.facebook.net/D64053	2016-10-12 23:13:09 -07:00
Aaron Gao	447f17127c	new Prev() prefix support using SeekForPrev() Summary: 1) The previous solution for Prev() prefix support is not clean. Since I add api SeekForPrev(), now the Prev() can be symmetric to Next(). and we do not need SeekToLast() to be called in Prev() any more. Also, Next() will Seek(prefix_seek_key_) to solve the problem of possible inconsistency between db_iter and merge_iter when there is merge_operator. And prefix_seek_key is only refreshed when change direction to forward. 2) This diff also solves the bug of Iterator::SeekToLast() with iterate_upper_bound_ with prefix extractor. add test cases for the above two cases. There are some tests for the SeekToLast() in Prev(), I will clean them later. Test Plan: make all check Reviewers: IslamAbdelRahman, andrewkr, yiwu, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D63933	2016-10-11 13:54:26 -07:00
Reid Horuff	2c1f95291d	Add facility to write only a portion of WriteBatch to WAL Summary: When constructing a write batch a client may now call MarkWalTerminationPoint() on that batch. No batch operations after this call will be added written to the WAL but will still be inserted into the Memtable. This facility is used to remove one of the three WriteImpl calls in 2PC transactions. This produces a ~1% perf improvement. ``` RocksDB - unoptimized 2pc, sync_binlog=1, disable_2pc=off INFO 2016-08-31 14:30:38,814 [main]: REQUEST PHASE COMPLETED. 75000000 requests done in 2619 seconds. Requests/second = 28628 RocksDB - optimized 2pc , sync_binlog=1, disable_2pc=off INFO 2016-08-31 16:26:59,442 [main]: REQUEST PHASE COMPLETED. 75000000 requests done in 2581 seconds. Requests/second = 29054 ``` Test Plan: Two unit tests added. Reviewers: sdong, yiwu, IslamAbdelRahman Reviewed By: yiwu Subscribers: hermanlee4, dhruba, andrewkr Differential Revision: https://reviews.facebook.net/D64599	2016-10-07 11:32:10 -07:00
Islam AbdelRahman	87dfc1d23e	Fix conflict between AddFile() and CompactRange() Summary: Fix the conflict bug between AddFile() and CompactRange() by - Make sure that no AddFile calls are running when asking CompactionPicker to pick compaction for manual compaction - If AddFile() run after we pick the compaction for the manual compaction it will be aware of it since we will add the manual compaction to running_compactions_ after picking it This will solve these 2 scenarios - If AddFile() is running, we will wait for it to finish before we pick a compaction for the manual compaction - If we already picked a manual compaction and then AddFile() started ... we ensure that it never ingest a file in a level that will overlap with the manual compaction Test Plan: unit tests Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, yoshinorim, jkedgar, dhruba Differential Revision: https://reviews.facebook.net/D64449	2016-09-28 15:42:06 -07:00
yiwu-arbug	eb3894cf42	Recompute compaction score on SetOptions (#1346 ) Summary: We didn't recompute compaction score on SetOptions, and end up not having compaction if no flush happens afterward. The PR fixing it. Test Plan: See unit test. Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D64167	2016-09-27 11:17:15 -07:00
Islam AbdelRahman	5c64fb67d2	Fix AddFile() conflict with compaction output [WaitForAddFile()] Summary: Since AddFile unlock/lock the mutex inside LogAndApply() we need to ensure that during this period other compactions cannot run since such compactions are not aware of the file we are ingesting and could create a compaction that overlap wit this file this diff add - WaitForAddFile() call that will ensure that no AddFile() calls are being processed right now - Call `WaitForAddFile()` in 3 locations -- When doing manual Compaction -- When starting automatic Compaction -- When doing CompactFiles() Test Plan: unit test Reviewers: lightmark, yiwu, andrewkr, sdong Reviewed By: sdong Subscribers: andrewkr, yoshinorim, jkedgar, dhruba Differential Revision: https://reviews.facebook.net/D64383	2016-09-27 00:14:55 -07:00
Yi Wu	9ed928e7a9	Split DBOptions into ImmutableDBOptions and MutableDBOptions Summary: Use ImmutableDBOptions/MutableDBOptions internally and DBOptions only for user-facing APIs. MutableDBOptions is barely a placeholder for now. I'll start to move options to MutableDBOptions in following diffs. Test Plan: make all check Reviewers: yhchiang, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D64065	2016-09-23 16:34:04 -07:00
yiwu-arbug	4bc8c88e6b	Recover same sequence id from WAL (#1350 ) Summary: Revert the behavior where we don't read sequence id from WAL, but increase it as we replay the log. We still keep the behave for 2PC for now but will fix later. This change fixes github issue 1339, where some writes come with WAL disabled and we may recover records with wrong sequence id. Test Plan: Added unit test. Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D64275	2016-09-23 16:15:14 -07:00
Aaron Gao	715256338a	forbid merge during recovery Summary: Mitigate regression bug of options.max_successive_merges hit during DB Recovery For https://reviews.facebook.net/D62625 Test Plan: make all check Reviewers: horuff, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D62655	2016-09-21 11:05:07 -07:00
Yi Wu	e4d3f5d9b8	Fix DBImpl::GetWalPreallocateBlockSize Mac build error Summary: Specify type param with std::min to resolve compile error on Mac. Test Plan: https://travis-ci.org/facebook/rocksdb/builds/161223845 Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D64143	2016-09-20 10:17:28 -07:00
sdong	d78a4401b5	DBImpl::GetWalPreallocateBlockSize() should return size_t Summary: WritableFile::SetPreallocationBlockSize() requires parameter as size_t, and options used in DBImpl::GetWalPreallocateBlockSize() are all size_t. WritableFile::SetPreallocationBlockSize() should return size_t to avoid build break if size_t is not uint64_t. Test Plan: Run existing tests. Reviewers: andrewkr, IslamAbdelRahman, yiwu Reviewed By: yiwu Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D64137	2016-09-19 16:51:38 -07:00
sdong	b666f85445	Consider more factors when determining preallocation size of WAL files Summary: Currently the WAL file preallocation size is 1.1 * write_buffer_size. This, however, will be over-estimated if options.db_write_buffer_size or options.max_total_wal_size is set and is much smaller. Test Plan: Add a unit test. Reviewers: andrewkr, yiwu Reviewed By: yiwu Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D63957	2016-09-19 12:04:35 -07:00
Yi Wu	0a88f38b7e	Remove ColumnFamilyData::options() Summary: One more small refactor before I split DBOptions into mutable and immutable parts. Test Plan: existing unit tests. Reviewers: yhchiang, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D64047	2016-09-16 15:09:14 -07:00
Andrew Kryczka	06b4785fec	Fix recovery for WALs without data for all CFs Summary: if one or more CFs had no data in the WAL, the log number that's used by FindObsoleteFiles() wasn't updated. We need to treat this case the same as if the data for that WAL had been flushed. Test Plan: new unit test Reviewers: IslamAbdelRahman, yiwu, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D63963	2016-09-15 11:40:48 -07:00
Yi Wu	17f76fc564	DB::GetOptions() reflect dynamic changed options Summary: DB::GetOptions() reflect dynamic changed options. Test Plan: See the new unit test. Reviewers: yhchiang, sdong, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D63903	2016-09-14 22:10:28 -07:00
Yi Wu	81747f1be6	Refactor MutableCFOptions Summary: * Change constructor of MutableCFOptions to depends only on ColumnFamilyOptions. * Move `max_subcompactions`, `compaction_options_fifo` and `compaction_pri` to ImmutableCFOptions to make it clear that they are immutable. Test Plan: existing unit tests. Reviewers: yhchiang, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D63945	2016-09-13 21:11:59 -07:00
somnathr	9e4aa798c3	Summary: (#1313 ) If log recycling is enabled with the rocksdb (recycle_log_file_num=16) db->Writebatch is erroring out with keynotfound after ~5-6 hours of run (1M seq but can happen to any workload I guess).See my detailed bug report here (https://github.com/facebook/rocksdb/issues/1303). This commit is the fix for this, a check is been added not to delete the log file if it is already there in the recycle list. Test Plan: Unit tested it and ran the similar profile. Not reproducing anymore.	2016-09-12 16:53:42 -07:00
Islam AbdelRahman	1cca091298	Temporarily revert Prev() prefix support Summary: Temporarily revert commits for supporting prefix Prev() to unblock MyRocks and RocksDB release These are the commits reverted - `6a14d55bd9` - `b18f9c9eac` - `db74b1a219` - `2482d5fb45` Test Plan: make check -j64 Reviewers: sdong, lightmark Reviewed By: lightmark Subscribers: andrewkr, dhruba, yoshinorim Differential Revision: https://reviews.facebook.net/D63789	2016-09-08 14:45:32 -07:00
Injun Song	ce1be2ce37	Fix build error on Windows (AppVeyor) (#1315 ) Add 'cf_options' to source list and db_imple.cc fix casting	2016-09-06 08:41:43 -07:00
Aaron Gao	2482d5fb45	support Prev() in prefix seek mode Summary: As title, make sure Prev() works as expected with Next() when the current iter->key() in the range of the same prefix in prefix seek mode Test Plan: make all check -j64 (add prefix_test with PrefixSeekModePrev test case) Reviewers: andrewkr, sdong, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: yoshinorim, andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D61419	2016-08-29 20:55:39 -07:00
sdong	dade61ac26	Mitigate regression bug of options.max_successive_merges hit during DB Recovery Summary: After `1b8a2e8fdd`, DB Pointer is passed to WriteBatchInternal::InsertInto() while DB recovery. This can cause deadlock if options.max_successive_merges hits. In that case DB::Get() will be called. Get() will try to acquire the DB mutex, which is already held by the DB::Open(), causing a deadlock condition. This commit mitigates the problem by not passing the DB pointer unless 2PC is allowed. Test Plan: Add a new test and run it. Reviewers: IslamAbdelRahman, andrewkr, kradhakrishnan, horuff Reviewed By: kradhakrishnan Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D62625	2016-08-25 17:30:34 -07:00
Justin Gibbs	b2ce59537c	Persist data during user initiated shutdown Summary: Move the manual memtable flush for databases containing data that has bypassed the WAL from DBImpl's destructor to CancleAllBackgroundWork(). CancelAllBackgroundWork() is a publicly exposed API which allows async operations performed by background threads to be disabled on a database. In effect, this places the database into a "shutdown" state in advance of calling the database object's destructor. No compactions or flushing of SST files can occur once a call to this API completes. When writes are issued to a database with WriteOptions::disableWAL set to true, DBImpl::has_unpersisted_data_ is set so that memtables can be flushed when the database object is destroyed. If CancelAllBackgroundWork() has been called prior to DBImpl's destructor, this flush operation is not possible and is skipped, causing unnecessary loss of data. Since CancelAllBackgroundWork() is already invoked by DBImpl's destructor in order to perform the thread join portion of its cleanup processing, moving the manual memtable flush to CancelAllBackgroundWork() ensures data is persisted regardless of client behavior. Test Plan: Write an amount of data that will not cause a memtable flush to a rocksdb database with all writes marked with WriteOptions::disableWAL. Properly "close" the database. Reopen database and verify that the data was persisted. Reviewers: IslamAbdelRahman, yiwu, yoshinorim, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D62277	2016-08-25 12:24:22 -07:00
sdong	56dd034115	read_options.background_purge_on_iterator_cleanup to cover forward iterator and log file closing too. Summary: With read_options.background_purge_on_iterator_cleanup=true, File deletion and closing can still happen in forward iterator, or WAL file closing. Cover those cases too. Test Plan: I am adding unit tests. Reviewers: andrewkr, IslamAbdelRahman, yiwu Reviewed By: yiwu Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D61503	2016-08-10 13:16:41 -07:00
Yi Wu	ee027fc19f	Ignore write stall triggers when auto-compaction is disabled Summary: My understanding is that the purpose of write stall triggers are to wait for auto-compaction to catch up. Without auto-compaction, we don't need to stall writes. Also with this diff, flush/compaction conditions are recalculated on dynamic option change. Previously the conditions are recalculate only when write stall options are changed. Test Plan: See the new test. Removed two tests that are no longer valid. Reviewers: IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D61437	2016-08-02 21:55:26 -07:00
sdong	2a6d0cde72	Ignore stale logs while restarting DBs Summary: Stale log files can be deleted out of order. This can happen for various reasons. One of the reason is that no data is ever inserted to a column family and we have an optimization to update its log number, but not all the old log files are cleaned up (the case shown in the unit tests added). It can also happen when we simply delete multiple log files out of order. This causes data corruption because we simply increase seqID after processing the next row and we may end up with writing data with smaller seqID than what is already flushed to memtables. In DB recovery, for the oldest files we are replaying, if there it contains no data for any column family, we ignore the sequence IDs in the file. Test Plan: Add two unit tests that fail without the fix. Reviewers: IslamAbdelRahman, igor, yiwu Reviewed By: yiwu Subscribers: hermanlee4, yoshinorim, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D60891	2016-07-25 11:47:31 -07:00
sdong	d5a51d4de3	Need to make sure log file synced before flushing memtable of one column family Summary: Multiput atomiciy is broken across multiple column families if we don't sync WAL before flushing one column family. The WAL file may contain a write batch containing writes to a key to the CF to be flushed and a key to other CF. If we don't sync WAL before flushing, if machine crashes after flushing, the write batch will only be partial recovered. Data to other CFs are lost. Test Plan: Add a new unit test which will fail without the diff. Reviewers: yhchiang, IslamAbdelRahman, igor, yiwu Reviewed By: yiwu Subscribers: yiwu, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D60915	2016-07-21 16:29:06 -07:00
Aaron Gao	dda6c72ac8	Add DestroyColumnFamilyHandle(ColumnFamilyHandle) to db.h Summary: add DestroyColumnFamilyHandle(ColumnFamilyHandle) to close column family instead of deleting cfh* User should call this to close a cf and then we can detect the deletion in this function. Test Plan: make all check -j64 Reviewers: andrewkr, yiwu, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D60765	2016-07-13 17:59:25 -07:00
Yi Wu	6ea41f8527	Fix deadlock when trying update options when write stalls Summary: When write stalls because of auto compaction is disabled, or stop write trigger is reached, user may change these two options to unblock writes. Unfortunately we had issue where the write thread will block the attempt to persist the options, thus creating a deadlock. This diff fix the issue and add two test cases to detect such deadlock. Test Plan: Run unit tests. Also, revert db_impl.cc to master (but don't revert `DBImpl::BackgroundCompaction:Finish` sync point) and run db_options_test. Both tests should hit deadlock. Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D60627	2016-07-12 15:30:38 -07:00
Aaron Gao	8e6b38d895	update DB::AddFile to ingest list of sst files Summary: DB::AddFile(std::string file_path) API that allow them to ingest an SST file created using SstFileWriter We want to update this interface to be able to accept a list of files that will be ingested, DB::AddFile(std::vector<std::string> file_path_list). Test Plan: Add test case `AddExternalSstFileList` in `DBSSTTest`. To make sure: 1. files key ranges are not overlapping with each other 2. each file key range dont overlap with the DB key range 3. make sure no snapshots are held Reviewers: andrewkr, sdong, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D58587	2016-07-11 10:43:12 -07:00
sdong	a00bf1b3cf	Add More Logging to track total_log_size Summary: We saw instances where total_log_size is off the real value, but I'm not able to reproduce it. Add more logging to help debugging when it happens again. Test Plan: Run the unit test and see the logging. Reviewers: andrewkr, yhchiang, igor, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D60081	2016-07-06 14:29:18 -07:00
sdong	32df9733d1	Add options.write_buffer_manager: control total memtable size across DB instances Summary: Add option write_buffer_manager to help users control total memory spent on memtables across multiple DB instances. Test Plan: Add a new unit test. Reviewers: yhchiang, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: adela, benj, sumeet, muthu, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D59925	2016-07-05 18:11:25 -07:00
omegaga	a45ee83181	Fix a bug that accesses invalid address in iterator cleanup function Summary: Reported in T11889874. When registering the cleanup function we should copy the option so that we can still access it if ReadOptions is deleted. Test Plan: Add a unit test to reproduce this bug. Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D60087	2016-07-05 11:57:14 -07:00
omegaga	c4e19b77e8	Add a read option to enable background purge when cleaning up iterators Summary: Add a read option `background_purge_on_iterator_cleanup` to avoid deleting files in foreground when destroying iterators. Instead, a job is scheduled in high priority queue and would be executed in a separate background thread. Test Plan: Add a variant of PurgeObsoleteFileTest. Turn on background purge option in the new test, and use sleeping task to ensure files are deleted in background. Reviewers: IslamAbdelRahman, sdong Reviewed By: IslamAbdelRahman Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D59499	2016-06-21 18:41:23 -07:00
Islam AbdelRahman	fa813f7478	Update DB::AddFile() to ingest the file to the lowest possible level Summary: DB::AddFile() right now always add the ingested file to L0 update the logic to add the file to the lowest possible level Test Plan: unit tests Reviewers: jkedgar, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, yoshinorim Differential Revision: https://reviews.facebook.net/D59637	2016-06-21 17:57:59 -07:00
sdong	7b79238b65	Deprectate filter_deletes Summary: filter_deltes is not a frequently used feature. Remove it. Test Plan: Run all test suites. Reviewers: igor, yhchiang, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D59427	2016-06-17 10:30:47 -07:00
Islam AbdelRahman	30a24f2d3d	Add InternalStats and logging for AddFile() Summary: We dont report the bytes that we ingested from AddFile which make the write amplification numbers incorrect Update InternalStats and add logging for AddFile() Test Plan: Make sure the code compile and existing tests pass Reviewers: lightmark, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D59763	2016-06-16 16:21:41 -07:00
Yi Wu	bc8af90e8c	add option to not flush memtable on open() Summary: Add option to not flush memtable on open() In case the option is enabled, don't delete existing log files by not updating log numbers to MANIFEST. Will still flush if we need to (e.g. memtable full in the middle). In that case we also flush final memtable. If wal_recovery_mode = kPointInTimeRecovery, do not halt immediately after encounter corruption. Instead, check if seq id of next log file is last_log_sequence + 1. In that case we continue recovery. Test Plan: See unit test. Reviewers: dhruba, horuff, sdong Reviewed By: sdong Subscribers: benj, yhchiang, andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D57813	2016-06-13 11:34:16 -07:00
Wanning Jiang	56887f6cb8	Backup Options Summary: Backup options file to private directory Test Plan: backupable_db_test.cc, BackupOptions Modify DB options by calling OpenDB for 3 times. Check the latest options file is in the right place. Also check no redundent files are backuped. Reviewers: andrewkr Reviewed By: andrewkr Subscribers: leveldb, dhruba, andrewkr Differential Revision: https://reviews.facebook.net/D59373	2016-06-09 19:03:10 -07:00
Andrew Kryczka	842958651f	Fix race condition in SwitchMemtable Summary: MemTableList::current_ could be written by background flush thread and simultaneously read in the user thread (NumNotFlushed() is used in SwitchMemtable()). Use the lock to prevent this case. Found the error from tsan. Related: D58833 Test Plan: $ OPT=-g COMPILE_WITH_TSAN=1 make -j64 db_test $ TEST_TMPDIR=/dev/shm/rocksdb ./db_test --gtest_filter=DBTest.RepeatedWritesToSameKey Reviewers: lightmark, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D59139	2016-06-02 17:11:45 -07:00
PraveenSinghRao	3a276b0cbe	Add a callback for when memtable is moved to immutable (#1137 ) * Create a callback for memtable becoming immutable Create a callback for memtable becoming immutable Create a callback for memtable becoming immutable moved notification outside the lock Move sealed notification to unlocked portion of SwitchMemtable * fix lite build	2016-06-02 11:57:31 -07:00
Mike Kolupaev	936973d145	Small tweaks to logging to track the number of immutable memtables Summary: We see some write stalls because of number of unflushed memtables. With existing logging I couldn't figure out what's happening exactly. See internal task t11446054 for details if interested. This diff adds: - logging of memtable creation at info level; I wanted it on multiple occasions for different reasons; also include number of immutable memtables, - logging of number of remaining immutable memtables after a flush. Test Plan: ran tests Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D58833	2016-06-01 11:11:33 -07:00
Reid Horuff	5d85fdb2c5	add missing lock	2016-05-31 12:26:48 -07:00
Ashish Shenoy	99765ed855	Clean up the ComputeCompactionScore() API Summary: Make CompactionOptionsFIFO a part of mutable_cf_options Test Plan: UT Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, lgalanis, dhruba Differential Revision: https://reviews.facebook.net/D58653	2016-05-23 15:55:29 -07:00
Sage Weil	11f329bd40	db/db_impl: restrict WALRecoveryMode when using recycled log files kPointInTimeRecovery is indistinguishable from kTolerateCorruptedTailRecords in recycle mode since we define the "end" of the log as the first corrupt record we encounter. kAbsoluteConsistency doesn't make sense because even a clean shutdown leaves old junk at the end of the log file. Signed-off-by: Sage Weil <sage@redhat.com>	2016-05-22 22:00:15 -07:00
Aaron Orenstein	2073cf3775	Eliminate use of 'using namespace std'. Also remove a number of ADL references to std functions. Summary: Reduce use of argument-dependent name lookup in RocksDB. Test Plan: 'make check' passed. Reviewers: andrewkr Reviewed By: andrewkr Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D58203	2016-05-20 07:42:18 -07:00
Islam AbdelRahman	c70a9335de	Fix mutex unlock issue between scheduled compaction and ReleaseCompactionFiles() Summary: NotifyOnCompactionCompleted can unlock the mutex. That mean that we can schedule a background compaction that will start before we ReleaseCompactionFiles(). Test Plan: added unittest existing unittest Reviewers: yhchiang, sdong Reviewed By: sdong Subscribers: yoshinorim, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D58065	2016-05-18 14:56:30 -07:00
Aaron Gao	43afd72bee	[rocksdb] make more options dynamic Summary: make more ColumnFamilyOptions dynamic: - compression - soft_pending_compaction_bytes_limit - hard_pending_compaction_bytes_limit - min_partial_merge_operands - report_bg_io_stats - paranoid_file_checks Test Plan: Add sanity check in `db_test.cc` for all above options except for soft_pending_compaction_bytes_limit and hard_pending_compaction_bytes_limit. All passed. Reviewers: andrewkr, sdong, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D57519	2016-05-17 13:11:56 -07:00
Islam AbdelRahman	f6aedb62c0	Fix Transaction memory leak Summary: - Make sure we clean up recovered_transactions_ on DBImpl destructor - delete leaked txns and env in TransactionTest Test Plan: Run transaction_test under valgrind Reviewers: sdong, andrewkr, yhchiang, horuff Reviewed By: horuff Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D58263	2016-05-16 16:32:55 -07:00
Reid Horuff	a400336398	TransactionLogIterator sequence gap fix Summary: DBTestXactLogIterator.TransactionLogIterator was failing due the sequence gaps. This was caused by an off-by-one error when calculating the new sequence number after recovering from logs. Test Plan: db_log_iter_test Reviewers: andrewkr Subscribers: andrewkr, hermanlee4, dhruba, IslamAbdelRahman Differential Revision: https://reviews.facebook.net/D58053	2016-05-12 13:54:08 -07:00
Reid Horuff	c27061dae7	[rocksdb] 2PC double recovery bug fix Summary: 1. prepare() 2. crash 3. recover 4. commit() 5. crash 6. data is lost This is due to the transaction data still only residing in the WAL but because the logs were flushed on the first recovery the data is ignored on the second recovery. We must scan all logs found on recovery and only ignore redundant data at the time of replay. It is not possible to know which logs still contain relevant data at time of recovery. We cannot simply ignore a log because all of the non-2pc data it contains has already been written to L0. The changes made to MemTableInserter are to ensure that prepared sections are still recovered even if all of the non-2pc data in that log has already been flushed to L0. Test Plan: Provided test. Reviewers: sdong Subscribers: andrewkr, hermanlee4, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D57729	2016-05-10 14:06:07 -07:00
Reid Horuff	a657ee9a9c	[rocksdb] Recovery path sequence miscount fix Summary: Consider the following WAL with 4 batch entries prefixed with their sequence at time of memtable insert. [1: BEGIN_PREPARE, PUT, PUT, PUT, PUT, END_PREPARE(a)] [1: BEGIN_PREPARE, PUT, PUT, PUT, PUT, END_PREPARE(b)] [4: COMMIT(a)] [7: COMMIT(b)] The first two batches do not consume any sequence numbers so are both prefixed with seq=1. For 2pc commit, memtable insertion takes place before COMMIT batch is written to WAL. We can see that sequence number consumption takes place between WAL entries giving us the seemingly sparse sequence prefix for WAL entries. This is a valid WAL. Because with 2PC markers one WriteBatch points to another batch containing its inserts a writebatch can consume more or less sequence numbers than the number of sequence consuming entries that it contains. We can see that, given the entries in the WAL, 6 sequence ids were consumed. Yet on recovery the maximum sequence consumed would be 7 + 3 (the number of sequence numbers consumed by COMMIT(b)) So, now upon recovery we must track the actual consumption of sequence numbers. In the provided scenario there will be no sequence gaps, but it is possible to produce a sequence gap. This should not be a problem though. correct? Test Plan: provided test. Reviewers: sdong Subscribers: andrewkr, leveldb, dhruba, hermanlee4 Differential Revision: https://reviews.facebook.net/D57645	2016-05-10 14:06:07 -07:00
Reid Horuff	1b8a2e8fdd	[rocksdb] Memtable Log Referencing and Prepared Batch Recovery Summary: This diff is built on top of WriteBatch modification: https://reviews.facebook.net/D54093 and adds the required functionality to rocksdb core necessary for rocksdb to support 2PC. modfication of DBImpl::WriteImpl() - added two arguments uint64_t log_used = nullptr, uint64_t log_ref = 0; - log_used is an output argument which will return the log number which the incoming batch was inserted into, 0 if no WAL insert took place. - log_ref is a supplied log_number which all memtables inserted into will reference after the batch insert takes place. This number will reside in 'FindMinPrepLogReferencedByMemTable()' until all Memtables insertinto have flushed. - Recovery/writepath is now aware of prepared batches and commit and rollback markers. Test Plan: There is currently no test on this diff. All testing of this functionality takes place in the Transaction layer/diff but I will add some testing. Reviewers: IslamAbdelRahman, sdong Subscribers: leveldb, santoshb, andrewkr, vasilep, dhruba, hermanlee4 Differential Revision: https://reviews.facebook.net/D56919	2016-05-10 14:06:07 -07:00
Islam AbdelRahman	4b31723433	Add bottommost_compression option Summary: Add a new option that can be used to set a specific compression algorithm for bottommost level. This option will only affect levels larger than base level. I have also updated CompactionJobInfo to include the compression algorithm used in compaction Test Plan: added new unittest existing unittests Reviewers: andrewkr, yhchiang, sdong Reviewed By: sdong Subscribers: lightmark, andrewkr, dhruba, yoshinorim Differential Revision: https://reviews.facebook.net/D57669	2016-05-09 15:57:19 -07:00
Yi Wu	a92049e3e7	Added EventListener::OnTableFileCreationStarted() callback Summary: Added EventListener::OnTableFileCreationStarted. EventListener::OnTableFileCreated will be called on failure case. User can check creation status via TableFileCreationInfo::status. Test Plan: unit test. Reviewers: dhruba, yhchiang, ott, sdong Reviewed By: sdong Subscribers: sdong, kradhakrishnan, IslamAbdelRahman, andrewkr, yhchiang, leveldb, ott, dhruba Differential Revision: https://reviews.facebook.net/D56337	2016-04-29 11:35:00 -07:00
sdong	992a8f83b7	Not enable jemalloc status printing if USE_CLANG=1 Summary: Warning is printed out with USE_CLANG=1 when including jemalloc.h. Disable it in that case. Test Plan: Run db_bench with USE_CLANG=1 and not. Make sure they can all build and jemalloc status is printed out in the case where USE_CLANG is not set. Reviewers: andrewkr, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D57399	2016-04-28 16:16:14 -07:00
Islam AbdelRahman	0850bc5147	Fix build on machines without jemalloc Summary: It looks like we mistakenly enable JEMALLOC even if it's not available on the machine, that's why travis is failing Test Plan: check on my devserver check on my mac Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D57345	2016-04-27 18:25:19 -07:00
Sergey Makarenko	1c80dfab24	Print memory allocation counters Summary: Introduced option to dump malloc statistics using new option flag. Added new command line option to db_bench tool to enable this funtionality. Also extended build to support environments with/without jemalloc. Test Plan: 1) Build rocksdb using `make` command. Launch the following command `./db_bench --benchmarks=fillrandom --dump_malloc_stats=true --num=10000000` end verified that jemalloc dump is present in LOG file. 2) Build rocksdb using `DISABLE_JEMALLOC=1 make db_bench -j32` and ran the same db_bench tool and found the following message in LOG file: "Please compile with jemalloc to enable malloc dump". 3) Also built rocksdb using `make` command on MacOS to verify behavior in non-FB environment. Also to debug build configuration change temporary changed AM_DEFAULT_VERBOSITY = 1 in Makefile to see compiler and build tools output. For case 1) -DROCKSDB_JEMALLOC was present in compiler command line. For both 2) and 3) this flag was not present. Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D57321	2016-04-27 16:23:33 -07:00
sdong	ac0e54b4c6	CompactedDB should not be used if there is outstanding WAL files Summary: CompactedDB skips memtable. So we shouldn't use compacted DB if there is outstanding WAL files. Test Plan: Change to options.max_open_files = -1 perf context test to create a compacted DB, which we shouldn't do. Reviewers: yhchiang, kradhakrishnan, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D57057	2016-04-26 14:22:07 -07:00
Yueh-Hsuan Chiang	24110ce90c	Correct Statistics FLUSH_WRITE_BYTES Summary: In https://reviews.facebook.net/D56271, we fixed an issue where we consider flush as compaction. However, that makes us mistakenly count FLUSH_WRITE_BYTES twice (one in flush_job and one in db_impl.) This patch removes the one incremented in db_impl. Test Plan: db_test Reviewers: yiwu, andrewkr, IslamAbdelRahman, kradhakrishnan, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D57111	2016-04-25 12:01:01 -07:00
sdong	4b6833aec1	Rename options.compaction_measure_io_stats to options.report_bg_io_stats and include flush too. Summary: It is useful to print out IO stats in flush jobs too. Extend options.compaction_measure_io_stats to flush jobs and raname it. Test Plan: Try db_bench and see the stats are printed out. Reviewers: yhchiang Reviewed By: yhchiang Subscribers: kradhakrishnan, yiwu, IslamAbdelRahman, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D56769	2016-04-15 10:22:18 -07:00
Jay Edgar	2448f80375	Make sure that if use_mmap_reads is on use_os_buffer is also on Summary: The code assumes that if use_mmap_reads is on then use_os_buffer is also on. This make sense as by using memory mapped files for reading you are expecting the OS to cache what it needs. Add code to make sure the user does not turn off use_os_buffer when they turn on use_mmap_reads Test Plan: New test: DBTest.MMapAndBufferOptions Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D56397	2016-04-08 14:30:15 -07:00
Andrew Kryczka	2391ef7214	Embed column family name in SST file Summary: Added the column family name to the properties block. This property is omitted only if the property is unavailable, such as when RepairDB() writes SST files. In a next diff, I will change RepairDB to use this new property for deciding to which column family an existing SST file belongs. If this property is missing, it will add it to the "unknown" column family (same as its existing behavior). Test Plan: New unit test: $ ./db_table_properties_test --gtest_filter=DBTablePropertiesTest.GetColumnFamilyNameProperty Reviewers: IslamAbdelRahman, yhchiang, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D55605	2016-04-06 23:10:32 -07:00
Marton Trencseni	9b51987521	Adding pin_l0_filter_and_index_blocks_in_cache feature and related fixes. Summary: When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache. What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention. Test Plan: 'export TEST_TMPDIR=/dev/shm/ && DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32' is OK. I didn't run the Java tests, I don't have Java set up on my devserver. Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D56133	2016-04-01 10:42:39 -07:00
zensan	78711524b7	In all the places where log records are read, there was a check that record.size() should not be less than 12. This "magic number" seems to be the WriteBatch header (8 byte sequence and 4 byte count). Replaced all the places where "12" was used by WriteBatchInternal::kHeader.	2016-03-30 23:05:22 +05:30
Praveen Rao	583157f710	Avoid overloaded virtual function	2016-03-22 17:10:31 -07:00
Praveen Rao	4f1c74a46e	merge from master	2016-03-18 12:48:01 -07:00
Praveen Rao	f8c2189307	Publish log numbers for column family to wal_filter, and provide log number in the record callback	2016-03-18 12:32:15 -07:00
Baris Yazici	e8e6cf0173	fix: handle_fatal_signal (sig=6) in std::vector<std::string, std::allocator<std::string> >::_M_range_check \| c++/4.8.2/bits/stl_vector.h:794 #174 Summary: Fix for https://github.com/facebook/mysql-5.6/issues/174 When there is no old files to purge, vector.at(i) function was crashing if (old_info_log_file_count != 0 && old_info_log_file_count >= db_options_.keep_log_file_num) { std::sort(old_info_log_files.begin(), old_info_log_files.end()); size_t end = old_info_log_file_count - db_options_.keep_log_file_num; for (unsigned int i = 0; i <= end; i++) { std::string& to_delete = old_info_log_files.at(i); Added check to old_info_log_file_count be non zero. Test Plan: run existing tests Reviewers: gunnarku, vasilep, sdong, yhchiang Reviewed By: yhchiang Subscribers: andrewkr, webscalesql-eng, dhruba Differential Revision: https://reviews.facebook.net/D55245	2016-03-11 11:11:45 -08:00
Andrew Kryczka	d9620239d2	Cleanup stale manifests outside of full purge Summary: - Keep track of obsolete manifests in VersionSet - Updated FindObsoleteFiles() to put obsolete manifests in the JobContext for later use by PurgeObsoleteFiles() - Added test case that verifies a stale manifest is deleted by a non-full purge Test Plan: $ ./backupable_db_test --gtest_filter=BackupableDBTest.ChangeManifestDuringBackupCreation Reviewers: IslamAbdelRahman, yoshinorim, sdong Reviewed By: sdong Subscribers: andrewkr, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D55269	2016-03-10 18:16:21 -08:00
Yueh-Hsuan Chiang	765597fa78	Update compaction score right after CompactFiles forms a compaction Summary: This is a follow-up patch of https://reviews.facebook.net/D54891. As the information about files being compacted will also be used when making compaction decision, it is necessary to update the compaction score when a compaction plan has been made but not yet execute. This patch adds a missing call to update the compaction score in CompactFiles(). Test Plan: compact_files_test Reviewers: sdong, IslamAbdelRahman, kradhakrishnan, yiwu, andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D55227	2016-03-10 14:34:28 -08:00
Yueh-Hsuan Chiang	a7d4eb2f34	Fix a bug where flush does not happen when a manual compaction is running Summary: Currently, when rocksdb tries to run manual compaction to refit data into a level, there's a ReFitLevel() process that requires no bg work is currently running. When RocksDB plans to ReFitLevel(), it will do the following: 1. pause scheduling new bg work. 2. wait until all bg work finished 3. do the ReFitLevel() 4. unpause scheduling new bg work. However, as it pause scheduling new bg work at step one and waiting for all bg work finished in step 2, RocksDB will stop flushing until all bg work is done (which could take a long time.) This patch fix this issue by changing the way ReFitLevel() pause the background work: 1. pause scheduling compaction. 2. wait until all bg work finished. 3. pause scheduling flush 4. do ReFitLevel() 5. unpause both flush and compaction. The major difference is that. We only pause scheduling compaction in step 1 and wait for all bg work finished in step 2. This prevent flush being blocked for a long time. Although there's a very rare case that ReFitLevel() might be in starvation in step 2, but it's less likely the case as flush typically finish very fast. Test Plan: existing test. Reviewers: anthony, IslamAbdelRahman, kradhakrishnan, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D55029	2016-03-04 14:24:52 -08:00
Islam AbdelRahman	dfe96c72c3	Fix WriteLevel0TableForRecovery file delete protection Summary: The call to ``` CaptureCurrentFileNumberInPendingOutputs() ``` should be before ``` versions_->NewFileNumber() ``` Right now we are not actually protecting the file from being deleted Test Plan: make check Reviewers: sdong, anthony, yhchiang Reviewed By: yhchiang Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D54645	2016-03-03 18:25:07 -08:00
sdong	e79ad9e184	Add Iterator Property rocksdb.iterator.version_number Summary: We want to provide a way to detect whether an iterator is stale and needs to be recreated. Add a iterator property to return version number. Test Plan: Add two unit tests for it. Reviewers: IslamAbdelRahman, yhchiang, anthony, kradhakrishnan, andrewkr Reviewed By: andrewkr Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D54921	2016-03-02 16:23:59 -08:00
Islam AbdelRahman	6743135ea1	Fix DB::AddFile() issue when PurgeObsoleteFiles() is called Summary: In some situations the DB will scan all existing files in the DB path and delete the ones that are Obsolete. If this happen during adding an external sst file. this could cause the file to be deleted while we are adding it. This diff fix this issue Test Plan: unit test to reproduce the bug existing unit tests Reviewers: sdong, yhchiang, andrewkr Reviewed By: andrewkr Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D54627	2016-03-01 12:05:29 -08:00
sdong	38201b3599	Fix assert failure when DBImpl::SyncWAL() conflicts with log rolling Summary: DBImpl::SyncWAL() releases db mutex before calling DBImpl::MarkLogsSynced(), while inside DBImpl::MarkLogsSynced() we assert there is none or one outstanding log file. However, a memtable switch can happen in between and causing two or outstanding logs there, failing the assert. The diff adds a unit test that repros the issue and fix the assert so that the unit test passes. Test Plan: Run the new tests. Reviewers: anthony, kolmike, yhchiang, IslamAbdelRahman, kradhakrishnan, andrewkr Reviewed By: andrewkr Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D54621	2016-02-23 11:42:15 -08:00
Mike Kolupaev	eef63ef807	Fixed CompactFiles() spuriously failing or corrupting DB Summary: We started getting two kinds of crashes since we started using `DB::CompactFiles()`: (1) `CompactFiles()` fails saying something like "/data/logdevice/4440/shard12/012302.sst: No such file or directory", and presumably makes DB read-only, (2) DB fails to open saying "Corruption: Can't access /267000.sst: IO error: /data/logdevice/4440/shard1/267000.sst: No such file or directory". AFAICT, both can be explained by background thread deleting compaction output as "obsolete" while it's being written, before it's committed to manifest. If it ends up committed to the manifest, we get (2); if compaction notices the disappearance and fails, we get (1). The internal tasks t10068021 and t10134177 have some details about the investigation that led to this. Test Plan: `make -j check`; the new test fails to reopen the DB without the fix Reviewers: yhchiang Reviewed By: yhchiang Subscribers: dhruba, sdong Differential Revision: https://reviews.facebook.net/D54561	2016-02-22 13:54:58 -08:00
Islam AbdelRahman	df9ba6df62	Introduce SstFileManager::SetMaxAllowedSpaceUsage() to cap disk space usage Summary: Introude SstFileManager::SetMaxAllowedSpaceUsage() that can be used to limit the maximum space usage allowed for RocksDB. When this limit is exceeded WriteImpl() will fail and return Status::Aborted() Test Plan: unit testing Reviewers: yhchiang, anthony, andrewkr, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D53763	2016-02-17 15:20:23 -08:00
reid horuff	5bcf952a87	Fix WriteImpl empty batch hanging issue Summary: There is an issue in DBImpl::WriteImpl where if an empty writebatch comes in and sync=true then the logs will be marked as being synced yet the sync never actually happens because there is no data in the writebatch. This causes the next incoming batch to hang while waiting for the logs to complete syncing. This fix syncs logs even if the writebatch is empty. Test Plan: DoubleEmptyBatch unit test in transaction_test. Reviewers: yoshinorim, hermanlee4, sdong, ngbronson, anthony Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D54057	2016-02-16 12:21:33 -08:00
Mike Kolupaev	44371501f0	Fixed a segfault when compaction fails Summary: We've hit it today. Test Plan: `make -j check`; didn't reproduce the issue Reviewers: yhchiang Reviewed By: yhchiang Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D54219	2016-02-16 11:11:16 -08:00
Baraa Hamodi	21e95811d1	Updated all copyright headers to the new format.	2016-02-09 15:12:00 -08:00
Yueh-Hsuan Chiang	4a8cbf4e31	Allows Get and MultiGet to read directly from SST files. Summary: Add kSstFileTier to ReadTier, which allows Get and MultiGet to read only directly from SST files and skip mem-tables. kSstFileTier = 0x2 // data in SST files. // Note that this ReadTier currently only supports // Get and MultiGet and does not support iterators. Test Plan: add new test in db_test. Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, sdong Reviewed By: sdong Subscribers: igor, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D53511	2016-02-09 11:20:22 -08:00
reid horuff	6f71d3b68b	Improve perf of Pessimistic Transaction expirations (and optimistic transactions) Summary: copy from task 8196669: 1) Optimistic transactions do not support batching writes from different threads. 2) Pessimistic transactions do not support batching writes if an expiration time is set. In these 2 cases, we currently do not do any write batching in DBImpl::WriteImpl() because there is a WriteCallback that could decide at the last minute to abort the write. But we could support batching write operations with callbacks if we make sure to process the callbacks correctly. To do this, we would first need to modify write_thread.cc to stop preventing writes with callbacks from being batched together. Then we would need to change DBImpl::WriteImpl() to call all WriteCallback's in a batch, only write the batches that succeed, and correctly set the state of each batch's WriteThread::Writer. Test Plan: Added test WriteWithCallbackTest to write_callback_test.cc which creates multiple client threads and verifies that writes are batched and executed properly. Reviewers: hermanlee4, anthony, ngbronson Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D52863	2016-02-05 10:44:13 -08:00
Andrew Kryczka	284aa613a7	Eliminate duplicated property constants Summary: Before this diff, there were duplicated constants to refer to properties (user- facing API had strings and InternalStats had an enum). I noticed these were inconsistent in terms of which constants are provided, names of constants, and documentation of constants. Overall it seemed annoying/error-prone to maintain these duplicated constants. So, this diff gets rid of InternalStats's constants and replaces them with a map keyed on the user-facing constant. The value in that map contains a function pointer to get the property value, so we don't need to do string matching while holding db->mutex_. This approach has a side benefit of making many small handler functions rather than a giant switch-statement. Test Plan: db_properties_test passes, running "make commit-prereq -j32" Reviewers: sdong, yhchiang, kradhakrishnan, IslamAbdelRahman, rven, anthony Reviewed By: anthony Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D53253	2016-02-02 19:14:56 -08:00
Nathan Bronson	9c2cf9479b	Fix for --allow_concurrent_memtable_write with batching Summary: Concurrent memtable adds were incorrectly computing the last sequence number for a write batch group when the write batches were not solitary. This is the cause of https://github.com/facebook/mysql-5.6/issues/155 Test Plan: 1. unit tests 2. new unit test 3. parallel db_bench stress tests with batch size of 10 and asserts enabled Reviewers: igor, sdong Reviewed By: sdong Subscribers: IslamAbdelRahman, MarkCallaghan, dhruba Differential Revision: https://reviews.facebook.net/D53595	2016-02-01 20:41:57 -08:00
SherlockNoMad	37159a6448	Add histogram for value size per operation	2016-01-31 18:09:24 -08:00
Venkatesh Radhakrishnan	3b2a1ddd2e	Add options.base_background_compactions as a number of compaction threads for low compaction debt Summary: If options.base_background_compactions is given, we try to schedule number of compactions not existing this number, only when L0 files increase to certain number, or pending compaction bytes more than certain threshold, we schedule compactions based on options.max_background_compactions. The watermarks are calculated based on slowdown thresholds. Test Plan: Add new test cases in column_family_test. Adding more unit tests. Reviewers: IslamAbdelRahman, yhchiang, kradhakrishnan, rven, anthony Reviewed By: anthony Subscribers: leveldb, dhruba, yoshinorim Differential Revision: https://reviews.facebook.net/D53409	2016-01-29 16:15:53 -08:00
Islam AbdelRahman	d6c838f1e1	Add SstFileManager (component tracking all SST file in DBs and control the deletion rate) Summary: Add a new class SstFileTracker that will be notified whenever a DB add/delete/move and sst file, it will also replace DeleteScheduler SstFileTracker can be used later to abort writes when we exceed a specific size Test Plan: unit tests Reviewers: rven, anthony, yhchiang, sdong Reviewed By: sdong Subscribers: igor, lovro, march, dhruba Differential Revision: https://reviews.facebook.net/D50469	2016-01-28 18:35:01 -08:00
Andrew Kryczka	167bd8856d	[directory includes cleanup] Finish removing util->db dependencies	2016-01-26 10:49:24 -08:00
Andrew Kryczka	acd7d58695	[directory includes cleanup] Remove util->db dependency for ThreadStatusUtil Summary: We can avoid the dependency by forward-declaring ColumnFamilyData and then treating it as a black box. That means callers of ThreadStatusUtil need to explicitly provide more options, even if they can be derived from the ColumnFamilyData, since ThreadStatusUtil doesn't include the definition. This is part of a series of diffs to eliminate circular dependencies between directories (e.g., db/* files depending on util/* files and vice-versa). Test Plan: $ ./db_test --gtest_filter=DBTest.GetThreadStatus $ make -j32 commit-prereq Reviewers: sdong, yhchiang, IslamAbdelRahman Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D53361	2016-01-26 10:49:16 -08:00
Andrew Kryczka	46f9cd46af	[directory includes cleanup] Move cross-function test points Summary: I split the db-specific test points out into a separate file under db/ directory. There were also a few bugs to fix in xfunc.{h,cc} that prevented it from compiling previously; see https://reviews.facebook.net/D36825. Test Plan: compilation works now, below command works, will also run "make xfunc". $ make check ROCKSDB_XFUNC_TEST='managed_new' tests-regexp='DBTest' -j32 Reviewers: sdong, yhchiang Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D53343	2016-01-26 10:49:05 -08:00
Islam AbdelRahman	2fbc59a348	Disallow SstFileWriter from creating empty sst files Summary: SstFileWriter may create an sst file with no entries Right now this will fail when being ingested using DB::AddFile() saying that the keys are corrupted Test Plan: make check Reviewers: yhchiang, rven, anthony, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D52815	2016-01-25 13:47:07 -08:00
David Bernard	3f12e16f27	Make alloca.h optional	2016-01-19 06:17:31 +00:00
David Bernard	d78c6b28c4	Changes for build on solaris Makefile adjust paths for solaris build Makefile enable _GLIBCXX_USE_C99 so that std::to_string is available db_compaction_test.cc Initialise a variable to avoid a compilation error db_impl.cc Include <alloca.h> db_test.cc Include <alloca.h> Environment.java recognise solaris envrionment options_bulder.cc Make log unambiguous geodb_impl.cc Make log and floor unambiguous	2016-01-19 04:45:21 +00:00
Venkatesh Radhakrishnan	7ece10ecb6	DeleteFilesInRange: Mark files to be deleted as being compacted before applying change Summary: While running the myrocks regression suite, I found that while dropping a table soon after inserting rows into it resulted in an assertion failure in CheckConsistencyForDeletes for not finding a file which was recently added or moved. Marking the files to be deleted as being compacted before calling LogAndApplyChange fixed the assertion failures. Test Plan: DBCompactionTest.DeleteFileRange Reviewers: IslamAbdelRahman, anthony, yhchiang, kradhakrishnan, sdong Reviewed By: sdong Subscribers: dhruba, yoshinorim, leveldb Differential Revision: https://reviews.facebook.net/D52599	2016-01-07 14:48:45 -08:00
Reid Horuff	da032495d3	Optimize GetLatestSequenceForKey Summary: DBImpl::GetLatestSequenceForKey() can do memcpy's to load a value that will never be used. This can be optimized by changing all the Get() functions called to optionally not fetch the value (and only fetch the sequencenumber). Test Plan: optimistic_transaction_test and transaction_test Reviewers: anthony Reviewed By: anthony Subscribers: leveldb, dhruba, hermanlee4 Differential Revision: https://reviews.facebook.net/D52227	2016-01-06 13:43:22 -08:00
Venkatesh Radhakrishnan	d74c9f0a57	DeleteFilesInRange: Clean job context if no files deleted Summary: We need to clean the job context if we end up not deleting any files because no files are in the range specified. Test Plan: DBCompactionTest.DeleteFileRange Reviewers: sdong, anthony, yhchiang, kradhakrishnan, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D52467	2016-01-04 10:55:31 -08:00
Nathan Bronson	ac16663bd6	use -Werror=missing-field-initializers, to closer match MyRocks build Summary: myrocks seems to build rocksdb using -Wmissing-field-initializers (and treats warnings as errors). This diff adds that flag to the rocksdb build, and fixes the compilation failures that result. I have not checked for any other differences in the build flags for rocksdb build as part of myrocks. Test Plan: make check Reviewers: sdong, rven Reviewed By: rven Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D52443	2015-12-30 14:56:18 -08:00
Venkatesh Radhakrishnan	63ddb783db	Delete files in given key range Summary: This is an initial diff for providing the ability to delete files which are completely within a given range of keys. Test Plan: DBCompactionTest.DeleteRange Reviewers: IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: yoshinorim, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D52293	2015-12-29 13:22:13 -08:00
Siying Dong	22c0ed8a5f	Disable Visual Studio Warning C4351 Currently Windows build is broken because of Warning C4351. Disable the warning before figuring out the right way to fix it.	2015-12-28 15:06:34 -08:00
sdong	11672df19a	Fix CLANG errors introduced by `7d87f02799` Summary: Fix some CLANG errors introduced in `7d87f02799` Test Plan: Build with both of CLANG and gcc Reviewers: rven, yhchiang, kradhakrishnan, anthony, IslamAbdelRahman, ngbronson Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D52329	2015-12-28 10:00:58 -08:00
Nathan Bronson	7d87f02799	support for concurrent adds to memtable Summary: This diff adds support for concurrent adds to the skiplist memtable implementations. Memory allocation is made thread-safe by the addition of a spinlock, with small per-core buffers to avoid contention. Concurrent memtable writes are made via an additional method and don't impose a performance overhead on the non-concurrent case, so parallelism can be selected on a per-batch basis. Write thread synchronization is an increasing bottleneck for higher levels of concurrency, so this diff adds --enable_write_thread_adaptive_yield (default off). This feature causes threads joining a write batch group to spin for a short time (default 100 usec) using sched_yield, rather than going to sleep on a mutex. If the timing of the yield calls indicates that another thread has actually run during the yield then spinning is avoided. This option improves performance for concurrent situations even without parallel adds, although it has the potential to increase CPU usage (and the heuristic adaptation is not yet mature). Parallel writes are not currently compatible with inplace updates, update callbacks, or delete filtering. Enable it with --allow_concurrent_memtable_write (and --enable_write_thread_adaptive_yield). Parallel memtable writes are performance neutral when there is no actual parallelism, and in my experiments (SSD server-class Linux and varying contention and key sizes for fillrandom) they are always a performance win when there is more than one thread. Statistics are updated earlier in the write path, dropping the number of DB mutex acquisitions from 2 to 1 for almost all cases. This diff was motivated and inspired by Yahoo's cLSM work. It is more conservative than cLSM: RocksDB's write batch group leader role is preserved (along with all of the existing flush and write throttling logic) and concurrent writers are blocked until all memtable insertions have completed and the sequence number has been advanced, to preserve linearizability. My test config is "db_bench -benchmarks=fillrandom -threads=$T -batch_size=1 -memtablerep=skip_list -value_size=100 --num=1000000/$T -level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999 -disable_auto_compactions --max_write_buffer_number=8 -max_background_flushes=8 --disable_wal --write_buffer_size=160000000 --block_size=16384 --allow_concurrent_memtable_write" on a two-socket Xeon E5-2660 @ 2.2Ghz with lots of memory and an SSD hard drive. With 1 thread I get ~440Kops/sec. Peak performance for 1 socket (numactl -N1) is slightly more than 1Mops/sec, at 16 threads. Peak performance across both sockets happens at 30 threads, and is ~900Kops/sec, although with fewer threads there is less performance loss when the system has background work. Test Plan: 1. concurrent stress tests for InlineSkipList and DynamicBloom 2. make clean; make check 3. make clean; DISABLE_JEMALLOC=1 make valgrind_check; valgrind db_bench 4. make clean; COMPILE_WITH_TSAN=1 make all check; db_bench 5. make clean; COMPILE_WITH_ASAN=1 make all check; db_bench 6. make clean; OPT=-DROCKSDB_LITE make check 7. verify no perf regressions when disabled Reviewers: igor, sdong Reviewed By: sdong Subscribers: MarkCallaghan, IslamAbdelRahman, anthony, yhchiang, rven, sdong, guyg8, kradhakrishnan, dhruba Differential Revision: https://reviews.facebook.net/D50589	2015-12-25 11:03:40 -08:00
Siying Dong	298ba27ae2	Merge pull request #846 from yuslepukhin/enble_c4244_lossofdata Enable MS compiler warning c4244.	2015-12-23 22:59:42 -08:00
Islam AbdelRahman	d005c66faf	Report compaction reason in CompactionListener Summary: Add CompactionReason to CompactionJobInfo This will allow users to understand why compaction started which will help options tuning Test Plan: added new tests make check -j64 Reviewers: yhchiang, anthony, kradhakrishnan, sdong, rven Reviewed By: rven Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D51975	2015-12-22 11:37:19 -08:00
Alex Yang	33e09c0e19	add call to install superversion and schedule work in enableautocompactions Summary: This patch fixes https://github.com/facebook/mysql-5.6/issues/121 There is a recent change in rocksdb to disable auto compactions on startup: https://reviews.facebook.net/D51147. However, there is a small timing window where a column family needs to be compacted and schedules a compaction, but the scheduled compaction fails when it checks the disable_auto_compactions setting. The expectation is once the application is ready, it will call EnableAutoCompactions() to allow new compactions to go through. However, if the Column family is stalled because L0 is full, and no writes can go through, it is possible the column family may never have a new compaction request get scheduled. EnableAutoCompaction() should probably schedule an new flush and compaction event when it resets disable_auto_compaction. Using InstallSuperVersionAndScheduleWork, we call SchedulePendingFlush, SchedulePendingCompaction, as well as MaybeScheduleFlushOrcompaction on all the column families to avoid the situation above. This is still a first pass for feedback. Could also just call SchedePendingFlush and SchedulePendingCompaction directly. Test Plan: Run on Asan build cd _build-5.6-ASan/ && ./mysql-test/mtr --mem --big --testcase-timeout=36000 --suite-timeout=12000 --parallel=16 --suite=rocksdb,rocksdb_rpl,rocksdb_sys_vars --mysqld=--default-storage-engine=rocksdb --mysqld=--skip-innodb --mysqld=--default-tmp-storage-engine=MyISAM --mysqld=--rocksdb rocksdb_rpl.rpl_rocksdb_stress_crash --repeat=1000 Ensure that it no longer hangs during the test. Reviewers: hermanlee4, yhchiang, anthony Reviewed By: anthony Subscribers: leveldb, yhchiang, dhruba Differential Revision: https://reviews.facebook.net/D51747	2015-12-21 10:06:49 -08:00
Venkatesh Radhakrishnan	7b12ae97d4	Add signalall after removing item from manual_compaction deque Summary: When there are waiting manual compactions, we need to signal them after removing the current manual compaction from the deque. Test Plan: ColumnFamilytTest.SameCFManualManualCommaction Reviewers: anthony, IslamAbdelRahman, kradhakrishnan, sdong Reviewed By: sdong Subscribers: dhruba, yoshinorim Differential Revision: https://reviews.facebook.net/D52119	2015-12-17 16:59:00 -08:00
Islam AbdelRahman	aececc209e	Introduce ReadOptions::pin_data (support zero copy for keys) Summary: This patch update the Iterator API to introduce new functions that allow users to keep the Slices returned by key() valid as long as the Iterator is not deleted ReadOptions::pin_data : If true keep loaded blocks in memory as long as the iterator is not deleted Iterator::IsKeyPinned() : If true, this mean that the Slice returned by key() is valid as long as the iterator is not deleted Also add a new option BlockBasedTableOptions::use_delta_encoding to allow users to disable delta_encoding if needed. Benchmark results (using https://phabricator.fb.com/P20083553) ``` // $ du -h /home/tec/local/normal.4K.Snappy/db10077 // 6.1G /home/tec/local/normal.4K.Snappy/db10077 // $ du -h /home/tec/local/zero.8K.LZ4/db10077 // 6.4G /home/tec/local/zero.8K.LZ4/db10077 // Benchmarks for shard db10077 // _build/opt/rocks/benchmark/rocks_copy_benchmark \ // --normal_db_path="/home/tec/local/normal.4K.Snappy/db10077" \ // --zero_db_path="/home/tec/local/zero.8K.LZ4/db10077" // First run // ============================================================================ // rocks/benchmark/RocksCopyBenchmark.cpp relative time/iter iters/s // ============================================================================ // BM_StringCopy 1.73s 576.97m // BM_StringPiece 103.74% 1.67s 598.55m // ============================================================================ // Match rate : 1000000 / 1000000 // Second run // ============================================================================ // rocks/benchmark/RocksCopyBenchmark.cpp relative time/iter iters/s // ============================================================================ // BM_StringCopy 611.99ms 1.63 // BM_StringPiece 203.76% 300.35ms 3.33 // ============================================================================ // Match rate : 1000000 / 1000000 ``` Test Plan: Unit tests Reviewers: sdong, igor, anthony, yhchiang, rven Reviewed By: rven Subscribers: dhruba, lovro, adsharma Differential Revision: https://reviews.facebook.net/D48999	2015-12-16 12:08:30 -08:00
Gunnar Kudrjavets	97265f5f14	Fix minor bugs in delete operator, snprintf, and size_t usage Summary: List of changes: 1) Fix the snprintf() usage in cases where wrong variable was used to determine the output buffer size. 2) Remove unnecessary checks before calling delete operator. 3) Increase code correctness by using size_t type when getting vector's size. 4) Unify the coding style by removing namespace::std usage at the top of the file to confirm to the majority usage. 5) Fix various lint errors pointed out by 'arc lint'. Test Plan: Code review and build: git diff make clean make -j 32 commit-prereq arc lint Reviewers: kradhakrishnan, sdong, rven, anthony, yhchiang, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D51849	2015-12-15 15:26:20 -08:00
Venkatesh Radhakrishnan	030215bf01	Running manual compactions in parallel with other automatic or manual compactions in restricted cases Summary: This diff provides a framework for doing manual compactions in parallel with other compactions. We now have a deque of manual compactions. We also pass manual compactions as an argument from RunManualCompactions down to BackgroundCompactions, so that RunManualCompactions can be reentrant. Parallelism is controlled by the two routines ConflictingManualCompaction to allow/disallow new parallel/manual compactions based on already existing ManualCompactions. In this diff, by default manual compactions still have to run exclusive of other compactions. However, by setting the compaction option, exclusive_manual_compaction to false, it is possible to run other compactions in parallel with a manual compaction. However, we are still restricted to one manual compaction per column family at a time. All of these restrictions will be relaxed in future diffs. I will be adding more tests later. Test Plan: Rocksdb regression + new tests + valgrind Reviewers: igor, anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, sdong Reviewed By: sdong Subscribers: yoshinorim, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D47973	2015-12-14 11:20:34 -08:00
Dmitri Smirnov	236fe21c92	Enable MS compiler warning c4244. Mostly due to the fact that there are differences in sizes of int,long on 64 bit systems vs GNU.	2015-12-11 16:47:34 -08:00
agiardullo	3bfd3d39a3	Use SST files for Transaction conflict detection Summary: Currently, transactions can fail even if there is no actual write conflict. This is due to relying on only the memtables to check for write-conflicts. Users have to tune memtable settings to try to avoid this, but it's hard to figure out exactly how to tune these settings. With this diff, TransactionDB will use both memtables and SST files to determine if there are any write conflicts. This relies on the fact that BlockBasedTable stores sequence numbers for all writes that happen after any open snapshot. Also, D50295 is needed to prevent SingleDelete from disappearing writes (the TODOs in this test code will be fixed once the other diff is approved and merged). Note that Optimistic transactions will still rely on tuning memtable settings as we do not want to read from SST while on the write thread. Also, memtable settings can still be used to reduce how often TransactionDB needs to read SST files. Test Plan: unit tests, db bench Reviewers: rven, yhchiang, kradhakrishnan, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: dhruba, leveldb, yoshinorim Differential Revision: https://reviews.facebook.net/D50475	2015-12-11 12:34:11 -08:00
agiardullo	9e44629061	Change SingleDelete to support conflict checking Summary: For Transactions, we want to start using the SST files to do write conflict checking. To do this, we need to make sure that compaction never removes all writes if an earlier snapshot exists. So I had to change the way we process SingleDeletes to sometimes leave a SingleDelete behind when we encounter a Put followed by a SingleDelete. See the comments in this diff for a more detailed explanation. Test Plan: added more unit tests Reviewers: rven, igor, kradhakrishnan, IslamAbdelRahman, yhchiang, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D50295	2015-12-10 11:35:38 -08:00
Yueh-Hsuan Chiang	774b80e99e	Resubmit the fix for a race condition in persisting options Summary: This patch fix a race condition in persisting options which will cause a crash when: * Thread A obtain cf options and start to persist options based on that cf options. * Thread B kicks in and finish DropColumnFamily and delete cf_handle. * Thread A wakes up and tries to finish the persisting options and crashes. Test Plan: Add a test in column_family_test that can reproduce the crash Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, sdong Reviewed By: sdong Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D51717	2015-12-08 17:01:02 -08:00
agiardullo	e5c5f23814	Support marking snapshots for write-conflict checking - Take 2 Summary: D51183 was reverted due to breaking the LITE build. This diff is the same as D51183 but with a fix for the LITE BUILD(D51693) Test Plan: run all unit tests Reviewers: sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D51711	2015-12-08 16:47:31 -08:00
sdong	1d63c3d610	Revert "Support marking snapshots for write-conflict checking" This reverts commit `ec704aafdc` for it broke RocksDB LITE build.	2015-12-08 09:27:17 -08:00
agiardullo	ec704aafdc	Support marking snapshots for write-conflict checking Summary: D50475 enables using SST files for transaction write-conflict checking. In order for this to work, we need to make sure not to compact out SingleDeletes when there is an earlier transaction snapshot(D50295). If there is a long-held snapshot, this could reduce the benefit of the SingleDelete optimization. This diff allows Transactions to mark snapshots as being used for write-conflict checking. Then, during compaction, we will be able to optimize SingleDeletes better in the future. This diff adds a flag to SnapshotImpl which is used by Transactions. This diff also passes the earliest write-conflict snapshot's sequence number to CompactionIterator. This diff does not actually change Compaction (after this diff is pushed, D50295 will be able to use this information). Test Plan: no behavior change, ran existing tests Reviewers: rven, kradhakrishnan, yhchiang, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D51183	2015-12-07 19:40:51 -08:00
sdong	f307036bde	Revert "Fix a race condition in persisting options" This reverts commit `2fa3ed5180`. It breaks RocksDB lite build	2015-12-07 17:09:12 -08:00
Yueh-Hsuan Chiang	2fa3ed5180	Fix a race condition in persisting options Summary: This patch fix a race condition in persisting options which will cause a crash when: * Thread A obtain cf options and start to persist options based on that cf options. * Thread B kicks in and finish DropColumnFamily and delete cf_handle. * Thread A wakes up and tries to finish the persisting options and crashes. Test Plan: Add a test in column_family_test that can reproduce the crash Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D51609	2015-12-07 15:25:12 -08:00
Alex Yang	e8180f9901	added public api to schedule flush/compaction, code to prevent race with db::open Summary: Fixes T8781168. Added a new function EnableAutoCompactions in db.h to be publicly avialable. This allows compaction to be re-enabled after disabling it via SetOptions Refactored code to set the dbptr earlier on in TransactionDB::Open and DB::Open Temporarily disable auto_compaction in TransactionDB::Open until dbptr is set to prevent race condition. Test Plan: Ran make all check verified fix on myrocks side: was able to reproduce the seg fault with ../tools/mysqltest.sh --mem --force rocksdb.drop_table method was to manually sleep the thread after DB::Open but before TransactionDB ptr was assigned in transaction_db_impl.cc: DB::Open(db_options, dbname, column_families_copy, handles, &db); clock_t goal = (60000 * 10) + clock(); while (goal > clock()); ...dbptr(aka rdb) gets assigned below verified my changes fixed the issue. Also added unit test 'ToggleAutoCompaction' in transaction_test.cc Reviewers: hermanlee4, anthony Reviewed By: anthony Subscribers: alex, dhruba Differential Revision: https://reviews.facebook.net/D51147	2015-12-03 22:59:44 -08:00
sdong	db320b1b82	DB to only flush the column family with the largest memtable while option.db_write_buffer_size is hit Summary: When option.db_write_buffer_size is hit, we currently flush all column families. Move to flush the column family with the largest active memt table instead. In this way, we can avoid too many small files in some cases. Test Plan: Modify test DBTest.SharedWriteBuffer to work with the updated behavior Reviewers: kradhakrishnan, yhchiang, rven, anthony, IslamAbdelRahman, igor Reviewed By: igor Subscribers: march, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D51291	2015-11-30 13:36:57 -08:00
Reid Horuff	3381e2c3e7	Handle multiple calls to DBImpl::PauseBackgroundWork() and DBImpl::ContinueBackgroundWork() Summary: Handle multiple calls to DBImpl::PauseBackgroundWork() and DBImpl::ContinueBackgroundWork() Test Plan: rocksdb.information_schema handles this case. Reviewers: igor Reviewed By: igor Subscribers: hermanlee4, jkedgar, dhruba Differential Revision: https://reviews.facebook.net/D50781	2015-11-16 14:20:18 -08:00
Venkatesh Radhakrishnan	2ae4d7d708	Make sure that CompactFiles does not run two parallel Level 0 compactions Summary: Since level 0 files can overlap, two level 0 compactions cannot run in parallel. Compact files needs to check this before running a compaction. Test Plan: CompactFilesTest.L0ConflictsFiles Reviewers: igor, IslamAbdelRahman, anthony, sdong, yhchiang Reviewed By: yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D50079	2015-11-13 12:01:00 -08:00
Nathan Bronson	6ce42dd075	Don't merge WriteBatch-es if WAL is disabled Summary: There's no need for WriteImpl to flatten the write batch group into a single WriteBatch if the WAL is disabled. This diff moves the flattening into the WAL step, and skips flattening entirely if it isn't needed. It's good for about 5% speedup on a multi-threaded workload with no WAL. This diff also adds clarifying comments about the chance for partial failure of WriteBatchInternal::InsertInto, and always sets bg_error_ if the memtable state diverges from the logged state or if a WriteBatch succeeds only partially. Benchmark for speedup: db_bench -benchmarks=fillrandom -threads=16 -batch_size=1 -memtablerep=skip_list -value_size=0 --num=200000 -level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999 -disable_auto_compactions --max_write_buffer_number=8 -max_background_flushes=8 --disable_wal --write_buffer_size=160000000 Test Plan: asserts + make check Reviewers: sdong, igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D50583	2015-11-12 10:50:38 -08:00
Yueh-Hsuan Chiang	e114f0abb8	Enable RocksDB to persist Options file. Summary: This patch allows rocksdb to persist options into a file on DB::Open, SetOptions, and Create / Drop ColumnFamily. Options files are created under the same directory as the rocksdb instance. In addition, this patch also adds a fail_if_missing_options_file in DBOptions that makes any function call return non-ok status when it is not able to persist options properly. // If true, then DB::Open / CreateColumnFamily / DropColumnFamily // / SetOptions will fail if options file is not detected or properly // persisted. // // DEFAULT: false bool fail_if_missing_options_file; Options file names are formatted as OPTIONS-<number>, and RocksDB will always keep the latest two options files. Test Plan: Add options_file_test. options_test column_family_test Reviewers: igor, IslamAbdelRahman, sdong, anthony Reviewed By: anthony Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D48285	2015-11-10 22:58:01 -08:00
Venkatesh Radhakrishnan	9d50afc3b9	Prefix-based iterating only shows keys in prefix Summary: MyRocks testing found an issue that while iterating over keys that are outside the prefix, sometimes wrong results were seen for keys outside the prefix. We now tighten the range of keys seen with a new read option called prefix_seen_at_start. This remembers the starting prefix and then compares it on a Next for equality of prefix. If they are from a different prefix, it sets valid to false. Test Plan: PrefixTest.PrefixValid Reviewers: IslamAbdelRahman, sdong, yhchiang, anthony Reviewed By: anthony Subscribers: spetrunia, hermanlee4, yoshinorim, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D50211	2015-11-05 13:24:05 -08:00
Yueh-Hsuan Chiang	3ecbab0040	Add GetAggregatedIntProperty(): returns the aggregated value from all CFs Summary: This patch adds GetAggregatedIntProperty() that returns the aggregated value from all CFs Test Plan: Added a test in db_test Reviewers: igor, sdong, anthony, IslamAbdelRahman, rven Reviewed By: rven Subscribers: rven, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D49497	2015-11-03 15:54:18 -08:00
Islam AbdelRahman	ff4499e297	Update DB::AddFile() to have less restrictions Summary: Update DB::AddFile() restrictions to be - Key range in loaded table file don't overlap with existing keys or tombstones in DB. - No other writes happen during AddFile call. The updated AddFile() will verify that the file key range don't overlap with any keys or tombstones in the DB, and then add the file to L0 Test Plan: unit tests Reviewers: igor, rven, anthony, kradhakrishnan, sdong Reviewed By: sdong Subscribers: adsharma, ameyag, dhruba Differential Revision: https://reviews.facebook.net/D49233	2015-10-30 16:38:10 -07:00
Islam AbdelRahman	2872e0c8c2	Clean and expose CreateLoggerFromOptions Summary: CreateLoggerFromOptions have some parameters like db_log_dir and env, these parameters are redundant since they already exist in DBOptions this patch remove the redundant parameters and expose CreateLoggerFromOptions to users Test Plan: make check Reviewers: igor, anthony, yhchiang, rven, kradhakrishnan, sdong Reviewed By: sdong Subscribers: dhruba, hermanlee4 Differential Revision: https://reviews.facebook.net/D49713	2015-10-29 18:07:37 -07:00
sdong	296c3a1f94	"make format" in some recent commits Summary: Run "make format" for some recent commits. Test Plan: Build and run tests Reviewers: IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D49707	2015-10-29 17:11:14 -07:00
Herman Lee	0d720dfc17	Use the correct variable when fetching table properties. Summary: An uninitialized parameter was being passed into the call to fetch the table properties during the compaction notification callbacks. Test Plan: Build it with myrocks and verify unit test passed. Run unit tests. Reviewers: rven, yhchiang, igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D49635	2015-10-28 16:28:11 -07:00
Praveen Rao	4ce117c4d5	Merge branch 'master' into wal_filter	2015-10-26 19:03:34 -07:00
Praveen Rao	32cdec634e	Fail recovery if filter provides more records than original and corresponding unit-test, fix naming conventions	2015-10-26 18:11:18 -07:00
Siying Dong	138876a62c	Merge pull request #746 from ceph/wip-recycle Add Options.recycle_log_file_num for Recycling WAL Files	2015-10-26 15:01:28 -07:00
Praveen Rao	2938c5c137	merge upstream changes	2015-10-19 15:21:33 -07:00
Praveen Rao	0c59691dde	Handle multiple batches in single log record - allow app to return a new batch + allow app to return corrupted record status	2015-10-19 13:27:40 -07:00
Alexey Maykov	f18acd8875	Fixed the clang compilation failure Summary: As above. Test Plan: USE_CLANG=1 make check -j Reviewers: igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D48981	2015-10-19 10:38:50 -07:00
Sage Weil	9c33f64d19	log_reader: pass in WALRecoveryMode instead of bool report_eof_inconsistency Soon our behavior will depend on more than just whther we are in kAbsoluteConsistency or not. Signed-off-by: Sage Weil <sage@redhat.com>	2015-10-18 21:24:32 -04:00
Sage Weil	3ac13c99d1	log_reader: pass log_number and optional info_log to ctor We will need the log number to validate the recycle-style CRCs. The log is helpful for debugging, but optional, as not all callers have it. Signed-off-by: Sage Weil <sage@redhat.com>	2015-10-18 21:24:32 -04:00
Sage Weil	5830c699f2	log_writer: pass log number and whether recycling is enabled to ctor When we recycle log files, we need to mix the log number into the CRC for each record. Note that for logs that don't get recycled (like the manifest), we always pass a log_number of 0 and false. Signed-off-by: Sage Weil <sage@redhat.com>	2015-10-18 21:24:32 -04:00
Sage Weil	666376150c	db_impl: recycle log files If log recycling is enabled, put old WAL files on a recycle queue instead of deleting them. When we need a new log file, take a recycled file off the list if one is available. Signed-off-by: Sage Weil <sage@redhat.com>	2015-10-18 21:24:32 -04:00
Sage Weil	d666225a0a	db_impl: disable recycle_log_files if WAL archive is enabled We can't recycle the files if they are being archived. Signed-off-by: Sage Weil <sage@redhat.com>	2015-10-18 21:21:24 -04:00
Alexey Maykov	e1a09a7703	Implementation for GetPropertiesOfTablesInRange Summary: In MyRocks, it is sometimes important to get propeties only for the subset of the database. This diff implements the API in RocksDB. Test Plan: ran the GetPropertiesOfTablesInRange Reviewers: rven, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D48651	2015-10-17 13:34:43 -07:00
Yueh-Hsuan Chiang	ad471453e8	Allow GetProperty to report the number of currently running flushes / compactions. Summary: Add rocksdb.num-running-compactions and rocksdb.num-running-flushes to GetIntProperty() that reports the number of currently running compactions / flushes. Test Plan: augmented existing tests in db_test Reviewers: igor, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D48693	2015-10-17 00:16:36 -07:00
sdong	277dea78f0	Add more kill points Summary: Add kill points in: 1. after creating a file 2. before writing a manifest record 3. before syncing manifest 4. before creating a new current file 5. after creating a new current file Test Plan: Run all current tests. Reviewers: yhchiang, igor, anthony, IslamAbdelRahman, rven, kradhakrishnan Reviewed By: kradhakrishnan Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D48855	2015-10-16 14:35:12 -07:00
Venkatesh Radhakrishnan	a98fbacfa0	Moving memtable related files from util to a new directory memtable Summary: We are cleaning up dependencies. This diff takes a first step at moving memtable files to their own directory called memtable. In future diffs, we will move other memtable files from db to memtable. Test Plan: make check Reviewers: sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D48915	2015-10-16 14:10:33 -07:00
Islam AbdelRahman	f55d3009c0	Make db_test_util compile under ROCKSDB_LITE Summary: db_test_util is used in multiple test files but it dont compile under ROCKSDB_LITE Test Plan: make check make static_lib OPT=-DROCKSDB_LITE make db_wal_test Reviewers: igor, yhchiang, rven, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D48579	2015-10-13 17:33:23 -07:00
sdong	35ad531be3	Seperate InternalIterator from Iterator Summary: Separate a new class InternalIterator from class Iterator, when the look-up is done internally, which also means they operate on key with sequence ID and type. This change will enable potential future optimizations but for now InternalIterator's functions are still the same as Iterator's. At the same time, separate the cleanup function to a separate class and let both of InternalIterator and Iterator inherit from it. Test Plan: Run all existing tests. Reviewers: igor, yhchiang, anthony, kradhakrishnan, IslamAbdelRahman, rven Reviewed By: rven Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D48549	2015-10-13 15:32:13 -07:00
Praveen Rao	cc4d13e0a8	Put wal_filter under #ifndef ROCKSDB_LITE	2015-10-13 11:10:14 -07:00
Praveen Rao	eb24178553	merge from master	2015-10-12 17:24:21 -07:00
Islam AbdelRahman	c64ae05b1c	Move TEST_NewInternalIterator to NewInternalIterator Summary: Long time ago we add InternalDumpCommand to ldb_tool https://reviews.facebook.net/D11517 This command is using TEST_NewInternalIterator although it's not a test. This patch move TEST_NewInternalIterator outside of db_impl_debug.cc Test Plan: make check make static_lib Reviewers: yhchiang, igor, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D48561	2015-10-12 17:22:37 -07:00
Praveen Rao	59a0c219bb	Adding log filter to inspect and filter log records on recovery	2015-10-12 17:03:03 -07:00
Alexey Maykov	3d07b815f6	Passing table properties to compaction callback Summary: It would be nice to have and access to table properties in compaction callbacks. In MyRocks project, it will make possible to update optimizer statistics online. Test Plan: ran the unit test. Ran myrocks with the new way of collecting stats. Reviewers: igor, rven, yhchiang Reviewed By: yhchiang Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D48267	2015-10-09 18:10:55 -07:00
sdong	776bd8d5eb	Pass column family ID to table property collector Summary: Pass column family ID through TablePropertiesCollectorFactory::CreateTablePropertiesCollector() so that users can identify which column family this file is for and handle it differently. Test Plan: Add unit test scenarios in tests related to table properties collectors to verify the information passed in is correct. Reviewers: rven, yhchiang, anthony, kradhakrishnan, igor, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: yoshinorim, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D48411	2015-10-09 14:36:51 -07:00
dyniusz	0267502655	Support for LevelDB SST with .ldb suffix Summary: Handle SST files with both ".sst" and ".ldb" suffix. This enables user to migrate from leveldb to rocksdb. Test Plan: Added unit test with DB operating on SSTs with names schema. See db/dc_test.cc:SSTsWithLdbSuffixHandling for details Reviewers: yhchiang, sdong, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D48003	2015-10-06 17:46:22 -07:00
Igor Canadi	115427ef63	Add APIs PauseBackgroundWork() and ContinueBackgroundWork() Summary: To support a new MongoDB capability, we need to make sure that we don't do any IO for a short period of time. For background, see: * https://jira.mongodb.org/browse/SERVER-20704 * https://jira.mongodb.org/browse/SERVER-18899 To implement that, I add a new API calls PauseBackgroundWork() and ContinueBackgroundWork() which reuse the capability we already have in place for RefitLevel() function. Test Plan: Added a new test in db_test. Made sure that test fails when PauseBackgroundWork() is commented out. Reviewers: IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D47901	2015-10-02 13:17:34 -07:00
Islam AbdelRahman	f03b5c987b	Add experimental DB::AddFile() to plug sst files into empty DB Summary: This is an initial version of bulk load feature This diff allow us to create sst files, and then bulk load them later, right now the restrictions for loading an sst file are (1) Memtables are empty (2) Added sst files have sequence number = 0, and existing values in database have sequence number = 0 (3) Added sst files values are not overlapping Test Plan: unit testing Reviewers: igor, ott, sdong Reviewed By: sdong Subscribers: leveldb, ott, dhruba Differential Revision: https://reviews.facebook.net/D39081	2015-09-23 12:42:43 -07:00
sdong	d0c31641d2	Internal stats WAL file synced to match meaning of the stats of the same name Summary: https://reviews.facebook.net/D23343 changed WAL sync bytes to extra fsync. This change does the same for internal stats. Test Plan: Run all existing unit tests and verify results in db_bench. Reviewers: anthony, rven, igor, MarkCallaghan, kradhakrishnan, yhchiang Reviewed By: yhchiang Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D47349	2015-09-22 14:23:11 -07:00
Andres Noetzli	014fd55adc	Support for SingleDelete() Summary: This patch fixes #7460559. It introduces SingleDelete as a new database operation. This operation can be used to delete keys that were never overwritten (no put following another put of the same key). If an overwritten key is single deleted the behavior is undefined. Single deletion of a non-existent key has no effect but multiple consecutive single deletions are not allowed (see limitations). In contrast to the conventional Delete() operation, the deletion entry is removed along with the value when the two are lined up in a compaction. Note: The semantics are similar to @igor's prototype that allowed to have this behavior on the granularity of a column family ( https://reviews.facebook.net/D42093 ). This new patch, however, is more aggressive when it comes to removing tombstones: It removes the SingleDelete together with the value whenever there is no snapshot between them while the older patch only did this when the sequence number of the deletion was older than the earliest snapshot. Most of the complex additions are in the Compaction Iterator, all other changes should be relatively straightforward. The patch also includes basic support for single deletions in db_stress and db_bench. Limitations: - Not compatible with cuckoo hash tables - Single deletions cannot be used in combination with merges and normal deletions on the same key (other keys are not affected by this) - Consecutive single deletions are currently not allowed (and older version of this patch supported this so it could be resurrected if needed) Test Plan: make all check Reviewers: yhchiang, sdong, rven, anthony, yoshinorim, igor Reviewed By: igor Subscribers: maykov, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D43179	2015-09-17 11:42:56 -07:00
Venkatesh Radhakrishnan	51e1c11254	Do not flag error if file to be deleted does not exist Summary: Some users have observed errors in the log file when the log file or sst file is already deleted. Test Plan: Make sure that the errors do not appear for already deleted files. Reviewers: sdong Reviewed By: sdong Subscribers: anthony, kradhakrishnan, yhchiang, rven, igor, IslamAbdelRahman, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D47115	2015-09-17 10:21:34 -07:00
sdong	9aca7cd6d8	DB::Open() to flush info log after printing DB pointer Summary: Now DB::Open() flushes info log before printing DB pointer, so it may not show up if no activity after DB open. Move log flushing from after printing options to printing DB pointer. Test Plan: make commit-prereq Reviewers: igor, IslamAbdelRahman, yhchiang, kradhakrishnan, anthony, rven Reviewed By: rven Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D47121	2015-09-16 16:33:39 -07:00
Yueh-Hsuan Chiang	f21c7415a7	Change the log level of DB start-up log from Warn to Header. Summary: Change the log level of DB start-up log from Warn to Header. Test Plan: db_bench and observe the LOG header Reviewers: igor, anthony, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D47067	2015-09-16 11:31:45 -07:00
Alexey Maykov	3ebf11ed16	Adding the increment for a counter for a number of WAL syncs Summary: This will unblock the corresponding change in MyRocks Test Plan: ran rocksdb.write_sync test Reviewers: sdong, kolmike Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D46911	2015-09-16 11:00:49 -07:00
sdong	f3170b6f6c	DBImpl::FindObsoleteFiles() shouldn't release mutex between getting min_pending_output and scanning files Summary: Releasing mutex between getting min_pending_output and scanning files may cause min_pending_output to be max but some non-final files are found in file scanning, ending up with deleting wrong files. As a recent regression, mutex can be released while waiting for log sync. We move it to after file scanning. Test Plan: Run all existing tests. Don't think it is easy to write a unit test. Maybe we should find a way to assert lock not released so that we can have some test verification for similar cases. Reviewers: igor, anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, kolmike, rven Reviewed By: rven Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D46899	2015-09-14 23:39:30 -07:00
krad	1126644082	Relaxing consistency detection to include errors while inserting to memtable as WAL recovery error. Summary: The current code, considers data to be consistent if the record checksum passes. We do have customer issues where the record checksum passed but the data was incomprehensible. There is no way to get out of this error case since all WAL recovery model will consider this error as unrelated to WAL. Relaxing the definition and including errors while inserting to memtable as WAL errors and handing them as per the recovery level. Test Plan: Used customer dump to verify the fix for different level. The db opens for kSkipAnyCorruptedRecords and kPointInTimeRecovery, but fails for kAbsoluteConsistency and kTolerateCorruptedTailRecords. Reviewers: sdon igor CC: leveldb@ Task ID: #7918721 Blame Rev:	2015-09-10 12:56:17 -07:00
Igor Canadi	ac9bcb55ce	Set max_open_files based on ulimit Summary: We should never set max_open_files to be bigger than the system's ulimit. Otherwise we will get "Too many open files" errors. See an example in this Travis run: https://travis-ci.org/facebook/rocksdb/jobs/79591566 Test Plan: make check I will also verify that max_max_open_files is reasonable. Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D46551	2015-09-10 10:49:28 -07:00
Andres Noetzli	3c9cef1eed	Unified maps with Comparator for sorting, other cleanup Summary: This diff is a collection of cleanups that were initially part of D43179. Additionally it adds a unified way of defining key-value maps that use a Comparator for sorting (this was previously implemented in four different places). Test Plan: make clean check all Reviewers: rven, anthony, yhchiang, sdong, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D45993	2015-09-02 13:58:22 -07:00
Andres Noetzli	effd9dd1e1	Fix deadlock in WAL sync Summary: MarkLogsSynced() was doing `logs_.erase(it++);`. The standard is saying: ``` all iterators and references are invalidated, unless the erased members are at an end (front or back) of the deque (in which case only iterators and references to the erased members are invalidated) ``` Because `it` is an iterator to the first element of the container, it is invalidated, only one iteration is executed and `log.getting_synced = false;` is not being done, so `while (logs_.front().getting_synced)` in `WriteImpl()` is not terminating. Test Plan: make db_bench && ./db_bench --benchmarks=fillsync Reviewers: igor, rven, IslamAbdelRahman, anthony, kradhakrishnan, yhchiang, sdong, tnovak Reviewed By: tnovak Subscribers: kolmike, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D45807	2015-08-28 18:06:32 -07:00
Igor Canadi	5f4166c90e	ReadaheadRandomAccessFile -- userspace readahead Summary: ReadaheadRandomAccessFile acts as a transparent layer on top of RandomAccessFile. When a Read() request is issued, it issues a much bigger request to the OS and caches the result. When a new request comes in and we already have the data cached, it doesn't have to issue any requests to the OS. We add ReadaheadRandomAccessFile layer only when file is read during compactions. D45105 was incorrectly closed by Phabricator because I committed it to a separate branch (not master), so I'm resubmitting the diff. Test Plan: make check Reviewers: MarkCallaghan, sdong Reviewed By: sdong Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D45123	2015-08-26 15:25:59 -07:00
Andres Notzli	09d982f9e0	Fix compact_files_example Summary: See task #7983654. The example was triggering an assert in compaction job because the compaction was not marked as manual. With this patch, CompactionPicker::FormCompaction() marks compactions as manual. This patch also fixes a couple of typos, adds optimistic_transaction_example to .gitignore and librocksdb as a dependency for examples. Adding librocksdb as a dependency makes sure that the examples are built with the latest changes in librocksdb. Test Plan: make clean && cd examples && make all && ./compact_files_example Reviewers: rven, sdong, anthony, igor, yhchiang Reviewed By: yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D45117	2015-08-25 12:29:44 -07:00
Andres Noetzli	2050832974	Fixing race condition in DBTest.DynamicMemtableOptions Summary: This patch fixes a race condition in DBTEst.DynamicMemtableOptions. In rare cases, it was possible that the main thread would fill up both memtables before the flush job acquired its work. Then, the flush job was flushing both memtables together, producing only one L0 file while the test expected two. Now, the test waits for flushes to finish earlier, to make sure that the memtables are flushed in separate flush jobs. Test Plan: Insert "usleep(10000);" after "IOSTATS_SET_THREAD_POOL_ID(Env::Priority::HIGH);" in BGWorkFlush() to make the issue more likely. Then test with: make db_test && time while ./db_test --gtest_filter=*DynamicMemtableOptions; do true; done Reviewers: rven, sdong, yhchiang, anthony, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D45429	2015-08-24 17:04:18 -07:00
Igor Canadi	4ab26c5ad1	Smarter purging during flush Summary: Currently, we only purge duplicate keys and deletions during flush if `earliest_seqno_in_memtable <= newest_snapshot`. This means that the newest snapshot happened before we first created the memtable. This is almost never true for MyRocks and MongoRocks. This patch makes purging during flush able to understand snapshots. The main logic is copied from compaction_job.cc, although the logic over there is much more complicated and extensive. However, we should try to merge the common functionality at some point. I need this patch to implement no_overwrite_i_promise functionality for flush. We'll also need this to support SingleDelete() during Flush(). @yoshinorim requested the feature. Test Plan: make check I had to adjust some unit tests to understand this new behavior Reviewers: yhchiang, yoshinorim, anthony, sdong, noetzli Reviewed By: noetzli Subscribers: yoshinorim, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D42087	2015-08-24 11:11:12 -07:00
Islam AbdelRahman	3fd70b05b8	Rate limit deletes issued by DestroyDB Summary: Update DestroyDB so that all SST files in the first path id go through DeleteScheduler instead of being deleted immediately Test Plan: added a unittest Reviewers: igor, yhchiang, anthony, kradhakrishnan, rven, sdong Reviewed By: sdong Subscribers: jeanxu2012, dhruba Differential Revision: https://reviews.facebook.net/D44955	2015-08-19 15:02:17 -07:00
Ari Ekmekji	f0da6977a3	[Parallel L0-L1 Compaction Prep]: Giving Subcompactions Their Own State Summary: In prepration for running multiple threads at the same time during a compaction job, this patch assigns each subcompaction its own state (instead of sharing the one global CompactionState). Each subcompaction then uses this state to update its statistics, keep track of its snapshots, etc. during the course of execution. Then at the end of all the executions the statistics are aggregated across the subcompactions so that the final result is the same as if only one larger compaction had run. Test Plan: ./db_test ./db_compaction_test ./compaction_job_test Reviewers: sdong, anthony, igor, noetzli, yhchiang Reviewed By: yhchiang Subscribers: MarkCallaghan, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D43239	2015-08-18 11:06:23 -07:00
sdong	72613657f0	Measure file read latency histogram per level Summary: In internal stats, remember read latency histogram, if statistics is enabled. It can be retrieved from DB::GetProperty() with "rocksdb.dbstats" property, if it is enabled. Test Plan: Manually run db_bench and prints out "rocksdb.dbstats" by hand and make sure it prints out as expected Reviewers: igor, IslamAbdelRahman, rven, kradhakrishnan, anthony, yhchiang Reviewed By: yhchiang Subscribers: MarkCallaghan, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D44193	2015-08-14 17:32:42 -07:00
Nathan Bronson	b7198c3afe	reduce db mutex contention for write batch groups Summary: This diff allows a Writer to join the next write batch group without acquiring any locks. Waiting is performed via a per-Writer mutex, so all of the non-leader writers never need to acquire the db mutex. It is now possible to join a write batch group after the leader has been chosen but before the batch has been constructed. This diff doesn't increase parallelism, but reduces synchronization overheads. For some CPU-bound workloads (no WAL, RAM-sized working set) this can substantially reduce contention on the db mutex in a multi-threaded environment. With T=8 N=500000 in a CPU-bound scenario (see the test plan) this is good for a 33% perf win. Not all scenarios see such a win, but none show a loss. This code is slightly faster even for the single-threaded case (about 2% for the CPU-bound scenario below). Test Plan: 1. unit tests 2. COMPILE_WITH_TSAN=1 make check 3. stress high-contention scenarios with db_bench -benchmarks=fillrandom -threads=$T -batch_size=1 -memtablerep=skip_list -value_size=0 --num=$N -level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999 -disable_auto_compactions --max_write_buffer_number=8 -max_background_flushes=8 --disable_wal --write_buffer_size=160000000 Reviewers: sdong, igor, rven, ljin, yhchiang Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D43887	2015-08-14 10:55:43 -07:00
sdong	603b6da8b8	Add options.compaction_measure_io_stats to print write I/O stats in compactions Summary: Add options.compaction_measure_io_stats to print out / pass to listener accumulated time spent on write calls. Example outputs in info logs: 2015/08/12-16:27:59.463944 7fd428bff700 (Original Log Time 2015/08/12-16:27:59.463922) EVENT_LOG_v1 {"time_micros": 1439422079463897, "job": 6, "event": "compaction_finished", "output_level": 1, "num_output_files": 4, "total_output_size": 6900525, "num_input_records": 111483, "num_output_records": 106877, "file_write_nanos": 15663206, "file_range_sync_nanos": 649588, "file_fsync_nanos": 349614797, "file_prepare_write_nanos": 1505812, "lsm_state": [2, 4, 0, 0, 0, 0, 0]} Add two more counters in iostats_context. Also add a parameter of db_bench. Test Plan: Add a unit test. Also manually verify LOG outputs in db_bench Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D44115	2015-08-13 16:52:26 -07:00
agiardullo	0db807ec28	Transaction error statuses Summary: Based on feedback from spetrunia, we should better differentiate error statuses for transaction failures. https://github.com/MySQLOnRocksDB/mysql-5.6/issues/86#issuecomment-124605954 Test Plan: unit tests Reviewers: rven, kradhakrishnan, spetrunia, yhchiang, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D43323	2015-08-11 17:52:56 -07:00
agiardullo	c2f2cb0214	Pessimistic Transactions Summary: Initial implementation of Pessimistic Transactions. This diff contains the api changes discussed in D38913. This diff is pretty large, so let me know if people would prefer to meet up to discuss it. MyRocks folks: please take a look at the API in include/rocksdb/utilities/transaction[_db].h and let me know if you have any issues. Also, you'll notice a couple of TODOs in the implementation of RollbackToSavePoint(). After chatting with Siying, I'm going to send out a separate diff for an alternate implementation of this feature that implements the rollback inside of WriteBatch/WriteBatchWithIndex. We can then decide which route is preferable. Next, I'm planning on doing some perf testing and then integrating this diff into MongoRocks for further testing. Test Plan: Unit tests, db_bench parallel testing. Reviewers: igor, rven, sdong, yhchiang, yoshinorim Reviewed By: sdong Subscribers: hermanlee4, maykov, spetrunia, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D40869	2015-08-11 17:52:23 -07:00
sdong	6a4aaadcd7	Avoid type unique_ptr in LogWriterNumber::writer for Windows build break Summary: Visual Studio complains about deque<LogWriterNumber> because LogWriterNumber is non-copyable for its unique_ptr member writer. Move away from it, and do explit free. It is less safe but I can't think of a better way to unblock it. Test Plan: valgrind check test Reviewers: anthony, IslamAbdelRahman, kolmike, rven, yhchiang Reviewed By: yhchiang Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D43647	2015-08-06 10:52:41 -07:00
Mike Kolupaev	e06cf1a098	[wal changes 3/3] method in DB to sync WAL without blocking writers Summary: Subj. We really need this feature. Previous diff D40899 has most of the changes to make this possible, this diff just adds the method. Test Plan: `make check`, the new test fails without this diff; ran with ASAN, TSAN and valgrind. Reviewers: igor, rven, IslamAbdelRahman, anthony, kradhakrishnan, tnovak, yhchiang, sdong Reviewed By: sdong Subscribers: MarkCallaghan, maykov, hermanlee4, yoshinorim, tnovak, dhruba Differential Revision: https://reviews.facebook.net/D40905	2015-08-05 06:06:39 -07:00
Islam AbdelRahman	c45a57b41e	Support delete rate limiting Summary: Introduce DeleteScheduler that allow enforcing a rate limit on file deletion Instead of deleting files immediately, files are moved to trash directory and deleted in a background thread that apply sleep penalty between deletes if needed. I have updated PurgeObsoleteFiles and PurgeObsoleteWALFiles to use the delete_scheduler instead of env_->DeleteFile Test Plan: added delete_scheduler_test existing unit tests Reviewers: kradhakrishnan, anthony, rven, yhchiang, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D43221	2015-08-04 20:45:27 -07:00
Poornima Chozhiyath Raman	1bdfcef7bf	Fix when output level is 0 of universal compaction with trivial move Summary: Fix for universal compaction with trivial move, when the ouput level is 0. The tests where failing. Fixed by allowing normal compaction when output level is 0. Test Plan: modified test cases run successfully. Reviewers: sdong, yhchiang, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: anthony, kradhakrishnan, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D42933	2015-07-27 14:25:57 -07:00
Mike Kolupaev	fe09a6dae3	[wal changes 2/3] write with sync=true syncs previous unsynced wals to prevent illegal data loss Summary: I'll just copy internal task summary here: " This sequence will cause data loss in the middle after an sync write: non-sync write key 1 flush triggered, not yet scheduled sync write key 2 system crash After rebooting, users might see key 2 but not key 1, which violates the API of sync write. This can be reproduced using unit test FaultInjectionTest::DISABLED_WriteOptionSyncTest. One way to fix it is for a sync write, if there is outstanding unsynced log files, we need to syc them too. " This diff should be considered together with the next diff D40905; in isolation this fix probably could be a little simpler. Test Plan: `make check`; added a test for that (DBTest.SyncingPreviousLogs) before noticing FaultInjectionTest.WriteOptionSyncTest (keeping both since mine asserts a bit more); both tests fail without this diff; for D40905 stacked on top of this diff, ran tests with ASAN, TSAN and valgrind Reviewers: rven, yhchiang, IslamAbdelRahman, anthony, kradhakrishnan, igor, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D40899	2015-07-22 03:28:08 -07:00
agiardullo	064294081b	Improved FileExists API Summary: Add new CheckFileExists method. Considered changing the FileExists api but didn't want to break anyone's builds. Test Plan: unit tests Reviewers: yhchiang, igor, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D42003	2015-07-20 17:20:40 -07:00
sdong	6e9fbeb27c	Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future. Test Plan: Run all existing unit tests. Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D42321	2015-07-17 16:58:18 -07:00
Igor Canadi	35ca59364c	Don't let flushes preempt compactions Summary: When we first started, max_background_flushes was 0 by default and compaction thread was executing flushes (since there was no flush thread). Then, we switched the default max_background_flushes to 1. However, we still support the case where there is no flush thread and flushes are done in compaction. This is making our code a bit more complicated. By not supporting this use-case we can make our code simpler. We have a special case that when you set max_background_flushes to 0, we schedule the flush to execute on the compaction thread. Test Plan: make check (there might be some unit tests that depend on this behavior) Reviewers: IslamAbdelRahman, yhchiang, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D41931	2015-07-17 12:02:52 -07:00
Igor Canadi	a96fcd09b7	Deprecate CompactionFilterV2 Summary: It has been around for a while and it looks like it never found any uses in the wild. It's also complicating our compaction_job code quite a bit. We're deprecating it in 3.13, but will put it back in 3.14 if we actually find users that need this feature. Test Plan: make check Reviewers: noetzli, yhchiang, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D42405	2015-07-17 18:59:11 +02:00
sdong	6c0c8dee7b	Fix data loss after DB recovery by not allowing flush/compaction to be scheduled until DB opened Summary: Previous run may leave some SST files with higher file numbers than manifest indicates. Compaction or flush may start to run while DB::Open() is still going on. SST file garbage collection may happen interleaving with compaction or flush, and overwrite files generated by compaction of flushes after they are generated. This might cause data loss. This possibility of interleaving is recently introduced. Fix it by not allowing compaction or flush to be scheduled before DB::Open() finishes. Test Plan: Add a unit test. This verification will have a chance to fail without the fix but doesn't fix without the fix. Reviewers: kradhakrishnan, anthony, yhchiang, IslamAbdelRahman, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D42399	2015-07-16 10:57:41 -07:00
Poornima Chozhiyath Raman	beb19ad0dd	Fixing delete files in Trivial move of universal compaction Summary: Trvial move in universal compaction was failing when trying to move files from levels other than 0. This was because the DeleteFile while trivially moving, was only deleting files of level 0 which caused duplication of same file in different levels. This is fixed by passing the right level as argument in the call of DeleteFile while doing trivial move. Test Plan: ./db_test ran successfully with the new test cases. Reviewers: sdong Reviewed By: sdong Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D42135	2015-07-15 12:28:22 -07:00
Igor Canadi	5aea98ddd8	Deprecate WriteOptions::timeout_hint_us Summary: In one of our recent meetings, we discussed deprecating features that are not being actively used. One of those features, at least within Facebook, is timeout_hint. The feature is really nicely implemented, but if nobody needs it, we should remove it from our code-base (until we get a valid use-case). Some arguments: * Less code == better icache hit rate, smaller builds, simpler code * The motivation for adding timeout_hint_us was to work-around RocksDB's stall issue. However, we're currently addressing the stall issue itself (see @sdong's recent work on stall write_rate), so we should never see sharp lock-ups in the future. * Nobody is using the feature within Facebook's code-base. Googling for `timeout_hint_us` also doesn't yield any users. Test Plan: make check Reviewers: anthony, kradhakrishnan, sdong, yhchiang Reviewed By: yhchiang Subscribers: sdong, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D41937	2015-07-14 09:35:48 +02:00
sdong	f9728640f3	"make format" against last 10 commits Summary: This helps Windows port to format their changes, as discussed. Might have formatted some other codes too becasue last 10 commits include more. Test Plan: Build it. Reviewers: anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D41961	2015-07-13 13:50:18 -07:00
sdong	5fd11853cb	Print Fast CRC32 support information in DB LOG Summary: Print whether fast CRC32 is supported in DB info LOG Test Plan: Run db_bench and see it prints out correctly. Reviewers: yhchiang, anthony, kradhakrishnan, igor Reviewed By: igor Subscribers: MarkCallaghan, yoshinorim, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D41733	2015-07-10 17:59:36 -07:00
Dmitri Smirnov	c903ccc4c2	Merge from github/master	2015-07-09 18:01:08 -07:00
Poornima Chozhiyath Raman	4bed00a44b	Fix function name format according to google style Summary: Change the naming style of getter and setters according to Google C++ style in compaction.h file Test Plan: Compilation success Reviewers: sdong Reviewed By: sdong Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D41265	2015-07-08 15:21:10 -07:00
Poornima Chozhiyath Raman	c0b23dd5b0	Enabling trivial move in universal compaction Summary: This change enables trivial move if all the input files are non onverlapping while doing Universal Compaction. Test Plan: ./compaction_picker_test and db_test ran successfully with the new testcases. Reviewers: sdong Reviewed By: sdong Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D40875	2015-07-07 14:18:55 -07:00
Yueh-Hsuan Chiang	4ce5be4255	fixed leaking log::Writers Summary: Fixes valgrind errors in column_family_test. Test Plan: `make check`, `make valgrind_check` Reviewers: igor, yhchiang Reviewed By: yhchiang Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D41181	2015-07-07 12:10:10 -07:00
Mike Kolupaev	218487d8dc	[wal changes 1/3] fixed unbounded wal growth in some workloads Summary: This fixes the following scenario we've hit: - we reached max_total_wal_size, created a new wal and scheduled flushing all memtables corresponding to the old one, - before the last of these flushes started its column family was dropped; the last background flush call was a no-op; no one removed the old wal from alive_logs_, - hours have passed and no flushes happened even though lots of data was written; data is written to different column families, compactions are disabled; old column families are dropped before memtable grows big enough to trigger a flush; the old wal still sits in alive_logs_ preventing max_total_wal_size limit from kicking in, - a few more hours pass and we run out disk space because of one huge .log file. Test Plan: `make check`; backported the new test, checked that it fails without this diff Reviewers: igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D40893	2015-07-02 14:27:00 -07:00
Dmitri Smirnov	9dbde7277c	Merge remote-tracking branch 'origin' into ms_win_port	2015-07-02 11:34:22 -07:00
Dmitri Smirnov	18285c1e2f	Windows Port from Microsoft Summary: Make RocksDb build and run on Windows to be functionally complete and performant. All existing test cases run with no regressions. Performance numbers are in the pull-request. Test plan: make all of the existing unit tests pass, obtain perf numbers. Co-authored-by: Praveen Rao praveensinghrao@outlook.com Co-authored-by: Sherlock Huang baihan.huang@gmail.com Co-authored-by: Alex Zinoviev alexander.zinoviev@me.com Co-authored-by: Dmitri Smirnov dmitrism@microsoft.com	2015-07-01 16:13:56 -07:00
Venkatesh Radhakrishnan	c9cd404bcd	Make flush check for shutdown Summary: Fixes task 7156865 where a compaction causes a hang in flush memtable if CancelAllBackgroundWork was called prior to it. Stack trace is in : https://phabricator.fb.com/P19848829 We end up waiting for a flush which will never happen because there are no background threads. Test Plan: PreShutdownFlush Reviewers: sdong, igor Reviewed By: sdong, igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D40617	2015-06-25 14:43:25 -07:00
Islam AbdelRahman	674b1181cf	Bottommost level compaction option Summary: Replace force_bottommost_level_compaction in CompactRangeOption with an option that allow the user to (always skip, always compact, compact if compaction filter is present) the bottommost level for level based compaction. Test Plan: make check Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D40527	2015-06-23 13:32:40 -07:00
Giuseppe Ottaviano	782a1590f9	Implement a table-level row cache Summary: Implementation of a table-level row cache. It only caches point queries done through the `DB::Get` interface, queries done through the `Iterator` interface will completely skip the cache. Supports snapshots and merge operations. Test Plan: Ran `make valgrind_check commit-prereq` Reviewers: igor, philipp, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D39849	2015-06-23 10:25:45 -07:00
krad	de85e4cadf	Introduce WAL recovery consistency levels Summary: The "one size fits all" approach with WAL recovery will only introduce inconvenience for our varied clients as we go forward. The current recovery is a bit heuristic. We introduce the following levels of consistency while replaying the WAL. 1. RecoverAfterRestart (kTolerateCorruptedTailRecords) This mocks the current recovery mode. 2. RecoverAfterCleanShutdown (kAbsoluteConsistency) This is ideal for unit test and cases where the store is shutdown cleanly. We tolerate no corruption or incomplete writes. 3. RecoverPointInTime (kPointInTimeRecovery) This is ideal when using devices with controller cache or file systems which can loose data on restart. We recover upto the point were is no corruption or incomplete write. 4. RecoverAfterDisaster (kSkipAnyCorruptRecord) This is ideal mode to recover data. We tolerate corruption and incomplete writes, and we hop over those sections that we cannot make sense of salvaging as many records as possible. Test Plan: (1) Run added unit test to cover all levels. (2) Run make check. Reviewers: leveldb, sdong, igor Subscribers: yoshinorim, dhruba Differential Revision: https://reviews.facebook.net/D38487	2015-06-22 15:28:12 -07:00
Islam AbdelRahman	530534fceb	Fix trivial move merge Summary: Fixing bad merge Test Plan: make -j64 check (this is not enough to verify the fix) Reviewers: igor, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D40521	2015-06-22 15:20:30 -07:00
Igor Canadi	760e9a94de	Fail DB::Open() when the requested compression is not available Summary: Currently RocksDB silently ignores this issue and doesn't compress the data. Based on discussion, we agree that this is pretty bad because it can cause confusion for our users. This patch fails DB::Open() if we don't support the compression that is specified in the options. Test Plan: make check with LZ4 not present. If Snappy is not present all tests will just fail because Snappy is our default library. We should make Snappy the requirement, since without it our default DB::Open() fails. Reviewers: sdong, MarkCallaghan, rven, yhchiang Reviewed By: yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D39687	2015-06-18 14:55:05 -07:00
Islam AbdelRahman	4eabbdb7ec	Skip bottommost level compaction if possible Summary: This is https://reviews.facebook.net/D39999 but after introducing an option to force compaction the bottom most level Changes in this patch - Introduce force_bottommost_level_compaction to CompactRangeOptions that force compacting bottommost level during compaction - Skip bottommost level compaction if we dont have a compaction filter and force_bottommost_level_compaction options is not set Although tests pass on my machine but I suspect that there maybe some tests that I am not aware of that should use force_bottommost_level_compaction to pass in a deterministic way Test Plan: make check adding new tests Reviewers: igor, sdong, yhchiang Reviewed By: yhchiang Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D40059	2015-06-18 11:03:31 -07:00
Yueh-Hsuan Chiang	bb1c74ce18	Fixed a bug of CompactionStats in multi-level universal compaction case Summary: Universal compaction can involves in multiple levels. However, the current implementation of bytes_readn and bytes_readnp1 (and some other stats with postfix `n` and `np1`) assumes compaction can only have two levels. This patch fixes this bug and redefines bytes_readn and bytes_readnp1: * bytes_readnp1: the number of bytes read in the compaction output level. * bytes_readn: the total number of bytes read minus bytes_readnp1 Test Plan: Add a test in compaction_job_stats_test Reviewers: igor, sdong, rven, anthony, kradhakrishnan, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D40239	2015-06-17 23:40:34 -07:00
Islam AbdelRahman	12e030a992	Use CompactRangeOptions for CompactRange Summary: This diff update DB::CompactRange to use RangeCompactionOptions instead of using multiple parameters Old CompactRange is still available but deprecated Test Plan: make all check make rocksdbjava USE_CLANG=1 make all OPT=-DROCKSDB_LITE make release Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D40209	2015-06-17 14:36:14 -07:00
Igor Canadi	25d600569d	Clean up InstallSuperVersion Summary: We go to great lengths to make sure MaybeScheduleFlushOrCompaction() is called outside of write thread. But anyway, it's still called in the mutex, so it's not that much cheaper. This diff removes the "optimization" and cleans up the code a bit. Test Plan: make check Reviewers: rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D40113	2015-06-17 12:37:59 -07:00
sdong	40f562e747	Allow GetApproximateSize() to include mem table size if it is skip list memtable Summary: Add an option in GetApproximateSize() so that the result will include estimated sizes in mem tables. To implement it, implement an estimated count from the beginning to a key in skip list. The approach is to count to find the entry, how many Next() is issued from each level, and sum them with a weight that is <branching factor> ^ <level>. Test Plan: Add a test case Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D40119	2015-06-16 18:13:23 -07:00
Islam AbdelRahman	cccd2199a6	Revert skip bottommost compaction Summary: Reverting this diff https://reviews.facebook.net/D39999 Will add an option to force bottom most level compaction and then re submit it Test Plan: make check Reviewers: igor, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D40041	2015-06-12 10:43:33 -07:00
Islam AbdelRahman	20f2b54252	Skip bottom most level compaction if no compaction filter Summary: If we don't have a compaction filter then we can skip compacting the bottom most level Test Plan: make check added unit tests Reviewers: yhchiang, sdong, igor Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D39999	2015-06-12 09:56:08 -07:00
sdong	7842920be5	Slow down writes by bytes written Summary: We slow down data into the database to the rate of options.delayed_write_rate (a new option) with this patch. The thread synchronization approach I take is to still synchronize write controller by DB mutex and GetDelay() is inside DB mutex. Try to minimize the frequency of getting time in GetDelay(). I verified it through db_bench and it seems to work hard_rate_limit is deprecated. options.delayed_write_rate is still not dynamically changeable. Need to work on it as a follow-up. Test Plan: Add new unit tests in db_test Reviewers: yhchiang, rven, kradhakrishnan, anthony, MarkCallaghan, igor Reviewed By: igor Subscribers: ikabiljo, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D36351	2015-06-11 20:42:18 -07:00
Islam AbdelRahman	d6ce0f7c61	Add largest sequence to FlushJobInfo Summary: Adding largest sequence number to FlushJobInfo and passing flushed file metadata to NotifyOnFlushCompleted which include alot of other values that we may want to expose in FlushJobInfo Test Plan: make check Reviewers: igor, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D39927	2015-06-11 15:22:22 -07:00
Yueh-Hsuan Chiang	3eddd1abe9	Add Env::GetThreadID(), which returns the ID of the current thread. Summary: Add Env::GetThreadID(), which returns the ID of the current thread. In addition, make GetThreadList() and InfoLog use same unique ID for the same thread. Test Plan: db_test listener_test Reviewers: igor, rven, IslamAbdelRahman, kradhakrishnan, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D39735	2015-06-11 14:18:02 -07:00
Islam AbdelRahman	73faa3d41d	Handling edge cases for ReFitLevel Summary: Right now the level we pass to ReFitLevel is the maximum level with files (before compaction), there are multiple cases where this maximum level have changed after compaction - all files where in L0 (now maximum level is L1) - using kCompactionStyleUniversal (now maximum level in the last level) - level_compaction_dynamic_level_bytes ?? We can handle each of these cases individually, but I felt it's safer to calculate max_level_with_files again if we want to do a ReFitLevel Test Plan: adding some tests make -j64 check Reviewers: igor, sdong Reviewed By: sdong Subscribers: ott, dhruba Differential Revision: https://reviews.facebook.net/D39663	2015-06-11 14:15:52 -07:00
Venkatesh Radhakrishnan	406a5682eb	Fix hang when closing a DB after doing loads with WAL disabled. Summary: There is a hang during DB close in the following scenario: a) a load with WAL disabled was done, b) CancelAllBackgroundWork was called, c) DB Close was called This was because in that we will wait for a flush but we cannot do a background flush because we have called CancelAllBackgroundWork which marks the DB as shutting downn. Test Plan: Added DBTest FlushOnDestroy Reviewers: sdong Reviewed By: sdong Subscribers: yoshinorim, hermanlee4, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D39747	2015-06-09 10:39:49 -07:00
sdong	d8c8f08c12	GetSnapshot() and ReleaseSnapshot() to move new and free out of DB mutex Summary: We currently issue malloc and free inside DB mutex in GetSnapshot() and ReleaseSnapshot(). Move them out. Test Plan: Go through all tests make valgrind_check Reviewers: yhchiang, rven, IslamAbdelRahman, anthony, igor Reviewed By: igor Subscribers: maykov, hermanlee4, MarkCallaghan, yoshinorim, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D39753	2015-06-08 21:57:02 -07:00

... 3 4 5 6 7 ...

1191 Commits