Commit Graph

836 Commits

Author SHA1 Message Date
Lei Jin 1e4a45aac8 remove cfd->options() in DBImpl::NotifyOnFlushCompleted
Summary: We should not reference cfd->options() directly!

Test Plan: make release

Reviewers: sdong, rven, igor, yhchiang

Reviewed By: igor, yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D29061
2014-11-18 10:19:48 -08:00
Igor Canadi 84af2ff8d3 Clean job context in DeleteFile 2014-11-14 16:20:24 -08:00
Igor Canadi 5c04acda08 Explicitly clean JobContext
Summary: This way we can gurantee that old MemTables get destructed before DBImpl gets destructed, which might be useful if we want to make them depend on state from DBImpl.

Test Plan: make check with asserts in JobContext's destructor

Reviewers: ljin, sdong, yhchiang, rven, jonahcohen

Reviewed By: jonahcohen

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D28959
2014-11-14 15:43:10 -08:00
Yueh-Hsuan Chiang 4161de92a3 Fix SIGSEGV
Summary: As a short-term fix, let's go back to previous way of calculating NeedsCompaction(). SIGSEGV happens because NeedsCompaction() can happen before super_version (and thus MutableCFOptions) is initialized.

Test Plan: make check

Reviewers: ljin, sdong, rven, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D28875
2014-11-13 15:21:04 -08:00
Igor Canadi 772bc97f13 No CompactFiles in ROCKSDB_LITE
Summary: It adds lots of code.

Test Plan: compile for iOS, compile for mac. works.

Reviewers: rven, sdong, ljin, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D28857
2014-11-13 16:45:33 -05:00
Yueh-Hsuan Chiang 1d1a64f58a Move NeedsCompaction() from VersionStorageInfo to CompactionPicker
Summary:
Move NeedsCompaction() from VersionStorageInfo to CompactionPicker
to allow different compaction strategy to have their own way to
determine whether doing compaction is necessary.

When compaction style is set to kCompactionStyleNone, then
NeedsCompaction() will always return false.

Test Plan:
export ROCKSDB_TESTS=Compact
./db_test

Reviewers: ljin, sdong, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D28719
2014-11-13 13:41:43 -08:00
Igor Canadi 25f273027b Fix iOS compile with -Wshorten-64-to-32
Summary: So iOS size_t is 32-bit, so we need to static_cast<size_t> any uint64_t :(

Test Plan: TARGET_OS=IOS make static_lib

Reviewers: dhruba, ljin, yhchiang, rven, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D28743
2014-11-13 14:39:30 -05:00
Igor Canadi 767777c2bd Turn on -Wshorten-64-to-32 and fix all the errors
Summary:
We need to turn on -Wshorten-64-to-32 for mobile. See D1671432 (internal phabricator) for details.

This diff turns on the warning flag and fixes all the errors. There were also some interesting errors that I might call bugs, especially in plain table. Going forward, I think it makes sense to have this flag turned on and be very very careful when converting 64-bit to 32-bit variables.

Test Plan: compiles

Reviewers: ljin, rven, yhchiang, sdong

Reviewed By: yhchiang

Subscribers: bobbaldwin, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D28689
2014-11-11 16:47:22 -05:00
Igor Canadi 113796c493 Fix NewFileNumber()
Summary: I mistakenly changed the behavior to ++next_file_number_ instead of next_file_number_++, as it should have been: 344edbb044/db/version_set.h (L539)

Test Plan: none. not sure if this would break anything. It's just different behavior, so I'd rather not risk

Reviewers: ljin, rven, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D28557
2014-11-11 06:58:47 -08:00
Igor Canadi 4a3bd2bad2 Optimize usage of Status in CompactionJob
Summary: Based on @ljin feedback

Test Plan: compiles

Reviewers: ljin, yhchiang, sdong

Reviewed By: sdong

Subscribers: ljin, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D28515
2014-11-10 11:57:58 -08:00
Igor Canadi e3d3567b5b Get rid of mutex in CompactionJob's state
Summary: Based on @sdong's feedback in the diff, we shouldn't keep db_mutex in CompactionJob's state. This diff removes db_mutex from CompactionJob state, by making next_file_number_ atomic. That way we only need to pass the lock to InstallCompactionResults() because of LogAndApply()

Test Plan: make check

Reviewers: ljin, yhchiang, rven, sdong

Reviewed By: sdong

Subscribers: sdong, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D28491
2014-11-07 15:44:12 -08:00
Yueh-Hsuan Chiang b8b3903429 Fixed compile error in db/db_impl.cc
Summary:
Fixed compile error in db/db_impl.cc

Test Plan:
make
2014-11-07 15:13:01 -08:00
Yueh-Hsuan Chiang 28c82ff1b3 CompactFiles, EventListener and GetDatabaseMetaData
Summary:
This diff adds three sets of APIs to RocksDB.

= GetColumnFamilyMetaData =
* This APIs allow users to obtain the current state of a RocksDB instance on one column family.
* See GetColumnFamilyMetaData in include/rocksdb/db.h

= EventListener =
* A virtual class that allows users to implement a set of
  call-back functions which will be called when specific
  events of a RocksDB instance happens.
* To register EventListener, simply insert an EventListener to ColumnFamilyOptions::listeners

= CompactFiles =
* CompactFiles API inputs a set of file numbers and an output level, and RocksDB
  will try to compact those files into the specified level.

= Example =
* Example code can be found in example/compact_files_example.cc, which implements
  a simple external compactor using EventListener, GetColumnFamilyMetaData, and
  CompactFiles API.

Test Plan:
listener_test
compactor_test
example/compact_files_example
export ROCKSDB_TESTS=CompactFiles
db_test
export ROCKSDB_TESTS=MetaData
db_test

Reviewers: ljin, igor, rven, sdong

Reviewed By: sdong

Subscribers: MarkCallaghan, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D24705
2014-11-07 14:45:18 -08:00
Igor Canadi 53af5d877d Redesign pending_outputs_
Summary:
Here's a prototype of redesigning pending_outputs_. This way, we don't have to expose pending_outputs_ to other classes (CompactionJob, FlushJob, MemtableList). DBImpl takes care of it.

Still have to write some comments, but should be good enough to start the discussion.

Test Plan: make check, will also run stress test

Reviewers: ljin, sdong, rven, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D28353
2014-11-07 11:50:34 -08:00
Lei Jin fd24ae9d05 SetOptions() to return status and also add it to StackableDB
Summary: as title

Test Plan: ./db_test

Reviewers: sdong, yhchiang, rven, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D28269
2014-11-04 16:23:05 -08:00
Lei Jin b1267750fb fix the asan check
Summary: as title

Test Plan: ran it

Reviewers: yhchiang, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D28311
2014-11-04 15:58:14 -08:00
Yueh-Hsuan Chiang 469d474ba0 Apply InfoLogLevel to the logs in db/db_impl.cc
Summary: Apply InfoLogLevel to the logs in db/db_impl.cc

Test Plan:
db_test
db_bench

Reviewers: ljin, sdong, igor

Reviewed By: igor

Subscribers: leveldb, MarkCallaghan, dhruba

Differential Revision: https://reviews.facebook.net/D28233
2014-11-04 10:28:08 -08:00
sdong ac6afaf9ef Enforce naming convention of getters in version_set.h
Summary: Enforce the accessier naming convention in functions in version_set.h

Test Plan: make all check

Reviewers: ljin, yhchiang, rven, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D28143
2014-11-04 09:59:05 -08:00
sdong 09899f0b51 DB::Open() to automatically increase thread pool size if it is smaller than max number of parallel compactions or flushes
Summary:
With the patch, thread pool size will be automatically increased if DB's options ask for more parallelism of compactions or flushes.

Too many users have been confused by the API. Change it to make it harder for users to make mistakes

Test Plan: Add two unit tests to cover the function.

Reviewers: yhchiang, rven, igor, MarkCallaghan, ljin

Reviewed By: ljin

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D27555
2014-11-03 17:22:34 -08:00
Igor Canadi 74eb4fbe93 CompactionJob
Summary:
Long awaited CompactionJob class! Move most compaction-related things from DBImpl to CompactionJob, making CompactionJob easier to test and understand.

Currently this is just replicating exactly the same functionality with as little as change as possible. As future work, we should:
1. Add CompactionJob tests (I think I'll do that tomorrow)
2. Reduce CompactionJob's state that it inherits from DBImpl
3. Figure out how to do yielding to flush better. Currently I implemented a callback as we agreed yesterday, but I don't think it's a good long term solution.

This reduces db_impl.cc from 5000+ LOC to 3400!

Test Plan: make check, will add CompactionJob-specific tests, probably also move some tests from db_test to compaction_job_test

Reviewers: rven, yhchiang, sdong, ljin

Reviewed By: ljin

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D27957
2014-10-31 16:31:25 -07:00
Igor Canadi 9f7fc3ac45 Turn on -Wshadow
Summary:
...and fix all the errors :)

Jim suggested turning on -Wshadow because it helped him fix number of critical bugs in fbcode. I think it's a good idea to be -Wshadow clean.

Test Plan: compiles

Reviewers: yhchiang, rven, sdong, ljin

Reviewed By: ljin

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D27711
2014-10-31 11:59:54 -07:00
sdong 4d2ba38b65 Make VersionBuilder unit testable
Summary:
Rename Version::Builder to VersionBuilder and expose its definition to a header.
Make VerisonBuilder not reference Version or ColumnFamilyData, only working with VersionStorageInfo.
Add version_builder_test which has a simple test.

Test Plan: make all check

Reviewers: rven, yhchiang, igor, ljin

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D27969
2014-10-31 10:44:06 -07:00
Igor Canadi 635905481d WalManager
Summary: Decoupling code that deals with archived log files outside of DBImpl. That will make this code easier to reason about and test. It will also make the code easier to improve, because an improver doesn't have to understand DBImpl code in entirety.

Test Plan: added test

Reviewers: ljin, yhchiang, rven, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D27873
2014-10-29 17:43:37 -07:00
sdong 76d1c28e82 Make CompactionPicker more easily tested
Summary:
Make compaction picker easier to test.
The basic idea is to separate a minimum subcomponent of Version to VersionStorageInfo, which just responsible to LSM tree. A stub VersionStorageInfo can then be easily created and passed into compaction picker so that we can check the outputs.

It now passes most tests. Still two things need to be done:
(1) deal with the FIFO compaction's file size.
(2) write an example test to make sure the interface can do the job.

Add a compaction_picker_test to make sure compaction picker codes can be easily unit tested.

Test Plan:
Pass all unit tests and compaction_picker_test

Reviewers: yhchiang, rven, igor, ljin

Reviewed By: ljin

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D27639
2014-10-29 15:16:53 -07:00
Yueh-Hsuan Chiang 3772a3d09d Fix the bug where compaction does not fail when RocksDB can't create a new file.
Summary:
This diff has two fixes.

1. Fix the bug where compaction does not fail when RocksDB can't create a new file.
2. When NewWritableFiles() fails in OpenCompactionOutputFiles(), previously such fail-to-created file will be still be included as a compaction output.  This patch also fixes this bug.
3. Allow VersionEdit::EncodeTo() to return Status and add basic check.

Test Plan:
./version_edit_test
export ROCKSDB_TESTS=FileCreationRandomFailure
./db_test

Reviewers: ljin, sdong, nkg-, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D25581
2014-10-28 14:27:26 -07:00
Igor Canadi a39e931e50 FlushProcess
Summary:
Abstract out FlushProcess and take it out of DBImpl.
This also includes taking DeletionState outside of DBImpl.

Currently this diff is only doing the refactoring. Future work includes:
1. Decoupling flush_process.cc, make it depend on less state
2. Write flush_process_test, which will mock out everything that FlushProcess depends on and test it in isolation

Test Plan: make check

Reviewers: rven, yhchiang, sdong, ljin

Reviewed By: ljin

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D27561
2014-10-28 11:54:33 -07:00
Igor Canadi 48842ab316 Deprecate AtomicPointer
Summary: RocksDB already depends on C++11, so we might as well all the goodness that C++11 provides. This means that we don't need AtomicPointer anymore. The less things in port/, the easier it will be to port to other platforms.

Test Plan: make check + careful visual review verifying that NoBarried got memory_order_relaxed, while Acquire/Release methods got memory_order_acquire and memory_order_release

Reviewers: rven, yhchiang, ljin, sdong

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D27543
2014-10-27 14:50:21 -07:00
Lei Jin f1841985e4 dynamic inplace_update options
Summary:
Make inplace_update_support and inplace_update_num_locks dynamic.
inplace_callback becomes immutable
We are almost free of references to cfd->options() in db_impl

Test Plan: unit test

Reviewers: igor, yhchiang, rven, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D25293
2014-10-27 12:10:13 -07:00
Lei Jin 122f98e0b9 dynamic max_mem_compact_level
Summary: as title

Test Plan: unit test

Reviewers: sdong, yhchiang, rven, igor

Reviewed By: igor

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D25347
2014-10-23 15:37:14 -07:00
Lei Jin 574028679b dynamic max_sequential_skip_in_iterations
Summary:
This is not a critical options. Making it dynamic so that we can remove
more reference to cfd->options()

Test Plan: unit test

Reviewers: yhchiang, sdong, igor

Reviewed By: igor

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D24957
2014-10-23 15:34:21 -07:00
sdong d755e53b87 Printing number of keys in DB Stats
Summary: It is useful to print out number of keys in DB Stats

Test Plan:
./db_bench --benchmarks fillrandom --num 1000000 -threads 16 -batch_size=16

and watch the outputs in LOG files

Reviewers: MarkCallaghan, ljin, yhchiang, igor

Reviewed By: igor

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D24513
2014-10-22 18:41:33 -07:00
Igor Canadi 6398e6a6a5 Fix DeleteFile() + enable deleting files oldest files in level 0
Summary:
DeleteFile() call was broken for non-default column family. This fixes it. We might need this feature for mongo.

I also introduced a possibility of deleting oldest file in level 0.

Test Plan: added unit test to deletefile_test

Reviewers: ljin, yhchiang, rven, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D24909
2014-10-21 11:23:06 -07:00
Lei Jin 2dd9bfe3a8 Sanitize block-based table index type and check prefix_extractor
Summary:
Respond to issue reported
https://www.facebook.com/groups/rocksdb.dev/permalink/651090261656158/
Change the Sanitize signature to take both DBOptions and CFOptions

Test Plan: unit test

Reviewers: sdong, yhchiang, igor

Reviewed By: igor

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D25041
2014-10-17 21:18:36 -07:00
Lei Jin d6c8dba727 Log MutableCFOptions in SetOptions
Summary: as title

Test Plan: make release

Reviewers: sdong, yhchiang, igor

Reviewed By: igor

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D24903
2014-10-16 17:22:28 -07:00
Lei Jin 065a67c4f0 dynamic disable_auto_compactions
Summary: Add more tests as well

Test Plan: unit test

Reviewers: igor, sdong, yhchiang

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D24747
2014-10-16 17:14:17 -07:00
Lei Jin dc50a1a593 make max_write_buffer_number dynamic
Summary: as title

Test Plan: unit test

Reviewers: sdong, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D24729
2014-10-16 16:57:59 -07:00
Igor Canadi ca250d71a1 Move logging out of mutex
Summary: As title

Test Plan: compiles

Reviewers: sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D24897
2014-10-15 10:56:50 -07:00
Igor Canadi cc6c883f59 Stop stopping writes on bg_error_
Summary: This might have caused https://github.com/facebook/rocksdb/issues/345. If we're stopping writes and bg_error comes along, we will never unblock the write.

Test Plan: compiles

Reviewers: ljin

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D24807
2014-10-13 14:25:55 -07:00
Igor Canadi f78b832e5d Log RocksDB version
Summary: This will be much easier than reviewing git sha's we currently have in our LOGs

Test Plan: none

Reviewers: sdong, yhchiang, ljin

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D24591
2014-10-07 10:40:57 -07:00
Yueh-Hsuan Chiang 56dfd363fd Fix a check in database shutdown or Column family drop during flush.
Summary:
Fix a check in database shutdown or Column family drop during flush.

Special thanks to Maurice Barnum who spots the problem :)

Test Plan: db_test

Reviewers: ljin, igor, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D24273
2014-10-03 00:25:27 -07:00
sdong 8ea232b9e3 Add number of records dropped in compaction summary
Summary:
Add two stats to compaction summary:
1. Total input records from previous level
2. Total number of records dropped after compaction

Test Plan: See outputs of printing when runnning locally

Reviewers: ljin, igor, MarkCallaghan

Reviewed By: MarkCallaghan

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D24411
2014-10-02 17:54:25 -07:00
sdong f4086a88b4 perf_context.get_from_output_files_time is set for MultiGet() and ReadOnly DB too.
Summary: perf_context.get_from_output_files_time is now only set writable DB's DB::Get(). Extend it to MultiGet() and read only DB.

Test Plan:
make all check
Fix perf_context_test and extend it to cover MultiGet(), as long as read-only DB. Run it and watch the results

Reviewers: ljin, yhchiang, igor

Reviewed By: igor

Subscribers: rven, leveldb

Differential Revision: https://reviews.facebook.net/D24207
2014-10-02 17:02:50 -07:00
Lei Jin 5ec53f3edf make compaction related options changeable
Summary:
make compaction related options changeable. Most of changes are tedious,
following the same convention: grabs MutableCFOptions at the beginning
of compaction under mutex, then pass it throughout the job and register
it in SuperVersion at the end.

Test Plan: make all check

Reviewers: igor, yhchiang, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D23349
2014-10-01 16:19:16 -07:00
Danny Al-Gaaf 0fd8bbca53 db/db_impl.cc: reduce scope of prefix_initialized
Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
2014-09-30 23:30:33 +02:00
Danny Al-Gaaf 33580fa39a db/db_impl.cc: fix object handling, remove double lines
Fix for:

[db/db_impl.cc:4039]: (error) Instance of 'StopWatch' object is
 destroyed immediately.
[db/db_impl.cc:4042]: (error) Instance of 'StopWatch' object is
 destroyed immediately.

Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
2014-09-30 23:30:32 +02:00
Mark Callaghan 1f963305a8 Print MB per second compaction throughput separately for reads and writes
Summary:
From this line there used to be one column (MB/sec) that includes reads and writes. This change splits it and for real workloads the rd and wr rates might not match when keys are dropped.
2014/09/29-17:31:01.213162 7f929fbff700 (Original Log Time 2014/09/29-17:31:01.180025) [default] compacted to: files[2 5 0 0 0 0 0], MB/sec: 14.0 rd, 14.0 wr, level 1, files in(4, 0) out(5) MB in(8.5, 0.0) out(8.5), read-write-amplify(2.0) write-amplify(1.0) OK

Test Plan:
make check, grepped LOG

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: igor

Differential Revision: https://reviews.facebook.net/D24237
2014-09-29 17:51:40 -07:00
Igor Canadi f7375f39fd Fix double deletes
Summary: While debugging clients compaction issues, I noticed bunch of delete bugs: P16329995. MakeTableName returns sst file with "/" prefix. We also need "/" prefix when we get the files though GetChildren(), so that we can properly dedup the files.

Test Plan: none

Reviewers: sdong, yhchiang, ljin

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D23457
2014-09-25 11:08:16 -07:00
Igor Canadi 21ddcf6e4f Remove allow_thread_local
Summary: See https://reviews.facebook.net/D19365

Test Plan: compiles

Reviewers: sdong, yhchiang, ljin

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D23907
2014-09-24 13:12:16 -07:00
Lei Jin 5e6aee4325 dont create backup_input if compaction filter v2 is not used
Summary:
Compaction creates backup_input iterator even though it only needed
when compaction filter v2 is enabled

Test Plan: make all check

Reviewers: sdong, yhchiang, igor

Reviewed By: igor

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D23769
2014-09-22 10:36:53 -07:00
Venkatesh Radhakrishnan f44594743f RocksDB: Format uint64 using PRIu64 in db_impl.cc
Summary: Use PRIu64 to format uint64 in a portable manner

Test Plan: Run "make all check"

Reviewers: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D23595
2014-09-18 22:19:41 -07:00
Igor Canadi 2fb1fea30f Fix syncronization issues 2014-09-18 10:42:54 -07:00
Lei Jin a062e1f2c4 SetOptions() for memtable related options
Summary: as title

Test Plan:
make all check
I will think a way to set up stress test for this

Reviewers: sdong, yhchiang, igor

Reviewed By: igor

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D23055
2014-09-17 12:49:13 -07:00
Igor Canadi dee91c259d WriteThread
Summary: This diff just moves the write thread control out of the DBImpl. I will need this as I will control column family data concurrency by only accessing some data in the write thread. That way, we won't have to lock our accesses to column family hash table (mappings from IDs to CFDs).

Test Plan: make check

Reviewers: sdong, yhchiang, ljin

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D23301
2014-09-12 16:23:58 -07:00
Igor Canadi 540a257f2c Fix WAL synced
Summary: Uhm...

Test Plan: nope

Reviewers: sdong, yhchiang, tnovak, ljin

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D23343
2014-09-12 16:15:29 -07:00
Igor Canadi 9c0e66ce98 Don't run background jobs (flush, compactions) when bg_error_ is set
Summary:
If bg_error_ is set, that means that we mark DB read only. However, current behavior still continues the flushes and compactions, even though bg_error_ is set.

On the other hand, if bg_error_ is set, we will return Status::OK() from CompactRange(), although the compaction didn't actually succeed.

This is clearly not desired behavior. I found this when I was debugging t5132159, although I'm pretty sure these aren't related.

Also, when we're shutting down, it's dangerous to exit RunManualCompaction(), since that will destruct ManualCompaction object. Background compaction job might still hold a reference to manual_compaction_ and this will lead to undefined behavior. I changed the behavior so that we only exit RunManualCompaction when manual compaction job is marked done.

Test Plan: make check

Reviewers: sdong, ljin, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D23223
2014-09-11 16:24:16 -07:00
Igor Canadi 3d9e6f7759 Push model for flushing memtables
Summary:
When memtable is full it calls the registered callback. That callback then registers column family as needing the flush. Every write checks if there are some column families that need to be flushed. This completely eliminates the need for MakeRoomForWrite() function and simplifies our Write code-path.

There is some complexity with the concurrency when the column family is dropped. I made it a bit less complex by dropping the column family from the write thread in https://reviews.facebook.net/D22965. Let me know if you want to discuss this.

Test Plan: make check works. I'll also run db_stress with creating and dropping column families for a while.

Reviewers: yhchiang, sdong, ljin

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D23067
2014-09-10 18:46:09 -07:00
sdong 06d986252a Always pass MergeContext as pointer, not reference
Summary: To follow the coding convention and make sure when passing reference as a parameter it is also const, pass MergeContext as a pointer to mem tables.

Test Plan: make all check

Reviewers: ljin, igor

Reviewed By: igor

Subscribers: leveldb, dhruba, yhchiang

Differential Revision: https://reviews.facebook.net/D23085
2014-09-09 11:37:32 -07:00
Stanislau Hlebik d343c3fe46 Improve db recovery
Summary: Avoid creating unnecessary sst files while db opening

Test Plan: make all check

Reviewers: sdong, igor

Reviewed By: igor

Subscribers: zagfox, yhchiang, ljin, leveldb

Differential Revision: https://reviews.facebook.net/D20661
2014-09-09 11:18:50 -07:00
Lei Jin 52311463e9 MemTableOptions
Summary: removed reference to options in WriteBatch and DBImpl::Get()

Test Plan: make all check

Reviewers: yhchiang, igor, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D23049
2014-09-08 18:46:52 -07:00
Lei Jin 659d2d50c3 move compaction_filter to immutable_options
Summary:
all shared_ptrs are in immutable_options now. This will also make
options assignment a little cheaper

Test Plan: make release

Reviewers: sdong, yhchiang, igor

Reviewed By: igor

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D23001
2014-09-08 15:09:25 -07:00
Lei Jin 048560a642 reduce references to cfd->options() in DBImpl
Summary:
I found it is almost impossible to get rid of this function in a single
batch. I will take a step by step approach

Test Plan: make release

Reviewers: sdong, yhchiang, igor

Reviewed By: igor

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D22995
2014-09-08 15:04:34 -07:00
sdong 011241bb99 DB::Flush() Do not wait for background threads when there is nothing in mem table
Summary:
When we have multiple column families, users can issue Flush() on every column families to make sure everything is flushes, even if some of them might be empty. By skipping the waiting for empty cases, it can be greatly speed up.

Still wait for people's comments before writing unit tests for it.

Test Plan: Will write a unit test to make sure it is correct.

Reviewers: ljin, yhchiang, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D22953
2014-09-08 13:40:42 -07:00
Igor Canadi a2bb7c3c33 Push- instead of pull-model for managing Write stalls
Summary:
Introducing WriteController, which is a source of truth about per-DB write delays. Let's define an DB epoch as a period where there are no flushes and compactions (i.e. new epoch is started when flush or compaction finishes). Each epoch can either:
* proceed with all writes without delay
* delay all writes by fixed time
* stop all writes

The three modes are recomputed at each epoch change (flush, compaction), rather than on every write (which is currently the case).

When we have a lot of column families, our current pull behavior adds a big overhead, since we need to loop over every column family for every write. With new push model, overhead on Write code-path is minimal.

This is just the start. Next step is to also take care of stalls introduced by slow memtable flushes. The final goal is to eliminate function MakeRoomForWrite(), which currently needs to be called for every column family by every write.

Test Plan: make check for now. I'll add some unit tests later. Also, perf test.

Reviewers: dhruba, yhchiang, MarkCallaghan, sdong, ljin

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D22791
2014-09-08 11:20:25 -07:00
Igor Canadi 9f1c80b556 Drop column family from write thread
Summary: If we drop column family only from (single) write thread, we can be sure that nobody will drop the column family while we're writing (and our mutex is released). This greatly simplifies my patch that's getting rid of MakeRoomForWrite().

Test Plan: make check, but also running stress test

Reviewers: ljin, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D22965
2014-09-05 15:20:05 -07:00
Lei Jin c9e419ccb6 rename options_ to db_options_ in DBImpl to avoid confusion
Summary: as title

Test Plan: make release

Reviewers: sdong, igor

Reviewed By: igor

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D22935
2014-09-05 11:48:17 -07:00
liuhuahang bb6ae0f80c fix more compile warnings
N/A

Change-Id: I5b6f9c70aea7d3f3489328834fed323d41106d9f
Signed-off-by: liuhuahang <liuhuahang@zerus.co>
2014-09-05 14:14:37 +08:00
Stanislau Hlebik 45a5e3ede0 Remove path with arena==nullptr from NewInternalIterator
Summary:
Simply code by removing code path which does not use Arena
from NewInternalIterator

Test Plan:
make all check
make valgrind_check

Reviewers: sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D22395
2014-09-04 17:40:41 -07:00
Lei Jin 5665e5e285 introduce ImmutableOptions
Summary:
As a preparation to support updating some options dynamically, I'd like
to first introduce ImmutableOptions, which is a subset of Options that
cannot be changed during the course of a DB lifetime without restart.

ColumnFamily will keep both Options and ImmutableOptions. Any component
below ColumnFamily should only take ImmutableOptions in their
constructor. Other options should be taken from APIs, which will be
allowed to adjust dynamically.

I am yet to make changes to memtable and other related classes to take
ImmutableOptions in their ctor. That can be done in a seprate diff as
this one is already pretty big.

Test Plan: make all check

Reviewers: yhchiang, igor, sdong

Reviewed By: sdong

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D22545
2014-09-04 16:18:36 -07:00
Raghav Pisolkar e0b99d4f5d created a new ReadOptions parameter 'iterate_upper_bound' 2014-09-04 11:00:16 -07:00
Lei Jin 9b58c73c7c call SanitizeDBOptionsByCFOptions() in the right place
Summary: It only covers Open() with default column family right now

Test Plan: make release

Reviewers: igor, yhchiang, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D22467
2014-09-02 14:42:23 -07:00
Igor Canadi a84234a61b Ignore missing column families
Summary:
Before this diff, whenever we Write to non-existing column family, Write() would fail.

This diff adds an option to not fail a Write() when WriteBatch points to non-existing column family. MongoDB said this would be useful for them, since they might have a transaction updating an index that was dropped by another thread. This way, they don't have to worry about checking if all indexes are alive on every write. They don't care if they lose writes to dropped index.

Test Plan: added a small unit test

Reviewers: sdong, yhchiang, ljin

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D22143
2014-09-02 13:29:05 -07:00
Igor Canadi 7f19bb93c6 Merge pull request #242 from tdfischer/perf-timer-destructors
Refactor PerfStepTimer to automatically stop on destruct
2014-09-02 13:06:40 -07:00
Feng Zhu 8438a19360 fix dropping column family bug
Summary: 1. db/db_impl.cc:2324 (DBImpl::BackgroundCompaction) should not raise bg_error_ when column family is dropped during compaction.

Test Plan: 1. db_stress

Reviewers: ljin, yhchiang, dhruba, igor, sdong

Reviewed By: igor

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D22653
2014-09-02 12:25:58 -07:00
Torrie Fischer 6614a48418 Refactor PerfStepTimer to stop on destruct
This eliminates the need to remember to call PERF_TIMER_STOP when a section has
been timed. This allows more useful design with the perf timers and enables
possible return value optimizations. Simplistic example:

class Foo {
  public:
    Foo(int v) : m_v(v);
  private:
    int m_v;
}

Foo makeFrobbedFoo(int *errno)
{
  *errno = 0;
  return Foo();
}

Foo bar(int *errno)
{
  PERF_TIMER_GUARD(some_timer);

  return makeFrobbedFoo(errno);
}

int main(int argc, char[] argv)
{
  Foo f;
  int errno;

  f = bar(&errno);

  if (errno)
    return -1;
  return 0;
}

After bar() is called, perf_context.some_timer would be incremented as if
Stop(&perf_context.some_timer) was called at the end, and the compiler is still
able to produce optimizations on the return value from makeFrobbedFoo() through
to main().
2014-09-02 12:04:22 -07:00
Igor Canadi 7dcadb1d37 Don't let flush preempt compaction in certain cases
Summary:
I have an application configured with 16 background threads. Write rates are high. L0->L1 compactions is very slow and it limits the concurrency of the system. While it's happening, other 15 threads are idle. However, when there is a need of a flush, that one thread busy with L0->L1 is doing flush, instead of any other 15 threads that are just sitting there.

This diff prevents that. If there are threads that are idle, we don't let flush preempt compaction.

Test Plan: Will run stress test

Reviewers: ljin, sdong, yhchiang

Reviewed By: sdong, yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D22299
2014-09-02 08:34:54 -07:00
Nik Bougalis f09329cb01 Fix candidate file comparison when using path ids 2014-08-31 00:54:15 -07:00
Lei Jin 722d80c374 reduce recordTick overhead in compaction loop
Summary: It is too expensive to bump ticker to every key/vaue pair

Test Plan: make release

Reviewers: sdong, yhchiang, igor

Reviewed By: igor

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D22527
2014-08-29 09:51:09 -07:00
Igor Canadi d977e55596 Don't let other compactions run when manual compaction runs
Summary:
Based on discussions from t4982833. This is just a short-term fix, I plan to revamp manual compaction process as part of t4982812.

Also, I think we should schedule automatic compactions at the very end of manual compactions, not when we're done with one level. I made that change as part of this diff. Let me know if you disagree.

Test Plan: make check for now

Reviewers: sdong, tnovak, yhchiang, ljin

Reviewed By: yhchiang

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D22401
2014-08-28 13:06:28 -04:00
Igor Canadi d5bd6c772b Fix ios compile
Summary: No __thread for ios.

Test Plan: compile works for ios now

Reviewers: ljin, dhruba

Reviewed By: dhruba

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D22491
2014-08-28 12:46:05 -04:00
Stanislau Hlebik 9dcb75b6d9 Add is-file-deletions-enabled property
Summary:
Add property 'rocksdb.is-file-deletions-enable'
	 which equals disable_delete_obsole_file_

Test Plan: make all check

Reviewers: sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D22119
2014-08-26 16:26:29 -07:00
Lei Jin 23861857c4 ReadOptions.total_order_seek to allow total order seek for block-based table when hash index is enabled
Summary: as title

Test Plan: table_test

Reviewers: igor, yhchiang, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D22239
2014-08-25 16:14:30 -07:00
Lei Jin 384400128f move block based table related options BlockBasedTableOptions
Summary:
I will move compression related options in a separate diff since this
diff is already pretty lengthy.
I guess I will also need to change JNI accordingly :(

Test Plan: make all check

Reviewers: yhchiang, igor, sdong

Reviewed By: igor

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D21915
2014-08-25 14:22:05 -07:00
Yueh-Hsuan Chiang 63a2215c63 Improve Options sanitization and add MmapReadRequired() to TableFactory
Summary:
Currently, PlainTable must use mmap_reads.  When PlainTable is used but
allow_mmap_reads is not set, rocksdb will fail in flush.

This diff improve Options sanitization and add MmapReadRequired() to
TableFactory.

Test Plan:
export ROCKSDB_TESTS=PlainTableOptionsSanitizeTest
make db_test -j32
./db_test

Reviewers: sdong, ljin

Reviewed By: ljin

Subscribers: you, leveldb

Differential Revision: https://reviews.facebook.net/D21939
2014-08-20 15:53:39 -07:00
sdong 10720a5587 Revert the unintended change that DestroyDB() doesn't clean up info logs.
Summary: A previous change triggered a change by mistake: DestroyDB() will keep info logs under DB directory. Revert the unintended change.

Test Plan: Add a unit test case to verify it.

Reviewers: ljin, yhchiang, igor

Reviewed By: igor

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D22209
2014-08-20 12:07:32 -07:00
Igor Canadi c8ecfaedd0 Merge pull request #230 from cockroachdb/spencerkimball/send-user-keys-to-v2-filter
Pass parsed user key to prefix extractor in V2 compaction
2014-08-18 11:09:30 -04:00
sdong 58b0f9d890 Support purging logs from separate log directory
Summary:
1. Support purging info logs from a separate paths from DB path. Refactor the codes of generating info log prefixes so that it can be called when generating new files and scanning log directory.
2. Fix the bug of not scanning multiple DB paths (should only impact multiple DB paths)

Test Plan:
Add unit test for generating and parsing info log files
Add end-to-end test in db_test

Reviewers: yhchiang, ljin

Reviewed By: ljin

Subscribers: leveldb, igor, dhruba

Differential Revision: https://reviews.facebook.net/D21801
2014-08-14 13:22:50 -07:00
Feng Zhu 5e642403a9 log db path info before open
Summary: 1. write db MANIFEST, CURRENT, IDENTITY, sst files, log files to log before open

Test Plan: run db and check LOG file

Reviewers: ljin, yhchiang, igor, dhruba, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D21459
2014-08-13 13:45:13 -07:00
sdong 48081777f3 Revert "Include candidate files under options.db_log_dir in FindObsoleteFiles()"
This reverts commit 54153ab07a.
2014-08-12 18:14:27 -07:00
Lei Jin 218857b3f5 remove tailing_iter.h/cc
Summary: as title

Test Plan:
make all check
ran db_bench and saw seek stats at the end

Reviewers: yhchiang, igor, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D21651
2014-08-12 17:13:15 -07:00
Lei Jin 5d0074c471 set bytes_per_sync to 1MB if rate limiter is enabled
Summary: as title

Test Plan: make all check

Reviewers: igor, yhchiang, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D21201
2014-08-12 16:42:18 -07:00
Spencer Kimball 3fcf7b26b9 Pass parsed user key to prefix extractor in V2 compaction
Previously, the prefix extractor was being supplied with the RocksDB
key instead of a parsed user key. This makes correct interpretation
by calling application fragile or impossible.
2014-08-12 18:48:28 -04:00
Stanislau Hlebik 2fa643466d Add scope guard
Summary: Small change: replace mutex_.Lock/mutex_.Unlock() with scope guard

Test Plan: make all check

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D21609
2014-08-12 12:13:13 -07:00
Stanislau Hlebik 06a52bda64 Flush only one column family
Summary:
Currently DBImpl::Flush() triggers flushes in all column families.
Instead we need to trigger just the column family specified.

Test Plan: make all check

Reviewers: igor, ljin, yhchiang, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D20841
2014-08-11 22:10:32 -07:00
miguelportilla 93e6b5e9d9 Changes to support unity build:
* Script for building the unity.cc file via Makefile
* Unity executable Makefile target for testing builds
* Source code changes to fix compilation of unity build
2014-08-11 13:22:47 -04:00
sdong 54153ab07a Include candidate files under options.db_log_dir in FindObsoleteFiles()
Summary: In FindObsoleteFiles(), we don't scan db_log_dir. Add it.

Test Plan: make all check

Reviewers: ljin, igor, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb, yhchiang

Differential Revision: https://reviews.facebook.net/D21429
2014-08-08 17:37:03 -07:00
sdong 4632239d13 Need to schedule compactions when manual compaction finishes
Summary: If there is an outstanding compaction scheduled but at the time a manual compaction is triggered, the manual compaction will preempt. In the end of the manual compaction, we should try to schedule compactions to make sure those preempted ones are not skipped.

Test Plan: make all check

Reviewers: yhchiang, ljin

Reviewed By: ljin

Subscribers: leveldb, dhruba, igor

Differential Revision: https://reviews.facebook.net/D21321
2014-08-08 12:28:36 -07:00
Igor Canadi 5e0868147d Fix SIGSEGV in travis
Summary:
Travis build was failing a lot. For example see https://travis-ci.org/facebook/rocksdb/builds/31425845

This fixes it.

Also, please don't put any code after SignalAll :)

Test Plan: no more SIGSEGV

Reviewers: yhchiang, sdong, ljin

Reviewed By: ljin

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D21417
2014-08-08 10:24:00 -07:00
Igor Canadi f8d6a2981f Merge pull request #224 from cockroachdb/spencerkimball/compaction-filter-v2-c-bindings
Add support for C bindings to the compaction V2 filter mechanism.
2014-08-07 14:10:54 -04:00
sdong 7abe9655d3 Fix valgrind failure caused by recent checked-in.
Summary: Initialize un-initialized parameters

Test Plan: run the failed test (c_test)

Reviewers: yhchiang, ljin

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D21249
2014-08-06 17:45:47 -07:00
Spencer Kimball 38e8b727a8 Fix typo, add missing inclusion of state void* in invocation of
create_compaction_filter_v2_.
2014-08-06 18:42:15 -04:00
Spencer Kimball c1f588af71 Add support for C bindings to the compaction V2 filter mechanism.
Test Plan: make c_test && ./c_test

Some fixes after merge.
2014-08-06 15:55:48 -04:00
sdong 1242bfcad7 Add DB property "rocksdb.estimate-table-readers-mem"
Summary:
Add a DB Property "rocksdb.estimate-table-readers-mem" to return estimated memory usage by all loaded table readers, other than allocated from block cache.

Refactor the property codes to allow getting property from a version, with DB mutex not acquired.

Test Plan: Add several checks of this new property in existing codes for various cases.

Reviewers: yhchiang, ljin

Reviewed By: ljin

Subscribers: xjin, igor, leveldb

Differential Revision: https://reviews.facebook.net/D20733
2014-08-06 11:39:46 -07:00
Feng Zhu 1129921e9b logging_when_create_and_delete_manifest
Summary:
  1. logging when create and delete manifest file
  2. fix formating in table/format.cc

Test Plan:
  make all check
  run db_bench, track the LOG file.

Reviewers: ljin, yhchiang, igor, yufei.zhu, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D21009
2014-08-04 11:25:42 -07:00
Igor Canadi e4c3673923 Never CompactRange to level 0 in level compaction
Summary: I was bit by this when developing SpatialDB. In case all files are at level 0, CompactRange() will output the compacted files to level 0. This is not ideal, since read amp. is much better at level 1 and higher.

Test Plan: Compacted data in SpatialDB, read manifest using ldb, verified that files are now at level 1 instead of 0.

Reviewers: sdong, ljin, yhchiang, dhruba

Reviewed By: dhruba

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D20901
2014-08-01 06:41:48 -07:00
Yueh-Hsuan Chiang 49ee5a4ac4 Fixed the crash when merge_operator is not properly set after reopen.
Summary:
Fixed the crash when merge_operator is not properly set after reopen
and added two test cases for this.

Test Plan:
make merge_test
./merge_test

Reviewers: igor, ljin, sdong

Reviewed By: sdong

Subscribers: benj, mvikjord, leveldb

Differential Revision: https://reviews.facebook.net/D20793
2014-07-30 17:24:36 -07:00
sdong f04356e660 Add DB::GetIntProperty() to return integer properties to be returned as integers
Summary: We have quite some properties that are integers and we are adding more. Add a function to directly return them as an integer, instead of a string

Test Plan: Add several unit test checks

Reviewers: yhchiang, igor, dhruba, haobo, ljin

Reviewed By: ljin

Subscribers: yoshinorim, leveldb

Differential Revision: https://reviews.facebook.net/D20637
2014-07-28 16:55:57 -07:00
Lei Jin 7e8bb71dd0 InternalStats to take cfd on constructor
Summary:
It has one-to-one relationship with CFD. Take a pointer to CFD on
constructor to avoid passing cfd through member functions.

Test Plan: make

Reviewers: sdong, yhchiang, igor

Reviewed By: igor

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D20565
2014-07-28 12:27:08 -07:00
Lei Jin 1bd3431f7c Change StopWatch interface
Summary: So that we can avoid calling NowSecs() in MakeRoomForWrite twice

Test Plan: make all check

Reviewers: yhchiang, igor, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D20529
2014-07-28 12:22:37 -07:00
Lei Jin f6ca226c17 make statistics forward-able
Summary:
Make StatisticsImpl being able to forward stats to provided statistics
implementation. The main purpose is to allow us to collect internal
stats in the future even when user supplies custom statistics
implementation. It avoids intrumenting 2 sets of stats collection code.
One immediate use case is tuning advisor, which needs to collect some
internal stats, users may not be interested.

Test Plan:
ran db_bench and see stats show up at the end of run
Will run make all check since some tests rely on statistics

Reviewers: yhchiang, sdong, igor

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D20145
2014-07-28 12:10:49 -07:00
Lei Jin 40fa8a4cd5 make statistics forward-able
Summary:
Make StatisticsImpl being able to forward stats to provided statistics
implementation. The main purpose is to allow us to collect internal
stats in the future even when user supplies custom statistics
implementation. It avoids intrumenting 2 sets of stats collection code.
One immediate use case is tuning advisor, which needs to collect some
internal stats, users may not be interested.

Test Plan:
ran db_bench and see stats show up at the end of run
Will run make all check since some tests rely on statistics

Reviewers: yhchiang, sdong, igor

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D20145
2014-07-28 12:05:36 -07:00
sdong f6b7e1ed1a Allow user to specify DB path of output file of manual compaction
Summary: Add a parameter path_id to DB::CompactRange(), to indicate where the output file should be placed to.

Test Plan: add a unit test

Reviewers: yhchiang, ljin

Reviewed By: ljin

Subscribers: xjin, igor, dhruba, MarkCallaghan, leveldb

Differential Revision: https://reviews.facebook.net/D20085
2014-07-21 19:06:00 -07:00
Lei Jin f6f1533c6f make internal stats independent of statistics
Summary:
also make it aware of column family
output from db_bench

```
** Compaction Stats [default] **
Level Files Size(MB) Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) RW-Amp W-Amp Rd(MB/s) Wr(MB/s)  Rn(cnt) Rnp1(cnt) Wnp1(cnt) Wnew(cnt)  Comp(sec) Comp(cnt) Avg(sec) Stall(sec) Stall(cnt) Avg(ms)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0    14      956   0.9      0.0     0.0      0.0       2.7      2.7    0.0   0.0      0.0    111.6        0         0         0         0         24        40    0.612      75.20     492387    0.15
  L1    21     2001   2.0      5.7     2.0      3.7       5.3      1.6    5.4   2.6     71.2     65.7       31        43        55        12         82         2   41.242      43.72      41183    1.06
  L2   217    18974   1.9     16.5     2.0     14.4      15.1      0.7   15.6   7.4     70.1     64.3       17       182       185         3        241        16   15.052       0.00          0    0.00
  L3  1641   188245   1.8      9.1     1.1      8.0       8.5      0.5   15.4   7.4     61.3     57.2        9        75        76         1        152         9   16.887       0.00          0    0.00
  L4  4447   449025   0.4     13.4     4.8      8.6       9.1      0.5    4.7   1.9     77.8     52.7       38        79       100        21        176        38    4.639       0.00          0    0.00
 Sum  6340   659201   0.0     44.7    10.0     34.7      40.6      6.0   32.0  15.2     67.7     61.6       95       379       416        37        676       105    6.439     118.91     533570    0.22
 Int     0        0   0.0      1.2     0.4      0.8       1.3      0.5    5.2   2.7     59.1     65.6        3         7         9         2         20        10    2.003       0.00          0    0.00
Stalls(secs): 75.197 level0_slowdown, 0.000 level0_numfiles, 0.000 memtable_compaction, 43.717 leveln_slowdown
Stalls(count): 492387 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 41183 leveln_slowdown

** DB Stats **
Uptime(secs): 202.1 total, 13.5 interval
Cumulative writes: 6291456 writes, 6291456 batches, 1.0 writes per batch, 4.90 ingest GB
Cumulative WAL: 6291456 writes, 6291456 syncs, 1.00 writes per sync, 4.90 GB written
Interval writes: 1048576 writes, 1048576 batches, 1.0 writes per batch, 836.0 ingest MB
Interval WAL: 1048576 writes, 1048576 syncs, 1.00 writes per sync, 0.82 MB written

Test Plan: ran it

Reviewers: sdong, yhchiang, igor

Reviewed By: igor

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19917
2014-07-21 12:57:29 -07:00
Yueh-Hsuan Chiang 3178510153 Allow class Compaction to handle input files from multiple levels.
Summary:
Allow class Compaction to handle input files from multiple levels.
This diff is a subset of https://reviews.facebook.net/D19263 where
only db/compaction.cc and db/compaction.h are changed.

Test Plan:
make db_test
export ROCKSDB_TESTS=Compaction
./db_test

Reviewers: igor, sdong, ljin

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19923
2014-07-17 14:36:41 -07:00
Feng Zhu 87895c62db fix bug in LOG for flush memtable
Summary:
  One line change to fix a bug in the LOG when flush memtable

Test Plan:
  NONE

Reviewers: sdong

Reviewed By: sdong

Differential Revision: https://reviews.facebook.net/D20049
2014-07-16 16:56:49 -07:00
sdong 0abaed2e08 Support multiple DB directories in universal compaction style
Summary:
This patch adds a target size parameter in options.db_paths and universal compaction will base it to determine which DB path to place a new file.
Level-style stays the same.

Test Plan: Add new unit tests

Reviewers: ljin, yhchiang

Reviewed By: yhchiang

Subscribers: MarkCallaghan, dhruba, igor, leveldb

Differential Revision: https://reviews.facebook.net/D19869
2014-07-15 12:06:28 -07:00
Igor Canadi 20c056306b Remove stats logger
Summary: Browsing through the code, looks like StatsLogger is not used at all!

Test Plan: compiles

Reviewers: ljin, sdong, yhchiang, dhruba

Reviewed By: dhruba

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D19827
2014-07-15 09:16:32 -04:00
Lei Jin 46f0f6ddd5 improve InternalStats output
Summary: as title

Test Plan:
sampe output:
Level Files Size(MB) Score Read(GB)  Rn(GB) Rnp1(GB) Write(BG) Wnew(GB) RW-Amp W-Amp Rd(MB/s) Wr(MB/s)  Rn(cnt) Rnp1(cnt) Wnp1(cnt) Wnew(cnt)  Comp(sec) Comp(cnt) Avg(sec) Stall(sec) Stall(cnt) Avg(ms)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0    15     1024   1.0      0.0     0.0      0.0       8.2      8.2    0.0   0.0      0.0    111.4        0         0         1         1         75       123    0.612     295.94    1939238    0.15
  L1    23     2118   2.1     20.9     8.3     12.7      20.0      7.3    5.0   2.4     73.2     69.9      124       141       208        67        293         8   36.582      17.05      16100    1.06
  L2   162    15333   1.5     47.0     7.1     40.0      42.6      2.6   12.7   6.0     67.9     61.5       62       457       482        25        709        55   12.898       0.00          0    0.00
  L3   985   108065   1.1     37.8     4.0     33.9      36.9      3.0   18.8   9.3     60.1     58.5       41       338       363        25        645        31   20.812       0.00          0    0.00
  L4  2788   356033   0.3      0.0     0.0      0.0       0.0      0.0    0.0   0.0      0.0      0.0        0         0         0         0          0         0    0.000       0.00          0    0.00
 Sum  3973   482572   0.0    105.8    19.3     86.5     107.7     21.2   11.1   5.6     62.9     64.0      227       936      1054       118       1723       217    7.938     312.99    1955338    0.16

Reviewers: sdong, yhchiang, igor

Reviewed By: igor

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19707
2014-07-11 15:03:30 -07:00
Feng Zhu 178fd6f9db use FileLevel in LevelFileNumIterator
Summary:
  Use FileLevel in LevelFileNumIterator, thus use new version of findFile.
  Old version of findFile function is deleted.
  Write a function in version_set.cc to generate FileLevel from files_.
  Add GenerateFileLevelTest in version_set_test.cc

Test Plan:
  make all check

Reviewers: ljin, haobo, yhchiang, sdong

Reviewed By: sdong

Subscribers: igor, dhruba

Differential Revision: https://reviews.facebook.net/D19659
2014-07-11 12:52:41 -07:00
Lei Jin 534357ca3a integrate rate limiter into rocksdb
Summary:
Add option and plugin rate limiter for PosixWritableFile. The rate
limiter only applies to flush and compaction. WAL and MANIFEST are
excluded from this enforcement.

Test Plan: db_test

Reviewers: igor, yhchiang, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19425
2014-07-08 12:31:49 -07:00
Yueh-Hsuan Chiang d33657a4a5 Fixed a warning in release mode.
Summary: Removed a variable that is only used in assertion check.

Test Plan: make release

Reviewers: ljin, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19455
2014-07-03 17:19:17 -07:00
Yueh-Hsuan Chiang 90a6aca48e Finer report I/O stats about Flush and Compaction.
Summary:
This diff allows the I/O stats about Flush and Compaction to be reported
in a more accurate way.  Instead of measuring the size of a file, it
measure I/O cost in per read / write basis.

Test Plan: make all check

Reviewers: sdong, igor, ljin

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19383
2014-07-03 16:28:03 -07:00
Yueh-Hsuan Chiang d4d338de33 Add timeout_hint_us to WriteOptions and introduce Status::TimeOut.
Summary:
This diff adds timeout_hint_us to WriteOptions.  If it's non-zero, then
1) writes associated with this options MAY be aborted when it has been
  waiting for longer than the specified time.  If an abortion happens,
  associated writes will return Status::TimeOut.
2) the stall time of the associated write caused by flush or compaction
  will be limited by timeout_hint_us.

The default value of timeout_hint_us is 0 (i.e., OFF.)

The statistics of timeout writes will be recorded in WRITE_TIMEDOUT.

Test Plan:
export ROCKSDB_TESTS=WriteTimeoutAndDelayTest
make db_test
./db_test

Reviewers: igor, ljin, haobo, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D18837
2014-07-03 15:47:02 -07:00
sdong 2459f7ec4e Support Multiple DB paths (without having an interface to expose to users)
Summary:
In this patch, we allow RocksDB to support multiple DB paths internally.
No user interface is supported yet so this patch is silent to users.

Test Plan: make all check

Reviewers: igor, haobo, ljin, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D18921
2014-07-02 21:14:44 -07:00
Igor Canadi f146cab261 Centralize compression decision to compaction picker
Summary:
Before this diff, we're deciding enable_compression in CompactionPicker and then we're deciding final compression type in DBImpl. This is kind of confusing.

After the diff, the final compression type will be decided in CompactionPicker.

The reason for this is that I want CompactFiles() to specify output compression type, so that people can mix and match compression styles in their compaction algorithms. This diff makes it much easier to do that.

Test Plan: make check

Reviewers: dhruba, haobo, sdong, yhchiang, ljin

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19137
2014-07-02 20:40:57 +02:00
sdong dd337bc0b2 In logging format, use PRIu64 instead of casting
Summary: Code cleaning up, since we are already using __STDC_FORMAT_MACROS in printing uint64_t, change other places. Only logging is changed.

Test Plan: make all check

Reviewers: ljin

Reviewed By: ljin

Subscribers: dhruba, yhchiang, haobo, leveldb

Differential Revision: https://reviews.facebook.net/D19113
2014-06-27 16:34:15 -07:00
Stanislau Hlebik a3594867ba Cache some conditions for DBImpl::MakeRoomForWrite
Summary:
Task 4580155. Some conditions in DBImpl::MakeRoomForWrite can be cached in
ColumnFamilyData, because theirs value can be changed only during compaction,
adding new memtable and/or add recalculation of compaction score.

These conditions are:

cfd->imm()->size() ==  cfd->options()->max_write_buffer_number - 1
cfd->current()->NumLevelFiles(0) >=  cfd->options()->level0_stop_writes_trigger
cfd->options()->soft_rate_limit > 0.0 &&
    (score = cfd->current()->MaxCompactionScore()) >  cfd->options()->soft_rate_limit
cfd->options()->hard_rate_limit > 1.0 &&
    (score = cfd->current()->MaxCompactionScore()) >  cfd->options()->hard_rate_limit

P.S.
As it's my first diff, Siying suggested to add everybody as a reviewers
for this diff. Sorry, if I forgot someone or add someone by mistake.

Test Plan: make all check

Reviewers: haobo, xjin, dhruba, yhchiang, zagfox, ljin, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19311
2014-06-26 16:45:27 -07:00
Igor Canadi d4a8423334 Remove seek compaction
Summary:
As discussed in our internal group, we don't get much use of seek compaction at the moment, while it's making code more complicated and slower in some cases.

This diff removes seek compaction and (hopefully) all code that was introduced to support seek compaction.

There is one test case that relied on didIO information. I'll try to find another way to implement it.

Test Plan: make check

Reviewers: sdong, haobo, yhchiang, ljin, dhruba

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19161
2014-06-20 10:23:02 +02:00
Yueh-Hsuan Chiang 4f5ccfd179 Fixed a potential write hang
Summary:
Currently, when something badly happen in the DB::Write() while the write-queue
contains more than one element, the current design seems to forget to clean up
the queue as well as wake-up all the writers, this potentially makes rocksdb
hang on writes.

Test Plan: make all check

Reviewers: sdong, ljin, igor, haobo

Reviewed By: haobo

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19167
2014-06-19 14:53:03 -07:00
Lei Jin c4e90c79ed bug fix: iteration over ColumnFamilySet needs to be under mutex
Summary: asan_crash_test is failing on segfault

Test Plan: running asan_crash_test

Reviewers: sdong, igor

Reviewed By: igor

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19149
2014-06-19 09:31:14 -07:00
sdong cadc1adffa Refactor: group metadata needed to open an SST file to a separate copyable struct
Summary:
We added multiple fields to FileMetaData recently and are planning to add more.
This refactoring separate the minimum information for accessing the file. This object is copyable (FileMetaData is not copyable since the ref counter). I hope this refactoring can enable further improvements:

(1) use it to design a more efficient data structure to speed up read queries.
(2) in the future, when we add information of storage level, we can easily do the encoding, instead of enlarge this structure, which might expand memory work set for file meta data.

The definition is same as current EncodedFileMetaData used in two level iterator, so now the logic in two level iterator is easier to understand.

Test Plan: make all check

Reviewers: haobo, igor, ljin

Reviewed By: ljin

Subscribers: leveldb, dhruba, yhchiang

Differential Revision: https://reviews.facebook.net/D18933
2014-06-16 16:10:52 -07:00
Igor Canadi a0191c9dfe Create Missing Column Families
Summary: Provide an convenience option to create column families if they are missing from the DB. Task #4460490

Test Plan: added unit test. also, stress test for some time

Reviewers: sdong, haobo, dhruba, ljin, yhchiang

Reviewed By: yhchiang

Subscribers: yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D18951
2014-06-06 18:04:56 -07:00
Igor Canadi 99d3eed2fd Write Fast-path for single column family
Summary: We have a perf regression of Write() even with one column family. Make fast path for single column family to avoid the perf regression. See task #4455480

Test Plan: make check

Reviewers: sdong, ljin

Reviewed By: sdong, ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D18963
2014-06-06 17:26:23 -07:00
Igor Canadi 5d870717ae Correctly preallocate files in universal compaction
Summary: In universal compaction, MaxFileSizeForLevel is ULLONG_MAX. We've been preallocation files to UULONG_MAX size all these time :)

Test Plan: make check

Reviewers: dhruba, haobo, ljin, sdong, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D18915
2014-06-05 13:19:35 -07:00
Igor Canadi fd27001072 Fix compile errors on Mac
Summary: https://phabricator.fb.com/P11372644

Test Plan: compiles

Reviewers: sdong, ljin

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D18873
2014-06-03 12:28:58 -07:00
sdong df9069d23f In DB::NewIterator(), try to allocate the whole iterator tree in an arena
Summary:
In this patch, try to allocate the whole iterator tree starting from DBIter from an arena
1. ArenaWrappedDBIter is created when serves as the entry point of an iterator tree, with an arena in it.
2. Add an option to create iterator from arena for following iterators: DBIter, MergingIterator, MemtableIterator, all mem table's iterators, all table reader's iterators and two level iterator.
3. MergeIteratorBuilder is created to incrementally build the tree of internal iterators. It is passed to mem table list and version set and add iterators to it.

Limitations:
(1) Only DB::NewIterator() without tailing uses the arena. Other cases, including readonly DB and compactions are still from malloc
(2) Two level iterator itself is allocated in arena, but not iterators inside it.

Test Plan: make all check

Reviewers: ljin, haobo

Reviewed By: haobo

Subscribers: leveldb, dhruba, yhchiang, igor

Differential Revision: https://reviews.facebook.net/D18513
2014-06-02 17:44:57 -07:00
Igor Canadi 91ddd587cc Only signal cond variable if need to
Summary:
At the end of BackgroundCallCompaction(), we call SignalAll(), even though we don't need to. If compaction hasn't done anything and there's another compaction running, there is no need to signal on the condition variable. Doing so creates a tight feedback loop which results in log files like:

   wait for memtable flush
   compaction nothing to do
   wait for memtable flush
   compaction nothing to do

This change eliminates that

Test Plan:
make check
Also:

    icanadi@dev1440 ~ $ grep "nothing to do" /fast-rocksdb-tmp/rocksdb_test/column_family_test/LOG | wc -l
    7435
    icanadi@dev1440 ~ $ grep "nothing to do" /fast-rocksdb-tmp/rocksdb_test/column_family_test/LOG | wc -l
    372

First version is before the change, second version is after the change.

Reviewers: dhruba, ljin, haobo, yhchiang, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D18855
2014-06-02 17:23:55 -07:00
Igor Canadi 8cb7ad83c3 Flush stale column families less aggressively
Summary:
We've seen some production issues where column family is detected as stale, although there is only one column family in the system. This is a quick fix that:
1) doesn't flush stale column families if there's only one of them
2) Use 4 as a coefficient instead of 2 for determening when a column family is stale. This will make flushing less aggressive, while still keep a nice dynamic flushing of very stale CFs.

Test Plan: make check

Reviewers: dhruba, haobo, ljin, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D18861
2014-06-02 15:33:54 -07:00
Lei Jin 388d2054c7 forward iterator
Summary:
Forward iterator puts everything together in a flat structure instead of
a hierarchy of nested iterators. this should simplify the code and
provide better performance. It also enables more optimization since all
information are accessiable in one place.
Init evaluation shows about 6% improvement

Test Plan: db_test and db_bench

Reviewers: dhruba, igor, tnovak, sdong, haobo

Reviewed By: haobo

Subscribers: sdong, leveldb

Differential Revision: https://reviews.facebook.net/D18795
2014-05-30 14:31:55 -07:00
Igor Canadi 6de6a06631 FIFO compaction style
Summary:
Introducing new compaction style -- FIFO.

FIFO compaction style has write amplification of 1 (+1 for WAL) and it deletes the oldest files when the total DB size exceeds pre-configured values.

FIFO compaction style is suited for storing high-frequency event logs.

Test Plan: Added a unit test

Reviewers: dhruba, haobo, sdong

Reviewed By: dhruba

Subscribers: alberts, leveldb

Differential Revision: https://reviews.facebook.net/D18765
2014-05-21 11:43:35 -07:00
Yueh-Hsuan Chiang 1c7799d8aa Fixed a file-not-found issue when a log file is moved to archive.
Summary:
Fixed a file-not-found issue when a log file is moved to archive
by doing a missing retry.

Test Plan:
make db_test
export ROCKSDB_TEST=TransactionLogIteratorRace
./db_test

Reviewers: sdong, haobo

Reviewed By: sdong

CC: igor, leveldb

Differential Revision: https://reviews.facebook.net/D18669
2014-05-12 17:50:21 -07:00
sdong 9efbd85ac9 fsync directory after creating current file in NewDB()
Summary: One of our users reported current file corruption. The machine was rebooted during the time. This is the only think I can think of which could cause current file corruption. Just add this paranoid check.

Test Plan: make all check

Reviewers: haobo, igor

Reviewed By: haobo

CC: yhchiang, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D18495
2014-05-06 17:51:33 -07:00
Igor Canadi 16f1aa7b2d Fix signed/unsigned compare 2014-04-30 14:38:01 -04:00
Igor Canadi df70047669 Flush stale column families
Summary:
Added a new option `max_total_wal_size`. Once the total WAL size goes over that, we make an attempt to flush all column families that still have data in the earliest WAL file.

By default, I calculate `max_total_wal_size` dynamically, that should be good-enough for non-advanced customers.

Test Plan: Added a test

Reviewers: dhruba, haobo, sdong, ljin, yhchiang

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18345
2014-04-30 14:33:40 -04:00
Yueh-Hsuan Chiang 9d9d2965cb Add a new mem-table representation based on cuckoo hash.
Summary:
= Major Changes =
* Add a new mem-table representation, HashCuckooRep, which is based cuckoo hash.
  Cuckoo hash uses multiple hash functions.  This allows each key to have multiple
  possible locations in the mem-table.

  - Put: When insert a key, it will try to find whether one of its possible
    locations is vacant and store the key.  If none of its possible
    locations are available, then it will kick out a victim key and
    store at that location.  The kicked-out victim key will then be
    stored at a vacant space of its possible locations or kick-out
    another victim.  In this diff, the kick-out path (known as
    cuckoo-path) is found using BFS, which guarantees to be the shortest.

 - Get: Simply tries all possible locations of a key --- this guarantees
   worst-case constant time complexity.

 - Time complexity: O(1) for Get, and average O(1) for Put if the
   fullness of the mem-table is below 80%.

 - Default using two hash functions, the number of hash functions used
   by the cuckoo-hash may dynamically increase if it fails to find a
   short-enough kick-out path.

 - Currently, HashCuckooRep does not support iteration and snapshots,
   as our current main purpose of this is to optimize point access.

= Minor Changes =
* Add IsSnapshotSupported() to DB to indicate whether the current DB
  supports snapshots.  If it returns false, then DB::GetSnapshot() will
  always return nullptr.

Test Plan:
Run existing tests.  Will develop a test specifically for cuckoo hash in
the next diff.

Reviewers: sdong, haobo

Reviewed By: sdong

CC: leveldb, dhruba, igor

Differential Revision: https://reviews.facebook.net/D16155
2014-04-29 17:13:46 -07:00
Igor Canadi dd9eb7a7d5 Cache result of ReadFirstRecord()
Summary:
ReadFirstRecord() reads the actual log file from disk on every call. This diff introduces a cache layer on top of ReadFirstRecord(), which should significantly speed up repeated calls to GetUpdatesSince().

I also cleaned up some stuff, but the whole TransactionLogIterator could use some refactoring, especially if we see increased usage.

Test Plan: make check

Reviewers: haobo, sdong, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18387
2014-04-29 13:27:58 -04:00
Lei Jin 3995e801ab kill ReadOptions.prefix and .prefix_seek
Summary:
also add an override option total_order_iteration if you want to use full
iterator with prefix_extractor

Test Plan: make all check

Reviewers: igor, haobo, sdong, yhchiang

Reviewed By: haobo

CC: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D17805
2014-04-25 12:21:34 -07:00
Igor Canadi 8ce5492623 Delete superversion and log outside of mutex
Summary: As summary. Add two autovectors that get filled up in MakeRoomForWrite and they get deleted outside of mutex

Test Plan: make check

Reviewers: dhruba, haobo, ljin, sdong

Reviewed By: ljin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18249
2014-04-25 14:58:02 -04:00
Igor Canadi ad3cd39ccd Column family logging
Summary:
Now that we have column families involved, we need to add extra context to every log message. They now start with "[column family name] log message"

Also added some logging that I think would be useful, like level summary after every flush (I often needed that when going through the logs).

Test Plan: make check + ran db_bench to confirm I'm happy with log output

Reviewers: dhruba, haobo, ljin, yhchiang, sdong

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18303
2014-04-25 09:51:16 -04:00
sdong fa430bfd04 Minimize accessing multiple objects in Version::Get()
Summary:
One of our profilings shows that Version::Get() sometimes is slow when getting pointer of user comparators or other global objects. In this patch:
(1) we keep pointers of immutable objects in Version to avoid accesses them though option objects or cfd objects
(2) table_reader is directly cached in FileMetaData so that table cache don't have to go through handle first to fetch it
(3) If level 0 has less than 3 files, skip the filtering logic based on SST tables' key range. Smallest and largest key are stored in separated memory locations, which has potential cache misses

Test Plan: make all check

Reviewers: haobo, ljin

Reviewed By: haobo

CC: igor, yhchiang, nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D17739
2014-04-17 14:14:00 -07:00
Igor Canadi 1803ed2ccb Fix Mac OS compile 2014-04-15 16:31:49 -07:00
sdong 0f40fe4bc7 When creating a new DB, fail it when wal_dir contains existing log files
Summary: Current behavior of creating new DB is, if there is existing log files, we will go ahead and replay them on top of empty DB. This is a behavior that no user would expect. With this patch, we will fail the creation if a user creates a DB with existing log files.

Test Plan: make all check

Reviewers: haobo, igor, ljin

Reviewed By: haobo

CC: nkg-, yhchiang, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D17817
2014-04-15 14:01:57 -07:00
Igor Canadi c166615850 Fix compile issues introduced by RocksDBLite 2014-04-15 13:51:07 -07:00
Igor Canadi 588bca2020 RocksDBLite
Summary:
Introducing RocksDBLite! Removes all the non-essential features and reduces the binary size. This effort should help our adoption on mobile.

Binary size when compiling for IOS (`TARGET_OS=IOS m static_lib`) is down to 9MB from 15MB (without stripping)

Test Plan: compiles :)

Reviewers: dhruba, haobo, ljin, sdong, yhchiang

Reviewed By: yhchiang

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17835
2014-04-15 13:39:26 -07:00
Igor Canadi dbe0f327ca Set log_empty to false even when options.sync is off [fix tests] 2014-04-15 10:28:34 -07:00
Igor Canadi e6acb874cd Don't roll empty logs
Summary:
With multiple column families, especially when manual Flush is executed, we might roll the log file, although the current log file is empty (no data has been written to the log).

After the diff, we won't create new log file if current is empty.

Next, I will write an algorithm that will flush column families that reference old log files (i.e., that weren't flushed in a while)

Test Plan: Added an unit test. Confirmed that unit test failes in master

Reviewers: dhruba, haobo, ljin, sdong

Reviewed By: ljin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17631
2014-04-15 09:57:25 -07:00
Lei Jin 82b37a18bd thread local for tailing iterator
Summary:
replace the super version acquisision in tailing itrator with thread
local

Test Plan: will post results

Reviewers: igor, haobo, sdong, yhchiang, dhruba

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17757
2014-04-14 10:48:01 -07:00
Lei Jin 539dd207df using thread local SuperVersion for NewIterator
Summary:
Similar to GetImp(), use SuperVersion from thread local instead of acquriing mutex.
I don't expect this change will make a dent on NewIterator() performance
because the bottleneck seems to be on the rest part of the API

Test Plan:
make asan_check
will post perf numbers

Reviewers: haobo, igor, sdong, dhruba, yhchiang

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17643
2014-04-14 09:34:59 -07:00
Igor Canadi de41357a18 Don't dump rocksdb version on IOS 2014-04-11 10:19:58 -07:00
Igor Canadi ddef6841b3 Renamed InfoLogLevel::DEBUG to InfoLogLevel::DEBUG_LEVEL
Summary: XCode for some reason injects `#define DEBUG 1` into our code, which makes compile fail because we use `DEBUG` keyword for other stuff. This diff fixes the issue by renaming `DEBUG` to `DEBUG_LEVEL`.

Test Plan: compiles

Reviewers: dhruba, haobo, sdong, yhchiang, ljin

Reviewed By: yhchiang

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17709
2014-04-10 15:27:42 -07:00
sdong df2a8b6a1a Polish IterKey and use it in DBImpl::ProcessKeyValueCompaction()
Summary:
1. Polish IterKey a little bit.
2. Turn to use it in local parameter of current_user_key in DBImpl::ProcessKeyValueCompaction(). Our profile showing that DBImpl::ProcessKeyValueCompaction() has about 14% costs in std::string (the base including reading and writing data but excluding compaction filtering), which is higher than it should be. There are two std::string used in DBImpl::ProcessKeyValueCompaction(), compaction_filter_value and current_user_key and it's hard to distinguish the two.

Test Plan: make all check

Reviewers: haobo, ljin

Reviewed By: haobo

CC: igor, yhchiang, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D17613
2014-04-09 20:50:58 -07:00
Igor Canadi b947fdc89d Column family support for DB::OpenForReadOnly()
Summary: When opening DB in read-only mode, client can choose to only specify a subset of column families ("default" column family can't be omitted, though)

Test Plan: added a unit test in column_family_test

Reviewers: haobo, sdong, ljin, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17565
2014-04-09 09:56:17 -07:00
Igor Canadi 5b345b76cb Remove env_ from MergingIterator
Summary: env_ is not used. Compiling for iOS complains.

Test Plan: compiles now

Reviewers: ljin, haobo, sdong, dhruba

Reviewed By: ljin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17589
2014-04-08 13:40:42 -07:00
Igor Canadi beeee9dccc Small speedup of CompactionFilterV2
Summary: ToString() is expensive. Profiling shows that most compaction threads are stuck in jemalloc, allocating a new string. This will help out a litte.

Test Plan: make check

Reviewers: haobo, danguo

Reviewed By: danguo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17583
2014-04-08 11:06:39 -07:00
Lei Jin 92c1eb0291 macros for perf_context
Summary: This will allow us to disable them completely for iOS or for better performance

Test Plan: will run make all check

Reviewers: igor, haobo, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17511
2014-04-08 10:58:07 -07:00
Igor Canadi 664559fe2d Small final fixes before merge 2014-04-07 15:38:53 -07:00
Igor Canadi d1e2bce42d CallFlushDuringCompaction 2014-04-07 15:03:15 -07:00
Igor Canadi b42ceb9598 Simplify cleanup of dead (refcount == 0) column families 2014-04-07 14:31:02 -07:00
Igor Canadi e48348d196 Make flush part of compaction process
This will enable user to use only 1 background thread.
2014-04-07 13:53:08 -07:00
Igor Canadi 2a0917b28e Merge branch 'master' into columnfamilies 2014-04-07 13:04:25 -07:00
Igor Canadi 751e4b1a35 Fix wal_dir sanitizing 2014-04-07 11:36:03 -07:00
Igor Canadi 3d2fe844ab Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_impl.h
	db/memtable_list.cc
	db/version_set.cc
2014-04-07 11:31:11 -07:00
Igor Canadi 7efdd9ef4d Options::wal_dir shouldn't end in '/'
Summary: If a client specifies wal_dir with trailing '/', we will fail in deleting obsolete log files. See task #4083746

Test Plan: make check

Reviewers: haobo, sdong

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17535
2014-04-07 10:25:38 -07:00
sdong ea0198fe9a Create log::Writer out of DB Mutex
Summary: Our measurement shows that sometimes new log::Write's constructor can take hundreds of milliseconds. It's unclear why but just simply move it out of DB mutex.

Test Plan: make all check

Reviewers: haobo, ljin, igor

Reviewed By: haobo

CC: nkg-, yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D17487
2014-04-04 15:46:28 -07:00
sdong 99c756f0fe Flush Buffered Info Logs Before Doing Compaction (one line change)
Summary: Flushing log buffer earlier to avoid confusion of time holding the locks.

Test Plan: Should be safe as long as several related db test passes

Reviewers: haobo, igor, ljin

Reviewed By: igor

CC: nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D17493
2014-04-04 10:58:30 -07:00
sdong b9767d0e09 Move several more logging inside DB mutex to log buffer
Summary: Move several some common logging still in DB mutex to log buffer.

Test Plan: make all check

Reviewers: haobo, igor, ljin, nkg-

Reviewed By: nkg-

CC: nkg-, yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D17439
2014-04-03 10:47:18 -07:00
Haobo Xu 48bc0c6ad3 [RocksDB] Fix a race condition in GetSortedWalFiles
Summary: This patch fixed a race condition where a log file is moved to archived dir in the middle of GetSortedWalFiles. Without the fix, the log file would be missed in the result, which leads to transaction log iterator gap. A test utility SyncPoint is added to help reproducing the race condition.

Test Plan: TransactionLogIteratorRace; make check

Reviewers: dhruba, ljin

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17121
2014-04-02 22:12:29 -07:00
sdong 158845ba9a Move a info logging out of DB Mutex
Summary: As we know, logging can be slow, or even hang for some file systems. Move one more logging out of DB mutex.

Test Plan: make all check

Reviewers: haobo, igor, ljin

Reviewed By: igor

CC: yhchiang, nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D17427
2014-04-02 16:48:32 -07:00
sdong 4af1954fd6 Compaction Filter V1 to use old context struct to keep backward compatible
Summary: The previous change D15087 changed existing compaction filter, which makes the commonly used class not backward compatible. Revert the older interface. Use a new interface for V2 instead.

Test Plan: make all check

Reviewers: haobo, yhchiang, igor

CC: danguo, dhruba, ljin, igor, leveldb

Differential Revision: https://reviews.facebook.net/D17223
2014-04-02 14:57:51 -07:00
Igor Canadi ddbd1ece88 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_test.cc
	db/internal_stats.cc
	db/internal_stats.h
	db/version_edit.cc
	db/version_edit.h
	db/version_set.cc
	include/rocksdb/options.h
	util/options.cc
2014-03-31 13:39:24 -07:00
Igor Canadi 577556d5f9 Don't store version number in MANIFEST
Summary: Talked to <insert internal project name> folks and they found it really scary that they won't be able to roll back once they upgrade to 2.8. We should fix this.

Test Plan: make check

Reviewers: haobo, ljin

Reviewed By: ljin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17343
2014-03-31 11:33:09 -07:00
Haobo Xu a92194e5b2 [RocksDB] Add db property "rocksdb.cur-size-active-mem-table"
Summary: as title

Test Plan: db_test

Reviewers: sdong

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17217
2014-03-27 15:14:04 -07:00
Igor Canadi 1c9f8f0884 Fix valgrind issues
Summary:
NewFixedPrefixTransform is leaked in default options. Broken by b47812fba6

Also included in the diff some code cleanup

Test Plan:
valgrind env_test
also make check

Reviewers: haobo, danguo, yhchiang

Reviewed By: danguo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17211
2014-03-27 08:22:59 -07:00
Igor Canadi 954679bb0f AssertHeld() should do things
Summary:
AssertHeld() was a no-op before. Now it does things.

Also, this change caught a bad bug in SuperVersion::Init(). The method is calling db->mutex.AssertHeld(), but db variable is not initialized yet! I also fixed that issue.

Test Plan: make check

Reviewers: dhruba, haobo, ljin, sdong, yhchiang

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17193
2014-03-26 11:24:52 -07:00
Igor Canadi e86d7dffd7 Merge branch 'master' into columnfamilies 2014-03-25 15:24:02 -07:00
Danny Guo d9ca83df28 [rocksdb] make init prefix more robust
Summary:
Currently if client uses kNULLString as the prefix, it will confuse
compaction filter v2. This diff added a bool to indicate if the prefix
has been intialized. I also added a unit test to cover this case and
make sure the new code path is hit.

Test Plan: db_test

Reviewers: igor, haobo

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17151
2014-03-25 11:59:40 -07:00
Igor Canadi e8168382c4 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	include/rocksdb/options.h
	util/options.cc
2014-03-25 11:09:40 -07:00
Danny Guo b47812fba6 [rocksdb] new CompactionFilterV2 API
Summary:
This diff adds a new CompactionFilterV2 API that roll up the
decisions of kv pairs during compactions. These kv pairs must share the
same key prefix. They are buffered inside the db.

    typedef std::vector<Slice> SliceVector;
    virtual std::vector<bool> Filter(int level,
                                 const SliceVector& keys,
                                 const SliceVector& existing_values,
                                 std::vector<std::string>* new_values,
                                 std::vector<bool>* values_changed
                                 ) const = 0;

Application can override the Filter() function to operate
on the buffered kv pairs. More details in the inline documentation.

Test Plan:
make check. Added unit tests to make sure Keep, Delete,
Change all works.

Reviewers: haobo

CCs: leveldb

Differential Revision: https://reviews.facebook.net/D15087
2014-03-24 20:47:53 -07:00
Yueh-Hsuan Chiang cda4006e87 Enhance partial merge to support multiple arguments
Summary:
* PartialMerge api now takes a list of operands instead of two operands.
* Add min_pertial_merge_operands to Options, indicating the minimum
  number of operands to trigger partial merge.
* This diff is based on Schalk's previous diff (D14601), but it also
  includes necessary changes such as updating the pure C api for
  partial merge.

Test Plan:
* make check all
* develop tests for cases where partial merge takes more than two
  operands.

TODOs (from Schalk):
* Add test with min_partial_merge_operands > 2.
* Perform benchmarks to measure the performance improvements (can probably
  use results of task #2837810.)
* Add description of problem to doc/index.html.
* Change wiki pages to reflect the interface changes.

Reviewers: haobo, igor, vamsi

Reviewed By: haobo

CC: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D16815
2014-03-24 17:57:13 -07:00
Igor Canadi ac328a86b9 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_test.cc
2014-03-20 14:41:37 -07:00
Igor Canadi e67241f0b9 Sanity check on Open
Summary:
Everytime a client opens a DB, we do a sanity check that:
* checks the existance of all the necessary files
* verifies that file sizes are correct

Some of the code was stolen from https://reviews.facebook.net/D16935

Test Plan: added a unit test

Reviewers: dhruba, haobo, sdong

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17097
2014-03-20 14:18:29 -07:00
Igor Canadi 8ea3cb621e If paranoid_checks -- Mark DB read-only on any IOError
Summary:
Whenever we get an IOError from GetImpl() or NewIterator(), we should immediatelly mark the DB read-only. The same check already exists in Write() and Compaction().

This should help with clients that are somehow missing a file.

Test Plan: make check

Reviewers: dhruba, haobo, sdong, ljin

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17061
2014-03-20 13:10:02 -07:00
Igor Canadi e20fa3f8a4 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/internal_stats.cc
	db/internal_stats.h
	db/version_set.cc
2014-03-19 17:22:20 -07:00
sdong 71e6a34271 Add a DB property to indicate number of background errors encountered
Summary: Add a property to calculate number of background errors encountered to help users build their monitoring

Test Plan: Add a unit test. make all check

Reviewers: haobo, igor, dhruba

Reviewed By: igor

CC: ljin, nkg-, yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D16959
2014-03-18 14:28:30 -07:00
Kai Liu 1ec72b37b1 Several easy-to-add properties related to compaction and flushes
Summary: To partly address the request @nkg- raised, add three easy-to-add properties to compactions and flushes.

Test Plan: run unit tests and add a new unit test to cover new properties.

Reviewers: haobo, dhruba

Reviewed By: dhruba

CC: nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D13677
2014-03-18 14:00:09 -07:00
Igor Canadi 3055a15b29 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/version_edit.cc
	db/version_edit.h
	db/version_set.cc
2014-03-18 13:24:27 -07:00
Lei Jin 63cef90078 disable the log_number check in Recover()
Summary:
There is a chance that an old MANIFEST is corrupted in 2.7 but just not noticed.
This check would fail them. Change it to log instead of returning a
Corruption status.

Test Plan: make

Reviewers: haobo, igor

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16923
2014-03-18 12:46:29 -07:00
Igor Canadi f26cb0f093 Optimize fallocation
Summary:
Based on my recent findings (posted in our internal group), if we use fallocate without KEEP_SIZE flag, we get superior performance of fdatasync() in append-only workloads.

This diff provides an option for user to not use KEEP_SIZE flag, thus optimizing his sync performance by up to 2x-3x.

At one point we also just called posix_fallocate instead of fallocate, which isn't very fast: http://code.woboq.org/userspace/glibc/sysdeps/posix/posix_fallocate.c.html (tl;dr it manually writes out zero bytes to allocate storage). This diff also fixes that, by first calling fallocate and then posix_fallocate if fallocate is not supported.

Test Plan: make check

Reviewers: dhruba, sdong, haobo, ljin

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16761
2014-03-17 21:52:14 -07:00
Igor Canadi ae25742af9 Fix race condition in manifest roll
Summary:
When the manifest is getting rolled the following happens:
1) manifest_file_number_ is assigned to a new manifest number (even though the old one is still current)
2) mutex is unlocked
3) SetCurrentFile() creates temporary file manifest_file_number_.dbtmp
4) SetCurrentFile() renames manifest_file_number_.dbtmp to CURRENT
5) mutex is locked

If FindObsoleteFiles happens between (3) and (4) it will:
1) Delete manifest_file_number_.dbtmp (because it's not in pending_outputs_)
2) Delete old manifest (because the manifest_file_number_ already points to a new one)

I introduce the concept of prev_manifest_file_number_ that will avoid the race condition.

However, we should discuss the future of MANIFEST file rolling. We found some race conditions with it last week and who knows how many more are there. Nobody is using it in production because we don't trust the implementation. Should we even support it?

Test Plan: make check

Reviewers: ljin, dhruba, haobo, sdong

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16929
2014-03-17 21:50:15 -07:00
Igor Canadi db234133a9 [CF] WriteBatch to take in ColumnFamilyHandle
Summary: Client doesn't need to know anything about ColumnFamily ID. By making WriteBatch take ColumnFamilyHandle as a parameter, we can eliminate method GetID() from ColumnFamilyHandle

Test Plan: column_family_test

Reviewers: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16887
2014-03-14 11:30:14 -07:00
Igor Canadi e1f56e12cf Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_test.cc
	tools/db_stress.cc
2014-03-13 13:21:20 -07:00
sdong 5aa81f04fa Fix extra compaction tasks scheduled after D16767 in some cases
Summary:
With D16767, there is a case compaction tasks are scheduled infinitely:
(1) no flush thread is configured and more than 1 compaction threads
(2) a flush is going on by one compaction hread
(3) the state of SST files is in the state that versions_->current()->NeedsCompaction() will generate a false positive (return true actually there is no work to be done)
In that case, a infinite loop will be formed.

This patch would fix it.

Test Plan: make all check

Reviewers: haobo, igor, ljin

Reviewed By: igor

CC: dhruba, yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D16863
2014-03-13 13:06:08 -07:00
Kai Liu 11da8bc5df A heuristic way to check if a memtable is full
Summary:
This is is based on https://reviews.facebook.net/D15027. It's not finished but I would like to give a prototype to avoid arena over-allocation while making better use of the already allocated memory blocks.

Instead of check approximate memtable size, we will take a deeper look at the arena, which incorporate essential idea that @sdong suggests: flush when arena has allocated its last and the last is "almost full"

Test Plan: N/A

Reviewers: haobo, sdong

Reviewed By: sdong

CC: leveldb, sdong

Differential Revision: https://reviews.facebook.net/D15051
2014-03-12 16:40:14 -07:00
Igor Canadi 25c8a1a20f More bug fixed introduced by code cleanup 2014-03-12 12:28:23 -07:00
Igor Canadi b5d6ad69fc Bug fixes introduced by code cleanup 2014-03-12 11:10:26 -07:00
Igor Canadi dff9214165 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	tools/db_stress.cc
2014-03-12 10:17:41 -07:00
Igor Canadi fb2346fc1f [CF] Code cleanup part 1
Summary:
I'm cleaning up some code preparing for the big diff review tomorrow. This is the first part of the cleanup.

Changes are mostly cosmetic. The goal is to decrease amount of code difference between columnfamilies and master branch.

This diff also fixes race condition when dropping column family.

Test Plan: Ran db_stress with variety of parameters

Reviewers: dhruba, haobo

Differential Revision: https://reviews.facebook.net/D16833
2014-03-12 09:56:53 -07:00
Igor Canadi 45ad75db80 Correct version of D16821 2014-03-12 09:38:59 -07:00
sdong bd45633b71 Fix data race against logging data structure because of LogBuffer
Summary:
@igor pointed out that there is a potential data race because of the way we use the newly introduced LogBuffer. After "bg_compaction_scheduled_--" or "bg_flush_scheduled_--", they can both become 0. As soon as the lock is released after that, DBImpl's deconstructor can go ahead and deconstruct all the states inside DB, including the info_log object hold in a shared pointer of the options object it keeps. At that point it is not safe anymore to continue using the info logger to write the delayed logs.

With the patch, lock is released temporarily for log buffer to be flushed before "bg_compaction_scheduled_--" or "bg_flush_scheduled_--". In order to make sure we don't miss any pending flush or compaction, a new flag bg_schedule_needed_ is added, which is set to be true if there is a pending flush or compaction but not scheduled because of the max thread limit. If the flag is set to be true, the scheduling function will be called before compaction or flush thread finishes.

Thanks @igor for this finding!

Test Plan: make all check

Reviewers: haobo, igor

Reviewed By: haobo

CC: dhruba, ljin, yhchiang, igor, leveldb

Differential Revision: https://reviews.facebook.net/D16767
2014-03-11 16:09:53 -07:00
Igor Canadi 457c78eb89 [CF] db_stress for column families
Summary:
I had this diff for a while to test column families implementation. Last night, I ran it sucessfully for 10 hours with the command:

   time ./db_stress --threads=30 --ops_per_thread=200000000 --max_key=5000 --column_families=20 --clear_column_family_one_in=3000000 --verify_before_write=1  --reopen=50 --max_background_compactions=10 --max_background_flushes=10 --db=/tmp/db_stress

It is ready to be committed :)

Test Plan: Ran it for 10 hours

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16797
2014-03-11 12:06:12 -07:00
sdong 6c66bc08d9 Temp Fix of LogBuffer flushing
Summary: To temp fix the log buffer flushing. Flush the buffer inside the lock. Clean the trunk before we find an eventual fix.

Test Plan: make all check

Reviewers: haobo, igor

Reviewed By: igor

CC: ljin, leveldb, yhchiang

Differential Revision: https://reviews.facebook.net/D16791
2014-03-11 11:37:40 -07:00
Igor Canadi cb9802168f Add a comment after SignalAll()
Summary: Having code after SignalAll has already caused 2 bugs. Let's make sure this doesn't happen again.

Test Plan: no test

Reviewers: sdong, dhruba, haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16785
2014-03-11 11:27:19 -07:00
Igor Canadi 9634ba42ac Merge branch 'master' into columnfamilies
Conflicts:
	db/compaction_picker.cc
	db/db_impl.cc
	db/db_impl.h
	db/tailing_iter.cc
	db/version_set.h
	include/rocksdb/options.h
	util/options.cc
2014-03-10 17:26:09 -07:00
Igor Canadi d5de22dc09 Call PurgeObsoleteFiles() only when HaveSomethingToDelete()
Summary: as title

Test Plan: fixed the build failure http://ci-builds.fb.com/job/rocksdb_build/987/console

Reviewers: haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16743
2014-03-10 15:42:14 -07:00
Haobo Xu a91aed615a [RocksDB] Minor cleanup of PurgeObsoleteFiles
Summary: as title. also made info log output of file deletion a bit more descriptive.

Test Plan: make check; db_bench and look at LOG output

Reviewers: igor

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16731
2014-03-10 14:13:38 -07:00
Lei Jin 8d007b4aaf Consolidate SliceTransform object ownership
Summary:
(1) Fix SanitizeOptions() to also check HashLinkList. The current
dynamic case just happens to work because the 2 classes have the same
layout.
(2) Do not delete SliceTransform object in HashSkipListFactory and
HashLinkListFactory destructor. Reason: SanitizeOptions() enforces
prefix_extractor and SliceTransform to be the same object when
Hash**Factory is used. This makes the behavior strange: when
Hash**Factory is used, prefix_extractor will be released by RocksDB. If
other memtable factory is used, prefix_extractor should be released by
user.

Test Plan: db_bench && make asan_check

Reviewers: haobo, igor, sdong

Reviewed By: igor

CC: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D16587
2014-03-10 12:56:46 -07:00
Haobo Xu 9e0e6aa7f6 [RocksDB] make sure KSVObsolete does not get accessed as a valid pointer.
Summary: KSVObsolete is no longer nullptr and needs to be checked explicitly. Also did some minor code cleanup and added a stat counter to track superversion cleanups incurred in the foreground.

Test Plan: make check

Reviewers: ljin

Reviewed By: ljin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16701
2014-03-10 12:55:25 -07:00
Haobo Xu 66da467983 [RocksDB] LogBuffer Cleanup
Summary: Moved LogBuffer class to an internal header. Removed some unneccesary indirection. Enabled log buffer for BackgroundCallFlush. Forced log buffer flush right after Unlock to improve time ordering of info log.

Test Plan: make check; db_bench compare LOG output

Reviewers: sdong

Reviewed By: sdong

CC: leveldb, igor

Differential Revision: https://reviews.facebook.net/D16707
2014-03-10 11:05:44 -07:00
Igor Canadi d4f2c610d3 Ignore dropped column families -- don't flush or compact them 2014-03-07 18:43:21 -08:00
Igor Canadi 1e0d47276c Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_impl.h
2014-03-07 16:59:47 -08:00
Igor Canadi 9f15092ebd [CF] NewIterators
Summary: Adding the last missing function -- NewIterators(). Pretty simple implementation

Test Plan: added a unit test

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16689
2014-03-07 16:15:25 -08:00
Lei Jin e5fa4944fc use CAS when returning SuperVersion to ThreadLocal
Summary:
Add a check at the end of GetImpl to release SuperVersion if it becomes
obsolete. Also do Scrape() inside InstallSuperVersion so it happens more
frequent.

Test Plan:
make all check
running asan_check now

Reviewers: igor, haobo, sdong, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16641
2014-03-07 14:43:22 -08:00
Igor Canadi eec8695206 Delete local sv when destroying DB from stress test
Summary: Not deleting local SV caused some an crash test issue: http://ci-builds.fb.com/job/rocksdb_asan_crash_test/83/console

Test Plan: ran unit tests

Reviewers: ljin

Reviewed By: ljin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16635
2014-03-06 18:15:26 -08:00
Igor Canadi 80a207fc90 Merge branch 'master' into columnfamilies
Conflicts:
	db/compaction_picker.cc
	db/compaction_picker.h
	db/db_impl.cc
	db/version_set.cc
	db/version_set.h
	include/rocksdb/options.h
	util/options.cc
2014-03-05 16:59:22 -08:00
sdong ecb1ffa2a8 Buffer info logs when picking compactions and write them out after releasing the mutex
Summary: Now while the background thread is picking compactions, it writes out multiple info_logs, especially for universal compaction, which introduces a chance of waiting log writing in mutex, which is bad. To remove this risk, write all those info logs to a buffer and flush it after releasing the mutex.

Test Plan:
make all check
check the log lines while running some tests that trigger compactions.

Reviewers: haobo, igor, dhruba

Reviewed By: dhruba

CC: i.am.jin.lei, dhruba, yhchiang, leveldb, nkg-

Differential Revision: https://reviews.facebook.net/D16515
2014-03-05 15:36:32 -08:00
Igor Canadi a329dd1b25 Fix TEST_Destroy_DBImpl() to work with column families 2014-03-05 12:27:39 -08:00
Igor Canadi 0738ae6dc9 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
2014-03-05 12:25:05 -08:00
Igor Canadi 8ca30bd51b Merge pull request #47 from mlin/kCompactionStopStyleSimilarSize
An initial implementation of kCompactionStopStyleSimilarSize for universal compaction
2014-03-05 10:35:30 -08:00
sdong e8ecca9e86 CleanupIteratorState() only to initialize DeletionState when super version cleanup needed
Summary:
Two changes:
1. DeletionState is only constructed when cleaning up is needed
2. Fix the bug of deletion state construction bug. A change was made in a previous patch: https://reviews.facebook.net/rROCKSDB774ed89c2405ee058086b099cbc8b29e243739cc#71a34e2e However, it somehow got lost when merging

Test Plan: make all check

Reviewers: kailiu, haobo, igor

Reviewed By: igor

CC: igor, dhruba, i.am.jin.lei, yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D16233
2014-03-04 20:58:20 -08:00
Igor Canadi e21d5b8bbc [CF] Flush all memtables on column family drop
Summary: When column family is dropped, we want to delete all WALs that refer to it. To do that, we need to make them obsolete by flushing all the memtables

Test Plan: column_family_test

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16557
2014-03-04 17:21:30 -08:00
Igor Canadi 335b207974 [CF] Delete SuperVersion in a special function
Summary: Added a function DeleteSuperVersion that can be called in DBImpl destructor before PurgingObsoleteFiles. That way, PurgeObsoleteFiles will be able to delete all files held by alive super versions.

Test Plan: column_family_test with valgrind

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16545
2014-03-04 09:35:44 -08:00
Igor Canadi 9d0577a6be Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_impl.h
	db/transaction_log_impl.cc
	db/transaction_log_impl.h
	include/rocksdb/options.h
	util/env.cc
	util/options.cc
2014-03-03 18:29:03 -08:00
Igor Canadi f9b2f0ad79 [CF] Fix CF bugs in WriteBatch
Summary:
This diff fixes two bugs:
* Increase sequence number even if WriteBatch fails. This is important because WriteBatches in WAL logs have implictly increasing sequence number, even if one update in a write batch fails. This caused some writes to get lost in my CF stress testing
* Tolerate 'invalid column family' errors on recovery. When a column family is dropped, processing WAL logs can have some WriteBatches that still refer to the dropped column family. In recovery environment, we want to ignore those errors. In client's Write() code path, however, we want to return the failure to the client if he's trying to add data to invalid column family.

Test Plan: db_stress's verification works now

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16533
2014-03-03 17:07:46 -08:00
Igor Canadi 8ea21a778b [CF] Rething LogAndApply for column families
Summary:
I though I might get away with as little changes to LogAndApply() as possible. It turns out this is not the case.

This diff introduces different behavior of LogAndApply() for three cases:
1. column family add
2. column family drop
3. no-column family manipulation

(1) and (2) don't support group commit yet.

There were a lot of problems with old version od LogAndApply, detected by db_stress. The biggest was non-atomicity of manifest writes and metadata changes (i.e. if column family add is in manifest, it also has to be in in-memory data structure).

Test Plan: db_stress

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16491
2014-02-28 14:46:48 -08:00
Igor Canadi 58ca641d53 Make Log::Reader more robust
Summary:
This diff does two things:
(1) Log::Reader does not report a corruption when the last record in a log or manifest file is truncated (meaning that log writer died in the middle of the write). Inherited the code from LevelDB: https://code.google.com/p/leveldb/source/detail?r=269fc6ca9416129248db5ca57050cd5d39d177c8#
(2) Turn off mmap writes for all writes to log and manifest files

(2) is necessary because if we use mmap writes, the last record is not truncated, but is actually filled with zeros, making checksum fail. It is hard to recover from checksum failing.

Test Plan:
Added unit tests from LevelDB
Actually recovered a "corrupted" MANIFEST file.

Reviewers: dhruba, haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16119
2014-02-28 13:19:47 -08:00
Yueh-Hsuan Chiang a77527f2af Add ReadOptions to TransactionLogIterator.
Summary:
Add an optional input parameter ReadOptions to DB::GetUpdateSince(),
which allows the verification of checksums to be disabled by setting
ReadOptions::verify_checksums to false.

Test Plan: Tests are done off-line and will not be included in the regular unit test.

Reviewers: igor

Reviewed By: igor

CC: leveldb, xjin, dhruba

Differential Revision: https://reviews.facebook.net/D16305
2014-02-28 11:50:36 -08:00
Igor Canadi f6a257b6a1 Set dropped column family before persisting in the manifest 2014-02-28 11:49:32 -08:00
Igor Canadi 510f84b686 [CF] CreateColumnFamily fix
Summary:
This fixes few bugs with CreateColumnFamily
* We first have to LogAndApply and then call VersionSet::CreateColumnFamily. Otherwise, WriteSnapshot might be invoked, writing out column family add inside of LogAndApply, even though it's not really committed
* Fix LogAndApplyHelper() to not apply log number to column_family_data, which is in case of column family add, just a dummy (default) column family
* Create SuperVerion when creating column family

Test Plan: column_family_test

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16443
2014-02-28 10:40:52 -08:00
Igor Canadi 206b38f31c SetLogNumber in CreateColumnFamily 2014-02-27 16:53:45 -08:00
Igor Canadi b41a3bc4da [CF] Change flow of CreateColumnFamily
Summary:
Previously, we first wrote to the manifest and then created internal data structure.
Now, we first create internal data structure. That way, we can write out internal comparator to the manifest

Test Plan: column_family_test

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16425
2014-02-27 16:49:49 -08:00
Lei Jin ad0c3747cb cache SuperVersion in thread local storage to avoid mutex lock
Summary: as title

Test Plan:
asan_check
will post results later

Reviewers: haobo, igor, dhruba, sdong

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16257
2014-02-27 11:38:55 -08:00
Igor Canadi 4c42201204 [CF] Test fixes and speedup 2014-02-26 17:34:39 -08:00
Igor Canadi 343c32be7b [CF] DifferentMergeOperators and DifferentCompactionStyles tests
Summary:
Two new column family tests:
* DifferentMergeOperators -- three column families, one without merge operator, one with add operator and one with append operator. verify that operations work as expected.
* DifferentCompactionStyles -- three column families, two with level compactions and one with universal compaction. trigger the compactions and verify they work as expected.

Test Plan: nope

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16377
2014-02-26 16:05:24 -08:00
Igor Canadi 6e7cae7711 [CF] More tests
Summary: New unit tests for column families

Test Plan: this is a test

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16359
2014-02-26 14:16:23 -08:00
Igor Canadi 9bce2b2a84 [CF] Fix lint errors in CF code
Summary: Big CF diff uncovered some lint errors. This diff fixes some of them. Not much to see here

Test Plan: make check

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16347
2014-02-26 10:10:00 -08:00
Igor Canadi 8b7ab9951c [CF] Handle failure in WriteBatch::Handler
Summary:
* Add ColumnFamilyHandle::GetID() function. Client needs to know column family's ID to be able to construct WriteBatch
* Handle WriteBatch::Handler failure gracefully. Since WriteBatch is not a very smart function (it takes raw CF id), client can add data to WriteBatch for column family that doesn't exist. In that case, we need to gracefully return failure status from DB::Write(). To do that, I added a return Status to WriteBatch functions PutCF, DeleteCF and MergeCF.

Test Plan: Added test to column_family_test

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16323
2014-02-26 10:10:00 -08:00
Igor Canadi 8895526308 Merge branch 'master' into columnfamilies 2014-02-25 17:04:48 -08:00
Igor Canadi 5ad7ee03ea [CF] Log deletion in column families
Summary:
* Added unit test that verifies that obsolete files are deleted.
* Advance log number for empty column family when cutting log file.
* MinLogNumber() bug fix! (caught by the new unit test)

Test Plan: unit test

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16311
2014-02-25 16:54:41 -08:00
Igor Canadi 4209516359 Schedule flush when waiting on flush
Summary:
This will also help with avoiding the deadlock. If a flush failed and we're waiting for a memtable to be flushed, we should schedule a new flush and hope a new one succeedes.

If paranoid_checks = false, Wait() will still hang on ENOSPC, but at least it will automatically continue when the space frees up. Current behavior both hangs and deadlocks.

Also, I renamed some 'compaction' to 'flush'. 'compaction' was leveldb way of saying things.

Test Plan: make check

Reviewers: dhruba, haobo, ljin

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16281
2014-02-25 12:04:14 -08:00
Igor Canadi b69e7d99d5 [CF] Better handling of memtable logs
Summary: DBImpl now keeps a list of alive_log_files_. On every FindObsoleteFiles, it deletes all alive log files that are smaller than versions_->MinLogNumber()

Test Plan:
make check passes
no specific unit tests yet, will add

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16293
2014-02-25 09:55:13 -08:00
Igor Canadi d39da4b578 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
2014-02-24 17:09:05 -08:00