Commit Graph

472 Commits

Author SHA1 Message Date
Naman Gupta 8454cfe569 Add read/modify/write functionality to Put() api
Summary: The application can set a callback function, which is applied on the previous value. And calculates the new value. This new value can be set, either inplace, if the previous value existed in memtable, and new value is smaller than previous value. Otherwise the new value is added normally.

Test Plan: fbmake. Added unit tests. All unit tests pass.

Reviewers: dhruba, haobo

Reviewed By: haobo

CC: sdong, kailiu, xinyaohu, sumeet, leveldb

Differential Revision: https://reviews.facebook.net/D14745
2014-01-14 07:55:16 -08:00
Igor Canadi 00065d0d5d Fix merge test
Copy max_successive_merges from options to column family options
2014-01-13 13:27:47 -08:00
Igor Canadi 151f9e144f Merge branch 'master' into columnfamilies 2014-01-13 09:09:51 -08:00
Schalk-Willem Kruger a09ee1069d Improve RocksDB "get" performance by computing merge result in memtable
Summary:
Added an option (max_successive_merges) that can be used to specify the
maximum number of successive merge operations on a key in the memtable.
This can be used to improve performance of the "get" operation. If many
successive merge operations are performed on a key, the performance of "get"
operations on the key deteriorates, as the value has to be computed for each
"get" operation by applying all the successive merge operations.

FB Task ID: #3428853

Test Plan:
make all check
db_bench --benchmarks=readrandommergerandom
counter_stress_test

Reviewers: haobo, vamsi, dhruba, sdong

Reviewed By: haobo

CC: zshao

Differential Revision: https://reviews.facebook.net/D14991
2014-01-10 17:33:56 -08:00
Siying Dong 5b5ab0c1a8 [Performance Branch] Fix memory leak in HashLinkListRep.GetIterator()
Summary: Full list constructed for full iterator can be leaked. This was a bug introduced when I copy the full iterator codes from hash skip list to hash link list. This patch fixes it.

Test Plan: Run valgrind test against db_test and make sure the memory leak is fixed

Reviewers: kailiu, haobo

Reviewed By: kailiu

CC: igor, leveldb

Differential Revision: https://reviews.facebook.net/D15093
2014-01-10 12:12:28 -08:00
Yancey afdd2d1a46 fix compile warning 2014-01-10 17:56:35 +08:00
Siying Dong 237a3da677 StopWatch not to get time if it is created for statistics and it is disabled
Summary: Currently, even if statistics is not enabled, StopWatch only for the stats still gets the time of the day, which is wasteful. This patch adds a new option to StopWatch to disable this get in this case.

Test Plan: make all check

Reviewers: dhruba, haobo, igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14703

Conflicts:
	db/db_impl.cc
2014-01-09 17:39:48 -08:00
Siying Dong 424a524ac9 [Performance Branch] A Hashed Linked List Based Mem Table
Summary:
Implement a mem table, in which keys are hashed based on prefixes. In each bucket, entries are organized in a sorted linked list. It has the same thread safety guarantee as skip list.

The motivation is to optimize memory usage for the case that prefix hashing is primary way of seeking to the entry. Compared to hash skip list implementation, this implementation is more memory efficient, but inside each bucket, search is always linear. The target scenario is that there are only very limited number of records in each hash bucket.

Test Plan: Add a test case in db_test

Reviewers: haobo, kailiu, dhruba

Reviewed By: haobo

CC: igor, nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D14979
2014-01-09 16:19:11 -08:00
Siying Dong 5575316350 StopWatch not to get time if it is created for statistics and it is disabled
Summary: Currently, even if statistics is not enabled, StopWatch only for the stats still gets the time of the day, which is wasteful. This patch adds a new option to StopWatch to disable this get in this case.

Test Plan: make all check

Reviewers: dhruba, haobo, igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14703
2014-01-08 16:05:36 -08:00
kailiu 12b6d2b839 Separate the aligned and unaligned memory allocation
Summary: Use two vectors for different types of memory allocation.

Test Plan: run all unit tests.

Reviewers: haobo, sdong

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15027
2014-01-08 15:11:42 -08:00
Igor Canadi 72918efffe [column families] Implement DB::OpenWithColumnFamilies()
Summary:
In addition to implementing OpenWithColumnFamilies, this diff also includes some minor changes:
* Changed all column family names from Slice() to std::string. The performance of column family name handling is not critical, and it's more convenient and cleaner to have names as std::strings
* Implemented ColumnFamilyOptions(const Options&) and DBOptions(const Options&)
* Added ColumnFamilyOptions to VersionSet::ColumnFamilyData. ColumnFamilyOptions are specified on OpenWithColumnFamilies() and CreateColumnFamily()

I will keep the diff in the Phabricator for a day or two and will push to the branch then. Feel free to comment even after the diff has been pushed.

Test Plan: Added a simple unit test

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15033
2014-01-07 11:05:50 -08:00
Igor Canadi 17a222670b Merge branch 'master' into performance 2014-01-07 11:04:21 -08:00
Mike Lin 4c75e21c20 Eliminate stdout message when launching a posix thread.
This seems out of place as it's the only time RocksDB prints to stdout in the
normal course of operations. Thread IDs can still be retrieved from the LOG
file: cut -d ' ' -f2 LOG | sort | uniq | egrep -x '[0-9a-f]+'
2014-01-07 10:44:02 -08:00
Igor Canadi fff5c7e817 Merge branch 'master' into columnfamilies 2014-01-06 13:31:41 -08:00
kailiu 7e70ff63d6 Fix issue #57 2014-01-06 11:11:19 -08:00
Kai Liu 774ed89c24 Replace vector with autovector
Summary: this diff only replace the cases when we need to frequently create vector with small amount of entries. This diff doesn't aim to improve performance of a specific area, but more like a small scale test for the autovector and see how it works in real life.

Test Plan:
make check

I also ran the performance tests, however there is no performance gain/loss. All performance numbers are pretty much the same before/after the change.

Reviewers: dhruba, haobo, sdong, igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14985
2014-01-02 16:43:35 -08:00
Igor Canadi 6de1b5b83e Merge branch 'master' into columnfamilies 2014-01-02 04:18:07 -08:00
kailiu f1cec73a76 Merge branch 'master' into performance
Conflicts:
	db/db_impl.cc
	db/db_test.cc
	db/memtable.cc
	db/version_set.cc
	include/rocksdb/statistics.h
2013-12-27 12:23:17 -08:00
Siying Dong 18df47b79a Avoid malloc in NotFound key status if no message is given.
Summary:
In some places we have NotFound status created with empty message, but it doesn't avoid a malloc. With this patch, the malloc is avoided for that case.

The motivation of it is that I found in db_bench readrandom test when all keys are not existing, about 4% of the total running time is spent on malloc of Status, plus a similar amount of CPU spent on free of them, which is not necessary.

Test Plan: make all check

Reviewers: dhruba, haobo, igor

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14691
2013-12-26 16:23:10 -08:00
Kai Liu b40c052bfa Fix all the comparison issue in fb dev servers 2013-12-26 16:13:49 -08:00
kailiu 113a08c929 Fix [-Werror=sign-compare] in autovector_test 2013-12-26 15:47:07 -08:00
kailiu c01676e46d Implement autovector
Summary:
A vector that leverages pre-allocated stack-based array to achieve better
performance for array with small amount of items.

Test Plan:
Added tests for both correctness and performance

Here is the performance benchmark between vector and autovector

Please note that in the test "Creation and Insertion Test", the test case were designed with the motivation described below:

* no element inserted: internal array of std::vector may not really get
  initialize.
* one element inserted: internal array of std::vector must have
  initialized.
* kSize elements inserted. This shows the most time we'll spend if we
  keep everything in stack.
* 2 * kSize elements inserted. The internal vector of
  autovector must have been initialized.

Note: kSize is the capacity of autovector

  =====================================================
  Creation and Insertion Test
  =====================================================
  created 100000 vectors:
  	each was inserted with 0 elements
  	total time elapsed: 128000 (ns)
  created 100000 autovectors:
  	each was inserted with 0 elements
  	total time elapsed: 3641000 (ns)
  created 100000 VectorWithReserveSizes:
  	each was inserted with 0 elements
  	total time elapsed: 9896000 (ns)
  -----------------------------------
  created 100000 vectors:
  	each was inserted with 1 elements
  	total time elapsed: 11089000 (ns)
  created 100000 autovectors:
  	each was inserted with 1 elements
  	total time elapsed: 5008000 (ns)
  created 100000 VectorWithReserveSizes:
  	each was inserted with 1 elements
  	total time elapsed: 24271000 (ns)
  -----------------------------------
  created 100000 vectors:
  	each was inserted with 4 elements
  	total time elapsed: 39369000 (ns)
  created 100000 autovectors:
  	each was inserted with 4 elements
  	total time elapsed: 10121000 (ns)
  created 100000 VectorWithReserveSizes:
  	each was inserted with 4 elements
  	total time elapsed: 28473000 (ns)
  -----------------------------------
  created 100000 vectors:
  	each was inserted with 8 elements
  	total time elapsed: 75013000 (ns)
  created 100000 autovectors:
  	each was inserted with 8 elements
  	total time elapsed: 18237000 (ns)
  created 100000 VectorWithReserveSizes:
  	each was inserted with 8 elements
  	total time elapsed: 42464000 (ns)
  -----------------------------------
  created 100000 vectors:
  	each was inserted with 16 elements
  	total time elapsed: 102319000 (ns)
  created 100000 autovectors:
  	each was inserted with 16 elements
  	total time elapsed: 76724000 (ns)
  created 100000 VectorWithReserveSizes:
  	each was inserted with 16 elements
  	total time elapsed: 68285000 (ns)
  -----------------------------------
  =====================================================
  Sequence Access Test
  =====================================================
  performed 100000 sequence access against vector
  	size: 4
  	total time elapsed: 198000 (ns)
  performed 100000 sequence access against autovector
  	size: 4
  	total time elapsed: 306000 (ns)
  -----------------------------------
  performed 100000 sequence access against vector
  	size: 8
  	total time elapsed: 565000 (ns)
  performed 100000 sequence access against autovector
  	size: 8
  	total time elapsed: 512000 (ns)
  -----------------------------------
  performed 100000 sequence access against vector
  	size: 16
  	total time elapsed: 1076000 (ns)
  performed 100000 sequence access against autovector
  	size: 16
  	total time elapsed: 1070000 (ns)
  -----------------------------------

Reviewers: dhruba, haobo, sdong, chip

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14655
2013-12-26 15:03:47 -08:00
Kai Liu 5643ae1a3f Merge pull request #32 from jamesgolick/master
Only try to use fallocate if it's actually present on the system.
2013-12-26 14:20:22 -08:00
Siying Dong abaf26266d [RocksDB] [Performance Branch] Some Changes to PlainTable format
Summary:
Some changes to PlainTable format:
(1) support variable key length
(2) use user defined slice transformer to extract prefixes
(3) Run some test cases against PlainTable in db_test and table_test

Test Plan: test db_test

Reviewers: haobo, kailiu

CC: dhruba, igor, leveldb, nkg-

Differential Revision: https://reviews.facebook.net/D14457
2013-12-20 12:08:35 -08:00
Igor Canadi 9385a5247e [RocksDB] [Column Family] Interface proposal
Summary:
<This diff is for Column Family branch>

Sharing some of the work I've done so far. This diff compiles and passes the tests.

The biggest change is in options.h - I broke down Options into two parts - DBOptions and ColumnFamilyOptions. DBOptions is DB-specific (env, create_if_missing, block_cache, etc.) and ColumnFamilyOptions is column family-specific (all compaction options, compresion options, etc.). Note that this does not break backwards compatibility at all.

Further, I created DBWithColumnFamily which inherits DB interface and adds new functions with column family support. Clients can transparently switch to DBWithColumnFamily and it will not break their backwards compatibility.
There are few methods worth checking out: ListColumnFamilies(), MultiNewIterator(), MultiGet() and GetSnapshot(). [GetSnapshot() returns the snapshot across all column families for now - I think that's what we agreed on]

Finally, I made small changes to WriteBatch so we are able to atomically insert data across column families.

Please provide feedback.

Test Plan: make check works, the code is backward compatible

Reviewers: dhruba, haobo, sdong, kailiu, emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14445
2013-12-18 13:08:22 -08:00
kailiu 5f5e5fc2e9 Revert `atomic_size_t usage`
Summary:
By disassemble the function, we found that the atomic variables do invoke the `lock` that locks the memory bus.
As a tradeoff, we protect the GetUsage by mutex and leave usage_ as plain size_t.

Test Plan: passed `cache_test`

Reviewers: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14667
2013-12-13 17:03:19 -08:00
Haobo Xu 5090316f0d [RocksDB] [Performance Branch] Trivia build fix
Summary: make release complains signed unsigned comparison.

Test Plan: make release

Reviewers: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14661
2013-12-13 14:21:59 -08:00
kailiu b660e2d468 Expose usage info for the cache
Summary: This diff will help us to figure out the memory usage for the cache part.

Test Plan: added a new memory usage test for cache

Reviewers: haobo, sdong, dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14559
2013-12-13 12:53:45 -08:00
James Golick c28dd2a891 oops - missed a spot 2013-12-11 11:18:00 -08:00
Igor Canadi e8d40c31b3 [RocksDB perf] Cache speedup
Summary:
I have ran a get benchmark where all the data is in the cache and observed that most of the time is spent on waiting for lock in LRUCache.

This is an effort to optimize LRUCache.

Test Plan:
The data was loaded with fillseq. Then, I ran a benchmark:

    /db_bench --db=/tmp/rocksdb_stat_bench --num=1000000 --benchmarks=readrandom --statistics=1 --use_existing_db=1 --threads=16 --disable_seek_compaction=1 --cache_size=20000000000 --cache_numshardbits=8 --table_cache_numshardbits=8

I ran the benchmark three times. Here are the results:
AFTER THE PATCH: 798072, 803998, 811807
BEFORE THE PATCH: 782008, 815593, 763017

Reviewers: dhruba, haobo, kailiu

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14571
2013-12-11 08:33:29 -08:00
Haobo Xu 3c02c363b3 [RocksDB] [Performance Branch] Added dynamic bloom, to be used for memable non-existing key filtering
Summary: as title

Test Plan: dynamic_bloom_test

Reviewers: dhruba, sdong, kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14385
2013-12-11 00:15:14 -08:00
James Golick 43c386b72e only try to use fallocate if it's actually present on the system 2013-12-10 22:34:19 -08:00
kailiu c79e595471 Make Cache::GetCapacity constant
Summary: This will allow us to access constant via `DB::GetOptions().table_cache.GetCapacity()` or `DB::GetOptions().block_cache.GetCapacity()` since GetOptions() is also constant method.
2013-12-10 17:34:35 -08:00
Igor Canadi 19f5463d3f Don't LogFlush() in foreground threads
Summary: So fflush() takes a lock which is heavyweight. I added flush_pending_, but more importantly, I removed LogFlush() from foreground threads.

Test Plan: ./db_test

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14535
2013-12-10 10:57:46 -08:00
Igor Canadi fb9fce4fc3 [RocksDB] BackupableDB
Summary:
In this diff I present you BackupableDB v1. You can easily use it to backup your DB and it will do incremental snapshots for you.
Let's first describe how you would use BackupableDB. It's inheriting StackableDB interface so you can easily construct it with your DB object -- it will add a method RollTheSnapshot() to the DB object. When you call RollTheSnapshot(), current snapshot of the DB will be stored in the backup dir. To restore, you can just call RestoreDBFromBackup() on a BackupableDB (which is a static method) and it will restore all files from the backup dir. In the next version, it will even support automatic backuping every X minutes.

There are multiple things you can configure:
1. backup_env and db_env can be different, which is awesome because then you can easily backup to HDFS or wherever you feel like.
2. sync - if true, it *guarantees* backup consistency on machine reboot
3. number of snapshots to keep - this will keep last N snapshots around if you want, for some reason, be able to restore from an earlier snapshot. All the backuping is done in incremental fashion - if we already have 00010.sst, we will not copy it again. *IMPORTANT* -- This is based on assumption that 00010.sst never changes - two files named 00010.sst from the same DB will always be exactly the same. Is this true? I always copy manifest, current and log files.
4. You can decide if you want to flush the memtables before you backup, or you're fine with backing up the log files -- either way, you get a complete and consistent view of the database at a time of backup.
5. More things you can find in BackupableDBOptions

Here is the directory structure I use:

   backup_dir/CURRENT_SNAPSHOT - just 4 bytes holding the latest snapshot
               0, 1, 2, ... - files containing serialized version of each snapshot - containing a list of files
               files/*.sst - sst files shared between snapshots - if one snapshot references 00010.sst and another one needs to backup it from the DB, it will just reference the same file
               files/ 0/, 1/, 2/, ... - snapshot directories containing private snapshot files - current, manifest and log files

All the files are ref counted and deleted immediatelly when they get out of scope.

Some other stuff in this diff:
1. Added GetEnv() method to the DB. Discussed with @haobo and we agreed that it seems right thing to do.
2. Fixed StackableDB interface. The way it was set up before, I was not able to implement BackupableDB.

Test Plan:
I have a unittest, but please don't look at this yet. I just hacked it up to help me with debugging. I will write a lot of good tests and update the diff.

Also, `make asan_check`

Reviewers: dhruba, haobo, emayanke

Reviewed By: dhruba

CC: leveldb, haobo

Differential Revision: https://reviews.facebook.net/D14295
2013-12-09 14:06:52 -08:00
Igor Canadi 9644e0e0c7 Print stack trace on assertion failure
Summary:
This will help me a lot! When we hit an assertion in unittest, we get the whole stack trace now.

Also, changed stack trace a bit, we now include actual demangled C++ class::function symbols!

Test Plan: Added ASSERT_TRUE(false) to a test, observed a stack trace

Reviewers: haobo, dhruba, kailiu

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14499
2013-12-06 17:11:09 -08:00
kailiu c7707f24c2 Refine the statistics 2013-12-06 16:51:35 -08:00
kailiu 551e9428ce Merge branch 'master' into performance 2013-12-06 14:15:42 -08:00
kailiu e1d92dfd2e Fix a bunch of mac compilation issues in performance branch 2013-12-04 23:00:33 -08:00
Vamsi Ponnekanti fa88cbc71e [Log dumper broken when merge operator is in log]
Summary: $title

Test Plan:
on my dev box

Revert Plan: OK

Task ID: #

Reviewers: emayanke, dhruba, haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14451
2013-12-04 16:22:54 -08:00
Mark Callaghan 97aa401e2f Add compression options to db_bench
Summary:
This adds 2 options for compression to db_bench:
* universal_compression_size_percent
* compression_level - to set zlib compression level
It also logs compression_size_percent at startup in LOG

Task ID: #

Blame Rev:

Test Plan:
make check, run db_bench

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14439
2013-12-03 14:28:48 -08:00
Igor Canadi eb12e47e0e Killing Transform Rep
Summary:
Let's get rid of TransformRep and it's children. We have confirmed that HashSkipListRep works better with multifeed, so there is no benefit to keeping this around.

This diff is mostly just deleting references to obsoleted functions. I also have a diff for fbcode that we'll need to push when we switch to new release.

I had to expose HashSkipListRepFactory in the client header files because db_impl.cc needs access to GetTransform() function for SanitizeOptions.

Test Plan: make check

Reviewers: dhruba, haobo, kailiu, sdong

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14397
2013-12-03 12:42:15 -08:00
lovro 930cb0b9ee Clarify CompactionFilter thread safety requirements
Summary: Documenting our discussion

Test Plan: make

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: igor

Differential Revision: https://reviews.facebook.net/D14403
2013-12-02 16:41:43 -08:00
lovro 45a2f2d8d3 Fix build without glibc
Summary: The preprocessor does not follow normal rules of && evaluation, tries to evaluate __GLIBC_PREREQ(2, 12) even though the defined() check fails.  This breaks the build if __GLIBC_PREREQ is absent.

Test Plan: Try adding #undef __GLIBC_PREREQ above the offending line, build no longer breaks

Reviewed By: igor

Blame Rev: 4c81383628
2013-12-01 11:32:54 -08:00
Kai Liu 1966b63137 Merge branch 'master' into perf 2013-11-27 11:47:40 -08:00
lovro 4c81383628 Set background thread name with pthread_setname_np()
Summary: Makes it easier to monitor performance with top

Test Plan: ./manual_compaction_test with `top -H` running.  Previously was two `manual_compacti`, now one shows `rocksdb:bg0`.

Reviewers: igor, dhruba

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14367
2013-11-27 11:28:06 -08:00
Haobo Xu 4e6463ea44 [RocksDB][Performance Branch] Make height and branching factor configurable for skiplist implementation
Summary: As title. Especially, HashSkipListRepFactory will be able to specify a relatively small height, to reduce the memory overhead of one skiplist per bucket.

Test Plan: make check and test it on leaf4

Reviewers: dhruba, sdong, kailiu

CC: reconnect.grayhat, leveldb

Differential Revision: https://reviews.facebook.net/D14307
2013-11-26 21:59:36 -08:00
Siying Dong 8aac46d686 [RocksDB Performance Branch] Fix a regression bug of munmap
Summary:
Fix a stupid bug I just introduced in b59d4d5a50, which I didn't even mean to include.
GCC might remove the munmap.

Test Plan: Run it and make sure munmap succeeds

Reviewers: haobo, kailiu

Reviewed By: kailiu

CC: dhruba, reconnect.grayhat, leveldb

Differential Revision: https://reviews.facebook.net/D14361
2013-11-26 14:05:37 -08:00
Haobo Xu 5b825d6964 [RocksDB] Use raw pointer instead of shared pointer when passing Statistics object internally
Summary: liveness of the statistics object is already ensured by the shared pointer in DB options. There's no reason to pass again shared pointer among internal functions. Raw pointer is sufficient and efficient.

Test Plan: make check

Reviewers: dhruba, MarkCallaghan, igor

Reviewed By: dhruba

CC: leveldb, reconnect.grayhat

Differential Revision: https://reviews.facebook.net/D14289
2013-11-25 10:38:15 -08:00
Siying Dong 3e35aa6412 Revert "Allow users to profile a query and see bottleneck of the query"
This reverts commit 3d8ac31d71.
2013-11-21 17:40:39 -08:00
Siying Dong b135d01e7b Allow users to profile a query and see bottleneck of the query
Summary:
Provide a framework to profile a query in detail to figure out latency bottleneck. Currently, in Get(), Put() and iterators, 2-3 simple timing is used. We can easily add more profile counters to the framework later.

Test Plan: Enable this profiling in seveal existing tests.

Reviewers: haobo, dhruba, kailiu, emayanke, vamsi, igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14001

Conflicts:
	table/merger.cc
2013-11-21 17:39:19 -08:00
Siying Dong 3d8ac31d71 Allow users to profile a query and see bottleneck of the query
Summary:
Provide a framework to profile a query in detail to figure out latency bottleneck. Currently, in Get(), Put() and iterators, 2-3 simple timing is used. We can easily add more profile counters to the framework later.

Test Plan: Enable this profiling in seveal existing tests.

Reviewers: haobo, dhruba, kailiu, emayanke, vamsi, igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14001
2013-11-21 16:29:57 -08:00
Siying Dong 58e1956d50 [Only for Performance Branch] A Hacky patch to lazily generate memtable key for prefix-hashed memtables.
Summary:
For prefix mem tables, encoding mem table key may be unnecessary if the prefix doesn't have any key. This patch is a little bit hacky but I want to try out the performance gain of removing this lazy initialization.

In longer term, we might want to revisit the way we abstract mem tables implementations.

Test Plan: make all check

Reviewers: haobo, igor, kailiu

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14265
2013-11-20 20:49:23 -08:00
Siying Dong b59d4d5a50 A Simple Plain Table
Summary:
A Simple plain table format. No block structure. When creating the table reader, scanning the full table to create indexes.

Test Plan:Add unit test

Reviewers:haobo,dhruba,kailiu

CC:

Task ID: #

Blame Rev:
2013-11-20 18:44:22 -08:00
kailiu 6eb5649800 Move flush_block_policy from Options to TableFactory
Summary:
Previously we introduce a `flush_block_policy_factory` in Options, however, that options is strongly releated to Table based tables.
It will make more sense to move it to block based table's own factory class.

Test Plan: make check to pass existing tests

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14211
2013-11-19 22:00:48 -08:00
kailiu 1415f8820d Improve the "table stats"
Summary:
The primary motivation of the changes is to make it easier to figure out the inside of the tables.

* rename "table stats" to "table properties" since now we have more than "integers" to store in the property block.
* Add filter block size to the basic table properties.
* Whenever a table is built, we'll log the table properties (the sample output is in Test Plan).
* Make an api to expose deleted keys.

Test Plan:
Passed all existing test. and the sample output of table stats:

    ==================================================================
        Basic Properties
    ------------------------------------------------------------------
                  # data blocks: 1
                      # entries: 1

                   raw key size: 9
           raw average key size: 9
                 raw value size: 9
         raw average value size: 0

                data block size: 25
               index block size: 27
              filter block size: 18
         (estimated) table size: 70

                  filter policy: rocksdb.BuiltinBloomFilter
    ==================================================================
        User collected properties: InternalKeyPropertiesCollector
    ------------------------------------------------------------------
                    kDeletedKeys: 1
    ==================================================================

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14187
2013-11-19 16:29:42 -08:00
Igor Canadi f611aba559 Move the compiler back to 4.8.1 + more small fixes
Summary:
1. Moved the compiler back to 4.8.1 and uses Centos 5.2 binaries if OS is Centos 5.2.

2. Fixes this issue: https://github.com/facebook/rocksdb/issues/7

3. We use lot of c++11 features, so we can't pretend we can compile without them. Makes it a first class dependency.

4. Fix blob_store_test, which failes on Ubuntu with "too many files opened" error

5. Removed dependency on port/port_chromium.h, which does not even exist on our system

Test Plan: make clean; make check

Reviewers: dhruba, kailiu

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14145
2013-11-18 11:40:16 -08:00
kailiu 97d8e573a6 make util/env_posix.cc work under mac
Summary: This diff invoves some more complicated issues in the posix environment.

Test Plan: works under mac os. will need to verify dev box.

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14061
2013-11-16 23:44:39 -08:00
Pascal Borreli 443e04e62d Fixed typos 2013-11-16 11:21:34 +00:00
Kai Liu 80bb81c6fe Add the correct table_factory for tables in table_tests 2013-11-12 23:54:31 -08:00
Kai Liu 22e1b04deb Quick fix for a string format
Summary:

Fix one more string format issue that throws warning in mac
2013-11-12 21:22:32 -08:00
Kai Liu 35460ccb53 Fix the string format issue
Summary:

mac and our dev server has totally differnt definition of uint64_t, therefore fixing the warning in mac has actually made code in linux uncompileable.

Test Plan:

make clean && make -j32
2013-11-12 21:05:39 -08:00
kailiu 21587760b9 Fixing the warning messages captured under mac os # Consider using `git commit -m 'One line title' && arc diff`. # You will save time by running lint and unit in the background.
Summary: The work to make sure mac os compiles rocksdb is not completed yet. But at least we can start cleaning some warnings captured only by g++ from mac os..

Test Plan: ran make in mac os

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14049
2013-11-12 20:05:28 -08:00
Igor Canadi 9bc4a26f56 Small changes in Deleting obsolete files
Summary:
@haobo's suggestions from https://reviews.facebook.net/D13827

Renaming some variables, deprecating purge_log_after_flush, changing for loop into auto for loop.

I have not implemented deleting objects outside of mutex yet because it would require a big code change - we would delete object in db_impl, which currently does not know anything about object because it's defined in version_edit.h (FileMetaData). We should do it at some point, though.

Test Plan: Ran deletefile_test

Reviewers: haobo

Reviewed By: haobo

CC: leveldb, haobo

Differential Revision: https://reviews.facebook.net/D14025
2013-11-12 11:53:26 -08:00
lovro 8a46ecd357 WriteBatch::Put() overload that gathers key and value from arrays of slices
Summary: In our project, when writing to the database, we want to form the value as the concatenation of a small header and a larger payload.  It's a shame to have to copy the payload just so we can give RocksDB API a linear view of the value.  Since RocksDB makes a copy internally, it's easy to support gather writes.

Test Plan: write_batch_test, new test case

Reviewers: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13947
2013-11-08 16:34:32 -08:00
Igor Canadi 1510339e52 Speed up FindObsoleteFiles
Summary:
Here's one solution we discussed on speeding up FindObsoleteFiles. Keep a set of all files in DBImpl and update the set every time we create a file. I probably missed few other spots where we create a file.

It might speed things up a bit, but makes code uglier. I don't really like it.

Much better approach would be to abstract all file handling to a separate class. Think of it as layer between DBImpl and Env. Having a separate class deal with file namings and deletion would benefit both code cleanliness (especially with huge DBImpl) and speed things up. It will take a huge effort to do this, though.

Let's discuss offline today.

Test Plan: Ran ./db_stress, verified that files are getting deleted

Reviewers: dhruba, haobo, kailiu, emayanke

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D13827
2013-11-08 15:23:46 -08:00
Igor Canadi 8b3379dc0a Implementing DynamicIterator for TransformRepNoLock
Summary: What @haobo done with TransformRep, now in TransformRepNoLock. Similar implementation, except that I made DynamicIterator a subclass of Iterator which makes me have less iterator initializations.

Test Plan: ./prefix_test. Seeing huge savings vs. TransformRep again!

Reviewers: dhruba, haobo, sdong, kailiu

Reviewed By: haobo

CC: leveldb, haobo

Differential Revision: https://reviews.facebook.net/D13953
2013-11-08 00:31:09 -08:00
Kai Liu fd075d6edd Provide mechanism to configure when to flush the block
Summary: Allow block based table to configure the way flushing the blocks. This feature will allow us to add support for prefix-aligned block.

Test Plan: make check

Reviewers: dhruba, haobo, sdong, igor

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13875
2013-11-07 21:27:21 -08:00
Igor Canadi 444cf88a56 Flush the log outside of lock
Summary:
Added a new call LogFlush() that flushes the log contents to the OS buffers. We never call it with lock held.

We call it once for every Read/Write and often in compaction/flush process so the frequency should not be a problem.

Test Plan: db_test

Reviewers: dhruba, haobo, kailiu, emayanke

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13935
2013-11-07 11:31:56 -08:00
Haobo Xu fd2044883a [RocksDB] Generalize prefix-aware iterator to be used for more than one Seek
Summary: Added a prefix_seek flag in ReadOptions to indicate that Seek is prefix aware(might not return data with different prefix), and also not bound to a specific prefix. Multiple Seeks and range scans can be invoked on the same iterator. If a specific prefix is specified, this flag will be ignored. Just a quick prototype that works for PrefixHashRep, the new lockless memtable could be easily extended with this support too.

Test Plan: test it on Leaf

Reviewers: dhruba, kailiu, sdong, igor

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13929
2013-11-06 20:45:49 -08:00
shamdor c2be2cba04 WAL log retention policy based on archive size.
Summary:
Archive cleaning will still happen every WAL_ttl seconds
but archived logs will be deleted only if archive size
is greater then a WAL_size_limit value.
Empty archived logs will be deleted evety WAL_ttl.

Test Plan:
1. Unit tests pass.
2. Benchmark.

Reviewers: emayanke, dhruba, haobo, sdong, kailiu, igor

Reviewed By: emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13869
2013-11-06 18:46:28 -08:00
Igor Canadi 95a8213f75 Log flush every 0 seconds
Summary: We have to be able to catch last few log outputs before a crash

Test Plan: no

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13917
2013-11-06 14:19:46 -08:00
Igor Canadi be96f2498e TransformRep - use array instead of unordered_map
Summary:
I'm sending this diff together with https://reviews.facebook.net/D13881 because it didn't allow me to send only the array one.

Here I also replaced unordered_map with just an array of shared_ptrs. This elminated all the locks.

I will run the new benchmark and post the results here.

Test Plan: db_test

Reviewers: dhruba, haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13893
2013-11-06 11:55:43 -08:00
Dhruba Borthakur b4ad5e89ae Implement a compressed block cache.
Summary:
Rocksdb can now support a uncompressed block cache, or a compressed
block cache or both. Lookups first look for a block in the
uncompressed cache, if it is not found only then it is looked up
in the compressed cache. If it is found in the compressed cache,
then it is uncompressed and inserted into the uncompressed cache.

It is possible that the same block resides in the compressed cache
as well as the uncompressed cache at the same time. Both caches
have their own individual LRU policy.

Test Plan: Unit test case attached.

Reviewers: kailiu, sdong, haobo, leveldb

Reviewed By: haobo

CC: xjin, haobo

Differential Revision: https://reviews.facebook.net/D12675
2013-11-01 14:31:35 -07:00
Piyush Garg 1e4375d2ef Task #3071144 Enhance ldb (db dump tool for leveldb) to report row counters for each row type
Summary: Added an option --count_delim=<char> which takes the given character as delimiter ('.' by default) and reports count of each row type found in the db

Test Plan:
1. Created test in file (for DBDumperCommand) rocksdb/tools/ldb_test.py which puts various key value pair in db and checks the output using dump --count_delim ,--count_delim="." and --count_delim=",".
2. Created test in file (for InternalDumperCommand) rocksdb/tools/ldb_test.py which puts various key value pair in db and checks the output using dump --count_delim ,--count_delim="." and --count_delim=",".
3. Manually created a database with several keys of several type and verified by running the command
   ./ldb db=<path> dump --count_delim="<char>"
   ./ldb db=<path> idump --count_delim="<char>"

Reviewers: vamsi, dhruba, emayanke, kailiu

Reviewed By: vamsi

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13815
2013-11-01 13:59:14 -07:00
Igor Canadi b572e81f94 Flush Log every 5 seconds
Summary: This might help with p99 performance, but does not solve the real problem. More discussion on #2947135

Test Plan: make check

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13809
2013-10-31 15:36:40 -07:00
Naman Gupta b4fab3be2a Merge branch 'master' of github.com:facebook/rocksdb into inplace 2013-10-31 11:51:03 -07:00
Naman Gupta fe25070242 In-place updates for equal keys and similar sized values
Summary:
Currently for each put, a fresh memory is allocated, and a new entry is added to the memtable with a new sequence number irrespective of whether the key already exists in the memtable. This diff is an attempt to update the value inplace for existing keys. It currently handles a very simple case:
1. Key already exists in the current memtable. Does not inplace update values in immutable memtable or snapshot
2. Latest value type is a 'put' ie kTypeValue
3. New value size is less than existing value, to avoid reallocating memory

TODO: For a put of an existing key, deallocate memory take by values, for other value types till a kTypeValue is found, ie. remove kTypeMerge.
TODO: Update the transaction log, to allow consistent reload of the memtable.

Test Plan: Added a unit test verifying the inplace update. But some other unit tests broken due to invalid sequence number checks. WIll fix them next.

Reviewers: xinyaohu, sumeet, haobo, dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D12423

Automatic commit by arc
2013-10-31 11:27:12 -07:00
Siying Dong d4eec30ed0 Make "Table" pluggable
Summary: This patch makes Table and TableBuilder a abstract class and make all the implementation of the current table into BlockedBasedTable and BlockedBasedTable Builder.

Test Plan: Make db_test.cc to work with block based table. Add a new test simple_table_db_test.cc where a different simple table format is implemented.

Reviewers: dhruba, haobo, kailiu, emayanke, vamsi

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13521
2013-10-28 17:54:09 -07:00
Kai Liu 994575c134 Support user-defined table stats collector
Summary:
1. Added a new option that support user-defined table stats collection.
2. Added a deleted key stats collector in `utilities`

Test Plan:
Added a unit test for newly added code.
Also ran make check to make sure other tests are not broken.

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13491
2013-10-28 15:45:14 -07:00
Igor Canadi cb8a7302e4 Implement max_size in BlobStore
Summary:
I added max_size option in blobstore. Since we now know the maximum number of buckets we'll ever use, we can allocate an array of buckets and access its elements without use of any locks! Common case Get doesn't lock anything now.

Benchmarks on 16KB block size show no impact on speed, though.

Test Plan: unittests + benchmark

Reviewers: dhruba, haobo, kailiu

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13641
2013-10-23 14:38:52 -07:00
Haobo Xu 2fb361ad98 [RocksDB] Add perf_context.wal_write_time to track time spent on writing the recovery log.
Summary: as title

Test Plan: make check; ./perf_context_test

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13629
2013-10-23 13:38:39 -07:00
Igor Canadi 30f1b97a06 Enable blobs to be fragmented
Summary:
I have implemented a FreeList version that supports fragmented blob chunks. Each block gets allocated and freed in FIFO order. Since the idea for the blocks to be big, we will not take a big hit of non-sequential IO. Free list is also faster, taking only O(k) size in both free and allocate instead of O(N) as before.

See more info on the task: https://our.intern.facebook.com/intern/tasks/?t=2990558

Also, I'm taking Slice instead of const char * and size in Put function.

Test Plan: unittests

Reviewers: haobo, kailiu, dhruba, emayanke

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13569
2013-10-22 17:44:00 -07:00
Mayank Agarwal 9b50106f9a Dbid feature
Summary:
Create a new type of file on startup if it doesn't already exist called DBID.
This will store a unique number generated from boost library's uuid header file.
The use-case is to identify the case of a db losing all its data and coming back up either empty or from an image(backup/live replica's recovery)
the key point to note is that DBID is not stored in a backup or db snapshot
It's preferable to use Boost for uuid because:
1) A non-standard way of generating uuid is not good
2) /proc/sys/kernel/random/uuid generates a uuid but only on linux environments and the solution would not be clean
3) c++ doesn't have any direct way to get a uuid
4) Boost is a very good library that was already having linkage in rocksdb from third-party
Note: I had to update the TOOLCHAIN_REV in build files to get latest verison of boost from third-party as the older version had a bug.
I had to put Wno-uninitialized in Makefile because boost-1.51 has an unitialized variable and rocksdb would not comiple otherwise. Latet open-source for boost is 1.54 but is not there in third-party. I have notified the concerned people in fbcode about it.
@kailiu : While releasing to third-party, an additional dependency will need to be created for boost in TARGETS file. I can help identify.

Test Plan:
Expand db_test to test 2 cases
1) Restarting db with Id file present - verify that no change to Id
2)Restarting db with Id file deleted - verify that a different Id is there after reopen
Also run make all check

Reviewers: dhruba, haobo, kailiu, sdong

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13587
2013-10-22 12:23:34 -07:00
Igor Canadi c674b42d52 Rephrasing the comment
Summary: Per @haobo's request, rephrasing the comment for allocate

Test Plan: It's a comment!

Reviewers: haobo, kailiu

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13575
2013-10-21 10:23:56 -07:00
Igor Canadi bcc8557901 tmpfs does not support fallocate
Summary: This caused Siying's unit test to fail.

Test Plan: Unittest

Reviewers: dhruba, kailiu, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13539
2013-10-17 22:15:57 -07:00
Vamsi Ponnekanti 6731997f64 [ldb compact is not allowing ttl flag]
Summary: Allow ttl flag

Test Plan:
tested on my database that has merge operations and ttl

Revert Plan: OK

Task ID: #3038186

Reviewers: emayanke, dhruba, haobo

Reviewed By: emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13503
2013-10-16 18:06:54 -07:00
Dhruba Borthakur 9cd221094c Add appropriate LICENSE and Copyright message.
Summary:
Add appropriate LICENSE and Copyright message.

Test Plan:
make check

Reviewers:

CC:

Task ID: #

Blame Rev:
2013-10-16 17:48:41 -07:00
Igor Canadi fc4616d898 External Value Store
Summary:
Developing a capability for storing values on external backing file(s).

This is just a highly unoptimized first pass - supports:
1) Allocating some portion of external file to be used to store value
2) Freeing the range, enabling it to be reused by other values

As next steps, I plan to:
1) Create some kind of stress testing. Once I can measure stuff, I can focus on optimizing.
2) Optimize locking.
3) Optimize freelist data structure. Currently we have O(n) for both freeing and allocation.
4) Figure out how to do recovery.

Test Plan: Created a unit test.

Reviewers: dhruba, haobo, kailiu

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13389
2013-10-16 17:33:49 -07:00
Dhruba Borthakur b825df81e2 Fix error in previous commit of 'ftruncate' to 'fallocate'.
Summary:
Fix error in previous commit of 'ftruncate' to 'fallocate'.

Test Plan:

Reviewers:

CC:

Task ID: #

Blame Rev:
2013-10-15 13:57:29 -07:00
Dhruba Borthakur 8457b74c26 Fix Unit test when run on tmpfs
Summary:
tmpfs might not support fallocate(). Fix unit test so that this
does not cause a unit test to fail.

Test Plan: ./env_test

Reviewers: emayanke, igor, kailiu

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13455
2013-10-15 12:03:09 -07:00
Mayank Agarwal da2fd001a6 Fix rocksdb->levledb BytewiseComparator and inverted order of error in db/version_set.cc
Summary:
This is needed to make existing dbs be able to open and also because BytewiseComparator was not changed since leveldb.
The inverted order in the error message caused confusion prebiously

Test Plan: make; open existing db

Reviewers: leveldb, dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D13449
2013-10-14 18:16:54 -07:00
sdong f8509653ba LRUCache to try to clean entries not referenced first.
Summary:
With this patch, when LRUCache.Insert() is called and the cache is full, it will first try to free up entries whose reference counter is 1 (would become 0 after remo\
ving from the cache). We do it in two passes, in the first pass, we only try to release those unreferenced entries. If we cannot free enough space after traversing t\
he first remove_scan_cnt_ entries, we start from the beginning again and remove those entries being used.

Test Plan: add two unit tests to cover the codes

Reviewers: dhruba, haobo, emayanke

Reviewed By: emayanke

CC: leveldb, emayanke, xjin

Differential Revision: https://reviews.facebook.net/D13377
2013-10-11 09:26:21 -07:00
Siying Dong 40a1e31fa5 Minor: Fix a lint error in cache_test.cc
Summary:
As title. Fix an lint error:

Lint: CppLint Error
Single-argument constructor 'Value(int v)' may inadvertently be used as a type conversion constructor. Prefix the function with the 'explicit' keyword to avoid this, or add an /* implicit */ comment to suppress this warning.

Test Plan: N/A

Reviewers: emayanke, haobo, dhruba

Reviewed By: emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13401
2013-10-10 13:47:25 -07:00
Igor Canadi d2ca2bd183 Fixing build failure
Summary: virtual NewRandomRWFile is not implemented on EnvHdfs, causing build failure.

Test Plan: make clean; make all check

Reviewers: dhruba, haobo, kailiu

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13383
2013-10-10 01:01:16 -07:00
Igor Canadi d0beadd456 Env class that can randomly read and write
Summary: I have implemented basic simple use case that I need for External Value Store I'm working on. There is a potential for making this prettier by refactoring/combining WritableFile and RandomAccessFile, avoiding some copypasta. However, I decided to implement just the basic functionality, so I can continue working on the other diff.

Test Plan: Added a unittest

Reviewers: dhruba, haobo, kailiu

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13365
2013-10-10 00:03:08 -07:00
Naman Gupta cbf4a06427 Add option for storing transaction logs in a separate dir
Summary: In some cases, you might not want to store the data log (write ahead log) files in the same dir as the sst files. An example use case is leaf, which stores sst files in tmpfs. And would like to save the log files in a separate dir (disk) to save memory.

Test Plan: make all. Ran db_test test. A few test failing. P2785018. If you guys don't see an obvious problem with the code, maybe somebody from the rocksdb team could help me debug the issue here. Running this on leaf worked well. I could see logs stored on disk, and deleted appropriately after compactions. Obviously this is only one set of options. The unit tests cover different options. Seems like I'm missing some edge cases.

Reviewers: dhruba, haobo, leveldb

CC: xinyaohu, sumeet

Differential Revision: https://reviews.facebook.net/D13239
2013-10-08 17:40:27 -07:00
Igor Canadi fa46ddb41f Move delete and free outside of crtical section
Summary: Split Unref into two parts -> cheap and expensive. Try to call expensive Unref outside of critical section to decrease lock contention.

Test Plan: unittests

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb, kailiu

Differential Revision: https://reviews.facebook.net/D13299
2013-10-07 15:37:40 -07:00
Dhruba Borthakur 4463b11cad Migrate names of properties from 'leveldb' prefix to 'rocksdb' prefix.
Summary: Migrate names of properties from 'leveldb' prefix to 'rocksdb' prefix.

Test Plan: make check

Reviewers: emayanke, haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13311
2013-10-06 00:14:26 -07:00
Dhruba Borthakur a143ef9b38 Change namespace from leveldb to rocksdb
Summary:
Change namespace from leveldb to rocksdb. This allows a single
application to link in open-source leveldb code as well as
rocksdb code into the same process.

Test Plan: compile rocksdb

Reviewers: emayanke

Reviewed By: emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13287
2013-10-04 11:59:26 -07:00
Haobo Xu fa798e9e28 [Rocksdb] Submit mem table flush job in a different thread pool
Summary: As title. This is just a quick hack and not ready for commit. fails a lot of unit test. I will test/debug it directly in ViewState shadow .

Test Plan: Try it in shadow test.

Reviewers: dhruba, xjin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D12933
2013-10-03 14:37:19 -07:00
Haobo Xu 71046971f0 [RocksDB] Added perf counters to track skipped internal keys during iteration
Summary: as title. unit test not polished. this is for a quick live test

Test Plan: live

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13221
2013-10-02 10:48:41 -07:00
Haobo Xu 22bb7c754b [RocksDB] print the name of options.memtable_factory in LOG so we know
Summary: as title

Test Plan: make check

Reviewers: dhruba, emayanke

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13179
2013-09-28 20:57:29 -07:00
Dhruba Borthakur f1a60e5c3e The vector rep implementation was segfaulting because of incorrect initialization of vector.
Summary:
The constructor for Vector memtable has a parameter called 'count'
that specifies the capacity of the vector to be reserved at allocation
time. It was incorrectly used to initialize the size of the vector.

Test Plan: Enhanced db_test.

Reviewers: haobo, xjin, emayanke

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13083
2013-09-25 11:33:52 -07:00
Dhruba Borthakur 87d6eb2f6b Implement apis in the Environment to clear out pages in the OS cache.
Summary:
Added a new api to the Environment that allows clearing out not-needed
pages from the OS cache. This will be helpful when the compressed
block cache replaces the OS cache.

Test Plan: EnvPosixTest.InvalidateCache

Reviewers: haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13041
2013-09-23 22:05:03 -07:00
Dhruba Borthakur 5e9f3a9aa7 Better locking in vectorrep that increases throughput to match speed of storage.
Summary:
There is a use-case where we want to insert data into rocksdb as
fast as possible. Vector rep is used for this purpose.

The background flush thread needs to flush the vectorrep to
storage. It acquires the dblock then sorts the vector, releases
the dblock and then writes the sorted vector to storage. This is
suboptimal because the lock is held during the sort, which
prevents new writes for occuring.

This patch moves the sorting of the vector rep to outside the
db mutex. Performance is now as fastas the underlying storage
system. If you are doing buffered writes to rocksdb files, then
you can observe throughput upwards of 200 MB/sec writes.

This is an early draft and not yet ready to be reviewed.

Test Plan:
make check

Task ID: #

Blame Rev:

Reviewers: haobo

Reviewed By: haobo

CC: leveldb, haobo

Differential Revision: https://reviews.facebook.net/D12987
2013-09-19 21:48:10 -07:00
Rajat Goel 11c65021fb Revert "Minor fixes found while trying to compile it using clang on Mac OS X"
This reverts commit 5f2c136c32.
2013-09-15 23:01:26 -07:00
Rajat Goel 5f2c136c32 Minor fixes found while trying to compile it using clang on Mac OS X 2013-09-15 22:06:14 -07:00
Haobo Xu 8866448001 [RocksDB] fix build env_test
Summary: move the TwoPools test to the end of thread related tests. Otherwise, the SetBackgroundThreads call would increase the Low pool size and affect the result of other tests.

Test Plan: make env_test; ./env_test

Reviewers: dhruba, emayanke, xjin

Reviewed By: xjin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D12939
2013-09-13 21:13:20 -07:00
Dhruba Borthakur 4012ca1c7b Added a parameter to limit the maximum space amplification for universal compaction.
Summary:
Added a new field called max_size_amplification_ratio in the
CompactionOptionsUniversal structure. This determines the maximum
percentage overhead of space amplification.

The size amplification is defined to be the ratio between the size of
the oldest file to the sum of the sizes of all other files. If the
size amplification exceeds the specified value, then min_merge_width
and max_merge_width are ignored and a full compaction of all files is done.
A value of 10 means that the size a database that stores 100 bytes
of user data could occupy 110 bytes of physical storage.

Test Plan: Unit test DBTest.UniversalCompactionSpaceAmplification added.

Reviewers: haobo, emayanke, xjin

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D12825
2013-09-13 16:27:18 -07:00
Haobo Xu 1565dab809 [RocksDB] Enhance Env to support two thread pools LOW and HIGH
Summary:
this is the ground work for separating memtable flush jobs to their own thread pool.
Both SetBackgroundThreads and Schedule take a third parameter Priority to indicate which thread pool they are working on. The names LOW and HIGH are just identifiers for two different thread pools, and does not indicate real difference in 'priority'. We can set number of threads in the pools independently.
The thread pool implementation is refactored.

Test Plan: make check

Reviewers: dhruba, emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D12885
2013-09-12 16:15:36 -07:00
Haobo Xu 0e422308aa [RocksDB] Remove Log file immediately after memtable flush
Summary: As title. The DB log file life cycle is tied up with the memtable it backs. Once the memtable is flushed to sst and committed, we should be able to delete the log file, without holding the mutex. This is part of the bigger change to avoid FindObsoleteFiles at runtime. It deals with log files. sst files will be dealt with later.

Test Plan: make check; db_bench

Reviewers: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11709
2013-09-12 11:54:44 -07:00
Haobo Xu f2f4c8072f [RocksDB] Added nano second stopwatch and new perf counters to track block read cost
Summary: The pupose of this diff is to expose per user-call level precise timing of block read, so that we can answer questions like: a Get() costs me 100ms, is that somehow related to loading blocks from file system, or sth else? We will answer that with EXACTLY how many blocks have been read, how much time was spent on transfering the bytes from os, how much time was spent on checksum verification and how much time was spent on block decompression, just for that one Get. A nano second stopwatch was introduced to track time with higher precision. The cost/precision of the stopwatch is also measured in unit-test. On my dev box, retrieving one time instance costs about 30ns, on average. The deviation of timing results is good enough to track 100ns-1us level events. And the overhead could be safely ignored for 100us level events (10000 instances/s), for example, a viewstate thrift call.

Test Plan: perf_context_test, also testing with viewstate shadow traffic.

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb, xjin

Differential Revision: https://reviews.facebook.net/D12351
2013-09-07 21:14:54 -07:00
Dhruba Borthakur 197034e4c3 An iterator may automatically invoke reseeks.
Summary:
An iterator invokes reseek if the number of sequential skips over the
same userkey exceeds a configured number. This makes iter->Next()
faster (bacause of fewer key compares) if a large number of
adjacent internal keys in a table (sst or memtable) have the
same userkey.

Test Plan: Unit test DBTest.IterReseek.

Reviewers: emayanke, haobo, xjin

Reviewed By: xjin

CC: leveldb, xjin

Differential Revision: https://reviews.facebook.net/D11865
2013-09-06 11:50:53 -07:00
Xing Jin 42c109cc2e New ldb command to convert compaction style
Summary:
Add new command "change_compaction_style" to ldb tool. For
universal->level, it shows "nothing to do". For level->universal, it
compacts all files into a single one and moves the file to level 0.

Also add check for number of files at level 1+ when opening db with
universal compaction style.

Test Plan:
'make all check'. New unit test for internal convertion function. Also manully test various
cmd like:

./ldb change_compaction_style --old_compaction_style=0
--new_compaction_style=1 --db=/tmp/leveldbtest-3088/db_test

Reviewers: haobo, dhruba

Reviewed By: haobo

CC: vamsi, emayanke

Differential Revision: https://reviews.facebook.net/D12603
2013-09-04 13:13:08 -07:00
Jim Paton 5c3b254ef2 Fix memory leak
Summary: There is a memory leak because TransformRepFactory does not delete its SliceTransform pointer. This patch adds a delete to the destructor.

Test Plan:
make check
make valgrind_check

Reviewers: dhruba, emayanke, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D12513
2013-08-23 15:39:49 -07:00
Dhruba Borthakur 1186192ed1 Replace include/leveldb with include/rocksdb.
Summary: Replace include/leveldb with include/rocksdb.

Test Plan:
make clean; make check
make clean; make release

Differential Revision: https://reviews.facebook.net/D12489
2013-08-23 10:51:00 -07:00
Jim Paton 74781a0c49 Add three new MemTableRep's
Summary:
This patch adds three new MemTableRep's: UnsortedRep, PrefixHashRep, and VectorRep.

UnsortedRep stores keys in an std::unordered_map of std::sets. When an iterator is requested, it dumps the keys into an std::set and iterates over that.

VectorRep stores keys in an std::vector. When an iterator is requested, it creates a copy of the vector and sorts it using std::sort. The iterator accesses that new vector.

PrefixHashRep stores keys in an unordered_map mapping prefixes to ordered sets.

I also added one API change. I added a function MemTableRep::MarkImmutable. This function is called when the rep is added to the immutable list. It doesn't do anything yet, but it seems like that could be useful. In particular, for the vectorrep, it means we could elide the extra copy and just sort in place. The only reason I haven't done that yet is because the use of the ArenaAllocator complicates things (I can elaborate on this if needed).

Test Plan:
make -j32 check
./db_stress --memtablerep=vector
./db_stress --memtablerep=unsorted
./db_stress --memtablerep=prefixhash --prefix_size=10

Reviewers: dhruba, haobo, emayanke

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D12117
2013-08-22 23:10:02 -07:00
Tyler Harter a8f47a4006 Add options to dump.
Summary: added options to Dump() I missed in D12027.  I also ran a script to look for other missing options and found a couple which I added.  Should we also print anything for "PrepareForBulkLoad", "memtable_factory", and "statistics"?  Or should we leave those alone since it's not easy to print useful info for those?

Test Plan: run anything and look at LOG file to make sure these are printed now.

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D12219
2013-08-14 09:06:10 -07:00
Tyler Harter f5f1842282 Prefix filters for scans (v4)
Summary: Similar to v2 (db and table code understands prefixes), but use ReadOptions as in v3.  Also, make the CreateFilter code faster and cleaner.

Test Plan: make db_test; export LEVELDB_TESTS=PrefixScan; ./db_test

Reviewers: dhruba

Reviewed By: dhruba

CC: haobo, emayanke

Differential Revision: https://reviews.facebook.net/D12027
2013-08-13 14:04:56 -07:00
sumeet 3b81df34bd Separate compaction filter for each compaction
Summary:
If we have same compaction filter for each compaction,
application cannot know about the different compaction processes.
Later on, we can put in more details in compaction filter for the
application to consume and use it according to its needs. For e.g. In
the universal compaction, we have a compaction process involving all the
files while others don't involve all the files. Applications may want to
collect some stats only when during full compaction.

Test Plan: run existing unit tests

Reviewers: haobo, dhruba

Reviewed By: dhruba

CC: xinyaohu, leveldb

Differential Revision: https://reviews.facebook.net/D12057
2013-08-13 10:56:20 -07:00
Dhruba Borthakur 03bd4461ad Merge branch 'performance' of github.com:facebook/rocksdb into performance 2013-08-12 10:31:21 -07:00
Dhruba Borthakur f3967a5132 Merge remote-tracking branch 'origin' into performance 2013-08-12 09:58:50 -07:00
Haobo Xu 3a3b1c3e6c [RocksDB] Improve manifest dump to print internal keys in hex for version edits.
Summary: Currently, VersionEdit::DebugString always display internal keys in the original ascii format. This could cause manifest dump to be truncated if internal keys contain special charactors (like null). Also added an option --input_key_hex for ldb idump to indicate that the passed in user keys are in hex.

Test Plan: run ldb manifest_dump

Reviewers: dhruba, emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D12111
2013-08-08 16:19:01 -07:00
Xing Jin 17b8f786a3 Fix unit tests/bugs for universal compaction (first step)
Summary:
This is the first step to fix unit tests and bugs for universal
compactiion. I added universal compaction option to ChangeOptions(), and
fixed all unit tests calling ChangeOptions(). Some of these tests
obviously assume more than 1 level and check file number/values in level
1 or above levels. I set kSkipUniversalCompaction for these tests.

The major bug I found is manual compaction with universal compaction never stops. I have put a fix for
it.

I have also set universal compaction as the default compaction and found
at least 20+ unit tests failing. I haven't looked into the details. The
next step is to check all unit tests without calling ChangeOptions().

Test Plan: make all check

Reviewers: dhruba, haobo

Differential Revision: https://reviews.facebook.net/D12051
2013-08-07 14:05:44 -07:00
Dhruba Borthakur f5fa26b6a9 Merge branch 'performance' of github.com:facebook/rocksdb into performance
Conflicts:
	db/builder.cc
	db/db_impl.cc
	db/version_set.cc
	include/leveldb/statistics.h
2013-08-07 11:58:06 -07:00
Mayank Agarwal 1d7b4765c3 Expose base db object from ttl wrapper
Summary: rocksdb replicaiton will need this when writing value+TS from master to slave 'as is'

Test Plan: make

Reviewers: dhruba, vamsi, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11919
2013-08-05 18:44:14 -07:00
Jim Paton 1036537c94 Add soft and hard rate limit support
Summary:
This diff adds support for both soft and hard rate limiting. The following changes are included:

1) Options.rate_limit is renamed to Options.hard_rate_limit.
2) Options.rate_limit_delay_milliseconds is renamed to Options.rate_limit_delay_max_milliseconds.
3) Options.soft_rate_limit is added.
4) If the maximum compaction score is > hard_rate_limit and rate_limit_delay_max_milliseconds == 0, then writes are delayed by 1 ms at a time until the max compaction score falls below hard_rate_limit.
5) If the max compaction score is > soft_rate_limit but <= hard_rate_limit, then writes are delayed by 0-1 ms depending on how close we are to hard_rate_limit.
6) Users can disable 4 by setting hard_rate_limit = 0. They can add a limit to the maximum amount of time waited by setting rate_limit_delay_max_milliseconds > 0. Thus, the old behavior can be preserved by setting soft_rate_limit = 0, which is the default.

Test Plan:
make -j32 check
./db_stress

Reviewers: dhruba, haobo, MarkCallaghan

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D12003
2013-08-05 15:43:49 -07:00
Dhruba Borthakur 711a30cb30 Merge branch 'master' into performance
Conflicts:
	include/leveldb/options.h
	include/leveldb/statistics.h
	util/options.cc
2013-08-02 10:22:08 -07:00
Xing Jin 0f0a24e298 Make arena block size configurable
Summary:
Add an option for arena block size, default value 4096 bytes. Arena will allocate blocks with such size.

I am not sure about passing parameter to skiplist in the new virtualized framework, though I talked to Jim a bit. So add Jim as reviewer.

Test Plan:
new unit test, I am running db_test.

For passing paramter from configured option to Arena, I tried tests like:

  TEST(DBTest, Arena_Option) {
  std::string dbname = test::TmpDir() + "/db_arena_option_test";
  DestroyDB(dbname, Options());

  DB* db = nullptr;
  Options opts;
  opts.create_if_missing = true;
  opts.arena_block_size = 1000000; // tested 99, 999999
  Status s = DB::Open(opts, dbname, &db);
  db->Put(WriteOptions(), "a", "123");
  }

and printed some debug info. The results look good. Any suggestion for such a unit-test?

Reviewers: haobo, dhruba, emayanke, jpaton

Reviewed By: dhruba

CC: leveldb, zshao

Differential Revision: https://reviews.facebook.net/D11799
2013-07-31 12:42:23 -07:00
Jim Paton 52d7ecfc78 Virtualize SkipList Interface
Summary: This diff virtualizes the skiplist interface so that users can provide their own implementation of a backing store for MemTables. Eventually, the backing store will be responsible for its own synchronization, allowing users (and us) to experiment with different lockless implementations.

Test Plan:
make clean
make -j32 check
./db_stress

Reviewers: dhruba, emayanke, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11739
2013-07-23 14:42:27 -07:00
Mayank Agarwal bf66c10b13 Use KeyMayExist for WriteBatch-Deletes
Summary:
Introduced KeyMayExist checking during writebatch-delete and removed from Outer Delete API because it uses writebatch-delete.
Added code to skip getting Table from disk if not already present in table_cache.
Some renaming of variables.
Introduced KeyMayExistImpl which allows checking since specified sequence number in GetImpl useful to check partially written writebatch.
Changed KeyMayExist to not be pure virtual and provided a default implementation.
Expanded unit-tests in db_test to check appropriately.
Ran db_stress for 1 hour with ./db_stress --max_key=100000 --ops_per_thread=10000000 --delpercent=50 --filter_deletes=1 --statistics=1.

Test Plan: db_stress;make check

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb, xjin

Differential Revision: https://reviews.facebook.net/D11745
2013-07-23 13:36:50 -07:00
Dhruba Borthakur 9357a53a7d Fix merge problems with options.
Summary:
Fix merge problems with options.

Test Plan:

Reviewers:

CC:

Task ID: #

Blame Rev:
2013-07-17 15:08:56 -07:00
Dhruba Borthakur 4a745a5666 Merge branch 'master' into performance
Conflicts:
	db/version_set.cc
	include/leveldb/options.h
	util/options.cc
2013-07-17 15:05:57 -07:00
Mayank Agarwal 2a986919d6 Make rocksdb-deletes faster using bloom filter
Summary:
Wrote a new function in db_impl.c-CheckKeyMayExist that calls Get but with a new parameter turned on which makes Get return false only if bloom filters can guarantee that key is not in database. Delete calls this function and if the option- deletes_use_filter is turned on and CheckKeyMayExist returns false, the delete will be dropped saving:
1. Put of delete type
2. Space in the db,and
3. Compaction time

Test Plan:
make all check;
will run db_stress and db_bench and enhance unit-test once the basic design gets approved

Reviewers: dhruba, haobo, vamsi

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11607
2013-07-11 12:11:11 -07:00
Haobo Xu a8d5f8dde2 [RocksDB] Remove old readahead options
Summary: As title.

Test Plan: make check; db_bench

Reviewers: dhruba, MarkCallaghan

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11643
2013-07-09 11:22:33 -07:00
Dhruba Borthakur 116ec527f2 Renamed 'hybrid_compaction' tp be "Universal Compaction'.
Summary:
All the universal compaction parameters are encapsulated in
a new file universal_compaction.h

Test Plan:
make check
2013-07-03 15:47:53 -07:00
Haobo Xu 92ca816a60 [RocksDB] Support internal key/value dump for ldb
Summary: This diff added a command 'idump' to ldb tool, which dumps the internal key/value pairs. It could be useful for diagnosis and estimating the per user key 'overhead'. Also cleaned up the ldb code a bit where I touched.

Test Plan: make check; ldb idump

Reviewers: emayanke, sheki, dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11517
2013-07-03 10:41:31 -07:00
Dhruba Borthakur 47c4191fe8 Reduce write amplification by merging files in L0 back into L0
Summary:
There is a new option called hybrid_mode which, when switched on,
causes HBase style compactions.  Files from L0 are
compacted back into L0. This meat of this compaction algorithm
is in PickCompactionHybrid().

All files reside in L0. That means all files have overlapping
keys. Each file has a time-bound, i.e. each file contains a
range of keys that were inserted around the same time. The
start-seqno and the end-seqno refers to the timeframe when
these keys were inserted.  Files that have contiguous seqno
are compacted together into a larger file. All files are
ordered from most recent to the oldest.

The current compaction algorithm starts to look for
candidate files starting from the most recent file. It continues to
add more files to the same compaction run as long as the
sum of the files chosen till now is smaller than the next
candidate file size. This logic needs to be debated
and validated.

The above logic should reduce write amplification to a
large extent... will publish numbers shortly.

Test Plan: dbstress runs for 6 hours with no data corruption (tested so far).

Differential Revision: https://reviews.facebook.net/D11289
2013-06-30 20:07:04 -07:00
Mayank Agarwal b858da709a Simplify bucketing logic in ldb-ttl
Summary: [start_time, end_time) is waht I'm following for the buckets and the whole time-range. Also cleaned up some code in db_ttl.* Not correcting the spacing/indenting convention for util/ldb_cmd.cc in this diff.

Test Plan: python ldb_test.py, make ttl_test, Run mcrocksdb-backup tool, Run the ldb tool on 2 mcrocksdb production backups form sigmafio033.prn1

Reviewers: vamsi, haobo

Reviewed By: vamsi

Differential Revision: https://reviews.facebook.net/D11433
2013-06-21 09:49:24 -07:00
Mayank Agarwal 61f1baaedf Introducing timeranged scan, timeranged dump in ldb. Also the ability to count in time-batches during Dump
Summary:
Scan and Dump commands in ldb use iterator. We need to also print timestamp for ttl databases for debugging. For this I create a TtlIterator class pointer in these functions and assign it the value of Iterator pointer which actually points to t TtlIterator object, and access the new function ValueWithTS which can return TS also. Buckets feature for dump command: gives a count of different key-values in the specified time-range distributed across the time-range partitioned according to bucket-size. start_time and end_time are specified in unixtimestamp and bucket in seconds on the user-commandline
Have commented out 3 ines from ldb_test.py so that the test does not break right now. It breaks because timestamp is also printed now and I have to look at wildcards in python to compare properly.

Test Plan: python tools/ldb_test.py

Reviewers: vamsi, dhruba, haobo, sheki

Reviewed By: vamsi

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11403
2013-06-19 18:45:13 -07:00
Haobo Xu 96be2c4ee0 [RocksDB] Add mmap_read option for db_stress
Summary: as title, also removed an incorrect assertion

Test Plan: make check; db_stress --mmap_read=1; db_stress --mmap_read=0

Reviewers: dhruba, emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11367
2013-06-19 10:28:32 -07:00
Abhishek Kona 5ef6bb8c37 [rocksdb][refactor] statistic printing code to one place
Summary: $title

Test Plan: db_bench --statistics=1

Reviewers: haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11373
2013-06-18 20:28:41 -07:00
Haobo Xu 3cc1af2062 [RocksDB] Option for incremental sync
Summary: This diff added an option to control the incremenal sync frequency. db_bench has a new flag bytes_per_sync for easy tuning exercise.

Test Plan: make check; db_bench

Reviewers: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11295
2013-06-18 15:00:32 -07:00
Dhruba Borthakur 6acbe0fc45 Compact multiple memtables before flushing to storage.
Summary:
Merge multiple multiple memtables in memory before writing it
out to a file in L0.

There is a new config parameter min_write_buffer_number_to_merge
that specifies the number of write buffers that should be merged
together to a single file in storage. The system will not flush
wrte buffers to storage unless at least these many buffers have
accumulated in memory.
The default value of this new parameter is 1, which means that
a write buffer will be immediately flushed to disk as soon it is
ready.

Test Plan: make check

Differential Revision: https://reviews.facebook.net/D11241
2013-06-18 14:28:04 -07:00
Abhishek Kona 00124683de [rocksdb] do not trim range for level0 in manual compaction
Summary:
https://code.google.com/p/leveldb/issues/detail?can=1&q=178&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary&id=178

Ported the solution as is to RocksDB.

Test Plan: moved the unit test as manual_compaction_test

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11331
2013-06-17 13:58:17 -07:00
Abhishek Kona bff718d81c [Rocksdb] Implement filluniquerandom
Summary:
Use a bit set to keep track of which random number is generated.
        Currently only supports single-threaded. All our perf tests are run with threads=1
        Copied over bitset implementation from common/datastructures

Test Plan: printed the generated keys, and verified all keys were present.

Reviewers: MarkCallaghan, haobo, dhruba

Reviewed By: MarkCallaghan

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11247
2013-06-14 16:17:56 -07:00
Haobo Xu 778e179046 [RocksDB] Sync file to disk incrementally
Summary:
During compaction, we sync the output files after they are fully written out. This causes unnecessary blocking of the compaction thread and burstiness of the write traffic.
This diff simply asks the OS to sync data incrementally as they are written, on the background. The hope is that, at the final sync, most of the data are already on disk and we would block less on the sync call. Thus, each compaction runs faster and we could use fewer number of compaction threads to saturate IO.
In addition, the write traffic will be smoothed out, hopefully reducing the IO P99 latency too.

Some quick tests show 10~20% improvement in per thread compaction throughput. Combined with posix advice on compaction read, just 5 threads are enough to almost saturate the udb flash bandwidth for 800 bytes write only benchmark.
What's more promising is that, with saturated IO, iostat shows average wait time is actually smoother and much smaller.
For the write only test 800bytes test:
Before the change:  await  occillate between 10ms and 3ms
After the change: await ranges 1-3ms

Will test against read-modify-write workload too, see if high read latency P99 could be resolved.

Will introduce a parameter to control the sync interval in a follow up diff after cleaning up EnvOptions.

Test Plan: make check; db_bench; db_stress

Reviewers: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11115
2013-06-12 12:53:59 -07:00
Haobo Xu bdf1085944 [RocksDB] cleanup EnvOptions
Summary:
This diff simplifies EnvOptions by treating it as POD, similar to Options.
- virtual functions are removed and member fields are accessed directly.
- StorageOptions is removed.
- Options.allow_readahead and Options.allow_readahead_compactions are deprecated.
- Unused global variables are removed: useOsBuffer, useFsReadAhead, useMmapRead, useMmapWrite

Test Plan: make check; db_stress

Reviewers: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11175
2013-06-12 11:17:19 -07:00
Haobo Xu 043573b24f [RocksDB] Include 64bit random number generator
Summary: As title.

Test Plan: make check;

Reviewers: chip, MarkCallaghan

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11061
2013-06-04 13:52:27 -07:00
Haobo Xu d897d33bf1 [RocksDB] Introduce Fast Mutex option
Summary:
This diff adds an option to specify whether PTHREAD_MUTEX_ADAPTIVE_NP will be enabled for the rocksdb single big kernel lock. db_bench also have this option now.
Quickly tested 8 thread cpu bound 100 byte random read.
No fast mutex: ~750k/s ops
With fast mutex: ~880k/s ops

Test Plan: make check; db_bench; db_stress

Reviewers: dhruba

CC: MarkCallaghan, leveldb

Differential Revision: https://reviews.facebook.net/D11031
2013-06-01 23:11:34 -07:00
Haobo Xu ab8d2f6ab2 [RocksDB] [Performance] Allow different posix advice to be applied to the same table file
Summary:
Current posix advice implementation ties up the access pattern hint with the creation of a file.
It is not possible to apply different advice for different access (random get vs compaction read),
without keeping two open files for the same table. This patch extended the RandomeAccessFile interface
to accept new access hint at anytime. Particularly, we are able to set different access hint on the same
table file based on when/how the file is used.
Two options are added to set the access hint, after the file is first opened and after the file is being
compacted.

Test Plan: make check; db_stress; db_bench

Reviewers: dhruba

Reviewed By: dhruba

CC: MarkCallaghan, leveldb

Differential Revision: https://reviews.facebook.net/D10905
2013-05-30 19:08:44 -07:00
heyongqiang 4c47d8f345 add block deviation option to terminate a block before it exceeds block_size
Summary: a new option block_size_deviation is added.

Test Plan: run db_test and db_bench

Reviewers: dhruba, haobo

Reviewed By: haobo

Differential Revision: https://reviews.facebook.net/D10821
2013-05-24 16:21:52 -07:00
heyongqiang 4b29651206 add block deviation option to terminate a block before it exceeds block_size
Summary: a new option block_size_deviation is added.

Test Plan: run db_test and db_bench

Reviewers: dhruba, haobo

Reviewed By: haobo

Differential Revision: https://reviews.facebook.net/D10821
2013-05-24 15:52:49 -07:00
Haobo Xu 0e879c93de [RocksDB] dump leveldb.stats periodically in LOG file.
Summary:
Added an option stats_dump_period_sec to dump leveldb.stats to LOG periodically for diagnosis.
By defauly, it's set to a very big number 3600 (1 hour).

Test Plan: make check;

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb, zshao

Differential Revision: https://reviews.facebook.net/D10761
2013-05-23 16:56:59 -07:00
Dhruba Borthakur 898f79372f Fix valgrind errors introduced by https://reviews.facebook.net/D10863
Summary:
The valgrind errors were in the unit tests where we change the
number of levels of a database using internal methods.

Test Plan:
valgrind ./reduce_levels_test
valgrind ./db_test

Reviewers: emayanke

Reviewed By: emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D10893
2013-05-23 12:14:41 -07:00
Vamsi Ponnekanti 760dd4750f [Kill randomly at various points in source code for testing]
Summary:
This is initial version. A few ways in which this could
be extended in the future are:
(a) Killing from more places in source code
(b) Hashing stack and using that hash in determining whether to crash.
    This is to avoid crashing more often at source lines that are executed
    more often.
(c) Raising exceptions or returning errors instead of killing

Test Plan:
This whole thing is for testing.

Here is part of output:

python2.7 tools/db_crashtest2.py -d 600
Running db_stress

db_stress retncode -15 output LevelDB version     : 1.5
Number of threads   : 32
Ops per thread      : 10000000
Read percentage     : 50
Write-buffer-size   : 4194304
Delete percentage   : 30
Max key             : 1000
Ratio #ops/#keys    : 320000
Num times DB reopens: 0
Batches/snapshots   : 1
Purge redundant %   : 50
Num keys per lock   : 4
Compression         : snappy
------------------------------------------------
No lock creation because test_batches_snapshots set
2013/04/26-17:55:17  Starting database operations
Created bg thread 0x7fc1f07ff700
... finished 60000 ops
Running db_stress

db_stress retncode -15 output LevelDB version     : 1.5
Number of threads   : 32
Ops per thread      : 10000000
Read percentage     : 50
Write-buffer-size   : 4194304
Delete percentage   : 30
Max key             : 1000
Ratio #ops/#keys    : 320000
Num times DB reopens: 0
Batches/snapshots   : 1
Purge redundant %   : 50
Num keys per lock   : 4
Compression         : snappy
------------------------------------------------
Created bg thread 0x7ff0137ff700
No lock creation because test_batches_snapshots set
2013/04/26-17:56:15  Starting database operations
... finished 90000 ops

Revert Plan: OK

Task ID: #2252691

Reviewers: dhruba, emayanke

Reviewed By: emayanke

CC: leveldb, haobo

Differential Revision: https://reviews.facebook.net/D10581
2013-05-21 18:21:49 -07:00
Haobo Xu 87d0af15d8 [RocksDB] Introduce an option to skip log error on recovery
Summary:
Currently, with paranoid_check on, DB::Open will fail on any log read error on recovery.
If client is ok with losing most recent updates, we could simply skip those errors.
However, it's important to introduce an additional flag, so that paranoid_check can
still guard against more serious problems.

Test Plan: make check; db_stress

Reviewers: dhruba, emayanke

Reviewed By: emayanke

CC: leveldb, emayanke

Differential Revision: https://reviews.facebook.net/D10869
2013-05-21 14:30:36 -07:00
Dhruba Borthakur d1aaaf718c Ability to set different size fanout multipliers for every level.
Summary:
There is an existing field Options.max_bytes_for_level_multiplier that
sets the multiplier for the size of each level in the database.

This patch introduces the ability to set different multipliers
for every level in the database. The size of a level is determined
by using both max_bytes_for_level_multiplier as well as the
per-level fanout.

size of level[i] = size of level[i-1] * max_bytes_for_level_multiplier
                   * fanout[i-1]

The default value of fanout is 1, so that it is backward compatible.

Test Plan: make check

Reviewers: haobo, emayanke

Reviewed By: emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D10863
2013-05-21 13:50:20 -07:00
Haobo Xu 839f6db77e [RocksDB] Fix PosixLogger and AutoRollLogger thread safety
Summary:
PosixLogger and AutoRollLogger do not seem to be thread safe.
For PosixLogger, log_size_ is not atomically updated.
For AutoRollLogger, the underlying logger_ might be deleted by
one thread while still being accessed by another.

Test Plan: make check

Reviewers: kailiu, dhruba, heyongqiang

Reviewed By: kailiu

CC: leveldb, zshao, sheki

Differential Revision: https://reviews.facebook.net/D9699
2013-05-21 11:39:44 -07:00
Abhishek Kona e1174306c5 [RocksDB] Simplify StopWatch implementation
Summary:
Make stop watch a simple implementation, instead of subclass of a virtual class
Allocate stop watches off the stack instead of heap.
Code is more terse now.

Test Plan: make all check, db_bench with --statistics=1

Reviewers: haobo, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D10809
2013-05-17 10:55:34 -07:00
Abhishek Kona 446151cd20 [Rocksdb] Remove unused double apis to record into histograms
Summary: Statistics.h and histogram.h had double based api's to record values. Remove them as they are not used anywhere

Test Plan: make all check

Reviewers: haobo, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D10815
2013-05-16 10:40:30 -07:00
Mayank Agarwal 8a48410f09 Enhance the ldb tool to support ttl databases
Summary: ldb works with raw data from the database and needs to be aware of ttl-database to work with it meaningfully. '-ttl' option now tells it that. Also added onto the ldb_test.py test. This option may be specified alongwith put, get, scan or dump. There is no support to provide a ttl-value and it uses default forever because there is no use-case for this currently.

Test Plan: make ldb_test; python tools/ldb_test.py

Reviewers: dhruba, sheki, haobo, vamsi

Reviewed By: sheki

CC: leveldb

Differential Revision: https://reviews.facebook.net/D10797
2013-05-15 12:10:00 -07:00
Haobo Xu 4ca3c67bd3 [RocksDB] Cleanup compaction filter to use a class interface, instead of function pointer and additional context pointer.
Summary:
This diff replaces compaction_filter_args and CompactionFilter with a single compaction_filter parameter. It gives CompactionFilter better encapsulation and a similar look to Comparator and MergeOpertor, which improves consistency of the overall interface.
The change is not backward compatible. Nevertheless, the two references in fbcode are not in production yet.

Test Plan: make check

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb, zshao

Differential Revision: https://reviews.facebook.net/D10773
2013-05-13 14:06:10 -07:00
Haobo Xu 05e8854085 [Rocksdb] Support Merge operation in rocksdb
Summary:
This diff introduces a new Merge operation into rocksdb.
The purpose of this review is mostly getting feedback from the team (everyone please) on the design.

Please focus on the four files under include/leveldb/, as they spell the client visible interface change.
include/leveldb/db.h
include/leveldb/merge_operator.h
include/leveldb/options.h
include/leveldb/write_batch.h

Please go over local/my_test.cc carefully, as it is a concerete use case.

Please also review the impelmentation files to see if the straw man implementation makes sense.

Note that, the diff does pass all make check and truly supports forward iterator over db and a version
of Get that's based on iterator.

Future work:
- Integration with compaction
- A raw Get implementation

I am working on a wiki that explains the design and implementation choices, but coding comes
just naturally and I think it might be a good idea to share the code earlier. The code is
heavily commented.

Test Plan: run all local tests

Reviewers: dhruba, heyongqiang

Reviewed By: dhruba

CC: leveldb, zshao, sheki, emayanke, MarkCallaghan

Differential Revision: https://reviews.facebook.net/D9651
2013-05-03 16:59:02 -07:00
Kai Liu 958b9c80e1 Avoid global static initialization in Env::Default()
Summary:
Mark's task description from #2316777

Env::Default() comes from util/env_posix.cc

This is a static global.

static PosixEnv default_env;

Env* Env::Default() {
  return &default_env;
}

-----

These globals assume default_env was initialized first. I don't think that is safe or correct to do (http://stackoverflow.com/questions/1005685/c-static-initialization-order)

const string AutoRollLoggerTest::kTestDir(
test::TmpDir() + "/db_log_test");
const string AutoRollLoggerTest::kLogFile(
test::TmpDir() + "/db_log_test/LOG");
Env* AutoRollLoggerTest::env = Env::Default();

Test Plan:
run make clean && make && make check
But how can I know if it works in Ubuntu?

Reviewers: MarkCallaghan, chip

Reviewed By: chip

CC: leveldb, dhruba, haobo

Differential Revision: https://reviews.facebook.net/D10491
2013-04-22 18:10:28 -07:00
Dhruba Borthakur 3cb7bf8170 Initialize parameters in the constructor.
Summary:
RocksDB doesn't build on Ubuntu VM .. shoudl be fixed with this patch.

g++ --version
g++ (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3

util/env_posix.cc:68:24: sorry, unimplemented: non-static data member initializers
util/env_posix.cc:68:24: error: ISO C++ forbids in-class initialization of non-const static member ‘use_os_buffer’
util/env_posix.cc:113:24: sorry, unimplemented: non-static data member initializers
util/env_posix.cc:113:24: error: ISO C++ forbids in-class initialization of non-const static member ‘use_os_buffer

Test Plan: make check

Reviewers: sheki, leveldb

Reviewed By: sheki

Differential Revision: https://reviews.facebook.net/D10461
2013-04-22 14:41:45 -07:00
Mark Callaghan b1ff9ac9c5 Add --writes_per_second rate limit, print p99.99 in histogram
Summary:
Adds the --writes_per_second rate limit for the readwhilewriting test.
The purpose is to optionally avoid saturating storage with writes & compaction
and test read response time when some writes are being done.

Changes the histogram code to also print the p99.99 value

Task ID: #

Blame Rev:

Test Plan:
make check, ran db_bench with it

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D10305
2013-04-20 10:26:51 -07:00
Haobo Xu e0b60923ee [RocksDB] fix build
Summary: forgot to include signal_test.cc

Test Plan: make check

Reviewers: sheki

Reviewed By: sheki

CC: leveldb

Differential Revision: https://reviews.facebook.net/D10281
2013-04-20 10:26:51 -07:00
Haobo Xu 1255dcd446 [RocksDB] Add stacktrace signal handler
Summary:
This diff provides the ability to print out a stacktrace when the process receives certain signals.
Currently, we enable this for the following signals (program error related):
SIGILL SIGSEGV SIGBUS SIGABRT
Application simply #include "util/stack_trace.h" and call leveldb::InstallStackTraceHandler() during initialization, if signal handler is needed. It's not done automatically when openning db, because it's the application(process)'s responsibility to install signal handler and some applications might already have their own (like fbcode).

Sample output:
Received signal 11 (Segmentation fault)
#0  0x408ff0 ./signal_test() [0x408ff0] /home/haobo/rocksdb/util/signal_test.cc:4
#1  0x40827d ./signal_test() [0x40827d] /home/haobo/rocksdb/util/signal_test.cc:24
#2  0x7f8bb183172e /usr/local/fbcode/gcc-4.7.1-glibc-2.14.1/lib/libc.so.6(__libc_start_main+0x10e) [0x7f8bb183172e] ??:0
#3  0x408ebc ./signal_test() [0x408ebc] /home/engshare/third-party/src/glibc/glibc-2.14.1/glibc-2.14.1/csu/../sysdeps/x86_64/elf/start.S:113
Segmentation fault (core dumped)

For each frame, we print the raw pointer, the symbol provided by backtrace_symbols (still not good enough), and the source file/line. Note that address translation is done by directly shell out to addr2line. ??:0 means addr2line fails to do the translation. Hacky, but I think it's good for now.

Test Plan: signal_test.cc

Reviewers: dhruba, MarkCallaghan

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D10173
2013-04-20 10:26:50 -07:00
Haobo Xu a29fc171a6 [RocksDB] posix_logger does not compile on non-linux platform
Summary: As title. Found out this when testing stack_trace.cc portability.

Test Plan: make check; manual test 'non-linux' build by forcing OS_LINUX2

Reviewers: dhruba, heyongqiang

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D10263
2013-04-15 19:18:51 -07:00
Abhishek Kona dae7379050 [RocksDB] Expose LDB functioanality as a library call - clients can build their own LDB binary with additional options
Summary: Primarily a refactor. Introduced LDBTool interface to which customers can plug in their options and this will create their own version of ldb tool.

Test Plan: made ldb tool and tried it.

Reviewers: dhruba, heyongqiang

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D10191
2013-04-11 20:21:49 -07:00
Mayank Agarwal 6594fef7ef Exit and Join the background compaction threads while running rocksdb tests
Summary:
The background compaction threads are never exitted and therefore caused
memory-leaks while running rpcksdb tests. Have changed the PosixEnv destructor to exit and join them and changed the tests likewise
The memory leaked has reduced from 320 bytes to 64 bytes in all the tests. The 64
bytes is relating to
pthread_exit, but still have to figure out why. The stack-trace right now with
table_test.cc = 64 bytes in 1 blocks are possibly lost in loss record 4 of 5
   at 0x475D8C: malloc (jemalloc.c:914)
   by 0x400D69E: _dl_map_object_deps (dl-deps.c:505)
   by 0x4013393: dl_open_worker (dl-open.c:263)
   by 0x400F015: _dl_catch_error (dl-error.c:178)
   by 0x4013B2B: _dl_open (dl-open.c:569)
   by 0x5D3E913: do_dlopen (dl-libc.c:86)
   by 0x400F015: _dl_catch_error (dl-error.c:178)
   by 0x5D3E9D6: __libc_dlopen_mode (dl-libc.c:47)
   by 0x5048BF3: pthread_cancel_init (unwind-forcedunwind.c:53)
   by 0x5048DC9: _Unwind_ForcedUnwind (unwind-forcedunwind.c:126)
   by 0x5046D9F: __pthread_unwind (unwind.c:130)
   by 0x50413A4: pthread_exit (pthreadP.h:289)

Test Plan: make all check

Reviewers: dhruba, sheki, haobo

Reviewed By: dhruba

CC: leveldb, chip

Differential Revision: https://reviews.facebook.net/D9573
2013-04-10 14:50:25 -07:00
heyongqiang e21ba94a69 Set FD_CLOEXEC after each file open
Summary: as subject. This is causing problem in adsconv. Ideally, this flags should be set in open. But that is only supported in Linux kernel ≥2.6.23 and glibc ≥2.7.

Test Plan:
db_test

run db_test

Reviewers: dhruba, MarkCallaghan, haobo

Reviewed By: dhruba

CC: leveldb, chip

Differential Revision: https://reviews.facebook.net/D10089
2013-04-10 14:44:06 -07:00
Mayank Agarwal adb4e4509b Fixing delete in env_posix.cc
Summary: Was deleting incorrectly. Should delete the whole array.

Test Plan: make;valgrind stops complaining about Mismatched free/delete

Reviewers: dhruba, sheki

Reviewed By: sheki

CC: leveldb, haobo

Differential Revision: https://reviews.facebook.net/D10059
2013-04-09 11:49:35 -07:00
Haobo Xu a77bc3d14c [RocksDB] Fix LRUCache Eviction problem
Summary:
1. The stock LRUCache nukes itself whenever the working set (the total number of entries not released by client at a certain time) is bigger than the cache capacity.
See https://our.dev.facebook.com/intern/tasks/?t=2252281
2. There's a bug in shard calculation leading to segmentation fault when only one shard is needed.

Test Plan: make check

Reviewers: dhruba, heyongqiang

Reviewed By: heyongqiang

CC: leveldb, zshao, sheki

Differential Revision: https://reviews.facebook.net/D9927
2013-04-04 11:22:50 -07:00
Haobo Xu d815082159 [RocksDB] env_posix cleanup
Summary:
1. SetBackgroundThreads was not thread safe
2. queue_size_ does not seem necessary
3. moved condition signal after shared state change. Even though the original
   order is in practice ok (because the mutex is still held), it looks fishy
   and non-intuitive.

Test Plan: make check

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb, zshao

Differential Revision: https://reviews.facebook.net/D9825
2013-04-02 11:36:51 -07:00
Abhishek Kona 7fdd5f5b33 Use non-mmapd files for Write-Ahead Files
Summary:
Use non mmapd files for Write-Ahead log.
Earlier use of MMaped files. made the log iterator read ahead and miss records.
Now the reader and writer will point to the same physical location.

There is no perf regression :
./db_bench --benchmarks=fillseq --db=/dev/shm/mmap_test --num=$(million 20) --use_existing_db=0 --threads=2
with This diff :
fillseq      :      10.756 micros/op 185281 ops/sec;   20.5 MB/s
without this dif :
fillseq      :      11.085 micros/op 179676 ops/sec;   19.9 MB/s

Test Plan: unit test included

Reviewers: dhruba, heyongqiang

Reviewed By: heyongqiang

CC: leveldb

Differential Revision: https://reviews.facebook.net/D9741
2013-03-28 13:13:35 -07:00
Abhishek Kona 63f216ee0a memory manage statistics
Summary:
Earlier Statistics object was a raw pointer. This meant the user had to clear up
the Statistics object after creating the database. In most use cases the database is created in a function and the statistics pointer is out of scope. Hence the statistics object would never be deleted.
Now Using a shared_ptr to manage this.

Want this in before the next release.

Test Plan: make all check.

Reviewers: dhruba, emayanke

Reviewed By: emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D9735
2013-03-27 11:27:39 -07:00
Simon Marlow a8bf8fe504 Integrate the manifest_dump command with ldb
Summary:
Syntax:

   manifest_dump [--verbose] --num=<manifest_num>

e.g.

$ ./ldb --db=/home/smarlow/tmp/testdb manifest_dump --num=12
manifest_file_number 13 next_file_number 14 last_sequence 3 log_number
11  prev_log_number 0
--- level 0 --- version# 0 ---
 6:116['a1' @ 1 : 1 .. 'a1' @ 1 : 1]
 10:130['a3' @ 2 : 1 .. 'a4' @ 3 : 1]
--- level 1 --- version# 0 ---
--- level 2 --- version# 0 ---
--- level 3 --- version# 0 ---
--- level 4 --- version# 0 ---
--- level 5 --- version# 0 ---
--- level 6 --- version# 0 ---

Test Plan: - Tested on an example DB (see output in summary)

Reviewers: sheki, dhruba

Reviewed By: sheki

CC: leveldb, heyongqiang

Differential Revision: https://reviews.facebook.net/D9609
2013-03-22 09:17:30 -07:00
Mayank Agarwal 38d54832f7 Initialize variable in constructor for PosixEnv::checkedDiskForMmap_
Summary: This caused compilation problems on some gcc platforms during the third-partyrelease

Test Plan: make

Reviewers: sheki

Reviewed By: sheki

Differential Revision: https://reviews.facebook.net/D9627
2013-03-21 11:26:50 -07:00
Dhruba Borthakur ad96563b79 Ability to configure bufferedio-reads, filesystem-readaheads and mmap-read-write per database.
Summary:
This patch allows an application to specify whether to use bufferedio,
reads-via-mmaps and writes-via-mmaps per database. Earlier, there
was a global static variable that was used to configure this functionality.

The default setting remains the same (and is backward compatible):
 1. use bufferedio
 2. do not use mmaps for reads
 3. use mmap for writes
 4. use readaheads for reads needed for compaction

I also added a parameter to db_bench to be able to explicitly specify
whether to do readaheads for compactions or not.

Test Plan: make check

Reviewers: sheki, heyongqiang, MarkCallaghan

Reviewed By: sheki

CC: leveldb

Differential Revision: https://reviews.facebook.net/D9429
2013-03-20 23:14:03 -07:00
Mayank Agarwal a6f4275403 Removing boost from ldb_cmd.cc
Summary: Getting rid of boost in our github codebase which caused problems on third-party

Test Plan: make ldb; python tools/ldb_test.py

Reviewers: sheki, dhruba

Reviewed By: sheki

Differential Revision: https://reviews.facebook.net/D9543
2013-03-20 11:19:12 -07:00
Mayank Agarwal 48abc06049 Using return value of fwrite in posix_logger.h
Summary: Was causing error(warning) in third-party saying unused result

Test Plan: make

Reviewers: sheki, dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D9447
2013-03-19 21:33:01 -07:00
Mayank Agarwal 487168cdcf Fixed sign-comparison in rocksdb code-base and fixed Makefile
Summary: Makefile had options to ignore sign-comparisons and unused-parameters, which should be there. Also fixed the specific errors in the code-base

Test Plan: make

Reviewers: chip, dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D9531
2013-03-19 14:35:23 -07:00
Mayank Agarwal f04cc368f7 Fixing a careless mistake in ldb
Summary: negation of the condition checked currently had to be checkd actually

Test Plan: make ldb; python ldb_test.py

Reviewers: sheki, dhruba

Reviewed By: sheki

Differential Revision: https://reviews.facebook.net/D9459
2013-03-15 13:59:11 -07:00
Mayank Agarwal a78fb5e8bc Doing away with boost in ldb_cmd.h
Summary: boost functions cause complications while deploying to third-party

Test Plan: make

Reviewers: sheki, dhruba

Reviewed By: sheki

Differential Revision: https://reviews.facebook.net/D9441
2013-03-14 18:16:46 -07:00
Abhishek Kona 1ba5abca97 Use posix_fallocate as default.
Summary:
Ftruncate does not throw an error on disk-full. This causes Sig-bus in
the case where the database tries to issue a Put call on a full-disk.

Use posix_fallocate for allocation instead of truncate.
Add a check to use MMaped files only on ext4, xfs and tempfs, as
posix_fallocate is very slow on ext3 and older.

Test Plan: make all check

Reviewers: dhruba, chip

Reviewed By: dhruba

CC: adsharma, leveldb

Differential Revision: https://reviews.facebook.net/D9291
2013-03-13 13:50:26 -07:00
Mayank Agarwal 5b278b53ae Fix valgrind errors in rocksdb tests: auto_roll_logger_test, reduce_levels_test
Summary: Fix for memory leaks in rocksdb tests. Also modified the variable NUM_FAILED_TESTS to print the actual number of failed tests.

Test Plan: make <test>; valgrind --leak-check=full ./<test>

Reviewers: sheki, dhruba

Reviewed By: sheki

CC: leveldb

Differential Revision: https://reviews.facebook.net/D9333
2013-03-12 16:03:16 -07:00
Dhruba Borthakur 469724be7f Add appropriate parameters to make bulk-load go faster.
Summary:
1. Create only 2 levels so that manual compactions are fast.
2. Set target file size to a large value

Test Plan: make clean check

Reviewers: kailiu, zshao

Reviewed By: zshao

CC: leveldb

Differential Revision: https://reviews.facebook.net/D9231
2013-03-08 10:52:16 -08:00
Zheng Shao 7b43500794 [RocksDB] Add bulk_load option to Options and ldb
Summary:
Add a shortcut function to make it easier for people
to efficiently bulk_load data into RocksDB.

Test Plan:
Tried ldb with "--bulk_load" and "--bulk_load --compact" and verified the outcome.
Needs to consult the team on how to test this automatically.

Reviewers: sheki, dhruba, emayanke, heyongqiang

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D8907
2013-03-05 00:34:53 -08:00
Mark Callaghan 993543d1be Add rate_delay_limit_milliseconds
Summary:
This adds the rate_delay_limit_milliseconds option to make the delay
configurable in MakeRoomForWrite when the max compaction score is too high.
This delay is called the Ln slowdown. This change also counts the Ln slowdown
per level to make it possible to see where the stalls occur.

From IO-bound performance testing, the Level N stalls occur:
* with compression -> at the largest uncompressed level. This makes sense
                      because compaction for compressed levels is much
                      slower. When Lx is uncompressed and Lx+1 is compressed
                      then files pile up at Lx because the (Lx,Lx+1)->Lx+1
                      compaction process is the first to be slowed by
                      compression.
* without compression -> at level 1

Task ID: #1832108

Blame Rev:

Test Plan:
run with real data, added test

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D9045
2013-03-04 07:41:15 -08:00
Dhruba Borthakur 806e264350 Ability for rocksdb to compact when flushing the in-memory memtable to a file in L0.
Summary:
Rocks accumulates recent writes and deletes in the in-memory memtable.
When the memtable is full, it writes the contents on the memtable to
a file in L0.

This patch removes redundant records at the time of the flush. If there
are multiple versions of the same key in the memtable, then only the
most recent one is dumped into the output file. The purging of
redundant records occur only if the most recent snapshot is earlier
than the earliest record in the memtable.

Should we switch on this feature by default or should we keep this feature
turned off in the default settings?

Test Plan: Added test case to db_test.cc

Reviewers: sheki, vamsi, emayanke, heyongqiang

Reviewed By: sheki

CC: leveldb

Differential Revision: https://reviews.facebook.net/D8991
2013-03-04 00:01:47 -08:00
Abhishek Kona c41f1e995c Codemod NULL to nullptr
Summary:
scripted NULL to nullptr in
* include/leveldb/
* db/
* table/
* util/

Test Plan: make all check

Reviewers: dhruba, emayanke

Reviewed By: emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D9003
2013-02-28 18:04:58 -08:00
Abhishek Kona 9bf91c74b8 ldb waldump to print the keys along with other stats + NULL to nullptr in ldb_cmd.cc
Summary: LDB tool to print the deleted/put keys in hex in the wal file.

Test Plan: run ldb on a  db to check if output was satisfactory

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D8691
2013-02-20 11:01:37 -08:00
amayank b2c50f1c3f Fix for the weird behaviour encountered by ldb Get where it could read only the second-latest value
Summary:
Changed the Get and Scan options with openForReadOnly mode to have access to the memtable.
Changed the visibility of NewInternalIterator in db_impl from private to protected so that
the derived class db_impl_read_only can call that in its NewIterator function for the
scan case. The previous approach which changed the default for flush_on_destroy_ from false to true
caused many problems in the unit tests due to empty sst files that it created. All
unit tests pass now.

Test Plan: make clean; make all check; ldb put and get and scans

Reviewers: dhruba, heyongqiang, sheki

Reviewed By: dhruba

CC: kosievdmerwe, zshao, dilipj, kailiu

Differential Revision: https://reviews.facebook.net/D8697
2013-02-20 10:45:52 -08:00
Abhishek Kona fe10200ddc Introduce histogram in statistics.h
Summary:
* Introduce is histogram in statistics.h
* stop watch to measure time.
* introduce two timers as a poc.
Replaced NULL with nullptr to fight some lint errors
Should be useful for google.

Test Plan:
ran db_bench and check stats.
make all check

Reviewers: dhruba, heyongqiang

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D8637
2013-02-20 10:43:32 -08:00
Kai Liu 45f0030458 Fix the "IO error" in auto_roll_logger_test
Summary:

I missed InitTestDb() in one of my tess. InitTestDb() initializes the test directory, without which the test will throw IO error.

This problem didn't occur before because I've already run the tests before so the test directory is already there.

Test Plan:

Reviewers: dhruba

CC:

Task ID: #

Blame Rev:
2013-02-19 00:13:22 -08:00
amayank f3901e0647 Revert "Fix for the weird behaviour encountered by ldb Get where it could read only the second-latest value"
This reverts commit 4c696ed001.
2013-02-18 22:32:27 -08:00
amayank 4c696ed001 Fix for the weird behaviour encountered by ldb Get where it could read only the second-latest value
Summary:
flush_on_destroy has a default value of false and the memtable is flushed
in the dbimpl-destructor only when that is set to true. Because we want the memtable to be flushed everytime that
the destructor is called(db is closed) and the cases where we work with the memtable only are very less
it is a good idea to give this a default value of true. Thus the put from ldb
wil have its data flushed to disk in the destructor and the next Get will be able to
read it when opened with OpenForReadOnly. The reason that ldb could read the latest value when
the db was opened in the normal Open mode is that the Get from normal Open first reads
the memtable and directly finds the latest value written there and the Get from OpenForReadOnly
doesn't have access to the memtable (which is correct because all its Put/Modify) are disabled

Test Plan: make all; ldb put and get and scans

Reviewers: dhruba, heyongqiang, sheki

Reviewed By: heyongqiang

CC: kosievdmerwe, zshao, dilipj, kailiu

Differential Revision: https://reviews.facebook.net/D8631
2013-02-15 16:56:06 -08:00