rocksdb

mirror of https://github.com/facebook/rocksdb.git synced 2024-11-29 18:33:58 +00:00

Author	SHA1	Message	Date
Kai Liu	ef602f6275	Misc cleanup on performance branch Summary: Did some trivial stuffs: * Add more comments; * fix compiler's warning messages (uninitialized variables). * etc Test Plan: make check	2014-01-17 14:26:29 -08:00
kailiu	1304d8c8ce	Merge branch 'master' into performance Conflicts: Makefile db/db_impl.cc db/db_impl.h db/db_test.cc db/memtable.cc db/memtable.h db/version_edit.h db/version_set.cc include/rocksdb/options.h util/hash_skiplist_rep.cc util/options.cc	2014-01-15 23:12:31 -08:00
kailiu	eae1804f29	Remove the unnecessary use of shared_ptr Summary: shared_ptr is slower than unique_ptr (which literally comes with no performance cost compare with raw pointers). In memtable and memtable rep, we use shared_ptr when we'd actually should use unique_ptr. According to igor's previous work, we are likely to make quite some performance gain from this diff. Test Plan: make check Reviewers: dhruba, igor, sdong, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D15213	2014-01-15 18:22:01 -08:00
Igor Canadi	65a8a52b54	Decrease reliance on VersionSet::NumberLevels() Summary: With column families VersionSet will not have a constant number of levels (each CF can have different options), so we'll need to eliminate call to VersionSet::NumberLevels() This diff decreases number of callsites, but we're not there yet. It associates number of levels with Version (each version is associated with single CF) instead of VersionSet. I have also slightly changed how VersionSet keeps track of manifest size. This diff also modifies constructor of Compaction such that it takes input_version and automatically Ref()s it. Before this was done outside of constructor. In next diffs I will continue to decrease number of callsites of VersionSet::NumberLevels() and also references to current_ Test Plan: make check Reviewers: haobo, dhruba, kailiu, sdong Reviewed By: sdong Differential Revision: https://reviews.facebook.net/D15171	2014-01-15 16:15:43 -08:00
Kai Liu	cd535c2280	Optimize MayContainHash() Summary: In latest leaf's, MayContainHash() consistently consumes 5%~7% CPU usage. I checked the code and did an experiment with/without inlining this method. In release mode, with `1024 * 1024 * 256` bits and `1024 * 512` entries, both call 2^30 MayContainHash() with distinctive parameters. As the result showed, this patch reduced the running time from 9.127 sec to 7.891 sec. Test Plan: make check Reviewers: sdong, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D15177	2014-01-14 22:03:57 -08:00
Igor Canadi	d9cd7a063f	Fix CompactRange to apply filter to every key Summary: When doing CompactRange(), we should first flush the memtable and then calculate max_level_with_files. Also, we want to compact all the levels that have files, including level `max_level_with_files`. This patch fixed the unit test. Test Plan: Added a failing unit test and a fix, so it's not failing anymore. Reviewers: dhruba, haobo, sdong Reviewed By: haobo CC: leveldb, xjin Differential Revision: https://reviews.facebook.net/D14421	2014-01-14 16:19:09 -08:00
Naman Gupta	8454cfe569	Add read/modify/write functionality to Put() api Summary: The application can set a callback function, which is applied on the previous value. And calculates the new value. This new value can be set, either inplace, if the previous value existed in memtable, and new value is smaller than previous value. Otherwise the new value is added normally. Test Plan: fbmake. Added unit tests. All unit tests pass. Reviewers: dhruba, haobo Reviewed By: haobo CC: sdong, kailiu, xinyaohu, sumeet, leveldb Differential Revision: https://reviews.facebook.net/D14745	2014-01-14 07:55:16 -08:00
Schalk-Willem Kruger	a09ee1069d	Improve RocksDB "get" performance by computing merge result in memtable Summary: Added an option (max_successive_merges) that can be used to specify the maximum number of successive merge operations on a key in the memtable. This can be used to improve performance of the "get" operation. If many successive merge operations are performed on a key, the performance of "get" operations on the key deteriorates, as the value has to be computed for each "get" operation by applying all the successive merge operations. FB Task ID: #3428853 Test Plan: make all check db_bench --benchmarks=readrandommergerandom counter_stress_test Reviewers: haobo, vamsi, dhruba, sdong Reviewed By: haobo CC: zshao Differential Revision: https://reviews.facebook.net/D14991	2014-01-10 17:33:56 -08:00
Siying Dong	5b5ab0c1a8	[Performance Branch] Fix memory leak in HashLinkListRep.GetIterator() Summary: Full list constructed for full iterator can be leaked. This was a bug introduced when I copy the full iterator codes from hash skip list to hash link list. This patch fixes it. Test Plan: Run valgrind test against db_test and make sure the memory leak is fixed Reviewers: kailiu, haobo Reviewed By: kailiu CC: igor, leveldb Differential Revision: https://reviews.facebook.net/D15093	2014-01-10 12:12:28 -08:00
Yancey	afdd2d1a46	fix compile warning	2014-01-10 17:56:35 +08:00
Siying Dong	237a3da677	StopWatch not to get time if it is created for statistics and it is disabled Summary: Currently, even if statistics is not enabled, StopWatch only for the stats still gets the time of the day, which is wasteful. This patch adds a new option to StopWatch to disable this get in this case. Test Plan: make all check Reviewers: dhruba, haobo, igor CC: leveldb Differential Revision: https://reviews.facebook.net/D14703 Conflicts: db/db_impl.cc	2014-01-09 17:39:48 -08:00
Siying Dong	424a524ac9	[Performance Branch] A Hashed Linked List Based Mem Table Summary: Implement a mem table, in which keys are hashed based on prefixes. In each bucket, entries are organized in a sorted linked list. It has the same thread safety guarantee as skip list. The motivation is to optimize memory usage for the case that prefix hashing is primary way of seeking to the entry. Compared to hash skip list implementation, this implementation is more memory efficient, but inside each bucket, search is always linear. The target scenario is that there are only very limited number of records in each hash bucket. Test Plan: Add a test case in db_test Reviewers: haobo, kailiu, dhruba Reviewed By: haobo CC: igor, nkg-, leveldb Differential Revision: https://reviews.facebook.net/D14979	2014-01-09 16:19:11 -08:00
Siying Dong	5575316350	StopWatch not to get time if it is created for statistics and it is disabled Summary: Currently, even if statistics is not enabled, StopWatch only for the stats still gets the time of the day, which is wasteful. This patch adds a new option to StopWatch to disable this get in this case. Test Plan: make all check Reviewers: dhruba, haobo, igor CC: leveldb Differential Revision: https://reviews.facebook.net/D14703	2014-01-08 16:05:36 -08:00
kailiu	12b6d2b839	Separate the aligned and unaligned memory allocation Summary: Use two vectors for different types of memory allocation. Test Plan: run all unit tests. Reviewers: haobo, sdong Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D15027	2014-01-08 15:11:42 -08:00
Igor Canadi	17a222670b	Merge branch 'master' into performance	2014-01-07 11:04:21 -08:00
kailiu	7e70ff63d6	Fix issue #57	2014-01-06 11:11:19 -08:00
Kai Liu	774ed89c24	Replace vector with autovector Summary: this diff only replace the cases when we need to frequently create vector with small amount of entries. This diff doesn't aim to improve performance of a specific area, but more like a small scale test for the autovector and see how it works in real life. Test Plan: make check I also ran the performance tests, however there is no performance gain/loss. All performance numbers are pretty much the same before/after the change. Reviewers: dhruba, haobo, sdong, igor CC: leveldb Differential Revision: https://reviews.facebook.net/D14985	2014-01-02 16:43:35 -08:00
kailiu	f1cec73a76	Merge branch 'master' into performance Conflicts: db/db_impl.cc db/db_test.cc db/memtable.cc db/version_set.cc include/rocksdb/statistics.h	2013-12-27 12:23:17 -08:00
Siying Dong	18df47b79a	Avoid malloc in NotFound key status if no message is given. Summary: In some places we have NotFound status created with empty message, but it doesn't avoid a malloc. With this patch, the malloc is avoided for that case. The motivation of it is that I found in db_bench readrandom test when all keys are not existing, about 4% of the total running time is spent on malloc of Status, plus a similar amount of CPU spent on free of them, which is not necessary. Test Plan: make all check Reviewers: dhruba, haobo, igor Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D14691	2013-12-26 16:23:10 -08:00
Kai Liu	b40c052bfa	Fix all the comparison issue in fb dev servers	2013-12-26 16:13:49 -08:00
kailiu	113a08c929	Fix [-Werror=sign-compare] in autovector_test	2013-12-26 15:47:07 -08:00
kailiu	c01676e46d	Implement autovector Summary: A vector that leverages pre-allocated stack-based array to achieve better performance for array with small amount of items. Test Plan: Added tests for both correctness and performance Here is the performance benchmark between vector and autovector Please note that in the test "Creation and Insertion Test", the test case were designed with the motivation described below: * no element inserted: internal array of std::vector may not really get initialize. * one element inserted: internal array of std::vector must have initialized. * kSize elements inserted. This shows the most time we'll spend if we keep everything in stack. * 2 * kSize elements inserted. The internal vector of autovector must have been initialized. Note: kSize is the capacity of autovector ===================================================== Creation and Insertion Test ===================================================== created 100000 vectors: each was inserted with 0 elements total time elapsed: 128000 (ns) created 100000 autovectors: each was inserted with 0 elements total time elapsed: 3641000 (ns) created 100000 VectorWithReserveSizes: each was inserted with 0 elements total time elapsed: 9896000 (ns) ----------------------------------- created 100000 vectors: each was inserted with 1 elements total time elapsed: 11089000 (ns) created 100000 autovectors: each was inserted with 1 elements total time elapsed: 5008000 (ns) created 100000 VectorWithReserveSizes: each was inserted with 1 elements total time elapsed: 24271000 (ns) ----------------------------------- created 100000 vectors: each was inserted with 4 elements total time elapsed: 39369000 (ns) created 100000 autovectors: each was inserted with 4 elements total time elapsed: 10121000 (ns) created 100000 VectorWithReserveSizes: each was inserted with 4 elements total time elapsed: 28473000 (ns) ----------------------------------- created 100000 vectors: each was inserted with 8 elements total time elapsed: 75013000 (ns) created 100000 autovectors: each was inserted with 8 elements total time elapsed: 18237000 (ns) created 100000 VectorWithReserveSizes: each was inserted with 8 elements total time elapsed: 42464000 (ns) ----------------------------------- created 100000 vectors: each was inserted with 16 elements total time elapsed: 102319000 (ns) created 100000 autovectors: each was inserted with 16 elements total time elapsed: 76724000 (ns) created 100000 VectorWithReserveSizes: each was inserted with 16 elements total time elapsed: 68285000 (ns) ----------------------------------- ===================================================== Sequence Access Test ===================================================== performed 100000 sequence access against vector size: 4 total time elapsed: 198000 (ns) performed 100000 sequence access against autovector size: 4 total time elapsed: 306000 (ns) ----------------------------------- performed 100000 sequence access against vector size: 8 total time elapsed: 565000 (ns) performed 100000 sequence access against autovector size: 8 total time elapsed: 512000 (ns) ----------------------------------- performed 100000 sequence access against vector size: 16 total time elapsed: 1076000 (ns) performed 100000 sequence access against autovector size: 16 total time elapsed: 1070000 (ns) ----------------------------------- Reviewers: dhruba, haobo, sdong, chip Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D14655	2013-12-26 15:03:47 -08:00
Kai Liu	5643ae1a3f	Merge pull request #32 from jamesgolick/master Only try to use fallocate if it's actually present on the system.	2013-12-26 14:20:22 -08:00
Siying Dong	abaf26266d	[RocksDB] [Performance Branch] Some Changes to PlainTable format Summary: Some changes to PlainTable format: (1) support variable key length (2) use user defined slice transformer to extract prefixes (3) Run some test cases against PlainTable in db_test and table_test Test Plan: test db_test Reviewers: haobo, kailiu CC: dhruba, igor, leveldb, nkg- Differential Revision: https://reviews.facebook.net/D14457	2013-12-20 12:08:35 -08:00
kailiu	5f5e5fc2e9	Revert `atomic_size_t usage` Summary: By disassemble the function, we found that the atomic variables do invoke the `lock` that locks the memory bus. As a tradeoff, we protect the GetUsage by mutex and leave usage_ as plain size_t. Test Plan: passed `cache_test` Reviewers: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D14667	2013-12-13 17:03:19 -08:00
Haobo Xu	5090316f0d	[RocksDB] [Performance Branch] Trivia build fix Summary: make release complains signed unsigned comparison. Test Plan: make release Reviewers: kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D14661	2013-12-13 14:21:59 -08:00
kailiu	b660e2d468	Expose usage info for the cache Summary: This diff will help us to figure out the memory usage for the cache part. Test Plan: added a new memory usage test for cache Reviewers: haobo, sdong, dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D14559	2013-12-13 12:53:45 -08:00
James Golick	c28dd2a891	oops - missed a spot	2013-12-11 11:18:00 -08:00
Igor Canadi	e8d40c31b3	[RocksDB perf] Cache speedup Summary: I have ran a get benchmark where all the data is in the cache and observed that most of the time is spent on waiting for lock in LRUCache. This is an effort to optimize LRUCache. Test Plan: The data was loaded with fillseq. Then, I ran a benchmark: /db_bench --db=/tmp/rocksdb_stat_bench --num=1000000 --benchmarks=readrandom --statistics=1 --use_existing_db=1 --threads=16 --disable_seek_compaction=1 --cache_size=20000000000 --cache_numshardbits=8 --table_cache_numshardbits=8 I ran the benchmark three times. Here are the results: AFTER THE PATCH: 798072, 803998, 811807 BEFORE THE PATCH: 782008, 815593, 763017 Reviewers: dhruba, haobo, kailiu Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D14571	2013-12-11 08:33:29 -08:00
Haobo Xu	3c02c363b3	[RocksDB] [Performance Branch] Added dynamic bloom, to be used for memable non-existing key filtering Summary: as title Test Plan: dynamic_bloom_test Reviewers: dhruba, sdong, kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D14385	2013-12-11 00:15:14 -08:00
James Golick	43c386b72e	only try to use fallocate if it's actually present on the system	2013-12-10 22:34:19 -08:00
kailiu	c79e595471	Make Cache::GetCapacity constant Summary: This will allow us to access constant via `DB::GetOptions().table_cache.GetCapacity()` or `DB::GetOptions().block_cache.GetCapacity()` since GetOptions() is also constant method.	2013-12-10 17:34:35 -08:00
Igor Canadi	19f5463d3f	Don't LogFlush() in foreground threads Summary: So fflush() takes a lock which is heavyweight. I added flush_pending_, but more importantly, I removed LogFlush() from foreground threads. Test Plan: ./db_test Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D14535	2013-12-10 10:57:46 -08:00
Igor Canadi	fb9fce4fc3	[RocksDB] BackupableDB Summary: In this diff I present you BackupableDB v1. You can easily use it to backup your DB and it will do incremental snapshots for you. Let's first describe how you would use BackupableDB. It's inheriting StackableDB interface so you can easily construct it with your DB object -- it will add a method RollTheSnapshot() to the DB object. When you call RollTheSnapshot(), current snapshot of the DB will be stored in the backup dir. To restore, you can just call RestoreDBFromBackup() on a BackupableDB (which is a static method) and it will restore all files from the backup dir. In the next version, it will even support automatic backuping every X minutes. There are multiple things you can configure: 1. backup_env and db_env can be different, which is awesome because then you can easily backup to HDFS or wherever you feel like. 2. sync - if true, it guarantees backup consistency on machine reboot 3. number of snapshots to keep - this will keep last N snapshots around if you want, for some reason, be able to restore from an earlier snapshot. All the backuping is done in incremental fashion - if we already have 00010.sst, we will not copy it again. IMPORTANT -- This is based on assumption that 00010.sst never changes - two files named 00010.sst from the same DB will always be exactly the same. Is this true? I always copy manifest, current and log files. 4. You can decide if you want to flush the memtables before you backup, or you're fine with backing up the log files -- either way, you get a complete and consistent view of the database at a time of backup. 5. More things you can find in BackupableDBOptions Here is the directory structure I use: backup_dir/CURRENT_SNAPSHOT - just 4 bytes holding the latest snapshot 0, 1, 2, ... - files containing serialized version of each snapshot - containing a list of files files/*.sst - sst files shared between snapshots - if one snapshot references 00010.sst and another one needs to backup it from the DB, it will just reference the same file files/ 0/, 1/, 2/, ... - snapshot directories containing private snapshot files - current, manifest and log files All the files are ref counted and deleted immediatelly when they get out of scope. Some other stuff in this diff: 1. Added GetEnv() method to the DB. Discussed with @haobo and we agreed that it seems right thing to do. 2. Fixed StackableDB interface. The way it was set up before, I was not able to implement BackupableDB. Test Plan: I have a unittest, but please don't look at this yet. I just hacked it up to help me with debugging. I will write a lot of good tests and update the diff. Also, `make asan_check` Reviewers: dhruba, haobo, emayanke Reviewed By: dhruba CC: leveldb, haobo Differential Revision: https://reviews.facebook.net/D14295	2013-12-09 14:06:52 -08:00
Igor Canadi	9644e0e0c7	Print stack trace on assertion failure Summary: This will help me a lot! When we hit an assertion in unittest, we get the whole stack trace now. Also, changed stack trace a bit, we now include actual demangled C++ class::function symbols! Test Plan: Added ASSERT_TRUE(false) to a test, observed a stack trace Reviewers: haobo, dhruba, kailiu Reviewed By: kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D14499	2013-12-06 17:11:09 -08:00
kailiu	c7707f24c2	Refine the statistics	2013-12-06 16:51:35 -08:00
kailiu	551e9428ce	Merge branch 'master' into performance	2013-12-06 14:15:42 -08:00
kailiu	e1d92dfd2e	Fix a bunch of mac compilation issues in performance branch	2013-12-04 23:00:33 -08:00
Vamsi Ponnekanti	fa88cbc71e	[Log dumper broken when merge operator is in log] Summary: $title Test Plan: on my dev box Revert Plan: OK Task ID: # Reviewers: emayanke, dhruba, haobo Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D14451	2013-12-04 16:22:54 -08:00
Mark Callaghan	97aa401e2f	Add compression options to db_bench Summary: This adds 2 options for compression to db_bench: * universal_compression_size_percent * compression_level - to set zlib compression level It also logs compression_size_percent at startup in LOG Task ID: # Blame Rev: Test Plan: make check, run db_bench Revert Plan: Database Impact: Memcache Impact: Other Notes: EImportant: - begin PUBLIC platform impact section - Bugzilla: # - end platform impact - Reviewers: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D14439	2013-12-03 14:28:48 -08:00
Igor Canadi	eb12e47e0e	Killing Transform Rep Summary: Let's get rid of TransformRep and it's children. We have confirmed that HashSkipListRep works better with multifeed, so there is no benefit to keeping this around. This diff is mostly just deleting references to obsoleted functions. I also have a diff for fbcode that we'll need to push when we switch to new release. I had to expose HashSkipListRepFactory in the client header files because db_impl.cc needs access to GetTransform() function for SanitizeOptions. Test Plan: make check Reviewers: dhruba, haobo, kailiu, sdong Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D14397	2013-12-03 12:42:15 -08:00
lovro	930cb0b9ee	Clarify CompactionFilter thread safety requirements Summary: Documenting our discussion Test Plan: make Reviewers: dhruba, haobo Reviewed By: dhruba CC: igor Differential Revision: https://reviews.facebook.net/D14403	2013-12-02 16:41:43 -08:00
lovro	45a2f2d8d3	Fix build without glibc Summary: The preprocessor does not follow normal rules of && evaluation, tries to evaluate __GLIBC_PREREQ(2, 12) even though the defined() check fails. This breaks the build if __GLIBC_PREREQ is absent. Test Plan: Try adding #undef __GLIBC_PREREQ above the offending line, build no longer breaks Reviewed By: igor Blame Rev: `4c81383628`	2013-12-01 11:32:54 -08:00
Kai Liu	1966b63137	Merge branch 'master' into perf	2013-11-27 11:47:40 -08:00
lovro	4c81383628	Set background thread name with pthread_setname_np() Summary: Makes it easier to monitor performance with top Test Plan: ./manual_compaction_test with `top -H` running. Previously was two `manual_compacti`, now one shows `rocksdb:bg0`. Reviewers: igor, dhruba Reviewed By: igor CC: leveldb Differential Revision: https://reviews.facebook.net/D14367	2013-11-27 11:28:06 -08:00
Haobo Xu	4e6463ea44	[RocksDB][Performance Branch] Make height and branching factor configurable for skiplist implementation Summary: As title. Especially, HashSkipListRepFactory will be able to specify a relatively small height, to reduce the memory overhead of one skiplist per bucket. Test Plan: make check and test it on leaf4 Reviewers: dhruba, sdong, kailiu CC: reconnect.grayhat, leveldb Differential Revision: https://reviews.facebook.net/D14307	2013-11-26 21:59:36 -08:00
Siying Dong	8aac46d686	[RocksDB Performance Branch] Fix a regression bug of munmap Summary: Fix a stupid bug I just introduced in `b59d4d5a50`, which I didn't even mean to include. GCC might remove the munmap. Test Plan: Run it and make sure munmap succeeds Reviewers: haobo, kailiu Reviewed By: kailiu CC: dhruba, reconnect.grayhat, leveldb Differential Revision: https://reviews.facebook.net/D14361	2013-11-26 14:05:37 -08:00
Haobo Xu	5b825d6964	[RocksDB] Use raw pointer instead of shared pointer when passing Statistics object internally Summary: liveness of the statistics object is already ensured by the shared pointer in DB options. There's no reason to pass again shared pointer among internal functions. Raw pointer is sufficient and efficient. Test Plan: make check Reviewers: dhruba, MarkCallaghan, igor Reviewed By: dhruba CC: leveldb, reconnect.grayhat Differential Revision: https://reviews.facebook.net/D14289	2013-11-25 10:38:15 -08:00
Siying Dong	3e35aa6412	Revert "Allow users to profile a query and see bottleneck of the query" This reverts commit `3d8ac31d71`.	2013-11-21 17:40:39 -08:00
Siying Dong	b135d01e7b	Allow users to profile a query and see bottleneck of the query Summary: Provide a framework to profile a query in detail to figure out latency bottleneck. Currently, in Get(), Put() and iterators, 2-3 simple timing is used. We can easily add more profile counters to the framework later. Test Plan: Enable this profiling in seveal existing tests. Reviewers: haobo, dhruba, kailiu, emayanke, vamsi, igor CC: leveldb Differential Revision: https://reviews.facebook.net/D14001 Conflicts: table/merger.cc	2013-11-21 17:39:19 -08:00

1 2 3 4 5 ...

321 commits