rocksdb

Commit Graph

Author	SHA1	Message	Date
Feng Zhu	8f09d53fd1	remove malloc when create data and index iterator in Get Summary: Define Block::Iter to be an independent class to be used by block_based_table_reader When creating data and index iterator, update an existing iterator rather than new one Thus malloc and free could be reduced Benchmark, Base: commit `76286ee67e` commands: --db=/dev/shm/rocksdb --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --write_buffer_size=134217728 --max_write_buffer_number=2 --target_file_size_base=33554432 --max_bytes_for_level_base=1073741824 --verify_checksum=false --max_background_compactions=4 --use_plain_table=0 --memtablerep=prefix_hash --open_files=-1 --mmap_read=1 --mmap_write=0 --bloom_bits=10 --bloom_locality=1 --memtable_bloom_bits=500000 --compression_type=lz4 --num=2621440 --use_hash_search=1 --block_size=1024 --block_restart_interval=1 --use_existing_db=1 --threads=1 --benchmarks=readrandom —disable_auto_compactions=1 malloc: 3.30% -> 1.42% free: 3.59%->1.61% Test Plan: make all check run db_stress valgrind ./db_test ./table_test Reviewers: ljin, yhchiang, dhruba, igor, sdong Reviewed By: sdong Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D20655	2014-07-30 16:34:35 -07:00
Radheshyam Balasundaram	91c01485d1	Minor changes to CuckooTableBuilder Summary: - Copy the key and value to in-memory hash table during Add operation. Also modified cuckoo_table_reader_test to use this. - Store only the user_key in in-memory hash table if it is last level file. - Handle Carryover while chosing unused key in Finish() method in case unused key was never found before Finish() call. Test Plan: cuckoo_table_reader_test --enable_perf cuckoo_table_builder_test valgrind_check asan_check Reviewers: sdong, yhchiang, igor, ljin Reviewed By: ljin Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D20715	2014-07-28 17:14:25 -07:00
Lei Jin	40fa8a4cd5	make statistics forward-able Summary: Make StatisticsImpl being able to forward stats to provided statistics implementation. The main purpose is to allow us to collect internal stats in the future even when user supplies custom statistics implementation. It avoids intrumenting 2 sets of stats collection code. One immediate use case is tuning advisor, which needs to collect some internal stats, users may not be interested. Test Plan: ran db_bench and see stats show up at the end of run Will run make all check since some tests rely on statistics Reviewers: yhchiang, sdong, igor Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D20145	2014-07-28 12:05:36 -07:00
sdong	4a8f0c957c	Block::Iter::PrefixSeek() to have an extra check to filter out some false matches Summary: In block based table's hash index checking, when looking for a key that doesn't exist, there is a high chance that a false block is returned because of hash bucket conflicts. In this revision, another check is done to filter out some of those cases: comparing previous key of the block boundary to see whether the target block is what we are looking for. In a favored test setting (bloom filter disabled, 8 L0 files), I saw about 80% improvements. In a non-favored test setting (bloom filter enabled, files are all in L1, files are all cached), I see the performance penalty is less than 3%. Test Plan: make all check Reviewers: haobo, ljin Reviewed By: ljin Subscribers: wuj, leveldb, zagfox, yhchiang Differential Revision: https://reviews.facebook.net/D20595	2014-07-25 17:27:57 -07:00
Radheshyam Balasundaram	62f9b071ff	Implementation of CuckooTableReader Summary: Contains: - Implementation of TableReader based on Cuckoo Hashing - Unittests for CuckooTableReader - Performance test for TableReader Test Plan: make cuckoo_table_reader_test ./cuckoo_table_reader_test make valgrind_check make asan_check Reviewers: yhchiang, sdong, igor, ljin Reviewed By: ljin Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D20511	2014-07-25 16:37:32 -07:00
Radheshyam Balasundaram	07a7d870b8	Addressing TODOs in CuckooTableBuilder Summary: Contains the following changes in CuckooTableBuilder: - Take an extra parameter in constructor to identify last level file. - Implement a better way to identify if a bucket has been inserted into the tree already during BFS search. - Minor typos Test Plan: make cuckoo_table_builder ./cuckoo_table_builder make valgrind_check Reviewers: sdong, igor, yhchiang, ljin Reviewed By: ljin Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D20445	2014-07-24 10:07:41 -07:00
Feng Zhu	da9274574f	Use IterKey instead of string in Block::Iter to reduce malloc Summary: Modify a functioin TrimAppend in dbformat.h: IterKey. Write a test for it in dbformat_test Use IterKey in block::Iter to replace std::string to reduce malloc. Evaluate it using perf record. malloc: 4.26% -> 2.91% free: 3.61% -> 3.08% Test Plan: make all check ./valgrind db_test dbformat_test Reviewers: ljin, haobo, yhchiang, dhruba, igor, sdong Reviewed By: sdong Differential Revision: https://reviews.facebook.net/D20433	2014-07-23 12:31:11 -07:00
Igor Canadi	2d3d63597a	Fix signed-unsigned compare error	2014-07-22 15:35:07 -04:00
Radheshyam Balasundaram	f6272e3055	Fixing memory leaks in cuckoo_table_builder_test Summary: Fixes some memory leaks in cuckoo_builder_test.cc. This also fixed broken valgrind_check tests Test Plan: make valgrind_check ./cuckoo_builder_test Currently running make check all. I shall update once it is done. Reviewers: ljin, sdong, yhchiang, igor Reviewed By: igor Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D20385	2014-07-22 09:49:04 -07:00
Yueh-Hsuan Chiang	bbe2e91d00	Fixed a compile error of cuckoo_table_builder. Summary: Fixed the following compile error. ./table/cuckoo_table_builder.h:72:22: error: private field 'key_length_' is not used [-Werror,-Wunused-private-field] const unsigned int key_length_; ^ 1 error generated. Test Plan: make Reviewers: sdong, ljin, radheshyamb, igor Reviewed By: igor Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D20349	2014-07-21 15:17:09 -07:00
Radheshyam Balasundaram	cf3da899b0	Adding a new SST table builder based on Cuckoo Hashing Summary: Cuckoo Hashing based SST table builder. Contains: - Cuckoo Hashing logic and file storage logic. - Unit tests for logic Test Plan: make cuckoo_table_builder_test ./cuckoo_table_builder_test make check all Reviewers: yhchiang, igor, sdong, ljin Reviewed By: ljin Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D19545	2014-07-21 13:26:09 -07:00
Stanislau Hlebik	c1a90b0848	Fix db_bench Summary: Adding check for zero size index Test Plan: ./build_tools/regression_build_test.sh Reviewers: yhchiang, sdong Reviewed By: sdong Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D20259	2014-07-21 10:31:33 -07:00
Chilledheart	54f4e2f188	Fix clang compiler warnings	2014-07-20 22:57:20 +08:00
Stanislau Hlebik	9d70cce047	Adding option to save PlainTable index and bloom filter in SST file. Summary: Adding option to save PlainTable index and bloom filter in SST file. If there is no bloom block and/or index block, PlainTableReader builds new ones. Otherwise PlainTableReader just use these blocks. Test Plan: make all check Reviewers: sdong Reviewed By: sdong Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19527	2014-07-18 16:58:13 -07:00
Stanislau Hlebik	92d73cbe78	Add PlainTableOptions Summary: Since we have a lot of options for PlainTable, add a struct PlainTableOptions to manage them Test Plan: make all check Reviewers: sdong Reviewed By: sdong Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D20175	2014-07-18 00:08:38 -07:00
Igor Canadi	1614284eff	Fix compressed cache	2014-07-16 06:45:49 -07:00
Stanislau Hlebik	30c81e7717	Removing NewTotalOrderPlainTableFactory Summary: Seems like NewTotalOrderPlainTableFactory is useless and is semantically incorrect. Total order mode indicator is prefix_extractor == nullptr, but NewTotalOrderPlainTableFactory doesn't set it to be nullptr. That's why some tests in plain_table_db_tests is incorrect. Test Plan: make all check Reviewers: sdong Reviewed By: sdong Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19587	2014-07-10 11:32:04 -07:00
Lei Jin	8d9a46fcd1	initialize decoded_internal_key_valid Summary: ReadInternalKey() will assign correct value anyway. Initialize it to true to suppress compiler error reported https://github.com/facebook/rocksdb/issues/186 Test Plan: I cannot reproduce it but this is obvious Reviewers: sdong, yhchiang Reviewed By: yhchiang Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19467	2014-07-03 23:13:08 -07:00
Feng Zhu	5656367416	use arena to allocate memtable's bloomfilter and hashskiplist's buckets_ Summary: Bloomfilter and hashskiplist's buckets_ allocated by memtable's arena DynamicBloom: pass arena via constructor, allocate space in SetTotalBits HashSkipListRep: allocate space of buckets_ using arena. do not delete it in deconstructor because arena would take care of it. Several test files are changed. Test Plan: make all check Reviewers: ljin, haobo, yhchiang, sdong Reviewed By: sdong Subscribers: igor, dhruba Differential Revision: https://reviews.facebook.net/D19335	2014-06-30 15:54:31 -07:00
Igor Canadi	9fe87b17aa	Fix compile	2014-06-20 10:36:48 +02:00
Igor Canadi	d4a8423334	Remove seek compaction Summary: As discussed in our internal group, we don't get much use of seek compaction at the moment, while it's making code more complicated and slower in some cases. This diff removes seek compaction and (hopefully) all code that was introduced to support seek compaction. There is one test case that relied on didIO information. I'll try to find another way to implement it. Test Plan: make check Reviewers: sdong, haobo, yhchiang, ljin, dhruba Reviewed By: ljin Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19161	2014-06-20 10:23:02 +02:00
Haobo Xu	7a9dd5f214	[RocksDB] Make block based table hash index more adaptive Summary: Currently, RocksDB returns error if a db written with prefix hash index, is later opened without providing a prefix extractor. This is uncessarily harsh. Without a prefix extractor, we could always fallback to the normal binary index. Test Plan: unit test, also manually veried LOG that fallback did occur. Reviewers: sdong, ljin Reviewed By: ljin Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19191	2014-06-19 16:40:32 -07:00
Lei Jin	1ec2d1c69d	fix make shared_lib compilation error Summary: s/class ParsedInternalKey/struct ParsedInternalKey Test Plan: make shared_lib Reviewers: igor, yhchiang, sdong, haobo Reviewed By: haobo Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19173	2014-06-19 10:12:26 -07:00
Haobo Xu	167738256f	[RocksDB] Fix unit test Summary: fix a bug in D19047, which caused DBTest.RecoverDuringMemtableCompaction to fail. Test Plan: unit test Reviewers: sdong, igor Reviewed By: igor Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19155	2014-06-19 01:37:21 -07:00
sdong	edd47c5104	PlainTable to encode to avoid to rewrite prefix when it is the same as the previous key Summary: Add a encoding feature of PlainTable to encode PlainTable's keys to save some bytes for the same prefixes. The data format is documented in table/plain_table_factory.h Test Plan: Add unit test coverage in plain_table_db_test Reviewers: yhchiang, igor, dhruba, ljin, haobo Reviewed By: haobo Subscribers: nkg-, leveldb Differential Revision: https://reviews.facebook.net/D18735	2014-06-18 20:41:52 -07:00
Haobo Xu	0f0076ed5a	[RocksDB] Reduce memory footprint of the blockbased table hash index. Summary: Currently, the in-memory hash index of blockbased table uses a precise hash map to track the prefix to block range mapping. In some use cases, especially when prefix itself is big, the memory overhead becomes a problem. This diff introduces a fixed hash bucket array that does not store the prefix and allows prefix collision, which is similar to the plaintable hash index, in order to reduce the memory consumption. Just a quick draft, still testing and refining. Test Plan: unit test and shadow testing Reviewers: dhruba, kailiu, sdong Reviewed By: sdong Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19047	2014-06-18 18:16:07 -07:00
Igor Canadi	3525aac9e5	Change order of parameters in adaptive table factory Summary: This is minor, but if we put the writing talbe factory as the third parameter, when we add a new table format, we'll have a situation: 1) block based factory 2) plain table factory 3) output factory 4) new format factory I think it makes more sense to have output as the first parameter. Also, fixed a NewAdaptiveTableFactory() call in unit test Test Plan: unit test Reviewers: sdong Reviewed By: sdong Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19119	2014-06-18 07:04:37 +02:00
sdong	200e4b4a72	Add a table factory that can read DB with both of PlainTable and BlockBasedTable in it Summary: The new table factory is used if users want to convert a DB from one table format to the other. A user can use this table to open a DB written using one table format and write new files to another table format. Test Plan: add a unit test Reviewers: haobo, igor Reviewed By: igor Subscribers: dhruba, ljin, yhchiang, leveldb Differential Revision: https://reviews.facebook.net/D19017	2014-06-17 11:49:22 -07:00
Lei Jin	c83b085770	prefetch bloom filter data block for L0 files Summary: as title Test Plan: db_bench the initial result is very promising. I will post results of complete runs Reviewers: dhruba, haobo, sdong, igor Reviewed By: sdong Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D18867	2014-06-12 10:06:18 -07:00
sdong	88a1691a1e	BlockBasedTable::PrefixMayMatch() to bloom setting to the beginning of the function Summary: In BlockBasedTable::PrefixMayMatch() we calculate prefix even if bloom is not config. Move the check before Test Plan: make all check Reviewers: igor, ljin Reviewed By: ljin Subscribers: wuj, leveldb, haobo, yhchiang, dhruba Differential Revision: https://reviews.facebook.net/D18993	2014-06-10 11:14:22 -07:00
sdong	80f409ea37	Clean PlainTableReader's variables for better data locality Summary: Clean PlainTableReader's data structures: (1) inline bloom_ (in order to do this, change DynamicBloom to allow lazy initialization) (2) remove some variables only used when initialization from the class (3) put variables not used in normal read code paths to the end of the class and reference prefix_extractor directly (4) make Options a reference. Test Plan: make all check Reviewers: haobo, ljin Reviewed By: ljin Subscribers: igor, yhchiang, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D18891	2014-06-09 13:53:39 -07:00
Igor Canadi	f43c8262c2	Don't compress block bigger than 2GB Summary: This is a temporary solution to a issue that we have with compression libraries. See task #4453446. Test Plan: make check doesn't complain :) Reviewers: haobo, ljin, yhchiang, dhruba, sdong Reviewed By: sdong Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D18975	2014-06-09 12:26:09 -07:00
sdong	df9069d23f	In DB::NewIterator(), try to allocate the whole iterator tree in an arena Summary: In this patch, try to allocate the whole iterator tree starting from DBIter from an arena 1. ArenaWrappedDBIter is created when serves as the entry point of an iterator tree, with an arena in it. 2. Add an option to create iterator from arena for following iterators: DBIter, MergingIterator, MemtableIterator, all mem table's iterators, all table reader's iterators and two level iterator. 3. MergeIteratorBuilder is created to incrementally build the tree of internal iterators. It is passed to mem table list and version set and add iterators to it. Limitations: (1) Only DB::NewIterator() without tailing uses the arena. Other cases, including readonly DB and compactions are still from malloc (2) Two level iterator itself is allocated in arena, but not iterators inside it. Test Plan: make all check Reviewers: ljin, haobo Reviewed By: haobo Subscribers: leveldb, dhruba, yhchiang, igor Differential Revision: https://reviews.facebook.net/D18513	2014-06-02 17:44:57 -07:00
Lei Jin	388d2054c7	forward iterator Summary: Forward iterator puts everything together in a flat structure instead of a hierarchy of nested iterators. this should simplify the code and provide better performance. It also enables more optimization since all information are accessiable in one place. Init evaluation shows about 6% improvement Test Plan: db_test and db_bench Reviewers: dhruba, igor, tnovak, sdong, haobo Reviewed By: haobo Subscribers: sdong, leveldb Differential Revision: https://reviews.facebook.net/D18795	2014-05-30 14:31:55 -07:00
Kai Liu	0b3d03d026	Materialize the hash index Summary: Materialize the hash index to avoid the soaring cpu/flash usage when initializing the database. Test Plan: existing unit tests passed Reviewers: sdong, haobo Reviewed By: sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D18339	2014-05-15 14:09:03 -07:00
Igor Canadi	26f5dd9a5a	TablePropertiesCollectorFactory Summary: This diff addresses task #4296714 and rethinks how users provide us with TablePropertiesCollectors as part of Options. Here's description of task #4296714: I'm debugging #4295529 and noticed that our count of user properties kDeletedKeys is wrong. We're sharing one single InternalKeyPropertiesCollector with all Table Builders. In LOG Files, we're outputting number of kDeletedKeys as connected with a single table, while it's actually the total count of deleted keys since creation of the DB. For example, this table has 3155 entries and 1391828 deleted keys. The problem with current approach that we call methods on a single TablePropertiesCollector for all the tables we create. Even worse, we could do it from multiple threads at the same time and TablePropertiesCollector has no way of knowing which table we're calling it for. Good part: Looks like nobody inside Facebook is using Options::table_properties_collectors. This means we should be able to painfully change the API. In this change, I introduce TablePropertiesCollectorFactory. For every table we create, we call `CreateTablePropertiesCollector`, which creates a TablePropertiesCollector for a single table. We then use it sequentially from a single thread, which means it doesn't have to be thread-safe. Test Plan: Added a test in table_properties_collector_test that fails on master (build two tables, assert that kDeletedKeys count is correct for the second one). Also, all other tests Reviewers: sdong, dhruba, haobo, kailiu Reviewed By: kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D18579	2014-05-13 12:30:55 -07:00
Igor Canadi	fec4269966	Fix more gflag namespace issues	2014-05-09 08:41:02 -07:00
sdong	ddd41146c4	MergingIterator uses autovector instead of vector Summary: Use autovector in MergingIterator so that if there are 4 or less child iterators in it, iterator wrappers are inline, which is more likely to be cache friendly. Based on one test run with a shadow traffic of one product, it reduces CPU of MergingIterator::Seek() by half. Test Plan: make all check Reviewers: haobo, yhchiang, igor, dhruba Reviewed By: igor CC: leveldb Differential Revision: https://reviews.facebook.net/D18531	2014-05-08 15:01:20 -07:00
Igor Canadi	b5616dafd1	Fix iOS compile	2014-05-07 17:48:31 -07:00
sdong	3a171dcb51	Pass logger to memtable rep and TLB page allocation error logged to info logs Summary: TLB page allocation errors are now logged to info logs, instead of stderr. In order to do that, mem table rep's factory functions take a info logger now. Test Plan: make all check Reviewers: haobo, igor, yhchiang Reviewed By: yhchiang CC: leveldb, yhchiang, dhruba Differential Revision: https://reviews.facebook.net/D18471	2014-05-05 16:43:37 -07:00
sdong	9b17558311	PlainTableFactory::PlainTableFactory() to have huge TLB turned off by default Summary: PlainTableFactory::PlainTableFactory() now has Huge TLB page feature turned on by default. Although it is not a public API (which we always turn the feature off now), our unit tests, like db_test sometimes uses it directly, which causes wrong coverage of codes. This patch fix it to allow unit tests to run with the correct setting Test Plan: Run db_test and make sure this feature is not on any more. Reviewers: igor, haobo Reviewed By: igor CC: yhchiang, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D18483	2014-05-05 11:05:54 -07:00
sdong	4a7c747064	Revert "Revert "Allow allocating dynamic bloom, plain table indexes and hash linked list from huge page TLB"" And make the default 0 for hash linked list memtable This reverts commit `d69dc64be7`.	2014-05-04 13:56:29 -07:00
Igor Canadi	d69dc64be7	Revert "Allow allocating dynamic bloom, plain table indexes and hash linked list from huge page TLB" This reverts commit `7dafa3a1d7`.	2014-05-04 08:37:09 -07:00
Igor Canadi	d28ed6931f	fix release build	2014-05-01 12:42:06 -07:00
Igor Canadi	d29e48bb2e	fix compile warning	2014-05-01 14:12:35 -04:00
Igor Canadi	0afc8bc29a	xxHash Summary: Originally: https://github.com/facebook/rocksdb/pull/87/files I'm taking over to apply some finishing touches Test Plan: will add tests Reviewers: dhruba, haobo, sdong, yhchiang, ljin Reviewed By: yhchiang CC: leveldb Differential Revision: https://reviews.facebook.net/D18315	2014-05-01 14:09:32 -04:00
sdong	7dafa3a1d7	Allow allocating dynamic bloom, plain table indexes and hash linked list from huge page TLB Summary: Add an option to allocate a piece of memory from huge page TLB. Add options to trigger it in dynamic bloom, plain table indexes andhash linked list hash table. Test Plan: make all check Reviewers: haobo, ljin Reviewed By: haobo CC: nkg-, dhruba, leveldb, igor, yhchiang Differential Revision: https://reviews.facebook.net/D18357	2014-04-30 11:02:26 -07:00
Igor Canadi	c489499a2b	Fix OSX compile	2014-04-26 17:15:43 -04:00
Lei Jin	ccaca59bee	avoid calling FindFile twice in TwoLevelIterator for PlainTable Summary: this is to reclaim the regression introduced in https://reviews.facebook.net/D17853 Test Plan: make all check Reviewers: igor, haobo, sdong, dhruba, yhchiang Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D17985	2014-04-25 12:23:07 -07:00
Lei Jin	d642c60bdc	Check PrefixMayMatch on Seek() Summary: As a follow-up diff for https://reviews.facebook.net/D17805, add optimization to check PrefixMayMatch on Seek() Test Plan: make all check Reviewers: igor, haobo, sdong, yhchiang, dhruba Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D17853	2014-04-25 12:22:23 -07:00
Lei Jin	3995e801ab	kill ReadOptions.prefix and .prefix_seek Summary: also add an override option total_order_iteration if you want to use full iterator with prefix_extractor Test Plan: make all check Reviewers: igor, haobo, sdong, yhchiang Reviewed By: haobo CC: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D17805	2014-04-25 12:21:34 -07:00
sdong	86a0133d05	PlainTableReader to expose index size to users Summary: This is a temp solution to expose index sizes to users from PlainTableReader before we persistent them to files. In this patch, the memory consumption of indexes used by PlainTableReader will be reported as two user defined properties, so that users can monitor them. Test Plan: Add a unit test. make all check` Reviewers: haobo, ljin Reviewed By: haobo CC: nkg-, yhchiang, igor, ljin, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D18195	2014-04-22 19:29:05 -07:00
Yueh-Hsuan Chiang	af6ad113a8	Fix SIGFAULT when running sst_dump on v2.6 db Summary: Fix the sigfault when running sst_dump on v2.6 db. Test Plan: git checkout `bba6595b1f` make clean make db_bench ./db_bench --db=/tmp/some/db --benchmarks=fillseq arc patch D18039 make clean make sst_dump ./sst_dump --file=/tmp/some/db --command=check Reviewers: igor, haobo, sdong Reviewed By: sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D18039	2014-04-21 17:49:47 -07:00
Igor Canadi	8dc34364d2	Rename "benchmark" back to "bench". Also, make `benchharness.cc` not compiled into rocksdb library.	2014-04-21 13:12:15 -07:00
Pratyush Seth	ff1b5df4c6	Added benchmark functionality on the lines of folly/Benchmark.h Summary: Added benchmark functionality on the lines of folly/Benchmark.h Test Plan: Added unit tests Reviewers: igor, haobo, sdong, ljin, yhchiang, dhruba Reviewed By: igor CC: leveldb Differential Revision: https://reviews.facebook.net/D17973	2014-04-21 12:29:55 -07:00
sdong	27d3bc184e	Use a different approach to make sure BlockBasedTableReader can use hash index on older files Summary: A recent commit `e37dd216f9` makes sure hash index can be used when reading existing files. This patch uses another way to achieve the approach: (1) Currently, always writing kBinarySearch to files, despite of BlockBasedTableOptions.IndexType setting. (2) When reading a file, read out the field, and make sure it is kBinarySearch, while always use index type by users. The reason for doing it is, to reserve kHashSearch property on disk to future. If now we write out binary index for both of kHashSearch and kBinarySearch. We have to use a new flag in the future for hash index on disk, otherwise compatibility would break. Also, we want the real index type and type shown in properties block to be consistent. Test Plan: make all check Reviewers: haobo, kailiu Reviewed By: kailiu CC: igor, ljin, yhchiang, xjin, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D18009	2014-04-18 14:09:21 -07:00
Kai Liu	e37dd216f9	Index type doesn't have to be persisted Summary: With the recent changes, there is no need to check the property block about the index block type. If user want to use it, they don't really need any disk format change; everything happens in the fly. Also another team encountered an error while reading the index type from properties. Test Plan: ran all the tests Reviewers: sdong CC: Task ID: # Blame Rev:	2014-04-17 11:08:12 -07:00
sdong	5cef458a2c	RocksDB 2.8 to be able to read files generated by 2.6 Summary: From 2.6 to 2.7, property block name is renamed from rocksdb.stats to rocksdb.properties. Older properties were not able to be loaded. In 2.8, we seem to have added some logic that uses property block without checking null pointers, which create segment faults. In this patch, we fix it by: (1) try rocksdb.stats if rocksdb.properties is not found (2) add some null checking before consuming rep->table_properties Test Plan: make sure a file generated in 2.7 couldn't be opened now can be opened. Reviewers: haobo, igor, yhchiang Reviewed By: igor CC: ljin, xjin, dhruba, kailiu, leveldb Differential Revision: https://reviews.facebook.net/D17961	2014-04-17 09:51:43 -07:00
Igor Canadi	588bca2020	RocksDBLite Summary: Introducing RocksDBLite! Removes all the non-essential features and reduces the binary size. This effort should help our adoption on mobile. Binary size when compiling for IOS (`TARGET_OS=IOS m static_lib`) is down to 9MB from 15MB (without stripping) Test Plan: compiles :) Reviewers: dhruba, haobo, ljin, sdong, yhchiang Reviewed By: yhchiang CC: leveldb Differential Revision: https://reviews.facebook.net/D17835	2014-04-15 13:39:26 -07:00
Kai Liu	e23e73e67c	Use shorten index key for hash-index Summary: I was wrong about the "index builder", right now since we create index by scanning both whole table and index, there is not need to preserve the whole key as the index key. I switch back to original way index which is both space efficient and able to supprot in-fly construction of hash index. IN this patch, I made minimal change since I'm not sure if we still need the "pluggable index builder", under current circumstance it is of no use and kind of over-engineered. But I'm not sure if we can still exploit its usefulness in the future; otherwise I think I can just burn them with great vengeance. Test Plan: unit tests Reviewers: sdong, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D17745	2014-04-10 22:45:25 -07:00
Kai Liu	75b59d5146	Enable hash index for block-based table Summary: Based on previous patches, this diff eventually provides the end-to-end mechanism for users to specify the hash-index. Test Plan: Wrote several new unit tests. Reviewers: sdong, haobo, dhruba Reviewed By: sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D16539	2014-04-10 14:19:43 -07:00
Igor Canadi	4daea66343	Turn on -Wmissing-prototypes Summary: Compiling for iOS has by default turned on -Wmissing-prototypes, which causes rocksdb to fail compiling. This diff turns on -Wmissing-prototypes in our compile options and cleans up all functions with missing prototypes. Test Plan: compiles Reviewers: dhruba, haobo, ljin, sdong Reviewed By: ljin CC: leveldb Differential Revision: https://reviews.facebook.net/D17649	2014-04-09 21:17:14 -07:00
sdong	e9ed28f9c8	PlainTableBuilder::Add() to use local char array instead of reused std::string as tmp buffer Summary: Our profile shows that in one of the applications, 5% of the CPU costs of PlainTableBuilder::Add() are spent on std::string stacks. By this simple change, we avoid this global reusable string. Also, we avoid another call of file appending, which probably gives another 2%. Test Plan: make all check Reviewers: haobo, ljin Reviewed By: haobo CC: igor, yhchiang, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D17601	2014-04-09 10:17:32 -07:00
Igor Canadi	34455deb06	Fix Mac OS compile issues	2014-04-08 14:05:53 -07:00
Igor Canadi	5b345b76cb	Remove env_ from MergingIterator Summary: env_ is not used. Compiling for iOS complains. Test Plan: compiles now Reviewers: ljin, haobo, sdong, dhruba Reviewed By: ljin CC: leveldb Differential Revision: https://reviews.facebook.net/D17589	2014-04-08 13:40:42 -07:00
Lei Jin	92c1eb0291	macros for perf_context Summary: This will allow us to disable them completely for iOS or for better performance Test Plan: will run make all check Reviewers: igor, haobo, dhruba Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D17511	2014-04-08 10:58:07 -07:00
sdong	5e2db3b434	PlainTableIterator not to store copied key in std::string Summary: Move PlainTableIterator's copied key from std::string local buffer to avoid paying the extra costs in std::string related to sharing. Reuse the same buffer class in DbIter. Move the class to dbformat.h. This patch improves iterator performance significantly. Running this benchmark: ./table_reader_bench --num_keys2=17 --iterator --plain_table --time_unit=nanosecond The average latency is improved to about 750 nanoseconds from 1100 nanoseconds. Test Plan: Add a unit test. make all check Reviewers: haobo, ljin Reviewed By: haobo CC: igor, yhchiang, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D17547	2014-04-07 19:06:09 -07:00
Igor Canadi	3d2fe844ab	Merge branch 'master' into columnfamilies Conflicts: db/db_impl.cc db/db_impl.h db/memtable_list.cc db/version_set.cc	2014-04-07 11:31:11 -07:00
Igor Canadi	bcd1f15b60	Remove -Wno-unused-const-variable	2014-04-04 16:15:47 -07:00
Igor Canadi	8555ce2dec	Merge branch 'master' into columnfamilies	2014-04-02 10:48:05 -07:00
sdong	d50619a559	PlainTableIterator::Seek() shouldn't check bloom filter in total order mode Summary: In total order mode, iterator's seek() shouldn't check total order. Also some cleaning up about checking null for shared pointers. I don't know the behavior before it. This bug was reported by @igor. Test Plan: test plain_table_db_test Reviewers: ljin, haobo, igor Reviewed By: igor CC: yhchiang, dhruba, igor, leveldb Differential Revision: https://reviews.facebook.net/D17391	2014-04-01 15:05:16 -07:00
Igor Canadi	ddbd1ece88	Merge branch 'master' into columnfamilies Conflicts: db/db_impl.cc db/db_test.cc db/internal_stats.cc db/internal_stats.h db/version_edit.cc db/version_edit.h db/version_set.cc include/rocksdb/options.h util/options.cc	2014-03-31 13:39:24 -07:00
Lei Jin	0d755fff14	cache friendly blocked bloomfilter Summary: By constraining the probes within cache line(s), we can improve the cache miss rate thus performance. This probably only makes sense for in-memory workload so defaults the option to off. Numbers and comparision can be found in wiki: https://our.intern.facebook.com/intern/wiki/index.php/Ljin/rocksdb_perf/2014_03_17#Bloom_Filter_Study Test Plan: benchmarked this change substantially. Will run make all check as well Reviewers: haobo, igor, dhruba, sdong, yhchiang Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D17133	2014-03-28 09:21:20 -07:00
Igor Canadi	e20fa3f8a4	Merge branch 'master' into columnfamilies Conflicts: db/db_impl.cc db/internal_stats.cc db/internal_stats.h db/version_set.cc	2014-03-19 17:22:20 -07:00
Kai Liu	69f6cf431d	Fix two bugs in talbe format Summary: Previous code had two bugs: * didn't initialize the table_magic_number_ explicitly -- as a result a random junk number is stored for table_magic_number_, making HasInitializedMagicNumber() always return true. * if condition is inconrrect in set_table_magic_number(), and the return value is not checked. I replace if-else by a stronger requirement enforced by assert(). Test Plan: Previous sst_dump failed to work. After the fix, things back to normal. Reviewers: yhchiang CC: haobo, sdong, igor, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D17055	2014-03-19 16:18:33 -07:00
Igor Canadi	fb2346fc1f	[CF] Code cleanup part 1 Summary: I'm cleaning up some code preparing for the big diff review tomorrow. This is the first part of the cleanup. Changes are mostly cosmetic. The goal is to decrease amount of code difference between columnfamilies and master branch. This diff also fixes race condition when dropping column family. Test Plan: Ran db_stress with variety of parameters Reviewers: dhruba, haobo Differential Revision: https://reviews.facebook.net/D16833	2014-03-12 09:56:53 -07:00
Igor Canadi	9634ba42ac	Merge branch 'master' into columnfamilies Conflicts: db/compaction_picker.cc db/db_impl.cc db/db_impl.h db/tailing_iter.cc db/version_set.h include/rocksdb/options.h util/options.cc	2014-03-10 17:26:09 -07:00
Lei Jin	8d007b4aaf	Consolidate SliceTransform object ownership Summary: (1) Fix SanitizeOptions() to also check HashLinkList. The current dynamic case just happens to work because the 2 classes have the same layout. (2) Do not delete SliceTransform object in HashSkipListFactory and HashLinkListFactory destructor. Reason: SanitizeOptions() enforces prefix_extractor and SliceTransform to be the same object when HashFactory is used. This makes the behavior strange: when HashFactory is used, prefix_extractor will be released by RocksDB. If other memtable factory is used, prefix_extractor should be released by user. Test Plan: db_bench && make asan_check Reviewers: haobo, igor, sdong Reviewed By: igor CC: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D16587	2014-03-10 12:56:46 -07:00
Igor Canadi	1e0d47276c	Merge branch 'master' into columnfamilies Conflicts: db/db_impl.cc db/db_impl.h	2014-03-07 16:59:47 -08:00
Kai Liu	566f18e6ad	More precise calculation of sub_index_size Summary: Previous we did rough estimation of subindex size, which in worst case may result in array reallocation. This patch aims to get the exact size and avoid any reallocation. Test Plan: make all check Reviewers: sdong, dhruba, haobo Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D16125	2014-03-06 17:30:46 -08:00
Igor Canadi	0738ae6dc9	Merge branch 'master' into columnfamilies Conflicts: db/db_impl.cc	2014-03-05 12:25:05 -08:00
kailiu	a01bda0997	Fix a buggy assert Summary: The assert was pointless since if if prefix is the same as the whole key, assertion will surely fail. Reason behind is when performing the internal key comparison, if user keys are the same, key with smaller transaction id wins. Test Plan: make -j32 && make check Reviewers: sdong, dhruba, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D16551	2014-03-04 18:08:23 -08:00
Igor Canadi	fa34697237	Merge branch 'master' into columnfamilies	2014-03-04 09:39:14 -08:00
kailiu	a1d56e73ec	Uncomment the unit tests in table test	2014-03-03 22:19:49 -08:00
kailiu	906f3dca72	Add a hash-index component for block Summary: this is the key component extracted from diff: https://reviews.facebook.net/D14271 I separate it to a dedicated patch to make the review easier. Test Plan: added a unit test and passed it. Reviewers: haobo, sdong, dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D16245	2014-03-03 21:11:49 -08:00
Igor Canadi	9d0577a6be	Merge branch 'master' into columnfamilies Conflicts: db/db_impl.cc db/db_impl.h db/transaction_log_impl.cc db/transaction_log_impl.h include/rocksdb/options.h util/env.cc util/options.cc	2014-03-03 18:29:03 -08:00
sdong	f0ee2356af	Fix issue with iterator operations in this order: Prev(), Seek(), Prev() Summary: Due to a bad merge of D14163 and D14001 before checking in D14001, "direction_ = kForward;" in MergeIterator::Seek() was deleted my mistake (in commit `b135d01e7b` ). It will generate wrong results or assert failure after the sequence of Prev() (or SeekToLast()), Seek() and Prev(). Fix it Test Plan: make all check Reviewers: igor, haobo, dhruba Reviewed By: igor CC: yhchiang, i.am.jin.lei, ljin, leveldb Differential Revision: https://reviews.facebook.net/D16527	2014-03-03 17:50:27 -08:00
kailiu	1aeafeccac	Make the Create() function comform the convention Summary: Moved "Return multiple values" a more conventional way.	2014-03-01 23:43:03 -08:00
Kai Liu	16d4e45c12	Fix the memory leak in table index Summary: BinarySearchIndex didn't use unique_ptr to guard the block object nor delete it in destructor, leading to valgrind failure for "definite memory leak". Test Plan: re-ran the failed valgrind test cases	2014-03-01 11:50:35 -08:00
Kai Liu	ff151132b3	Fix the unit test failure in devbox Summary: My last diff was developed in MacOS but in devserver environment error occurs. I dug into the problem and found the way we calcuate approximate data size is pretty out-of-date. We can use table properties to get more accurate results. Test Plan: ran ./table_test and passed Reviewers: igor, dhruba, haobo, sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D16509	2014-02-28 20:40:05 -08:00
kailiu	74939a9e13	Make the block-based table's index pluggable Summary: This patch introduced a new table options that allows us to make block-based table's index pluggable. To support that new features: * Code has been refacotred to be more flexible and supports this option well. * More documentation is added for the existing obsecure functionalities. * Big surgeon on DataBlockReader(), where the logic was really convoluted. * Other small code cleanups. The pluggablility will mostly affect development of internal modules and won't change frequently, as a result I intentionally avoid heavy-weight patterns (like factory) and try to make it simple. Test Plan: make all check Reviewers: haobo, sdong Reviewed By: sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D16395	2014-02-28 18:19:07 -08:00
kailiu	bf86af5174	Remove the terrible hack in for flush_block_policy_factory Summary: Previous code is too convoluted and I must be drunk for letting such code to be written without a second thought. Thanks to the discussion with @sdong, I added the `Options` when generating the flusher, thus avoiding the tricks. Just FYI: I resisted to add Options in flush_block_policy.h since I wanted to avoid cyclic dependencies: FlushBlockPolicy dpends on Options and Options also depends FlushBlockPolicy... While I appreciate my effort to prevent it, the old design turns out creating more troubles than it tried to avoid. Test Plan: ran ./table_test Reviewers: sdong Reviewed By: sdong CC: sdong, leveldb Differential Revision: https://reviews.facebook.net/D16503	2014-02-28 16:39:27 -08:00
kailiu	444cafc28c	Fix inconsistent code format Summary: Found some function follows camel style. When naming funciton, we have two styles: Trivially expose internal data in readonly mode: `all_lower_case()` Regular function: `CapitalizeFirstLetter()` I renames these functions. Test Plan: make -j32 Reviewers: haobo, sdong, dhruba, igor CC: leveldb Differential Revision: https://reviews.facebook.net/D16383	2014-02-26 18:56:39 -08:00
sdong	a04dbf6e49	PlainTable::Next() should pass the error message from ReadKey() Summary: PlainTable::Next() should pass the error message from ReadKey(). Now it would return a wrong error message. Also improve the messages of status when failing to read Test Plan: make all check Reviewers: ljin, kailiu, haobo Reviewed By: kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D16365	2014-02-26 15:12:44 -08:00
Igor Canadi	d39da4b578	Merge branch 'master' into columnfamilies Conflicts: db/db_impl.cc	2014-02-24 17:09:05 -08:00
Kai Liu	2b205b35d8	Disable putting filter block to block cache Summary: This bug caused server crash issues because the filter block is too big and kept purging out of cache. Test Plan: Wrote a new unit tests to make sure it works. Reviewers: dhruba, haobo, igor, sdong Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D16221	2014-02-19 15:38:57 -08:00
Igor Canadi	76c048183c	Merge branch 'master' into columnfamilies Conflicts: db/db_impl.cc db/db_test.cc include/rocksdb/db.h	2014-02-14 16:46:03 -08:00
kailiu	63690625cd	Expose the table properties to application Summary: Provide a public API for users to access the table properties for each SSTable. Test Plan: Added a unit tests to test the function correctness under differnet conditions. Reviewers: haobo, dhruba, sdong Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D16083	2014-02-13 16:28:21 -08:00
Kai Liu	b2e7ee8b41	Followup code refactor on plain table Summary: Fixed most comments in https://reviews.facebook.net/D15429. Still have some remaining comments left. Test Plan: make all check Reviewers: sdong, haobo Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D15885	2014-02-13 15:27:59 -08:00
Kai Liu	59cffe02c4	Benchmark table reader wiht nanoseconds Summary: nanosecnods gave us better view of the performance, especially when some operations are fast so that micro seconds may only reveal less informative results. Test Plan: sample output: ./table_reader_bench --plain_table --time_unit=nanosecond ======================================================================================================= InMemoryTableSimpleBenchmark: PlainTable num_key1: 4096 num_key2: 512 non_empty ======================================================================================================= Histogram (unit: nanosecond): Count: 6291456 Average: 475.3867 StdDev: 556.05 Min: 135.0000 Median: 400.1817 Max: 33370.0000 Percentiles: P50: 400.18 P75: 530.02 P99: 887.73 P99.9: 8843.26 P99.99: 9941.21 ------------------------------------------------------ [ 120, 140 ) 2 0.000% 0.000% [ 140, 160 ) 452 0.007% 0.007% [ 160, 180 ) 13683 0.217% 0.225% [ 180, 200 ) 54353 0.864% 1.089% [ 200, 250 ) 101004 1.605% 2.694% [ 250, 300 ) 729791 11.600% 14.294% ## [ 300, 350 ) 616070 9.792% 24.086% ## [ 350, 400 ) 1628021 25.877% 49.963% ##### [ 400, 450 ) 647220 10.287% 60.250% ## [ 450, 500 ) 577206 9.174% 69.424% ## [ 500, 600 ) 1168585 18.574% 87.999% #### [ 600, 700 ) 506875 8.057% 96.055% ## [ 700, 800 ) 147878 2.350% 98.406% [ 800, 900 ) 42633 0.678% 99.083% [ 900, 1000 ) 16304 0.259% 99.342% [ 1000, 1200 ) 7811 0.124% 99.466% [ 1200, 1400 ) 1453 0.023% 99.490% [ 1400, 1600 ) 307 0.005% 99.494% [ 1600, 1800 ) 81 0.001% 99.496% [ 1800, 2000 ) 18 0.000% 99.496% [ 2000, 2500 ) 8 0.000% 99.496% [ 2500, 3000 ) 6 0.000% 99.496% [ 3500, 4000 ) 3 0.000% 99.496% [ 4000, 4500 ) 116 0.002% 99.498% [ 4500, 5000 ) 1144 0.018% 99.516% [ 5000, 6000 ) 1087 0.017% 99.534% [ 6000, 7000 ) 2403 0.038% 99.572% [ 7000, 8000 ) 9840 0.156% 99.728% [ 8000, 9000 ) 12820 0.204% 99.932% [ 9000, 10000 ) 3881 0.062% 99.994% [ 10000, 12000 ) 135 0.002% 99.996% [ 12000, 14000 ) 159 0.003% 99.998% [ 14000, 16000 ) 58 0.001% 99.999% [ 16000, 18000 ) 30 0.000% 100.000% [ 18000, 20000 ) 14 0.000% 100.000% [ 20000, 25000 ) 2 0.000% 100.000% [ 25000, 30000 ) 2 0.000% 100.000% [ 30000, 35000 ) 1 0.000% 100.000% Reviewers: haobo, dhruba, sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D16113	2014-02-13 13:57:36 -08:00
sdong	b5140a0361	Fix table_reader_bench and add it to "make" Summary: Fix table_reader_bench after some interface changes. Add it to make to avoid future breaking Test Plan: make table_reader_bench and run it with different options. Reviewers: kailiu, haobo Reviewed By: haobo CC: igor, leveldb Differential Revision: https://reviews.facebook.net/D16107	2014-02-12 18:31:02 -08:00
Siying Dong	f3ae3d07cc	Add more black-box tests for PlainTable and explicitly support total order mode Summary: 1. Add some more implementation-aware tests for PlainTable 2. move from a hard-coded one index per 16 rows in one prefix to a configurable number. Also, make hash table ratio = 0 means binary search only. Also fixes some divide 0 risks. 3. Explicitly support total order (only use binary search) 4. some code cleaning up. Test Plan: make all check Reviewers: haobo, kailiu Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D16023	2014-02-12 17:37:22 -08:00
Igor Canadi	ccdb93e775	Merge branch 'master' into columnfamilies Conflicts: db/db_impl.cc db/db_impl.h db/memtable_list.cc db/memtable_list.h db/version_set.cc db/version_set.h	2014-02-12 14:01:30 -08:00
Igor Canadi	b06840aa7d	[CF] Rethinking ColumnFamilyHandle and fix to dropping column families Summary: The change to the public behavior: * When opening a DB or creating new column family client gets a ColumnFamilyHandle. * As long as column family handle is alive, client can do whatever he wants with it, even drop it * Dropped column family can still be read from (using the column family handle) * Added a new call CloseColumnFamily(). Client has to close all column families that he has opened before deleting the DB * As soon as column family is closed, any calls to DB using that column family handle will fail (also any outstanding calls) Internally: * Ref-counting ColumnFamilyData * New thread-safety for ColumnFamilySet * Dropped column families are now completely dropped and their memory cleaned-up Test Plan: added some tests to column_family_test Reviewers: dhruba, haobo, kailiu, sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D16101	2014-02-12 13:47:09 -08:00
kailiu	e6b3e3b4db	Support prefix seek in UserCollectedProperties Summary: We'll need the prefix seek support for property aggregation. Test Plan: make all check Reviewers: haobo, sdong, dhruba Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D15963	2014-02-12 13:14:59 -08:00
kailiu	aa734ce9ab	Fix a member variables initialization order issue Summary: In MacOS, I got issue with `Footer`'s default constructor, which initialized the magic number with some random number instead of 0. With investigation, I found we forgot to make the kInvalidTableMagicNumber to be static. As a result, kInvalidTableMagicNumber was assgined to `table_magic_number_` before it is initialized (which will be populated with random number). Test Plan: passed current unit tests; also passed the unit tests for the incoming diff which used the default footer. Reviewers: yhchiang CC: leveldb Differential Revision: https://reviews.facebook.net/D16077	2014-02-11 14:16:46 -08:00
kailiu	745c181e20	Quick fix for table_test failure Summary: * Fixed the compression state array size bug. * Temporarily disable running `DoCompressionTest()` against bzip, which will fail the test. Test Plan: make && ./table_test Reviewers: igor CC: leveldb Differential Revision: https://reviews.facebook.net/D16065	2014-02-10 17:05:14 -08:00
Igor Canadi	8e634d3ea4	Merge pull request #74 from alberts/lz4 Support for LZ4 compression.	2014-02-10 15:46:56 -08:00
Albert Strasheim	df2f92214a	Support for LZ4 compression.	2014-02-08 14:15:51 -08:00
Kai Liu	9a270f3f6d	Fix the valgrind error in table test.	2014-02-07 22:43:58 -08:00
Kai Liu	b8ea5e36b3	Fix incompatible compilation in Linux server	2014-02-07 19:47:48 -08:00
kailiu	161ab42a8a	Make table properties shareable Summary: We are going to expose properties of all tables to end users through "some" db interface. However, current design doesn't naturally fit for this need, which is because: 1. If a table presents in table cache, we cannot simply return the reference to its table properties, because the table may be destroy after compaction (and we don't want to hold the ref of the version). 2. Copy table properties is OK, but it's slow. Thus in this diff, I change the table reader's interface to return a shared pointer (for const table properties), instead a const refernce. Test Plan: `make check` passed Reviewers: haobo, sdong, dhruba Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D15999	2014-02-07 19:26:49 -08:00
Igor Canadi	2b8c44639a	Merge branch 'master' into columnfamilies	2014-02-07 11:42:56 -08:00
Yueh-Hsuan Chiang	3ce8d9a988	Add support for plain table format to sst_dump. Summary: This diff enables the command line tool `sst_dump` to work for sst files under plain table format. Changes include: * In tools/sst_dump.cc: - add support for plain table format - display prefix_extractor information when --show_properties is on * In table/format.cc - Now the table magic number of a Footer can be later initialized via ReadFooterFromFile(). * In table/meta_bocks: - add function ReadTableMagicNumber() that reads the magic number of the specified file. Minor fixes: - remove a duplicate #include in table/table_test.cc - fix a commentary typo in include/rocksdb/memtablerep.h - fix lint errors. Test Plan: Runs sst_dump with both block-based and plain-table format files with different arguments, specifically those with --show-properties and --from. * sample output: https://reviews.facebook.net/P261 Reviewers: kailiu, sdong, xjin CC: leveldb Differential Revision: https://reviews.facebook.net/D15903	2014-02-07 11:15:00 -08:00
Igor Canadi	0143abdbb0	Merge branch 'master' into columnfamilies Conflicts: HISTORY.md db/db_impl.cc db/db_impl.h db/db_iter.cc db/db_test.cc db/dbformat.h db/memtable.cc db/memtable_list.cc db/memtable_list.h db/table_cache.cc db/table_cache.h db/version_edit.h db/version_set.cc db/version_set.h db/write_batch.cc db/write_batch_test.cc include/rocksdb/options.h util/options.cc	2014-02-06 15:58:20 -08:00
Igor Canadi	8fa8a708ef	[CF] Propagate correct options to WriteBatch::InsertInto Summary: WriteBatch can have multiple column families in one batch. Every column family has different options. So we have to add a way for write batch to get options for an arbitrary column family. This required a bit more acrobatics since lots of interfaces had to be changed. Test Plan: make check Reviewers: dhruba, haobo, sdong, kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D15957	2014-02-06 10:23:31 -08:00
kailiu	84f8185fc0	Merge branch 'master' into performance Conflicts: HISTORY.md db/db_impl.cc db/memtable.cc	2014-02-05 21:21:00 -08:00
Igor Canadi	328ac7ee02	Merge branch 'master' into columnfamilies	2014-02-05 12:00:33 -08:00
kailiu	d43ebd8c65	Put table factory back to public api Summary: Previous I am too ambitious to hide every detail about table factory to internal api. However, we cannot pass the compilatoin for external users since we use table factory as the shared_ptr, which requires the definition of table factory's destructor. Test Plan: make check; Reviewers: sdong, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D15861	2014-02-03 19:51:20 -08:00
Igor Canadi	2966d764cd	Fix some 32-bit compile errors Summary: RocksDB doesn't compile on 32-bit architecture apparently. This is attempt to fix some of 32-bit errors. They are reported here: https://gist.github.com/paxos/8789697 Test Plan: RocksDB still compiles on 64-bit :) Reviewers: kailiu Reviewed By: kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D15825	2014-02-03 13:48:30 -08:00
Siying Dong	d169b67680	[Performance Branch] PlainTable to encode rows with seqID 0, value type using 1 internal byte. Summary: In PlainTable, use one single byte to represent 8 bytes of internal bytes, if seqID = 0 and it is value type (which should be common for bottom most files). It is to save 7 bytes for uncompressed cases. Test Plan: make all check Reviewers: haobo, dhruba, kailiu Reviewed By: haobo CC: igor, leveldb Differential Revision: https://reviews.facebook.net/D15489	2014-02-03 12:19:30 -08:00
kailiu	4f6cb17bdb	First phase API clean up Summary: Addressed all the issues in https://reviews.facebook.net/D15447. Now most table-related modules are hidden from user land. Test Plan: make check Reviewers: sdong, haobo, dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D15525	2014-02-03 00:30:43 -08:00
kailiu	4e0298f23c	Clean up arena API Summary: Easy thing goes first. This patch moves arena to internal dir; based on which, the coming patch will deal with memtable_rep. Test Plan: make check Reviewers: haobo, sdong, dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D15615	2014-01-30 22:10:10 -08:00
kailiu	a5e220f5ef	Merge branch 'master' into performance Conflicts: Makefile db/db_impl.cc db/db_test.cc db/memtable_list.cc db/memtable_list.h table/block_based_table_reader.cc table/table_test.cc util/cache.cc util/coding.cc	2014-01-28 10:35:55 -08:00
Igor Canadi	eb055609e4	[column families] Move memtable and immutable memtable list to column family data Summary: All memtables and immutable memtables are moved from DBImpl to ColumnFamilyData. For now, they are all referenced from default column family in DBImpl. It shouldn't be hard to get them from custom column family. Test Plan: make check Reviewers: dhruba, kailiu, sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D15459	2014-01-27 13:37:14 -08:00
Kai Liu	4b51dffcf8	Some refactorings on plain table Summary: Plain table has been working well and this is just a nit-picking patch, which is generated during my coding reading. No real functional changes. only some changes regarding: * Improve some comments from the perspective a "new" code reader. * Change some magic number to constant, which can help us to parameterize them in the future. * Did some style, naming, C++ convention changes. * Fix warnings from new "arc lint" Test Plan: make check Reviewers: sdong, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D15429	2014-01-24 21:28:10 -08:00
Kai Liu	0ab766132b	Re-org the table tests Summary: We'll divide the table tests into 3 buckets, plain table test, block-based table test and general table feature test. This diff does no real change and only does the rename and reorg. Test Plan: run table_test Reviewers: sdong, haobo, igor, dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D15417	2014-01-24 12:46:17 -08:00
Kai Liu	7d991be400	Some small refactorings on table_test Summary: Just revise some hard-to-read or unnecessarily verbose code. Test Plan: make check	2014-01-24 11:10:32 -08:00
kailiu	66dc033af3	Temporarily disable caching index/filter blocks Summary: Mixing index/filter blocks with data blocks resulted in some known issues. To make sure in next release our users won't be affected, we added a new option in BlockBasedTableFactory::TableOption to conceal this functionality for now. This patch also introduced a BlockBasedTableReader::OpenOptions, which avoids the "infinite" growth of parameters in BlockBasedTableReader::Open(). Test Plan: make check Reviewers: haobo, sdong, igor, dhruba Reviewed By: igor CC: leveldb, tnovak Differential Revision: https://reviews.facebook.net/D15327	2014-01-24 10:57:15 -08:00
Kai Liu	054c5dda8c	Merge branch 'master' into performance Conflicts: db/db_impl.cc db/db_test.cc db/memtable.cc db/version_set.cc include/rocksdb/statistics.h util/statistics_imp.h	2014-01-23 16:32:49 -08:00
Kai Liu	ef602f6275	Misc cleanup on performance branch Summary: Did some trivial stuffs: * Add more comments; * fix compiler's warning messages (uninitialized variables). * etc Test Plan: make check	2014-01-17 14:26:29 -08:00
Igor Canadi	83681bf9ef	Statistics code cleanup Summary: I'm separating code-cleanup part of https://reviews.facebook.net/D14517. This will make D14517 easier to understand and this diff easier to review. Test Plan: make check Reviewers: haobo, kailiu, sdong, dhruba, tnovak Reviewed By: tnovak CC: leveldb Differential Revision: https://reviews.facebook.net/D15099	2014-01-17 12:46:06 -08:00
kailiu	1304d8c8ce	Merge branch 'master' into performance Conflicts: Makefile db/db_impl.cc db/db_impl.h db/db_test.cc db/memtable.cc db/memtable.h db/version_edit.h db/version_set.cc include/rocksdb/options.h util/hash_skiplist_rep.cc util/options.cc	2014-01-15 23:12:31 -08:00
Igor Canadi	7f3e417f59	Fix memtable construction in tests	2014-01-14 15:36:12 -08:00
Kai Liu	8c4eb71b5d	Fix one more valgrind error in table_test	2014-01-03 18:27:33 -08:00
Kai Liu	5e7d5629c7	Fix the valgrind issues	2014-01-03 11:48:31 -08:00
Siying Dong	abaf26266d	[RocksDB] [Performance Branch] Some Changes to PlainTable format Summary: Some changes to PlainTable format: (1) support variable key length (2) use user defined slice transformer to extract prefixes (3) Run some test cases against PlainTable in db_test and table_test Test Plan: test db_test Reviewers: haobo, kailiu CC: dhruba, igor, leveldb, nkg- Differential Revision: https://reviews.facebook.net/D14457	2013-12-20 12:08:35 -08:00
Siying Dong	28c24de8be	[RocksDB Peformance Branch] A bug in PlainTable format Summary: A bug to fix. IT's already fixed in D14457, but want to check it in sooner to unblock tests Test Plan: plain_table_db_test Reviewers: nkg-, haobo Reviewed By: nkg- CC: kailiu, leveldb Differential Revision: https://reviews.facebook.net/D14673	2013-12-13 21:51:16 -08:00
Kai Liu	2e9efcd6d8	Add the property block for the plain table Summary: This is the last diff that adds the property block to plain table. The format resembles that of the block-based table: https://github.com/facebook/rocksdb/wiki/Rocksdb-table-format [data block] [meta block 1: stats block] [meta block 2: future extended block] ... [meta block K: future extended block] (we may add more meta blocks in the future) [metaindex block] [index block: we only have the placeholder here, we can add persistent index block in the future] [Footer: contains magic number, handle to metaindex block and index block] <end_of_file> Test Plan: extended existing property block test. Reviewers: haobo, sdong, dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D14523	2013-12-13 17:18:14 -08:00
Siying Dong	9718c790ec	[Performance Branch] Fix a bug of PlainTable when building indexes Summary: PlainTable now has a bug of the ordering of indexes for the prefixes in the same bucket. I thought std::map guaranteed key order but it didn't, probably because I didn't use it properly. But seems to me that we don't need to make extra sorting as input prefixes are already sorted. Found by problem by running leaf4 against plain table. Replace the map with a vector. It should performs better too. After the fix, leaf4 unit tests are passing. Test Plan: run plain_table_db_test Also going to run db_test with plain table in the uncommitted branch. Reviewers: haobo, kailiu Reviewed By: haobo CC: nkg-, leveldb Differential Revision: https://reviews.facebook.net/D14649	2013-12-12 22:22:35 -08:00
kailiu	551e9428ce	Merge branch 'master' into performance	2013-12-06 14:15:42 -08:00
kailiu	b1d2de4a40	Fix #26 by putting the implementation of CreateDBStatistics() to a cc file	2013-12-05 22:29:03 -08:00
kailiu	90729f8b23	Extract metaindex block from block-based table Summary: This change will allow other table to reuse the code for meta blocks. Test Plan: all existing unit tests passed Reviewers: dhruba, haobo, sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D14475	2013-12-05 16:34:16 -08:00
kailiu	e1d92dfd2e	Fix a bunch of mac compilation issues in performance branch	2013-12-04 23:00:33 -08:00
Kai Liu	219b35be6a	Generalize footer reading from file Summary: Generalizing this process will help us to re-use the code for plain table Test Plan: ran ./table_test	2013-12-04 16:35:48 -08:00
Kai Liu	5dec7acd91	Introducing the concept of NULL block handle	2013-12-04 15:45:14 -08:00
Kai Liu	3a0e98d558	Parameterize table magic number Summary: As we are having different types of tables and they all might share the same structure in block-based table: [metaindex block] [index block] [Footer] To be able to identify differnt types of tables, we need to parameterize the "magic number" in the `Footer`. Test Plan: make check	2013-12-04 15:17:40 -08:00
Siying Dong	f040e536e4	[RocksDB Performance Branch] A more customized index in PlainTableReader Summary: PlainTableReader to use a more customized hash table. This patch assumes the SST file is smaller than 2GB: (1) Every bucket uses 32-bit integer (2) no key is stored in bucket (3) use the first bit of the bucket value to distinguish it points to the file offset or a second level index. This index schema fits the use case that most of prefixes have very small number of keys Test Plan: plain_table_db_test Reviewers: haobo, kailiu, dhruba Reviewed By: haobo CC: nkg-, leveldb Differential Revision: https://reviews.facebook.net/D14343	2013-12-04 13:43:45 -08:00
Igor Canadi	043fc14c3e	Get rid of some shared_ptrs Summary: I went through all remaining shared_ptrs and removed the ones that I found not-necessary. Only GenerateCachePrefix() is called fairly often, so don't expect much perf wins. The ones that are left are accessed infrequently and I think we're fine with keeping them. Test Plan: make asan_check Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D14427	2013-12-03 11:17:58 -08:00
Dhruba Borthakur	96bc3ec297	Memtables should be deleted appropriately in the unit test. Summary: Memtables should be deleted appropriately in the unit test. Test Plan: Reviewers: CC: Task ID: # Blame Rev:	2013-12-01 21:23:44 -08:00
Kai Liu	1966b63137	Merge branch 'master' into perf	2013-11-27 11:47:40 -08:00
Haobo Xu	5b825d6964	[RocksDB] Use raw pointer instead of shared pointer when passing Statistics object internally Summary: liveness of the statistics object is already ensured by the shared pointer in DB options. There's no reason to pass again shared pointer among internal functions. Raw pointer is sufficient and efficient. Test Plan: make check Reviewers: dhruba, MarkCallaghan, igor Reviewed By: dhruba CC: leveldb, reconnect.grayhat Differential Revision: https://reviews.facebook.net/D14289	2013-11-25 10:38:15 -08:00
Siying Dong	dfa1460d88	[For Performance Branch] Bloom filter in PlainTableIterator::Seek() - Update 1 Summary: Address @haobo's comments in D14277 Test Plan: ./indexed_table_db_test Reviewers: haobo CC: Task ID: # Blame Rev:	2013-11-21 23:33:45 -08:00
Siying Dong	718488abc5	Add BloomFilter to PlainTableIterator::Seek() Summary: This patch adds a simple bloom filter in PlainTableIterator::Seek() Test Plan: N/A Reviewers: CC: Task ID: # Blame Rev:	2013-11-21 22:26:39 -08:00
kailiu	0c93df912e	Improve the readability of the TableProperties::ToString()	2013-11-21 17:54:23 -08:00
Siying Dong	3e35aa6412	Revert "Allow users to profile a query and see bottleneck of the query" This reverts commit `3d8ac31d71`.	2013-11-21 17:40:39 -08:00
Siying Dong	b135d01e7b	Allow users to profile a query and see bottleneck of the query Summary: Provide a framework to profile a query in detail to figure out latency bottleneck. Currently, in Get(), Put() and iterators, 2-3 simple timing is used. We can easily add more profile counters to the framework later. Test Plan: Enable this profiling in seveal existing tests. Reviewers: haobo, dhruba, kailiu, emayanke, vamsi, igor CC: leveldb Differential Revision: https://reviews.facebook.net/D14001 Conflicts: table/merger.cc	2013-11-21 17:39:19 -08:00
Siying Dong	3d8ac31d71	Allow users to profile a query and see bottleneck of the query Summary: Provide a framework to profile a query in detail to figure out latency bottleneck. Currently, in Get(), Put() and iterators, 2-3 simple timing is used. We can easily add more profile counters to the framework later. Test Plan: Enable this profiling in seveal existing tests. Reviewers: haobo, dhruba, kailiu, emayanke, vamsi, igor CC: leveldb Differential Revision: https://reviews.facebook.net/D14001	2013-11-21 16:29:57 -08:00
kailiu	7b10fe9fac	Fix a memory leak happened in table_test	2013-11-20 20:49:23 -08:00
Siying Dong	b59d4d5a50	A Simple Plain Table Summary: A Simple plain table format. No block structure. When creating the table reader, scanning the full table to create indexes. Test Plan:Add unit test Reviewers:haobo,dhruba,kailiu CC: Task ID: # Blame Rev:	2013-11-20 18:44:22 -08:00
kailiu	56589ab81f	Add TableOptions for BlockBasedTableFactory We are having more and more options to specify for this table so it makes sense to have a TableOptions for future extension.	2013-11-20 18:42:12 -08:00
Siying Dong	15b31b57df	MergingIterator.Seek() to lazily initialize MinHeap Summary: For the use cases that prefix filtering is enabled, initializing heaps when doing MergingIterator.Seek() might introduce non-negligible costs. This patch makes it lazily done. Test Plan: make all check Reviewers: haobo,dhruba,kailiu CC: Task ID: # Blame Rev:	2013-11-20 17:29:35 -08:00
kailiu	1c8b819be2	Fix a memory leak happened in table_test	2013-11-20 13:45:32 -08:00
kailiu	6eb5649800	Move flush_block_policy from Options to TableFactory Summary: Previously we introduce a `flush_block_policy_factory` in Options, however, that options is strongly releated to Table based tables. It will make more sense to move it to block based table's own factory class. Test Plan: make check to pass existing tests Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D14211	2013-11-19 22:00:48 -08:00
Igor Canadi	469a9f32a7	Fix two nasty use-after-free-bugs Summary: These bugs were caught by ASAN crash test. 1. The first one, in table/filter_block.cc is very nasty. We first reference entries_ and store the reference to Slice prev. Then, we call entries_.append(), which can change the reference. The Slice prev now points to junk. 2. The second one is a bug in a test, so it's not very serious. Once we set read_opts.prefix, we never clear it, so some other function might still reference it. Test Plan: asan crash test now runs more than 5 mins. Before, it failed immediately. I will run the full one, but the full one takes quite some time (5 hours) Reviewers: dhruba, haobo, kailiu Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D14223	2013-11-19 21:01:48 -08:00
kailiu	1415f8820d	Improve the "table stats" Summary: The primary motivation of the changes is to make it easier to figure out the inside of the tables. * rename "table stats" to "table properties" since now we have more than "integers" to store in the property block. * Add filter block size to the basic table properties. * Whenever a table is built, we'll log the table properties (the sample output is in Test Plan). * Make an api to expose deleted keys. Test Plan: Passed all existing test. and the sample output of table stats: ================================================================== Basic Properties ------------------------------------------------------------------ # data blocks: 1 # entries: 1 raw key size: 9 raw average key size: 9 raw value size: 9 raw average value size: 0 data block size: 25 index block size: 27 filter block size: 18 (estimated) table size: 70 filter policy: rocksdb.BuiltinBloomFilter ================================================================== User collected properties: InternalKeyPropertiesCollector ------------------------------------------------------------------ kDeletedKeys: 1 ================================================================== Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D14187	2013-11-19 16:29:42 -08:00
kailiu	97d8e573a6	make util/env_posix.cc work under mac Summary: This diff invoves some more complicated issues in the posix environment. Test Plan: works under mac os. will need to verify dev box. Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D14061	2013-11-16 23:44:39 -08:00
Kai Liu	7604e2f70c	Make the options in table_builder/block_builder less misleading Summary: By original design, the regular `block options` and index `block options` in table_builder is mutable. We can use ChangeOptions to change the options directly. However, with my last change, `BlockBuilder` no longer hold the reference to the index_block_options -- as a result, any changes made after the creation of index block builder will be of no effect. But still the code is very error-prone and developers can easily fall into the trap without aware of it. To avoid this problem from happening in the future, I deleted the `ChangeOptions` and the `index_block_options`, as well as many other changes to make it less misleading. Test Plan: make make check make release Reviewers: dhruba, haobo Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D13707	2013-11-16 23:21:15 -08:00
Siying Dong	55baa3d955	Add an option to table_reader_bench to access the table from DB And Iterating non-existing prefix case. Summary: This patch adds an option to table_reader_bench that queries run against DB level (which has one table). It is useful if user wants to see the extra costs DB level introduces. Test Plan: Run the benchmark with and without the new parameter Reviewers: haobo, dhruba, kailiu Reviewed By: kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D13863	2013-11-15 22:23:12 -08:00
Igor Canadi	e0ad0f26b8	Fix bloom filters Summary: https://reviews.facebook.net/D13167 broke bloom filters. If filter is not in cache, we want to return true (safe thing). Am I right? Test Plan: when benchmarking https://reviews.facebook.net/D14031 I got different results when using bloom filters vs. when not using them. This fixed the issue. I will also be putting this change to the other diff, but that one will probably be in review for longer time. Reviewers: kailiu, dhruba, haobo Reviewed By: kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D14085	2013-11-14 14:05:15 -08:00
Kai Liu	80bb81c6fe	Add the correct table_factory for tables in table_tests	2013-11-12 23:54:31 -08:00
Kai Liu	88ba331c1a	Add the index/filter block cache Summary: This diff leverage the existing block cache and extend it to cache index/filter block. Test Plan: Added new tests in db_test and table_test The correctness is checked by: 1. make check 2. make valgrind_check Performance is test by: 1. 10 times of build_tools/regression_build_test.sh on two versions of rocksdb before/after the code change. Test results suggests no significant difference between them. For the two key operatons `overwrite` and `readrandom`, the average iops are both 20k and ~260k, with very small variance). 2. db_stress. Reviewers: dhruba Reviewed By: dhruba CC: leveldb, haobo, xjin Differential Revision: https://reviews.facebook.net/D13167	2013-11-12 22:46:51 -08:00
kailiu	21587760b9	Fixing the warning messages captured under mac os # Consider using `git commit -m 'One line title' && arc diff`. # You will save time by running lint and unit in the background. Summary: The work to make sure mac os compiles rocksdb is not completed yet. But at least we can start cleaning some warnings captured only by g++ from mac os.. Test Plan: ran make in mac os Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D14049	2013-11-12 20:05:28 -08:00
Kai Liu	0ef628537c	Don't not suggest flushing data when data block is still empty Summary: This diff fix the bug when the Options::block_size is too small.	2013-11-11 21:05:16 -08:00
Kai Liu	551ecfa416	Move down the time consuming tests in table_test Summary: it helps us to better check the tests we really care. Test Plan: make	2013-11-10 01:17:32 -08:00
Kai Liu	fd075d6edd	Provide mechanism to configure when to flush the block Summary: Allow block based table to configure the way flushing the blocks. This feature will allow us to add support for prefix-aligned block. Test Plan: make check Reviewers: dhruba, haobo, sdong, igor Reviewed By: sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D13875	2013-11-07 21:27:21 -08:00
Dhruba Borthakur	292c2b3357	Fix stress test failure when using mmap-reads. Summary: The mmap-read file->Read() does not use the scratch buffer to read in file-contents. Test Plan: ./db_stress --test_batches_snapshots=1 --ops_per_thread=100000000 --threads=32 --write_buffer_size=4194304 --destroy_db_initially=0 --reopen=0 --readpercent=45 --prefixpercent=5 --writepercent=35 --delpercent=5 --iterpercent=10 --db=/tmp/dhruba --max_key=100000000 --disable_seek_compaction=0 --mmap_read=1 --block_size=16384 --cache_size=1048576 --open_files=500000 --verify_checksum=1 --sync=1 --disable_wal=0 --disable_data_sync=0 --target_file_size_base=2097152 --target_file_size_multiplier=2 --max_write_buffer_number=3 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --filter_deletes=0 Reviewers: haobo, kailiu Reviewed By: kailiu CC: leveldb, kailiu, emayanke Differential Revision: https://reviews.facebook.net/D13923	2013-11-06 15:40:26 -08:00
Igor Canadi	36409e0016	Fix slow no-io iterator Summary: This fixes #3130525. Dhruba's suggestion and Tnovak's implementation :) The issue was with SkipEmptyDataBlocksForward(), but I also changed SkipEmptyDataBlocksBackward(). Is that OK? Test Plan: Run the logdevice test Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13911	2013-11-06 14:11:52 -08:00
Dhruba Borthakur	b4ad5e89ae	Implement a compressed block cache. Summary: Rocksdb can now support a uncompressed block cache, or a compressed block cache or both. Lookups first look for a block in the uncompressed cache, if it is not found only then it is looked up in the compressed cache. If it is found in the compressed cache, then it is uncompressed and inserted into the uncompressed cache. It is possible that the same block resides in the compressed cache as well as the uncompressed cache at the same time. Both caches have their own individual LRU policy. Test Plan: Unit test case attached. Reviewers: kailiu, sdong, haobo, leveldb Reviewed By: haobo CC: xjin, haobo Differential Revision: https://reviews.facebook.net/D12675	2013-11-01 14:31:35 -07:00
Siying Dong	82b7e37f6e	Fix a bug of table_reader_bench Summary: Iterator benchmark case is timed incorrectly. Fix it Test Plan: Run the benchmark Reviewers: haobo, dhruba Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D13845	2013-10-31 15:26:06 -07:00
Siying Dong	7caadf2e52	A very simple benchmark to measure Table implemenation's Get() And Iterator performance Summary: It is a very simple benchmark to measure a Table implementation's Get() and iterator performance if all the data is in memory. Test Plan: N/A Reviewers: dhruba, haobo, kailiu Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D13743	2013-10-31 13:38:54 -07:00
Naman Gupta	b4fab3be2a	Merge branch 'master' of github.com:facebook/rocksdb into inplace	2013-10-31 11:51:03 -07:00
Naman Gupta	fe25070242	In-place updates for equal keys and similar sized values Summary: Currently for each put, a fresh memory is allocated, and a new entry is added to the memtable with a new sequence number irrespective of whether the key already exists in the memtable. This diff is an attempt to update the value inplace for existing keys. It currently handles a very simple case: 1. Key already exists in the current memtable. Does not inplace update values in immutable memtable or snapshot 2. Latest value type is a 'put' ie kTypeValue 3. New value size is less than existing value, to avoid reallocating memory TODO: For a put of an existing key, deallocate memory take by values, for other value types till a kTypeValue is found, ie. remove kTypeMerge. TODO: Update the transaction log, to allow consistent reload of the memtable. Test Plan: Added a unit test verifying the inplace update. But some other unit tests broken due to invalid sequence number checks. WIll fix them next. Reviewers: xinyaohu, sumeet, haobo, dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D12423 Automatic commit by arc	2013-10-31 11:27:12 -07:00
Siying Dong	f03b2df010	Follow-up Cleaning-up After D13521 Summary: This patch is to address @haobo's comments on D13521: 1. rename Table to be TableReader and make its factory function to be GetTableReader 2. move the compression type selection logic out of TableBuilder but to compaction logic 3. more accurate comments 4. Move stat name constants into BlockBasedTable implementation. 5. remove some uncleaned codes in simple_table_db_test Test Plan: pass test suites. Reviewers: haobo, dhruba, kailiu Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D13785	2013-10-30 10:52:33 -07:00
Kai Liu	79d8dad331	Change a typo in method signature	2013-10-28 21:23:17 -07:00
Siying Dong	d4eec30ed0	Make "Table" pluggable Summary: This patch makes Table and TableBuilder a abstract class and make all the implementation of the current table into BlockedBasedTable and BlockedBasedTable Builder. Test Plan: Make db_test.cc to work with block based table. Add a new test simple_table_db_test.cc where a different simple table format is implemented. Reviewers: dhruba, haobo, kailiu, emayanke, vamsi Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13521	2013-10-28 17:54:09 -07:00
Kai Liu	994575c134	Support user-defined table stats collector Summary: 1. Added a new option that support user-defined table stats collection. 2. Added a deleted key stats collector in `utilities` Test Plan: Added a unit test for newly added code. Also ran make check to make sure other tests are not broken. Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13491	2013-10-28 15:45:14 -07:00
Kai Liu	1ca86f0391	Fix a bug that index block's restart_block_interval is not 1 Summary: This bug may affect the seek performance. Test Plan: make make check Also gdb into some index block builder to make sure the restart_block_interval is `1`.	2013-10-28 10:51:34 -07:00
Kai Liu	43ee5e2b3a	Fix the valgrind error in newly added unittests for table stats Summary: Previous the newly added test called NewBloomFilter without releasing it at the end of the test, which resulted in memory leak and was detected by valgrind. Test Plan: Ran valgrind test.	2013-10-20 22:02:05 -07:00
Siying Dong	9edda37027	Universal Compaction to Have a Size Percentage Threshold To Decide Whether to Compress Summary: This patch adds a option for universal compaction to allow us to only compress output files if the files compacted previously did not yet reach a specified ratio, to save CPU costs in some cases. Compression is always skipped for flushing. This is because the size information is not easy to evaluate for flushing case. We can improve it later. Test Plan: add test DBTest.UniversalCompactionCompressRatio1 and DBTest.UniversalCompactionCompressRatio12 Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13467	2013-10-17 13:33:39 -07:00
Kai Liu	aac44226a0	Add bloom filter to predefined table stats Summary: As title. Test Plan: Updated the unit tests to make sure new statistic is correctly written/read. Reviewers: dhruba, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D13497	2013-10-17 11:43:06 -07:00
Dhruba Borthakur	9cd221094c	Add appropriate LICENSE and Copyright message. Summary: Add appropriate LICENSE and Copyright message. Test Plan: make check Reviewers: CC: Task ID: # Blame Rev:	2013-10-16 17:48:41 -07:00
Kai Liu	86ef6c3f74	Add statistics to sst file Summary: So far we only have key/value pairs as well as bloom filter stored in the sst file. It will be great if we are able to store more metadata about this table itself, for example, the entry size, bloom filter name, etc. This diff is the first step of this effort. It allows table to keep the basic statistics mentioned in http://fburl.com/14995441, as well as allowing writing user-collected stats to stats block. After this diff, we will figure out the interface of how to allow user to collect their interested statistics. Test Plan: 1. Added several unit tests. 2. Ran `make check` to ensure it doesn't break other tests. Reviewers: dhruba, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D13419	2013-10-14 15:56:13 -07:00
Kai Liu	1f8ade6bd6	Fix a bug in table builder Summary: In talbe.cc, when reading the metablock, it uses BytewiseComparator(); However in table_builder.cc, we use r->options.comparator. After tracing the creation of r->options.comparator, I found this comparator is an InternalKeyComparator, which wraps the user defined comparator(details can be found in DBImpl::SanitizeOptions(). I encountered this problem when adding metadata about "bloom filter" before. With different comparator, we may fail to do the binary sort. Current code works well since there is only one entry in meta block. Test Plan: make all check I've also tested this change in https://reviews.facebook.net/D8283 before. Reviewers: dhruba, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D13335	2013-10-07 18:53:19 -07:00
Dhruba Borthakur	4463b11cad	Migrate names of properties from 'leveldb' prefix to 'rocksdb' prefix. Summary: Migrate names of properties from 'leveldb' prefix to 'rocksdb' prefix. Test Plan: make check Reviewers: emayanke, haobo Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D13311	2013-10-06 00:14:26 -07:00
Dhruba Borthakur	a143ef9b38	Change namespace from leveldb to rocksdb Summary: Change namespace from leveldb to rocksdb. This allows a single application to link in open-source leveldb code as well as rocksdb code into the same process. Test Plan: compile rocksdb Reviewers: emayanke Reviewed By: emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D13287	2013-10-04 11:59:26 -07:00
Haobo Xu	f2f4c8072f	[RocksDB] Added nano second stopwatch and new perf counters to track block read cost Summary: The pupose of this diff is to expose per user-call level precise timing of block read, so that we can answer questions like: a Get() costs me 100ms, is that somehow related to loading blocks from file system, or sth else? We will answer that with EXACTLY how many blocks have been read, how much time was spent on transfering the bytes from os, how much time was spent on checksum verification and how much time was spent on block decompression, just for that one Get. A nano second stopwatch was introduced to track time with higher precision. The cost/precision of the stopwatch is also measured in unit-test. On my dev box, retrieving one time instance costs about 30ns, on average. The deviation of timing results is good enough to track 100ns-1us level events. And the overhead could be safely ignored for 100us level events (10000 instances/s), for example, a viewstate thrift call. Test Plan: perf_context_test, also testing with viewstate shadow traffic. Reviewers: dhruba Reviewed By: dhruba CC: leveldb, xjin Differential Revision: https://reviews.facebook.net/D12351	2013-09-07 21:14:54 -07:00
Mayank Agarwal	352f0636ef	Fix memory leak in table.cc Summary: In InternalGet, BlockReader returns an Iterator which is legitimately freed at the end of the 'else' scope. BUT there is a break statement in between and must be freed there too! The best solution would be to move to unique_ptr and let it handle. Changed it to a unique_ptr. Test Plan: valgrind ./db_test;make all check Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D12681	2013-09-02 22:13:29 -07:00
Dhruba Borthakur	fc0c399d2e	Introduced a new flag non_blocking_io in ReadOptions. Summary: If ReadOptions.non_blocking_io is set to true, then KeyMayExists and Iterators will return data that is cached in RAM. If the Iterator needs to do IO from storage to serve the data, then the Iterator.status() will return Status::IsRetry(). Test Plan: Enhanced unit test DBTest.KeyMayExist to detect if there were are IOs issues from storage. Added DBTest.NonBlockingIteration to verify nonblocking Iterations. Reviewers: emayanke, haobo Reviewed By: haobo CC: leveldb Maniphest Tasks: T63 Differential Revision: https://reviews.facebook.net/D12531	2013-08-28 10:49:14 -07:00
Mayank Agarwal	dad2731729	Fix bug in KeyMayExist Summary: In KeyMayExist.db_test we do a Flush which causes sst file to be written and added as open file in TableCache, but block cache for the file is not populated. So value_found should have been false where it was true and KeyMayExist.db_test should not have passed earlier. But it passed because BlockReader in table/table.cc takes 2 default arguments at the end called for_compaction and no_io. Although I passed no_io=true from InternalGet to BlockReader, but it understood for_compaction=true and defaulted no_io to false. This is a bug and although will be removed by Dhruba's new patch to incorporate no_io in readoptions, I'm submitting this patch to fix this bug independently of that patch. Test Plan: make all check Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D12537	2013-08-26 08:45:58 -07:00
Tyler Harter	4504c99030	Internal/user key bug fix. Summary: Fix code so that the filter_block layer only assumes keys are internal when prefix_extractor is set. Test Plan: ./filter_block_test Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D12501	2013-08-23 14:49:57 -07:00
Dhruba Borthakur	1186192ed1	Replace include/leveldb with include/rocksdb. Summary: Replace include/leveldb with include/rocksdb. Test Plan: make clean; make check make clean; make release Differential Revision: https://reviews.facebook.net/D12489	2013-08-23 10:51:00 -07:00
Jim Paton	74781a0c49	Add three new MemTableRep's Summary: This patch adds three new MemTableRep's: UnsortedRep, PrefixHashRep, and VectorRep. UnsortedRep stores keys in an std::unordered_map of std::sets. When an iterator is requested, it dumps the keys into an std::set and iterates over that. VectorRep stores keys in an std::vector. When an iterator is requested, it creates a copy of the vector and sorts it using std::sort. The iterator accesses that new vector. PrefixHashRep stores keys in an unordered_map mapping prefixes to ordered sets. I also added one API change. I added a function MemTableRep::MarkImmutable. This function is called when the rep is added to the immutable list. It doesn't do anything yet, but it seems like that could be useful. In particular, for the vectorrep, it means we could elide the extra copy and just sort in place. The only reason I haven't done that yet is because the use of the ArenaAllocator complicates things (I can elaborate on this if needed). Test Plan: make -j32 check ./db_stress --memtablerep=vector ./db_stress --memtablerep=unsorted ./db_stress --memtablerep=prefixhash --prefix_size=10 Reviewers: dhruba, haobo, emayanke Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D12117	2013-08-22 23:10:02 -07:00
Tyler Harter	94cf218720	Revert "Prefix scan: db_bench and bug fixes" This reverts commit `c2bd8f4824`.	2013-08-22 18:01:11 -07:00
Tyler Harter	c2bd8f4824	Prefix scan: db_bench and bug fixes Summary: If use_prefix_filters is set and read_range>1, then the random seeks will set a the prefix filter to be the prefix of the key which was randomly selected as the target. Still need to add statistics (perhaps in a separate diff). Test Plan: ./db_bench --benchmarks=fillseq,prefixscanrandom --num=10000000 --statistics=1 --use_prefix_blooms=1 --use_prefix_api=1 --bloom_bits=10 Reviewers: dhruba Reviewed By: dhruba CC: leveldb, haobo Differential Revision: https://reviews.facebook.net/D12273	2013-08-22 16:06:50 -07:00
Haobo Xu	f9e2decf7c	[RocksDB] Minor iterator cleanup Summary: Was going through the iterator related code, did some cleanup along the way. Basically replaced array with vector and adopted range based loop where applicable. Test Plan: make check; make valgrind_check Reviewers: dhruba, emayanke Reviewed By: emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D12435	2013-08-21 16:54:48 -07:00
Tyler Harter	7612d496ff	Add prefix scans to db_stress (and bug fix in prefix scan) Summary: Added support for prefix scans. Test Plan: ./db_stress --max_key=4096 --ops_per_thread=10000 Reviewers: dhruba, vamsi Reviewed By: vamsi CC: leveldb Differential Revision: https://reviews.facebook.net/D12267	2013-08-14 16:58:36 -07:00
Tyler Harter	f5f1842282	Prefix filters for scans (v4) Summary: Similar to v2 (db and table code understands prefixes), but use ReadOptions as in v3. Also, make the CreateFilter code faster and cleaner. Test Plan: make db_test; export LEVELDB_TESTS=PrefixScan; ./db_test Reviewers: dhruba Reviewed By: dhruba CC: haobo, emayanke Differential Revision: https://reviews.facebook.net/D12027	2013-08-13 14:04:56 -07:00
Deon Nicholas	c2d7826ced	[RocksDB] [MergeOperator] The new Merge Interface! Uses merge sequences. Summary: Here are the major changes to the Merge Interface. It has been expanded to handle cases where the MergeOperator is not associative. It does so by stacking up merge operations while scanning through the key history (i.e.: during Get() or Compaction), until a valid Put/Delete/end-of-history is encountered; it then applies all of the merge operations in the correct sequence starting with the base/sentinel value. I have also introduced an "AssociativeMerge" function which allows the user to take advantage of associative merge operations (such as in the case of counters). The implementation will always attempt to merge the operations/operands themselves together when they are encountered, and will resort to the "stacking" method if and only if the "associative-merge" fails. This implementation is conjectured to allow MergeOperator to handle the general case, while still providing the user with the ability to take advantage of certain efficiencies in their own merge-operator / data-structure. NOTE: This is a preliminary diff. This must still go through a lot of review, revision, and testing. Feedback welcome! Test Plan: -This is a preliminary diff. I have only just begun testing/debugging it. -I will be testing this with the existing MergeOperator use-cases and unit-tests (counters, string-append, and redis-lists) -I will be "desk-checking" and walking through the code with the help gdb. -I will find a way of stress-testing the new interface / implementation using db_bench, db_test, merge_test, and/or db_stress. -I will ensure that my tests cover all cases: Get-Memtable, Get-Immutable-Memtable, Get-from-Disk, Iterator-Range-Scan, Flush-Memtable-to-L0, Compaction-L0-L1, Compaction-Ln-L(n+1), Put/Delete found, Put/Delete not-found, end-of-history, end-of-file, etc. -A lot of feedback from the reviewers. Reviewers: haobo, dhruba, zshao, emayanke Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D11499	2013-08-05 20:14:32 -07:00
Mayank Agarwal	59d0b02f8b	Expand KeyMayExist to return the proper value if it can be found in memory and also check block_cache Summary: Removed KeyMayExistImpl because KeyMayExist demanded Get like semantics now. Removed no_io from memtable and imm because we need the proper value now and shouldn't just stop when we see Merge in memtable. Added checks to block_cache. Updated documentation and unit-test Test Plan: make all check;db_stress for 1 hour Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D11853	2013-08-01 09:07:46 -07:00
Jim Paton	52d7ecfc78	Virtualize SkipList Interface Summary: This diff virtualizes the skiplist interface so that users can provide their own implementation of a backing store for MemTables. Eventually, the backing store will be responsible for its own synchronization, allowing users (and us) to experiment with different lockless implementations. Test Plan: make clean make -j32 check ./db_stress Reviewers: dhruba, emayanke, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D11739	2013-07-23 14:42:27 -07:00
Mayank Agarwal	bf66c10b13	Use KeyMayExist for WriteBatch-Deletes Summary: Introduced KeyMayExist checking during writebatch-delete and removed from Outer Delete API because it uses writebatch-delete. Added code to skip getting Table from disk if not already present in table_cache. Some renaming of variables. Introduced KeyMayExistImpl which allows checking since specified sequence number in GetImpl useful to check partially written writebatch. Changed KeyMayExist to not be pure virtual and provided a default implementation. Expanded unit-tests in db_test to check appropriately. Ran db_stress for 1 hour with ./db_stress --max_key=100000 --ops_per_thread=10000000 --delpercent=50 --filter_deletes=1 --statistics=1. Test Plan: db_stress;make check Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb, xjin Differential Revision: https://reviews.facebook.net/D11745	2013-07-23 13:36:50 -07:00
Mayank Agarwal	2a986919d6	Make rocksdb-deletes faster using bloom filter Summary: Wrote a new function in db_impl.c-CheckKeyMayExist that calls Get but with a new parameter turned on which makes Get return false only if bloom filters can guarantee that key is not in database. Delete calls this function and if the option- deletes_use_filter is turned on and CheckKeyMayExist returns false, the delete will be dropped saving: 1. Put of delete type 2. Space in the db,and 3. Compaction time Test Plan: make all check; will run db_stress and db_bench and enhance unit-test once the basic design gets approved Reviewers: dhruba, haobo, vamsi Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D11607	2013-07-11 12:11:11 -07:00
Abhishek Kona	39ee47fbf4	[Rocksdb] Record WriteBlock Times into a histogram Summary: Add a histogram to track WriteBlock times Test Plan: db_bench and print Reviewers: haobo, dhruba Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D11319	2013-06-17 10:11:10 -07:00
Abhishek Kona	7a5f71d19a	[Rocksdb] measure table open io in a histogram Summary: Table is setup for compaction using Table::SetupForCompaction. So read block calls can be differentiated b/w Gets/Compaction. Use this and measure times. Test Plan: db_bench --statistics=1 Reviewers: dhruba, haobo Reviewed By: haobo CC: leveldb, MarkCallaghan Differential Revision: https://reviews.facebook.net/D11217	2013-06-13 17:25:09 -07:00
Haobo Xu	bdf1085944	[RocksDB] cleanup EnvOptions Summary: This diff simplifies EnvOptions by treating it as POD, similar to Options. - virtual functions are removed and member fields are accessed directly. - StorageOptions is removed. - Options.allow_readahead and Options.allow_readahead_compactions are deprecated. - Unused global variables are removed: useOsBuffer, useFsReadAhead, useMmapRead, useMmapWrite Test Plan: make check; db_stress Reviewers: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D11175	2013-06-12 11:17:19 -07:00
Haobo Xu	ab8d2f6ab2	[RocksDB] [Performance] Allow different posix advice to be applied to the same table file Summary: Current posix advice implementation ties up the access pattern hint with the creation of a file. It is not possible to apply different advice for different access (random get vs compaction read), without keeping two open files for the same table. This patch extended the RandomeAccessFile interface to accept new access hint at anytime. Particularly, we are able to set different access hint on the same table file based on when/how the file is used. Two options are added to set the access hint, after the file is first opened and after the file is being compacted. Test Plan: make check; db_stress; db_bench Reviewers: dhruba Reviewed By: dhruba CC: MarkCallaghan, leveldb Differential Revision: https://reviews.facebook.net/D10905	2013-05-30 19:08:44 -07:00
heyongqiang	4b29651206	add block deviation option to terminate a block before it exceeds block_size Summary: a new option block_size_deviation is added. Test Plan: run db_test and db_bench Reviewers: dhruba, haobo Reviewed By: haobo Differential Revision: https://reviews.facebook.net/D10821	2013-05-24 15:52:49 -07:00
Haobo Xu	05e8854085	[Rocksdb] Support Merge operation in rocksdb Summary: This diff introduces a new Merge operation into rocksdb. The purpose of this review is mostly getting feedback from the team (everyone please) on the design. Please focus on the four files under include/leveldb/, as they spell the client visible interface change. include/leveldb/db.h include/leveldb/merge_operator.h include/leveldb/options.h include/leveldb/write_batch.h Please go over local/my_test.cc carefully, as it is a concerete use case. Please also review the impelmentation files to see if the straw man implementation makes sense. Note that, the diff does pass all make check and truly supports forward iterator over db and a version of Get that's based on iterator. Future work: - Integration with compaction - A raw Get implementation I am working on a wiki that explains the design and implementation choices, but coding comes just naturally and I think it might be a good idea to share the code earlier. The code is heavily commented. Test Plan: run all local tests Reviewers: dhruba, heyongqiang Reviewed By: dhruba CC: leveldb, zshao, sheki, emayanke, MarkCallaghan Differential Revision: https://reviews.facebook.net/D9651	2013-05-03 16:59:02 -07:00
Haobo Xu	49fbd5531b	[RocksDB] Refactor table.cc to reduce code duplication and improve readability. Summary: In table.cc, the code section that reads in BlockContent and then put it into a Block, appears at least 4 times. This is too much duplication. BlockReader is much shorter after the change and reads way better. D10077 attempted that for index block read. This is a complete cleanup. Test Plan: make check; ./db_stress Reviewers: dhruba, sheki Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D10527	2013-04-29 09:43:36 -07:00
Haobo Xu	eb6d139666	[RocksDB] Move table.h to table/ Summary: - don't see a point exposing table.h to the public. - fixed make clean to remove also *.d files. Test Plan: make check; db_stress Reviewers: dhruba, heyongqiang Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D10479	2013-04-22 16:07:56 -07:00
Abhishek Kona	63f216ee0a	memory manage statistics Summary: Earlier Statistics object was a raw pointer. This meant the user had to clear up the Statistics object after creating the database. In most use cases the database is created in a function and the statistics pointer is out of scope. Hence the statistics object would never be deleted. Now Using a shared_ptr to manage this. Want this in before the next release. Test Plan: make all check. Reviewers: dhruba, emayanke Reviewed By: emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D9735	2013-03-27 11:27:39 -07:00
Dhruba Borthakur	ad96563b79	Ability to configure bufferedio-reads, filesystem-readaheads and mmap-read-write per database. Summary: This patch allows an application to specify whether to use bufferedio, reads-via-mmaps and writes-via-mmaps per database. Earlier, there was a global static variable that was used to configure this functionality. The default setting remains the same (and is backward compatible): 1. use bufferedio 2. do not use mmaps for reads 3. use mmap for writes 4. use readaheads for reads needed for compaction I also added a parameter to db_bench to be able to explicitly specify whether to do readaheads for compactions or not. Test Plan: make check Reviewers: sheki, heyongqiang, MarkCallaghan Reviewed By: sheki CC: leveldb Differential Revision: https://reviews.facebook.net/D9429	2013-03-20 23:14:03 -07:00
Abhishek Kona	c41f1e995c	Codemod NULL to nullptr Summary: scripted NULL to nullptr in * include/leveldb/ * db/ * table/ * util/ Test Plan: make all check Reviewers: dhruba, emayanke Reviewed By: emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D9003	2013-02-28 18:04:58 -08:00
Kosie van der Merwe	4dcc0c89f4	Fixed cache key for block cache Summary: Added function to `RandomAccessFile` to generate an unique ID for that file. Currently only `PosixRandomAccessFile` has this behaviour implemented and only on Linux. Changed how key is generated in `Table::BlockReader`. Added tests to check whether the unique ID is stable, unique and not a prefix of another unique ID. Added tests to see that `Table` uses the cache more efficiently. Test Plan: make check Reviewers: chip, vamsi, dhruba Reviewed By: chip CC: leveldb Differential Revision: https://reviews.facebook.net/D8145	2013-01-31 15:20:24 -08:00
Chip Turner	0b83a83191	Fix poor error on num_levels mismatch and few other minor improvements Summary: Previously, if you opened a db with num_levels set lower than the database, you received the unhelpful message "Corruption: VersionEdit: new-file entry." Now you get a more verbose message describing the issue. Also, fix handling of compression_levels (both the run-over-the-end issue and the memory management of it). Lastly, unique_ptr'ify a couple of minor calls. Test Plan: make check Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D8151	2013-01-25 15:37:26 -08:00
Chip Turner	2fdf91a4f8	Fix a number of object lifetime/ownership issues Summary: Replace manual memory management with std::unique_ptr in a number of places; not exhaustive, but this fixes a few leaks with file handles as well as clarifies semantics of the ownership of file handles with log classes. Test Plan: db_stress, make check Reviewers: dhruba Reviewed By: dhruba CC: zshao, leveldb, heyongqiang Differential Revision: https://reviews.facebook.net/D8043	2013-01-23 16:54:11 -08:00
Kosie van der Merwe	88b79b24f3	Fixed didIO not being set with no block_cache Summary: In `Table::BlockReader()` when there was no block cache `didIO` was not set. This didn't seem to matter as `didIO` is only used to trigger seek compactions. However, I would like it if someone else could check that is the case. Test Plan: `make check OPT="-g -O3"` Reviewers: dhruba, vamsi Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D8133	2013-01-23 12:49:10 -08:00
Chip Turner	a2dcd79c1e	Add optional clang compile mode Summary: clang is an alternate compiler based on llvm. It produces nicer error messages and finds some bugs that gcc doesn't, such as the size_t change in this file (which caused some write return values to be misinterpreted!) Clang isn't the default; to try it, do "USE_CLANG=1 make" or "export USE_CLANG=1" then make as normal Test Plan: "make check" and "USE_CLANG=1 make check" Reviewers: dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D7899	2013-01-15 18:48:37 -08:00
Kosie van der Merwe	4e9d9d989f	Fixed wrong assumption in Table::Open() Summary: `Table::Open()` assumes that `size` correctly describes the size of `file`, added a check that the footer is actually the right size and for good measure added assertions to `Footer::DecodeFrom()`. This was discovered by running `valgrind ./db_test` and seeing that `Footer::DecodeFrom()` was accessing uninitialized memory. Test Plan: make clean check ran `valgrind ./db_test` and saw DBTest.NoSpace no longer complains about a conditional jump being dependent on uninitialized memory. Reviewers: dhruba, vamsi, emayanke, sheki Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D7815	2013-01-09 10:44:30 -08:00
Abhishek Kona	2e1ad2c48b	Remove unnecessary asserts in table/merger.cc Summary: The asserts introduced in https://reviews.facebook.net/D7629 are wrong. The direction of iteration is changed after the function call so they assert's fail. Test Plan: make clean check Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D7827	2013-01-09 09:52:45 -08:00
Abhishek Kona	3f7af03a2d	Use a priority queue to merge files. Summary: Use a std::priority_queue in merger.cc instead of doing a o(n) search every time. Currently only the ForwardIteration uses a Priority Queue. Test Plan: make all check Reviewers: dhruba Reviewed By: dhruba CC: emayanke, zshao Differential Revision: https://reviews.facebook.net/D7629	2013-01-02 13:52:25 -08:00
Zheng Shao	7521a225d1	sst_dump: Error message should include the case that compression algorithms are not supported. Summary: It took me almost a day to debug this. :( Although I got to learn the file format as a by-product, this time could be saved if we have better error messages. Test Plan: gmake clean all; sst_dump --hex --file=000005.sst Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D7551	2012-12-20 15:29:51 -08:00
Dhruba Borthakur	551f01fb23	Unit test to test block format. Summary: This is a standalone unit test to test the format of a block. Test Plan: ./block_test Reviewers: sheki Reviewed By: sheki Differential Revision: https://reviews.facebook.net/D7533	2012-12-20 14:55:07 -08:00
Abhishek Kona	0f8e4721a5	Metrics: record compaction drop's and bloom filter effectiveness Summary: Record BloomFliter hits and drop off reasons during compaction. Test Plan: Unit tests work. Reviewers: dhruba, heyongqiang Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D6591	2012-11-09 11:38:45 -08:00
Abhishek Kona	391885c4e4	stat's collection in leveldb Summary: Prototype stat's collection. Diff is a good estimate of what the final code will look like. A few assumptions : * Used a global static instance of the statistics object. Plan to pass it to each internal function. Static allows metrics only at app level. * In the Ticker's do not do any locking. Depend on the mutex at each function of LevelDB. If we ever remove the mutex, we should change here too. The other option is use atomic objects anyways as there won't be any contention as they will be always acquired only by one thread. * The counters are dumb, increment through lifecycle. Plan to use ods etc to get last5min stat etc. Test Plan: made changes in db_bench Ran ./db_bench --statistics=1 --num=10000 --cache_size=5000 This will print the cache hit/miss stats. Reviewers: dhruba, heyongqiang Differential Revision: https://reviews.facebook.net/D6441	2012-11-08 13:55:49 -08:00
Dhruba Borthakur	aa42c66814	Fix all warnings generated by -Wall option to the compiler. Summary: The default compilation process now uses "-Wall" to compile. Fix all compilation error generated by gcc. Test Plan: make all check Reviewers: heyongqiang, emayanke, sheki Reviewed By: heyongqiang CC: MarkCallaghan Differential Revision: https://reviews.facebook.net/D6525	2012-11-06 14:07:31 -08:00
amayank	854c66b089	Make compression options configurable. These include window-bits, level and strategy for ZlibCompression Summary: Leveldb currently uses windowBits=-14 while using zlib compression.(It was earlier 15). This makes the setting configurable. Related changes here: https://reviews.facebook.net/D6105 Test Plan: make all check Reviewers: dhruba, MarkCallaghan, sheki, heyongqiang Differential Revision: https://reviews.facebook.net/D6393	2012-11-02 11:26:39 -07:00
Dhruba Borthakur	321dfdc3ae	Allow having different compression algorithms on different levels. Summary: The leveldb API is enhanced to support different compression algorithms at different levels. This adds the option min_level_to_compress to db_bench that specifies the minimum level for which compression should be done when compression is enabled. This can be used to disable compression for levels 0 and 1 which are likely to suffer from stalls because of the CPU load for memtable flushes and (L0,L1) compaction. Level 0 is special as it gets frequent memtable flushes. Level 1 is special as it frequently gets all:all file compactions between it and level 0. But all other levels could be the same. For any level N where N > 1, the rate of sequential IO for that level should be the same. The last level is the exception because it might not be full and because files from it are not read to compact with the next larger level. The same amount of time will be spent doing compaction at any level N excluding N=0, 1 or the last level. By this standard all of those levels should use the same compression. The difference is that the loss (using more disk space) from a faster compression algorithm is less significant for N=2 than for N=3. So we might be willing to trade disk space for faster write rates with no compression for L0 and L1, snappy for L2, zlib for L3. Using a faster compression algorithm for the mid levels also allows us to reclaim some cpu without trading off much loss in disk space overhead. Also note that little is to be gained by compressing levels 0 and 1. For a 4-level tree they account for 10% of the data. For a 5-level tree they account for 1% of the data. With compression enabled: * memtable flush rate is ~18MB/second * (L0,L1) compaction rate is ~30MB/second With compression enabled but min_level_to_compress=2 * memtable flush rate is ~320MB/second * (L0,L1) compaction rate is ~560MB/second This practicaly takes the same code from https://reviews.facebook.net/D6225 but makes the leveldb api more general purpose with a few additional lines of code. Test Plan: make check Differential Revision: https://reviews.facebook.net/D6261	2012-10-29 11:48:09 -07:00
Dhruba Borthakur	c1bb32e1ba	Trigger read compaction only if seeks to storage are incurred. Summary: In the current code, a Get() call can trigger compaction if it has to look at more than one file. This causes unnecessary compaction because looking at more than one file is a penalty only if the file is not yet in the cache. Also, th current code counts these files before the bloom filter check is applied. This patch counts a 'seek' only if the file fails the bloom filter check and has to read in data block(s) from the storage. This patch also counts a 'seek' if a file is not present in the file-cache, because opening a file means that its index blocks need to be read into cache. Test Plan: unit test attached. I will probably add one more unti tests. Reviewers: heyongqiang Reviewed By: heyongqiang CC: MarkCallaghan Differential Revision: https://reviews.facebook.net/D5709	2012-09-28 11:10:52 -07:00
Dhruba Borthakur	fe93631678	Clean up compiler warnings generated by -Wall option. Summary: Clean up compiler warnings generated by -Wall option. make clean all OPT=-Wall This is a pre-requisite before making a new release. Test Plan: compile and run unit tests Reviewers: heyongqiang Reviewed By: heyongqiang Differential Revision: https://reviews.facebook.net/D5019	2012-08-29 14:24:51 -07:00
heyongqiang	daa816c4a0	add bzip2 compression Summary: add bzip2 compression Test Plan: testcases in table_test Reviewers: dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D3909	2012-06-29 10:27:28 -07:00
heyongqiang	054a5657f8	add zlib compression Summary: add zlib compression Test Plan: Will add more testcases Reviewers: dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D3873	2012-06-28 16:28:57 -07:00
heyongqiang	4e4b6812ff	Make some variables configurable for each db instance Summary: Make configurable 'targetFileSize', 'targetFileSizeMultiplier', 'maxBytesForLevelBase', 'maxBytesForLevelMultiplier', 'expandedCompactionFactor', 'maxGrandParentOverlapFactor' Test Plan: N/A Reviewers: dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D3801	2012-06-27 14:36:31 -07:00
Sanjay Ghemawat	85584d497e	Added bloom filter support. In particular, we add a new FilterPolicy class. An instance of this class can be supplied in Options when opening a database. If supplied, the instance is used to generate summaries of keys (e.g., a bloom filter) which are placed in sstables. These summaries are consulted by DB::Get() so we can avoid reading sstable blocks that are guaranteed to not contain the key we are looking for. This change provides one implementation of FilterPolicy based on bloom filters. Other changes: - Updated version number to 1.4. - Some build tweaks. - C binding for CompactRange. - A few more benchmarks: deleteseq, deleterandom, readmissing, seekrandom. - Minor .gitignore update.	2012-04-17 08:36:46 -07:00
Sanjay Ghemawat	9013f13b15	use mmap on 64-bit machines to speed-up reads; small build fixes	2012-03-15 09:14:00 -07:00
Hans Wennborg	36a5f8ed7f	A number of fixes: - Replace raw slice comparison with a call to user comparator. Added test for custom comparators. - Fix end of namespace comments. - Fixed bug in picking inputs for a level-0 compaction. When finding overlapping files, the covered range may expand as files are added to the input set. We now correctly expand the range when this happens instead of continuing to use the old range. For example, suppose L0 contains files with the following ranges: F1: a .. d F2: c .. g F3: f .. j and the initial compaction target is F3. We used to search for range f..j which yielded {F2,F3}. However we now expand the range as soon as another file is added. In this case, when F2 is added, we expand the range to c..j and restart the search. That picks up file F1 as well. This change fixes a bug related to deleted keys showing up incorrectly after a compaction as described in Issue 44. (Sync with upstream @25072954)	2011-10-31 17:22:06 +00:00
gabor@google.com	60bd8015f2	Speed up Snappy uncompression, new Logger interface. - Removed one copy of an uncompressed block contents changing the signature of Snappy_Uncompress() so it uncompresses into a flat array instead of a std::string. Speeds up readrandom ~10%. - Instead of a combination of Env/WritableFile, we now have a Logger interface that can be easily overridden applications that want to supply their own logging. - Separated out the gcc and Sun Studio parts of atomic_pointer.h so we can use 'asm', 'volatile' keywords for Sun Studio. git-svn-id: https://leveldb.googlecode.com/svn/trunk@39 62dab493-f737-651d-591e-8d6aee1b9529	2011-07-21 02:40:18 +00:00
gabor@google.com	6872ace901	Sun Studio support, and fix for test related memory fixes. - LevelDB patch for Sun Studio Based on a patch submitted by Theo Schlossnagle - thanks! This fixes Issue 17. - Fix a couple of test related memory leaks. git-svn-id: https://leveldb.googlecode.com/svn/trunk@38 62dab493-f737-651d-591e-8d6aee1b9529	2011-07-19 23:36:47 +00:00
gabor@google.com	ccf0fcd5c2	A number of smaller fixes and performance improvements: - Implemented Get() directly instead of building on top of a full merging iterator stack. This speeds up the "readrandom" benchmark by up to 15-30%. - Fixed an opensource compilation problem. Added --db=<name> flag to control where the database is placed. - Automatically compact a file when we have done enough overlapping seeks to that file. - Fixed a performance bug where we would read from at least one file in a level even if none of the files overlapped the key being read. - Makefile fix for Mac OSX installations that have XCode 4 without XCode 3. - Unified the two occurrences of binary search in a file-list into one routine. - Found and fixed a bug where we would unnecessarily search the last file when looking for a key larger than all data in the level. - A fix to avoid the need for trivial move compactions and therefore gets rid of two out of five syncs in "fillseq". - Removed the MANIFEST file write when switching to a new memtable/log-file for a 10-20% improvement on fill speed on ext4. - Adding a SNAPPY setting in the Makefile for folks who have Snappy installed. Snappy compresses values and speeds up writes. git-svn-id: https://leveldb.googlecode.com/svn/trunk@32 62dab493-f737-651d-591e-8d6aee1b9529	2011-06-22 02:36:45 +00:00

... 3 4 5 6 7 ...

464 Commits