rocksdb

Commit Graph

Author	SHA1	Message	Date
Maysam Yabandeh	17e04039dd	Propagate fill_cache config to partitioned index iterator Summary: Currently the partitioned index iterator creates a new ReadOptions which ignores the fill_cache config set to ReadOptions passed by the user. The patch propagates fill_cache from the user's ReadOptions to that of partition index iterator. Also it clarifies the contract of fill_cache that i) it does not apply to filters, ii) it still charges block cache for the size of the data block, it still pin the block if it is already in the block cache. Closes https://github.com/facebook/rocksdb/pull/3739 Differential Revision: D7678308 Pulled By: maysamyabandeh fbshipit-source-id: 53ed96424ae922e499e2d4e3580ddc3f0db893da	2018-04-20 15:13:05 -07:00
Zhongyi Xie	954b496b3f	fix memory leak in two_level_iterator Summary: this PR fixes a few failed contbuild: 1. ASAN memory leak in Block::NewIterator (table/block.cc:429). the proper destruction of first_level_iter_ and second_level_iter_ of two_level_iterator.cc is missing from the code after the refactoring in https://github.com/facebook/rocksdb/pull/3406 2. various unused param errors introduced by https://github.com/facebook/rocksdb/pull/3662 3. updated comment for `ForceReleaseCachedEntry` to emphasize the use of `force_erase` flag. Closes https://github.com/facebook/rocksdb/pull/3718 Reviewed By: maysamyabandeh Differential Revision: D7621192 Pulled By: miasantreble fbshipit-source-id: 476c94264083a0730ded957c29de7807e4f5b146	2018-04-15 17:26:26 -07:00
David Lai	3be9b36453	comment unused parameters to turn on -Wunused-parameter flag Summary: This PR comments out the rest of the unused arguments which allow us to turn on the -Wunused-parameter flag. This is the second part of a codemod relating to https://github.com/facebook/rocksdb/pull/3557. Closes https://github.com/facebook/rocksdb/pull/3662 Differential Revision: D7426121 Pulled By: Dayvedde fbshipit-source-id: 223994923b42bd4953eb016a0129e47560f7e352	2018-04-12 17:59:16 -07:00
Maysam Yabandeh	d2bcd7611f	Fix the memory leak with pinned partitioned filters Summary: The existing unit test did not set the level so the check for pinned partitioned filter/index being properly released from the block cache was not properly exercised as they only take effect in level 0. As a result a memory leak in pinned partitioned filters was hidden. The patch fix the test as well as the bug. Closes https://github.com/facebook/rocksdb/pull/3692 Differential Revision: D7559763 Pulled By: maysamyabandeh fbshipit-source-id: 55eff274945838af983c764a7d71e8daff092e4a	2018-04-09 16:28:19 -07:00
Maysam Yabandeh	67182678a5	Stats for false positive rate of full filtesr Summary: Adds two stats to allow us measuring the false positive rate of full filters: - The total count of positives: rocksdb.bloom.filter.full.positive - The total count of true positives: rocksdb.bloom.filter.full.true.positive Not the term "full" in the stat name to indicate that they are meaningful in full filters. block-based filters are to be deprecated soon and supporting it is not worth the the additional cost of if-then-else branches. Closes #3680 Tested by: $ ./db_bench -benchmarks=fillrandom -db /dev/shm/rocksdb-tmpdb --num=1000000 -bloom_bits=10 $ ./db_bench -benchmarks="readwhilewriting" -db /dev/shm/rocksdb-tmpdb --statistics -bloom_bits=10 --duration=60 --num=2000000 --use_existing_db 2>&1 > /tmp/full.log $ grep filter.full /tmp/full.log rocksdb.bloom.filter.full.positive COUNT : 3628593 rocksdb.bloom.filter.full.true.positive COUNT : 3536026 which gives the false positive rate of 2.5% Closes https://github.com/facebook/rocksdb/pull/3681 Differential Revision: D7517570 Pulled By: maysamyabandeh fbshipit-source-id: 630ab1a473afdce404916d297035b6318de4c052	2018-04-05 15:58:48 -07:00
Sagar Vemuri	d687670256	Fix a leak in FilterBlockBuilder when adding prefix Summary: Our valgrind continuous test found an interesting leak which got introduced in #3614. We were adding the prefix key before saving the previous prefix start offset, due to which previous prefix offset is always incorrect. Fixed it by saving the the previous sate before adding the key. Closes https://github.com/facebook/rocksdb/pull/3660 Differential Revision: D7418698 Pulled By: sagar0 fbshipit-source-id: 9933685f943cf2547ed5c553f490035a2fa785cf	2018-03-27 15:13:56 -07:00
Anand Ananthabhotla	f9f4d40f93	Align SST file data blocks to avoid spanning multiple pages Summary: Provide a block_align option in BlockBasedTableOptions to allow alignment of SST file data blocks. This will avoid higher IOPS/throughput load due to < 4KB data blocks spanning 2 4KB pages. When this option is set to true, the block alignment is set to lower of block size and 4KB. Closes https://github.com/facebook/rocksdb/pull/3502 Differential Revision: D7400897 Pulled By: anand1976 fbshipit-source-id: 04cc3bd144e88e3431a4f97604e63ad7a0f06d44	2018-03-26 20:26:10 -07:00
Dmitri Smirnov	d382ae7de6	Imporve perf of random read and insert compare by suggesting inlining to the compiler Summary: Results from 2015 compiler. This improve sequential insert. Random Read results are inconclusive but I hope 2017 will do a better job at inlining. Before: fillseq : 3.638 micros/op 274866 ops/sec; 213.9 MB/s After: fillseq : 3.379 micros/op 295979 ops/sec; 230.3 MB/s Closes https://github.com/facebook/rocksdb/pull/3645 Differential Revision: D7382711 Pulled By: siying fbshipit-source-id: 092a07ffe8a6e598d1226ceff0f11b35e6c5c8e4	2018-03-23 13:26:55 -07:00
Huachao Huang	7a6353bd1c	Ignore empty filter block when data block is empty Summary: Close https://github.com/facebook/rocksdb/issues/3592 Closes https://github.com/facebook/rocksdb/pull/3614 Differential Revision: D7291706 Pulled By: ajkr fbshipit-source-id: 9dd8f40bd7716588e1e3fd6be0c2bc2766861f8c	2018-03-21 23:13:05 -07:00
Bruce Mitchener	a3a3f5497c	Fix some typos in comments and docs. Summary: Closes https://github.com/facebook/rocksdb/pull/3568 Differential Revision: D7170953 Pulled By: siying fbshipit-source-id: 9cfb8dd88b7266da920c0e0c1e10fb2c5af0641c	2018-03-08 10:27:25 -08:00
Siying Dong	b560fc9f62	Fix a block pinning regression introduced in `b555ed30a4` Summary: `b555ed30a4` introduces a regression, which causes blocks always to be pinned in block based iterators. Fix it. Closes https://github.com/facebook/rocksdb/pull/3582 Differential Revision: D7189534 Pulled By: siying fbshipit-source-id: 117dc7a03d0a0e360424db02efb366e12da2be03	2018-03-08 10:12:23 -08:00
Fosco Marotto	d518fe1da6	uint64_t and size_t changes to compile for iOS Summary: In attempting to build a static lib for use in iOS, I ran in to lots of type errors between uint64_t and size_t. This PR contains the changes I made to get `TARGET_OS=IOS make static_lib` to succeed while also getting Xcode to build successfully with the resulting `librocksdb.a` library imported. This also compiles for me on macOS and tests fine, but I'm really not sure if I made the correct decisions about where to `static_cast` and where to change types. Also up for discussion: is iOS worth supporting? Getting the static lib is just part one, we aren't providing any bridging headers or wrappers like the ObjectiveRocks project, it won't be a great experience. Closes https://github.com/facebook/rocksdb/pull/3503 Differential Revision: D7106457 Pulled By: gfosco fbshipit-source-id: 82ac2073de7e1f09b91f6b4faea91d18bd311f8e	2018-03-06 12:43:51 -08:00
Andrew Kryczka	5d68243e61	Comment out unused variables Summary: Submitting on behalf of another employee. Closes https://github.com/facebook/rocksdb/pull/3557 Differential Revision: D7146025 Pulled By: ajkr fbshipit-source-id: 495ca5db5beec3789e671e26f78170957704e77e	2018-03-05 13:13:41 -08:00
Igor Sugak	aba3409740	Back out "[codemod] - comment out unused parameters" Reviewed By: igorsugak fbshipit-source-id: 4a93675cc1931089ddd574cacdb15d228b1e5f37	2018-02-22 12:43:17 -08:00
David Lai	f4a030ce81	- comment out unused parameters Reviewed By: everiq, igorsugak Differential Revision: D7046710 fbshipit-source-id: 8e10b1f1e2aecebbfb229c742e214db887e5a461	2018-02-22 09:44:23 -08:00
jsteemann	4e7a182d09	Several small "fixes" Summary: - removed a few unneeded variables - fused some variable declarations and their assignments - fixed right-trimming code in string_util.cc to not underflow - simplifed an assertion - move non-nullptr check assertion before dereferencing of that pointer - pass an std::string function parameter by const reference instead of by value (avoiding potential copy) Closes https://github.com/facebook/rocksdb/pull/3507 Differential Revision: D7004679 Pulled By: sagar0 fbshipit-source-id: 52944952d9b56dfcac3bea3cd7878e315bb563c4	2018-02-15 16:57:37 -08:00
Siying Dong	b555ed30a4	Customized BlockBasedTableIterator and LevelIterator Summary: Use a customzied BlockBasedTableIterator and LevelIterator to replace current implementations leveraging two-level-iterator. Hope the customized logic will make code easier to understand. As a side effect, BlockBasedTableIterator reduces the allocation for the data block iterator object, and avoid the virtual function call to it, because we can directly reference BlockIter, a final class. Similarly, LevelIterator reduces virtual function call to the dummy iterator iterating the file metadata. It also enabled further optimization. The upper bound check is also moved from index block to data block. This implementation fits this iterator better. After the change, forwared iterator is slightly optimized to ensure we trim those iterators. The two-level-iterator now is only used by partitioned index, so it is simplified. Closes https://github.com/facebook/rocksdb/pull/3406 Differential Revision: D6809041 Pulled By: siying fbshipit-source-id: 7da3b9b1d3c8e9d9405302c15920af1fcaf50ffa	2018-02-12 17:12:25 -08:00
Andrew Kryczka	e78715c29a	Eliminate a memcpy for uncompressed blocks Summary: `ReadBlockFromFile` uses a stack buffer to hold small data blocks before passing them to the compression library, which outputs uncompressed data in a heap buffer. In the case of `kNoCompression` there is a `memcpy` to copy from stack buffer to heap buffer. This PR optimizes `ReadBlockFromFile` to skip the stack buffer for files whose blocks are known to be uncompressed. We determine this using the SST file property, "compression_name", if it's available. Closes https://github.com/facebook/rocksdb/pull/3472 Differential Revision: D6920848 Pulled By: ajkr fbshipit-source-id: 5c753e804efc178b9229ae5dbe6a4adc32031f07	2018-02-07 15:57:37 -08:00
Zhongyi Xie	2f29991701	split RandomizedHarnessTest more ways Summary: RandomizedHarnessTest enumerates different combinations of test type, compression type, restart interval, etc. For some combinations it takes very long to finish, causing the test to time out in test infrastructure. This PR split the test input into smaller trunks in the hope that they will fit in the timeout window. Another possibility is to reduce `num_entries` of course Closes https://github.com/facebook/rocksdb/pull/3467 Differential Revision: D6910235 Pulled By: miasantreble fbshipit-source-id: 717246ee5d21a8a48ad82d4d9c04f9051a66f07f	2018-02-06 13:58:18 -08:00
Andrew Kryczka	1edac32b77	Update rocksdb.read.block.get.micros when block cache disabled Summary: Previously `ReadBlockFromFile` for data blocks was only measured when reading a block to populate block cache. This PR adds the corresponding measurements for users who disabled block cache. Closes https://github.com/facebook/rocksdb/pull/3442 Differential Revision: D6848671 Pulled By: ajkr fbshipit-source-id: bb4bbe1797fa2cc1d9a5bad44891af2b55384b41	2018-01-31 14:26:52 -08:00
Fosco Marotto	77dc069eb9	Change size_t cast in table_test Summary: Fixes this build error on master (macOS): ``` table/table_test.cc:972:27: error: implicit conversion loses integer precision: 'size_t' (aka 'unsigned long') to 'unsigned int' [-Werror,-Wshorten-64-to-32] ``` Closes https://github.com/facebook/rocksdb/pull/3434 Reviewed By: maysamyabandeh Differential Revision: D6840354 Pulled By: gfosco fbshipit-source-id: fffac6aefbbdd134ce1299453c5590aa855a5fc8	2018-01-30 11:12:51 -08:00
Zhongyi Xie	3fe0937180	Use block cache to track memory usage when ReadOptions.fill_cache=false Summary: ReadOptions.fill_cache is set in compaction inputs and can be set by users in their queries too. It tells RocksDB not to put a data block used to block cache. The memory used by the data block is, however, not trackable by users. To make the system more manageable, we can cost the block to block cache while using it, and then release it after using. Closes https://github.com/facebook/rocksdb/pull/3333 Differential Revision: D6670230 Pulled By: miasantreble fbshipit-source-id: ab848d3ed286bd081a13ee1903de357b56cbc308	2018-01-29 14:43:10 -08:00
Mark Isaacson	b8eb32f8cf	Suppress lint in old files Summary: Grandfather in super old lint issues to make a clean slate for moving forward that allows us to have stronger enforcement on new issues. Reviewed By: yiwu-arbug Differential Revision: D6821806 fbshipit-source-id: 22797d31ec58e9eb0255d3b66fedfcfcb0dc127c	2018-01-29 12:56:42 -08:00
Maysam Yabandeh	46acdc9883	Split HarnessTest_Randomized to avoid timeout Summary: Split HarnessTest_Randomized to two tests Closes https://github.com/facebook/rocksdb/pull/3424 Differential Revision: D6826006 Pulled By: maysamyabandeh fbshipit-source-id: 59c9a11c7da092206effce6e4fa3792f9c66bef2	2018-01-29 07:41:44 -08:00
Sagar Vemuri	d938226af4	Improve performance of long range scans with readahead Summary: This change improves the performance of iterators doing long range scans (e.g. big/full table scans in MyRocks) by using readahead and prefetching additional data on each disk IO. This prefetching is automatically enabled on noticing more than 2 IOs for the same table file during iteration. The readahead size starts with 8KB and is exponentially increased on each additional sequential IO, up to a max of 256 KB. This helps in cutting down the number of IOs needed to complete the range scan. Constraints: - The prefetched data is stored by the OS in page cache. So this currently works only for non direct-reads use-cases i.e applications which use page cache. (Direct-I/O support will be enabled in a later PR). - This gets currently enabled only when ReadOptions.readahead_size = 0 (which is the default value). Thanks to siying for the original idea and implementation. Benchmarks: Data fill: ``` TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=fillrandom -num=1000000000 -compression_type="none" -level_compaction_dynamic_level_bytes ``` Do a long range scan: Seekrandom with large number of nexts ``` TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=seekrandom -duration=60 -num=1000000000 -use_existing_db -seek_nexts=10000 -statistics -histogram ``` Page cache was cleared before each experiment with the command: ``` sudo sh -c "echo 3 > /proc/sys/vm/drop_caches" ``` ``` Before: seekrandom : 34020.945 micros/op 29 ops/sec; 32.5 MB/s (1636 of 1999 found) With this change: seekrandom : 8726.912 micros/op 114 ops/sec; 126.8 MB/s (5702 of 6999 found) ``` ~3.9X performance improvement. Also verified with strace and gdb that the readahead size is increasing as expected. ``` strace -e readahead -f -T -t -p <db_bench process pid> ``` Closes https://github.com/facebook/rocksdb/pull/3282 Differential Revision: D6586477 Pulled By: sagar0 fbshipit-source-id: 8a118a0ed4594fbb7f5b1cafb242d7a4033cb58c	2018-01-25 21:41:53 -08:00
Siying Dong	1039133f2d	BlockBasedTable::NewDataBlockIterator to always return BlockIter Summary: This is a pre-cleaning up before a major block based table iterator refactoring. BlockBasedTable::NewDataBlockIterator() will always return BlockIter. This simplifies the logic and code and enable further refactoring and optimization. Closes https://github.com/facebook/rocksdb/pull/3398 Differential Revision: D6780165 Pulled By: siying fbshipit-source-id: 273f7dc896724f682c0118fb69a359d9cc4418b4	2018-01-25 14:57:18 -08:00
Islam AbdelRahman	c615689bb5	Support skipping bloom filters for SstFileWriter Summary: Add an option for SstFileWriter to skip building bloom filters Closes https://github.com/facebook/rocksdb/pull/3360 Differential Revision: D6709120 Pulled By: IslamAbdelRahman fbshipit-source-id: 964d4bce38822a048691792f447bcfbb4b6bd809	2018-01-22 14:42:18 -08:00
Yi Wu	dc360df81e	Fix multiple build failures Summary: * Fix DBTest.CompactRangeWithEmptyBottomLevel lite build failure * Fix DBTest.AutomaticConflictsWithManualCompaction failure introduce by #3366 * Fix BlockBasedTableTest::IndexUncompressed should be disabled if snappy is disabled * Fix ASAN failure with DBBasicTest::DBClose test Closes https://github.com/facebook/rocksdb/pull/3373 Differential Revision: D6732313 Pulled By: yiwu-arbug fbshipit-source-id: 1eb9b9d9a8d795f56188fa9770db9353f6fdedc5	2018-01-16 17:30:39 -08:00
Dmitri Smirnov	b010116d82	Eliminate some redundant block reads. Summary: Re-use metadata for reading Compression Dictionary on BlockBased table open, this saves two reads from disk. This helps to our 999 percentile in 5.6.1 where prefetch buffer is not present. Closes https://github.com/facebook/rocksdb/pull/3354 Differential Revision: D6695753 Pulled By: ajkr fbshipit-source-id: bb8acd9e9e66e65b89c548ab8940570ae360333c	2018-01-10 17:11:58 -08:00
Anand Ananthabhotla	199405192d	Add a BlockBasedTableOption to turn off index block compression. Summary: Add a new bool option index_uncompressed in BlockBasedTableOptions. Closes https://github.com/facebook/rocksdb/pull/3303 Differential Revision: D6686161 Pulled By: anand1976 fbshipit-source-id: 748b46993d48a01e5f89b6bd3e41f06a59ec6054	2018-01-10 15:11:59 -08:00
Siying Dong	ccc095a016	Speed up BlockTest.BlockReadAmpBitmap Summary: BlockTest.BlockReadAmpBitmap is too slow and times out in some environments. Speed it up by: (1) improve the way the verification is done. With this it is 5 times faster (2) run fewer tests for large blocks. This cut it down by another 10 times. Now it can finish in similar time as other tests. Closes https://github.com/facebook/rocksdb/pull/3313 Differential Revision: D6643711 Pulled By: siying fbshipit-source-id: c2397d666eab5421a78ca87e1e45491e0f832a6d	2018-01-02 10:41:28 -08:00
Siying Dong	6b77c07379	NUMBER_BLOCK_COMPRESSED, etc, shouldn't be treated as timer counter Summary: NUMBER_BLOCK_DECOMPRESSED and NUMBER_BLOCK_COMPRESSED are not reported unless the stats level contain detailed timers, which is wrong. They are normal counters. Fix it. Closes https://github.com/facebook/rocksdb/pull/3263 Differential Revision: D6552519 Pulled By: siying fbshipit-source-id: 40899ccea7b2856bb39752616657c0bfd432f6f9	2017-12-14 10:27:43 -08:00
Zhongyi Xie	51c2ea0feb	Reduce heavy hitter for Get operation Summary: This PR addresses the following heavy hitters in `Get` operation by moving calls to `StatisticsImpl::recordTick` from `BlockBasedTable` to `Version::Get` - rocksdb.block.cache.bytes.write - rocksdb.block.cache.add - rocksdb.block.cache.data.miss - rocksdb.block.cache.data.bytes.insert - rocksdb.block.cache.data.add - rocksdb.block.cache.hit - rocksdb.block.cache.data.hit - rocksdb.block.cache.bytes.read The db_bench statistics before and after the change are: \|1GB block read\|Children \|Self \|Command \|Shared Object \|Symbol\| \|---\|---\|---\|---\|---\|---\| \|master: \|4.22% \|1.31% \|db_bench \|db_bench \|[.] rocksdb::StatisticsImpl::recordTick\| \|updated: \|0.51% \|0.21% \|db_bench \|db_bench \|[.] rocksdb::StatisticsImpl::recordTick\| \| \|0.14% \|0.14% \|db_bench \|db_bench \|[.] rocksdb::GetContext::record_counters\| \|1MB block read\|Children \|Self \|Command \|Shared Object \|Symbol\| \|---\|---\|---\|---\|---\|---\| \|master: \|3.48% \|1.08% \|db_bench \|db_bench \|[.] rocksdb::StatisticsImpl::recordTick\| \|updated: \|0.80% \|0.31% \|db_bench \|db_bench \|[.] rocksdb::StatisticsImpl::recordTick\| \| \|0.35% \|0.35% \|db_bench \|db_bench \|[.] rocksdb::GetContext::record_counters\| Closes https://github.com/facebook/rocksdb/pull/3172 Differential Revision: D6330532 Pulled By: miasantreble fbshipit-source-id: 2b492959e00a3db29e9437ecdcc5e48ca4ec5741	2017-12-12 21:11:33 -08:00
Yi Wu	7393ef779c	Fix BlockFetcher ASAN error Summary: Some call sites of BlockFetcher create temporary ReadOptions and pass to BlockFetcher. The temporary object will be gone after BlockFetcher construction but BlockFetcher keep its reference, causing stack-use-after-scope. Fixing it. Closes https://github.com/facebook/rocksdb/pull/3258 Differential Revision: D6547152 Pulled By: yiwu-arbug fbshipit-source-id: 6b49e9dd46bb72307f5d8f88ea15faacff35b9bc	2017-12-12 12:12:38 -08:00
Siying Dong	a9c8d4ef15	Fix memory issue introduced by `2f1a3a4d74` Summary: Closes https://github.com/facebook/rocksdb/pull/3256 Differential Revision: D6541714 Pulled By: siying fbshipit-source-id: 40efd89b68587a9d58cfe6f4eebd771c2d9f1542	2017-12-11 18:27:28 -08:00
Siying Dong	2f1a3a4d74	Refactor ReadBlockContents() Summary: Divide ReadBlockContents() to multiple sub-functions. Maintaining the input and intermediate data in a new class BlockFetcher. I hope in general it makes the code easier to maintain. Another motivation to do it is to clearly divide the logic before file reading and after file reading. The refactor will help us evaluate how can we make I/O async in the future. Closes https://github.com/facebook/rocksdb/pull/3244 Differential Revision: D6520983 Pulled By: siying fbshipit-source-id: 338d90bc0338472d46be7a7682028dc9114b12e9	2017-12-11 15:27:32 -08:00
Prashant D	baff91c1ad	table: Fix coverity issues Summary: table/block.cc: 420 } CID 1396127 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR) 7. uninit_member: Non-static class member restart_offset_ is not initialized in this constructor nor in any functions that it calls. 421} table/block_based_table_builder.cc: CID 1418259 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR) 7. uninit_member: Non-static class member compressed_cache_key_prefix_size is not initialized in this constructor nor in any functions that it calls. table/block_based_table_reader.h: 3. uninit_member: Non-static class member index_type is not initialized in this constructor nor in any functions that it calls. CID 1396147 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR) 5. uninit_member: Non-static class member hash_index_allow_collision is not initialized in this constructor nor in any functions that it calls. 413 global_seqno(kDisableGlobalSequenceNumber) {} 414 table/cuckoo_table_reader.cc: 55 if (hash_funs == user_props.end()) { 56 status_ = Status::Corruption("Number of hash functions not found"); 5. uninit_member: Non-static class member is_last_level_ is not initialized in this constructor nor in any functions that it calls. 7. uninit_member: Non-static class member identity_as_first_hash_ is not initialized in this constructor nor in any functions that it calls. 9. uninit_member: Non-static class member use_module_hash_ is not initialized in this constructor nor in any functions that it calls. 11. uninit_member: Non-static class member num_hash_func_ is not initialized in this constructor nor in any functions that it calls. 13. uninit_member: Non-static class member key_length_ is not initialized in this constructor nor in any functions that it calls. 15. uninit_member: Non-static class member user_key_length_ is not initialized in this constructor nor in any functions that it calls. 17. uninit_member: Non-static class member value_length_ is not initialized in this constructor nor in any functions that it calls. 19. uninit_member: Non-static class member bucket_length_ is not initialized in this constructor nor in any functions that it calls. 21. uninit_member: Non-static class member cuckoo_block_size_ is not initialized in this constructor nor in any functions that it calls. 23. uninit_member: Non-static class member cuckoo_block_bytes_minus_one_ is not initialized in this constructor nor in any functions that it calls. CID 1322785 (#2 of 2): Uninitialized scalar field (UNINIT_CTOR) 25. uninit_member: Non-static class member table_size_ is not initialized in this constructor nor in any functions that it calls. 57 return; table/plain_table_index.h: 2. uninit_member: Non-static class member index_size_ is not initialized in this constructor nor in any functions that it calls. CID 1322801 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR) 4. uninit_member: Non-static class member sub_index_size_ is not initialized in this constructor nor in any functions that it calls. 128 huge_page_tlb_size_(huge_page_tlb_size) {} 129 Closes https://github.com/facebook/rocksdb/pull/3113 Differential Revision: D6505719 Pulled By: yiwu-arbug fbshipit-source-id: 38f44d8f9dfefb4c2e25d83b8df25a5201c75618	2017-12-07 11:57:36 -08:00
Andrew Kryczka	63f1c0a57d	fix gflags namespace Summary: I started adding gflags support for cmake on linux and got frustrated that I'd need to duplicate the build_detect_platform logic, which determines namespace based on attempting compilation. We can do it differently -- use the GFLAGS_NAMESPACE macro if available, and if not, that indicates it's an old gflags version without configurable namespace so we can simply hardcode "google". Closes https://github.com/facebook/rocksdb/pull/3212 Differential Revision: D6456973 Pulled By: ajkr fbshipit-source-id: 3e6d5bde3ca00d4496a120a7caf4687399f5d656	2017-12-01 10:42:05 -08:00
Andrew Kryczka	c85f8ccca3	convert null terminator in ascii dump Summary: The ASCII output is almost always useless to me as the first '\0' byte in the key or value causes it to stop printing. Since all characters are already surrounded by spaces, "\ 0" (how we display a backslash followed by a zero) and "\0" (how this PR displays a null terminator) are distinguishable. My assumption is the value of seeing all the bytes outweighs the value of the alignment we had before, where we always had one character followed by one space. Closes https://github.com/facebook/rocksdb/pull/3203 Differential Revision: D6428651 Pulled By: ajkr fbshipit-source-id: aafc978a51e9ea029cfe3e763e2bb0e1751b9ccf	2017-11-28 17:28:58 -08:00
Phani Shekhar Mantripragada	4b65cfc723	Support for block_cache num_shards and other config via option string. Summary: Problem: Option string accepts only cache_size as parameter for block_cache which is specified as "block_cache=1M". It doesn't accept other parameters like num_shards etc. Changes : 1) ParseBlockBasedTableOption in block_based_table_factory is edited to accept cache options in the format "block_cache=<cache_size>:<num_shard_bits>:<strict_capacity_limit>:<high_pri_pool_ratio>". Options other than cache_size are optional to maintain backward compatibility. The changes are valid for block_cache_compressed as well. For example, "block_cache=1M:6:true:0.5", "block_cache=1M:6:true", "block_cache=1M:6" and "block_cache=1M" are all valid option strings. 2) Corresponding unit tests are added. Closes https://github.com/facebook/rocksdb/pull/3108 Differential Revision: D6420997 Pulled By: sagar0 fbshipit-source-id: cdea8b785688d2802907974af27225ccc1c0cd43	2017-11-28 10:48:53 -08:00
Zhongyi Xie	a6c6b8b38c	Revert "No need for Restart Interval for meta blocks" Summary: See [issue 3169](https://github.com/facebook/rocksdb/issues/3169) for more information This reverts commit `593d3de371`. Closes https://github.com/facebook/rocksdb/pull/3188 Differential Revision: D6379271 Pulled By: miasantreble fbshipit-source-id: 88f9ed67ba52237ad9b6f7251db83672b62d7537	2017-11-20 16:42:03 -08:00
Maysam Yabandeh	30285ee31c	Fix calculating filter partition target size Summary: block_size_deviation is in percentage while the partition size is in bytes. The current code fails to take that into account resulting into very large target size for filter partitions. Closes https://github.com/facebook/rocksdb/pull/3187 Differential Revision: D6376069 Pulled By: maysamyabandeh fbshipit-source-id: 276546fc68f50e0da32c462abb46f6cf676db9b2	2017-11-20 13:26:57 -08:00
Yi Wu	84a04af9a9	TableProperty::oldest_key_time defaults to 0 Summary: We don't propagate TableProperty::oldest_key_time on compaction and just write the default value to SST files. It is more natural to default the value to 0. Also revert db_sst_test back to before #2842. Closes https://github.com/facebook/rocksdb/pull/3079 Differential Revision: D6165702 Pulled By: yiwu-arbug fbshipit-source-id: ca3ce5928d96ae79a5beb12bb7d8c640a71478a0	2017-10-27 15:00:05 -07:00
Yi Wu	66a2c44ef4	Add DB::Properties::kEstimateOldestKeyTime Summary: With FIFO compaction we would like to get the oldest data time for monitoring. The problem is we don't have timestamp for each key in the DB. As an approximation, we expose the earliest of sst file "creation_time" property. My plan is to override the property with a more accurate value with blob db, where we actually have timestamp. Closes https://github.com/facebook/rocksdb/pull/2842 Differential Revision: D5770600 Pulled By: yiwu-arbug fbshipit-source-id: 03833c8f10bbfbee62f8ea5c0d03c0cafb5d853a	2017-10-23 15:27:27 -07:00
Dmitri Smirnov	ebab2e2d42	Enable MSVC W4 with a few exceptions. Fix warnings and bugs Summary: Closes https://github.com/facebook/rocksdb/pull/3018 Differential Revision: D6079011 Pulled By: yiwu-arbug fbshipit-source-id: 988a721e7e7617967859dba71d660fc69f4dff57	2017-10-19 10:57:12 -07:00
Yi Wu	60c09f5fbb	print more table_options to info log Summary: print more table_options to info log Closes https://github.com/facebook/rocksdb/pull/3003 Differential Revision: D6054490 Pulled By: yiwu-arbug fbshipit-source-id: 8e6f96e08bdc906077b6c62ade419d7cb739110f	2017-10-13 14:42:26 -07:00
Yi Wu	31d3e41810	PinnableSlice move assignment Summary: Allow `std::move(pinnable_slice)`. Closes https://github.com/facebook/rocksdb/pull/2997 Differential Revision: D6036782 Pulled By: yiwu-arbug fbshipit-source-id: 583fb0419a97e437ff530f4305822341cd3381fa	2017-10-12 18:28:24 -07:00
Yi Wu	fb4ae4d810	fix DBImpl::NewInternalIterator super-version leak on failure Summary: Close #2955 Closes https://github.com/facebook/rocksdb/pull/2960 Differential Revision: D5962872 Pulled By: yiwu-arbug fbshipit-source-id: a6472d5c015bea3dc476c572ff5a5c90259e6059	2017-10-11 14:57:43 -07:00
Manuel Ung	88ed1f6ea6	Allow upgrades from nullptr to some merge operator Summary: Currently, RocksDB does not allow reopening a preexisting DB with no merge operator defined, with a merge operator defined. This means that if a DB ever want to add a merge operator, there's no way to do so currently. Fix this by adding a new verification type `kByNameAllowFromNull` which will allow old values to be nullptr, and new values to be non-nullptr. Closes https://github.com/facebook/rocksdb/pull/2958 Differential Revision: D5961131 Pulled By: lth fbshipit-source-id: 06179bebd0d90db3d43690b5eb7345e2d5bab1eb	2017-10-04 09:57:23 -07:00
Yi Wu	d1cab2b64e	Add ValueType::kTypeBlobIndex Summary: Add kTypeBlobIndex value type, which will be used by blob db only, to insert a (key, blob_offset) KV pair. The purpose is to 1. Make it possible to open existing rocksdb instance as blob db. Existing value will be of kTypeIndex type, while value inserted by blob db will be of kTypeBlobIndex. 2. Make rocksdb able to detect if the db contains value written by blob db, if so return error. 3. Make it possible to have blob db optionally store value in SST file (with kTypeValue type) or as a blob value (with kTypeBlobIndex type). The root db (DBImpl) basically pretended kTypeBlobIndex are normal value on write. On Get if is_blob is provided, return whether the value read is of kTypeBlobIndex type, or return Status::NotSupported() status if is_blob is not provided. On scan allow_blob flag is pass and if the flag is true, return wether the value is of kTypeBlobIndex type via iter->IsBlob(). Changes on blob db side will be in a separate patch. Closes https://github.com/facebook/rocksdb/pull/2886 Differential Revision: D5838431 Pulled By: yiwu-arbug fbshipit-source-id: 3c5306c62bc13bb11abc03422ec5cbcea1203cca	2017-10-03 09:11:23 -07:00
Zhongyi Xie	593d3de371	No need for Restart Interval for meta blocks Summary: In SST files, restart interval helps us search in data blocks. However, some meta blocks will be read sequentially, so there's no need for restart points. Restart interval will introduce extra space in the block (https://github.com/facebook/rocksdb/blob/master/table/block_builder.cc#L80). We will see if we can remove this redundant space. (Maybe set restart interval to infinite.) Closes https://github.com/facebook/rocksdb/pull/2940 Differential Revision: D5930139 Pulled By: miasantreble fbshipit-source-id: 92b1b23c15cffa90378343ac846b713623b19c21	2017-09-29 20:26:20 -07:00
Maysam Yabandeh	ab0542f5ec	Fix for when block.cache_handle is nullptr Summary: When using with compressed cache it is possible that the status is ok but the block is not actually added to the block cache. The patch takes this case into account. Closes https://github.com/facebook/rocksdb/pull/2945 Differential Revision: D5937613 Pulled By: maysamyabandeh fbshipit-source-id: 5428cf1115e5046b3d01ab78d26cb181122af4c6	2017-09-29 07:56:55 -07:00
Sagar Vemuri	93c2b91740	Introduce conditional merge-operator invocation in point lookups Summary: For every merge operand encountered for a key in the read path we now have the ability to decide whether to look further (to retrieve more merge operands for the key) or stop and invoke the merge operator to return the value. The user needs to override `ShouldMerge()` method with a condition to terminate search when true to avail this facility. This has a couple of advantages: 1. It helps in limiting the number of merge operands that are looked at to compute a value as part of a user Get operation. 2. It allows to peek at a merge key-value to see if further merge operands need to look at. Example: Limiting the number of merge operands that are looked at: Lets say you have 10 merge operands for a key spread over various levels. If you only want RocksDB to look at the latest two merge operands instead of all 10 to compute the value, it is now possible with this PR. You can set the condition in `ShouldMerge()` to return true when the size of the operand list is 2. Look at the example implementation in the unit test. Without this PR, a Get might look at all the 10 merge operands in different levels before invoking the merge-operator. Added a new unit test. Made sure that there is no perf regression by running benchmarks. Command line to Load data: ``` TEST_TMPDIR=/dev/shm ./db_bench --benchmarks="mergerandom" --merge_operator="uint64add" --num=10000000 ... mergerandom : 12.861 micros/op 77757 ops/sec; 8.6 MB/s ( updates:10000000) ``` ReadRandomMergeRandom bechmark results: Command line: ``` TEST_TMPDIR=/dev/shm ./db_bench --benchmarks="readrandommergerandom" --merge_operator="uint64add" --num=10000000 ``` Base -- Without this code change (on commit `fc7476b`): ``` readrandommergerandom : 38.586 micros/op 25916 ops/sec; (reads:3001599 merges:6998401 total:10000000 hits:842235 maxlength:8) ``` With this code change: ``` readrandommergerandom : 38.653 micros/op 25870 ops/sec; (reads:3001599 merges:6998401 total:10000000 hits:842235 maxlength:8) ``` Closes https://github.com/facebook/rocksdb/pull/2923 Differential Revision: D5898239 Pulled By: sagar0 fbshipit-source-id: daefa325019f77968639a75c851d46352c2303ef	2017-09-28 15:58:49 -07:00
Andrew Kryczka	c2f6e45aa3	prevent nullptr dereference in table reader error case Summary: A user encountered segfault on the call to `CacheDependencies()`, probably because `NewIndexIterator()` failed before populating `*index_entry`. Let's avoid the call in that case. Closes https://github.com/facebook/rocksdb/pull/2939 Differential Revision: D5928611 Pulled By: ajkr fbshipit-source-id: 484be453dbb00e5e160e9c6a1bc933df7d80f574	2017-09-28 00:12:34 -07:00
Siying Dong	edcbb36944	Three code-level optimization to Iterator::Next() Summary: Three small optimizations: (1) iter_->IsKeyPinned() shouldn't be called if read_options.pin_data is not true. This may trigger function call all the way down the iterator tree. (2) reuse the iterator key object in DBIter::FindNextUserEntryInternal(). The constructor of the class has some overheads. (3) Move the switching direction logic in MergingIterator::Next() to a separate function. These three in total improves readseq performance by about 3% in my benchmark setting. Closes https://github.com/facebook/rocksdb/pull/2880 Differential Revision: D5829252 Pulled By: siying fbshipit-source-id: 991aea10c6d6c3b43769cb4db168db62954ad1e3	2017-09-14 17:57:31 -07:00
Siying Dong	64b6452e0c	Make InternalKeyComparator final and directly use it in merging iterator Summary: Merging iterator invokes InternalKeyComparator.Compare() frequently to heap merge. By making InternalKeyComparator final and merging iterator to directly use InternalKeyComparator rather than through Iterator interface, we can give compiler a choice to avoid one more virtual function call if possible. I ran readseq benchmark in memory-only use case to make sure the performance at least doesn't regress. I have to disable the final key word in debug build, as a hack test class depends on overriding the class. Closes https://github.com/facebook/rocksdb/pull/2860 Differential Revision: D5800461 Pulled By: siying fbshipit-source-id: ab876f22a09bb5c560740911412336e0e25ccb53	2017-09-11 12:04:21 -07:00
Maysam Yabandeh	f46464d383	write-prepared txn: call IsInSnapshot Summary: This patch instruments the read path to verify each read value against an optional ReadCallback class. If the value is rejected, the reader moves on to the next value. The WritePreparedTxn makes use of this feature to skip sequence numbers that are not in the read snapshot. Closes https://github.com/facebook/rocksdb/pull/2850 Differential Revision: D5787375 Pulled By: maysamyabandeh fbshipit-source-id: 49d808b3062ab35e7ae98ad388f659757794184c	2017-09-11 09:14:48 -07:00
Andrew Kryczka	7fbb9eccaf	support disabling checksum in block-based table Summary: store a zero as the checksum when disabled since it's easier to keep block trailer a fixed length. Closes https://github.com/facebook/rocksdb/pull/2781 Differential Revision: D5694702 Pulled By: ajkr fbshipit-source-id: 69cea9da415778ba2b600dfd9d0dfc8cb5188ecd	2017-08-23 19:40:47 -07:00
Andrew Kryczka	19cc66dc4f	fix clang bug in block-based table reader Summary: This is the warning that clang considers a bug and has been causing it to fail: ``` table/block_based_table_reader.cc:240:27: warning: Potential leak of memory pointed to by 'block.value' for (; biter.Valid(); biter.Next()) { ^~~~~ ``` Actually clang just doesn't have enough knowledge to statically determine it's safe. We can teach it using an assert. Closes https://github.com/facebook/rocksdb/pull/2779 Differential Revision: D5691225 Pulled By: ajkr fbshipit-source-id: 3f0d545bf44636953b30ee5243c63239e8f16d8e	2017-08-23 15:12:05 -07:00
Andrew Kryczka	234f33a3f9	allow nullptr Slice only as sentinel Summary: Allow `Slice` holding nullptr as a sentinel value but not in comparisons. This new restriction eliminates the need for the manual checks in `39ef900551`, while still conforming to glibc's `memcmp` API. Thanks siying for the idea. Users may need to migrate, so mentioned it in HISTORY.md. Closes https://github.com/facebook/rocksdb/pull/2777 Differential Revision: D5686016 Pulled By: ajkr fbshipit-source-id: 03a2ca3fd9a0ebade9d0d5686c81d59a9534f563	2017-08-23 10:56:06 -07:00
Maysam Yabandeh	1dfcdb15f9	Extend pin_l0 to filter partitions Summary: This is the continuation of https://github.com/facebook/rocksdb/pull/2661 for filter partitions. When pin_l0 is set (along with cache_xxx), then open table open the filter partitions are loaded into the cache and pinned there. Closes https://github.com/facebook/rocksdb/pull/2766 Differential Revision: D5671098 Pulled By: maysamyabandeh fbshipit-source-id: 174f24018f1d7f1129621e7380287b65b67d2115	2017-08-23 07:56:08 -07:00
Maysam Yabandeh	1efc600ddf	Preload l0 index partitions Summary: This fixes the existing logic for pinning l0 index partitions. The patch preloads the partitions into block cache and pin them if they belong to level 0 and pin_l0 is set. The drawback is that it does many small IOs when preloading all the partitions into the cache is direct io is enabled. Working for a solution for that. Closes https://github.com/facebook/rocksdb/pull/2661 Differential Revision: D5554010 Pulled By: maysamyabandeh fbshipit-source-id: 1e6f32a3524d71355c77d4138516dcfb601ca7b2	2017-08-18 10:56:20 -07:00
Siying Dong	666a005f9b	Support prefetch last 512KB with direct I/O in block based file reader Summary: Right now, if direct I/O is enabled, prefetching the last 512KB cannot be applied, except compaction inputs or readahead is enabled for iterators. This can create a lot of I/O for HDD cases. To solve the problem, the 512KB is prefetched in block based table if direct I/O is enabled. The prefetched buffer is passed in totegher with random access file reader, so that we try to read from the buffer before reading from the file. This can be extended in the future to support flexible user iterator readahead too. Closes https://github.com/facebook/rocksdb/pull/2708 Differential Revision: D5593091 Pulled By: siying fbshipit-source-id: ee36ff6d8af11c312a2622272b21957a7b5c81e7	2017-08-11 12:16:45 -07:00
Aaron G	7848f0b24c	add VerifyChecksum() to db.h Summary: We need a tool to check any sst file corruption in the db. It will check all the sst files in current version and read all the blocks (data, meta, index) with checksum verification. If any verification fails, the function will return non-OK status. Closes https://github.com/facebook/rocksdb/pull/2498 Differential Revision: D5324269 Pulled By: lightmark fbshipit-source-id: 6f8a272008b722402a772acfc804524c9d1a483b	2017-08-09 15:58:13 -07:00
Siying Dong	21696ba502	Replace dynamic_cast<> Summary: Replace dynamic_cast<> so that users can choose to build with RTTI off, so that they can save several bytes per object, and get tiny more memory available. Some nontrivial changes: 1. Add Comparator::GetRootComparator() to get around the internal comparator hack 2. Add the two experiemental functions to DB 3. Add TableFactory::GetOptionString() to avoid unnecessary casting to get the option string 4. Since 3 is done, move the parsing option functions for table factory to table factory files too, to be symmetric. Closes https://github.com/facebook/rocksdb/pull/2645 Differential Revision: D5502723 Pulled By: siying fbshipit-source-id: fd13cec5601cf68a554d87bfcf056f2ffa5fbf7c	2017-07-28 16:27:16 -07:00
Andrew Kryczka	710411aea6	fix asan/valgrind for TableCache cleanup Summary: Breaking commit: `d12691b86f` In the above commit, I moved the `TableCache` cleanup logic from `Version` destructor into `PurgeObsoleteFiles`. I missed cleaning up `TableCache` entries for the current `Version` during DB destruction. This PR adds that logic to `VersionSet` destructor. One unfortunate side effect is now we're potentially deleting `TableReader`s after `column_family_set_.reset()`, which means we can't call `BlockBasedTableReader::Close` a second time as the block cache might already be destroyed. Closes https://github.com/facebook/rocksdb/pull/2662 Differential Revision: D5515108 Pulled By: ajkr fbshipit-source-id: 2cb820e19aa813e0d258d17f76b2d7b6b7ee0b18	2017-07-27 20:28:04 -07:00
Aaron Gao	8f553d3c52	remove unnecessary internal_comparator param in newIterator Summary: solved https://github.com/facebook/rocksdb/issues/2604 Closes https://github.com/facebook/rocksdb/pull/2648 Differential Revision: D5504875 Pulled By: lightmark fbshipit-source-id: c14bb62ccbdc9e7bda9cd914cae4ea0765d882ee	2017-07-27 14:30:42 -07:00
Sagar Vemuri	72502cf227	Revert "comment out unused parameters" Summary: This reverts the previous commit `1d7048c598`, which broke the build. Did a `git revert 1d7048c`. Closes https://github.com/facebook/rocksdb/pull/2627 Differential Revision: D5476473 Pulled By: sagar0 fbshipit-source-id: 4756ff5c0dfc88c17eceb00e02c36176de728d06	2017-07-21 18:26:26 -07:00
Victor Gao	1d7048c598	comment out unused parameters Summary: This uses `clang-tidy` to comment out unused parameters (in functions, methods and lambdas) in fbcode. Cases that the tool failed to handle are fixed manually. Reviewed By: igorsugak Differential Revision: D5454343 fbshipit-source-id: 5dee339b4334e25e963891b519a5aa81fbf627b2	2017-07-21 14:57:44 -07:00
Siying Dong	ae28634e9f	Remove some left-over BSD headers Summary: Closes https://github.com/facebook/rocksdb/pull/2608 Differential Revision: D5444797 Pulled By: siying fbshipit-source-id: 690581d03f37822e059a16085088e8e2d8a45016	2017-07-18 11:56:57 -07:00
Sushma Devendrappa	0655b58582	enable PinnableSlice for RowCache Summary: This patch enables using PinnableSlice for RowCache, changes include not releasing the cache handle immediately after lookup in TableCache::Get, instead pass a Cleanble function which does Cache::RleaseHandle. Closes https://github.com/facebook/rocksdb/pull/2492 Differential Revision: D5316216 Pulled By: maysamyabandeh fbshipit-source-id: d2a684bd7e4ba73772f762e58a82b5f4fbd5d362	2017-07-17 15:08:30 -07:00
Daniel Black	cbaab30449	table/block.h: change memset Summary: In gcc-7 the following is an error identified by -Werror=class-memaccess In file included from ./table/get_context.h:14:0, from db/version_set.cc:43: ./table/block.h: In constructor ‘rocksdb::BlockReadAmpBitmap::BlockReadAmpBitmap(size_t, size_t, rocksdb::Statistics)’: ./table/block.h:73:53: error: ‘void memset(void, int, size_t)’ clearing an object of type ‘struct std::atomic<unsigned int>’ with no trivial copy-assignment; use value-initialization instead [-Werror=class-memaccess] memset(bitmap_, 0, bitmap_size kBytesPersEntry); ^ In file included from ./db/version_set.h:23:0, from db/version_set.cc:12: /toolchain/include/c++/8.0.0/atomic:684:12: note: ‘struct std::atomic<unsigned int>’ declared here struct atomic<unsigned int> : __atomic_base<unsigned int> ^~~~~~~~~~~~~~~~~~~~ As a solution the default initializer can be applied in list context. Signed-off-by: Daniel Black <daniel.black@au.ibm.com> Closes https://github.com/facebook/rocksdb/pull/2561 Differential Revision: D5398714 Pulled By: siying fbshipit-source-id: d883fb88ec7535eee60d551038fe91f14488be36	2017-07-17 10:41:56 -07:00
Yedidya Feldblum	f1a056e005	CodeMod: Prefer ADD_FAILURE() over EXPECT_TRUE(false), et cetera Summary: CodeMod: Prefer `ADD_FAILURE()` over `EXPECT_TRUE(false)`, et cetera. The tautologically-conditioned and tautologically-contradicted boolean expectations/assertions have better alternatives: unconditional passes and failures. Reviewed By: Orvid Differential Revision: D5432398 Tags: codemod, codemod-opensource fbshipit-source-id: d16b447e8696a6feaa94b41199f5052226ef6914	2017-07-16 21:26:02 -07:00
Siying Dong	3c327ac2d0	Change RocksDB License Summary: Closes https://github.com/facebook/rocksdb/pull/2589 Differential Revision: D5431502 Pulled By: siying fbshipit-source-id: 8ebf8c87883daa9daa54b2303d11ce01ab1f6f75	2017-07-15 16:11:23 -07:00
奏之章	70440f7a63	Add virtual func IsDeleteRangeSupported Summary: this modify allows third-party tables able to support delete range Closes https://github.com/facebook/rocksdb/pull/2035 Differential Revision: D5407973 Pulled By: ajkr fbshipit-source-id: 82e364b7dd5a198660788d59543f15b8f95cc418	2017-07-12 16:58:45 -07:00
Daniel Black	7a0b5de771	Gcc 7 ignored quantifiers Summary: The casting seemed to cause a problem. I think this might increase it to unsigned long. Closes https://github.com/facebook/rocksdb/pull/2562 Differential Revision: D5406842 Pulled By: siying fbshipit-source-id: 736adef31448229a58a1a48bdbe77792f36736e8	2017-07-12 09:41:15 -07:00
Maysam Yabandeh	98669b5356	init filters_in_partition_ Summary: Valgrind reports that it is not initialized. Closes https://github.com/facebook/rocksdb/pull/2541 Differential Revision: D5376084 Pulled By: maysamyabandeh fbshipit-source-id: 55c312f4f506863aa0d25ff92c8c34b57f48b860	2017-07-06 08:48:53 -07:00
Maysam Yabandeh	0013bf14ef	fix asan and valgrind leak report in test Summary: Closes https://github.com/facebook/rocksdb/pull/2537 Differential Revision: D5371433 Pulled By: maysamyabandeh fbshipit-source-id: 90d3e8bb1a8576f48b1ddf1bdbba5512b5986ba0	2017-07-05 19:11:39 -07:00
Maysam Yabandeh	f6b9d9355e	Fix clang error in PartitionedFilterBlockBuilder Summary: Closes https://github.com/facebook/rocksdb/pull/2536 Differential Revision: D5371271 Pulled By: maysamyabandeh fbshipit-source-id: f1355ac658a79c9982a24986f0925c9e24fc39d5	2017-07-05 11:16:04 -07:00
Maysam Yabandeh	45b9bb0331	Cut filter partition based on metadata_block_size Summary: Currently metadata_block_size controls only index partition size. With this patch a partition is cut after any of index or filter partitions reaches metadata_block_size. Closes https://github.com/facebook/rocksdb/pull/2452 Differential Revision: D5275651 Pulled By: maysamyabandeh fbshipit-source-id: 5057e4424b4c8902043782e6bf8c38f0c4f25160	2017-07-02 10:42:12 -07:00
Mike Kolupaev	397ab11152	Improve Status message for block checksum mismatches Summary: We've got some DBs where iterators return Status with message "Corruption: block checksum mismatch" all the time. That's not very informative. It would be much easier to investigate if the error message contained the file name - then we would know e.g. how old the corrupted file is, which would be very useful for finding the root cause. This PR adds file name, offset and other stuff to some block corruption-related status messages. It doesn't improve all the error messages, just a few that were easy to improve. I'm mostly interested in "block checksum mismatch" and "Bad table magic number" since they're the only corruption errors that I've ever seen in the wild. Closes https://github.com/facebook/rocksdb/pull/2507 Differential Revision: D5345702 Pulled By: al13n321 fbshipit-source-id: fc8023d43f1935ad927cef1b9c55481ab3cb1339	2017-06-28 21:27:01 -07:00
Sagar Vemuri	1cd45cd1b3	FIFO Compaction with TTL Summary: Introducing FIFO compactions with TTL. FIFO compaction is based on size only which makes it tricky to enable in production as use cases can have organic growth. A user requested an option to drop files based on the time of their creation instead of the total size. To address that request: - Added a new TTL option to FIFO compaction options. - Updated FIFO compaction score to take TTL into consideration. - Added a new table property, creation_time, to keep track of when the SST file is created. - Creation_time is set as below: - On Flush: Set to the time of flush. - On Compaction: Set to the max creation_time of all the files involved in the compaction. - On Repair and Recovery: Set to the time of repair/recovery. - Old files created prior to this code change will have a creation_time of 0. - FIFO compaction with TTL is enabled when ttl > 0. All files older than ttl will be deleted during compaction. i.e. `if (file.creation_time < (current_time - ttl)) then delete(file)`. This will enable cases where you might want to delete all files older than, say, 1 day. - FIFO compaction will fall back to the prior way of deleting files based on size if: - the creation_time of all files involved in compaction is 0. - the total size (of all SST files combined) does not drop below `compaction_options_fifo.max_table_files_size` even if the files older than ttl are deleted. This feature is not supported if max_open_files != -1 or with table formats other than Block-based. Test Plan: Added tests. Benchmark results: Base: FIFO with max size: 100MB :: ``` svemuri@dev15905 ~/rocksdb (fifo-compaction) $ TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=readwhilewriting --num=5000000 --threads=16 --compaction_style=2 --fifo_compaction_max_table_files_size_mb=100 readwhilewriting : 1.924 micros/op 519858 ops/sec; 13.6 MB/s (1176277 of 5000000 found) ``` With TTL (a low one for testing) :: ``` svemuri@dev15905 ~/rocksdb (fifo-compaction) $ TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=readwhilewriting --num=5000000 --threads=16 --compaction_style=2 --fifo_compaction_max_table_files_size_mb=100 --fifo_compaction_ttl=20 readwhilewriting : 1.902 micros/op 525817 ops/sec; 13.7 MB/s (1185057 of 5000000 found) ``` Example Log lines: ``` 2017/06/26-15:17:24.609249 7fd5a45ff700 (Original Log Time 2017/06/26-15:17:24.609177) [db/compaction_picker.cc:1471] [default] FIFO compaction: picking file 40 with creation time 1498515423 for deletion 2017/06/26-15:17:24.609255 7fd5a45ff700 (Original Log Time 2017/06/26-15:17:24.609234) [db/db_impl_compaction_flush.cc:1541] [default] Deleted 1 files ... 2017/06/26-15:17:25.553185 7fd5a61a5800 [DEBUG] [db/db_impl_files.cc:309] [JOB 0] Delete /dev/shm/dbbench/000040.sst type=2 #40 -- OK 2017/06/26-15:17:25.553205 7fd5a61a5800 EVENT_LOG_v1 {"time_micros": 1498515445553199, "job": 0, "event": "table_file_deletion", "file_number": 40} ``` SST Files remaining in the dbbench dir, after db_bench execution completed: ``` svemuri@dev15905 ~/rocksdb (fifo-compaction) $ ls -l /dev/shm//dbbench/*.sst -rw-r--r--. 1 svemuri users 30749887 Jun 26 15:17 /dev/shm//dbbench/000042.sst -rw-r--r--. 1 svemuri users 30768779 Jun 26 15:17 /dev/shm//dbbench/000044.sst -rw-r--r--. 1 svemuri users 30757481 Jun 26 15:17 /dev/shm//dbbench/000046.sst ``` Closes https://github.com/facebook/rocksdb/pull/2480 Differential Revision: D5305116 Pulled By: sagar0 fbshipit-source-id: 3e5cfcf5dd07ed2211b5b37492eb235b45139174	2017-06-27 17:11:48 -07:00
Siying Dong	5c97a7c066	Unit Tests for sync, range sync and file close failures Summary: Closes https://github.com/facebook/rocksdb/pull/2454 Differential Revision: D5255320 Pulled By: siying fbshipit-source-id: 0080830fa8eb5da6de25e17ba68aee91018c7913	2017-06-26 13:27:58 -07:00
Maysam Yabandeh	499ebb3ab5	Optimize for serial commits in 2PC Summary: Throughput: 46k tps in our sysbench settings (filling the details later) The idea is to have the simplest change that gives us a reasonable boost in 2PC throughput. Major design changes: 1. The WAL file internal buffer is not flushed after each write. Instead it is flushed before critical operations (WAL copy via fs) or when FlushWAL is called by MySQL. Flushing the WAL buffer is also protected via mutex_. 2. Use two sequence numbers: last seq, and last seq for write. Last seq is the last visible sequence number for reads. Last seq for write is the next sequence number that should be used to write to WAL/memtable. This allows to have a memtable write be in parallel to WAL writes. 3. BatchGroup is not used for writes. This means that we can have parallel writers which changes a major assumption in the code base. To accommodate for that i) allow only 1 WriteImpl that intends to write to memtable via mem_mutex_--which is fine since in 2PC almost all of the memtable writes come via group commit phase which is serial anyway, ii) make all the parts in the code base that assumed to be the only writer (via EnterUnbatched) to also acquire mem_mutex_, iii) stat updates are protected via a stat_mutex_. Note: the first commit has the approach figured out but is not clean. Submitting the PR anyway to get the early feedback on the approach. If we are ok with the approach I will go ahead with this updates: 0) Rebase with Yi's pipelining changes 1) Currently batching is disabled by default to make sure that it will be consistent with all unit tests. Will make this optional via a config. 2) A couple of unit tests are disabled. They need to be updated with the serial commit of 2PC taken into account. 3) Replacing BatchGroup with mem_mutex_ got a bit ugly as it requires releasing mutex_ beforehand (the same way EnterUnbatched does). This needs to be cleaned up. Closes https://github.com/facebook/rocksdb/pull/2345 Differential Revision: D5210732 Pulled By: maysamyabandeh fbshipit-source-id: 78653bd95a35cd1e831e555e0e57bdfd695355a4	2017-06-24 14:11:29 -07:00
Maysam Yabandeh	0ac4afb975	Sanitize partitioning options Summary: We currently do not support partitioning filters if indexes are not partitioned. The patch makes sure that these two are consistent. Closes https://github.com/facebook/rocksdb/pull/2455 Differential Revision: D5275644 Pulled By: maysamyabandeh fbshipit-source-id: b61701ac8914c2206d06f5e33ff6f67b24406d1d	2017-06-23 18:30:01 -07:00
Maysam Yabandeh	6f4154d693	record index partition properties Summary: When Partitioning index/filter is enabled the user might need to check the index block size as well as the top-level index size via sst_dump. This patch records i) number of partitions, ii) top-level index size and make it accessible through sst_dump. The number of partitions for filters is the same as that of indexes. The top-level index for filters has a similar size to top-level index for indexes, so it is not repeated. Closes https://github.com/facebook/rocksdb/pull/2437 Differential Revision: D5224225 Pulled By: maysamyabandeh fbshipit-source-id: 5324598c75793523aef1bb7ee225a5475e95a9cb	2017-06-13 11:21:32 -07:00
Siying Dong	5582123dee	Sample number of reads per SST file Summary: We estimate number of reads per SST files, by updating the counter per file in sampled read requests. This information can later be used to trigger compactions to improve read performacne. Closes https://github.com/facebook/rocksdb/pull/2417 Differential Revision: D5193528 Pulled By: siying fbshipit-source-id: b4241c5ad0eaf444b61afb53f8e6290d9f5da2df	2017-06-12 07:12:08 -07:00
Maysam Yabandeh	cc5f9339ee	Fix concurrency issue with filter_block_set_ Summary: filter_block_set_ access must also be protected with mutex. Closes https://github.com/facebook/rocksdb/pull/2413 Differential Revision: D5193159 Pulled By: maysamyabandeh fbshipit-source-id: 6987fc219d9a65c20b9c7e52151aef4b8e4882e6	2017-06-06 12:56:52 -07:00
Aaron Gao	7f6c02dda1	using ThreadLocalPtr to hide ROCKSDB_SUPPORT_THREAD_LOCAL from public… Summary: … headers https://github.com/facebook/rocksdb/pull/2199 should not reference RocksDB-specific macros (like ROCKSDB_SUPPORT_THREAD_LOCAL in this case) to public headers, `iostats_context.h` and `perf_context.h`. We shouldn't do that because users have to provide these compiler flags when building their binary with RocksDB. We should hide the thread local global variable inside our implementation and just expose a function api to retrieve these variables. It may break some users for now but good for long term. make check -j64 Closes https://github.com/facebook/rocksdb/pull/2380 Differential Revision: D5177896 Pulled By: lightmark fbshipit-source-id: 6fcdfac57f2e2dcfe60992b7385c5403f6dcb390	2017-06-02 17:26:19 -07:00
Mike Kolupaev	138b87eae4	Fix interaction between CompactionFilter::Decision::kRemoveAndSkipUnt… Summary: Fixes the following scenario: 1. Set prefix extractor. Enable bloom filters, with `whole_key_filtering = false`. Use compaction filter that sometimes returns `kRemoveAndSkipUntil`. 2. Do a compaction. 3. Compaction creates an iterator with `total_order_seek = false`, calls `SeekToFirst()` on it, then repeatedly calls `Next()`. 4. At some point compaction filter returns `kRemoveAndSkipUntil`. 5. Compaction calls `Seek(skip_until)` on the iterator. The key that it seeks to happens to have prefix that doesn't match the bloom filter. Since `total_order_seek = false`, iterator becomes invalid, and compaction thinks that it has reached the end. The rest of the compaction input is silently discarded. The fix is to make compaction iterator use `total_order_seek = true`. The implementation for PlainTable is quite awkward. I've made `kRemoveAndSkipUntil` officially incompatible with PlainTable. If you try to use them together, compaction will fail, and DB will enter read-only mode (`bg_error_`). That's not a very graceful way to communicate a misconfiguration, but the alternatives don't seem worth the implementation time and complexity. To be able to check in advance that `kRemoveAndSkipUntil` is not going to be used with PlainTable, we'd need to extend the interface of either `CompactionFilter` or `InternalIterator`. It seems unlikely that anyone will ever want to use `kRemoveAndSkipUntil` with PlainTable: PlainTable probably has very few users, and `kRemoveAndSkipUntil` has only one user so far: us (logdevice). Closes https://github.com/facebook/rocksdb/pull/2349 Differential Revision: D5110388 Pulled By: lightmark fbshipit-source-id: ec29101a99d9dcd97db33923b87f72bce56cc17a	2017-06-02 15:11:38 -07:00
Andrew Kryczka	a4d9c02511	Pass CF ID to MemTableRepFactory Summary: Some users want to monitor column family activity in their custom memtable implementations. Previously there was no way to figure out with which column family a memtable is associated. This diff: - adds an overload to MemTableRepFactory::CreateMemTableRep() that provides the CF ID. For compatibility, its default implementation calls the old overload. - updates MemTable to create MemTableRep's using the new overload. Closes https://github.com/facebook/rocksdb/pull/2346 Differential Revision: D5108061 Pulled By: ajkr fbshipit-source-id: 3a1921214a348dd8ea0f54e1cab3b71c3d46d616	2017-06-02 12:12:06 -07:00
Aaron Gao	f7bb1a0060	support merge and delete in file ingestion Summary: Previously sst_file_writer only supports kTypeValue, we need kTypeMerge and kTypeDeletion also as user requested. Closes https://github.com/facebook/rocksdb/pull/2361 Differential Revision: D5139402 Pulled By: lightmark fbshipit-source-id: 092a60756d01692539d817a3765ebfd58a8d7f88	2017-05-26 12:11:21 -07:00
Giuseppe Ottaviano	69ec8356b2	Allow SstFileWriter to use the rate limiter Summary: The default IO priority of WritableFiles is IO_TOTAL, meaning that they will bypass the rate limiter if it's passed in the options. This change allows to pass an io priority in construction, so that by setting IO_LOW or IO_HIGH the rate limit will be honored. It also fixes a minor bug: SstFileWriter's copy and move constructor are not disabled and incorrect, as any copy/move will result in a double free. Switching to unique_ptr makes the object correctly movable and non-copyable as expected. Also fix minor style inconsistencies. Closes https://github.com/facebook/rocksdb/pull/2335 Differential Revision: D5113260 Pulled By: sagar0 fbshipit-source-id: e084236e7ff0b50a56cbeceaa9fedd5e210bf9f8	2017-05-23 11:42:09 -07:00
Siying Dong	9bbba4fec1	Remoe unused BlockBasedTable::compaction_optimized_ Summary: BlockBasedTable::compaction_optimized_ is never used but can cause TSAN warning. Remove it. Closes https://github.com/facebook/rocksdb/pull/2324 Differential Revision: D5085533 Pulled By: siying fbshipit-source-id: 2feefce6806d559dfb4ab2989aa3db36752fe25d	2017-05-18 06:41:23 -07:00
yizhu.sun	f5ba131bf8	Fixed some spelling mistakes Summary: Closes https://github.com/facebook/rocksdb/pull/2314 Differential Revision: D5079601 Pulled By: sagar0 fbshipit-source-id: ae5696fd735718f544435c64c3179c49b8c04349	2017-05-17 23:12:36 -07:00
hyunwoo	0ebdd70579	fixed typo Summary: fixed typo Closes https://github.com/facebook/rocksdb/pull/2312 Differential Revision: D5079631 Pulled By: sagar0 fbshipit-source-id: e4c8d1d89b244ee69e9dea1dd013227cc5241026	2017-05-17 16:41:49 -07:00
Aaron Gao	259a00eaca	unbiase readamp bitmap Summary: Consider BlockReadAmpBitmap with bytes_per_bit = 32. Suppose bytes [a, b) were used, while bytes [a-32, a) and [b+1, b+33) weren't used; more formally, the union of ranges passed to BlockReadAmpBitmap::Mark() contains [a, b) and doesn't intersect with [a-32, a) and [b+1, b+33). Then bits [floor(a/32), ceil(b/32)] will be set, and so the number of useful bytes will be estimated as (ceil(b/32) - floor(a/32)) * 32, which is on average equal to b-a+31. An extreme example: if we use 1 byte from each block, it'll be counted as 32 bytes from each block. It's easy to remove this bias by slightly changing the semantics of the bitmap. Currently each bit represents a byte range [i32, (i+1)32). This diff makes each bit represent a single byte: i32 + X, where X is a random number in [0, 31] generated when bitmap is created. So, e.g., if you read a single byte at random, with probability 31/32 it won't be counted at all, and with probability 1/32 it will be counted as 32 bytes; so, on average it's counted as 1 byte. But there is one exception: the last bit will always set with the old way.* (*) - assuming read_amp_bytes_per_bit = 32. Closes https://github.com/facebook/rocksdb/pull/2259 Differential Revision: D5035652 Pulled By: lightmark fbshipit-source-id: bd98b1b9b49fbe61f9e3781d07f624e3cbd92356	2017-05-10 01:49:52 -07:00
Islam AbdelRahman	4897eb250b	dont skip IO for filter blocks Summary: Based on my experience with linkbench, We should not skip loading bloom filter blocks when they are not available in block cache when using Iterator::Seek Actually I am not sure why this behavior existed in the first place Closes https://github.com/facebook/rocksdb/pull/2255 Differential Revision: D5010721 Pulled By: maysamyabandeh fbshipit-source-id: 0af545a06ac4baeecb248706ec34d009c2480ca4	2017-05-09 09:52:02 -07:00
Maysam Yabandeh	40af2381ec	Object lifetime in cache Summary: Any non-raw-data dependent object must be destructed before the table closes. There was a bug of not doing that for filter object. This patch fixes the bug and adds a unit test to prevent such bugs in future. Closes https://github.com/facebook/rocksdb/pull/2246 Differential Revision: D5001318 Pulled By: maysamyabandeh fbshipit-source-id: 6d8772e58765485868094b92964da82ef9730b6d	2017-05-05 23:20:01 -07:00
Aaron Gao	a30a696034	do not read next datablock if upperbound is reached Summary: Now if we have iterate_upper_bound set, we continue read until get a key >= upper_bound. For a lot of cases that neighboring data blocks have a user key gap between them, our index key will be a user key in the middle to get a shorter size. For example, if we have blocks: [a b c d][f g h] Then the index key for the first block will be 'e'. then if upper bound is any key between 'd' and 'e', for example, d1, d2, ..., d99999999999, we don't have to read the second block and also know that we have done our iteration by reaching the last key that smaller the upper bound already. This diff can reduce RA in most cases. Closes https://github.com/facebook/rocksdb/pull/2239 Differential Revision: D4990693 Pulled By: lightmark fbshipit-source-id: ab30ea2e3c6edf3fddd5efed3c34fcf7739827ff	2017-05-05 23:20:01 -07:00

1 2 3 4 5 ...

768 Commits