rocksdb

Commit Graph

Author	SHA1	Message	Date
Andrew Kryczka	06dc32ef25	internal_repo_rocksdb (435146444452818992) (#12115 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12115 Reviewed By: jowlyzhang Differential Revision: D51745742 Pulled By: ajkr fbshipit-source-id: 67000d07783b413924798dd9c1751da27e119d53	2023-12-01 11:15:17 -08:00
anand76	727bad78b8	Format files under table/ by clang-format (#10852 ) Summary: Run clang-format on files under the `table` directory. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10852 Reviewed By: ajkr Differential Revision: D40650732 Pulled By: anand1976 fbshipit-source-id: 2023a958e37fd6274040c5181130284600c9e0ef	2022-10-25 11:50:38 -07:00
Peter Dillinger	5bf9a7d5ee	Clarify caching behavior for index and filter partitions (#9068 ) Summary: Somewhat confusingly, index and filter partition blocks are never owned by table readers, even with cache_index_and_filter_blocks=false. They still go into block cache (possibly pinned by table reader) if there is a block cache. If no block cache, they are only loaded transiently on demand. This PR primarily clarifies the options APIs and some internal code comments. Also, this closes a hypothetical data corruption vulnerability where some but not all index partitions are pinned. I haven't been able to reproduce a case where it can happen (the failure seems to propagate to abort table open) but it's worth patching nonetheless. Fixes https://github.com/facebook/rocksdb/issues/8979 Pull Request resolved: https://github.com/facebook/rocksdb/pull/9068 Test Plan: existing tests :-/ I could cover the new code using sync points, but then I'd have to very carefully relax my `assert(false)` Reviewed By: ajkr Differential Revision: D31898284 Pulled By: pdillinger fbshipit-source-id: f2511a7d3a36bc04b627935d8e6cfea6422f98be	2021-10-27 17:23:04 -07:00
Andrew Kryczka	75d3b6fdf0	Redesign block cache pinning API (#7520 ) Summary: The old flag-based APIs (`BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache` and `BlockBasedTableOptions::pin_top_level_index_and_filter`) were insufficient for our needs. For example, it was impossible to pin only unpartitioned meta-blocks, which could prevent block cache contention when turning on dictionary compression or during a migration to partitioned indexes/filters. It was also impossible to pin all meta-blocks in memory while having predictable memory usage via block cache. If we had continued adding flags to address these scenarios, they would have had significant overlap causing confusion. Instead, this PR deprecates the flags and starts a new API with non-overlapping options. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7520 Test Plan: - new unit test - added new options to stress/crash test and ran for a while: `$ python tools/db_crashtest.py blackbox --simple --max_key=1000000 -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 --interval=10 -value_size_mult=33 -column_families=1 -reopen=0` Reviewed By: pdillinger Differential Revision: D24200034 Pulled By: ajkr fbshipit-source-id: 3fa7cfc71e7960f7a867511dd6ae5834dd73b13e	2020-10-11 14:58:24 -07:00
sdong	fdf882ded2	Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433 ) Summary: When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433 Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag. Differential Revision: D19977691 fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e	2020-02-20 12:09:57 -08:00
Mike Kolupaev	b4d7209428	Add an option to put first key of each sst block in the index (#5289 ) Summary: The first key is used to defer reading the data block until this file gets to the top of merging iterator's heap. For short range scans, most files never make it to the top of the heap, so this change can reduce read amplification by a lot sometimes. Consider the following workload. There are a few data streams (we'll be calling them "logs"), each stream consisting of a sequence of blobs (we'll be calling them "records"). Each record is identified by log ID and a sequence number within the log. RocksDB key is concatenation of log ID and sequence number (big endian). Reads are mostly relatively short range scans, each within a single log. Writes are mostly sequential for each log, but writes to different logs are randomly interleaved. Compactions are disabled; instead, when we accumulate a few tens of sst files, we create a new column family and start writing to it. So, a typical sst file consists of a few ranges of blocks, each range corresponding to one log ID (we use FlushBlockPolicy to cut blocks at log boundaries). A typical read would go like this. First, iterator Seek() reads one block from each sst file. Then a series of Next()s move through one sst file (since writes to each log are mostly sequential) until the subiterator reaches the end of this log in this sst file; then Next() switches to the next sst file and reads sequentially from that, and so on. Often a range scan will only return records from a small number of blocks in small number of sst files; in this case, the cost of initial Seek() reading one block from each file may be bigger than the cost of reading the actually useful blocks. Neither iterate_upper_bound nor bloom filters can prevent reading one block from each file in Seek(). But this PR can: if the index contains first key from each block, we don't have to read the block until this block actually makes it to the top of merging iterator's heap, so for short range scans we won't read any blocks from most of the sst files. This PR does the deferred block loading inside value() call. This is not ideal: there's no good way to report an IO error from inside value(). As discussed with siying offline, it would probably be better to change InternalIterator's interface to explicitly fetch deferred value and get status. I'll do it in a separate PR. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5289 Differential Revision: D15256423 Pulled By: al13n321 fbshipit-source-id: 750e4c39ce88e8d41662f701cf6275d9388ba46a	2019-06-24 20:54:04 -07:00
Siying Dong	8843129ece	Move some memory related files from util/ to memory/ (#5382 ) Summary: Move arena, allocator, and memory tools under util to a separate memory/ directory. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5382 Differential Revision: D15564655 Pulled By: siying fbshipit-source-id: 9cd6b5d0d3d52b39606e19221fa154596e5852a5	2019-05-30 17:44:09 -07:00
Vijay Nadimpalli	50e470791d	Organizing rocksdb/table directory by format Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5373 Differential Revision: D15559425 Pulled By: vjnadimpalli fbshipit-source-id: 5d6d6d615582bedd96a4b879bb25d429a6de8b55	2019-05-30 14:51:11 -07:00
Michael Liu	ca89ac2ba9	Apply modernize-use-override (2nd iteration) Summary: Use C++11’s override and remove virtual where applicable. Change are automatically generated. Reviewed By: Orvid Differential Revision: D14090024 fbshipit-source-id: 1e9432e87d2657e1ff0028e15370a85d1739ba2a	2019-02-14 14:41:36 -08:00
Maysam Yabandeh	caf0f53a74	Index value delta encoding (#3983 ) Summary: Given that index value is a BlockHandle, which is basically an <offset, size> pair we can apply delta encoding on the values. The first value at each index restart interval encoded the full BlockHandle but the rest encode only the size. Refer to IndexBlockIter::DecodeCurrentValue for the detail of the encoding. This reduces the index size which helps using the block cache more efficiently. The feature is enabled with using format_version 4. The feature comes with a bit of cpu overhead which should be paid back by the higher cache hits due to smaller index block size. Results with sysbench read-only using 4k blocks and using 16 index restart interval: Format 2: 19585 rocksdb read-only range=100 Format 3: 19569 rocksdb read-only range=100 Format 4: 19352 rocksdb read-only range=100 Pull Request resolved: https://github.com/facebook/rocksdb/pull/3983 Differential Revision: D8361343 Pulled By: maysamyabandeh fbshipit-source-id: f882ee082322acac32b0072e2bdbb0b5f854e651	2018-08-09 16:58:40 -07:00
Mike Kolupaev	8bf555f487	Change and clarify the relationship between Valid(), status() and Seek() for all iterators. Also fix some bugs Summary: Before this PR, Iterator/InternalIterator may simultaneously have non-ok status() and Valid() = true. That state means that the last operation failed, but the iterator is nevertheless positioned on some unspecified record. Likely intended uses of that are: If some sst files are corrupted, a normal iterator can be used to read the data from files that are not corrupted. * When using read_tier = kBlockCacheTier, read the data that's in block cache, skipping over the data that is not. However, this behavior wasn't documented well (and until recently the wiki on github had misleading incorrect information). In the code there's a lot of confusion about the relationship between status() and Valid(), and about whether Seek()/SeekToLast()/etc reset the status or not. There were a number of bugs caused by this confusion, both inside rocksdb and in the code that uses rocksdb (including ours). This PR changes the convention to: * If status() is not ok, Valid() always returns false. * Any seek operation resets status. (Before the PR, it depended on iterator type and on particular error.) This does sacrifice the two use cases listed above, but siying said it's ok. Overview of the changes: * A commit that adds missing status checks in MergingIterator. This fixes a bug that actually affects us, and we need it fixed. `DBIteratorTest.NonBlockingIterationBugRepro` explains the scenario. * Changes to lots of iterator types to make all of them conform to the new convention. Some bug fixes along the way. By far the biggest changes are in DBIter, which is a big messy piece of code; I tried to make it less big and messy but mostly failed. * A stress-test for DBIter, to gain some confidence that I didn't break it. It does a few million random operations on the iterator, while occasionally modifying the underlying data (like ForwardIterator does) and occasionally returning non-ok status from internal iterator. To find the iterator types that needed changes I searched for "public .Iterator" in the code. Here's an overview of all 27 iterator types: Iterators that didn't need changes: status() is always ok(), or Valid() is always false: MemTableIterator, ModelIter, TestIterator, KVIter (2 classes with this name anonymous namespaces), LoggingForwardVectorIterator, VectorIterator, MockTableIterator, EmptyIterator, EmptyInternalIterator. * Thin wrappers that always pass through Valid() and status(): ArenaWrappedDBIter, TtlIterator, InternalIteratorFromIterator. Iterators with changes (see inline comments for details): * DBIter - an overhaul: - It used to silently skip corrupted keys (`FindParseableKey()`), which seems dangerous. This PR makes it just stop immediately after encountering a corrupted key, just like it would for other kinds of corruption. Let me know if there was actually some deeper meaning in this behavior and I should put it back. - It had a few code paths silently discarding subiterator's status. The stress test caught a few. - The backwards iteration code path was expecting the internal iterator's set of keys to be immutable. It's probably always true in practice at the moment, since ForwardIterator doesn't support backwards iteration, but this PR fixes it anyway. See added DBIteratorTest.ReverseToForwardBug for an example. - Some parts of backwards iteration code path even did things like `assert(iter_->Valid())` after a seek, which is never a safe assumption. - It used to not reset status on seek for some types of errors. - Some simplifications and better comments. - Some things got more complicated from the added error handling. I'm open to ideas for how to make it nicer. * MergingIterator - check status after every operation on every subiterator, and in some places assert that valid subiterators have ok status. * ForwardIterator - changed to the new convention, also slightly simplified. * ForwardLevelIterator - fixed some bugs and simplified. * LevelIterator - simplified. * TwoLevelIterator - changed to the new convention. Also fixed a bug that would make SeekForPrev() sometimes silently ignore errors from first_level_iter_. * BlockBasedTableIterator - minor changes. * BlockIter - replaced `SetStatus()` with `Invalidate()` to make sure non-ok BlockIter is always invalid. * PlainTableIterator - some seeks used to not reset status. * CuckooTableIterator - tiny code cleanup. * ManagedIterator - fixed some bugs. * BaseDeltaIterator - changed to the new convention and fixed a bug. * BlobDBIterator - seeks used to not reset status. * KeyConvertingIterator - some small change. Closes https://github.com/facebook/rocksdb/pull/3810 Differential Revision: D7888019 Pulled By: al13n321 fbshipit-source-id: 4aaf6d3421c545d16722a815b2fa2e7912bc851d	2018-05-17 02:56:56 -07:00
Zhongyi Xie	954b496b3f	fix memory leak in two_level_iterator Summary: this PR fixes a few failed contbuild: 1. ASAN memory leak in Block::NewIterator (table/block.cc:429). the proper destruction of first_level_iter_ and second_level_iter_ of two_level_iterator.cc is missing from the code after the refactoring in https://github.com/facebook/rocksdb/pull/3406 2. various unused param errors introduced by https://github.com/facebook/rocksdb/pull/3662 3. updated comment for `ForceReleaseCachedEntry` to emphasize the use of `force_erase` flag. Closes https://github.com/facebook/rocksdb/pull/3718 Reviewed By: maysamyabandeh Differential Revision: D7621192 Pulled By: miasantreble fbshipit-source-id: 476c94264083a0730ded957c29de7807e4f5b146	2018-04-15 17:26:26 -07:00
Andrew Kryczka	5d68243e61	Comment out unused variables Summary: Submitting on behalf of another employee. Closes https://github.com/facebook/rocksdb/pull/3557 Differential Revision: D7146025 Pulled By: ajkr fbshipit-source-id: 495ca5db5beec3789e671e26f78170957704e77e	2018-03-05 13:13:41 -08:00
Igor Sugak	aba3409740	Back out "[codemod] - comment out unused parameters" Reviewed By: igorsugak fbshipit-source-id: 4a93675cc1931089ddd574cacdb15d228b1e5f37	2018-02-22 12:43:17 -08:00
David Lai	f4a030ce81	- comment out unused parameters Reviewed By: everiq, igorsugak Differential Revision: D7046710 fbshipit-source-id: 8e10b1f1e2aecebbfb229c742e214db887e5a461	2018-02-22 09:44:23 -08:00
Siying Dong	b555ed30a4	Customized BlockBasedTableIterator and LevelIterator Summary: Use a customzied BlockBasedTableIterator and LevelIterator to replace current implementations leveraging two-level-iterator. Hope the customized logic will make code easier to understand. As a side effect, BlockBasedTableIterator reduces the allocation for the data block iterator object, and avoid the virtual function call to it, because we can directly reference BlockIter, a final class. Similarly, LevelIterator reduces virtual function call to the dummy iterator iterating the file metadata. It also enabled further optimization. The upper bound check is also moved from index block to data block. This implementation fits this iterator better. After the change, forwared iterator is slightly optimized to ensure we trim those iterators. The two-level-iterator now is only used by partitioned index, so it is simplified. Closes https://github.com/facebook/rocksdb/pull/3406 Differential Revision: D6809041 Pulled By: siying fbshipit-source-id: 7da3b9b1d3c8e9d9405302c15920af1fcaf50ffa	2018-02-12 17:12:25 -08:00
Siying Dong	3c327ac2d0	Change RocksDB License Summary: Closes https://github.com/facebook/rocksdb/pull/2589 Differential Revision: D5431502 Pulled By: siying fbshipit-source-id: 8ebf8c87883daa9daa54b2303d11ce01ab1f6f75	2017-07-15 16:11:23 -07:00
Aaron Gao	a30a696034	do not read next datablock if upperbound is reached Summary: Now if we have iterate_upper_bound set, we continue read until get a key >= upper_bound. For a lot of cases that neighboring data blocks have a user key gap between them, our index key will be a user key in the middle to get a shorter size. For example, if we have blocks: [a b c d][f g h] Then the index key for the first block will be 'e'. then if upper bound is any key between 'd' and 'e', for example, d1, d2, ..., d99999999999, we don't have to read the second block and also know that we have done our iteration by reaching the last key that smaller the upper bound already. This diff can reduce RA in most cases. Closes https://github.com/facebook/rocksdb/pull/2239 Differential Revision: D4990693 Pulled By: lightmark fbshipit-source-id: ab30ea2e3c6edf3fddd5efed3c34fcf7739827ff	2017-05-05 23:20:01 -07:00
Siying Dong	d616ebea23	Add GPLv2 as an alternative license. Summary: Closes https://github.com/facebook/rocksdb/pull/2226 Differential Revision: D4967547 Pulled By: siying fbshipit-source-id: dd3b58ae1e7a106ab6bb6f37ab5c88575b125ab4	2017-04-27 18:06:12 -07:00
Aaron Gao	f517d9dd09	Add SeekForPrev() to Iterator Summary: Add new Iterator API, `SeekForPrev`: find the last key that <= target key support prefix_extractor support prefix_same_as_start support upper_bound not supported in iterators without Prev() Also add tests in db_iter_test and db_iterator_test Pass all tests Cheers! Test Plan: make all check -j64 Reviewers: andrewkr, yiwu, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D64149	2016-09-27 18:20:57 -07:00
Islam AbdelRahman	b693ba68b5	Minor PinnedIteratorsManager Refactoring Summary: This diff include these simple change - Rename ReleasePinnedIterators to ReleasePinnedData - Rename PinIteratorIfNeeded to PinIterator - Use std::vector directly in PinnedIteratorsManager instead of std::unique_ptr<std::vector> - Generalize PinnedIteratorsManager by adding PinPtr which can pin any pointer Test Plan: existing tests Reviewers: sdong, yiwu, andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D61305	2016-08-11 11:54:17 -07:00
Islam AbdelRahman	68a8e6b8fa	Introduce FullMergeV2 (eliminate memcpy from merge operators) Summary: This diff update the code to pin the merge operator operands while the merge operation is done, so that we can eliminate the memcpy cost, to do that we need a new public API for FullMerge that replace the std::deque<std::string> with std::vector<Slice> This diff is stacked on top of D56493 and D56511 In this diff we - Update FullMergeV2 arguments to be encapsulated in MergeOperationInput and MergeOperationOutput which will make it easier to add new arguments in the future - Replace std::deque<std::string> with std::vector<Slice> to pass operands - Replace MergeContext std::deque with std::vector (based on a simple benchmark I ran https://gist.github.com/IslamAbdelRahman/78fc86c9ab9f52b1df791e58943fb187) - Allow FullMergeV2 output to be an existing operand ``` [Everything in Memtable \| 10K operands \| 10 KB each \| 1 operand per key] DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=10000 --num=10000 --disable_auto_compactions --value_size=10240 --write_buffer_size=1000000000 [FullMergeV2] readseq : 0.607 micros/op 1648235 ops/sec; 16121.2 MB/s readseq : 0.478 micros/op 2091546 ops/sec; 20457.2 MB/s readseq : 0.252 micros/op 3972081 ops/sec; 38850.5 MB/s readseq : 0.237 micros/op 4218328 ops/sec; 41259.0 MB/s readseq : 0.247 micros/op 4043927 ops/sec; 39553.2 MB/s [master] readseq : 3.935 micros/op 254140 ops/sec; 2485.7 MB/s readseq : 3.722 micros/op 268657 ops/sec; 2627.7 MB/s readseq : 3.149 micros/op 317605 ops/sec; 3106.5 MB/s readseq : 3.125 micros/op 320024 ops/sec; 3130.1 MB/s readseq : 4.075 micros/op 245374 ops/sec; 2400.0 MB/s ``` ``` [Everything in Memtable \| 10K operands \| 10 KB each \| 10 operand per key] DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=1000 --num=10000 --disable_auto_compactions --value_size=10240 --write_buffer_size=1000000000 [FullMergeV2] readseq : 3.472 micros/op 288018 ops/sec; 2817.1 MB/s readseq : 2.304 micros/op 434027 ops/sec; 4245.2 MB/s readseq : 1.163 micros/op 859845 ops/sec; 8410.0 MB/s readseq : 1.192 micros/op 838926 ops/sec; 8205.4 MB/s readseq : 1.250 micros/op 800000 ops/sec; 7824.7 MB/s [master] readseq : 24.025 micros/op 41623 ops/sec; 407.1 MB/s readseq : 18.489 micros/op 54086 ops/sec; 529.0 MB/s readseq : 18.693 micros/op 53495 ops/sec; 523.2 MB/s readseq : 23.621 micros/op 42335 ops/sec; 414.1 MB/s readseq : 18.775 micros/op 53262 ops/sec; 521.0 MB/s ``` ``` [Everything in Block cache \| 10K operands \| 10 KB each \| 1 operand per key] [FullMergeV2] $ DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --num=100000 --db="/dev/shm/merge-random-10K-10KB" --cache_size=1000000000 --use_existing_db --disable_auto_compactions readseq : 14.741 micros/op 67837 ops/sec; 663.5 MB/s readseq : 1.029 micros/op 971446 ops/sec; 9501.6 MB/s readseq : 0.974 micros/op 1026229 ops/sec; 10037.4 MB/s readseq : 0.965 micros/op 1036080 ops/sec; 10133.8 MB/s readseq : 0.943 micros/op 1060657 ops/sec; 10374.2 MB/s [master] readseq : 16.735 micros/op 59755 ops/sec; 584.5 MB/s readseq : 3.029 micros/op 330151 ops/sec; 3229.2 MB/s readseq : 3.136 micros/op 318883 ops/sec; 3119.0 MB/s readseq : 3.065 micros/op 326245 ops/sec; 3191.0 MB/s readseq : 3.014 micros/op 331813 ops/sec; 3245.4 MB/s ``` ``` [Everything in Block cache \| 10K operands \| 10 KB each \| 10 operand per key] DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --num=100000 --db="/dev/shm/merge-random-10-operands-10K-10KB" --cache_size=1000000000 --use_existing_db --disable_auto_compactions [FullMergeV2] readseq : 24.325 micros/op 41109 ops/sec; 402.1 MB/s readseq : 1.470 micros/op 680272 ops/sec; 6653.7 MB/s readseq : 1.231 micros/op 812347 ops/sec; 7945.5 MB/s readseq : 1.091 micros/op 916590 ops/sec; 8965.1 MB/s readseq : 1.109 micros/op 901713 ops/sec; 8819.6 MB/s [master] readseq : 27.257 micros/op 36687 ops/sec; 358.8 MB/s readseq : 4.443 micros/op 225073 ops/sec; 2201.4 MB/s readseq : 5.830 micros/op 171526 ops/sec; 1677.7 MB/s readseq : 4.173 micros/op 239635 ops/sec; 2343.8 MB/s readseq : 4.150 micros/op 240963 ops/sec; 2356.8 MB/s ``` Test Plan: COMPILE_WITH_ASAN=1 make check -j64 Reviewers: yhchiang, andrewkr, sdong Reviewed By: sdong Subscribers: lovro, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D57075	2016-07-20 09:49:03 -07:00
Islam AbdelRahman	d719b095dc	Introduce PinnedIteratorsManager (Reduce PinData() overhead / Refactor PinData) Summary: While trying to reuse PinData() / ReleasePinnedData() .. to optimize away some memcpys I realized that there is a significant overhead for using PinData() / ReleasePinnedData if they were called many times. This diff refactor the pinning logic by introducing PinnedIteratorsManager a centralized component that will be created once and will be notified whenever we need to Pin an Iterator. This implementation have much less overhead than the original implementation Test Plan: make check -j64 COMPILE_WITH_ASAN=1 make check -j64 Reviewers: yhchiang, sdong, andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D56493	2016-04-26 12:41:07 -07:00
Baraa Hamodi	21e95811d1	Updated all copyright headers to the new format.	2016-02-09 15:12:00 -08:00
Islam AbdelRahman	aececc209e	Introduce ReadOptions::pin_data (support zero copy for keys) Summary: This patch update the Iterator API to introduce new functions that allow users to keep the Slices returned by key() valid as long as the Iterator is not deleted ReadOptions::pin_data : If true keep loaded blocks in memory as long as the iterator is not deleted Iterator::IsKeyPinned() : If true, this mean that the Slice returned by key() is valid as long as the iterator is not deleted Also add a new option BlockBasedTableOptions::use_delta_encoding to allow users to disable delta_encoding if needed. Benchmark results (using https://phabricator.fb.com/P20083553) ``` // $ du -h /home/tec/local/normal.4K.Snappy/db10077 // 6.1G /home/tec/local/normal.4K.Snappy/db10077 // $ du -h /home/tec/local/zero.8K.LZ4/db10077 // 6.4G /home/tec/local/zero.8K.LZ4/db10077 // Benchmarks for shard db10077 // _build/opt/rocks/benchmark/rocks_copy_benchmark \ // --normal_db_path="/home/tec/local/normal.4K.Snappy/db10077" \ // --zero_db_path="/home/tec/local/zero.8K.LZ4/db10077" // First run // ============================================================================ // rocks/benchmark/RocksCopyBenchmark.cpp relative time/iter iters/s // ============================================================================ // BM_StringCopy 1.73s 576.97m // BM_StringPiece 103.74% 1.67s 598.55m // ============================================================================ // Match rate : 1000000 / 1000000 // Second run // ============================================================================ // rocks/benchmark/RocksCopyBenchmark.cpp relative time/iter iters/s // ============================================================================ // BM_StringCopy 611.99ms 1.63 // BM_StringPiece 203.76% 300.35ms 3.33 // ============================================================================ // Match rate : 1000000 / 1000000 ``` Test Plan: Unit tests Reviewers: sdong, igor, anthony, yhchiang, rven Reviewed By: rven Subscribers: dhruba, lovro, adsharma Differential Revision: https://reviews.facebook.net/D48999	2015-12-16 12:08:30 -08:00
sdong	35ad531be3	Seperate InternalIterator from Iterator Summary: Separate a new class InternalIterator from class Iterator, when the look-up is done internally, which also means they operate on key with sequence ID and type. This change will enable potential future optimizations but for now InternalIterator's functions are still the same as Iterator's. At the same time, separate the cleanup function to a separate class and let both of InternalIterator and Iterator inherit from it. Test Plan: Run all existing tests. Reviewers: igor, yhchiang, anthony, kradhakrishnan, IslamAbdelRahman, rven Reviewed By: rven Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D48549	2015-10-13 15:32:13 -07:00
sdong	05e2831966	Allocate LevelFileIteratorState and LevelFileNumIterator from DB iterator's arena Summary: Try to allocate LevelFileIteratorState and LevelFileNumIterator from DB iterator's arena, instead of calling malloc and free. Test Plan: valgrind check Reviewers: rven, yhchiang, anthony, kradhakrishnan, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D40929	2015-06-30 17:30:38 -07:00
Igor Sugak	62247ffa3b	rocksdb: Add missing override Summary: When using latest clang (3.6 or 3.7/trunck) rocksdb is failing with many errors. Almost all of them are missing override errors. This diff adds missing override keyword. No manual changes. Prerequisites: bear and clang 3.5 build with extra tools ```lang=bash % USE_CLANG=1 bear make all # generate a compilation database http://clang.llvm.org/docs/JSONCompilationDatabase.html % clang-modernize -p . -include . -add-override % make format ``` Test Plan: Make sure all tests are passing. ```lang=bash % #Use default fb code clang. % make check ``` Verify less error and no missing override errors. ```lang=bash % # Have trunk clang present in path. % ROCKSDB_NO_FBCODE=1 CC=clang CXX=clang++ make ``` Reviewers: igor, kradhakrishnan, rven, meyering, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D34077	2015-02-26 11:28:41 -08:00
Tomislav Novak	0f9c43ea36	ForwardIterator: reset incomplete iterators on Seek() Summary: When reading from kBlockCacheTier, ForwardIterator's internal child iterators may end up in the incomplete state (read was unable to complete without doing disk I/O). `ForwardIterator::status()` will correctly report that; however, the iterator may be stuck in that state until all sub-iterators are rebuilt: * `NeedToSeekImmutable()` may return false even if some sub-iterators are incomplete * one of the child iterators may be an empty iterator without any state other that the kIncomplete status (created using `NewErrorIterator()`); seeking on any such iterator has no effect -- we need to construct it again Akin to rebuilding iterators after a superversion bump, this diff makes forward iterator reset all incomplete child iterators when `Seek()` or `Next()` are called. Test Plan: TEST_TMPDIR=/dev/shm/rocksdbtest ROCKSDB_TESTS=TailingIterator ./db_test Reviewers: igor, sdong, ljin Reviewed By: ljin Subscribers: lovro, march, leveldb Differential Revision: https://reviews.facebook.net/D22575	2014-08-29 16:21:29 -07:00
sdong	df9069d23f	In DB::NewIterator(), try to allocate the whole iterator tree in an arena Summary: In this patch, try to allocate the whole iterator tree starting from DBIter from an arena 1. ArenaWrappedDBIter is created when serves as the entry point of an iterator tree, with an arena in it. 2. Add an option to create iterator from arena for following iterators: DBIter, MergingIterator, MemtableIterator, all mem table's iterators, all table reader's iterators and two level iterator. 3. MergeIteratorBuilder is created to incrementally build the tree of internal iterators. It is passed to mem table list and version set and add iterators to it. Limitations: (1) Only DB::NewIterator() without tailing uses the arena. Other cases, including readonly DB and compactions are still from malloc (2) Two level iterator itself is allocated in arena, but not iterators inside it. Test Plan: make all check Reviewers: ljin, haobo Reviewed By: haobo Subscribers: leveldb, dhruba, yhchiang, igor Differential Revision: https://reviews.facebook.net/D18513	2014-06-02 17:44:57 -07:00
Lei Jin	ccaca59bee	avoid calling FindFile twice in TwoLevelIterator for PlainTable Summary: this is to reclaim the regression introduced in https://reviews.facebook.net/D17853 Test Plan: make all check Reviewers: igor, haobo, sdong, dhruba, yhchiang Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D17985	2014-04-25 12:23:07 -07:00
Lei Jin	d642c60bdc	Check PrefixMayMatch on Seek() Summary: As a follow-up diff for https://reviews.facebook.net/D17805, add optimization to check PrefixMayMatch on Seek() Test Plan: make all check Reviewers: igor, haobo, sdong, yhchiang, dhruba Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D17853	2014-04-25 12:22:23 -07:00
Siying Dong	d169b67680	[Performance Branch] PlainTable to encode rows with seqID 0, value type using 1 internal byte. Summary: In PlainTable, use one single byte to represent 8 bytes of internal bytes, if seqID = 0 and it is value type (which should be common for bottom most files). It is to save 7 bytes for uncompressed cases. Test Plan: make all check Reviewers: haobo, dhruba, kailiu Reviewed By: haobo CC: igor, leveldb Differential Revision: https://reviews.facebook.net/D15489	2014-02-03 12:19:30 -08:00
Igor Canadi	36409e0016	Fix slow no-io iterator Summary: This fixes #3130525. Dhruba's suggestion and Tnovak's implementation :) The issue was with SkipEmptyDataBlocksForward(), but I also changed SkipEmptyDataBlocksBackward(). Is that OK? Test Plan: Run the logdevice test Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13911	2013-11-06 14:11:52 -08:00
Siying Dong	d4eec30ed0	Make "Table" pluggable Summary: This patch makes Table and TableBuilder a abstract class and make all the implementation of the current table into BlockedBasedTable and BlockedBasedTable Builder. Test Plan: Make db_test.cc to work with block based table. Add a new test simple_table_db_test.cc where a different simple table format is implemented. Reviewers: dhruba, haobo, kailiu, emayanke, vamsi Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13521	2013-10-28 17:54:09 -07:00
Dhruba Borthakur	9cd221094c	Add appropriate LICENSE and Copyright message. Summary: Add appropriate LICENSE and Copyright message. Test Plan: make check Reviewers: CC: Task ID: # Blame Rev:	2013-10-16 17:48:41 -07:00
Dhruba Borthakur	a143ef9b38	Change namespace from leveldb to rocksdb Summary: Change namespace from leveldb to rocksdb. This allows a single application to link in open-source leveldb code as well as rocksdb code into the same process. Test Plan: compile rocksdb Reviewers: emayanke Reviewed By: emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D13287	2013-10-04 11:59:26 -07:00
Haobo Xu	ab8d2f6ab2	[RocksDB] [Performance] Allow different posix advice to be applied to the same table file Summary: Current posix advice implementation ties up the access pattern hint with the creation of a file. It is not possible to apply different advice for different access (random get vs compaction read), without keeping two open files for the same table. This patch extended the RandomeAccessFile interface to accept new access hint at anytime. Particularly, we are able to set different access hint on the same table file based on when/how the file is used. Two options are added to set the access hint, after the file is first opened and after the file is being compacted. Test Plan: make check; db_stress; db_bench Reviewers: dhruba Reviewed By: dhruba CC: MarkCallaghan, leveldb Differential Revision: https://reviews.facebook.net/D10905	2013-05-30 19:08:44 -07:00
Haobo Xu	eb6d139666	[RocksDB] Move table.h to table/ Summary: - don't see a point exposing table.h to the public. - fixed make clean to remove also *.d files. Test Plan: make check; db_stress Reviewers: dhruba, heyongqiang Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D10479	2013-04-22 16:07:56 -07:00
Dhruba Borthakur	ad96563b79	Ability to configure bufferedio-reads, filesystem-readaheads and mmap-read-write per database. Summary: This patch allows an application to specify whether to use bufferedio, reads-via-mmaps and writes-via-mmaps per database. Earlier, there was a global static variable that was used to configure this functionality. The default setting remains the same (and is backward compatible): 1. use bufferedio 2. do not use mmaps for reads 3. use mmap for writes 4. use readaheads for reads needed for compaction I also added a parameter to db_bench to be able to explicitly specify whether to do readaheads for compactions or not. Test Plan: make check Reviewers: sheki, heyongqiang, MarkCallaghan Reviewed By: sheki CC: leveldb Differential Revision: https://reviews.facebook.net/D9429	2013-03-20 23:14:03 -07:00
Abhishek Kona	c41f1e995c	Codemod NULL to nullptr Summary: scripted NULL to nullptr in * include/leveldb/ * db/ * table/ * util/ Test Plan: make all check Reviewers: dhruba, emayanke Reviewed By: emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D9003	2013-02-28 18:04:58 -08:00
Hans Wennborg	36a5f8ed7f	A number of fixes: - Replace raw slice comparison with a call to user comparator. Added test for custom comparators. - Fix end of namespace comments. - Fixed bug in picking inputs for a level-0 compaction. When finding overlapping files, the covered range may expand as files are added to the input set. We now correctly expand the range when this happens instead of continuing to use the old range. For example, suppose L0 contains files with the following ranges: F1: a .. d F2: c .. g F3: f .. j and the initial compaction target is F3. We used to search for range f..j which yielded {F2,F3}. However we now expand the range as soon as another file is added. In this case, when F2 is added, we expand the range to c..j and restart the search. That picks up file F1 as well. This change fixes a bug related to deleted keys showing up incorrectly after a compaction as described in Issue 44. (Sync with upstream @25072954)	2011-10-31 17:22:06 +00:00
dgrogan@chromium.org	69c6d38342	reverting disastrous MOE commit, returning to r21 git-svn-id: https://leveldb.googlecode.com/svn/trunk@23 62dab493-f737-651d-591e-8d6aee1b9529	2011-04-19 23:11:15 +00:00
dgrogan@chromium.org	b743906eea	Revision created by MOE tool push_codebase. MOE_MIGRATION= git-svn-id: https://leveldb.googlecode.com/svn/trunk@22 62dab493-f737-651d-591e-8d6aee1b9529	2011-04-19 23:01:25 +00:00
dgrogan@chromium.org	b409afe968	chmod a-x git-svn-id: https://leveldb.googlecode.com/svn/trunk@21 62dab493-f737-651d-591e-8d6aee1b9529	2011-04-18 23:15:58 +00:00
dgrogan@chromium.org	f779e7a5d8	@20602303. Default file permission is now 755. git-svn-id: https://leveldb.googlecode.com/svn/trunk@20 62dab493-f737-651d-591e-8d6aee1b9529	2011-04-12 19:38:58 +00:00
jorlow@chromium.org	4671a695fc	Move include files into a leveldb subdir. git-svn-id: https://leveldb.googlecode.com/svn/trunk@18 62dab493-f737-651d-591e-8d6aee1b9529	2011-03-30 18:35:40 +00:00
jorlow@chromium.org	f67e15e50f	Initial checkin. git-svn-id: https://leveldb.googlecode.com/svn/trunk@2 62dab493-f737-651d-591e-8d6aee1b9529	2011-03-18 22:37:00 +00:00

48 Commits