rocksdb

Commit Graph

Author	SHA1	Message	Date
Shi Feng	01e6badbb6	Introduce CPU timers for iterator seek and next (#5076 ) Summary: Introduce CPU timers for iterator seek and next operations. Seek counter includes SeekToFirst, SeekToLast and SeekForPrev, w/ the caveat that SeekToLast timer doesn't include some post processing time if upper bound is defined. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5076 Differential Revision: D14525218 Pulled By: fredfsh fbshipit-source-id: 03ba25df3b22b06c072621e4de0eacfa1445f0d9	2019-03-26 16:32:13 -07:00
Maysam Yabandeh	c84fad7a19	Reorder DBIter fields to reduce memory usage (#5078 ) Summary: The patch reorders DBIter fields to put 1-byte fields together and let the compiler optimize the memory usage by using less 64-bit allocations for bools and enums. This might have a negative side effect of putting the variables that are accessed together into different cache lines and hence increasing the cache misses. Not sure what benchmark would verify that thought. I ran simple, single-threaded seekrandom benchmarks but the variance in the results is too much to be conclusive. ./db_bench --benchmarks=fillrandom --use_existing_db=0 --num=1000000 --db=/dev/shm/dbbench ./db_bench --benchmarks=seekrandom[X10] --use_existing_db=1 --db=/dev/shm/dbbench --num=1000000 --duration=60 --seek_nexts=100 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5078 Differential Revision: D14562676 Pulled By: maysamyabandeh fbshipit-source-id: 2284655d46e079b6e9a860e94be5defb6f482167	2019-03-21 09:55:09 -07:00
Michael Liu	ca89ac2ba9	Apply modernize-use-override (2nd iteration) Summary: Use C++11’s override and remove virtual where applicable. Change are automatically generated. Reviewed By: Orvid Differential Revision: D14090024 fbshipit-source-id: 1e9432e87d2657e1ff0028e15370a85d1739ba2a	2019-02-14 14:41:36 -08:00
Abhishek Madan	81b6b09f6b	Remove v1 RangeDelAggregator (#4778 ) Summary: Now that v2 is fully functional, the v1 aggregator is removed. The v2 aggregator has been renamed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4778 Differential Revision: D13495930 Pulled By: abhimadan fbshipit-source-id: 9d69500a60a283e79b6c4fa938fc68a8aa4d40d6	2018-12-17 17:33:46 -08:00
Abhishek Madan	abf931afa6	Add compaction logic to RangeDelAggregatorV2 (#4758 ) Summary: RangeDelAggregatorV2 now supports ShouldDelete calls on snapshot stripes and creation of range tombstone compaction iterators. RangeDelAggregator is no longer used on any non-test code path, and will be removed in a future commit. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4758 Differential Revision: D13439254 Pulled By: abhimadan fbshipit-source-id: fe105bcf8e3d4a2df37a622d5510843cd71b0401	2018-12-17 13:20:51 -08:00
Abhishek Madan	8fe1e06ca0	Clean up FragmentedRangeTombstoneList (#4692 ) Summary: Removed `one_time_use` flag, which removed the need for some tests, and changed all `NewRangeTombstoneIterator` methods to return `FragmentedRangeTombstoneIterators`. These changes also led to removing `RangeDelAggregatorV2::AddUnfragmentedTombstones` and one of the `MemTableListVersion::AddRangeTombstoneIterators` methods. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4692 Differential Revision: D13106570 Pulled By: abhimadan fbshipit-source-id: cbab5432d7fc2d9cdfd8d9d40361a1bffaa8f845	2018-11-28 15:29:02 -08:00
Abhishek Madan	457f77b9ff	Introduce RangeDelAggregatorV2 (#4649 ) Summary: The old RangeDelAggregator did expensive pre-processing work to create a collapsed, binary-searchable representation of range tombstones. With FragmentedRangeTombstoneIterator, much of this work is now unnecessary. RangeDelAggregatorV2 takes advantage of this by seeking in each iterator to find a covering tombstone in ShouldDelete, while doing minimal work in AddTombstones. The old RangeDelAggregator is still used during flush/compaction for now, though RangeDelAggregatorV2 will support those uses in a future PR. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4649 Differential Revision: D13146964 Pulled By: abhimadan fbshipit-source-id: be29a4c020fc440500c137216fcc1cf529571eb3	2018-11-21 10:56:45 -08:00
Soli Como	5945e16dfc	Divide `NO_ITERATORS` into two counters `NO_ITERATOR_CREATED` and `NO_ITERATOR_DELETE` (#4498 ) Summary: Currently, `Statistics` can record tick by `recordTick()` whose second parameter is an `uint64_t`. That means tick can only increase. If we want to reduce tick, we have to work around like `RecordTick(statistics_, NO_ITERATORS, uint64_t(-1));`. That's kind of a hack. So, this PR divide `NO_ITERATORS` into two counters `NO_ITERATOR_CREATED` and `NO_ITERATOR_DELETE`, making the counters increase only. Fixes #3013 . Pull Request resolved: https://github.com/facebook/rocksdb/pull/4498 Differential Revision: D10395010 Pulled By: sagar0 fbshipit-source-id: cfb523b22a37411c794b4e9da090f1ae30293db2	2018-11-13 11:46:32 -08:00
Andrew Kryczka	7e56072290	Fix merge operand reappearing when covered by DeleteRange (#4481 ) Summary: Even during `DBIter::Prev()`, there is a case where we need to use `RangeDelPositioningMode::kForwardTraversal`. In particular, when we hit too many internal keys for a single user key, we use seek to find the newest internal key. If it's a merge operand, we then scan forwards, collecting the merge operands. This forward scan should be using `RangeDelPositioningMode::kForwardTraversal`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4481 Differential Revision: D10319507 Pulled By: ajkr fbshipit-source-id: b5ce7352461f3a7696b28a5136ae0076f2bde51f	2018-10-10 18:16:12 -07:00
jsteemann	517d3b8b77	fix typo in error message, twice (#4457 ) Summary: Fixes a typo in error messages returned by Iterator::GetProperty(...) Pull Request resolved: https://github.com/facebook/rocksdb/pull/4457 Differential Revision: D10281965 Pulled By: sagar0 fbshipit-source-id: 1cd3c665f467ef06cdfd9f482692e6f8568f3d22	2018-10-09 17:07:27 -07:00
Zhichao Cao	6d75319d95	Add tracing function of Seek() and SeekForPrev() to trace_replay (#4228 ) Summary: In the current trace_and replay, Get an WriteBatch are traced. This pull request track down the Seek() and SeekForPrev() to the trace file. <target_key, timestamp, column_family_id> are write to the file. Replay of Iterator is not supported in the current implementation. Tested with trace_analyzer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4228 Differential Revision: D9201381 Pulled By: zhichao-cao fbshipit-source-id: 6f9cc9cb3c20260af741bee065ec35c5c96354ab	2018-08-10 17:57:40 -07:00
Nikhil Benesch	5f3088d565	Range deletion performance improvements + cleanup (#4014 ) Summary: This fixes the same performance issue that #3992 fixes but with much more invasive cleanup. I'm more excited about this PR because it paves the way for fixing another problem we uncovered at Cockroach where range deletion tombstones can cause massive compactions. For example, suppose L4 contains deletions from [a, c) and [x, z) and no other keys, and L5 is entirely empty. L6, however, is full of data. When compacting L4 -> L5, we'll end up with one file that spans, massively, from [a, z). When we go to compact L5 -> L6, we'll have to rewrite all of L6! If, instead of range deletions in L4, we had keys a, b, x, y, and z, RocksDB would have been smart enough to create two files in L5: one for a and b and another for x, y, and z. With the changes in this PR, it will be possible to adjust the compaction logic to split tombstones/start new output files when they would span too many files in the grandparent level. ajkr please take a look when you have a minute! Pull Request resolved: https://github.com/facebook/rocksdb/pull/4014 Differential Revision: D8773253 Pulled By: ajkr fbshipit-source-id: ec62fa85f648fdebe1380b83ed997f9baec35677	2018-07-12 14:42:39 -07:00
Manuel Ung	a16e00b7b9	WriteUnPrepared Txn: Disable seek to snapshot optimization (#3955 ) Summary: This is implemented by extending ReadCallback with another function `MaxUnpreparedSequenceNumber` which returns the largest visible sequence number for the current transaction, if there is uncommitted data written to DB. Otherwise, it returns zero, indicating no uncommitted data. There are the places where reads had to be modified. - Get and Seek/Next was just updated to seek to max(snapshot_seq, MaxUnpreparedSequenceNumber()) instead, and iterate until a key was visible. - Prev did not need need updates since it did not use the Seek to sequence number optimization. Assuming that locks were held when writing unprepared keys, and ValidateSnapshot runs, there should only be committed keys and unprepared keys of the current transaction, all of which are visible. Prev will simply iterate to get the last visible key. - Reseeking to skip keys optimization was also disabled for write unprepared, since it's possible to hit the max_skip condition even while reseeking. There needs to be some way to resolve infinite looping in this case. Closes https://github.com/facebook/rocksdb/pull/3955 Differential Revision: D8286688 Pulled By: lth fbshipit-source-id: 25e42f47fdeb5f7accea0f4fd350ef35198caafe	2018-06-27 12:23:07 -07:00
Zhongyi Xie	c3ebc75843	Move prefix_extractor to MutableCFOptions Summary: Currently it is not possible to change bloom filter config without restart the db, which is causing a lot of operational complexity for users. This PR aims to make it possible to dynamically change bloom filter config. Closes https://github.com/facebook/rocksdb/pull/3601 Differential Revision: D7253114 Pulled By: miasantreble fbshipit-source-id: f22595437d3e0b86c95918c484502de2ceca120c	2018-05-21 14:43:11 -07:00
Mike Kolupaev	8bf555f487	Change and clarify the relationship between Valid(), status() and Seek() for all iterators. Also fix some bugs Summary: Before this PR, Iterator/InternalIterator may simultaneously have non-ok status() and Valid() = true. That state means that the last operation failed, but the iterator is nevertheless positioned on some unspecified record. Likely intended uses of that are: If some sst files are corrupted, a normal iterator can be used to read the data from files that are not corrupted. * When using read_tier = kBlockCacheTier, read the data that's in block cache, skipping over the data that is not. However, this behavior wasn't documented well (and until recently the wiki on github had misleading incorrect information). In the code there's a lot of confusion about the relationship between status() and Valid(), and about whether Seek()/SeekToLast()/etc reset the status or not. There were a number of bugs caused by this confusion, both inside rocksdb and in the code that uses rocksdb (including ours). This PR changes the convention to: * If status() is not ok, Valid() always returns false. * Any seek operation resets status. (Before the PR, it depended on iterator type and on particular error.) This does sacrifice the two use cases listed above, but siying said it's ok. Overview of the changes: * A commit that adds missing status checks in MergingIterator. This fixes a bug that actually affects us, and we need it fixed. `DBIteratorTest.NonBlockingIterationBugRepro` explains the scenario. * Changes to lots of iterator types to make all of them conform to the new convention. Some bug fixes along the way. By far the biggest changes are in DBIter, which is a big messy piece of code; I tried to make it less big and messy but mostly failed. * A stress-test for DBIter, to gain some confidence that I didn't break it. It does a few million random operations on the iterator, while occasionally modifying the underlying data (like ForwardIterator does) and occasionally returning non-ok status from internal iterator. To find the iterator types that needed changes I searched for "public .Iterator" in the code. Here's an overview of all 27 iterator types: Iterators that didn't need changes: status() is always ok(), or Valid() is always false: MemTableIterator, ModelIter, TestIterator, KVIter (2 classes with this name anonymous namespaces), LoggingForwardVectorIterator, VectorIterator, MockTableIterator, EmptyIterator, EmptyInternalIterator. * Thin wrappers that always pass through Valid() and status(): ArenaWrappedDBIter, TtlIterator, InternalIteratorFromIterator. Iterators with changes (see inline comments for details): * DBIter - an overhaul: - It used to silently skip corrupted keys (`FindParseableKey()`), which seems dangerous. This PR makes it just stop immediately after encountering a corrupted key, just like it would for other kinds of corruption. Let me know if there was actually some deeper meaning in this behavior and I should put it back. - It had a few code paths silently discarding subiterator's status. The stress test caught a few. - The backwards iteration code path was expecting the internal iterator's set of keys to be immutable. It's probably always true in practice at the moment, since ForwardIterator doesn't support backwards iteration, but this PR fixes it anyway. See added DBIteratorTest.ReverseToForwardBug for an example. - Some parts of backwards iteration code path even did things like `assert(iter_->Valid())` after a seek, which is never a safe assumption. - It used to not reset status on seek for some types of errors. - Some simplifications and better comments. - Some things got more complicated from the added error handling. I'm open to ideas for how to make it nicer. * MergingIterator - check status after every operation on every subiterator, and in some places assert that valid subiterators have ok status. * ForwardIterator - changed to the new convention, also slightly simplified. * ForwardLevelIterator - fixed some bugs and simplified. * LevelIterator - simplified. * TwoLevelIterator - changed to the new convention. Also fixed a bug that would make SeekForPrev() sometimes silently ignore errors from first_level_iter_. * BlockBasedTableIterator - minor changes. * BlockIter - replaced `SetStatus()` with `Invalidate()` to make sure non-ok BlockIter is always invalid. * PlainTableIterator - some seeks used to not reset status. * CuckooTableIterator - tiny code cleanup. * ManagedIterator - fixed some bugs. * BaseDeltaIterator - changed to the new convention and fixed a bug. * BlobDBIterator - seeks used to not reset status. * KeyConvertingIterator - some small change. Closes https://github.com/facebook/rocksdb/pull/3810 Differential Revision: D7888019 Pulled By: al13n321 fbshipit-source-id: 4aaf6d3421c545d16722a815b2fa2e7912bc851d	2018-05-17 02:56:56 -07:00
Sagar Vemuri	8ada876dfe	Add rocksdb.iterator.internal-key property Summary: Added a new iterator property: `rocksdb.iterator.internal-key` to get the internal-key (converted to user key) at which the iterator stopped. Closes https://github.com/facebook/rocksdb/pull/3525 Differential Revision: D7033694 Pulled By: sagar0 fbshipit-source-id: d51e6c00f5e9d766c6276ef79774b81c6c5216f8	2018-02-20 19:12:09 -08:00
Mike Kolupaev	cb5b8f2090	Fix use-after-free in tailing iterator with merge operator Summary: ForwardIterator::SVCleanup() sometimes didn't pin superversion when it was supposed to. See the added test for the scenario. Here's the ASAN output of the added test without the fix (using `COMPILE_WITH_ASAN=1 make`): https://pastebin.com/9rD0Ywws Closes https://github.com/facebook/rocksdb/pull/3415 Differential Revision: D6817414 Pulled By: al13n321 fbshipit-source-id: bc80c44ea78a3a1fa885dfa448a26111f91afb24	2018-02-02 21:26:28 -08:00
Yi Wu	c7226428dd	WritePrepared Txn: Fix DBIterator and add test Summary: In DBIter, Prev() calls FindValueForCurrentKey() to search the current value backward. If it finds that there are too many stale value being skipped, it falls back to FindValueForCurrentKeyUsingSeek(), seeking directly to the key with snapshot sequence. After introducing read_callback, however, the key it seeks to might not be visible, according to read_callback. It thus needs to keep searching forward until the first visible value. Closes https://github.com/facebook/rocksdb/pull/3382 Differential Revision: D6756148 Pulled By: yiwu-arbug fbshipit-source-id: 064e39b1eec5e083af1c10142600f26d1d2697be	2018-01-23 16:57:11 -08:00
Yi Wu	06149429d9	WritePrepared Txn: Return NotSupported on iterator refresh Summary: A proper implementation of Iterator::Refresh() for WritePreparedTxnDB would require release and acquire another snapshot. Since MyRocks don't make use of Iterator::Refresh(), we just simply mark it as not supported. Closes https://github.com/facebook/rocksdb/pull/3290 Differential Revision: D6599931 Pulled By: yiwu-arbug fbshipit-source-id: 4e1632d967316431424f6e458254ecf9a97567cf	2017-12-18 22:29:30 -08:00
Yi Wu	20995c5729	Make iterator invalid on Merge error Summary: Since #1665, on merge error, iterator will be set to corrupted status, but it doesn't invalidate the iterator. Fixing it. Closes https://github.com/facebook/rocksdb/pull/3226 Differential Revision: D6499094 Pulled By: yiwu-arbug fbshipit-source-id: 80222930f949e31f90a6feaa37ddc3529b510d2c	2017-12-06 11:56:39 -08:00
Maysam Yabandeh	18dcf7f98d	WritePrepared Txn: PreReleaseCallback Summary: Add PreReleaseCallback to be called at the end of WriteImpl but before publishing the sequence number. The callback is used in WritePrepareTxn to i) update the commit map, ii) update the last published sequence number in the 2nd write queue. It also ensures that all the commits will go to the 2nd queue. These changes will ensure that the commit map is updated before the sequence number is published and used by reading snapshots. If we use two write queues, the snapshots will use the seq number published by the 2nd queue. If we use one write queue (the default, the snapshots will use the last seq number in the memtable, which also indicates the last published seq number. Closes https://github.com/facebook/rocksdb/pull/3205 Differential Revision: D6438959 Pulled By: maysamyabandeh fbshipit-source-id: f8b6c434e94bc5f5ab9cb696879d4c23e2577ab9	2017-11-30 23:50:45 -08:00
zhangjinpeng1987	ffacaaa3ea	fix Seek with lower_bound Summary: When Seek a key less than `lower_bound`, should return `lower_bound`. ajkr PTAL Closes https://github.com/facebook/rocksdb/pull/3199 Differential Revision: D6421126 Pulled By: ajkr fbshipit-source-id: a06c825830573e0040630704f6bcb3f7f48626f7	2017-11-29 22:56:29 -08:00
anand1976	d394a6bb48	Add a ticker stat for number of keys skipped during iteration Summary: This diff adds a new ticker stat, NUMBER_ITER_SKIP, to count the number of internal keys skipped during iteration. Keys can be skipped due to deletes, or lower sequence number, or higher sequence number than the one requested. Also, fix the issue when StatisticsData is naturally aligned on cacheline boundary, padding becomes a zero size array, which the Windows compiler doesn't like. So add a cacheline worth of padding in that case to keep it happy. We cannot conditionally add padding as gcc doesn't allow using sizeof in preprocessor directives. Closes https://github.com/facebook/rocksdb/pull/3177 Differential Revision: D6353897 Pulled By: anand1976 fbshipit-source-id: 441d5a09af9c4e22e7355242dfc0c7b27aa0a6c2	2017-11-20 21:26:37 -08:00
Maysam Yabandeh	2515266725	WritePrepared Txn: Refactoring TrackKeys Summary: This patch clarifies and refactors the logic around tracked keys in transactions. Closes https://github.com/facebook/rocksdb/pull/3140 Differential Revision: D6290258 Pulled By: maysamyabandeh fbshipit-source-id: 03b50646264cbcc550813c060b180fc7451a55c1	2017-11-11 13:14:20 -08:00
Mikhail Antonov	7fe3b32896	Added support for differential snapshots Summary: The motivation for this PR is to add to RocksDB support for differential (incremental) snapshots, as snapshot of the DB changes between two points in time (one can think of it as diff between to sequence numbers, or the diff D which can be thought of as an SST file or just set of KVs that can be applied to sequence number S1 to get the database to the state at sequence number S2). This feature would be useful for various distributed storages layers built on top of RocksDB, as it should help reduce resources (time and network bandwidth) needed to recover and rebuilt DB instances as replicas in the context of distributed storages. From the API standpoint that would like client app requesting iterator between (start seqnum) and current DB state, and reading the "diff". This is a very draft PR for initial review in the discussion on the approach, i'm going to rework some parts and keep updating the PR. For now, what's done here according to initial discussions: Preserving deletes: - We want to be able to optionally preserve recent deletes for some defined period of time, so that if a delete came in recently and might need to be included in the next incremental snapshot it would't get dropped by a compaction. This is done by adding new param to Options (preserve deletes flag) and new variable to DB Impl where we keep track of the sequence number after which we don't want to drop tombstones, even if they are otherwise eligible for deletion. - I also added a new API call for clients to be able to advance this cutoff seqnum after which we drop deletes; i assume it's more flexible to let clients control this, since otherwise we'd need to keep some kind of timestamp < -- > seqnum mapping inside the DB, which sounds messy and painful to support. Clients could make use of it by periodically calling GetLatestSequenceNumber(), noting the timestamp, doing some calculation and figuring out by how much we need to advance the cutoff seqnum. - Compaction codepath in compaction_iterator.cc has been modified to avoid dropping tombstones with seqnum > cutoff seqnum. Iterator changes: - couple params added to ReadOptions, to optionally allow client to request internal keys instead of user keys (so that client can get the latest value of a key, be it delete marker or a put), as well as min timestamp and min seqnum. TableCache changes: - I modified table_cache code to be able to quickly exclude SST files from iterators heep if creation_time on the file is less then iter_start_ts as passed in ReadOptions. That would help a lot in some DB settings (like reading very recent data only or using FIFO compactions), but not so much for universal compaction with more or less long iterator time span. What's left: - Still looking at how to best plug that inside DBIter codepath. So far it seems that FindNextUserKeyInternal only parses values as UserKeys, and iter->key() call generally returns user key. Can we add new API to DBIter as internal_key(), and modify this internal method to optionally set saved_key_ to point to the full internal key? I don't need to store actual seqnum there, but I do need to store type. Closes https://github.com/facebook/rocksdb/pull/2999 Differential Revision: D6175602 Pulled By: mikhail-antonov fbshipit-source-id: c779a6696ee2d574d86c69cec866a3ae095aa900	2017-11-01 18:56:43 -07:00
Prashant D	47166baeac	Fix coverity uninitialized fields warnings Pulled By: ajkr Differential Revision: D6170448 fbshipit-source-id: 5fd6d1608fc0df27c94d9f5059315ce7f79b8f5c	2017-10-26 21:11:50 -07:00
Andrew Kryczka	95667383db	implement lower bound for iterators Summary: - for `SeekToFirst()`, just convert it to a regular `Seek()` if lower bound is specified - for operations that iterate backwards over user keys (`SeekForPrev`, `SeekToLast`, `Prev`), change `PrevInternal` to check whether user key went below lower bound every time the user key changes -- same approach we use to ensure we stay within a prefix when `prefix_same_as_start=true`. Closes https://github.com/facebook/rocksdb/pull/3074 Differential Revision: D6158654 Pulled By: ajkr fbshipit-source-id: cb0e3a922e2650d2cd4d1c6e1c0f1e8b729ff518	2017-10-26 17:27:42 -07:00
Islam AbdelRahman	addfe1ef4b	Fix tombstone scans in SeekForPrev outside prefix Summary: When doing a Seek() or SeekForPrev() we should stop the moment we see a key with a different prefix as start if ReadOptions:: prefix_same_as_start was set to true Right now we don't stop if we encounter a tombstone outside the prefix while executing SeekForPrev() Closes https://github.com/facebook/rocksdb/pull/3067 Differential Revision: D6149638 Pulled By: IslamAbdelRahman fbshipit-source-id: 7f659862d2bf552d3c9104a360c79439ceba2f18	2017-10-25 15:12:00 -07:00
Dmitri Smirnov	ebab2e2d42	Enable MSVC W4 with a few exceptions. Fix warnings and bugs Summary: Closes https://github.com/facebook/rocksdb/pull/3018 Differential Revision: D6079011 Pulled By: yiwu-arbug fbshipit-source-id: 988a721e7e7617967859dba71d660fc69f4dff57	2017-10-19 10:57:12 -07:00
Yi Wu	8c392a31d7	WritePrepared Txn: Iterator Summary: On iterator create, take a snapshot, create a ReadCallback and pass the ReadCallback to the underlying DBIter to check if key is committed. Closes https://github.com/facebook/rocksdb/pull/2981 Differential Revision: D6001471 Pulled By: yiwu-arbug fbshipit-source-id: 3565c4cdaf25370ba47008b0e0cb65b31dfe79fe	2017-10-09 17:15:28 -07:00
Yi Wu	d1cab2b64e	Add ValueType::kTypeBlobIndex Summary: Add kTypeBlobIndex value type, which will be used by blob db only, to insert a (key, blob_offset) KV pair. The purpose is to 1. Make it possible to open existing rocksdb instance as blob db. Existing value will be of kTypeIndex type, while value inserted by blob db will be of kTypeBlobIndex. 2. Make rocksdb able to detect if the db contains value written by blob db, if so return error. 3. Make it possible to have blob db optionally store value in SST file (with kTypeValue type) or as a blob value (with kTypeBlobIndex type). The root db (DBImpl) basically pretended kTypeBlobIndex are normal value on write. On Get if is_blob is provided, return whether the value read is of kTypeBlobIndex type, or return Status::NotSupported() status if is_blob is not provided. On scan allow_blob flag is pass and if the flag is true, return wether the value is of kTypeBlobIndex type via iter->IsBlob(). Changes on blob db side will be in a separate patch. Closes https://github.com/facebook/rocksdb/pull/2886 Differential Revision: D5838431 Pulled By: yiwu-arbug fbshipit-source-id: 3c5306c62bc13bb11abc03422ec5cbcea1203cca	2017-10-03 09:11:23 -07:00
Siying Dong	edcbb36944	Three code-level optimization to Iterator::Next() Summary: Three small optimizations: (1) iter_->IsKeyPinned() shouldn't be called if read_options.pin_data is not true. This may trigger function call all the way down the iterator tree. (2) reuse the iterator key object in DBIter::FindNextUserEntryInternal(). The constructor of the class has some overheads. (3) Move the switching direction logic in MergingIterator::Next() to a separate function. These three in total improves readseq performance by about 3% in my benchmark setting. Closes https://github.com/facebook/rocksdb/pull/2880 Differential Revision: D5829252 Pulled By: siying fbshipit-source-id: 991aea10c6d6c3b43769cb4db168db62954ad1e3	2017-09-14 17:57:31 -07:00
Siying Dong	2dd22e5449	Make DBIter class final Summary: DBIter is referenced in ArenaWrappedDBIter, which is a simple wrapper. If DBIter is final, some virtual function call can be avoided. Some functions can even be inlined, like DBIter.value() to ArenaWrappedDBIter.value() and DBIter.key() to ArenaWrappedDBIter.key(). The performance gain is hard to measure. I just ran the memory-only benchmark for readseq and saw it didn't regress. There shouldn't be any harm doing it. Just give compiler more choices. Closes https://github.com/facebook/rocksdb/pull/2859 Differential Revision: D5799888 Pulled By: siying fbshipit-source-id: 829788f91310c40282dcfb7e412e6ef489931143	2017-09-11 12:04:21 -07:00
Andrew Kryczka	ed0a4c93ef	perf_context measure user bytes read Summary: With this PR, we can measure read-amp for queries where perf_context is enabled as follows: ``` SetPerfLevel(kEnableCount); Get(1, "foo"); double read_amp = static_cast<double>(get_perf_context()->block_read_byte / get_perf_context()->get_read_bytes); SetPerfLevel(kDisable); ``` Our internal infra enables perf_context for a sampling of queries. So we'll be able to compute the read-amp for the sample set, which can give us a good estimate of read-amp. Closes https://github.com/facebook/rocksdb/pull/2749 Differential Revision: D5647240 Pulled By: ajkr fbshipit-source-id: ad73550b06990cf040cc4528fa885360f308ec12	2017-08-18 11:43:33 -07:00
Siying Dong	e67b35c076	Add Iterator::Refresh() Summary: Add and implement Iterator::Refresh(). When this function is called, if the super version doesn't change, update the sequence number of the iterator to the latest one and invalidate the iterator. If the super version changed, recreated the whole iterator. This can help users reuse the iterator more easily. Closes https://github.com/facebook/rocksdb/pull/2621 Differential Revision: D5464500 Pulled By: siying fbshipit-source-id: f548bd35e85c1efca2ea69273802f6704eba6ba9	2017-07-24 10:54:37 -07:00
Siying Dong	3c327ac2d0	Change RocksDB License Summary: Closes https://github.com/facebook/rocksdb/pull/2589 Differential Revision: D5431502 Pulled By: siying fbshipit-source-id: 8ebf8c87883daa9daa54b2303d11ce01ab1f6f75	2017-07-15 16:11:23 -07:00
Siying Dong	18c63af6ef	Make "make analyze" happy Summary: "make analyze" is reporting some errors. It's complicated to look but it seems to me that they are all false positive. Anyway, I think cleaning them up is a good idea. Some of the changes are hacky but I don't know a better way. Closes https://github.com/facebook/rocksdb/pull/2508 Differential Revision: D5341710 Pulled By: siying fbshipit-source-id: 6070e430e0e41a080ef441e05e8ec827d45efab6	2017-06-28 15:42:27 -07:00
Siying Dong	51ac91f586	Histogram of number of merge operands Summary: Add a histogram in statistics to help users understand how many merge operands they merge. Closes https://github.com/facebook/rocksdb/pull/2373 Differential Revision: D5139983 Pulled By: siying fbshipit-source-id: 61b9ba8ca83f358530a4833d68f0103b56a0e182	2017-05-31 07:41:44 -07:00
Sagar Vemuri	7d8207f1f2	Fix errors in clang-analyzer builds Summary: Fix build error in db_iter.cc when running clang-analyzer. ``` CC db/db_iter.o db/db_iter.cc:938:21: error: no matching constructor for initialization of 'rocksdb::ParsedInternalKey' ParsedInternalKey ikey(Slice(), 0, 0); ^ ~~~~~~~~~~~~~ ./db/dbformat.h:84:3: note: candidate constructor not viable: no known conversion from 'int' to 'rocksdb::ValueType' for 3rd argument ParsedInternalKey(const Slice& u, const SequenceNumber& seq, ValueType t) ^ ./db/dbformat.h:78:8: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 3 were provided struct ParsedInternalKey { ^ ./db/dbformat.h:78:8: note: candidate constructor (the implicit move constructor) not viable: requires 1 argument, but 3 were provided ./db/dbformat.h:83:3: note: candidate constructor not viable: requires 0 arguments, but 3 were provided ParsedInternalKey() { } // Intentionally left uninitialized (for speed) ^ 1 error generated. ``` Closes https://github.com/facebook/rocksdb/pull/2354 Differential Revision: D5115751 Pulled By: sagar0 fbshipit-source-id: b0e386d4e935e4725b07761c3ca5f7a8cbde3692	2017-05-23 15:11:42 -07:00
Yi Wu	d746aead1a	Suppress clang-analyzer false positive Summary: Fixing two types of clang-analyzer false positives: * db is deleted and then reopen, and clang-analyzer thinks we are reusing the pointer after it has been deleted. Adding asserts to hint clang-analyzer the pointer is recreated. * ParsedInternalKey is (intentionally) uninitialized. Initialize the struct only when clang-analyzer is running. Closes https://github.com/facebook/rocksdb/pull/2334 Differential Revision: D5093801 Pulled By: yiwu-arbug fbshipit-source-id: f51355382098eb3da5ab9f64e094c6d03e6bdf7d	2017-05-19 10:56:28 -07:00
Siying Dong	d616ebea23	Add GPLv2 as an alternative license. Summary: Closes https://github.com/facebook/rocksdb/pull/2226 Differential Revision: D4967547 Pulled By: siying fbshipit-source-id: dd3b58ae1e7a106ab6bb6f37ab5c88575b125ab4	2017-04-27 18:06:12 -07:00
Sagar Vemuri	7124268a09	Reduce the number of params needed to construct DBIter Summary: DBIter, and in-turn NewDBIterator and NewArenaWrappedDBIterator, take a bunch of params. They can be reduced by passing in ReadOptions directly instead of passing in every new param separately. It also seems much cleaner as a bunch of the params towards the end seem to be optional. (Recently I introduced max_skippable_internal_keys, which added one more to the already huge count). Idea courtesy IslamAbdelRahman Closes https://github.com/facebook/rocksdb/pull/2116 Differential Revision: D4857128 Pulled By: sagar0 fbshipit-source-id: 7d239df094b94bd9ea79d145cdf825478ac037a8	2017-04-10 11:14:14 -07:00
Siying Dong	d2dce5611a	Move some files under util/ to separate dirs Summary: Move some files under util/ to new directories env/, monitoring/ options/ and cache/ Closes https://github.com/facebook/rocksdb/pull/2090 Differential Revision: D4833681 Pulled By: siying fbshipit-source-id: 2fd8bef	2017-04-05 19:09:16 -07:00
Aaron Gao	90cfd46458	update IterKey that can get user key and internal key explicitly Summary: to void future bug that caused by the mix of userkey/internalkey Closes https://github.com/facebook/rocksdb/pull/2084 Differential Revision: D4825889 Pulled By: lightmark fbshipit-source-id: 28411db	2017-04-04 14:24:20 -07:00
Siying Dong	6ef8c620d3	Move auto_roll_logger and filename out of db/ Summary: It is confusing to have auto_roll_logger to stay under db/, which has nothing to do with database. Move filename together as it is a dependency. Closes https://github.com/facebook/rocksdb/pull/2080 Differential Revision: D4821141 Pulled By: siying fbshipit-source-id: ca7d768	2017-04-03 18:39:14 -07:00
Sagar Vemuri	c6d04f2ecf	Option to fail a request as incomplete when skipping too many internal keys Summary: Operations like Seek/Next/Prev sometimes take too long to complete when there are many internal keys to be skipped. Adding an option, max_skippable_internal_keys -- which could be used to set a threshold for the maximum number of keys that can be skipped, will help to address these cases where it is much better to fail a request (as incomplete) than to wait for a considerable time for the request to complete. This feature -- to fail an iterator seek request as incomplete, is disabled by default when max_skippable_internal_keys = 0. It is enabled only when max_skippable_internal_keys > 0. This feature is based on the discussion mentioned in the PR https://github.com/facebook/rocksdb/pull/1084. Closes https://github.com/facebook/rocksdb/pull/2000 Differential Revision: D4753223 Pulled By: sagar0 fbshipit-source-id: 1c973f7	2017-03-30 12:09:21 -07:00
Islam AbdelRahman	e19163688b	Add macros to include file name and line number during Logging Summary: current logging ``` 2017/03/14-14:20:30.393432 7fedde9f5700 (Original Log Time 2017/03/14-14:20:30.393414) [default] Level summary: base level 1 max bytes base 268435456 files[1 0 0 0 0 0 0] max score 0.25 2017/03/14-14:20:30.393438 7fedde9f5700 [JOB 2] Try to delete WAL files size 61417909, prev total WAL file size 73820858, number of live WAL files 2. 2017/03/14-14:20:30.393464 7fedde9f5700 [DEBUG] [JOB 2] Delete /dev/shm/old_logging//MANIFEST-000001 type=3 #1 -- OK 2017/03/14-14:20:30.393472 7fedde9f5700 [DEBUG] [JOB 2] Delete /dev/shm/old_logging//000003.log type=0 #3 -- OK 2017/03/14-14:20:31.427103 7fedd49f1700 [default] New memtable created with log file: #9. Immutable memtables: 0. 2017/03/14-14:20:31.427179 7fedde9f5700 [JOB 3] Syncing log #6 2017/03/14-14:20:31.427190 7fedde9f5700 (Original Log Time 2017/03/14-14:20:31.427170) Calling FlushMemTableToOutputFile with column family [default], flush slots available 1, compaction slots allowed 1, compaction slots scheduled 1 2017/03/14-14:20:31. Closes https://github.com/facebook/rocksdb/pull/1990 Differential Revision: D4708695 Pulled By: IslamAbdelRahman fbshipit-source-id: cb8968f	2017-03-15 19:39:12 -07:00
Aaron Gao	12ba00ea65	Reset DBIter::saved_key_ with proper user key anywhere before pass to DBIter::FindNextUserEntry Summary: fix db_iter bug introduced by [facebook#1413](https://github.com/facebook/rocksdb/pull/1413) Closes https://github.com/facebook/rocksdb/pull/1962 Differential Revision: D4672369 Pulled By: lightmark fbshipit-source-id: 6a22953	2017-03-08 17:24:11 -08:00
Andrew Kryczka	b104b87814	Maintain position in range deletions map Summary: When deletion-collapsing mode is enabled (i.e., for DBIter/CompactionIterator), we maintain position in the tombstone maps across calls to ShouldDelete(). Since iterators often access keys sequentially (or reverse-sequentially), scanning forward/backward from the last position can be faster than binary-searching the map for every key. - When Next() is invoked on an iterator, we use kForwardTraversal to scan forwards, if needed, until arriving at the range deletion containing the next key. - Similarly for Prev(), we use kBackwardTraversal to scan backwards in the range deletion map. - When the iterator seeks, we use kBinarySearch for repositioning - After tombstones are added or before the first ShouldDelete() invocation, the current position is set to invalid, which forces kBinarySearch to be used. - Non-iterator users (i.e., Get()) use kFullScan, which has the same behavior as before---scan the whole map for every key passed to ShouldDelete(). Closes https://github.com/facebook/rocksdb/pull/1701 Differential Revision: D4350318 Pulled By: ajkr fbshipit-source-id: 5129b76	2017-01-05 10:39:12 -08:00
Andrew Kryczka	50e305de98	Collapse range deletions Summary: Added a tombstone-collapsing mode to RangeDelAggregator, which eliminates overlap in the TombstoneMap. In this mode, we can check whether a tombstone covers a user key using upper_bound() (i.e., binary search). However, the tradeoff is the overhead to add tombstones is now higher, so at first I've only enabled it for range scans (compaction/flush/user iterators), where we expect a high number of calls to ShouldDelete() for the same tombstones. Point queries like Get() will still use the linear scan approach. Also in this diff I changed RangeDelAggregator's TombstoneMap to use multimap with user keys instead of map with internal keys. Callers sometimes provided ParsedInternalKey directly, from which it would've required string copying to derive an internal key Slice with which we could search the map. Closes https://github.com/facebook/rocksdb/pull/1614 Differential Revision: D4270397 Pulled By: ajkr fbshipit-source-id: 93092c7	2016-12-19 16:54:12 -08:00
Yi Wu	c270735861	Iterator should be in corrupted status if merge operator return false Summary: Iterator should be in corrupted status if merge operator return false. Also add test to make sure if max_successive_merges is hit during write, data will not be lost. Closes https://github.com/facebook/rocksdb/pull/1665 Differential Revision: D4322695 Pulled By: yiwu-arbug fbshipit-source-id: b327b05	2016-12-16 11:09:16 -08:00
Siying Dong	b3b875657f	Remove unused assignment in db/db_iter.cc Summary: "make analyze" complains the assignment is not useful. Remove it. Closes https://github.com/facebook/rocksdb/pull/1581 Differential Revision: D4241697 Pulled By: siying fbshipit-source-id: 178f67a	2016-11-29 09:09:14 -08:00
Mike Kolupaev	236d4c67e9	Less linear search in DBIter::Seek() when keys are overwritten a lot Summary: In one deployment we saw high latencies (presumably from slow iterator operations) and a lot of CPU time reported by perf with this stack: ``` rocksdb::MergingIterator::Next rocksdb::DBIter::FindNextUserEntryInternal rocksdb::DBIter::Seek ``` I think what's happening is: 1. we create a snapshot iterator, 2. we do lots of Put()s for the same key x; this creates lots of entries in memtable, 3. we seek the iterator to a key slightly smaller than x, 4. the seek walks over lots of entries in memtable for key x, skipping them because of high sequence numbers. CC IslamAbdelRahman Closes https://github.com/facebook/rocksdb/pull/1413 Differential Revision: D4083879 Pulled By: IslamAbdelRahman fbshipit-source-id: a83ddae	2016-11-28 10:24:11 -08:00
Andrew Kryczka	fd43ee09da	Range deletion microoptimizations Summary: - Made RangeDelAggregator's InternalKeyComparator member a reference-to-const so we don't need to copy-construct it. Also added InternalKeyComparator to ImmutableCFOptions so we don't need to construct one for each DBIter. - Made MemTable::NewRangeTombstoneIterator and the table readers' NewRangeTombstoneIterator() functions return nullptr instead of NewEmptyInternalIterator to avoid the allocation. Updated callers accordingly. Closes https://github.com/facebook/rocksdb/pull/1548 Differential Revision: D4208169 Pulled By: ajkr fbshipit-source-id: 2fd65cf	2016-11-21 12:24:13 -08:00
Andrew Kryczka	3f62215210	Lazily initialize RangeDelAggregator's map and pinning manager Summary: Since a RangeDelAggregator is created for each read request, these heap-allocating member variables were consuming significant CPU (~3% total) which slowed down request throughput. The map and pinning manager are only necessary when range deletions exist, so we can defer their initialization until the first range deletion is encountered. Currently lazy initialization is done for reads only since reads pass us a single snapshot, which is easier to store on the stack for later insertion into the map than the vector passed to us by flush or compaction. Note the Arena member variable is still expensive, I will figure out what to do with it in a subsequent diff. It cannot be lazily initialized because we currently use this arena even to allocate empty iterators, which is necessary even when no range deletions exist. Closes https://github.com/facebook/rocksdb/pull/1539 Differential Revision: D4203488 Pulled By: ajkr fbshipit-source-id: 3b36279	2016-11-18 17:09:11 -08:00
Andrew Kryczka	9e7cf3469b	DeleteRange user iterator support Summary: Note: reviewed in https://reviews.facebook.net/D65115 - DBIter maintains a range tombstone accumulator. We don't cleanup obsolete tombstones yet, so if the user seeks back and forth, the same tombstones would be added to the accumulator multiple times. - DBImpl::NewInternalIterator() (used to make DBIter's underlying iterator) adds memtable/L0 range tombstones, L1+ range tombstones are added on-demand during NewSecondaryIterator() (see D62205) - DBIter uses ShouldDelete() when advancing to check whether keys are covered by range tombstones Closes https://github.com/facebook/rocksdb/pull/1464 Differential Revision: D4131753 Pulled By: ajkr fbshipit-source-id: be86559	2016-11-04 12:09:22 -07:00
Aaron Gao	5e0d6b4cc9	fix db_stress assertion failure Summary: in rocksdb::DBIter::FindValueForCurrentKey(), last_not_merge_type could also be SingleDelete() which is omitted Test Plan: db_iter_test Reviewers: yhchiang, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D65187	2016-10-18 16:07:10 -07:00
Aaron Gao	21e8daced5	fix assertion failure in Prev() Summary: fix assertion failure in db_stress. It happens because of prefix seek key is larger than merge iterator key when they have the same user key Test Plan: ./db_stress --max_background_compactions=1 --max_write_buffer_number=3 --sync=0 --reopen=20 --write_buffer_size=33554432 --delpercent=5 --log2_keys_per_lock=10 --block_size=16384 --allow_concurrent_memtable_write=0 --test_batches_snapshots=0 --max_bytes_for_level_base=67108864 --progress_reports=0 --mmap_read=0 --writepercent=35 --disable_data_sync=0 --readpercent=50 --subcompactions=4 --ops_per_thread=20000000 --memtablerep=skip_list --prefix_size=0 --target_file_size_multiplier=1 --column_families=1 --threads=32 --disable_wal=0 --open_files=500000 --destroy_db_initially=0 --target_file_size_base=16777216 --nooverwritepercent=1 --iterpercent=10 --max_key=100000000 --prefixpercent=0 --use_clock_cache=false --kill_random_test=888887 --cache_size=1048576 --verify_checksum=1 Reviewers: sdong, andrewkr, yiwu, yhchiang Reviewed By: yhchiang Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D65025	2016-10-13 17:36:48 -07:00
Aaron Gao	447f17127c	new Prev() prefix support using SeekForPrev() Summary: 1) The previous solution for Prev() prefix support is not clean. Since I add api SeekForPrev(), now the Prev() can be symmetric to Next(). and we do not need SeekToLast() to be called in Prev() any more. Also, Next() will Seek(prefix_seek_key_) to solve the problem of possible inconsistency between db_iter and merge_iter when there is merge_operator. And prefix_seek_key is only refreshed when change direction to forward. 2) This diff also solves the bug of Iterator::SeekToLast() with iterate_upper_bound_ with prefix extractor. add test cases for the above two cases. There are some tests for the SeekToLast() in Prev(), I will clean them later. Test Plan: make all check Reviewers: IslamAbdelRahman, andrewkr, yiwu, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D63933	2016-10-11 13:54:26 -07:00
Aaron Gao	26388247aa	delete unused variable for PrevInterval() Summary: delete unused variable Test Plan: make check Reviewers: sdong, andrewkr, IslamAbdelRahman, tianx Reviewed By: tianx Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D64509	2016-09-29 13:19:58 -07:00
Aaron Gao	f517d9dd09	Add SeekForPrev() to Iterator Summary: Add new Iterator API, `SeekForPrev`: find the last key that <= target key support prefix_extractor support prefix_same_as_start support upper_bound not supported in iterators without Prev() Also add tests in db_iter_test and db_iterator_test Pass all tests Cheers! Test Plan: make all check -j64 Reviewers: andrewkr, yiwu, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D64149	2016-09-27 18:20:57 -07:00
Islam AbdelRahman	1cca091298	Temporarily revert Prev() prefix support Summary: Temporarily revert commits for supporting prefix Prev() to unblock MyRocks and RocksDB release These are the commits reverted - `6a14d55bd9` - `b18f9c9eac` - `db74b1a219` - `2482d5fb45` Test Plan: make check -j64 Reviewers: sdong, lightmark Reviewed By: lightmark Subscribers: andrewkr, dhruba, yoshinorim Differential Revision: https://reviews.facebook.net/D63789	2016-09-08 14:45:32 -07:00
Aaron Gao	2482d5fb45	support Prev() in prefix seek mode Summary: As title, make sure Prev() works as expected with Next() when the current iter->key() in the range of the same prefix in prefix seek mode Test Plan: make all check -j64 (add prefix_test with PrefixSeekModePrev test case) Reviewers: andrewkr, sdong, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: yoshinorim, andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D61419	2016-08-29 20:55:39 -07:00
Islam AbdelRahman	b693ba68b5	Minor PinnedIteratorsManager Refactoring Summary: This diff include these simple change - Rename ReleasePinnedIterators to ReleasePinnedData - Rename PinIteratorIfNeeded to PinIterator - Use std::vector directly in PinnedIteratorsManager instead of std::unique_ptr<std::vector> - Generalize PinnedIteratorsManager by adding PinPtr which can pin any pointer Test Plan: existing tests Reviewers: sdong, yiwu, andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D61305	2016-08-11 11:54:17 -07:00
Islam AbdelRahman	68a8e6b8fa	Introduce FullMergeV2 (eliminate memcpy from merge operators) Summary: This diff update the code to pin the merge operator operands while the merge operation is done, so that we can eliminate the memcpy cost, to do that we need a new public API for FullMerge that replace the std::deque<std::string> with std::vector<Slice> This diff is stacked on top of D56493 and D56511 In this diff we - Update FullMergeV2 arguments to be encapsulated in MergeOperationInput and MergeOperationOutput which will make it easier to add new arguments in the future - Replace std::deque<std::string> with std::vector<Slice> to pass operands - Replace MergeContext std::deque with std::vector (based on a simple benchmark I ran https://gist.github.com/IslamAbdelRahman/78fc86c9ab9f52b1df791e58943fb187) - Allow FullMergeV2 output to be an existing operand ``` [Everything in Memtable \| 10K operands \| 10 KB each \| 1 operand per key] DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=10000 --num=10000 --disable_auto_compactions --value_size=10240 --write_buffer_size=1000000000 [FullMergeV2] readseq : 0.607 micros/op 1648235 ops/sec; 16121.2 MB/s readseq : 0.478 micros/op 2091546 ops/sec; 20457.2 MB/s readseq : 0.252 micros/op 3972081 ops/sec; 38850.5 MB/s readseq : 0.237 micros/op 4218328 ops/sec; 41259.0 MB/s readseq : 0.247 micros/op 4043927 ops/sec; 39553.2 MB/s [master] readseq : 3.935 micros/op 254140 ops/sec; 2485.7 MB/s readseq : 3.722 micros/op 268657 ops/sec; 2627.7 MB/s readseq : 3.149 micros/op 317605 ops/sec; 3106.5 MB/s readseq : 3.125 micros/op 320024 ops/sec; 3130.1 MB/s readseq : 4.075 micros/op 245374 ops/sec; 2400.0 MB/s ``` ``` [Everything in Memtable \| 10K operands \| 10 KB each \| 10 operand per key] DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=1000 --num=10000 --disable_auto_compactions --value_size=10240 --write_buffer_size=1000000000 [FullMergeV2] readseq : 3.472 micros/op 288018 ops/sec; 2817.1 MB/s readseq : 2.304 micros/op 434027 ops/sec; 4245.2 MB/s readseq : 1.163 micros/op 859845 ops/sec; 8410.0 MB/s readseq : 1.192 micros/op 838926 ops/sec; 8205.4 MB/s readseq : 1.250 micros/op 800000 ops/sec; 7824.7 MB/s [master] readseq : 24.025 micros/op 41623 ops/sec; 407.1 MB/s readseq : 18.489 micros/op 54086 ops/sec; 529.0 MB/s readseq : 18.693 micros/op 53495 ops/sec; 523.2 MB/s readseq : 23.621 micros/op 42335 ops/sec; 414.1 MB/s readseq : 18.775 micros/op 53262 ops/sec; 521.0 MB/s ``` ``` [Everything in Block cache \| 10K operands \| 10 KB each \| 1 operand per key] [FullMergeV2] $ DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --num=100000 --db="/dev/shm/merge-random-10K-10KB" --cache_size=1000000000 --use_existing_db --disable_auto_compactions readseq : 14.741 micros/op 67837 ops/sec; 663.5 MB/s readseq : 1.029 micros/op 971446 ops/sec; 9501.6 MB/s readseq : 0.974 micros/op 1026229 ops/sec; 10037.4 MB/s readseq : 0.965 micros/op 1036080 ops/sec; 10133.8 MB/s readseq : 0.943 micros/op 1060657 ops/sec; 10374.2 MB/s [master] readseq : 16.735 micros/op 59755 ops/sec; 584.5 MB/s readseq : 3.029 micros/op 330151 ops/sec; 3229.2 MB/s readseq : 3.136 micros/op 318883 ops/sec; 3119.0 MB/s readseq : 3.065 micros/op 326245 ops/sec; 3191.0 MB/s readseq : 3.014 micros/op 331813 ops/sec; 3245.4 MB/s ``` ``` [Everything in Block cache \| 10K operands \| 10 KB each \| 10 operand per key] DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --num=100000 --db="/dev/shm/merge-random-10-operands-10K-10KB" --cache_size=1000000000 --use_existing_db --disable_auto_compactions [FullMergeV2] readseq : 24.325 micros/op 41109 ops/sec; 402.1 MB/s readseq : 1.470 micros/op 680272 ops/sec; 6653.7 MB/s readseq : 1.231 micros/op 812347 ops/sec; 7945.5 MB/s readseq : 1.091 micros/op 916590 ops/sec; 8965.1 MB/s readseq : 1.109 micros/op 901713 ops/sec; 8819.6 MB/s [master] readseq : 27.257 micros/op 36687 ops/sec; 358.8 MB/s readseq : 4.443 micros/op 225073 ops/sec; 2201.4 MB/s readseq : 5.830 micros/op 171526 ops/sec; 1677.7 MB/s readseq : 4.173 micros/op 239635 ops/sec; 2343.8 MB/s readseq : 4.150 micros/op 240963 ops/sec; 2356.8 MB/s ``` Test Plan: COMPILE_WITH_ASAN=1 make check -j64 Reviewers: yhchiang, andrewkr, sdong Reviewed By: sdong Subscribers: lovro, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D57075	2016-07-20 09:49:03 -07:00
Islam AbdelRahman	7c919deccc	Reuse TimedFullMerge instead of FullMerge + instrumentation Summary: We have alot of code duplication whenever we call FullMerge we keep duplicating the instrumentation and statistics code This is a simple diff to refactor the code to use TimedFullMerge instead of FullMerge Test Plan: COMPILE_WITH_ASAN=1 make check -j64 Reviewers: andrewkr, yhchiang, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D59577	2016-06-13 16:17:26 -07:00
Islam AbdelRahman	ff4b3fb5b4	Fix Iterator::Prev memory pinning bug Summary: We should not use IterKey::SetKey with copy = false except if we are pinning the iterator thru it's life time, otherwise we may release the temporarily pinned blocks and in this case the IterKey will be pointing to freed memory Test Plan: added a new test Reviewers: sdong, andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D57561	2016-05-03 16:50:01 -07:00
Islam AbdelRahman	6e801b0bd1	Eliminate memcpy in Iterator::Prev() by pinning blocks for keys spanning multiple blocks Summary: This diff is stacked on top of this diff https://reviews.facebook.net/D56493 The current Iterator::Prev() implementation need to copy every value since the underlying Iterator may move after reading the value. This can be optimized by making sure that the block containing the value is pinned until the Iterator move. which will improve the throughput by up to 1.5X master ``` ==> 1000000_Keys_100Byte.txt <== readreverse : 0.449 micros/op 2225887 ops/sec; 246.2 MB/s readreverse : 0.433 micros/op 2311508 ops/sec; 255.7 MB/s readreverse : 0.436 micros/op 2294335 ops/sec; 253.8 MB/s readreverse : 0.471 micros/op 2121295 ops/sec; 234.7 MB/s readreverse : 0.465 micros/op 2152227 ops/sec; 238.1 MB/s readreverse : 0.454 micros/op 2203011 ops/sec; 243.7 MB/s readreverse : 0.451 micros/op 2216095 ops/sec; 245.2 MB/s readreverse : 0.462 micros/op 2162447 ops/sec; 239.2 MB/s readreverse : 0.476 micros/op 2099151 ops/sec; 232.2 MB/s readreverse : 0.472 micros/op 2120710 ops/sec; 234.6 MB/s avg : 242.34 MB/s ==> 1000000_Keys_1KB.txt <== readreverse : 1.013 micros/op 986793 ops/sec; 978.7 MB/s readreverse : 0.942 micros/op 1061136 ops/sec; 1052.5 MB/s readreverse : 0.951 micros/op 1051901 ops/sec; 1043.3 MB/s readreverse : 0.932 micros/op 1072894 ops/sec; 1064.1 MB/s readreverse : 1.024 micros/op 976720 ops/sec; 968.7 MB/s readreverse : 0.935 micros/op 1069169 ops/sec; 1060.4 MB/s readreverse : 1.012 micros/op 988132 ops/sec; 980.1 MB/s readreverse : 0.962 micros/op 1039579 ops/sec; 1031.1 MB/s readreverse : 0.991 micros/op 1008924 ops/sec; 1000.7 MB/s readreverse : 1.004 micros/op 996144 ops/sec; 988.0 MB/s avg : 1016.76 MB/s ==> 1000000_Keys_10KB.txt <== readreverse : 4.167 micros/op 239952 ops/sec; 2346.9 MB/s readreverse : 4.070 micros/op 245713 ops/sec; 2403.3 MB/s readreverse : 4.572 micros/op 218733 ops/sec; 2139.4 MB/s readreverse : 4.497 micros/op 222388 ops/sec; 2175.2 MB/s readreverse : 4.203 micros/op 237920 ops/sec; 2327.1 MB/s readreverse : 4.206 micros/op 237756 ops/sec; 2325.5 MB/s readreverse : 4.181 micros/op 239149 ops/sec; 2339.1 MB/s readreverse : 4.157 micros/op 240552 ops/sec; 2352.8 MB/s readreverse : 4.187 micros/op 238848 ops/sec; 2336.1 MB/s readreverse : 4.106 micros/op 243575 ops/sec; 2382.4 MB/s avg : 2312.78 MB/s ==> 100000_Keys_100KB.txt <== readreverse : 41.281 micros/op 24224 ops/sec; 2366.0 MB/s readreverse : 39.722 micros/op 25175 ops/sec; 2458.9 MB/s readreverse : 40.319 micros/op 24802 ops/sec; 2422.5 MB/s readreverse : 39.762 micros/op 25149 ops/sec; 2456.4 MB/s readreverse : 40.916 micros/op 24440 ops/sec; 2387.1 MB/s readreverse : 41.188 micros/op 24278 ops/sec; 2371.4 MB/s readreverse : 40.061 micros/op 24962 ops/sec; 2438.1 MB/s readreverse : 40.221 micros/op 24862 ops/sec; 2428.4 MB/s readreverse : 40.084 micros/op 24947 ops/sec; 2436.7 MB/s readreverse : 40.655 micros/op 24597 ops/sec; 2402.4 MB/s avg : 2416.79 MB/s ==> 10000_Keys_1MB.txt <== readreverse : 298.038 micros/op 3355 ops/sec; 3355.3 MB/s readreverse : 335.001 micros/op 2985 ops/sec; 2985.1 MB/s readreverse : 286.956 micros/op 3484 ops/sec; 3484.9 MB/s readreverse : 329.954 micros/op 3030 ops/sec; 3030.8 MB/s readreverse : 306.428 micros/op 3263 ops/sec; 3263.5 MB/s readreverse : 330.749 micros/op 3023 ops/sec; 3023.5 MB/s readreverse : 328.903 micros/op 3040 ops/sec; 3040.5 MB/s readreverse : 324.853 micros/op 3078 ops/sec; 3078.4 MB/s readreverse : 320.488 micros/op 3120 ops/sec; 3120.3 MB/s readreverse : 320.536 micros/op 3119 ops/sec; 3119.8 MB/s avg : 3150.21 MB/s ``` After memcpy elimination ``` ==> 1000000_Keys_100Byte.txt <== readreverse : 0.395 micros/op 2529890 ops/sec; 279.9 MB/s readreverse : 0.368 micros/op 2715922 ops/sec; 300.5 MB/s readreverse : 0.384 micros/op 2603929 ops/sec; 288.1 MB/s readreverse : 0.375 micros/op 2663286 ops/sec; 294.6 MB/s readreverse : 0.357 micros/op 2802180 ops/sec; 310.0 MB/s readreverse : 0.363 micros/op 2757684 ops/sec; 305.1 MB/s readreverse : 0.372 micros/op 2689603 ops/sec; 297.5 MB/s readreverse : 0.379 micros/op 2638599 ops/sec; 291.9 MB/s readreverse : 0.375 micros/op 2663803 ops/sec; 294.7 MB/s readreverse : 0.375 micros/op 2665579 ops/sec; 294.9 MB/s avg: 295.72 MB/s (1.22 X) ==> 1000000_Keys_1KB.txt <== readreverse : 0.879 micros/op 1138112 ops/sec; 1128.8 MB/s readreverse : 0.842 micros/op 1187998 ops/sec; 1178.3 MB/s readreverse : 0.837 micros/op 1194915 ops/sec; 1185.1 MB/s readreverse : 0.845 micros/op 1182983 ops/sec; 1173.3 MB/s readreverse : 0.877 micros/op 1140308 ops/sec; 1131.0 MB/s readreverse : 0.849 micros/op 1177581 ops/sec; 1168.0 MB/s readreverse : 0.915 micros/op 1093284 ops/sec; 1084.3 MB/s readreverse : 0.863 micros/op 1159418 ops/sec; 1149.9 MB/s readreverse : 0.895 micros/op 1117670 ops/sec; 1108.5 MB/s readreverse : 0.852 micros/op 1174116 ops/sec; 1164.5 MB/s avg: 1147.17 MB/s (1.12 X) ==> 1000000_Keys_10KB.txt <== readreverse : 3.870 micros/op 258386 ops/sec; 2527.2 MB/s readreverse : 3.568 micros/op 280296 ops/sec; 2741.5 MB/s readreverse : 4.005 micros/op 249694 ops/sec; 2442.2 MB/s readreverse : 3.550 micros/op 281719 ops/sec; 2755.5 MB/s readreverse : 3.562 micros/op 280758 ops/sec; 2746.1 MB/s readreverse : 3.507 micros/op 285125 ops/sec; 2788.8 MB/s readreverse : 3.463 micros/op 288739 ops/sec; 2824.1 MB/s readreverse : 3.428 micros/op 291734 ops/sec; 2853.4 MB/s readreverse : 3.553 micros/op 281491 ops/sec; 2753.2 MB/s readreverse : 3.535 micros/op 282885 ops/sec; 2766.9 MB/s avg : 2719.89 MB/s (1.17 X) ==> 100000_Keys_100KB.txt <== readreverse : 22.815 micros/op 43830 ops/sec; 4281.0 MB/s readreverse : 29.957 micros/op 33381 ops/sec; 3260.4 MB/s readreverse : 25.334 micros/op 39473 ops/sec; 3855.4 MB/s readreverse : 23.037 micros/op 43409 ops/sec; 4239.8 MB/s readreverse : 27.810 micros/op 35958 ops/sec; 3512.1 MB/s readreverse : 30.327 micros/op 32973 ops/sec; 3220.6 MB/s readreverse : 29.704 micros/op 33665 ops/sec; 3288.2 MB/s readreverse : 29.423 micros/op 33987 ops/sec; 3319.6 MB/s readreverse : 23.334 micros/op 42856 ops/sec; 4185.9 MB/s readreverse : 29.969 micros/op 33368 ops/sec; 3259.1 MB/s avg : 3642.21 MB/s (1.5 X) ==> 10000_Keys_1MB.txt <== readreverse : 244.748 micros/op 4085 ops/sec; 4085.9 MB/s readreverse : 230.208 micros/op 4343 ops/sec; 4344.0 MB/s readreverse : 235.655 micros/op 4243 ops/sec; 4243.6 MB/s readreverse : 235.730 micros/op 4242 ops/sec; 4242.2 MB/s readreverse : 237.346 micros/op 4213 ops/sec; 4213.3 MB/s readreverse : 227.306 micros/op 4399 ops/sec; 4399.4 MB/s readreverse : 194.957 micros/op 5129 ops/sec; 5129.4 MB/s readreverse : 238.359 micros/op 4195 ops/sec; 4195.4 MB/s readreverse : 221.588 micros/op 4512 ops/sec; 4513.0 MB/s readreverse : 235.911 micros/op 4238 ops/sec; 4239.0 MB/s avg : 4360.52 MB/s (1.38 X) ``` Test Plan: COMPILE_WITH_ASAN=1 make check -j64 Reviewers: andrewkr, yhchiang, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D56511	2016-05-02 21:46:30 -07:00
Peter Mattis	c6c770a1ac	Use prefix_same_as_start to avoid iteration in FindNextUserEntryInternal. (#1102 ) This avoids excessive iteration in tombstone fields.	2016-04-28 16:48:03 -07:00
Islam AbdelRahman	d719b095dc	Introduce PinnedIteratorsManager (Reduce PinData() overhead / Refactor PinData) Summary: While trying to reuse PinData() / ReleasePinnedData() .. to optimize away some memcpys I realized that there is a significant overhead for using PinData() / ReleasePinnedData if they were called many times. This diff refactor the pinning logic by introducing PinnedIteratorsManager a centralized component that will be created once and will be notified whenever we need to Pin an Iterator. This implementation have much less overhead than the original implementation Test Plan: make check -j64 COMPILE_WITH_ASAN=1 make check -j64 Reviewers: yhchiang, sdong, andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D56493	2016-04-26 12:41:07 -07:00
Islam AbdelRahman	8a1a603fdb	Eliminate std::deque initialization while iterating over merge operands Summary: This patch is similar to D52563, When we iterate over a DB with merge operands we keep creating std::queue to store the operands, optimize this by reusing merge_operands_ data member Before the patch ``` ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq" --db="/dev/shm/bench_merge_memcpy_on_the_fly/" --merge_operator="put" --merge_keys=10000 --num=10000 DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/] mergerandom : 3.757 micros/op 266141 ops/sec; 29.4 MB/s ( updates:10000) DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/] readseq : 0.413 micros/op 2423538 ops/sec; 268.1 MB/s DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/] readseq : 0.451 micros/op 2219071 ops/sec; 245.5 MB/s DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/] readseq : 0.420 micros/op 2382039 ops/sec; 263.5 MB/s DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/] readseq : 0.408 micros/op 2452017 ops/sec; 271.3 MB/s DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/] mergerandom : 3.947 micros/op 253376 ops/sec; 28.0 MB/s ( updates:10000) DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/] readseq : 0.441 micros/op 2266473 ops/sec; 250.7 MB/s DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/] readseq : 0.471 micros/op 2122033 ops/sec; 234.8 MB/s DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/] readseq : 0.440 micros/op 2271407 ops/sec; 251.3 MB/s DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/] readseq : 0.429 micros/op 2331471 ops/sec; 257.9 MB/s ``` with the patch ``` ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq" --db="/dev/shm/bench_merge_memcpy_on_the_fly/" --merge_operator="put" --merge_keys=10000 --num=10000 DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/] mergerandom : 4.080 micros/op 245092 ops/sec; 27.1 MB/s ( updates:10000) DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/] readseq : 0.308 micros/op 3241843 ops/sec; 358.6 MB/s DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/] readseq : 0.312 micros/op 3200408 ops/sec; 354.0 MB/s DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/] readseq : 0.332 micros/op 3013962 ops/sec; 333.4 MB/s DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/] readseq : 0.300 micros/op 3328017 ops/sec; 368.2 MB/s DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/] mergerandom : 3.973 micros/op 251705 ops/sec; 27.8 MB/s ( updates:10000) DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/] readseq : 0.320 micros/op 3123752 ops/sec; 345.6 MB/s DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/] readseq : 0.335 micros/op 2986641 ops/sec; 330.4 MB/s DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/] readseq : 0.339 micros/op 2950047 ops/sec; 326.4 MB/s DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/] readseq : 0.319 micros/op 3131565 ops/sec; 346.4 MB/s ``` Test Plan: make check -j64 Reviewers: yhchiang, andrewkr, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D56031	2016-04-01 15:48:55 -07:00
Islam AbdelRahman	580fede347	Aggregate hot Iterator counters in LocalStatistics (DBIter::Next perf regression) Summary: This patch bump the counters in the frequent code path DBIter::Next() / DBIter::Prev() in a local data members and send them to Statistics when the iterator is destroyed A better solution will be to have thread_local implementation for Statistics New performance ``` readseq : 0.035 micros/op 28597881 ops/sec; 3163.7 MB/s 1,851,568,819 stalled-cycles-frontend # 31.29% frontend cycles idle [49.86%] 884,929,823 stalled-cycles-backend # 14.95% backend cycles idle [50.21%] readreverse : 0.071 micros/op 14077393 ops/sec; 1557.3 MB/s 3,239,575,993 stalled-cycles-frontend # 27.36% frontend cycles idle [49.96%] 1,558,253,983 stalled-cycles-backend # 13.16% backend cycles idle [50.14%] ``` Existing performance ``` readreverse : 0.174 micros/op 5732342 ops/sec; 634.1 MB/s 20,570,209,389 stalled-cycles-frontend # 70.71% frontend cycles idle [50.01%] 18,422,816,837 stalled-cycles-backend # 63.33% backend cycles idle [50.04%] readseq : 0.119 micros/op 8400537 ops/sec; 929.3 MB/s 15,634,225,844 stalled-cycles-frontend # 79.07% frontend cycles idle [49.96%] 14,227,427,453 stalled-cycles-backend # 71.95% backend cycles idle [50.09%] ``` Test Plan: unit tests Reviewers: yhchiang, sdong, igor Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D55107	2016-03-11 19:01:12 -08:00
sdong	294bdf9ee2	Change Property name from "rocksdb.current_version_number" to "rocksdb.current-super-version-number" Summary: I realized I again is wrong about the naming convention. Let me change it to the correct one. Test Plan: Run unit tests. Reviewers: IslamAbdelRahman, kradhakrishnan, yhchiang, andrewkr Reviewed By: andrewkr Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D55041	2016-03-04 18:15:29 -08:00
sdong	e79ad9e184	Add Iterator Property rocksdb.iterator.version_number Summary: We want to provide a way to detect whether an iterator is stale and needs to be recreated. Add a iterator property to return version number. Test Plan: Add two unit tests for it. Reviewers: IslamAbdelRahman, yhchiang, anthony, kradhakrishnan, andrewkr Reviewed By: andrewkr Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D54921	2016-03-02 16:23:59 -08:00
sdong	74b660702e	Rename iterator property "rocksdb.iterator.is.key.pinned" => "rocksdb.iterator.is-key-pinned" Summary: Rename iterator property to folow property naming convention. Test Plan: Run all existing tests. Reviewers: andrewkr, anthony, yhchiang, kradhakrishnan, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D54957	2016-03-01 13:47:12 -08:00
sdong	1f5954147b	Introduce Iterator::GetProperty() and replace Iterator::IsKeyPinned() Summary: Add Iterator::GetProperty(), a way for users to communicate with iterator, and turn Iterator::IsKeyPinned() with it. As a follow-up, I'll ask a property as the version number attached to the iterator Test Plan: Rerun existing tests and add a negative test case. Reviewers: yhchiang, andrewkr, kradhakrishnan, anthony, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D54783	2016-02-29 14:01:31 -08:00
Peter Mattis	239aaf2fc0	Use user_comparator when comparing against iterate_upper_bound. Fixes #983.	2016-02-11 08:47:16 -05:00
Baraa Hamodi	21e95811d1	Updated all copyright headers to the new format.	2016-02-09 15:12:00 -08:00
Islam AbdelRahman	8c71eb5afc	Optimize DBIter::Prev() by reducing stack overhead Summary: It looks like we are spending significant amount of time creating std::deque<std::string> every time we do Iterator::Prev() {F921567} By using merge_operands_ as a DBIter data member w create it once and reduce this overhead and see ~30% performance improvement when using Iterator::Prev() on hot data Orignal performance ``` DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readreverse" --db="/dev/shm/bench_prev_opt/" --use_existing_db --disable_auto_compactions readreverse : 0.713 micros/op 1402219 ops/sec; 155.1 MB/s readreverse : 0.609 micros/op 1641386 ops/sec; 181.6 MB/s readreverse : 0.684 micros/op 1461150 ops/sec; 161.6 MB/s readreverse : 0.629 micros/op 1589842 ops/sec; 175.9 MB/s readreverse : 0.647 micros/op 1544530 ops/sec; 170.9 MB/s ``` After optimization ``` DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readreverse" --db="/dev/shm/bench_prev_opt/" --use_existing_db --disable_auto_compactions readreverse : 0.488 micros/op 2051189 ops/sec; 226.9 MB/s readreverse : 0.505 micros/op 1980892 ops/sec; 219.1 MB/s readreverse : 0.541 micros/op 1846971 ops/sec; 204.3 MB/s readreverse : 0.497 micros/op 2013612 ops/sec; 222.8 MB/s readreverse : 0.480 micros/op 2082665 ops/sec; 230.4 MB/s ``` Test Plan: make check -j64 Reviewers: sdong, anthony, rven, igor, yhchiang Reviewed By: yhchiang Subscribers: jkedgar, dhruba Differential Revision: https://reviews.facebook.net/D52563	2016-01-07 07:59:14 -08:00
Islam AbdelRahman	aececc209e	Introduce ReadOptions::pin_data (support zero copy for keys) Summary: This patch update the Iterator API to introduce new functions that allow users to keep the Slices returned by key() valid as long as the Iterator is not deleted ReadOptions::pin_data : If true keep loaded blocks in memory as long as the iterator is not deleted Iterator::IsKeyPinned() : If true, this mean that the Slice returned by key() is valid as long as the iterator is not deleted Also add a new option BlockBasedTableOptions::use_delta_encoding to allow users to disable delta_encoding if needed. Benchmark results (using https://phabricator.fb.com/P20083553) ``` // $ du -h /home/tec/local/normal.4K.Snappy/db10077 // 6.1G /home/tec/local/normal.4K.Snappy/db10077 // $ du -h /home/tec/local/zero.8K.LZ4/db10077 // 6.4G /home/tec/local/zero.8K.LZ4/db10077 // Benchmarks for shard db10077 // _build/opt/rocks/benchmark/rocks_copy_benchmark \ // --normal_db_path="/home/tec/local/normal.4K.Snappy/db10077" \ // --zero_db_path="/home/tec/local/zero.8K.LZ4/db10077" // First run // ============================================================================ // rocks/benchmark/RocksCopyBenchmark.cpp relative time/iter iters/s // ============================================================================ // BM_StringCopy 1.73s 576.97m // BM_StringPiece 103.74% 1.67s 598.55m // ============================================================================ // Match rate : 1000000 / 1000000 // Second run // ============================================================================ // rocks/benchmark/RocksCopyBenchmark.cpp relative time/iter iters/s // ============================================================================ // BM_StringCopy 611.99ms 1.63 // BM_StringPiece 203.76% 300.35ms 3.33 // ============================================================================ // Match rate : 1000000 / 1000000 ``` Test Plan: Unit tests Reviewers: sdong, igor, anthony, yhchiang, rven Reviewed By: rven Subscribers: dhruba, lovro, adsharma Differential Revision: https://reviews.facebook.net/D48999	2015-12-16 12:08:30 -08:00
sdong	459c7fba36	Revert previous behavior of internal_key_skipped_count Summary: With recent commit `33e0c93826`, db iterator skips perf context counter internal_key_skipped_count when blindly issuing internal Next(). Now increment the counter by one when issuing this Next() Test Plan: Run all existing tests Reviewers: rven, yhchiang, IslamAbdelRahman, kradhakrishnan, igor, anthony Reviewed By: anthony Subscribers: yoshinorim, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D51465	2015-11-30 21:55:05 -08:00
sdong	33e0c93826	Reduce extra key comparision in DBIter::Next() Summary: Now DBIter::Next() always compares with current key with itself first, which is unnecessary if the last key is not a merge key. I made the change and didn't see db_iter_test fails. Want to hear whether people have any idea what I miss. Test Plan: Run all unit tests Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D48279	2015-11-24 17:16:18 -08:00
Venkatesh Radhakrishnan	ae7940b628	Fix regression failure in PrefixTest.PrefixValid Summary: Use IterKey to store prefix_start_ so that it doesn't get freed Test Plan: PrefixTest.PrefixValid Reviewers: anthony, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D50289	2015-11-05 16:43:54 -08:00
Venkatesh Radhakrishnan	9d50afc3b9	Prefix-based iterating only shows keys in prefix Summary: MyRocks testing found an issue that while iterating over keys that are outside the prefix, sometimes wrong results were seen for keys outside the prefix. We now tighten the range of keys seen with a new read option called prefix_seen_at_start. This remembers the starting prefix and then compares it on a Next for equality of prefix. If they are from a different prefix, it sets valid to false. Test Plan: PrefixTest.PrefixValid Reviewers: IslamAbdelRahman, sdong, yhchiang, anthony Reviewed By: anthony Subscribers: spetrunia, hermanlee4, yoshinorim, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D50211	2015-11-05 13:24:05 -08:00
sdong	35ad531be3	Seperate InternalIterator from Iterator Summary: Separate a new class InternalIterator from class Iterator, when the look-up is done internally, which also means they operate on key with sequence ID and type. This change will enable potential future optimizations but for now InternalIterator's functions are still the same as Iterator's. At the same time, separate the cleanup function to a separate class and let both of InternalIterator and Iterator inherit from it. Test Plan: Run all existing tests. Reviewers: igor, yhchiang, anthony, kradhakrishnan, IslamAbdelRahman, rven Reviewed By: rven Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D48549	2015-10-13 15:32:13 -07:00
Andres Noetzli	014fd55adc	Support for SingleDelete() Summary: This patch fixes #7460559. It introduces SingleDelete as a new database operation. This operation can be used to delete keys that were never overwritten (no put following another put of the same key). If an overwritten key is single deleted the behavior is undefined. Single deletion of a non-existent key has no effect but multiple consecutive single deletions are not allowed (see limitations). In contrast to the conventional Delete() operation, the deletion entry is removed along with the value when the two are lined up in a compaction. Note: The semantics are similar to @igor's prototype that allowed to have this behavior on the granularity of a column family ( https://reviews.facebook.net/D42093 ). This new patch, however, is more aggressive when it comes to removing tombstones: It removes the SingleDelete together with the value whenever there is no snapshot between them while the older patch only did this when the sequence number of the deletion was older than the earliest snapshot. Most of the complex additions are in the Compaction Iterator, all other changes should be relatively straightforward. The patch also includes basic support for single deletions in db_stress and db_bench. Limitations: - Not compatible with cuckoo hash tables - Single deletions cannot be used in combination with merges and normal deletions on the same key (other keys are not affected by this) - Consecutive single deletions are currently not allowed (and older version of this patch supported this so it could be resurrected if needed) Test Plan: make all check Reviewers: yhchiang, sdong, rven, anthony, yoshinorim, igor Reviewed By: igor Subscribers: maykov, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D43179	2015-09-17 11:42:56 -07:00
Manuel Ung	aeb4612685	Add counters for seek/next/prev Summary: There are currently no statistics on seeks, only on gets. This adds the following counters: rocksdb.number.db.seek rocksdb.number.db.next rocksdb.number.db.prev (number of calls) rocksdb.db.iterate.bytes.read (number of bytes read from key + value using seek/next/prev) rocksdb.number.keys.seek.found rocksdb.number.keys.next.found rocksdb.number.keys.prev.found (number of calls where seek/next/prev found a value) Test Plan: ./db_bench -statistics -benchmarks fillrandom,seekrandom -seek_nexts 5 ./db_bench -statistics -benchmarks fillrandom,seekrandom -seek_nexts 5 -reverse_iterator Reviewers: yhchiang, rven, kradhakrishnan, IslamAbdelRahman, MarkCallaghan, sdong, igor Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D46605	2015-09-11 11:37:44 -07:00
Andres Noetzli	6bdc484fd8	Added Equal method to Comparator interface Summary: In some cases, equality comparisons can be done more efficiently than three-way comparisons. There are quite a few places in the code where we only care about equality. This patch adds an Equal() method that defaults to using the Compare() method. Test Plan: make clean all check Reviewers: rven, anthony, yhchiang, igor, sdong Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D46233	2015-09-08 15:30:49 -07:00
sdong	d286b5df90	DBIter to out extra keys with higher sequence numbers when changing direction from forward to backward Summary: When DBIter changes iterating direction from forward to backward, it might see some much larger keys with higher sequence ID. With this commit, these rows will be actively filtered out. It should fix existing disabled tests in db_iter_test. This may not be a perfect fix, but it introduces least impact on existing codes, in order to be safe. Test Plan: Enable existing tests and make sure they pass. Add a new test DBIterWithMergeIterTest.InnerMergeIteratorDataRace8. Also run all existing tests. Reviewers: yhchiang, rven, anthony, IslamAbdelRahman, kradhakrishnan, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D45567	2015-08-26 13:01:39 -07:00
sdong	888fbdc889	Remove the contstaint that iterator upper bound needs to be within a prefix Summary: There is a check to fail the iterator if prefix extractor is specified but upper bound is out of the prefix for the seek key. Relax this constraint to allow users to set upper bound to the next prefix of the current one. Test Plan: make commit-prereq Reviewers: igor, anthony, kradhakrishnan, yhchiang, rven Reviewed By: rven Subscribers: tnovak, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D44949	2015-08-19 11:03:51 -07:00
Andres Notzli	4249f159d5	Removing duplicate code in db_bench/db_stress, fixing typos Summary: While working on single delete support for db_bench, I realized that db_bench/db_stress contain a bunch of duplicate code related to copmression and found some typos. This patch removes duplicate code, typos and a redundant #ifndef in internal_stats.cc. Test Plan: make db_stress && make db_bench && ./db_bench --benchmarks=compress,uncompress Reviewers: yhchiang, sdong, rven, anthony, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D43965	2015-08-11 11:46:15 -07:00
Andres Noetzli	d7314ba759	Fixing endless loop if seeking to end of key with seq num 0 Summary: When seeking to the last occurrence of a key with sequence number 0, db_iter ends up in an endless loop because it seeks to type kValueTypeForSeek which is larger than kTypeDeletion/kTypeValue. Added test case that triggers the behavior. Test Plan: make clean all check Reviewers: igor, rven, anthony, yhchiang, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D43653	2015-08-06 10:43:28 -07:00
sdong	8e01bd1144	Fix misplaced position for reversing iterator direction while current key is a merge Summary: While doing forward iterating, if current key is merge, internal iterator position is placed to the next key. If Prev() is called now, needs to do extra Prev() to recover the location. This is second attempt of fixing after reverting `ec70fea4c4`. This time shrink the fix to only merge key is the current key and avoid the reseeking logic for max_iterating skipping Test Plan: enable the two disabled tests and make sure they pass Reviewers: rven, IslamAbdelRahman, kradhakrishnan, tnovak, yhchiang Reviewed By: yhchiang Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D43557	2015-08-05 11:08:50 -07:00
Yueh-Hsuan Chiang	685582a0b4	Revert two diffs related to DBIter::FindPrevUserKey() Summary: This diff reverts the following two previous diffs related to DBIter::FindPrevUserKey(), which makes db_stress unstable. We should bake a better fix for this. * "Fix a comparison in DBIter::FindPrevUserKey()" `ec70fea4c4`. * "Fixed endless loop in DBIter::FindPrevUserKey()" `acee2b08a2`. Test Plan: db_stress Reviewers: anthony, igor, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D41301	2015-07-07 11:36:24 -07:00
Tomislav Novak	ec70fea4c4	Fix a comparison in DBIter::FindPrevUserKey() Summary: When seek target is a merge key (`kTypeMerge`), `DBIter::FindNextUserEntry()` advances the underlying iterator _past_ the current key (`saved_key_`); see `MergeValuesNewToOld()`. However, `FindPrevUserKey()` assumes that `iter_` points to an entry with the same user key as `saved_key_`. As a result, `it->Seek(key) && it->Prev()` can cause the iterator to be positioned at the _next_, instead of the previous, entry (new test, written by @lovro, reproduces the bug). This diff changes `FindPrevUserKey()` to also skip keys that are _greater_ than `saved_key_`. Test Plan: db_test Reviewers: igor, sdong Reviewed By: sdong Subscribers: leveldb, dhruba, lovro Differential Revision: https://reviews.facebook.net/D40791	2015-06-29 17:04:03 -07:00
Poornima Chozhiyath Raman	4fb09c6871	Updating SeekToLast with upper bound Summary: #7124486: RocksDB's Iterator.SeekToLast should seek to the last key before iterate_upper_bound if presents Test Plan: ./db_iter_test run successfully with the new testcase Reviewers: rven, yhchiang, igor, anthony, kradhakrishnan, sdong Reviewed By: sdong Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D40425	2015-06-25 09:44:30 -07:00
clark.kang	6ede020dc4	fix typos	2015-04-25 18:14:27 +09:00
Anurag Indu	3d1a924ff3	Adding stats for the merge and filter operation Summary: We have addded new stats and perf_context for measuring the merge and filter operation time consumption. We have bounded all the merge operations within the GUARD statment and collected the total time for these operations in the DB. Test Plan: WIP Reviewers: rven, yhchiang, kradhakrishnan, igor, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D34377	2015-03-24 14:42:04 -07:00
Igor Sugak	62247ffa3b	rocksdb: Add missing override Summary: When using latest clang (3.6 or 3.7/trunck) rocksdb is failing with many errors. Almost all of them are missing override errors. This diff adds missing override keyword. No manual changes. Prerequisites: bear and clang 3.5 build with extra tools ```lang=bash % USE_CLANG=1 bear make all # generate a compilation database http://clang.llvm.org/docs/JSONCompilationDatabase.html % clang-modernize -p . -include . -add-override % make format ``` Test Plan: Make sure all tests are passing. ```lang=bash % #Use default fb code clang. % make check ``` Verify less error and no missing override errors. ```lang=bash % # Have trunk clang present in path. % ROCKSDB_NO_FBCODE=1 CC=clang CXX=clang++ make ``` Reviewers: igor, kradhakrishnan, rven, meyering, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D34077	2015-02-26 11:28:41 -08:00
Igor Sugak	62f7a1be4f	rocksdb: Fixed 'Dead assignment' and 'Dead initialization' scan-build warnings Summary: This diff contains trivial fixes for 6 scan-build warnings: db/c_test.c `db` variable is never read. Removed assignment. scan-build report: http://home.fburl.com/~sugak/latest20/report-9b77d2.html#EndPath db/db_iter.cc `skipping` local variable is assigned to false. Then in the next switch block the only "non return" case assign `skipping` to true, the rest cases don't use it and all do return. scan-build report: http://home.fburl.com/~sugak/latest20/report-13fca7.html#EndPath db/log_reader.cc In `bool Reader::SkipToInitialBlock()` `offset_in_block` local variable is assigned to 0 `if (offset_in_block > kBlockSize - 6)` and then never used. Removed the assignment and renamed it to `initial_offset_in_block` to avoid confusion. scan-build report: http://home.fburl.com/~sugak/latest20/report-a618dd.html#EndPath In `bool Reader::ReadRecord(Slice* record, std::string* scratch)` local variable `in_fragmented_record` in switch case `kFullType` block is assigned to false and then does `return` without use. In the other switch case `kFirstType` block the same `in_fragmented_record` is assigned to false, but later assigned to true without prior use. Removed assignment for both cases. scan-build reprots: http://home.fburl.com/~sugak/latest20/report-bb86b0.html#EndPath http://home.fburl.com/~sugak/latest20/report-a975be.html#EndPath table/plain_table_key_coding.cc Local variable `user_key_size` is assigned when declared. But then in both places where it is used assigned to `static_cast<uint32_t>(key.size() - 8)`. Changed to initialize the variable to the proper value in declaration. scan-build report: http://home.fburl.com/~sugak/latest20/report-9e6b86.html#EndPath tools/db_stress.cc Missing `break` in switch case block. This seems to be a bug. Added missing `break`. Test Plan: Make sure all tests are passing and scan-build does not report 'Dead assignment' and 'Dead initialization' bugs. ```lang=bash % make check % make analyze ``` Reviewers: meyering, igor, kradhakrishnan, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D33795	2015-02-23 14:10:09 -08:00
Yueh-Hsuan Chiang	5f719d7202	Replace exception by setting valid_ = false in DBIter::MergeValuesNewToOld() Summary: Replace exception by setting valid_ = false in DBIter::MergeValuesNewToOld(). Test Plan: Not sure if I am right at this, but it seems we currently don't have a good way to test that code path as it requires dynamically set merge_operator = nullptr at the time while Merge() is calling. Reviewers: igor, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D29811	2014-12-04 11:11:11 -08:00
Igor Canadi	9f20395cd6	Turn -Wshadow back on Summary: It turns out that -Wshadow has different rules for gcc than clang. Previous commit fixed clang. This commits fixes the rest of the warnings for gcc. Test Plan: compiles Reviewers: ljin, yhchiang, rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D28131	2014-11-06 11:14:28 -08:00
Yueh-Hsuan Chiang	2b1f23dcae	Apply InfoLogLevel to the logs in db/db_iter.cc Summary: Apply InfoLogLevel to the logs in db/db_iter.cc Test Plan: make Reviewers: igor, ljin, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D27861	2014-10-30 16:54:34 -07:00
Danny Al-Gaaf	44cca0cd8f	db/db_iter.cc: remove unused variable Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>	2014-10-01 10:49:08 +02:00
Lei Jin	048560a642	reduce references to cfd->options() in DBImpl Summary: I found it is almost impossible to get rid of this function in a single batch. I will take a step by step approach Test Plan: make release Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D22995	2014-09-08 15:04:34 -07:00
Raghav Pisolkar	e0b99d4f5d	created a new ReadOptions parameter 'iterate_upper_bound'	2014-09-04 11:00:16 -07:00
Torrie Fischer	6614a48418	Refactor PerfStepTimer to stop on destruct This eliminates the need to remember to call PERF_TIMER_STOP when a section has been timed. This allows more useful design with the perf timers and enables possible return value optimizations. Simplistic example: class Foo { public: Foo(int v) : m_v(v); private: int m_v; } Foo makeFrobbedFoo(int errno) { errno = 0; return Foo(); } Foo bar(int *errno) { PERF_TIMER_GUARD(some_timer); return makeFrobbedFoo(errno); } int main(int argc, char[] argv) { Foo f; int errno; f = bar(&errno); if (errno) return -1; return 0; } After bar() is called, perf_context.some_timer would be incremented as if Stop(&perf_context.some_timer) was called at the end, and the compiler is still able to produce optimizations on the return value from makeFrobbedFoo() through to main().	2014-09-02 12:04:22 -07:00
Lei Jin	5a5953b388	Add histogram for DB_SEEK Summary: as title Test Plan: make release Reviewers: sdong, yhchiang Reviewed By: yhchiang Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D21717	2014-08-13 15:56:37 -07:00
Stanislau Hlebik	7c88249f51	Fix db_test and DBIter Summary: Fix old issue with DBTest.Randomized with BlockBasedTableWithWholeKeyHashIndex + added printing in DBTest.Randomized. Test Plan: make all check Reviewers: zagfox, igor, ljin, yhchiang, sdong Reviewed By: sdong Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D21003	2014-08-08 09:44:14 -07:00
Lei Jin	40fa8a4cd5	make statistics forward-able Summary: Make StatisticsImpl being able to forward stats to provided statistics implementation. The main purpose is to allow us to collect internal stats in the future even when user supplies custom statistics implementation. It avoids intrumenting 2 sets of stats collection code. One immediate use case is tuning advisor, which needs to collect some internal stats, users may not be interested. Test Plan: ran db_bench and see stats show up at the end of run Will run make all check since some tests rely on statistics Reviewers: yhchiang, sdong, igor Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D20145	2014-07-28 12:05:36 -07:00
Stanislau Hlebik	d916593ead	Add Prev() for merge operator Summary: Implement Prev() with merge operator for DBIterator. Request from mongoDB. Task 4673663. Test Plan: make all check Reviewers: sdong Reviewed By: sdong Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19743	2014-07-15 16:10:18 -07:00
Igor Canadi	fd27001072	Fix compile errors on Mac Summary: https://phabricator.fb.com/P11372644 Test Plan: compiles Reviewers: sdong, ljin Reviewed By: ljin Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D18873	2014-06-03 12:28:58 -07:00
sdong	df9069d23f	In DB::NewIterator(), try to allocate the whole iterator tree in an arena Summary: In this patch, try to allocate the whole iterator tree starting from DBIter from an arena 1. ArenaWrappedDBIter is created when serves as the entry point of an iterator tree, with an arena in it. 2. Add an option to create iterator from arena for following iterators: DBIter, MergingIterator, MemtableIterator, all mem table's iterators, all table reader's iterators and two level iterator. 3. MergeIteratorBuilder is created to incrementally build the tree of internal iterators. It is passed to mem table list and version set and add iterators to it. Limitations: (1) Only DB::NewIterator() without tailing uses the arena. Other cases, including readonly DB and compactions are still from malloc (2) Two level iterator itself is allocated in arena, but not iterators inside it. Test Plan: make all check Reviewers: ljin, haobo Reviewed By: haobo Subscribers: leveldb, dhruba, yhchiang, igor Differential Revision: https://reviews.facebook.net/D18513	2014-06-02 17:44:57 -07:00
Lei Jin	388d2054c7	forward iterator Summary: Forward iterator puts everything together in a flat structure instead of a hierarchy of nested iterators. this should simplify the code and provide better performance. It also enables more optimization since all information are accessiable in one place. Init evaluation shows about 6% improvement Test Plan: db_test and db_bench Reviewers: dhruba, igor, tnovak, sdong, haobo Reviewed By: haobo Subscribers: sdong, leveldb Differential Revision: https://reviews.facebook.net/D18795	2014-05-30 14:31:55 -07:00
sdong	acd17fd002	Remove unused variable in DBIter Summary: as title Test Plan: Still compile Reviewers: haobo, igor, yhchiang Reviewed By: igor CC: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D18603	2014-05-09 13:20:35 -07:00
Lei Jin	82b37a18bd	thread local for tailing iterator Summary: replace the super version acquisision in tailing itrator with thread local Test Plan: will post results Reviewers: igor, haobo, sdong, yhchiang, dhruba Reviewed By: igor CC: leveldb Differential Revision: https://reviews.facebook.net/D17757	2014-04-14 10:48:01 -07:00
Lei Jin	92c1eb0291	macros for perf_context Summary: This will allow us to disable them completely for iOS or for better performance Test Plan: will run make all check Reviewers: igor, haobo, dhruba Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D17511	2014-04-08 10:58:07 -07:00
sdong	5e2db3b434	PlainTableIterator not to store copied key in std::string Summary: Move PlainTableIterator's copied key from std::string local buffer to avoid paying the extra costs in std::string related to sharing. Reuse the same buffer class in DbIter. Move the class to dbformat.h. This patch improves iterator performance significantly. Running this benchmark: ./table_reader_bench --num_keys2=17 --iterator --plain_table --time_unit=nanosecond The average latency is improved to about 750 nanoseconds from 1100 nanoseconds. Test Plan: Add a unit test. make all check Reviewers: haobo, ljin Reviewed By: haobo CC: igor, yhchiang, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D17547	2014-04-07 19:06:09 -07:00
sdong	e0a87c4cf1	DBIter to use static allocated char array for saved_key_ (if it is not too long) Summary: DBIter now uses a std::string for saved_key. Based on some profiling, it could be more expensive than we though. Optimize it with the same technique as LookupKey -- if it is short, we copy it to a static allocated char. Otherwise, dynamically allocate memory for it. Test Plan: make all check Reviewers: haobo, ljin Reviewed By: haobo CC: dhruba, igor, yhchiang, leveldb Differential Revision: https://reviews.facebook.net/D17289	2014-04-01 16:43:11 -07:00
Yueh-Hsuan Chiang	cda4006e87	Enhance partial merge to support multiple arguments Summary: * PartialMerge api now takes a list of operands instead of two operands. * Add min_pertial_merge_operands to Options, indicating the minimum number of operands to trigger partial merge. * This diff is based on Schalk's previous diff (D14601), but it also includes necessary changes such as updating the pure C api for partial merge. Test Plan: * make check all * develop tests for cases where partial merge takes more than two operands. TODOs (from Schalk): * Add test with min_partial_merge_operands > 2. * Perform benchmarks to measure the performance improvements (can probably use results of task #2837810.) * Add description of problem to doc/index.html. * Change wiki pages to reflect the interface changes. Reviewers: haobo, igor, vamsi Reviewed By: haobo CC: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D16815	2014-03-24 17:57:13 -07:00
sdong	c61c9830d4	Fix a bug that Prev() can hang. Summary: Prev() now can hang when there is a key with more than max_skipped number of appearance internally but all of them are newer than the sequence ID to seek. Add unit tests to confirm the bug and fix it. Test Plan: make all check Reviewers: igor, haobo Reviewed By: igor CC: ljin, yhchiang, leveldb Differential Revision: https://reviews.facebook.net/D16899	2014-03-17 10:00:41 -07:00
Siying Dong	d169b67680	[Performance Branch] PlainTable to encode rows with seqID 0, value type using 1 internal byte. Summary: In PlainTable, use one single byte to represent 8 bytes of internal bytes, if seqID = 0 and it is value type (which should be common for bottom most files). It is to save 7 bytes for uncompressed cases. Test Plan: make all check Reviewers: haobo, dhruba, kailiu Reviewed By: haobo CC: igor, leveldb Differential Revision: https://reviews.facebook.net/D15489	2014-02-03 12:19:30 -08:00
kailiu	551e9428ce	Merge branch 'master' into performance	2013-12-06 14:15:42 -08:00
Igor Canadi	043fc14c3e	Get rid of some shared_ptrs Summary: I went through all remaining shared_ptrs and removed the ones that I found not-necessary. Only GenerateCachePrefix() is called fairly often, so don't expect much perf wins. The ones that are left are accessed infrequently and I think we're fine with keeping them. Test Plan: make asan_check Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D14427	2013-12-03 11:17:58 -08:00
Kai Liu	1966b63137	Merge branch 'master' into perf	2013-11-27 11:47:40 -08:00
Haobo Xu	5b825d6964	[RocksDB] Use raw pointer instead of shared pointer when passing Statistics object internally Summary: liveness of the statistics object is already ensured by the shared pointer in DB options. There's no reason to pass again shared pointer among internal functions. Raw pointer is sufficient and efficient. Test Plan: make check Reviewers: dhruba, MarkCallaghan, igor Reviewed By: dhruba CC: leveldb, reconnect.grayhat Differential Revision: https://reviews.facebook.net/D14289	2013-11-25 10:38:15 -08:00
Siying Dong	3e35aa6412	Revert "Allow users to profile a query and see bottleneck of the query" This reverts commit `3d8ac31d71`.	2013-11-21 17:40:39 -08:00
Siying Dong	b135d01e7b	Allow users to profile a query and see bottleneck of the query Summary: Provide a framework to profile a query in detail to figure out latency bottleneck. Currently, in Get(), Put() and iterators, 2-3 simple timing is used. We can easily add more profile counters to the framework later. Test Plan: Enable this profiling in seveal existing tests. Reviewers: haobo, dhruba, kailiu, emayanke, vamsi, igor CC: leveldb Differential Revision: https://reviews.facebook.net/D14001 Conflicts: table/merger.cc	2013-11-21 17:39:19 -08:00
Siying Dong	3d8ac31d71	Allow users to profile a query and see bottleneck of the query Summary: Provide a framework to profile a query in detail to figure out latency bottleneck. Currently, in Get(), Put() and iterators, 2-3 simple timing is used. We can easily add more profile counters to the framework later. Test Plan: Enable this profiling in seveal existing tests. Reviewers: haobo, dhruba, kailiu, emayanke, vamsi, igor CC: leveldb Differential Revision: https://reviews.facebook.net/D14001	2013-11-21 16:29:57 -08:00
Siying Dong	071fb0d77b	Inline a couple of functions and put one save lazily clearing Summary: Machine several functions inline. Also, in DBIter.Seek() make value cleaning up lazily done. These are for the use case that Seek() are called lots of times but few return values. Test Plan: make all check Differential Revision: https://reviews.facebook.net/D14217	2013-11-20 17:32:57 -08:00
Siying Dong	65428b0c0a	Fix Bug: iterator.Prev() or iterator.SeekToLast() might return the first element instead of the correct one Summary: Recent patch https://reviews.facebook.net/D11865 introduced a regression bug: DBIter::FindPrevUserEntry(), which is called by DBIter::Prev() (and also implicitly if calling iterator.SeekToLast()) might do issue a seek when having skipped too many entries. If the skipped entry just before the seek() is a delete, the saved key is erased so that it seeks to the front, so Prev() would return the first element. This patch fixes the bug by not doing seek() in DBIter::FindNextUserEntry() if saved key has been erased. Test Plan: Add a test DBTest.IterPrevMaxSkip which would fail without the patch and would pass with the change. Reviewers: dhruba, xjin, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13557	2013-10-17 18:33:18 -07:00
Dhruba Borthakur	9cd221094c	Add appropriate LICENSE and Copyright message. Summary: Add appropriate LICENSE and Copyright message. Test Plan: make check Reviewers: CC: Task ID: # Blame Rev:	2013-10-16 17:48:41 -07:00
Dhruba Borthakur	a143ef9b38	Change namespace from leveldb to rocksdb Summary: Change namespace from leveldb to rocksdb. This allows a single application to link in open-source leveldb code as well as rocksdb code into the same process. Test Plan: compile rocksdb Reviewers: emayanke Reviewed By: emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D13287	2013-10-04 11:59:26 -07:00
Haobo Xu	71046971f0	[RocksDB] Added perf counters to track skipped internal keys during iteration Summary: as title. unit test not polished. this is for a quick live test Test Plan: live Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13221	2013-10-02 10:48:41 -07:00
Dhruba Borthakur	197034e4c3	An iterator may automatically invoke reseeks. Summary: An iterator invokes reseek if the number of sequential skips over the same userkey exceeds a configured number. This makes iter->Next() faster (bacause of fewer key compares) if a large number of adjacent internal keys in a table (sst or memtable) have the same userkey. Test Plan: Unit test DBTest.IterReseek. Reviewers: emayanke, haobo, xjin Reviewed By: xjin CC: leveldb, xjin Differential Revision: https://reviews.facebook.net/D11865	2013-09-06 11:50:53 -07:00
Dhruba Borthakur	1186192ed1	Replace include/leveldb with include/rocksdb. Summary: Replace include/leveldb with include/rocksdb. Test Plan: make clean; make check make clean; make release Differential Revision: https://reviews.facebook.net/D12489	2013-08-23 10:51:00 -07:00
Deon Nicholas	b87dcae1a3	Made merge_oprator a shared_ptr; and added TTL unit tests Test Plan: - make all check; - make release; - make stringappend_test; ./stringappend_test Reviewers: haobo, emayanke Reviewed By: haobo CC: leveldb, kailiu Differential Revision: https://reviews.facebook.net/D12381	2013-08-20 13:35:28 -07:00
Deon Nicholas	e1346968d8	Merge operator fixes part 1. Summary: -Added null checks and revisions to DBIter::MergeValuesNewToOld() -Added DBIter test to stringappend_test -Major fix with Merge and TTL More plans for fixes later. Test Plan: -make clean; make stringappend_test -j 32; ./stringappend_test -make all check; Reviewers: haobo, emayanke, vamsi, dhruba Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D12315	2013-08-19 11:42:47 -07:00
Jim Paton	0307c5fe3a	Implement log blobs Summary: This patch adds the ability for the user to add sequences of arbitrary data (blobs) to write batches. These blobs are saved to the log along with everything else in the write batch. You can add multiple blobs per WriteBatch and the ordering of blobs, puts, merges, and deletes are preserved. Blobs are not saves to SST files. RocksDB ignores blobs in every way except for writing them to the log. Before committing this patch, I need to add some test code. But I'm submitting it now so people can comment on the API. Test Plan: make -j32 check Reviewers: dhruba, haobo, vamsi Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D12195	2013-08-14 16:32:46 -07:00
Deon Nicholas	c2d7826ced	[RocksDB] [MergeOperator] The new Merge Interface! Uses merge sequences. Summary: Here are the major changes to the Merge Interface. It has been expanded to handle cases where the MergeOperator is not associative. It does so by stacking up merge operations while scanning through the key history (i.e.: during Get() or Compaction), until a valid Put/Delete/end-of-history is encountered; it then applies all of the merge operations in the correct sequence starting with the base/sentinel value. I have also introduced an "AssociativeMerge" function which allows the user to take advantage of associative merge operations (such as in the case of counters). The implementation will always attempt to merge the operations/operands themselves together when they are encountered, and will resort to the "stacking" method if and only if the "associative-merge" fails. This implementation is conjectured to allow MergeOperator to handle the general case, while still providing the user with the ability to take advantage of certain efficiencies in their own merge-operator / data-structure. NOTE: This is a preliminary diff. This must still go through a lot of review, revision, and testing. Feedback welcome! Test Plan: -This is a preliminary diff. I have only just begun testing/debugging it. -I will be testing this with the existing MergeOperator use-cases and unit-tests (counters, string-append, and redis-lists) -I will be "desk-checking" and walking through the code with the help gdb. -I will find a way of stress-testing the new interface / implementation using db_bench, db_test, merge_test, and/or db_stress. -I will ensure that my tests cover all cases: Get-Memtable, Get-Immutable-Memtable, Get-from-Disk, Iterator-Range-Scan, Flush-Memtable-to-L0, Compaction-L0-L1, Compaction-Ln-L(n+1), Put/Delete found, Put/Delete not-found, end-of-history, end-of-file, etc. -A lot of feedback from the reviewers. Reviewers: haobo, dhruba, zshao, emayanke Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D11499	2013-08-05 20:14:32 -07:00
Dhruba Borthakur	a8d807ee91	Record the number of open db iterators. Summary: Enhance the statitics to report the number of open db iterators. Test Plan: make check Reviewers: haobo, emayanke Reviewed By: emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D10983	2013-05-29 08:47:08 -07:00
Haobo Xu	05e8854085	[Rocksdb] Support Merge operation in rocksdb Summary: This diff introduces a new Merge operation into rocksdb. The purpose of this review is mostly getting feedback from the team (everyone please) on the design. Please focus on the four files under include/leveldb/, as they spell the client visible interface change. include/leveldb/db.h include/leveldb/merge_operator.h include/leveldb/options.h include/leveldb/write_batch.h Please go over local/my_test.cc carefully, as it is a concerete use case. Please also review the impelmentation files to see if the straw man implementation makes sense. Note that, the diff does pass all make check and truly supports forward iterator over db and a version of Get that's based on iterator. Future work: - Integration with compaction - A raw Get implementation I am working on a wiki that explains the design and implementation choices, but coding comes just naturally and I think it might be a good idea to share the code earlier. The code is heavily commented. Test Plan: run all local tests Reviewers: dhruba, heyongqiang Reviewed By: dhruba CC: leveldb, zshao, sheki, emayanke, MarkCallaghan Differential Revision: https://reviews.facebook.net/D9651	2013-05-03 16:59:02 -07:00
Hans Wennborg	36a5f8ed7f	A number of fixes: - Replace raw slice comparison with a call to user comparator. Added test for custom comparators. - Fix end of namespace comments. - Fixed bug in picking inputs for a level-0 compaction. When finding overlapping files, the covered range may expand as files are added to the input set. We now correctly expand the range when this happens instead of continuing to use the old range. For example, suppose L0 contains files with the following ranges: F1: a .. d F2: c .. g F3: f .. j and the initial compaction target is F3. We used to search for range f..j which yielded {F2,F3}. However we now expand the range as soon as another file is added. In this case, when F2 is added, we expand the range to c..j and restart the search. That picks up file F1 as well. This change fixes a bug related to deleted keys showing up incorrectly after a compaction as described in Issue 44. (Sync with upstream @25072954)	2011-10-31 17:22:06 +00:00
gabor@google.com	ab323f7e1e	Bugfixes for iterator and documentation. - Fix bug in Iterator::Prev where it would return the wrong key. Fixes issues 29 and 30. - Added a tweak to testharness to allow running just some tests. - Fixing two minor documentation errors based on issues 28 and 25. - Cleanup; fix namespaces of export-to-C code. Also fix one "const char" vs "char" mismatch. git-svn-id: https://leveldb.googlecode.com/svn/trunk@48 62dab493-f737-651d-591e-8d6aee1b9529	2011-08-16 01:21:01 +00:00
dgrogan@chromium.org	ba6dac0e80	@20776309 * env_chromium.cc should not export symbols. * Fix MSVC warnings. * Removed large value support. * Fix broken reference to documentation file git-svn-id: https://leveldb.googlecode.com/svn/trunk@24 62dab493-f737-651d-591e-8d6aee1b9529	2011-04-20 22:48:11 +00:00
dgrogan@chromium.org	69c6d38342	reverting disastrous MOE commit, returning to r21 git-svn-id: https://leveldb.googlecode.com/svn/trunk@23 62dab493-f737-651d-591e-8d6aee1b9529	2011-04-19 23:11:15 +00:00
dgrogan@chromium.org	b743906eea	Revision created by MOE tool push_codebase. MOE_MIGRATION= git-svn-id: https://leveldb.googlecode.com/svn/trunk@22 62dab493-f737-651d-591e-8d6aee1b9529	2011-04-19 23:01:25 +00:00
dgrogan@chromium.org	b409afe968	chmod a-x git-svn-id: https://leveldb.googlecode.com/svn/trunk@21 62dab493-f737-651d-591e-8d6aee1b9529	2011-04-18 23:15:58 +00:00
dgrogan@chromium.org	f779e7a5d8	@20602303. Default file permission is now 755. git-svn-id: https://leveldb.googlecode.com/svn/trunk@20 62dab493-f737-651d-591e-8d6aee1b9529	2011-04-12 19:38:58 +00:00
jorlow@chromium.org	4671a695fc	Move include files into a leveldb subdir. git-svn-id: https://leveldb.googlecode.com/svn/trunk@18 62dab493-f737-651d-591e-8d6aee1b9529	2011-03-30 18:35:40 +00:00

1 2 3 4 5 ...

254 Commits