Summary:
util/ means for lower level libraries, so it's a good idea to move the files which requires knowledge to DB out. Create a file/ and move some files there.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5375
Differential Revision: D15550935
Pulled By: siying
fbshipit-source-id: 61a9715dcde5386eebfb43e93f847bba1ae0d3f2
Summary:
Previously if iterator upper/lower bound presents, `DBIter` will check the bound for every key. This patch turns the check into per-file or per-data block check when applicable, by checking against either file largest/smallest key or block index key.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5111
Differential Revision: D15330061
Pulled By: siying
fbshipit-source-id: 8a653fe3cd50d94d81eb2d13b087326c58ee2024
Summary:
Right now, DBIter::Next() always checks whether an entry is for the same user key as the previous entry to see whether the key should be hidden to the user. However, if previous entry's sequence number is 0, the check is not needed because 0 is the oldest possible sequence number.
We could extend it from seqnum 0 case to simply prev_seqno >= current_seqno. However, it is less robust with bug or unexpected situations, while the gain is relatively low. We can always extend it later when needed.
In a readseq benchmark with full formed LSM-tree, number of key comparisons called is reduced from 2.981 to 2.165. readseq against a fully compacted DB, no key comparison is called. Performance in this benchmark didn't show obvious improvement, which is expected because key comparisons only takes small percentage of CPU. But it may show up to be more effective if users have an expensive customized comparator.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5244
Differential Revision: D15067257
Pulled By: siying
fbshipit-source-id: b7e1ef3ec4fa928cba509683d2b3246e35d270d9
Summary:
It's hard to get DBIter to directly use InternalIterator::NextAndGetResult() because the code change would be complicated. Instead, use IteratorWrapper, where Next() is already using NextAndGetResult(). Performance number is hard to measure because it is small and ther is variation. I run readseq many times, and there seems to be 1% gain.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5214
Differential Revision: D15003635
Pulled By: siying
fbshipit-source-id: 17af1965c409c2fe90cd85037fbd2c5a1364f82a
Summary:
My compiler doesn't inline DBIter::Next() to arena wrapped iterator, even if it is a direct forward. Adding this annotation makes it inlined. It might not always work but inlinging this function to arena wrapped iterator always feels like the right decision.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5217
Differential Revision: D15004086
Pulled By: siying
fbshipit-source-id: a4cffd79c6fb092669a3a90633c9aa5e494f8a66
Summary:
Several small changes for Next():
1. Reducing branching by always update local_stats_.next_count_++ even if statistics is null. This should be faster than a branching.
2. Replacing ResetInternalKeysSkippedCounter() in Next() because the valid_ check is not needed in this case.
3. iter_->Valid() should always be true for non merge case. Remove this check.
4. Adding an inline annotation. It ends up with not picked up by my compiler, but it shouldn't hurt.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5200
Differential Revision: D15000391
Pulled By: siying
fbshipit-source-id: be97f61c708968234fb8e5cf272b5c2ac07dc4dd
Summary:
In long scans, virtual function calls of Next(), Valid(), key() and value() are not trivial. By introducing NextAndGetResult(), Some of the Next(), Valid() and key() calls are consolidated into one virtual function call to reduce CPU.
Also did some inline tricks and add some "final" randomly in some functions. Even without the "final" annotation, most Next() calls are inlined with -O3, but sometimes with a final it is inlined by O2 too. It doesn't hurt to add those final annotations.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5197
Differential Revision: D14945977
Pulled By: siying
fbshipit-source-id: 7003969f9a5f1d5717f0bda503b91d19ba75ed88
Summary:
WriteUnPrepared adds a virtual function, MaxUnpreparedSequenceNumber, to ReadCallback, which returns 0 unless WriteUnPrepared is enabled and the transaction has uncommitted data written to the DB. Together with snapshot sequence number, this determines the last sequence that is visible to reads.
The patch clarifies the guarantees of the GetIterator API in WriteUnPrepared transactions and make use of that to statically initialize the read callback and thus avoid the virtual call.
Furthermore it increases the minimum value for min_uncommitted from 0 to 1 as seq 0 is used only for last level keys that are committed in all snapshots.
The following benchmark shows +0.26% higher throughput in seekrandom benchmark.
Benchmark:
./db_bench --benchmarks=fillrandom --use_existing_db=0 --num=1000000 --db=/dev/shm/dbbench
./db_bench --benchmarks=seekrandom[X10] --use_existing_db=1 --db=/dev/shm/dbbench --num=1000000 --duration=60 --seek_nexts=100
seekrandom [AVG 10 runs] : 20355 ops/sec; 225.2 MB/sec
seekrandom [MEDIAN 10 runs] : 20425 ops/sec; 225.9 MB/sec
./db_bench_lessvirtual3 --benchmarks=seekrandom[X10] --use_existing_db=1 --db=/dev/shm/dbbench --num=1000000 --duration=60 --seek_nexts=100
seekrandom [AVG 10 runs] : 20409 ops/sec; 225.8 MB/sec
seekrandom [MEDIAN 10 runs] : 20487 ops/sec; 226.6 MB/sec
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5049
Differential Revision: D14366459
Pulled By: maysamyabandeh
fbshipit-source-id: ebaff8908332a5ae9af7defeadabcb624be660ef
Summary:
Currently `perf_context.user_key_comparison_count` is bump only in `InternalKeyComparator`. For places user comparator is used directly the counter is not bump. Fixing the majority of it.
Index iterator and filter code also use user comparator directly and don't bump the counter. It is not fixed in this patch.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5098
Differential Revision: D14603753
Pulled By: siying
fbshipit-source-id: 1cd41035644ca9e49b97a51030a5d1e15f5f3cae
Summary:
Introduce CPU timers for iterator seek and next operations. Seek
counter includes SeekToFirst, SeekToLast and SeekForPrev, w/ the
caveat that SeekToLast timer doesn't include some post processing
time if upper bound is defined.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5076
Differential Revision: D14525218
Pulled By: fredfsh
fbshipit-source-id: 03ba25df3b22b06c072621e4de0eacfa1445f0d9
Summary:
The patch reorders DBIter fields to put 1-byte fields together and let the compiler optimize the memory usage by using less 64-bit allocations for bools and enums.
This might have a negative side effect of putting the variables that are accessed together into different cache lines and hence increasing the cache misses. Not sure what benchmark would verify that thought. I ran simple, single-threaded seekrandom benchmarks but the variance in the results is too much to be conclusive.
./db_bench --benchmarks=fillrandom --use_existing_db=0 --num=1000000 --db=/dev/shm/dbbench
./db_bench --benchmarks=seekrandom[X10] --use_existing_db=1 --db=/dev/shm/dbbench --num=1000000 --duration=60 --seek_nexts=100
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5078
Differential Revision: D14562676
Pulled By: maysamyabandeh
fbshipit-source-id: 2284655d46e079b6e9a860e94be5defb6f482167
Summary:
Now that v2 is fully functional, the v1 aggregator is removed.
The v2 aggregator has been renamed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4778
Differential Revision: D13495930
Pulled By: abhimadan
fbshipit-source-id: 9d69500a60a283e79b6c4fa938fc68a8aa4d40d6
Summary:
RangeDelAggregatorV2 now supports ShouldDelete calls on
snapshot stripes and creation of range tombstone compaction iterators.
RangeDelAggregator is no longer used on any non-test code path, and will
be removed in a future commit.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4758
Differential Revision: D13439254
Pulled By: abhimadan
fbshipit-source-id: fe105bcf8e3d4a2df37a622d5510843cd71b0401
Summary:
Removed `one_time_use` flag, which removed the need for some
tests, and changed all `NewRangeTombstoneIterator` methods to return
`FragmentedRangeTombstoneIterators`.
These changes also led to removing `RangeDelAggregatorV2::AddUnfragmentedTombstones`
and one of the `MemTableListVersion::AddRangeTombstoneIterators` methods.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4692
Differential Revision: D13106570
Pulled By: abhimadan
fbshipit-source-id: cbab5432d7fc2d9cdfd8d9d40361a1bffaa8f845
Summary:
The old RangeDelAggregator did expensive pre-processing work
to create a collapsed, binary-searchable representation of range
tombstones. With FragmentedRangeTombstoneIterator, much of this work is
now unnecessary. RangeDelAggregatorV2 takes advantage of this by seeking
in each iterator to find a covering tombstone in ShouldDelete, while
doing minimal work in AddTombstones. The old RangeDelAggregator is still
used during flush/compaction for now, though RangeDelAggregatorV2 will
support those uses in a future PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4649
Differential Revision: D13146964
Pulled By: abhimadan
fbshipit-source-id: be29a4c020fc440500c137216fcc1cf529571eb3
Summary:
Currently, `Statistics` can record tick by `recordTick()` whose second parameter is an `uint64_t`.
That means tick can only increase.
If we want to reduce tick, we have to work around like `RecordTick(statistics_, NO_ITERATORS, uint64_t(-1));`.
That's kind of a hack.
So, this PR divide `NO_ITERATORS` into two counters `NO_ITERATOR_CREATED` and `NO_ITERATOR_DELETE`, making the counters increase only.
Fixes#3013 .
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4498
Differential Revision: D10395010
Pulled By: sagar0
fbshipit-source-id: cfb523b22a37411c794b4e9da090f1ae30293db2
Summary:
Even during `DBIter::Prev()`, there is a case where we need to use `RangeDelPositioningMode::kForwardTraversal`. In particular, when we hit too many internal keys for a single user key, we use seek to find the newest internal key. If it's a merge operand, we then scan forwards, collecting the merge operands. This forward scan should be using `RangeDelPositioningMode::kForwardTraversal`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4481
Differential Revision: D10319507
Pulled By: ajkr
fbshipit-source-id: b5ce7352461f3a7696b28a5136ae0076f2bde51f
Summary:
In the current trace_and replay, Get an WriteBatch are traced. This pull request track down the Seek() and SeekForPrev() to the trace file. <target_key, timestamp, column_family_id> are write to the file.
Replay of Iterator is not supported in the current implementation.
Tested with trace_analyzer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4228
Differential Revision: D9201381
Pulled By: zhichao-cao
fbshipit-source-id: 6f9cc9cb3c20260af741bee065ec35c5c96354ab
Summary:
This fixes the same performance issue that #3992 fixes but with much more invasive cleanup.
I'm more excited about this PR because it paves the way for fixing another problem we uncovered at Cockroach where range deletion tombstones can cause massive compactions. For example, suppose L4 contains deletions from [a, c) and [x, z) and no other keys, and L5 is entirely empty. L6, however, is full of data. When compacting L4 -> L5, we'll end up with one file that spans, massively, from [a, z). When we go to compact L5 -> L6, we'll have to rewrite all of L6! If, instead of range deletions in L4, we had keys a, b, x, y, and z, RocksDB would have been smart enough to create two files in L5: one for a and b and another for x, y, and z.
With the changes in this PR, it will be possible to adjust the compaction logic to split tombstones/start new output files when they would span too many files in the grandparent level.
ajkr please take a look when you have a minute!
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4014
Differential Revision: D8773253
Pulled By: ajkr
fbshipit-source-id: ec62fa85f648fdebe1380b83ed997f9baec35677
Summary:
This is implemented by extending ReadCallback with another function `MaxUnpreparedSequenceNumber` which returns the largest visible sequence number for the current transaction, if there is uncommitted data written to DB. Otherwise, it returns zero, indicating no uncommitted data.
There are the places where reads had to be modified.
- Get and Seek/Next was just updated to seek to max(snapshot_seq, MaxUnpreparedSequenceNumber()) instead, and iterate until a key was visible.
- Prev did not need need updates since it did not use the Seek to sequence number optimization. Assuming that locks were held when writing unprepared keys, and ValidateSnapshot runs, there should only be committed keys and unprepared keys of the current transaction, all of which are visible. Prev will simply iterate to get the last visible key.
- Reseeking to skip keys optimization was also disabled for write unprepared, since it's possible to hit the max_skip condition even while reseeking. There needs to be some way to resolve infinite looping in this case.
Closes https://github.com/facebook/rocksdb/pull/3955
Differential Revision: D8286688
Pulled By: lth
fbshipit-source-id: 25e42f47fdeb5f7accea0f4fd350ef35198caafe
Summary:
Currently it is not possible to change bloom filter config without restart the db, which is causing a lot of operational complexity for users.
This PR aims to make it possible to dynamically change bloom filter config.
Closes https://github.com/facebook/rocksdb/pull/3601
Differential Revision: D7253114
Pulled By: miasantreble
fbshipit-source-id: f22595437d3e0b86c95918c484502de2ceca120c
Summary:
Before this PR, Iterator/InternalIterator may simultaneously have non-ok status() and Valid() = true. That state means that the last operation failed, but the iterator is nevertheless positioned on some unspecified record. Likely intended uses of that are:
* If some sst files are corrupted, a normal iterator can be used to read the data from files that are not corrupted.
* When using read_tier = kBlockCacheTier, read the data that's in block cache, skipping over the data that is not.
However, this behavior wasn't documented well (and until recently the wiki on github had misleading incorrect information). In the code there's a lot of confusion about the relationship between status() and Valid(), and about whether Seek()/SeekToLast()/etc reset the status or not. There were a number of bugs caused by this confusion, both inside rocksdb and in the code that uses rocksdb (including ours).
This PR changes the convention to:
* If status() is not ok, Valid() always returns false.
* Any seek operation resets status. (Before the PR, it depended on iterator type and on particular error.)
This does sacrifice the two use cases listed above, but siying said it's ok.
Overview of the changes:
* A commit that adds missing status checks in MergingIterator. This fixes a bug that actually affects us, and we need it fixed. `DBIteratorTest.NonBlockingIterationBugRepro` explains the scenario.
* Changes to lots of iterator types to make all of them conform to the new convention. Some bug fixes along the way. By far the biggest changes are in DBIter, which is a big messy piece of code; I tried to make it less big and messy but mostly failed.
* A stress-test for DBIter, to gain some confidence that I didn't break it. It does a few million random operations on the iterator, while occasionally modifying the underlying data (like ForwardIterator does) and occasionally returning non-ok status from internal iterator.
To find the iterator types that needed changes I searched for "public .*Iterator" in the code. Here's an overview of all 27 iterator types:
Iterators that didn't need changes:
* status() is always ok(), or Valid() is always false: MemTableIterator, ModelIter, TestIterator, KVIter (2 classes with this name anonymous namespaces), LoggingForwardVectorIterator, VectorIterator, MockTableIterator, EmptyIterator, EmptyInternalIterator.
* Thin wrappers that always pass through Valid() and status(): ArenaWrappedDBIter, TtlIterator, InternalIteratorFromIterator.
Iterators with changes (see inline comments for details):
* DBIter - an overhaul:
- It used to silently skip corrupted keys (`FindParseableKey()`), which seems dangerous. This PR makes it just stop immediately after encountering a corrupted key, just like it would for other kinds of corruption. Let me know if there was actually some deeper meaning in this behavior and I should put it back.
- It had a few code paths silently discarding subiterator's status. The stress test caught a few.
- The backwards iteration code path was expecting the internal iterator's set of keys to be immutable. It's probably always true in practice at the moment, since ForwardIterator doesn't support backwards iteration, but this PR fixes it anyway. See added DBIteratorTest.ReverseToForwardBug for an example.
- Some parts of backwards iteration code path even did things like `assert(iter_->Valid())` after a seek, which is never a safe assumption.
- It used to not reset status on seek for some types of errors.
- Some simplifications and better comments.
- Some things got more complicated from the added error handling. I'm open to ideas for how to make it nicer.
* MergingIterator - check status after every operation on every subiterator, and in some places assert that valid subiterators have ok status.
* ForwardIterator - changed to the new convention, also slightly simplified.
* ForwardLevelIterator - fixed some bugs and simplified.
* LevelIterator - simplified.
* TwoLevelIterator - changed to the new convention. Also fixed a bug that would make SeekForPrev() sometimes silently ignore errors from first_level_iter_.
* BlockBasedTableIterator - minor changes.
* BlockIter - replaced `SetStatus()` with `Invalidate()` to make sure non-ok BlockIter is always invalid.
* PlainTableIterator - some seeks used to not reset status.
* CuckooTableIterator - tiny code cleanup.
* ManagedIterator - fixed some bugs.
* BaseDeltaIterator - changed to the new convention and fixed a bug.
* BlobDBIterator - seeks used to not reset status.
* KeyConvertingIterator - some small change.
Closes https://github.com/facebook/rocksdb/pull/3810
Differential Revision: D7888019
Pulled By: al13n321
fbshipit-source-id: 4aaf6d3421c545d16722a815b2fa2e7912bc851d
Summary:
Added a new iterator property: `rocksdb.iterator.internal-key` to get the internal-key (converted to user key) at which the iterator stopped.
Closes https://github.com/facebook/rocksdb/pull/3525
Differential Revision: D7033694
Pulled By: sagar0
fbshipit-source-id: d51e6c00f5e9d766c6276ef79774b81c6c5216f8
Summary:
ForwardIterator::SVCleanup() sometimes didn't pin superversion when it was supposed to. See the added test for the scenario. Here's the ASAN output of the added test without the fix (using `COMPILE_WITH_ASAN=1 make`): https://pastebin.com/9rD0Ywws
Closes https://github.com/facebook/rocksdb/pull/3415
Differential Revision: D6817414
Pulled By: al13n321
fbshipit-source-id: bc80c44ea78a3a1fa885dfa448a26111f91afb24
Summary:
In DBIter, Prev() calls FindValueForCurrentKey() to search the current value backward. If it finds that there are too many stale value being skipped, it falls back to FindValueForCurrentKeyUsingSeek(), seeking directly to the key with snapshot sequence. After introducing read_callback, however, the key it seeks to might not be visible, according to read_callback. It thus needs to keep searching forward until the first visible value.
Closes https://github.com/facebook/rocksdb/pull/3382
Differential Revision: D6756148
Pulled By: yiwu-arbug
fbshipit-source-id: 064e39b1eec5e083af1c10142600f26d1d2697be
Summary:
A proper implementation of Iterator::Refresh() for WritePreparedTxnDB would require release and acquire another snapshot. Since MyRocks don't make use of Iterator::Refresh(), we just simply mark it as not supported.
Closes https://github.com/facebook/rocksdb/pull/3290
Differential Revision: D6599931
Pulled By: yiwu-arbug
fbshipit-source-id: 4e1632d967316431424f6e458254ecf9a97567cf
Summary:
Since #1665, on merge error, iterator will be set to corrupted status, but it doesn't invalidate the iterator. Fixing it.
Closes https://github.com/facebook/rocksdb/pull/3226
Differential Revision: D6499094
Pulled By: yiwu-arbug
fbshipit-source-id: 80222930f949e31f90a6feaa37ddc3529b510d2c
Summary:
Add PreReleaseCallback to be called at the end of WriteImpl but before publishing the sequence number. The callback is used in WritePrepareTxn to i) update the commit map, ii) update the last published sequence number in the 2nd write queue. It also ensures that all the commits will go to the 2nd queue.
These changes will ensure that the commit map is updated before the sequence number is published and used by reading snapshots. If we use two write queues, the snapshots will use the seq number published by the 2nd queue. If we use one write queue (the default, the snapshots will use the last seq number in the memtable, which also indicates the last published seq number.
Closes https://github.com/facebook/rocksdb/pull/3205
Differential Revision: D6438959
Pulled By: maysamyabandeh
fbshipit-source-id: f8b6c434e94bc5f5ab9cb696879d4c23e2577ab9
Summary:
When Seek a key less than `lower_bound`, should return `lower_bound`.
ajkr PTAL
Closes https://github.com/facebook/rocksdb/pull/3199
Differential Revision: D6421126
Pulled By: ajkr
fbshipit-source-id: a06c825830573e0040630704f6bcb3f7f48626f7
Summary:
This diff adds a new ticker stat, NUMBER_ITER_SKIP, to count the
number of internal keys skipped during iteration. Keys can be skipped
due to deletes, or lower sequence number, or higher sequence number
than the one requested.
Also, fix the issue when StatisticsData is naturally aligned on cacheline boundary,
padding becomes a zero size array, which the Windows compiler doesn't
like. So add a cacheline worth of padding in that case to keep it happy.
We cannot conditionally add padding as gcc doesn't allow using sizeof
in preprocessor directives.
Closes https://github.com/facebook/rocksdb/pull/3177
Differential Revision: D6353897
Pulled By: anand1976
fbshipit-source-id: 441d5a09af9c4e22e7355242dfc0c7b27aa0a6c2
Summary:
This patch clarifies and refactors the logic around tracked keys in transactions.
Closes https://github.com/facebook/rocksdb/pull/3140
Differential Revision: D6290258
Pulled By: maysamyabandeh
fbshipit-source-id: 03b50646264cbcc550813c060b180fc7451a55c1
Summary:
The motivation for this PR is to add to RocksDB support for differential (incremental) snapshots, as snapshot of the DB changes between two points in time (one can think of it as diff between to sequence numbers, or the diff D which can be thought of as an SST file or just set of KVs that can be applied to sequence number S1 to get the database to the state at sequence number S2).
This feature would be useful for various distributed storages layers built on top of RocksDB, as it should help reduce resources (time and network bandwidth) needed to recover and rebuilt DB instances as replicas in the context of distributed storages.
From the API standpoint that would like client app requesting iterator between (start seqnum) and current DB state, and reading the "diff".
This is a very draft PR for initial review in the discussion on the approach, i'm going to rework some parts and keep updating the PR.
For now, what's done here according to initial discussions:
Preserving deletes:
- We want to be able to optionally preserve recent deletes for some defined period of time, so that if a delete came in recently and might need to be included in the next incremental snapshot it would't get dropped by a compaction. This is done by adding new param to Options (preserve deletes flag) and new variable to DB Impl where we keep track of the sequence number after which we don't want to drop tombstones, even if they are otherwise eligible for deletion.
- I also added a new API call for clients to be able to advance this cutoff seqnum after which we drop deletes; i assume it's more flexible to let clients control this, since otherwise we'd need to keep some kind of timestamp < -- > seqnum mapping inside the DB, which sounds messy and painful to support. Clients could make use of it by periodically calling GetLatestSequenceNumber(), noting the timestamp, doing some calculation and figuring out by how much we need to advance the cutoff seqnum.
- Compaction codepath in compaction_iterator.cc has been modified to avoid dropping tombstones with seqnum > cutoff seqnum.
Iterator changes:
- couple params added to ReadOptions, to optionally allow client to request internal keys instead of user keys (so that client can get the latest value of a key, be it delete marker or a put), as well as min timestamp and min seqnum.
TableCache changes:
- I modified table_cache code to be able to quickly exclude SST files from iterators heep if creation_time on the file is less then iter_start_ts as passed in ReadOptions. That would help a lot in some DB settings (like reading very recent data only or using FIFO compactions), but not so much for universal compaction with more or less long iterator time span.
What's left:
- Still looking at how to best plug that inside DBIter codepath. So far it seems that FindNextUserKeyInternal only parses values as UserKeys, and iter->key() call generally returns user key. Can we add new API to DBIter as internal_key(), and modify this internal method to optionally set saved_key_ to point to the full internal key? I don't need to store actual seqnum there, but I do need to store type.
Closes https://github.com/facebook/rocksdb/pull/2999
Differential Revision: D6175602
Pulled By: mikhail-antonov
fbshipit-source-id: c779a6696ee2d574d86c69cec866a3ae095aa900
Summary:
- for `SeekToFirst()`, just convert it to a regular `Seek()` if lower bound is specified
- for operations that iterate backwards over user keys (`SeekForPrev`, `SeekToLast`, `Prev`), change `PrevInternal` to check whether user key went below lower bound every time the user key changes -- same approach we use to ensure we stay within a prefix when `prefix_same_as_start=true`.
Closes https://github.com/facebook/rocksdb/pull/3074
Differential Revision: D6158654
Pulled By: ajkr
fbshipit-source-id: cb0e3a922e2650d2cd4d1c6e1c0f1e8b729ff518
Summary:
When doing a Seek() or SeekForPrev() we should stop the moment we see a key with a different prefix as start if ReadOptions:: prefix_same_as_start was set to true
Right now we don't stop if we encounter a tombstone outside the prefix while executing SeekForPrev()
Closes https://github.com/facebook/rocksdb/pull/3067
Differential Revision: D6149638
Pulled By: IslamAbdelRahman
fbshipit-source-id: 7f659862d2bf552d3c9104a360c79439ceba2f18
Summary:
On iterator create, take a snapshot, create a ReadCallback and pass the ReadCallback to the underlying DBIter to check if key is committed.
Closes https://github.com/facebook/rocksdb/pull/2981
Differential Revision: D6001471
Pulled By: yiwu-arbug
fbshipit-source-id: 3565c4cdaf25370ba47008b0e0cb65b31dfe79fe
Summary:
Add kTypeBlobIndex value type, which will be used by blob db only, to insert a (key, blob_offset) KV pair. The purpose is to
1. Make it possible to open existing rocksdb instance as blob db. Existing value will be of kTypeIndex type, while value inserted by blob db will be of kTypeBlobIndex.
2. Make rocksdb able to detect if the db contains value written by blob db, if so return error.
3. Make it possible to have blob db optionally store value in SST file (with kTypeValue type) or as a blob value (with kTypeBlobIndex type).
The root db (DBImpl) basically pretended kTypeBlobIndex are normal value on write. On Get if is_blob is provided, return whether the value read is of kTypeBlobIndex type, or return Status::NotSupported() status if is_blob is not provided. On scan allow_blob flag is pass and if the flag is true, return wether the value is of kTypeBlobIndex type via iter->IsBlob().
Changes on blob db side will be in a separate patch.
Closes https://github.com/facebook/rocksdb/pull/2886
Differential Revision: D5838431
Pulled By: yiwu-arbug
fbshipit-source-id: 3c5306c62bc13bb11abc03422ec5cbcea1203cca
Summary:
Three small optimizations:
(1) iter_->IsKeyPinned() shouldn't be called if read_options.pin_data is not true. This may trigger function call all the way down the iterator tree.
(2) reuse the iterator key object in DBIter::FindNextUserEntryInternal(). The constructor of the class has some overheads.
(3) Move the switching direction logic in MergingIterator::Next() to a separate function.
These three in total improves readseq performance by about 3% in my benchmark setting.
Closes https://github.com/facebook/rocksdb/pull/2880
Differential Revision: D5829252
Pulled By: siying
fbshipit-source-id: 991aea10c6d6c3b43769cb4db168db62954ad1e3
Summary:
DBIter is referenced in ArenaWrappedDBIter, which is a simple wrapper. If DBIter is final, some virtual function call can be avoided. Some functions can even be inlined, like DBIter.value() to ArenaWrappedDBIter.value() and DBIter.key() to ArenaWrappedDBIter.key(). The performance gain is hard to measure. I just ran the memory-only benchmark for readseq and saw it didn't regress. There shouldn't be any harm doing it. Just give compiler more choices.
Closes https://github.com/facebook/rocksdb/pull/2859
Differential Revision: D5799888
Pulled By: siying
fbshipit-source-id: 829788f91310c40282dcfb7e412e6ef489931143
Summary:
With this PR, we can measure read-amp for queries where perf_context is enabled as follows:
```
SetPerfLevel(kEnableCount);
Get(1, "foo");
double read_amp = static_cast<double>(get_perf_context()->block_read_byte / get_perf_context()->get_read_bytes);
SetPerfLevel(kDisable);
```
Our internal infra enables perf_context for a sampling of queries. So we'll be able to compute the read-amp for the sample set, which can give us a good estimate of read-amp.
Closes https://github.com/facebook/rocksdb/pull/2749
Differential Revision: D5647240
Pulled By: ajkr
fbshipit-source-id: ad73550b06990cf040cc4528fa885360f308ec12
Summary:
Add and implement Iterator::Refresh(). When this function is called, if the super version doesn't change, update the sequence number of the iterator to the latest one and invalidate the iterator. If the super version changed, recreated the whole iterator. This can help users reuse the iterator more easily.
Closes https://github.com/facebook/rocksdb/pull/2621
Differential Revision: D5464500
Pulled By: siying
fbshipit-source-id: f548bd35e85c1efca2ea69273802f6704eba6ba9
Summary:
"make analyze" is reporting some errors. It's complicated to look but it seems to me that they are all false positive. Anyway, I think cleaning them up is a good idea. Some of the changes are hacky but I don't know a better way.
Closes https://github.com/facebook/rocksdb/pull/2508
Differential Revision: D5341710
Pulled By: siying
fbshipit-source-id: 6070e430e0e41a080ef441e05e8ec827d45efab6
Summary:
Add a histogram in statistics to help users understand how many merge operands they merge.
Closes https://github.com/facebook/rocksdb/pull/2373
Differential Revision: D5139983
Pulled By: siying
fbshipit-source-id: 61b9ba8ca83f358530a4833d68f0103b56a0e182
Summary:
Fix build error in db_iter.cc when running clang-analyzer.
```
CC db/db_iter.o
db/db_iter.cc:938:21: error: no matching constructor for initialization of 'rocksdb::ParsedInternalKey'
ParsedInternalKey ikey(Slice(), 0, 0);
^ ~~~~~~~~~~~~~
./db/dbformat.h:84:3: note: candidate constructor not viable: no known conversion from 'int' to 'rocksdb::ValueType' for 3rd argument
ParsedInternalKey(const Slice& u, const SequenceNumber& seq, ValueType t)
^
./db/dbformat.h:78:8: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 3 were provided
struct ParsedInternalKey {
^
./db/dbformat.h:78:8: note: candidate constructor (the implicit move constructor) not viable: requires 1 argument, but 3 were provided
./db/dbformat.h:83:3: note: candidate constructor not viable: requires 0 arguments, but 3 were provided
ParsedInternalKey() { } // Intentionally left uninitialized (for speed)
^
1 error generated.
```
Closes https://github.com/facebook/rocksdb/pull/2354
Differential Revision: D5115751
Pulled By: sagar0
fbshipit-source-id: b0e386d4e935e4725b07761c3ca5f7a8cbde3692
Summary:
Fixing two types of clang-analyzer false positives:
* db is deleted and then reopen, and clang-analyzer thinks we are reusing the pointer after it has been deleted. Adding asserts to hint clang-analyzer the pointer is recreated.
* ParsedInternalKey is (intentionally) uninitialized. Initialize the struct only when clang-analyzer is running.
Closes https://github.com/facebook/rocksdb/pull/2334
Differential Revision: D5093801
Pulled By: yiwu-arbug
fbshipit-source-id: f51355382098eb3da5ab9f64e094c6d03e6bdf7d
Summary:
DBIter, and in-turn NewDBIterator and NewArenaWrappedDBIterator, take a bunch of params. They can be reduced by passing in ReadOptions directly instead of passing in every new param separately. It also seems much cleaner as a bunch of the params towards the end seem to be optional.
(Recently I introduced max_skippable_internal_keys, which added one more to the already huge count).
Idea courtesy IslamAbdelRahman
Closes https://github.com/facebook/rocksdb/pull/2116
Differential Revision: D4857128
Pulled By: sagar0
fbshipit-source-id: 7d239df094b94bd9ea79d145cdf825478ac037a8
Summary:
Move some files under util/ to new directories env/, monitoring/ options/ and cache/
Closes https://github.com/facebook/rocksdb/pull/2090
Differential Revision: D4833681
Pulled By: siying
fbshipit-source-id: 2fd8bef
Summary:
to void future bug that caused by the mix of userkey/internalkey
Closes https://github.com/facebook/rocksdb/pull/2084
Differential Revision: D4825889
Pulled By: lightmark
fbshipit-source-id: 28411db
Summary:
It is confusing to have auto_roll_logger to stay under db/, which has nothing to do with database. Move filename together as it is a dependency.
Closes https://github.com/facebook/rocksdb/pull/2080
Differential Revision: D4821141
Pulled By: siying
fbshipit-source-id: ca7d768
Summary:
Operations like Seek/Next/Prev sometimes take too long to complete when there are many internal keys to be skipped. Adding an option, max_skippable_internal_keys -- which could be used to set a threshold for the maximum number of keys that can be skipped, will help to address these cases where it is much better to fail a request (as incomplete) than to wait for a considerable time for the request to complete.
This feature -- to fail an iterator seek request as incomplete, is disabled by default when max_skippable_internal_keys = 0. It is enabled only when max_skippable_internal_keys > 0.
This feature is based on the discussion mentioned in the PR https://github.com/facebook/rocksdb/pull/1084.
Closes https://github.com/facebook/rocksdb/pull/2000
Differential Revision: D4753223
Pulled By: sagar0
fbshipit-source-id: 1c973f7
Summary:
When deletion-collapsing mode is enabled (i.e., for DBIter/CompactionIterator), we maintain position in the tombstone maps across calls to ShouldDelete(). Since iterators often access keys sequentially (or reverse-sequentially), scanning forward/backward from the last position can be faster than binary-searching the map for every key.
- When Next() is invoked on an iterator, we use kForwardTraversal to scan forwards, if needed, until arriving at the range deletion containing the next key.
- Similarly for Prev(), we use kBackwardTraversal to scan backwards in the range deletion map.
- When the iterator seeks, we use kBinarySearch for repositioning
- After tombstones are added or before the first ShouldDelete() invocation, the current position is set to invalid, which forces kBinarySearch to be used.
- Non-iterator users (i.e., Get()) use kFullScan, which has the same behavior as before---scan the whole map for every key passed to ShouldDelete().
Closes https://github.com/facebook/rocksdb/pull/1701
Differential Revision: D4350318
Pulled By: ajkr
fbshipit-source-id: 5129b76
Summary:
Added a tombstone-collapsing mode to RangeDelAggregator, which eliminates overlap in the TombstoneMap. In this mode, we can check whether a tombstone covers a user key using upper_bound() (i.e., binary search). However, the tradeoff is the overhead to add tombstones is now higher, so at first I've only enabled it for range scans (compaction/flush/user iterators), where we expect a high number of calls to ShouldDelete() for the same tombstones. Point queries like Get() will still use the linear scan approach.
Also in this diff I changed RangeDelAggregator's TombstoneMap to use multimap with user keys instead of map with internal keys. Callers sometimes provided ParsedInternalKey directly, from which it would've required string copying to derive an internal key Slice with which we could search the map.
Closes https://github.com/facebook/rocksdb/pull/1614
Differential Revision: D4270397
Pulled By: ajkr
fbshipit-source-id: 93092c7
Summary:
Iterator should be in corrupted status if merge operator return false.
Also add test to make sure if max_successive_merges is hit during write,
data will not be lost.
Closes https://github.com/facebook/rocksdb/pull/1665
Differential Revision: D4322695
Pulled By: yiwu-arbug
fbshipit-source-id: b327b05
Summary:
In one deployment we saw high latencies (presumably from slow iterator operations) and a lot of CPU time reported by perf with this stack:
```
rocksdb::MergingIterator::Next
rocksdb::DBIter::FindNextUserEntryInternal
rocksdb::DBIter::Seek
```
I think what's happening is:
1. we create a snapshot iterator,
2. we do lots of Put()s for the same key x; this creates lots of entries in memtable,
3. we seek the iterator to a key slightly smaller than x,
4. the seek walks over lots of entries in memtable for key x, skipping them because of high sequence numbers.
CC IslamAbdelRahman
Closes https://github.com/facebook/rocksdb/pull/1413
Differential Revision: D4083879
Pulled By: IslamAbdelRahman
fbshipit-source-id: a83ddae
Summary:
- Made RangeDelAggregator's InternalKeyComparator member a reference-to-const so we don't need to copy-construct it. Also added InternalKeyComparator to ImmutableCFOptions so we don't need to construct one for each DBIter.
- Made MemTable::NewRangeTombstoneIterator and the table readers' NewRangeTombstoneIterator() functions return nullptr instead of NewEmptyInternalIterator to avoid the allocation. Updated callers accordingly.
Closes https://github.com/facebook/rocksdb/pull/1548
Differential Revision: D4208169
Pulled By: ajkr
fbshipit-source-id: 2fd65cf
Summary:
Since a RangeDelAggregator is created for each read request, these heap-allocating member variables were consuming significant CPU (~3% total) which slowed down request throughput. The map and pinning manager are only necessary when range deletions exist, so we can defer their initialization until the first range deletion is encountered. Currently lazy initialization is done for reads only since reads pass us a single snapshot, which is easier to store on the stack for later insertion into the map than the vector passed to us by flush or compaction.
Note the Arena member variable is still expensive, I will figure out what to do with it in a subsequent diff. It cannot be lazily initialized because we currently use this arena even to allocate empty iterators, which is necessary even when no range deletions exist.
Closes https://github.com/facebook/rocksdb/pull/1539
Differential Revision: D4203488
Pulled By: ajkr
fbshipit-source-id: 3b36279
Summary:
Note: reviewed in https://reviews.facebook.net/D65115
- DBIter maintains a range tombstone accumulator. We don't cleanup obsolete tombstones yet, so if the user seeks back and forth, the same tombstones would be added to the accumulator multiple times.
- DBImpl::NewInternalIterator() (used to make DBIter's underlying iterator) adds memtable/L0 range tombstones, L1+ range tombstones are added on-demand during NewSecondaryIterator() (see D62205)
- DBIter uses ShouldDelete() when advancing to check whether keys are covered by range tombstones
Closes https://github.com/facebook/rocksdb/pull/1464
Differential Revision: D4131753
Pulled By: ajkr
fbshipit-source-id: be86559
Summary: in rocksdb::DBIter::FindValueForCurrentKey(), last_not_merge_type could also be SingleDelete() which is omitted
Test Plan: db_iter_test
Reviewers: yhchiang, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D65187
Summary:
1) The previous solution for Prev() prefix support is not clean.
Since I add api SeekForPrev(), now the Prev() can be symmetric to Next().
and we do not need SeekToLast() to be called in Prev() any more.
Also, Next() will Seek(prefix_seek_key_) to solve the problem of possible inconsistency between db_iter and merge_iter when
there is merge_operator. And prefix_seek_key is only refreshed when change direction to forward.
2) This diff also solves the bug of Iterator::SeekToLast() with iterate_upper_bound_ with prefix extractor.
add test cases for the above two cases.
There are some tests for the SeekToLast() in Prev(), I will clean them later.
Test Plan: make all check
Reviewers: IslamAbdelRahman, andrewkr, yiwu, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D63933
Summary:
Add new Iterator API, `SeekForPrev`: find the last key that <= target key
support prefix_extractor
support prefix_same_as_start
support upper_bound
not supported in iterators without Prev()
Also add tests in db_iter_test and db_iterator_test
Pass all tests
Cheers!
Test Plan: make all check -j64
Reviewers: andrewkr, yiwu, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D64149
Summary: As title, make sure Prev() works as expected with Next() when the current iter->key() in the range of the same prefix in prefix seek mode
Test Plan: make all check -j64 (add prefix_test with PrefixSeekModePrev test case)
Reviewers: andrewkr, sdong, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: yoshinorim, andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61419
Summary:
This diff include these simple change
- Rename ReleasePinnedIterators to ReleasePinnedData
- Rename PinIteratorIfNeeded to PinIterator
- Use std::vector directly in PinnedIteratorsManager instead of std::unique_ptr<std::vector>
- Generalize PinnedIteratorsManager by adding PinPtr which can pin any pointer
Test Plan: existing tests
Reviewers: sdong, yiwu, andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D61305
Summary:
We have alot of code duplication whenever we call FullMerge we keep duplicating the instrumentation and statistics code
This is a simple diff to refactor the code to use TimedFullMerge instead of FullMerge
Test Plan: COMPILE_WITH_ASAN=1 make check -j64
Reviewers: andrewkr, yhchiang, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59577
Summary: We should not use IterKey::SetKey with copy = false except if we are pinning the iterator thru it's life time, otherwise we may release the temporarily pinned blocks and in this case the IterKey will be pointing to freed memory
Test Plan: added a new test
Reviewers: sdong, andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D57561
Summary:
While trying to reuse PinData() / ReleasePinnedData() .. to optimize away some memcpys I realized that there is a significant overhead for using PinData() / ReleasePinnedData if they were called many times.
This diff refactor the pinning logic by introducing PinnedIteratorsManager a centralized component that will be created once and will be notified whenever we need to Pin an Iterator. This implementation have much less overhead than the original implementation
Test Plan:
make check -j64
COMPILE_WITH_ASAN=1 make check -j64
Reviewers: yhchiang, sdong, andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56493
Summary: I realized I again is wrong about the naming convention. Let me change it to the correct one.
Test Plan: Run unit tests.
Reviewers: IslamAbdelRahman, kradhakrishnan, yhchiang, andrewkr
Reviewed By: andrewkr
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D55041
Summary: We want to provide a way to detect whether an iterator is stale and needs to be recreated. Add a iterator property to return version number.
Test Plan: Add two unit tests for it.
Reviewers: IslamAbdelRahman, yhchiang, anthony, kradhakrishnan, andrewkr
Reviewed By: andrewkr
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D54921
Summary:
Add Iterator::GetProperty(), a way for users to communicate with iterator, and turn Iterator::IsKeyPinned() with it.
As a follow-up, I'll ask a property as the version number attached to the iterator
Test Plan: Rerun existing tests and add a negative test case.
Reviewers: yhchiang, andrewkr, kradhakrishnan, anthony, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D54783
Summary:
This patch update the Iterator API to introduce new functions that allow users to keep the Slices returned by key() valid as long as the Iterator is not deleted
ReadOptions::pin_data : If true keep loaded blocks in memory as long as the iterator is not deleted
Iterator::IsKeyPinned() : If true, this mean that the Slice returned by key() is valid as long as the iterator is not deleted
Also add a new option BlockBasedTableOptions::use_delta_encoding to allow users to disable delta_encoding if needed.
Benchmark results (using https://phabricator.fb.com/P20083553)
```
// $ du -h /home/tec/local/normal.4K.Snappy/db10077
// 6.1G /home/tec/local/normal.4K.Snappy/db10077
// $ du -h /home/tec/local/zero.8K.LZ4/db10077
// 6.4G /home/tec/local/zero.8K.LZ4/db10077
// Benchmarks for shard db10077
// _build/opt/rocks/benchmark/rocks_copy_benchmark \
// --normal_db_path="/home/tec/local/normal.4K.Snappy/db10077" \
// --zero_db_path="/home/tec/local/zero.8K.LZ4/db10077"
// First run
// ============================================================================
// rocks/benchmark/RocksCopyBenchmark.cpp relative time/iter iters/s
// ============================================================================
// BM_StringCopy 1.73s 576.97m
// BM_StringPiece 103.74% 1.67s 598.55m
// ============================================================================
// Match rate : 1000000 / 1000000
// Second run
// ============================================================================
// rocks/benchmark/RocksCopyBenchmark.cpp relative time/iter iters/s
// ============================================================================
// BM_StringCopy 611.99ms 1.63
// BM_StringPiece 203.76% 300.35ms 3.33
// ============================================================================
// Match rate : 1000000 / 1000000
```
Test Plan: Unit tests
Reviewers: sdong, igor, anthony, yhchiang, rven
Reviewed By: rven
Subscribers: dhruba, lovro, adsharma
Differential Revision: https://reviews.facebook.net/D48999
Summary: With recent commit 33e0c93826, db iterator skips perf context counter internal_key_skipped_count when blindly issuing internal Next(). Now increment the counter by one when issuing this Next()
Test Plan: Run all existing tests
Reviewers: rven, yhchiang, IslamAbdelRahman, kradhakrishnan, igor, anthony
Reviewed By: anthony
Subscribers: yoshinorim, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D51465
Summary: Now DBIter::Next() always compares with current key with itself first, which is unnecessary if the last key is not a merge key. I made the change and didn't see db_iter_test fails. Want to hear whether people have any idea what I miss.
Test Plan: Run all unit tests
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D48279
Summary: Use IterKey to store prefix_start_ so that it doesn't get freed
Test Plan: PrefixTest.PrefixValid
Reviewers: anthony, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D50289
Summary:
MyRocks testing found an issue that while iterating over keys
that are outside the prefix, sometimes wrong results were seen for keys
outside the prefix. We now tighten the range of keys seen with a new
read option called prefix_seen_at_start. This remembers the starting
prefix and then compares it on a Next for equality of prefix. If they
are from a different prefix, it sets valid to false.
Test Plan: PrefixTest.PrefixValid
Reviewers: IslamAbdelRahman, sdong, yhchiang, anthony
Reviewed By: anthony
Subscribers: spetrunia, hermanlee4, yoshinorim, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D50211
Summary:
Separate a new class InternalIterator from class Iterator, when the look-up is done internally, which also means they operate on key with sequence ID and type.
This change will enable potential future optimizations but for now InternalIterator's functions are still the same as Iterator's.
At the same time, separate the cleanup function to a separate class and let both of InternalIterator and Iterator inherit from it.
Test Plan: Run all existing tests.
Reviewers: igor, yhchiang, anthony, kradhakrishnan, IslamAbdelRahman, rven
Reviewed By: rven
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D48549
Summary:
This patch fixes#7460559. It introduces SingleDelete as a new database
operation. This operation can be used to delete keys that were never
overwritten (no put following another put of the same key). If an overwritten
key is single deleted the behavior is undefined. Single deletion of a
non-existent key has no effect but multiple consecutive single deletions are
not allowed (see limitations).
In contrast to the conventional Delete() operation, the deletion entry is
removed along with the value when the two are lined up in a compaction. Note:
The semantics are similar to @igor's prototype that allowed to have this
behavior on the granularity of a column family (
https://reviews.facebook.net/D42093 ). This new patch, however, is more
aggressive when it comes to removing tombstones: It removes the SingleDelete
together with the value whenever there is no snapshot between them while the
older patch only did this when the sequence number of the deletion was older
than the earliest snapshot.
Most of the complex additions are in the Compaction Iterator, all other changes
should be relatively straightforward. The patch also includes basic support for
single deletions in db_stress and db_bench.
Limitations:
- Not compatible with cuckoo hash tables
- Single deletions cannot be used in combination with merges and normal
deletions on the same key (other keys are not affected by this)
- Consecutive single deletions are currently not allowed (and older version of
this patch supported this so it could be resurrected if needed)
Test Plan: make all check
Reviewers: yhchiang, sdong, rven, anthony, yoshinorim, igor
Reviewed By: igor
Subscribers: maykov, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D43179
Summary:
There are currently no statistics on seeks, only on gets. This adds the following counters:
rocksdb.number.db.seek
rocksdb.number.db.next
rocksdb.number.db.prev
(number of calls)
rocksdb.db.iterate.bytes.read
(number of bytes read from key + value using seek/next/prev)
rocksdb.number.keys.seek.found
rocksdb.number.keys.next.found
rocksdb.number.keys.prev.found
(number of calls where seek/next/prev found a value)
Test Plan:
./db_bench -statistics -benchmarks fillrandom,seekrandom -seek_nexts 5
./db_bench -statistics -benchmarks fillrandom,seekrandom -seek_nexts 5 -reverse_iterator
Reviewers: yhchiang, rven, kradhakrishnan, IslamAbdelRahman, MarkCallaghan, sdong, igor
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D46605
Summary:
In some cases, equality comparisons can be done more efficiently than three-way
comparisons. There are quite a few places in the code where we only care about
equality. This patch adds an Equal() method that defaults to using the
Compare() method.
Test Plan: make clean all check
Reviewers: rven, anthony, yhchiang, igor, sdong
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D46233
Summary:
When DBIter changes iterating direction from forward to backward, it might see some much larger keys with higher sequence ID. With this commit, these rows will be actively filtered out. It should fix existing disabled tests in db_iter_test.
This may not be a perfect fix, but it introduces least impact on existing codes, in order to be safe.
Test Plan:
Enable existing tests and make sure they pass. Add a new test DBIterWithMergeIterTest.InnerMergeIteratorDataRace8.
Also run all existing tests.
Reviewers: yhchiang, rven, anthony, IslamAbdelRahman, kradhakrishnan, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D45567
Summary: There is a check to fail the iterator if prefix extractor is specified but upper bound is out of the prefix for the seek key. Relax this constraint to allow users to set upper bound to the next prefix of the current one.
Test Plan: make commit-prereq
Reviewers: igor, anthony, kradhakrishnan, yhchiang, rven
Reviewed By: rven
Subscribers: tnovak, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D44949
Summary:
While working on single delete support for db_bench, I realized that
db_bench/db_stress contain a bunch of duplicate code related to
copmression and found some typos. This patch removes duplicate code,
typos and a redundant #ifndef in internal_stats.cc.
Test Plan: make db_stress && make db_bench && ./db_bench --benchmarks=compress,uncompress
Reviewers: yhchiang, sdong, rven, anthony, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D43965