Commit graph

237 commits

Author SHA1 Message Date
Andrew Kryczka 9d347332fb Fix argument mismatch in BlockBasedTableBuilder (#3974)
Summary:
The sixth argument should be `key_includes_seq` bool, the seventh a `GetContext*`. We were mistakenly passing the `GetContext*` as the sixth argument and relying on the default (nullptr) for the seventh. This would make statistics inaccurate, at least.

Blame: 402b7aa0
Closes https://github.com/facebook/rocksdb/pull/3974

Differential Revision: D8344907

Pulled By: ajkr

fbshipit-source-id: 3ad865a0541d6d30f75dfc726352788118cfe12e
2018-06-12 13:57:44 -07:00
Maysam Yabandeh b73652169e Extend format 3 to partitioned index/filters (#3958)
Summary:
format_version 3 changes the format of index blocks by storing user keys instead of the internal keys, which saves 8-bytes per key. This patch extends the format to top-level indexes in partitioned index/filters.
Closes https://github.com/facebook/rocksdb/pull/3958

Differential Revision: D8294615

Pulled By: maysamyabandeh

fbshipit-source-id: 17666cc16b8076c363972e2308e31547e835f0fe
2018-06-06 16:58:16 -07:00
Zhongyi Xie f1592a06c2 run make format for PR 3838 (#3954)
Summary:
PR https://github.com/facebook/rocksdb/pull/3838 made some changes that triggers lint warnings.
Run `make format` to fix formatting as suggested by siying .
Also piggyback two changes:
1) fix singleton destruction order for windows and posix env
2) fix two clang warnings
Closes https://github.com/facebook/rocksdb/pull/3954

Differential Revision: D8272041

Pulled By: miasantreble

fbshipit-source-id: 7c4fd12bd17aac13534520de0c733328aa3c6c9f
2018-06-05 12:58:02 -07:00
Maysam Yabandeh d0c38c0c8c Extend some tests to format_version=3 (#3942)
Summary:
format_version=3 changes the format of SST index. This is however not being tested currently since tests only work with the default format_version which is currently 2. The patch extends the most related tests to also test for format_version=3.
Closes https://github.com/facebook/rocksdb/pull/3942

Differential Revision: D8238413

Pulled By: maysamyabandeh

fbshipit-source-id: 915725f55753dd8e9188e802bf471c23645ad035
2018-06-04 20:13:00 -07:00
Dmitri Smirnov f4b72d7056 Provide a way to override windows memory allocator with jemalloc for ZSTD
Summary:
Windows does not have LD_PRELOAD mechanism to override all memory allocation functions and ZSTD makes use of C-tuntime calloc. During flushes and compactions default system allocator fragments and the system slows down considerably.

For builds with jemalloc we employ an advanced ZSTD context creation API that re-directs memory allocation to jemalloc. To reduce the cost of context creation on each block we cache ZSTD context within the block based table builder while a new SST file is being built, this will help all platform builds including those w/o jemalloc. This avoids system allocator fragmentation and improves the performance.

The change does not address random reads and currently on Windows reads with ZSTD regress as compared with SNAPPY compression.
Closes https://github.com/facebook/rocksdb/pull/3838

Differential Revision: D8229794

Pulled By: miasantreble

fbshipit-source-id: 719b622ab7bf4109819bc44f45ec66f0dd3ee80d
2018-06-04 12:12:48 -07:00
Andrew Kryczka fea2b1dfb2 Copy Get() result when file reads use mmap
Summary:
For iterator reads, a `SuperVersion` is pinned to preserve a snapshot of SST files, and `Block`s are pinned to allow `key()` and `value()` to return pointers directly into a RocksDB memory region. This works for both non-mmap reads, where the block owns the memory region, and mmap reads, where the file owns the memory region.

For point reads with `PinnableSlice`, only the `Block` object is pinned. This works for non-mmap reads because the block owns the memory region, so even if the file is deleted after compaction, the memory region survives. However, for mmap reads, file deletion causes the memory region to which the `PinnableSlice` refers to be unmapped.   The result is usually a segfault upon accessing the `PinnableSlice`, although sometimes it returned wrong results (I repro'd this a bunch of times with `db_stress`).

This PR copies the value into the `PinnableSlice` when it comes from mmap'd memory. We can tell whether the `Block` owns its memory using `Block::cachable()`, which is unset when reads do not use the provided buffer as is the case with mmap file reads. When that is false we ensure the result of `Get()` is copied.

This feels like a short-term solution as ideally we'd have the `PinnableSlice` pin the mmap'd memory so we can do zero-copy reads. It seemed hard so I chose this approach to fix correctness in the meantime.
Closes https://github.com/facebook/rocksdb/pull/3881

Differential Revision: D8076288

Pulled By: ajkr

fbshipit-source-id: 31d78ec010198723522323dbc6ea325122a46b08
2018-06-01 16:57:58 -07:00
Zhongyi Xie 2a0dfaa044 fix PrefixExtractorChanged: pass raw pointer instead shared_ptr
Summary:
This should resolve the performance regression caused by the unnecessary copying of the shared_ptr.
Closes https://github.com/facebook/rocksdb/pull/3937

Differential Revision: D8232330

Pulled By: miasantreble

fbshipit-source-id: 7885bf7cd190b6f87164c52d6edd328298c13f97
2018-05-31 21:42:50 -07:00
Maysam Yabandeh 44cf84932f Fix the bug of some test scenarios being put after kEnd
Summary:
DBTestBase::OptionConfig includes the scenarios that unit tests could iterate over them by calling ChangeOptions(). Some of the options have  been mistakenly put after kEnd which makes them essentially invisible to ChangeOptions() caller. This patch fixes it except for kUniversalSubcompactions which is left as TODO since it would break some unit tests.
Closes https://github.com/facebook/rocksdb/pull/3935

Differential Revision: D8230748

Pulled By: maysamyabandeh

fbshipit-source-id: edddb8fffcd161af1809fef24798ce118f8593db
2018-05-31 19:28:00 -07:00
Maysam Yabandeh 03cda531e4 Check for rep_->table_properties being nullptr
Summary:
The very old sst formats do not have table_properties and rep_->table_properties is thus nullptr. The recent patch in https://github.com/facebook/rocksdb/pull/3894 does not check for nullptr and hence makes it backward incompatible. This patch adds the check.
Closes https://github.com/facebook/rocksdb/pull/3918

Differential Revision: D8188638

Pulled By: maysamyabandeh

fbshipit-source-id: b1d986665ecf0b4d1c442adfa8a193b97707d47b
2018-05-29 12:13:55 -07:00
Maysam Yabandeh 402b7aa07f Exclude seq from index keys
Summary:
Index blocks have the same format as data blocks. The keys therefore similarly to the keys in the data blocks are internal keys, which means that in addition to the user key it also has 8 bytes that encodes sequence number and value type. This extra 8 bytes however is not necessary in index blocks since the index keys act as an separator between two data blocks. The only exception is when the last key of a block and the first key of the next block share the same user key, in which the sequence number is required to act as a separator.
The patch excludes the sequence from index keys only if the above special case does not happen for any of the index keys. It then records that in the property block. The reader looks at the property block to see if it should expect sequence numbers in the keys of the index block.s
Closes https://github.com/facebook/rocksdb/pull/3894

Differential Revision: D8118775

Pulled By: maysamyabandeh

fbshipit-source-id: 915479f028b5799ca91671d67455ecdefbd873bd
2018-05-25 18:42:43 -07:00
Nathan VanBenschoten 8c3bf0801b Check status when reading HashIndexPrefixesMetadataBlock
Summary:
This was missed in a refactor of `ReadBlockContents` (2f1a3a4).
Closes https://github.com/facebook/rocksdb/pull/3906

Differential Revision: D8172648

Pulled By: ajkr

fbshipit-source-id: 27e453b19795fea974bfed4721105be6f3a12090
2018-05-25 17:42:51 -07:00
Zhongyi Xie 6c73a46693 Fix a backward compatibility problem with table_properties being nullptr
Summary:
Currently when ldb built from master tries to open a DB from version 2.2, there will be a segfault because table_properties didn't exist back then.
Closes https://github.com/facebook/rocksdb/pull/3890

Differential Revision: D8100914

Pulled By: miasantreble

fbshipit-source-id: b255e8aedc54695432be2e704839c857dabdd65a
2018-05-22 13:57:17 -07:00
Zhongyi Xie c3ebc75843 Move prefix_extractor to MutableCFOptions
Summary:
Currently it is not possible to change bloom filter config without restart the db, which is causing a lot of operational complexity for users.
This PR aims to make it possible to dynamically change bloom filter config.
Closes https://github.com/facebook/rocksdb/pull/3601

Differential Revision: D7253114

Pulled By: miasantreble

fbshipit-source-id: f22595437d3e0b86c95918c484502de2ceca120c
2018-05-21 14:43:11 -07:00
Mike Kolupaev 8bf555f487 Change and clarify the relationship between Valid(), status() and Seek*() for all iterators. Also fix some bugs
Summary:
Before this PR, Iterator/InternalIterator may simultaneously have non-ok status() and Valid() = true. That state means that the last operation failed, but the iterator is nevertheless positioned on some unspecified record. Likely intended uses of that are:
 * If some sst files are corrupted, a normal iterator can be used to read the data from files that are not corrupted.
 * When using read_tier = kBlockCacheTier, read the data that's in block cache, skipping over the data that is not.

However, this behavior wasn't documented well (and until recently the wiki on github had misleading incorrect information). In the code there's a lot of confusion about the relationship between status() and Valid(), and about whether Seek()/SeekToLast()/etc reset the status or not. There were a number of bugs caused by this confusion, both inside rocksdb and in the code that uses rocksdb (including ours).

This PR changes the convention to:
 * If status() is not ok, Valid() always returns false.
 * Any seek operation resets status. (Before the PR, it depended on iterator type and on particular error.)

This does sacrifice the two use cases listed above, but siying said it's ok.

Overview of the changes:
 * A commit that adds missing status checks in MergingIterator. This fixes a bug that actually affects us, and we need it fixed. `DBIteratorTest.NonBlockingIterationBugRepro` explains the scenario.
 * Changes to lots of iterator types to make all of them conform to the new convention. Some bug fixes along the way. By far the biggest changes are in DBIter, which is a big messy piece of code; I tried to make it less big and messy but mostly failed.
 * A stress-test for DBIter, to gain some confidence that I didn't break it. It does a few million random operations on the iterator, while occasionally modifying the underlying data (like ForwardIterator does) and occasionally returning non-ok status from internal iterator.

To find the iterator types that needed changes I searched for "public .*Iterator" in the code. Here's an overview of all 27 iterator types:

Iterators that didn't need changes:
 * status() is always ok(), or Valid() is always false: MemTableIterator, ModelIter, TestIterator, KVIter (2 classes with this name anonymous namespaces), LoggingForwardVectorIterator, VectorIterator, MockTableIterator, EmptyIterator, EmptyInternalIterator.
 * Thin wrappers that always pass through Valid() and status(): ArenaWrappedDBIter, TtlIterator, InternalIteratorFromIterator.

Iterators with changes (see inline comments for details):
 * DBIter - an overhaul:
    - It used to silently skip corrupted keys (`FindParseableKey()`), which seems dangerous. This PR makes it just stop immediately after encountering a corrupted key, just like it would for other kinds of corruption. Let me know if there was actually some deeper meaning in this behavior and I should put it back.
    - It had a few code paths silently discarding subiterator's status. The stress test caught a few.
    - The backwards iteration code path was expecting the internal iterator's set of keys to be immutable. It's probably always true in practice at the moment, since ForwardIterator doesn't support backwards iteration, but this PR fixes it anyway. See added DBIteratorTest.ReverseToForwardBug for an example.
    - Some parts of backwards iteration code path even did things like `assert(iter_->Valid())` after a seek, which is never a safe assumption.
    - It used to not reset status on seek for some types of errors.
    - Some simplifications and better comments.
    - Some things got more complicated from the added error handling. I'm open to ideas for how to make it nicer.
 * MergingIterator - check status after every operation on every subiterator, and in some places assert that valid subiterators have ok status.
 * ForwardIterator - changed to the new convention, also slightly simplified.
 * ForwardLevelIterator - fixed some bugs and simplified.
 * LevelIterator - simplified.
 * TwoLevelIterator - changed to the new convention. Also fixed a bug that would make SeekForPrev() sometimes silently ignore errors from first_level_iter_.
 * BlockBasedTableIterator - minor changes.
 * BlockIter - replaced `SetStatus()` with `Invalidate()` to make sure non-ok BlockIter is always invalid.
 * PlainTableIterator - some seeks used to not reset status.
 * CuckooTableIterator - tiny code cleanup.
 * ManagedIterator - fixed some bugs.
 * BaseDeltaIterator - changed to the new convention and fixed a bug.
 * BlobDBIterator - seeks used to not reset status.
 * KeyConvertingIterator - some small change.
Closes https://github.com/facebook/rocksdb/pull/3810

Differential Revision: D7888019

Pulled By: al13n321

fbshipit-source-id: 4aaf6d3421c545d16722a815b2fa2e7912bc851d
2018-05-17 02:56:56 -07:00
Zhongyi Xie 6cab3184f5 avoid double delete on dummy record insertion failure
Summary:
When the dummy record insertion fails, there is no need to explicitly delete the block as it will be registered for cleanup regardless.
Closes https://github.com/facebook/rocksdb/pull/3688

Differential Revision: D7537741

Pulled By: miasantreble

fbshipit-source-id: fcd3a3d3d382ee8e2c7ced0a4980e683d93a16d6
2018-05-01 16:01:28 -07:00
Andrew Kryczka 7004e45489 Remove block-based table assertion for non-empty filter block
Summary:
7a6353bd1c prevents empty filter blocks from being written for SST files containing range deletions only. However the assertion this PR removes is still a problem as we could be reading from a DB generated by a RocksDB build without the 7a6353bd1c patch. So remove the assertion. We already don't do this check when `cache_index_and_filter_blocks=false`, so it should be safe.
Closes https://github.com/facebook/rocksdb/pull/3773

Differential Revision: D7769964

Pulled By: ajkr

fbshipit-source-id: 7285762446f2cd2ccf16efd7a988a106fbb0d8d3
2018-04-26 14:43:11 -07:00
Maysam Yabandeh 17e04039dd Propagate fill_cache config to partitioned index iterator
Summary:
Currently the partitioned index iterator creates a new ReadOptions which ignores the fill_cache config set to ReadOptions passed by the user. The patch propagates fill_cache from the user's ReadOptions to that of partition index iterator.
Also it clarifies the contract of fill_cache that i) it does not apply to filters, ii) it still charges block cache for the size of the data block, it still pin the block if it is already in the block cache.
Closes https://github.com/facebook/rocksdb/pull/3739

Differential Revision: D7678308

Pulled By: maysamyabandeh

fbshipit-source-id: 53ed96424ae922e499e2d4e3580ddc3f0db893da
2018-04-20 15:13:05 -07:00
Zhongyi Xie 954b496b3f fix memory leak in two_level_iterator
Summary:
this PR fixes a few failed contbuild:
1. ASAN memory leak in Block::NewIterator (table/block.cc:429). the proper destruction of first_level_iter_ and second_level_iter_ of two_level_iterator.cc is missing from the code after the refactoring in https://github.com/facebook/rocksdb/pull/3406
2. various unused param errors introduced by https://github.com/facebook/rocksdb/pull/3662
3. updated comment for `ForceReleaseCachedEntry` to emphasize the use of `force_erase` flag.
Closes https://github.com/facebook/rocksdb/pull/3718

Reviewed By: maysamyabandeh

Differential Revision: D7621192

Pulled By: miasantreble

fbshipit-source-id: 476c94264083a0730ded957c29de7807e4f5b146
2018-04-15 17:26:26 -07:00
Maysam Yabandeh 67182678a5 Stats for false positive rate of full filtesr
Summary:
Adds two stats to allow us measuring the false positive rate of full filters:
- The total count of positives: rocksdb.bloom.filter.full.positive
- The total count of true positives: rocksdb.bloom.filter.full.true.positive
Not the term "full" in the stat name to indicate that they are meaningful in full filters. block-based filters are to be deprecated soon and supporting it is not worth the the additional cost of if-then-else branches.

Closes #3680

Tested by:
$ ./db_bench -benchmarks=fillrandom  -db /dev/shm/rocksdb-tmpdb --num=1000000 -bloom_bits=10
$ ./db_bench -benchmarks="readwhilewriting"  -db /dev/shm/rocksdb-tmpdb --statistics -bloom_bits=10 --duration=60 --num=2000000 --use_existing_db 2>&1 > /tmp/full.log
$ grep filter.full /tmp/full.log
rocksdb.bloom.filter.full.positive COUNT : 3628593
rocksdb.bloom.filter.full.true.positive COUNT : 3536026
which gives the false positive rate of 2.5%
Closes https://github.com/facebook/rocksdb/pull/3681

Differential Revision: D7517570

Pulled By: maysamyabandeh

fbshipit-source-id: 630ab1a473afdce404916d297035b6318de4c052
2018-04-05 15:58:48 -07:00
Fosco Marotto d518fe1da6 uint64_t and size_t changes to compile for iOS
Summary:
In attempting to build a static lib for use in iOS, I ran in to lots of type errors between uint64_t and size_t.  This PR contains the changes I made to get `TARGET_OS=IOS make static_lib` to succeed while also getting Xcode to build successfully with the resulting `librocksdb.a` library imported.

This also compiles for me on macOS and tests fine, but I'm really not sure if I made the correct decisions about where to `static_cast` and where to change types.

Also up for discussion: is iOS worth supporting?  Getting the static lib is just part one, we aren't providing any bridging headers or wrappers like the ObjectiveRocks project, it won't be a great experience.
Closes https://github.com/facebook/rocksdb/pull/3503

Differential Revision: D7106457

Pulled By: gfosco

fbshipit-source-id: 82ac2073de7e1f09b91f6b4faea91d18bd311f8e
2018-03-06 12:43:51 -08:00
Andrew Kryczka 5d68243e61 Comment out unused variables
Summary:
Submitting on behalf of another employee.
Closes https://github.com/facebook/rocksdb/pull/3557

Differential Revision: D7146025

Pulled By: ajkr

fbshipit-source-id: 495ca5db5beec3789e671e26f78170957704e77e
2018-03-05 13:13:41 -08:00
Igor Sugak aba3409740 Back out "[codemod] - comment out unused parameters"
Reviewed By: igorsugak

fbshipit-source-id: 4a93675cc1931089ddd574cacdb15d228b1e5f37
2018-02-22 12:43:17 -08:00
David Lai f4a030ce81 - comment out unused parameters
Reviewed By: everiq, igorsugak

Differential Revision: D7046710

fbshipit-source-id: 8e10b1f1e2aecebbfb229c742e214db887e5a461
2018-02-22 09:44:23 -08:00
Siying Dong b555ed30a4 Customized BlockBasedTableIterator and LevelIterator
Summary:
Use a customzied BlockBasedTableIterator and LevelIterator to replace current implementations leveraging two-level-iterator. Hope the customized logic will make code easier to understand. As a side effect, BlockBasedTableIterator reduces the allocation for the data block iterator object, and avoid the virtual function call to it, because we can directly reference BlockIter, a final class. Similarly, LevelIterator reduces virtual function call to the dummy iterator iterating the file metadata. It also enabled further optimization.

The upper bound check is also moved from index block to data block. This implementation fits this iterator better. After the change, forwared iterator is slightly optimized to ensure we trim those iterators.

The two-level-iterator now is only used by partitioned index, so it is simplified.
Closes https://github.com/facebook/rocksdb/pull/3406

Differential Revision: D6809041

Pulled By: siying

fbshipit-source-id: 7da3b9b1d3c8e9d9405302c15920af1fcaf50ffa
2018-02-12 17:12:25 -08:00
Andrew Kryczka e78715c29a Eliminate a memcpy for uncompressed blocks
Summary:
`ReadBlockFromFile` uses a stack buffer to hold small data blocks before passing them to the compression library, which outputs uncompressed data in a heap buffer. In the case of `kNoCompression` there is a `memcpy` to copy from stack buffer to heap buffer.

This PR optimizes `ReadBlockFromFile` to skip the stack buffer for files whose blocks are known to be uncompressed. We determine this using the SST file property, "compression_name", if it's available.
Closes https://github.com/facebook/rocksdb/pull/3472

Differential Revision: D6920848

Pulled By: ajkr

fbshipit-source-id: 5c753e804efc178b9229ae5dbe6a4adc32031f07
2018-02-07 15:57:37 -08:00
Andrew Kryczka 1edac32b77 Update rocksdb.read.block.get.micros when block cache disabled
Summary:
Previously `ReadBlockFromFile` for data blocks was only measured when reading a block to populate block cache. This PR adds the corresponding measurements for users who disabled block cache.
Closes https://github.com/facebook/rocksdb/pull/3442

Differential Revision: D6848671

Pulled By: ajkr

fbshipit-source-id: bb4bbe1797fa2cc1d9a5bad44891af2b55384b41
2018-01-31 14:26:52 -08:00
Zhongyi Xie 3fe0937180 Use block cache to track memory usage when ReadOptions.fill_cache=false
Summary:
ReadOptions.fill_cache is set in compaction inputs and can be set by users in their queries too. It tells RocksDB not to put a data block used to block cache.

The memory used by the data block is, however, not trackable by users.

To make the system more manageable, we can cost the block to block cache while using it, and then release it after using.
Closes https://github.com/facebook/rocksdb/pull/3333

Differential Revision: D6670230

Pulled By: miasantreble

fbshipit-source-id: ab848d3ed286bd081a13ee1903de357b56cbc308
2018-01-29 14:43:10 -08:00
Sagar Vemuri d938226af4 Improve performance of long range scans with readahead
Summary:
This change improves the performance of iterators doing long range scans (e.g. big/full table scans in MyRocks) by using readahead and prefetching additional data on each disk IO. This prefetching is automatically enabled on noticing more than 2 IOs for the same table file during iteration. The readahead size starts with 8KB and is exponentially increased on each additional sequential IO, up to a max of 256 KB. This helps in cutting down the number of IOs needed to complete the range scan.

Constraints:
- The prefetched data is stored by the OS in page cache. So this currently works only for non direct-reads use-cases i.e applications which use page cache. (Direct-I/O support will be enabled in a later PR).
- This gets currently enabled only when ReadOptions.readahead_size = 0 (which is the default value).

Thanks to siying for the original idea and implementation.

**Benchmarks:**
Data fill:
```
TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=fillrandom -num=1000000000 -compression_type="none" -level_compaction_dynamic_level_bytes
```
Do a long range scan: Seekrandom with large number of nexts
```
TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=seekrandom -duration=60 -num=1000000000 -use_existing_db -seek_nexts=10000 -statistics -histogram
```

Page cache was cleared before each experiment with the command:
```
sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"
```
```
Before:
seekrandom   :   34020.945 micros/op 29 ops/sec;   32.5 MB/s (1636 of 1999 found)
With this change:
seekrandom   :    8726.912 micros/op 114 ops/sec;  126.8 MB/s (5702 of 6999 found)
```
~3.9X performance improvement.

Also verified with strace and gdb that the readahead size is increasing as expected.
```
strace -e readahead -f -T -t -p <db_bench process pid>
```
Closes https://github.com/facebook/rocksdb/pull/3282

Differential Revision: D6586477

Pulled By: sagar0

fbshipit-source-id: 8a118a0ed4594fbb7f5b1cafb242d7a4033cb58c
2018-01-25 21:41:53 -08:00
Siying Dong 1039133f2d BlockBasedTable::NewDataBlockIterator to always return BlockIter
Summary:
This is a pre-cleaning up before a major block based table iterator refactoring. BlockBasedTable::NewDataBlockIterator() will always return BlockIter. This simplifies the logic and code and enable further refactoring and optimization.
Closes https://github.com/facebook/rocksdb/pull/3398

Differential Revision: D6780165

Pulled By: siying

fbshipit-source-id: 273f7dc896724f682c0118fb69a359d9cc4418b4
2018-01-25 14:57:18 -08:00
Dmitri Smirnov b010116d82 Eliminate some redundant block reads.
Summary:
Re-use metadata for reading Compression Dictionary on BlockBased
  table open, this saves two reads from disk.
  This helps to our 999 percentile in 5.6.1 where prefetch buffer is  not present.
Closes https://github.com/facebook/rocksdb/pull/3354

Differential Revision: D6695753

Pulled By: ajkr

fbshipit-source-id: bb8acd9e9e66e65b89c548ab8940570ae360333c
2018-01-10 17:11:58 -08:00
Zhongyi Xie 51c2ea0feb Reduce heavy hitter for Get operation
Summary:
This PR addresses the following heavy hitters in `Get` operation by moving calls to `StatisticsImpl::recordTick` from `BlockBasedTable` to `Version::Get`

- rocksdb.block.cache.bytes.write
- rocksdb.block.cache.add
- rocksdb.block.cache.data.miss
- rocksdb.block.cache.data.bytes.insert
- rocksdb.block.cache.data.add
- rocksdb.block.cache.hit
- rocksdb.block.cache.data.hit
- rocksdb.block.cache.bytes.read

The db_bench statistics before and after the change are:

|1GB block read|Children      |Self  |Command          |Shared Object        |Symbol|
|---|---|---|---|---|---|
|master:     |4.22%     |1.31%  |db_bench  |db_bench  |[.] rocksdb::StatisticsImpl::recordTick|
|updated:    |0.51%     |0.21%  |db_bench  |db_bench  |[.] rocksdb::StatisticsImpl::recordTick|
|     	     |0.14%     |0.14%  |db_bench  |db_bench  |[.] rocksdb::GetContext::record_counters|

|1MB block read|Children      |Self  |Command          |Shared Object        |Symbol|
|---|---|---|---|---|---|
|master:    |3.48%     |1.08%  |db_bench  |db_bench  |[.] rocksdb::StatisticsImpl::recordTick|
|updated:    |0.80%     |0.31%  |db_bench  |db_bench  |[.] rocksdb::StatisticsImpl::recordTick|
|    	     |0.35%     |0.35%  |db_bench  |db_bench  |[.] rocksdb::GetContext::record_counters|
Closes https://github.com/facebook/rocksdb/pull/3172

Differential Revision: D6330532

Pulled By: miasantreble

fbshipit-source-id: 2b492959e00a3db29e9437ecdcc5e48ca4ec5741
2017-12-12 21:11:33 -08:00
Siying Dong a9c8d4ef15 Fix memory issue introduced by 2f1a3a4d74
Summary: Closes https://github.com/facebook/rocksdb/pull/3256

Differential Revision: D6541714

Pulled By: siying

fbshipit-source-id: 40efd89b68587a9d58cfe6f4eebd771c2d9f1542
2017-12-11 18:27:28 -08:00
Siying Dong 2f1a3a4d74 Refactor ReadBlockContents()
Summary:
Divide ReadBlockContents() to multiple sub-functions. Maintaining the input and intermediate data in a new class BlockFetcher.
I hope in general it makes the code easier to maintain.
Another motivation to do it is to clearly divide the logic before file reading and after file reading. The refactor will help us evaluate how can we make I/O async in the future.
Closes https://github.com/facebook/rocksdb/pull/3244

Differential Revision: D6520983

Pulled By: siying

fbshipit-source-id: 338d90bc0338472d46be7a7682028dc9114b12e9
2017-12-11 15:27:32 -08:00
Andrew Kryczka c85f8ccca3 convert null terminator in ascii dump
Summary:
The ASCII output is almost always useless to me as the first '\0' byte in the key or value causes it to stop printing. Since all characters are already surrounded by spaces, "\ 0" (how we display a backslash followed by a zero) and "\0" (how this PR displays a null terminator) are distinguishable. My assumption is the value of seeing all the bytes outweighs the value of the alignment we had before, where we always had one character followed by one space.
Closes https://github.com/facebook/rocksdb/pull/3203

Differential Revision: D6428651

Pulled By: ajkr

fbshipit-source-id: aafc978a51e9ea029cfe3e763e2bb0e1751b9ccf
2017-11-28 17:28:58 -08:00
Maysam Yabandeh ab0542f5ec Fix for when block.cache_handle is nullptr
Summary:
When using with compressed cache it is possible that the status is ok but the block is not actually added to the block cache. The patch takes this case into account.
Closes https://github.com/facebook/rocksdb/pull/2945

Differential Revision: D5937613

Pulled By: maysamyabandeh

fbshipit-source-id: 5428cf1115e5046b3d01ab78d26cb181122af4c6
2017-09-29 07:56:55 -07:00
Andrew Kryczka c2f6e45aa3 prevent nullptr dereference in table reader error case
Summary:
A user encountered segfault on the call to `CacheDependencies()`, probably because `NewIndexIterator()` failed before populating `*index_entry`. Let's avoid the call in that case.
Closes https://github.com/facebook/rocksdb/pull/2939

Differential Revision: D5928611

Pulled By: ajkr

fbshipit-source-id: 484be453dbb00e5e160e9c6a1bc933df7d80f574
2017-09-28 00:12:34 -07:00
Andrew Kryczka 19cc66dc4f fix clang bug in block-based table reader
Summary:
This is the warning that clang considers a bug and has been causing it to fail:

```
table/block_based_table_reader.cc:240:27: warning: Potential leak of memory pointed to by 'block.value'
    for (; biter.Valid(); biter.Next()) {
                          ^~~~~
```

Actually clang just doesn't have enough knowledge to statically determine it's safe. We can teach it using an assert.
Closes https://github.com/facebook/rocksdb/pull/2779

Differential Revision: D5691225

Pulled By: ajkr

fbshipit-source-id: 3f0d545bf44636953b30ee5243c63239e8f16d8e
2017-08-23 15:12:05 -07:00
Maysam Yabandeh 1dfcdb15f9 Extend pin_l0 to filter partitions
Summary:
This is the continuation of https://github.com/facebook/rocksdb/pull/2661 for filter partitions. When pin_l0 is set (along with cache_xxx), then open table open the filter partitions are loaded into the cache and pinned there.
Closes https://github.com/facebook/rocksdb/pull/2766

Differential Revision: D5671098

Pulled By: maysamyabandeh

fbshipit-source-id: 174f24018f1d7f1129621e7380287b65b67d2115
2017-08-23 07:56:08 -07:00
Maysam Yabandeh 1efc600ddf Preload l0 index partitions
Summary:
This fixes the existing logic for pinning l0 index partitions. The patch preloads the partitions into block cache and pin them if they belong to level 0 and pin_l0 is set.

The drawback is that it does many small IOs when preloading all the partitions into the cache is direct io is enabled. Working for a solution for that.
Closes https://github.com/facebook/rocksdb/pull/2661

Differential Revision: D5554010

Pulled By: maysamyabandeh

fbshipit-source-id: 1e6f32a3524d71355c77d4138516dcfb601ca7b2
2017-08-18 10:56:20 -07:00
Siying Dong 666a005f9b Support prefetch last 512KB with direct I/O in block based file reader
Summary:
Right now, if direct I/O is enabled, prefetching the last 512KB cannot be applied, except compaction inputs or readahead is enabled for iterators. This can create a lot of I/O for HDD cases. To solve the problem, the 512KB is prefetched in block based table if direct I/O is enabled. The prefetched buffer is passed in totegher with random access file reader, so that we try to read from the buffer before reading from the file. This can be extended in the future to support flexible user iterator readahead too.
Closes https://github.com/facebook/rocksdb/pull/2708

Differential Revision: D5593091

Pulled By: siying

fbshipit-source-id: ee36ff6d8af11c312a2622272b21957a7b5c81e7
2017-08-11 12:16:45 -07:00
Aaron G 7848f0b24c add VerifyChecksum() to db.h
Summary:
We need a tool to check any sst file corruption in the db.
It will check all the sst files in current version and read all the blocks (data, meta, index) with checksum verification. If any verification fails, the function will return non-OK status.
Closes https://github.com/facebook/rocksdb/pull/2498

Differential Revision: D5324269

Pulled By: lightmark

fbshipit-source-id: 6f8a272008b722402a772acfc804524c9d1a483b
2017-08-09 15:58:13 -07:00
Andrew Kryczka 710411aea6 fix asan/valgrind for TableCache cleanup
Summary:
Breaking commit: d12691b86f

In the above commit, I moved the `TableCache` cleanup logic from `Version` destructor into `PurgeObsoleteFiles`. I missed cleaning up `TableCache` entries for the current `Version` during DB destruction.

This PR adds that logic to `VersionSet` destructor. One unfortunate side effect is now we're potentially deleting `TableReader`s after `column_family_set_.reset()`, which means we can't call `BlockBasedTableReader::Close` a second time as the block cache might already be destroyed.
Closes https://github.com/facebook/rocksdb/pull/2662

Differential Revision: D5515108

Pulled By: ajkr

fbshipit-source-id: 2cb820e19aa813e0d258d17f76b2d7b6b7ee0b18
2017-07-27 20:28:04 -07:00
Aaron Gao 8f553d3c52 remove unnecessary internal_comparator param in newIterator
Summary:
solved https://github.com/facebook/rocksdb/issues/2604
Closes https://github.com/facebook/rocksdb/pull/2648

Differential Revision: D5504875

Pulled By: lightmark

fbshipit-source-id: c14bb62ccbdc9e7bda9cd914cae4ea0765d882ee
2017-07-27 14:30:42 -07:00
Sagar Vemuri 72502cf227 Revert "comment out unused parameters"
Summary:
This reverts the previous commit 1d7048c598, which broke the build.

Did a `git revert 1d7048c`.
Closes https://github.com/facebook/rocksdb/pull/2627

Differential Revision: D5476473

Pulled By: sagar0

fbshipit-source-id: 4756ff5c0dfc88c17eceb00e02c36176de728d06
2017-07-21 18:26:26 -07:00
Victor Gao 1d7048c598 comment out unused parameters
Summary: This uses `clang-tidy` to comment out unused parameters (in functions, methods and lambdas) in fbcode. Cases that the tool failed to handle are fixed manually.

Reviewed By: igorsugak

Differential Revision: D5454343

fbshipit-source-id: 5dee339b4334e25e963891b519a5aa81fbf627b2
2017-07-21 14:57:44 -07:00
Siying Dong 3c327ac2d0 Change RocksDB License
Summary: Closes https://github.com/facebook/rocksdb/pull/2589

Differential Revision: D5431502

Pulled By: siying

fbshipit-source-id: 8ebf8c87883daa9daa54b2303d11ce01ab1f6f75
2017-07-15 16:11:23 -07:00
Maysam Yabandeh 6f4154d693 record index partition properties
Summary:
When Partitioning index/filter is enabled the user might need to check the index block size as well as the top-level index size via sst_dump. This patch records i) number of partitions, ii) top-level index size and make it accessible through sst_dump. The number of partitions for filters is the same as that of indexes. The top-level index for filters has a similar size to top-level index for indexes, so it is not repeated.
Closes https://github.com/facebook/rocksdb/pull/2437

Differential Revision: D5224225

Pulled By: maysamyabandeh

fbshipit-source-id: 5324598c75793523aef1bb7ee225a5475e95a9cb
2017-06-13 11:21:32 -07:00
Siying Dong 9bbba4fec1 Remoe unused BlockBasedTable::compaction_optimized_
Summary:
BlockBasedTable::compaction_optimized_ is never used but can cause TSAN warning. Remove it.
Closes https://github.com/facebook/rocksdb/pull/2324

Differential Revision: D5085533

Pulled By: siying

fbshipit-source-id: 2feefce6806d559dfb4ab2989aa3db36752fe25d
2017-05-18 06:41:23 -07:00
hyunwoo 0ebdd70579 fixed typo
Summary:
fixed typo
Closes https://github.com/facebook/rocksdb/pull/2312

Differential Revision: D5079631

Pulled By: sagar0

fbshipit-source-id: e4c8d1d89b244ee69e9dea1dd013227cc5241026
2017-05-17 16:41:49 -07:00
Islam AbdelRahman 4897eb250b dont skip IO for filter blocks
Summary:
Based on my experience with linkbench, We should not skip loading bloom filter blocks when they are not available in block cache when using Iterator::Seek

Actually I am not sure why this behavior existed in the first place
Closes https://github.com/facebook/rocksdb/pull/2255

Differential Revision: D5010721

Pulled By: maysamyabandeh

fbshipit-source-id: 0af545a06ac4baeecb248706ec34d009c2480ca4
2017-05-09 09:52:02 -07:00
Maysam Yabandeh 40af2381ec Object lifetime in cache
Summary:
Any non-raw-data dependent object must be destructed before the table
    closes. There was a bug of not doing that for filter object. This patch
    fixes the bug and adds a unit test to prevent such bugs in future.
Closes https://github.com/facebook/rocksdb/pull/2246

Differential Revision: D5001318

Pulled By: maysamyabandeh

fbshipit-source-id: 6d8772e58765485868094b92964da82ef9730b6d
2017-05-05 23:20:01 -07:00
Aaron Gao a30a696034 do not read next datablock if upperbound is reached
Summary:
Now if we have iterate_upper_bound set, we continue read until get a key >= upper_bound. For a lot of cases that neighboring data blocks have a user key gap between them, our index key will be a user key in the middle to get a shorter size. For example, if we have blocks:
[a b c d][f g h]
Then the index key for the first block will be 'e'.
then if upper bound is any key between 'd' and 'e', for example, d1, d2, ..., d99999999999, we don't have to read the second block and also know that we have done our iteration by reaching the last key that smaller the upper bound already.

This diff can reduce RA in most cases.
Closes https://github.com/facebook/rocksdb/pull/2239

Differential Revision: D4990693

Pulled By: lightmark

fbshipit-source-id: ab30ea2e3c6edf3fddd5efed3c34fcf7739827ff
2017-05-05 23:20:01 -07:00
Maysam Yabandeh 6798d1f3be Revert "Delete filter before closing the table"
Summary:
This reverts commit 89833577a8.
Closes https://github.com/facebook/rocksdb/pull/2240

Differential Revision: D4986982

Pulled By: maysamyabandeh

fbshipit-source-id: 56c4c07b7b5b7c6fe122d5c2f2199d221c8510c0
2017-05-02 13:46:39 -07:00
Maysam Yabandeh 89833577a8 Delete filter before closing the table
Summary:
Some filters such as partitioned filter have pointers to the table for which they are created. Therefore is they are stored in the block cache, the should be forcibly erased from block cache before closing the  table, which would result into deleting the object. Otherwise the destructor will be called later when the cache is lazily erasing the object, which having the parent table no longer existent it could result into undefined behavior.

Update: there will be still cases the filter is not removed from the cache since the table has not kept a pointer to the cache handle to be able to forcibly release it later. We make sure that the filter destructor does not access the table pointer to get around such cases.
Closes https://github.com/facebook/rocksdb/pull/2207

Differential Revision: D4941591

Pulled By: maysamyabandeh

fbshipit-source-id: 56fbab2a11cf447e1aa67caa30b58d7bd7ce5bbd
2017-05-01 19:19:37 -07:00
Siying Dong d616ebea23 Add GPLv2 as an alternative license.
Summary: Closes https://github.com/facebook/rocksdb/pull/2226

Differential Revision: D4967547

Pulled By: siying

fbshipit-source-id: dd3b58ae1e7a106ab6bb6f37ab5c88575b125ab4
2017-04-27 18:06:12 -07:00
Aaron Gao 6e8d6f429d readahead backwards from sst end
Summary:
prefetch some data from the end of the file for each compaction to reduce IO.
Closes https://github.com/facebook/rocksdb/pull/2149

Differential Revision: D4880576

Pulled By: lightmark

fbshipit-source-id: aa767cd1afc84c541837fbf1ad6c0d45b34d3932
2017-04-14 18:56:14 -07:00
Siying Dong d2dce5611a Move some files under util/ to separate dirs
Summary:
Move some files under util/ to new directories env/, monitoring/ options/ and cache/
Closes https://github.com/facebook/rocksdb/pull/2090

Differential Revision: D4833681

Pulled By: siying

fbshipit-source-id: 2fd8bef
2017-04-05 19:09:16 -07:00
Maysam Yabandeh 34a70859bc Fix segmentation fault caused by #1961
Summary:
Fixes #1961 which causes a segfault when filter_policy is nullptr and both
pin_l0_filter_and_index_blocks_in_cache/cache_index_and_filter_blocks
are set.
Closes https://github.com/facebook/rocksdb/pull/2029

Differential Revision: D4764862

Pulled By: maysamyabandeh

fbshipit-source-id: 05bd695
2017-03-24 17:24:11 -07:00
Maysam Yabandeh 8b0097b49b Readers for partition filter
Summary:
This is the last split of this pull request: https://github.com/facebook/rocksdb/pull/1891 which includes the reader part as well as the tests.
Closes https://github.com/facebook/rocksdb/pull/1961

Differential Revision: D4672216

Pulled By: maysamyabandeh

fbshipit-source-id: 6a2b829
2017-03-22 09:24:15 -07:00
Islam AbdelRahman e19163688b Add macros to include file name and line number during Logging
Summary:
current logging
```
2017/03/14-14:20:30.393432 7fedde9f5700 (Original Log Time 2017/03/14-14:20:30.393414) [default] Level summary: base level 1 max bytes base 268435456 files[1 0 0 0 0 0 0] max score 0.25
2017/03/14-14:20:30.393438 7fedde9f5700 [JOB 2] Try to delete WAL files size 61417909, prev total WAL file size 73820858, number of live WAL files 2.
2017/03/14-14:20:30.393464 7fedde9f5700 [DEBUG] [JOB 2] Delete /dev/shm/old_logging//MANIFEST-000001 type=3 #1 -- OK
2017/03/14-14:20:30.393472 7fedde9f5700 [DEBUG] [JOB 2] Delete /dev/shm/old_logging//000003.log type=0 #3 -- OK
2017/03/14-14:20:31.427103 7fedd49f1700 [default] New memtable created with log file: #9. Immutable memtables: 0.
2017/03/14-14:20:31.427179 7fedde9f5700 [JOB 3] Syncing log #6
2017/03/14-14:20:31.427190 7fedde9f5700 (Original Log Time 2017/03/14-14:20:31.427170) Calling FlushMemTableToOutputFile with column family [default], flush slots available 1, compaction slots allowed 1, compaction slots scheduled 1
2017/03/14-14:20:31.
Closes https://github.com/facebook/rocksdb/pull/1990

Differential Revision: D4708695

Pulled By: IslamAbdelRahman

fbshipit-source-id: cb8968f
2017-03-15 19:39:12 -07:00
Maysam Yabandeh 11526252cc Pinnableslice (2nd attempt)
Summary:
PinnableSlice

    Summary:
    Currently the point lookup values are copied to a string provided by the
    user. This incures an extra memcpy cost. This patch allows doing point lookup
    via a PinnableSlice which pins the source memory location (instead of
    copying their content) and releases them after the content is consumed
    by the user. The old API of Get(string) is translated to the new API
    underneath.

    Here is the summary for improvements:

    value 100 byte: 1.8% regular, 1.2% merge values
    value 1k byte: 11.5% regular, 7.5% merge values
    value 10k byte: 26% regular, 29.9% merge values
    The improvement for merge could be more if we extend this approach to
    pin the merge output and delay the full merge operation until the user
    actually needs it. We have put that for future work.

    PS:
    Sometimes we observe a small decrease in performance when switching from
    t5452014 to this patch but with the old Get(string) API. The d
Closes https://github.com/facebook/rocksdb/pull/1756

Differential Revision: D4391738

Pulled By: maysamyabandeh

fbshipit-source-id: 6f3edd3
2017-03-13 11:54:10 -07:00
Maysam Yabandeh a2f7a514d1 Refactoring
Summary:
This is the first split of https://github.com/facebook/rocksdb/pull/1891 and will be needed for the upcoming partitioned filter patch.
Closes https://github.com/facebook/rocksdb/pull/1949

Differential Revision: D4652152

Pulled By: maysamyabandeh

fbshipit-source-id: 9801778
2017-03-03 18:24:12 -08:00
Maysam Yabandeh 69d5262c81 Two-level Indexes
Summary:
Partition Index blocks and use a Partition-index as a 2nd level index.

The two-level index can be used by setting
BlockBasedTableOptions::kTwoLevelIndexSearch as the index type and
configuring BlockBasedTableOptions::index_per_partition

t15539501
Closes https://github.com/facebook/rocksdb/pull/1814

Differential Revision: D4473535

Pulled By: maysamyabandeh

fbshipit-source-id: bffb87e
2017-02-06 16:39:12 -08:00
Andrew Kryczka b797e42157 Dump compression dictionary meta-block
Summary:
make sst_dump print size/contents of the dictionary meta-block for easier debugging
Closes https://github.com/facebook/rocksdb/pull/1837

Differential Revision: D4506399

Pulled By: ajkr

fbshipit-source-id: b9bf668
2017-02-03 12:39:16 -08:00
Andrew Kryczka 3b35134e4b Avoid cache lookups for range deletion meta-block
Summary:
I added the Cache::Ref() function a couple weeks ago (#1761) to make this feature possible. Like other meta-blocks, rep_->range_del_entry holds a cache handle to pin the range deletion block in uncompressed block cache for the duration of the table reader's lifetime. We can reuse this cache handle to create an iterator over this meta-block without any cache lookup. Ref() is used to increment the cache handle's refcount in case the returned iterator outlives the table reader.
Closes https://github.com/facebook/rocksdb/pull/1801

Differential Revision: D4458782

Pulled By: ajkr

fbshipit-source-id: 2883f10
2017-01-26 11:24:13 -08:00
Maysam Yabandeh d0ba8ec8f9 Revert "PinnableSlice"
Summary:
This reverts commit 54d94e9c2c.

The pull request was landed by mistake.
Closes https://github.com/facebook/rocksdb/pull/1755

Differential Revision: D4391678

Pulled By: maysamyabandeh

fbshipit-source-id: 36d5149
2017-01-08 14:24:12 -08:00
Maysam Yabandeh 54d94e9c2c PinnableSlice
Summary:
Currently the point lookup values are copied to a string provided by the user.
This incures an extra memcpy cost. This patch allows doing point lookup
via a PinnableSlice which pins the source memory location (instead of
copying their content) and releases them after the content is consumed
by the user. The old API of Get(string) is translated to the new API
underneath.

 Here is the summary for improvements:
 1. value 100 byte: 1.8%  regular, 1.2% merge values
 2. value 1k   byte: 11.5% regular, 7.5% merge values
 3. value 10k byte: 26% regular,    29.9% merge values

 The improvement for merge could be more if we extend this approach to
 pin the merge output and delay the full merge operation until the user
 actually needs it. We have put that for future work.

PS:
Sometimes we observe a small decrease in performance when switching from
t5452014 to this patch but with the old Get(string) API. The difference
is a little and could be noise. More importantly it is safely
cancelled
Closes https://github.com/facebook/rocksdb/pull/1732

Differential Revision: D4374613

Pulled By: maysamyabandeh

fbshipit-source-id: a077f1a
2017-01-08 13:54:13 -08:00
Maysam Yabandeh 0712d541d1 Delegate Cleanables
Summary:
Cleanable objects will perform the registered cleanups when
they are destructed. We however rather to delay this cleaning like when
we are gathering the merge operands. Current approach is to create the
Cleanable object on heap (instead of on stack) and delay deleting it.

By allowing Cleanables to delegate their cleanups to another cleanable
object we can delay the cleaning without however the need to craete the
cleanable object on heap and keeping it around. This patch applies this
technique for the cleanups of BlockIter and shows improved performance
for some in-memory benchmarks:
+1.8% for merge worklaod, +6.4% for non-merge workload when the merge
operator is specified.
https://our.intern.facebook.com/intern/tasks?t=15168163

Non-merge benchmark:
TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench --benchmarks=fillrandom
--num=1000000 -value_size=100 -compression_type=none

Reading random with no merge operator specified:
TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench
--benchmarks="read
Closes https://github.com/facebook/rocksdb/pull/1711

Differential Revision: D4361163

Pulled By: maysamyabandeh

fbshipit-source-id: 9801e07
2016-12-29 15:54:19 -08:00
Andrew Kryczka fd43ee09da Range deletion microoptimizations
Summary:
- Made RangeDelAggregator's InternalKeyComparator member a reference-to-const so we don't need to copy-construct it. Also added InternalKeyComparator to ImmutableCFOptions so we don't need to construct one for each DBIter.
- Made MemTable::NewRangeTombstoneIterator and the table readers' NewRangeTombstoneIterator() functions return nullptr instead of NewEmptyInternalIterator to avoid the allocation. Updated callers accordingly.
Closes https://github.com/facebook/rocksdb/pull/1548

Differential Revision: D4208169

Pulled By: ajkr

fbshipit-source-id: 2fd65cf
2016-11-21 12:24:13 -08:00
Andrew Kryczka 327085b7b2 fix valgrind
Summary: Closes https://github.com/facebook/rocksdb/pull/1526

Differential Revision: D4191257

Pulled By: ajkr

fbshipit-source-id: d09dc76
2016-11-16 12:09:11 -08:00
Andrew Kryczka 307a4e80c8 sst_dump support for range deletion
Summary:
Change DumpTable() so we can see the range deletion meta-block.
Closes https://github.com/facebook/rocksdb/pull/1505

Differential Revision: D4172227

Pulled By: ajkr

fbshipit-source-id: ae35665
2016-11-12 09:39:23 -08:00
Andrew Kryczka 815f54afad Insert range deletion meta-block into block cache
Summary:
This handles two issues: (1) range deletion iterator sometimes outlives
the table reader that created it, in which case the block must not be destroyed
during table reader destruction; and (2) we prefer to read these range tombstone
meta-blocks from file fewer times.

- Extracted cache-populating logic from NewDataBlockIterator() into a separate function: MaybeLoadDataBlockToCache()
- Use MaybeLoadDataBlockToCache() to load range deletion meta-block and pin it through the reader's lifetime. This code reuse works since range deletion meta-block has same format as data blocks.
- Use NewDataBlockIterator() to create range deletion iterators, which uses block cache if enabled, otherwise reads the block from file. Either way, the underlying block won't disappear until after the iterator is destroyed.
Closes https://github.com/facebook/rocksdb/pull/1459

Differential Revision: D4123175

Pulled By: ajkr

fbshipit-source-id: 8f64281
2016-11-05 09:24:26 -07:00
Islam AbdelRahman b88f8e87c5 Support SST files with Global sequence numbers [reland]
Summary:
reland https://reviews.facebook.net/D62523

- Update SstFileWriter to include a property for a global sequence number in the SST file `rocksdb.external_sst_file.global_seqno`
- Update TableProperties to be aware of the offset of each property in the file
- Update BlockBasedTableReader and Block to be able to honor the sequence number in `rocksdb.external_sst_file.global_seqno` property and use it to overwrite all sequence number in the file

Something worth mentioning is that we don't update the seqno in the index block since and when doing a binary search, the reason for that is that it's guaranteed that SST files with global seqno will have only one user_key and each key will have seqno=0 encoded in it, This mean that this key is greater than any other key with seqno> 0. That mean that we can actually keep the current logic for these blocks

Test Plan: unit tests

Reviewers: sdong, yhchiang

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D65211
2016-10-18 16:59:37 -07:00
Yi Wu 991b585ee0 More block cache tickers
Summary: Adding several missing block cache tickers.

Test Plan:
  make all check

Reviewers: IslamAbdelRahman, yhchiang, lightmark

Reviewed By: lightmark

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D64881
2016-10-11 11:59:05 -07:00
Islam AbdelRahman d062328977 Revert "Support SST files with Global sequence numbers"
This reverts commit ab01da5437.
2016-10-07 14:05:12 -07:00
Islam AbdelRahman ab01da5437 Support SST files with Global sequence numbers
Summary:
- Update SstFileWriter to include a property for a global sequence number in the SST file `rocksdb.external_sst_file.global_seqno`
- Update TableProperties to be aware of the offset of each property in the file
- Update BlockBasedTableReader and Block to be able to honor the sequence number in `rocksdb.external_sst_file.global_seqno` property and use it to overwrite all sequence number in the file

Something worth mentioning is that we don't update the seqno in the index block since and when doing a binary search, the reason for that is that it's guaranteed that SST files with global seqno will have only one user_key and each key will have seqno=0 encoded in it, This mean that this key is greater than any other key with seqno> 0. That mean that we can actually keep the current logic for these blocks

Test Plan: unit tests

Reviewers: andrewkr, yhchiang, yiwu, sdong

Reviewed By: sdong

Subscribers: hcz, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D62523
2016-10-03 16:12:39 -07:00
Islam AbdelRahman 1cca091298 Temporarily revert Prev() prefix support
Summary:
Temporarily revert commits for supporting prefix Prev() to unblock MyRocks and RocksDB release

These are the commits reverted

  - 6a14d55bd9
  - b18f9c9eac
  - db74b1a219
  - 2482d5fb45

Test Plan: make check -j64

Reviewers: sdong, lightmark

Reviewed By: lightmark

Subscribers: andrewkr, dhruba, yoshinorim

Differential Revision: https://reviews.facebook.net/D63789
2016-09-08 14:45:32 -07:00
Aaron Gao 4590b53a4b add stats to Cache::LookUp()
Summary: basically for SimCache stats. I find most times it is hard to pass Statistics* to SimCache constructor.

Test Plan: make all check

Reviewers: andrewkr, sdong, yiwu

Reviewed By: yiwu

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D62193
2016-09-01 13:50:39 -07:00
Aaron Gao b18f9c9eac add nullptr check to internal_prefix_transform
Summary: patch for D62361

Test Plan: make all check

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D62883
2016-08-30 13:48:31 -07:00
Islam AbdelRahman b49b92cf28 Introduce Read amplification bitmap (read amp statistics)
Summary:
Add ReadOptions::read_amp_bytes_per_bit option which allow us to create a bitmap for every data block we read
the bitmap will contain (block_size / read_amp_bytes_per_bit) bits.

We will use this bitmap to mark which bytes have been used of the block so we can calculate the read amplification

Test Plan: added new tests

Reviewers: andrewkr, yhchiang, sdong

Reviewed By: sdong

Subscribers: yiwu, leveldb, march, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D58707
2016-08-26 18:55:58 -07:00
Aaron Gao c7004840d2 store prefix_extractor_name in table
Summary:
Make sure prefix extractor name is stored in SST files and if DB is opened with a prefix extractor of a different name, prefix bloom is skipped when read the file.
Also add unit tests for that.

Test Plan:
before change:
```
Note: Google Test filter = BlockBasedTableTest.SkipPrefixBloomFilter
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from BlockBasedTableTest
[ RUN      ] BlockBasedTableTest.SkipPrefixBloomFilter
table/table_test.cc:1421: Failure
Value of: db_iter->Valid()
Actual: false
Expected: true
[  FAILED  ] BlockBasedTableTest.SkipPrefixBloomFilter (1 ms)
[----------] 1 test from BlockBasedTableTest (1 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (1 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] BlockBasedTableTest.SkipPrefixBloomFilter

1 FAILED TEST
```
after:
```
Note: Google Test filter = BlockBasedTableTest.SkipPrefixBloomFilter
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from BlockBasedTableTest
[ RUN      ] BlockBasedTableTest.SkipPrefixBloomFilter
[       OK ] BlockBasedTableTest.SkipPrefixBloomFilter (0 ms)
[----------] 1 test from BlockBasedTableTest (0 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (0 ms total)
[  PASSED  ] 1 test.
```

Reviewers: sdong, andrewkr, yiwu, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D61215
2016-08-26 11:46:32 -07:00
Aaron Gao cec2c6436b fix data race in NewIndexIterator() in block_based_table_reader.cc
Summary: fixed data race described in https://github.com/facebook/rocksdb/issues/1267 and add regression test

Test Plan:
./table_test --gtest_filter=BlockBasedTableTest.NewIndexIteratorLeak
make all check -j64
core dump before fix. ok after fix.

Reviewers: andrewkr, sdong

Reviewed By: sdong

Subscribers: igor, andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D62361
2016-08-23 18:20:41 -07:00
Yi Wu 4a16c32ece Option to cache index/filter blocks with priority
Summary:
Add option to block based table to insert index/filter blocks to block cache with priority. Combined with LRUCache with high_pri_pool_ratio, we can reserved space for index/filter blocks, make them less likely to be evicted.

Depends on D61977.

Test Plan: See unit test.

Reviewers: lightmark, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, march, leveldb

Differential Revision: https://reviews.facebook.net/D62241
2016-08-23 13:44:13 -07:00
Andrew Kryczka ecf9003860 Fix bug in printing values for block-based table
Summary: value is not an InternalKey, we do not need to decode it

Test Plan:
setup:

  $ ldb put --create_if_missing=true k v
  $ ldb put --db=./tmp --create_if_missing k v
  $ ldb compact --db=./tmp

before:

  $ sst_dump --command=raw --file=./tmp/000004.sst
  ...
  terminate called after throwing an instance of 'std::length_error'

after:

  $ ./sst_dump --command=raw --file=./tmp/000004.sst
  $ cat tmp/000004_dump.txt
  ...
  ASCII  k : v
  ...

Reviewers: sdong, yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D62301
2016-08-22 10:27:50 -07:00
Wanning Jiang 78837f5d61 TableBuilder / TableReader support for range deletion
Summary: 1. Range Deletion Tombstone structure 2. Modify Add() in table_builder to make it usable for adding range del tombstones 3. Expose NewTombstoneIterator() API in table_reader

Test Plan: table_test.cc (now BlockBasedTableBuilder::Add() only accepts InternalKey. I make table_test only pass InternalKey to BlockBasedTableBuidler. Also test writing/reading range deletion tombstones in table_test )

Reviewers: sdong, IslamAbdelRahman, lightmark, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D61473
2016-08-19 15:10:31 -07:00
Philipp Unterbrunner deda159b55 Added min/max/avg data block size output to sst_dump
Summary: Added min/max/avg data block size output to sst_dump. Output was added to the end of BlockBasedTable::DumpDataBlocks, so it appears after the data block details, at the very end of the dump file.

Test Plan:
```
./db_bench --benchmarks=fillrandom
./sst_dump --file=/tmp/rocksdbtest-xyz/dbbench/000007.sst --command=raw
tail -n 6 /tmp/rocksdbtest-xyz/dbbench/000007_dump.txt
```
```
Data Block Summary:
--------------------------------------
  # data blocks: 11336
  min data block size: 903
  max data block size: 2268
  avg data block size: 2245.363356
```

Reviewers: IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D61815
2016-08-12 16:34:11 -07:00
Islam AbdelRahman b693ba68b5 Minor PinnedIteratorsManager Refactoring
Summary:
This diff include these simple change
- Rename ReleasePinnedIterators to ReleasePinnedData
- Rename PinIteratorIfNeeded to PinIterator
- Use std::vector directly in PinnedIteratorsManager instead of std::unique_ptr<std::vector>
- Generalize PinnedIteratorsManager by adding PinPtr which can pin any pointer

Test Plan: existing tests

Reviewers: sdong, yiwu, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D61305
2016-08-11 11:54:17 -07:00
omegaga d51dc96a79 Experiments on column-aware encodings
Summary:
Experiments on column-aware encodings. Supported features: 1) extract data blocks from SST file and encode with specified encodings; 2) Decode encoded data back into row format; 3) Directly extract data blocks and write in row format (without prefix encoding); 4) Get column distribution statistics for column format; 5) Dump data blocks separated by columns in human-readable format.

There is still on-going work on this diff. More refactoring is necessary.

Test Plan: Wrote tests in `column_aware_encoding_test.cc`. More tests should be added.

Reviewers: sdong

Reviewed By: sdong

Subscribers: arahut, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D60027
2016-08-01 14:50:19 -07:00
omegaga e70020e4f6 Only cache level 0 indexes and filter when opening table reader
Summary: In T8216281 we decided to disable prefetching the index and filter during opening table handlers during startup (max_open_files = -1).

Test Plan: Rely on `IndexAndFilterBlocksOfNewTableAddedToCache` to guarantee L0 indexes and filters are still cached and change `PinL0IndexAndFilterBlocksTest` to make sure other levels are not cached (maybe add one more test to test we don't cache other levels?)

Reviewers: sdong, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59913
2016-07-20 11:23:31 -07:00
Islam AbdelRahman 68a8e6b8fa Introduce FullMergeV2 (eliminate memcpy from merge operators)
Summary:
This diff update the code to pin the merge operator operands while the merge operation is done, so that we can eliminate the memcpy cost, to do that we need a new public API for FullMerge that replace the std::deque<std::string> with std::vector<Slice>

This diff is stacked on top of D56493 and D56511

In this diff we
- Update FullMergeV2 arguments to be encapsulated in MergeOperationInput and MergeOperationOutput which will make it easier to add new arguments in the future
- Replace std::deque<std::string> with std::vector<Slice> to pass operands
- Replace MergeContext std::deque with std::vector (based on a simple benchmark I ran https://gist.github.com/IslamAbdelRahman/78fc86c9ab9f52b1df791e58943fb187)
- Allow FullMergeV2 output to be an existing operand

```
[Everything in Memtable | 10K operands | 10 KB each | 1 operand per key]

DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=10000 --num=10000 --disable_auto_compactions --value_size=10240 --write_buffer_size=1000000000

[FullMergeV2]
readseq      :       0.607 micros/op 1648235 ops/sec; 16121.2 MB/s
readseq      :       0.478 micros/op 2091546 ops/sec; 20457.2 MB/s
readseq      :       0.252 micros/op 3972081 ops/sec; 38850.5 MB/s
readseq      :       0.237 micros/op 4218328 ops/sec; 41259.0 MB/s
readseq      :       0.247 micros/op 4043927 ops/sec; 39553.2 MB/s

[master]
readseq      :       3.935 micros/op 254140 ops/sec; 2485.7 MB/s
readseq      :       3.722 micros/op 268657 ops/sec; 2627.7 MB/s
readseq      :       3.149 micros/op 317605 ops/sec; 3106.5 MB/s
readseq      :       3.125 micros/op 320024 ops/sec; 3130.1 MB/s
readseq      :       4.075 micros/op 245374 ops/sec; 2400.0 MB/s
```

```
[Everything in Memtable | 10K operands | 10 KB each | 10 operand per key]

DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=1000 --num=10000 --disable_auto_compactions --value_size=10240 --write_buffer_size=1000000000

[FullMergeV2]
readseq      :       3.472 micros/op 288018 ops/sec; 2817.1 MB/s
readseq      :       2.304 micros/op 434027 ops/sec; 4245.2 MB/s
readseq      :       1.163 micros/op 859845 ops/sec; 8410.0 MB/s
readseq      :       1.192 micros/op 838926 ops/sec; 8205.4 MB/s
readseq      :       1.250 micros/op 800000 ops/sec; 7824.7 MB/s

[master]
readseq      :      24.025 micros/op 41623 ops/sec;  407.1 MB/s
readseq      :      18.489 micros/op 54086 ops/sec;  529.0 MB/s
readseq      :      18.693 micros/op 53495 ops/sec;  523.2 MB/s
readseq      :      23.621 micros/op 42335 ops/sec;  414.1 MB/s
readseq      :      18.775 micros/op 53262 ops/sec;  521.0 MB/s

```

```
[Everything in Block cache | 10K operands | 10 KB each | 1 operand per key]

[FullMergeV2]
$ DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --num=100000 --db="/dev/shm/merge-random-10K-10KB" --cache_size=1000000000 --use_existing_db --disable_auto_compactions
readseq      :      14.741 micros/op 67837 ops/sec;  663.5 MB/s
readseq      :       1.029 micros/op 971446 ops/sec; 9501.6 MB/s
readseq      :       0.974 micros/op 1026229 ops/sec; 10037.4 MB/s
readseq      :       0.965 micros/op 1036080 ops/sec; 10133.8 MB/s
readseq      :       0.943 micros/op 1060657 ops/sec; 10374.2 MB/s

[master]
readseq      :      16.735 micros/op 59755 ops/sec;  584.5 MB/s
readseq      :       3.029 micros/op 330151 ops/sec; 3229.2 MB/s
readseq      :       3.136 micros/op 318883 ops/sec; 3119.0 MB/s
readseq      :       3.065 micros/op 326245 ops/sec; 3191.0 MB/s
readseq      :       3.014 micros/op 331813 ops/sec; 3245.4 MB/s
```

```
[Everything in Block cache | 10K operands | 10 KB each | 10 operand per key]

DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --num=100000 --db="/dev/shm/merge-random-10-operands-10K-10KB" --cache_size=1000000000 --use_existing_db --disable_auto_compactions

[FullMergeV2]
readseq      :      24.325 micros/op 41109 ops/sec;  402.1 MB/s
readseq      :       1.470 micros/op 680272 ops/sec; 6653.7 MB/s
readseq      :       1.231 micros/op 812347 ops/sec; 7945.5 MB/s
readseq      :       1.091 micros/op 916590 ops/sec; 8965.1 MB/s
readseq      :       1.109 micros/op 901713 ops/sec; 8819.6 MB/s

[master]
readseq      :      27.257 micros/op 36687 ops/sec;  358.8 MB/s
readseq      :       4.443 micros/op 225073 ops/sec; 2201.4 MB/s
readseq      :       5.830 micros/op 171526 ops/sec; 1677.7 MB/s
readseq      :       4.173 micros/op 239635 ops/sec; 2343.8 MB/s
readseq      :       4.150 micros/op 240963 ops/sec; 2356.8 MB/s
```

Test Plan: COMPILE_WITH_ASAN=1 make check -j64

Reviewers: yhchiang, andrewkr, sdong

Reviewed By: sdong

Subscribers: lovro, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D57075
2016-07-20 09:49:03 -07:00
John Alexander 9430333f84 New Statistics to track Compression/Decompression (#1197)
* Added new statistics and refactored to allow ioptions to be passed around as required to access environment and statistics pointers (and, as a convenient side effect, info_log pointer).

* Prevent incrementing compression counter when compression is turned off in options.

* Prevent incrementing compression counter when compression is turned off in options.

* Added two more supported compression types to test code in db_test.cc

* Prevent incrementing compression counter when compression is turned off in options.

* Added new StatsLevel that excludes compression timing.

* Fixed casting error in coding.h

* Fixed CompressionStatsTest for new StatsLevel.

* Removed unused variable that was breaking the Linux build
2016-07-19 09:44:03 -07:00
Yi Wu 296545a2c7 Fix clang analyzer errors
Summary:
Fixing erros reported by clang static analyzer.
* Removing some unused variables.
* Adding assertions to fix false positives reported by clang analyzer.
* Adding `__clang_analyzer__` macro to suppress false positive warnings.

Test Plan:
    USE_CLANG=1 OPT=-g make analyze -j64

Reviewers: andrewkr, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D60549
2016-07-08 17:50:51 -07:00
sdong 5009b5326b BlockBasedTable::FullFilterKeyMayMatch() Should skip prefix bloom if full key bloom exists
Summary: Currently, if users define both of full key bloom and prefix bloom in SST files. During Get(), if full key bloom shows the key may exist, we still go ahead and check prefix bloom. This is wasteful. If bloom filter for full keys exists, we should always ignore prefix bloom in Get().

Test Plan: Run existing tests

Reviewers: yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D57825
2016-06-10 16:27:56 -07:00
Aaron Gao e532877940 Add statistics field to show total size of index and filter blocks in block cache
Summary: With `table_options.cache_index_and_filter_blocks = true`, index and filter blocks are stored in block cache. Then people are curious how much of the block cache total size is used by indexes and bloom filters. It will be nice we have a way to report that. It can help people tune performance and plan for optimized hardware setting. We add several enum values for db Statistics. BLOCK_CACHE_INDEX/FILTER_BYTES_INSERT - BLOCK_CACHE_INDEX/FILTER_BYTES_ERASE = current INDEX/FILTER total block size in bytes.

Test Plan:
write a test case called `DBBlockCacheTest.IndexAndFilterBlocksStats`. The result is:
```
[gzh@dev9927.prn1 ~/local/rocksdb]  make db_block_cache_test -j64 && ./db_block_cache_test --gtest_filter=DBBlockCacheTest.IndexAndFilterBlocksStats
Makefile:101: Warning: Compiling in debug mode. Don't use the resulting binary in production
  GEN      util/build_version.cc
  make: `db_block_cache_test' is up to date.
  Note: Google Test filter = DBBlockCacheTest.IndexAndFilterBlocksStats
  [==========] Running 1 test from 1 test case.
  [----------] Global test environment set-up.
  [----------] 1 test from DBBlockCacheTest
  [ RUN      ] DBBlockCacheTest.IndexAndFilterBlocksStats
  [       OK ] DBBlockCacheTest.IndexAndFilterBlocksStats (689 ms)
  [----------] 1 test from DBBlockCacheTest (689 ms total)

  [----------] Global test environment tear-down
  [==========] 1 test from 1 test case ran. (689 ms total)
  [  PASSED  ] 1 test.
```

Reviewers: IslamAbdelRahman, andrewkr, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D58677
2016-06-03 10:47:47 -07:00
sdong 1d725ca51d Deprecate BlockBasedTableOptions.hash_index_allow_collision=false.
Summary: Deprecate this one option and delete code and tests that are now superfluous.

Test Plan: all tests pass

Reviewers: igor, yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: msalib, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D55317
2016-05-20 17:52:27 -07:00
krad a08c8c851a Added PersistentCache abstraction
Summary:
Added a new abstraction to cache page to RocksDB designed for the read
cache use.

RocksDB current block cache is more of an object cache. For the persistent read cache
project, what we need is a page cache equivalent. This changes adds a cache
abstraction to RocksDB to cache pages called PersistentCache. PersistentCache can cache
uncompressed pages or raw pages (content as in filesystem). The user can
choose to operate PersistentCache either in  COMPRESSED or UNCOMPRESSED mode.

Blame Rev:

Test Plan: Run unit tests

Reviewers: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D55707
2016-05-15 22:17:18 -07:00
sdong 7ccb8d6ef3 BlockBasedTable::Get() not to use prefix bloom if read_options.total_order_seek = true
Summary: This is to provide a way for users to skip prefix bloom in point look-up.

Test Plan: Add a new unit test scenario.

Reviewers: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D57747
2016-05-06 10:16:11 -07:00
Andrew Kryczka 843d2e3137 Shared dictionary compression using reference block
Summary:
This adds a new metablock containing a shared dictionary that is used
to compress all data blocks in the SST file. The size of the shared dictionary
is configurable in CompressionOptions and defaults to 0. It's currently only
used for zlib/lz4/lz4hc, but the block will be stored in the SST regardless of
the compression type if the user chooses a nonzero dictionary size.

During compaction, computes the dictionary by randomly sampling the first
output file in each subcompaction. It pre-computes the intervals to sample
by assuming the output file will have the maximum allowable length. In case
the file is smaller, some of the pre-computed sampling intervals can be beyond
end-of-file, in which case we skip over those samples and the dictionary will
be a bit smaller. After the dictionary is generated using the first file in a
subcompaction, it is loaded into the compression library before writing each
block in each subsequent file of that subcompaction.

On the read path, gets the dictionary from the metablock, if it exists. Then,
loads that dictionary into the compression library before reading each block.

Test Plan: new unit test

Reviewers: yhchiang, IslamAbdelRahman, cyan, sdong

Reviewed By: sdong

Subscribers: andrewkr, yoshinorim, kradhakrishnan, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D52287
2016-04-27 17:36:03 -07:00
sdong 535af525d6 BlockBasedTable::PrefixMayMatch() to skip index checking if we can't find a filter block.
Summary:
In the case where we can't find a filter block, there is not much benefit of doing the binary search and see whether the index key has the prefix. With the change, we blindly return true if we can't get the filter.
It also fixes missing row cases for reverse comparator with full bloom.

Test Plan: Add a test case that used to fail.

Reviewers: yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: kradhakrishnan, yiwu, hermanlee4, yoshinorim, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56697
2016-04-13 19:06:48 -07:00
sdong dff4c48ede BlockBasedTable::PrefixMayMatch: no need to find data block after full bloom checking
Summary:
Full block checking should be a good enough indication of prefix existance. No need to further check data block.
This also fixes wrong results when using prefix bloom and reverse bitwise comparator.

Test Plan: Will add a unit test.

Reviewers: yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: hermanlee4, yoshinorim, yiwu, kradhakrishnan, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56625
2016-04-12 16:25:54 -07:00