Commit Graph

8176 Commits

Author SHA1 Message Date
Sagar Vemuri 47fd574829 Log file_creation_time table property (#5232)
Summary:
Log file_creation_time table property when a new table file is created.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5232

Differential Revision: D15033069

Pulled By: sagar0

fbshipit-source-id: aaac56a4c03a8f96c338cad1b0cdb7fbfb887647
2019-04-22 15:30:07 -07:00
Andrew Kryczka 8272a6de57 Optionally wait on bytes_per_sync to smooth I/O (#5183)
Summary:
The existing implementation does not guarantee bytes reach disk every `bytes_per_sync` when writing SST files, or every `wal_bytes_per_sync` when writing WALs. This can cause confusing behavior for users who enable this feature to avoid large syncs during flush and compaction, but then end up hitting them anyways.

My understanding of the existing behavior is we used `sync_file_range` with `SYNC_FILE_RANGE_WRITE` to submit ranges for async writeback, such that we could continue processing the next range of bytes while that I/O is happening. I believe we can preserve that benefit while also limiting how far the processing can get ahead of the I/O, which prevents huge syncs from happening when the file finishes.

Consider this `sync_file_range` usage: `sync_file_range(fd_, 0, static_cast<off_t>(offset + nbytes), SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE)`. Expanding the range to start at 0 and adding the `SYNC_FILE_RANGE_WAIT_BEFORE` flag causes any pending writeback (like from a previous call to `sync_file_range`) to finish before it proceeds to submit the latest `nbytes` for writeback. The latest `nbytes` are still written back asynchronously, unless processing exceeds I/O speed, in which case the following `sync_file_range` will need to wait on it.

There is a second change in this PR to use `fdatasync` when `sync_file_range` is unavailable (determined statically) or has some known problem with the underlying filesystem (determined dynamically).

The above two changes only apply when the user enables a new option, `strict_bytes_per_sync`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5183

Differential Revision: D14953553

Pulled By: siying

fbshipit-source-id: 445c3862e019fb7b470f9c7f314fc231b62706e9
2019-04-22 11:51:39 -07:00
Mike Kolupaev df38c1ce66 Add BlockBasedTableOptions::index_shortening (#5174)
Summary:
Introduce BlockBasedTableOptions::index_shortening to give users control on which key shortening techniques to be used in building index blocks. Before this patch, both separators and successor keys where shortened in indexes. With this patch, the default is set to kShortenSeparators to only shorten the separators. Since each index block has many separators and only one successor (last key), the change should not have negative impact on index block size. However it should prevent many unnecessary block loads where due to approximation introduced by shorted successor, seek would land us to the previous block and then fix it by moving to the next one.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5174

Differential Revision: D14884185

Pulled By: al13n321

fbshipit-source-id: 1b08bc8c03edcf09b6b8c16e9a7eea08ad4dd534
2019-04-22 08:20:35 -07:00
jsteemann de76909464 refactor SavePoints (#5192)
Summary:
Savepoints are assumed to be used in a stack-wise fashion (only
the top element should be used), so they were stored by `WriteBatch`
in a member variable `save_points` using an std::stack.

Conceptually this is fine, but the implementation had a few issues:
- the `save_points_` instance variable was a plain pointer to a heap-
  allocated `SavePoints` struct. The destructor of `WriteBatch` simply
  deletes this pointer. However, the copy constructor of WriteBatch
  just copied that pointer, meaning that copying a WriteBatch with
  active savepoints will very likely have crashed before. Now a proper
  copy of the savepoints is made in the copy constructor, and not just
  a copy of the pointer
- `save_points_` was an std::stack, which defaults to `std::deque` for
  the underlying container. A deque is a bit over the top here, as we
  only need access to the most recent savepoint (i.e. stack.top()) but
  never any elements at the front. std::deque is rather expensive to
  initialize in common environments. For example, the STL implementation
  shipped with GNU g++ will perform a heap allocation of more than 500
  bytes to create an empty deque object. Although the `save_points_`
  container is created lazily by RocksDB, moving from a deque to a plain
  `std::vector` is much more memory-efficient. So `save_points_` is now
  a vector.
- `save_points_` was changed from a plain pointer to an `std::unique_ptr`,
  making ownership more explicit.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5192

Differential Revision: D15024074

Pulled By: maysamyabandeh

fbshipit-source-id: 5b128786d3789cde94e46465c9e91badd07a25d7
2019-04-19 20:33:04 -07:00
Sagar Vemuri dc64c2f5cc Fix history to not include some features in 6.1 (#5224)
Summary:
Fix HISTORY.md by removing a few items from 6.1.1 history as they did not make into the 6.1.fb branch.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5224

Differential Revision: D15017030

Pulled By: sagar0

fbshipit-source-id: 090724d326d29168952e06dc1a5090c03fdd739e
2019-04-19 13:00:53 -07:00
Yanqin Jin c77aab584e Force read existing data during db repair (#5209)
Summary:
Setting read_opts.total_order_seek achieves this, even with a different prefix
extractor.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5209

Differential Revision: D14980388

Pulled By: riversand963

fbshipit-source-id: 16527989a3d6b3e3ae8241c894d011326429d66e
2019-04-19 11:55:13 -07:00
anand76 5265c5709e Remove a couple of non-public includes from public header file (#5219)
Summary:
Cleanup a couple of stray includes left by #5011.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5219

Differential Revision: D15007244

Pulled By: anand1976

fbshipit-source-id: 15ca1d4f977b5b60e99df3bfb8fc3db217d19bdd
2019-04-19 11:10:33 -07:00
Siying Dong 7a73adda9c Add some "inline" annotation to DBIter functions (#5217)
Summary:
My compiler doesn't inline DBIter::Next() to arena wrapped iterator, even if it is a direct forward. Adding this annotation makes it inlined. It might not always work but inlinging this function to arena wrapped iterator always feels like the right decision.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5217

Differential Revision: D15004086

Pulled By: siying

fbshipit-source-id: a4cffd79c6fb092669a3a90633c9aa5e494f8a66
2019-04-19 10:38:43 -07:00
Sagar Vemuri efa948741c Use creation_time or mtime when file_creation_time=0 (#5184)
Summary:
We found an issue in Periodic Compactions (introduced in #5166) where files were not being picked up for compactions as all the SST files created with older versions of RocksDB have `file_creation_time` as 0. (Note that `file_creation_time` is a new table property introduced in #5166).

To address this, Periodic compactions now fall back to looking at the `creation_time` table property or the file's modification time (as given by the Env) when `file_creation_time` table property is found to be 0.

Here how the file's modification time (and, in turn, the file age) is computed now:
1. Use `file_creation_time` table property if it is > 0.
1. If not, then use `creation_time` table property if it is > 0.
1. If not, then use file's mtime stat metadata given by the underlying Env.
Don't consider the file at all for compaction if the modification time cannot be correctly determined based on the above conditions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5184

Differential Revision: D14907795

Pulled By: sagar0

fbshipit-source-id: 4bb2f3631f9a3e04470c674a1d13544584e1e56c
2019-04-18 22:39:34 -07:00
Zhongyi Xie 3bdce20e2b reorganize history.md to list unreleased changes separately
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5216

Differential Revision: D15003749

Pulled By: miasantreble

fbshipit-source-id: a52c264e694cd7c55813be33ee22b4f3046b545a
2019-04-18 14:55:57 -07:00
Siying Dong d6862b3f51 Make ReadRangeDelAggregator::ShouldDelete() more inline friendly (#5202)
Summary:
Reorganize the code so that no function call into ReadRangeDelAggregator is needed if there is no tomb range stone.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5202

Differential Revision: D14968155

Pulled By: siying

fbshipit-source-id: 0bd61911293c7a27b4e1b8d57c66d0c4ad6a6a5f
2019-04-18 12:27:25 -07:00
Siying Dong 01cfea6637 Some small code changes to improve Next() (#5200)
Summary:
Several small changes for Next():
1. Reducing branching by always update local_stats_.next_count_++ even if statistics is null. This should be faster than a branching.
2. Replacing ResetInternalKeysSkippedCounter() in Next() because the valid_ check is not needed in this case.
3. iter_->Valid() should always be true for non merge case. Remove this check.
4. Adding an inline annotation. It ends up with not picked up by my compiler, but it shouldn't hurt.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5200

Differential Revision: D15000391

Pulled By: siying

fbshipit-source-id: be97f61c708968234fb8e5cf272b5c2ac07dc4dd
2019-04-18 12:18:11 -07:00
Siying Dong 992dfc7811 Introduce InternalIteratorBase::NextAndGetResult() (#5197)
Summary:
In long scans, virtual function calls of Next(), Valid(), key() and value() are not trivial. By introducing NextAndGetResult(), Some of the Next(), Valid() and key() calls are consolidated into one virtual function call to reduce CPU.
Also did some inline tricks and add some "final" randomly in some functions. Even without the "final" annotation, most Next() calls are inlined with -O3, but sometimes with a final it is inlined by O2 too. It doesn't hurt to add those final annotations.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5197

Differential Revision: D14945977

Pulled By: siying

fbshipit-source-id: 7003969f9a5f1d5717f0bda503b91d19ba75ed88
2019-04-18 11:12:39 -07:00
Fosco Marotto 6c2bf9e916 Add copyright headers per FB open-source checkup tool. (#5199)
Summary:
internal task: T35568575
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5199

Differential Revision: D14962794

Pulled By: gfosco

fbshipit-source-id: 93838ede6d0235eaecff90d200faed9a8515bbbe
2019-04-18 10:55:01 -07:00
Yanqin Jin 392f6d49e5 Fix a bug in GetOverlappingInputsRangeBinarySearch (#5211)
Summary:
As title.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5211

Differential Revision: D14992018

Pulled By: riversand963

fbshipit-source-id: b5720ea4742029e2fb47ff6d9f8d9de006db4ed4
2019-04-18 09:22:16 -07:00
JiYou 5b7e09bd6f VersionSet: optmize GetOverlappingInputsRangeBinarySearch (#4987)
Summary:
`GetOverlappingInputsRangeBinarySearch` firstly use binary search
to find a index in the given range `[begin, end]`. But after find
the index, then use linear search to find the `start_index` and
`end_index`. So the search process degraded to linear time.

Here optmize the search process with below changes:

- use `std::lower_bound` and `std::upper_bound` to get
  `lg(n)` search complexity.
- use uniformed lambda for search process.
- simplify process for `within_interval` true or false.
- remove function `ExtendFileRangeWithinInterval`
  and `ExtendFileRangeOverlappingInterval`.

Signed-off-by: JiYou <jiyou09@gmail.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4987

Differential Revision: D14984192

Pulled By: riversand963

fbshipit-source-id: fae4b8e59a21b7e350718d60cdc94dd55ac81e89
2019-04-17 18:15:20 -07:00
Zhongyi Xie 248b6b551e rename variable to avoid shadowing (#5204)
Summary:
this PR fixes the following compile warning:
```
db/memtable.cc: In member function ‘virtual void rocksdb::MemTableIterator::Seek(const rocksdb::Slice&)’:
db/memtable.cc:321:22: error: declaration of ‘user_key’ shadows a member of 'this' [-Werror=shadow]
       Slice user_key(ExtractUserKey(k));
                      ^
db/memtable.cc: In member function ‘virtual void rocksdb::MemTableIterator::SeekForPrev(const rocksdb::Slice&)’:
db/memtable.cc:338:22: error: declaration of ‘user_key’ shadows a member of 'this' [-Werror=shadow]
       Slice user_key(ExtractUserKey(k));
                      ^
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5204

Differential Revision: D14970160

Pulled By: miasantreble

fbshipit-source-id: 388eb089f90c4528cc6d615dd4607fb53ceac705
2019-04-17 10:15:05 -07:00
Zhongyi Xie baa5302447 Avoid double-compacting data in bottom level in manual compactions (#5138)
Summary:
Depending on the config, manual compaction (leveled compaction style) does following compactions:
L0->L1
L1->L2
...
Ln-1 -> Ln
Ln -> Ln
The final Ln -> Ln compaction is partly unnecessary as it recompacts all the files that were just generated by the Ln-1 -> Ln. We should avoid recompacting such files. This rule should be applied to Lmax only.
Resolves issue https://github.com/facebook/rocksdb/issues/4995
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5138

Differential Revision: D14940106

Pulled By: miasantreble

fbshipit-source-id: 8d3cf5507a17e76f3333cfd4bac5256d005636e5
2019-04-16 23:32:20 -07:00
Yanqin Jin d9280ff2d2 Add back NewEmptyIterator (#5203)
Summary:
#4905 removed the implementation of `NewEmptyIterator` but kept its
declaration in the public header. This breaks some systems that depend on
RocksDB if the systems use `NewEmptyIterator`. Therefore, add it back to fix. cc maysamyabandeh please remind me if I miss anything here. Thanks
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5203

Differential Revision: D14968382

Pulled By: riversand963

fbshipit-source-id: 5fb86e99c8cfaf9f7a9473cdb1355d7558ff6e01
2019-04-16 20:28:05 -07:00
Siying Dong beb44ec3eb WriteBufferManager's dummy entry size to block cache 1MB -> 256KB (#5175)
Summary:
Dummy cache size of 1MB is too large for small block sizes. Our GetDefaultCacheShardBits() use min_shard_size = 512L * 1024L to determine number of shards, so 1MB will excceeds the size of the whole shard and make the cache excceeds the budget.
Change it to 256KB accordingly.
There shouldn't be obvious performance impact, since inserting a cache entry every 256KB of memtable inserts is still infrequently enough.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5175

Differential Revision: D14954289

Pulled By: siying

fbshipit-source-id: 2c275255c1ac3992174e06529e44c55538325c94
2019-04-16 12:03:07 -07:00
yiwu-arbug f1239d5f10 Avoid per-key upper bound check in BlockBasedTableIterator (#5142)
Summary:
This is second attempt for #5101. Original commit message:
`BlockBasedTableIterator` avoid reading next block on `Next()` if it detects the iterator will be out of bound, by checking against index key. The optimization was added in #2239, and by the time it only check the bound per block. It seems later change make it a per-key check, which introduce unnecessary key comparisons.

This patch come with two fixes:

Fix 1: To optimize checking for bounds, we need comparing the bounds with index key as well. However BlockBasedTableIterator doesn't know whether its index iterator is internally using user keys or internal keys. The patch fixes that by extending InternalIterator with a user_key() function that is overridden by In IndexBlockIter.

Fix 2: In #5101 we return `IsOutOfBound()=true` when block index key is out of bound. But the index key can be larger than smallest key of the next file on the level. That file can be within upper bound and should not be filtered out.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5142

Differential Revision: D14907113

Pulled By: siying

fbshipit-source-id: ac95775c5b4e7b700f76ab43e39f45402c98fbfb
2019-04-16 11:37:47 -07:00
Vijay Nadimpalli 71a82a0abe Consolidating WAL creation which currently has duplicate logic in db_impl_write.cc and db_impl_open.cc (#5188)
Summary:
Right now, two separate pieces of code are used to create WAL files in DBImpl::Open function of db_impl_open.cc and DBImpl::SwitchMemtable function of db_impl_write.cc. This code change simply creates 1 function called DBImpl::CreateWAL in db_impl_open.cc which is used to replace existing WAL creation logic in DBImpl::Open and DBImpl::SwitchMemtable.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5188

Differential Revision: D14942832

Pulled By: vjnadimpalli

fbshipit-source-id: d49230e04c36176015c8c1b422575872f92157fb
2019-04-15 18:51:04 -07:00
Yi Zhang 3e63e553b4 Fix MultiGet ASSERT bug when passing unsorted result (#5195)
Summary:
Found this when test driving the new MultiGet. If you pass unsorted result with sorted_result = false you'll trigger the ASSERT incorrect even though we'll sort down below.

I've also added simple test cover sorted_result=true/false scenario copied from MultiGetSimple.

anand1976
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5195

Differential Revision: D14935475

Pulled By: yizhang82

fbshipit-source-id: 1d2af5e3a003847d965066a16e3b19da68acf170
2019-04-15 11:35:21 -07:00
Yi Wu b70967aac7 db_bench: support seek to non-exist prefix (#5163)
Summary:
Add `--seek_missing_prefix` flag to db_bench to allow benchmarking seeking to non-existing prefix. Usage example:
```
./db_bench --db=/dev/shm/db_bench --use_existing_db=false --benchmarks=fillrandom --num=100000000 --prefix_size=9 --keys_per_prefix=10
./db_bench --db=/dev/shm/db_bench --use_existing_db=true --benchmarks=seekrandom --disable_auto_compactions=true --num=100000000 --prefix_size=9 --keys_per_prefix=10 --reads=1000 --prefix_same_as_start=true --seek_missing_prefix=true
```
Also adding `--total_order_seek` and `--prefix_same_as_start` flags.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5163

Differential Revision: D14935724

Pulled By: riversand963

fbshipit-source-id: 7c41023f007febe373eb1589861f215432a9e18a
2019-04-15 10:54:58 -07:00
Fosco Marotto b5cad5c986 Update history and version to 6.1.1 (#5171)
Summary:
Including latest fixes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5171

Differential Revision: D14875157

Pulled By: gfosco

fbshipit-source-id: 86ec7ee3553a9b25ab71ed98966ce08a16322e2c
2019-04-15 10:49:38 -07:00
jsteemann 8295d364e2 Improve transaction lock details (#5193)
Summary:
This branch contains two small improvements:
* Create `LockMap` entries using `std::make_shared`. This saves one heap allocation per LockMap entry but also locates the control block and the LockMap object closely together in memory, which can help with caching
* Reorder the members of `TrackedTrxInfo`, so that the resulting struct uses less memory (at least on 64bit systems)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5193

Differential Revision: D14934536

Pulled By: maysamyabandeh

fbshipit-source-id: f7b49812bb4b6029eef9d131e7cd56260df5b28e
2019-04-15 10:44:03 -07:00
anand76 29111e92b4 Add bounds check in FilePickerMultiGet::PrepareNextLevel() (#5189)
Summary:
Add bounds check when looping through empty levels in FilePickerMultiGet
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5189

Differential Revision: D14925334

Pulled By: anand1976

fbshipit-source-id: 65d53247cf443153e28ce2b8b753fa51c6ae4566
2019-04-12 18:05:09 -07:00
yiwu-arbug cca141ecf8 Fix crash with memtable prefix bloom and key out of prefix extractor domain (#5190)
Summary:
Before using prefix extractor `InDomain()` should be check. All uses in memtable.cc didn't check `InDomain()`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5190

Differential Revision: D14923773

Pulled By: miasantreble

fbshipit-source-id: b3ad60bcca5f3a1a2b929a6eb34b0b7ba6326f04
2019-04-12 17:07:49 -07:00
Manuel Ung d655a3aab7 Remove extraneous call to TrackKey (#5173)
Summary:
In `PessimisticTransaction::TryLock`, we were calling `TrackKey` even when assume_tracked=true, which defeats the purpose of assume_tracked. Remove this.

For keys that are already tracked, TrackKey will actually bump some counters (num_reads/num_writes) which are consumed in `TransactionBaseImpl::GetTrackedKeysSinceSavePoint`, and this is used to determine which keys were tracked since the last savepoint. I believe this functionality should still work, since I think the user should not call GetForUpdate/Put(assume_tracked=true) across savepoints, and if they do, they should not expect the Put(assume_tracked=true) to show up as a tracked key in the second savepoint.

This is another 2-3% cpu improvement.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5173

Differential Revision: D14883809

Pulled By: lth

fbshipit-source-id: 7d09f0772da422384af0519773e310c22b0cbca3
2019-04-12 16:37:12 -07:00
Maysam Yabandeh fe642cbee6 WritePrepared: fix race condition in reading batch with duplicate keys (#5147)
Summary:
When ReadOption doesn't specify a snapshot, WritePrepared::Get used kMaxSequenceNumber to avoid the cost of creating a new snapshot object (that requires sync over db_mutex). This creates a race condition if it is reading from the writes of a transaction that had duplicate keys: each instance of duplicate key is inserted with a different sequence number and depending on the ordering the ::Get might skip the newer one and read the older one that is obsolete.
The patch fixes that by using last published seq as the snapshot sequence number. It also adds a check after the read is done to ensure that the max_evicted_seq has not advanced the aforementioned seq, which is a very unlikely event. If it did, then the read is not valid since the seq is not backed by an actually snapshot to let IsInSnapshot handle that properly when an overlapping commit is evicted from commit cache.
A unit  test is added to reproduce the race condition with duplicate keys.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5147

Differential Revision: D14758815

Pulled By: maysamyabandeh

fbshipit-source-id: a56915657132cf6ba5e3f5ea1b5d78c803407719
2019-04-12 14:40:41 -07:00
ableegoldman 1966a7c055 Expose JavaAPI for getting the filter policy of a BlockBasedTableConfig (#5186)
Summary:
I would like to be able to read out the current Filter that has been set (or not) for a BlockBasedTableConfig. Added one public method to BlockBasedTableConfig:

public Filter filterPolicy() {
    return filterPolicy;
}
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5186

Differential Revision: D14921415

Pulled By: siying

fbshipit-source-id: 2a63c8685480197862b49fc48916c757cd6daf95
2019-04-12 14:01:36 -07:00
Siying Dong 85b2bde3dd Still implement StatisticsImpl::measureTime() (#5181)
Summary:
Since Statistics::measureTime() is deprecated, StatisticsImpl::measureTime() is not implemented. We realized that users might have a wrapped Statistics implementation in which measureTime() is implemented as forwarded to StatisticsImpl, and causes assert failure. In order to make the change less intrusive, we implement StatisticsImpl::measureTime(). We will revisit whether we need to remove it after several releases.

Also, add a test to make sure that a Statistics implementation using the old interface still works.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5181

Differential Revision: D14907089

Pulled By: siying

fbshipit-source-id: 29b6202fd04e30ed6f6adcaeb1000e87f10d1e1a
2019-04-12 11:00:35 -07:00
Yanqin Jin 3189398c00 Fix bugs detected by clang analyzer (#5185)
Summary:
as titled. False positive included, fixed anyway to make the check
pass.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5185

Differential Revision: D14909384

Pulled By: riversand963

fbshipit-source-id: dc5177e72b1929ccfd6175a60e2cd7bdb9bd80f3
2019-04-12 10:45:56 -07:00
vijaynadimpalli f49e12b892 Added missing table properties in log (#5168)
Summary:
When a new SST file is created via flush or compaction, we dump out the table properties, however only a few table properties are logged. The change here is to log all the table properties
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5168

Differential Revision: D14876928

Pulled By: vjnadimpalli

fbshipit-source-id: 1aca42ad00f9f650761d39e187f8beeb8700149b
2019-04-11 14:33:49 -07:00
anand76 fefd4b98c5 Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.

Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency

The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.

Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).

Batch   Sizes

1        | 2        | 4         | 8      | 16  | 32

Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074        - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14        - MultiGet (w/ batching)

Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135

Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62

Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891

dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10  ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011

Differential Revision: D14348703

Pulled By: anand1976

fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
2019-04-11 14:28:26 -07:00
Siying Dong ed9f5e21aa Change OptimizeForPointLookup() and OptimizeForSmallDb() (#5165)
Summary:
Change the behavior of OptimizeForSmallDb() so that it is less likely to go out of memory.
Change the behavior of OptimizeForPointLookup() to take advantage of the new memtable whole key filter, and move away from prefix extractor as well as hash-based indexing, as they are prone to misuse.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5165

Differential Revision: D14880709

Pulled By: siying

fbshipit-source-id: 9af30e3c9e151eceea6d6b38701a58f1f9fb692d
2019-04-11 10:45:36 -07:00
Sagar Vemuri d3d20dcdca Periodic Compactions (#5166)
Summary:
Introducing Periodic Compactions.

This feature allows all the files in a CF to be periodically compacted. It could help in catching any corruptions that could creep into the DB proactively as every file is constantly getting re-compacted.  And also, of course, it helps to cleanup data older than certain threshold.

- Introduced a new option `periodic_compaction_time` to control how long a file can live without being compacted in a CF.
- This works across all levels.
- The files are put in the same level after going through the compaction. (Related files in the same level are picked up as `ExpandInputstoCleanCut` is used).
- Compaction filters, if any, are invoked as usual.
- A new table property, `file_creation_time`, is introduced to implement this feature. This property is set to the time at which the SST file was created (and that time is given by the underlying Env/OS).

This feature can be enabled on its own, or in conjunction with `ttl`. It is possible to set a different time threshold for the bottom level when used in conjunction with ttl. Since `ttl` works only on 0 to last but one levels, you could set `ttl` to, say, 1 day, and `periodic_compaction_time` to, say, 7 days. Since `ttl < periodic_compaction_time` all files in last but one levels keep getting picked up based on ttl, and almost never based on periodic_compaction_time. The files in the bottom level get picked up for compaction based on `periodic_compaction_time`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5166

Differential Revision: D14884441

Pulled By: sagar0

fbshipit-source-id: 408426cbacb409c06386a98632dcf90bfa1bda47
2019-04-10 19:31:18 -07:00
Manuel Ung ef0fc1b461 Reduce copies of LockInfo (#5172)
Summary:
The LockInfo struct is not easy to copy because it contains std::vector. Reduce copies by using move constructor and `unordered_map::emplace`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5172

Differential Revision: D14882053

Pulled By: lth

fbshipit-source-id: 93999ec6ab1a5841fb5115abb764b6c1831a6de1
2019-04-10 15:58:58 -07:00
jsteemann 313e877285 fix reading encrypted files beyond file boundaries (#5160)
Summary:
This fix should help reading from encrypted files if the file-to-be-read
is smaller than expected. For example, when using the encrypted env and
making it read a journal file of exactly 0 bytes size, the encrypted env
code crashes with SIGSEGV in its Decrypt function, as there is no check
if the read attempts to read over the file's boundaries (as specified
originally by the `dataSize` parameter).

The most important problem this patch addresses is however that there is
no size underlow check in `CTREncryptionProvider::CreateCipherStream`:

The stream to be read will be initialized to a size of always
`prefix.size() - (2 * blockSize)`. If the prefix however is smaller than
twice the block size, this will obviously assume a _very_ large stream
and read over the bounds. The patch adds a check here as follows:

    // If the prefix is smaller than twice the block size, we would below read a
    // very large chunk of the file (and very likely read over the bounds)
    assert(prefix.size() >= 2 * blockSize);
    if (prefix.size() < 2 * blockSize) {
      return Status::Corruption("Unable to read from file " + fname + ": read attempt would read beyond file bounds");
    }

so embedders can catch the error in their release builds.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5160

Differential Revision: D14834633

Pulled By: sagar0

fbshipit-source-id: 47aa39a6db8977252cede054c7eb9a663b9a3484
2019-04-08 14:57:25 -07:00
Siying Dong 0bb555630f Consolidate hash function used for non-persistent data in a new function (#5155)
Summary:
Create new function NPHash64() and GetSliceNPHash64(), which are currently
implemented using murmurhash.
Replace the current direct call of murmurhash() to use the new functions
if the hash results are not used in on-disk format.
This will make it easier to try out or switch to alternative functions
in the uses where data format compatibility doesn't need to be considered.
This part shouldn't have any performance impact.

Also, the sharded cache hash function is changed to the new format, because
it falls into this categoery. It doesn't show visible performance impact
in db_bench results. CPU showed by perf is increased from about 0.2% to 0.4%
in an extreme benchmark setting (4KB blocks, no-compression, everything
cached in block cache). We've known that the current hash function used,
our own Hash() has serious hash quality problem. It can generate a lots of
conflicts with similar input. In this use case, it means extra lock contention
for reads from the same file. This slight CPU regression is worthy to me
to counter the potential bad performance with hot keys. And hopefully this
will get further improved in the future with a better hash function.

cache_test's condition is relaxed a little bit to. The new hash is slightly
more skewed in this use case, but I manually checked the data and see
the hash results are still in a reasonable range.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5155

Differential Revision: D14834821

Pulled By: siying

fbshipit-source-id: ec9a2c0a2f8ae4b54d08b13a5c2e9cc97aa80cb5
2019-04-08 13:32:06 -07:00
Yanqin Jin de00f28132 Refactor ExternalSSTFileTest (#5129)
Summary:
remove an unnecessary function `GenerateAndAddFileIngestBehind`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5129

Differential Revision: D14686710

Pulled By: riversand963

fbshipit-source-id: 5698ae63e10f8ef76c2da753bbb07a36024ac065
2019-04-08 11:16:34 -07:00
Sergei Glushchenko 39c6c5fc1b Expose DB methods to lock and unlock the WAL (#5146)
Summary:
Expose DB methods to lock and unlock the WAL.

These methods are intended to use by MyRocks in order to obtain WAL
coordinates in consistent way.

Usage scenario is following:

MySQL has performance_schema.log_status which provides information that
enables a backup tool to copy the required log files without locking for
the duration of copy. To populate this table MySQL does following:

1. Lock the binary log. Transactions are not allowed to commit now
2. Save the binary log coordinates
3. Walk through the storage engines and lock writes on each engine. For
   InnoDB, redo log is locked. For MyRocks, WAL should be locked.
4. Ask storage engines for their coordinates. InnoDB reports its current
   LSN and checkpoint LSN. MyRocks should report active WAL files names
   and sizes.
5. Release storage engine's locks
6. Unlock binary log

Backup tool will then use this information to copy InnoDB, RocksDB and
MySQL binary logs up to specified positions to end up with consistent DB
state after restore.

Currently, RocksDB allows to obtain the list of WAL files. Only missing
bit is the method to lock the writes to WAL files.

LockWAL method must flush the WAL in order for the reported size to be
accurate (GetSortedWALFiles is using file system stat call to return the
file size), also, since backup tool is going to copy the WAL, it is
better to be flushed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5146

Differential Revision: D14815447

Pulled By: maysamyabandeh

fbshipit-source-id: eec9535a6025229ed471119f19fe7b3d8ae888a3
2019-04-06 06:40:36 -07:00
Siying Dong 479c566771 Add final annotations to some cache functions (#5156)
Summary:
cache functions heavily use virtual functions.
Add some "final" annotations to give compilers more information
to optimize. The compiler doesn't seem to take advantage of it
though. But it doesn't hurt.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5156

Differential Revision: D14814837

Pulled By: siying

fbshipit-source-id: 4423f58eafc93f7dd3c5f04b02b5c993dba2ea94
2019-04-05 16:08:01 -07:00
Harry Wong 8d1e52165d Removed const fields in copyable classes (#5095)
Summary:
This fixed the compile error in Clang-8:
```
error: explicitly defaulted copy assignment operator is implicitly deleted [-Werror,-Wdefaulted-function-deleted]
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5095

Differential Revision: D14811961

Pulled By: riversand963

fbshipit-source-id: d935d1f85a4e8694dca10033fb5af92d8777eca0
2019-04-05 15:40:30 -07:00
Levi Tamasi 59ef2ba559 Evict the uncompression dictionary from the block cache upon table close (#5150)
Summary:
The uncompression dictionary object has a Statistics pointer that might
dangle if the database closed. This patch evicts the dictionary from the
block cache when a table is closed, similarly to how index and filter
readers are handled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5150

Differential Revision: D14782422

Pulled By: ltamasi

fbshipit-source-id: 0cec9336c742c479aa92206e04521767f1aa9622
2019-04-04 16:21:12 -07:00
Mike Kolupaev 306b9adfd8 Add missing methods to EnvWrapper, and more wrappers in Env.h (#5131)
Summary:
- Some newer methods of Env weren't wrapped in EnvWrapper. Fixed.
 - Added more wrapper classes similar to WritableFileWrapper: SequentialFileWrapper, RandomAccessFileWrapper, RandomRWFileWrapper, DirectoryWrapper, LoggerWrapper.
 - Moved the code around a bit, removed some unused friendships, added some comments.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5131

Differential Revision: D14738932

Pulled By: al13n321

fbshipit-source-id: 99a9b1af28f2c629e7b7501389fa920b5ce30218
2019-04-04 14:47:41 -07:00
Adam Simpkins c06c4c01c5 Fix many bugs in log statement arguments (#5089)
Summary:
Annotate all of the logging functions to inform the compiler that these
use printf-style formatting arguments.  This allows the compiler to emit
warnings if the format arguments are incorrect.

This also fixes many problems reported now that format string checking
is enabled.  Many of these are simply mix-ups in the argument type (e.g,
int vs uint64_t), but in several cases the wrong number of arguments
were being passed in which can cause the code to crash.

The primary motivation for this was to fix the log message in
`DBImpl::SwitchMemtable()` which caused a segfault due to an extra %s
format parameter with no argument supplied.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5089

Differential Revision: D14574795

Pulled By: simpkins

fbshipit-source-id: 0921b03f0743652bf4ae21e414ff54b3bb65422a
2019-04-04 12:12:11 -07:00
datonli f0edf9d575 #5145 , rename port/dirent.h to port/port_dirent.h to avoid compile err when use port dir as header dir output (#5152)
Summary:
mv port/dirent.h to port/port_dirent.h to avoid compile err when use port dir as header dir output
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5152

Differential Revision: D14779409

Pulled By: siying

fbshipit-source-id: d4162c47c979c6e8cc6a9e601802864ab3768ecb
2019-04-04 11:38:19 -07:00
Maysam Yabandeh 75e8b6dfcf Fix race condition in IteratorWithLocalStatistics (#5149)
Summary:
The ReadCallback was shared between all threads in IteratorWithLocalStatistics. A race condition was
 hence introduced with recent changes that changes the content of ReadCallback. The patch fixes that by using a separate callback per thread.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5149

Differential Revision: D14761612

Pulled By: maysamyabandeh

fbshipit-source-id: 814a316aed046c318cb90e22379a6e32ac528949
2019-04-03 16:04:38 -07:00
Maysam Yabandeh 7441a0ecba WriteUnPrepared: fix ubsan complaint (#5148)
Summary:
Ubsna complains that in initialization of WriteUnpreparedTxnReadCallback the method of the child class is used before the parent class is constructed. The patch fixes that by making the aforementioned method static.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5148

Differential Revision: D14760098

Pulled By: maysamyabandeh

fbshipit-source-id: cf19b7c1fdb5de0a54e62c1deebe09a0fa048ded
2019-04-03 15:51:30 -07:00