Commit graph

8308 commits

Author SHA1 Message Date
Vijay Nadimpalli 24b118ad98 Combine the read-ahead logic for user reads and compaction reads (#5431)
Summary:
Currently the read-ahead logic for user reads and compaction reads go through different code paths where compaction reads create new table readers and use `ReadaheadRandomAccessFile`. This change is to unify read-ahead logic to use read-ahead in BlockBasedTableReader::InitDataBlock(). As a result of the change  `ReadAheadRandomAccessFile` class and `new_table_reader_for_compaction_inputs` option will no longer be used.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5431

Test Plan:
make check

Here is the benchmarking - https://gist.github.com/vjnadimpalli/083cf423f7b6aa12dcdb14c858bc18a5

Differential Revision: D15772533

Pulled By: vjnadimpalli

fbshipit-source-id: b71dca710590471ede6fb37553388654e2e479b9
2019-06-19 14:10:46 -07:00
Simon Grätzer fe90ed7a70 Replace Corruption with TryAgain status when new tail is not visible to TransactionLogIterator (#5474)
Summary:
When tailing the WAL with TransactionLogIterator, it used to return Corruption status to indicate that the WAL has new tail that is not visible to the iterator, which is a misleading status. The patch replaces it with TryAgain which is more descriptive of a status, indicating that the user needs to create a new iterator to fetch the recent tail.
Fixes https://github.com/facebook/rocksdb/issues/5455
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5474

Differential Revision: D15898953

Pulled By: maysamyabandeh

fbshipit-source-id: 40966f6457cb539e1aeb104daeada6b0e46059fc
2019-06-19 08:10:08 -07:00
Levi Tamasi 5355e527d9 Make the 'block read count' performance counters consistent (#5484)
Summary:
The patch brings the semantics of per-block-type read performance
context counters in sync with the generic block_read_count by only
incrementing the counter if the block was actually read from the file.
It also fixes index_block_read_count, which fell victim to the
refactoring in PR https://github.com/facebook/rocksdb/issues/5298.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5484

Test Plan: Extended the unit tests.

Differential Revision: D15887431

Pulled By: ltamasi

fbshipit-source-id: a3889759d0ac5759d56625d692cd828d1b9207a6
2019-06-18 19:03:24 -07:00
haoyuhuang 2e8ad03ab3 Add more stats in the block cache trace analyzer (#5482)
Summary:
This PR adds more stats in the block cache trace analyzer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5482

Differential Revision: D15883553

Pulled By: HaoyuHuang

fbshipit-source-id: 6d440e4f657af75690420102d532d0ee1ed4e9cf
2019-06-18 18:38:42 -07:00
Vaibhav Gogte f46a2a0375 Export Cache::GetCharge (#5476)
Summary:
Exporting GetCharge to cache.hh
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5476

Differential Revision: D15881882

Pulled By: riversand963

fbshipit-source-id: 3d99084d10059b4fcaaaba240606ed50bc23351c
2019-06-18 17:35:41 -07:00
Huisheng Liu 92f631da33 replace sprintf with its safe version snprintf (#5475)
Summary:
sprintf is unsafe and has buffer overrun risk. Replace it with the safer version snprintf where buffer size is supplied to avoid overrun.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5475

Differential Revision: D15879481

Pulled By: sagar0

fbshipit-source-id: 7ae1958ffc9727fa50261dfbb98ddd74e70a72d8
2019-06-18 16:42:26 -07:00
Levi Tamasi d0c6aea192 Revert to respecting only the read_tier read option for index blocks (#5481)
Summary:
PR https://github.com/facebook/rocksdb/issues/5298 subtly changed how read options are applied to the index block
during a Get, MultiGet, or iteration. Earlier, only the read_tier option
applied to the index block read; since PR https://github.com/facebook/rocksdb/issues/5298, fill_cache and
verify_checksums also have an effect. This patch restores the earlier
behavior to prevent surprise memory increases for clients due to the
index block not being cached.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5481

Test Plan: make check

Differential Revision: D15883082

Pulled By: ltamasi

fbshipit-source-id: 9a065ec3a6db5a365cf6dd5e95190a20c5756356
2019-06-18 15:02:09 -07:00
Andrew Kryczka 220870523c Fix compilation with USE_HDFS (#5444)
Summary:
The changes in 8272a6de57 were untested with `USE_HDFS=1`. There were a couple compiler errors. This PR fixes them.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5444

Test Plan:
```
$ EXTRA_LDFLAGS="-L/tmp/hadoop-3.1.2/lib/native/" EXTRA_CXXFLAGS="-I/tmp/hadoop-3.1.2/include" USE_HDFS=1 make -j12 check
```

Differential Revision: D15885009

fbshipit-source-id: 2a0a63739e0b9a2819b461ad63ce1292c4833fe2
2019-06-18 14:55:59 -07:00
Adam Retter 5dc9fbd117 Update the version of ZStd for the Rocks Java static build
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5228

Differential Revision: D15880451

Pulled By: sagar0

fbshipit-source-id: 84da6f42cac15367d95bffa5336ebd002e7c3308
2019-06-18 11:57:01 -07:00
siddontang 4bd0cf541d build on ARM64 (#5450)
Summary:
Support building RocksDB on AWS ARM64

```
uname -m
aarch64
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5450

Differential Revision: D15879851

fbshipit-source-id: a9b56520a2cd9921338305a06d7103a40a3300b8
2019-06-18 11:27:45 -07:00
Yanqin Jin f287f8dc93 Fix a bug caused by secondary not skipping the beginning of new MANIFEST (#5472)
Summary:
While the secondary is replaying after the primary, the primary may switch to a new MANIFEST. The secondary is already able to detect and follow the primary to the new MANIFEST. However, the current implementation has a bug, described as follows.
The new MANIFEST's first records have been generated by VersionSet::WriteSnapshot to describe the current state of the column families and the db as of the MANIFEST creation. Since the secondary instance has already finished recovering upon start, there is no need for the secondary to process these records. Actually, if the secondary were to replay these records, the secondary may end up adding the same SST files **again** to each column family, causing consistency checks done by VersionBuilder to fail. Therefore, we record the number of records to skip at the beginning of the new MANIFEST and ignore them.

Test plan (on dev server)
```
$make clean && make -j32 all
$./db_secondary_test
```
All existing unit tests must pass as well.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5472

Differential Revision: D15866771

Pulled By: riversand963

fbshipit-source-id: a1eec4837fb2ad13059398efb0f437e74fd53bed
2019-06-18 11:21:37 -07:00
Zhongyi Xie ddd088c8b9 fix rocksdb lite and clang contrun test failures (#5477)
Summary:
recent commit 671d15cbdd introduced some test failures:
```
===== Running stats_history_test
[==========] Running 9 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 9 tests from StatsHistoryTest
[ RUN      ] StatsHistoryTest.RunStatsDumpPeriodSec
monitoring/stats_history_test.cc:63: Failure
dbfull()->SetDBOptions({{"stats_dump_period_sec", "0"}})
Not implemented: Not supported in ROCKSDB LITE

db/db_options_test.cc:28:11: error: unused variable 'kMicrosInSec' [-Werror,-Wunused-const-variable]
const int kMicrosInSec = 1000000;
```
This PR fixes these failures
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5477

Differential Revision: D15871814

Pulled By: miasantreble

fbshipit-source-id: 0a7023914d2c1784d9d2d3f5bfb47310d4855394
2019-06-17 21:16:29 -07:00
haoyuhuang bcfc53b436 Block cache tracing: Fix minor bugs with downsampling and some benchmark results. (#5473)
Summary:
As the code changes for block cache tracing are almost complete, I did a benchmark to compare the performance when block cache tracing is enabled/disabled.

 With 1% downsampling ratio, the performance overhead of block cache tracing is negligible. When we trace all block accesses, the throughput drops by 6 folds with 16 threads issuing random reads and all reads are served in block cache.

Setup:
RocksDB:    version 6.2
Date:       Mon Jun 17 17:11:13 2019
CPU:        24 * Intel Core Processor (Skylake)
CPUCache:   16384 KB
Keys:       20 bytes each
Values:     100 bytes each (100 bytes after compression)
Entries:    10000000
Prefix:    20 bytes
Keys per prefix:    0
RawSize:    1144.4 MB (estimated)
FileSize:   1144.4 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: NoCompression
Compression sampling rate: 0
Memtablerep: skip_list
Perf Level: 1

I ran the readrandom workload for 1 minute. Detailed throughput results:  (ops/second)
Sample rate 0: no block cache tracing.
Sample rate 1: trace all block accesses.
Sample rate 100: trace accesses 1% blocks.
1 thread |   |   |  -- | -- | -- | --
Sample rate | 0 | 1 | 100
1 MB block cache size | 13,094 | 13,166 | 13,341
10 GB block cache size | 202,243 | 188,677 | 229,182

16 threads |   |   |  -- | -- | -- | --
Sample rate | 0 | 1 | 100
1 MB block cache size | 208,761 | 178,700 | 201,872
10 GB block cache size | 2,645,996 | 426,295 | 2,587,605
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5473

Differential Revision: D15869479

Pulled By: HaoyuHuang

fbshipit-source-id: 7ae802abe84811281a6af8649f489887cd7c4618
2019-06-17 17:59:02 -07:00
haoyuhuang 2d1dd5bce7 Support computing miss ratio curves using sim_cache. (#5449)
Summary:
This PR adds a BlockCacheTraceSimulator that reports the miss ratios given different cache configurations. A cache configuration contains "cache_name,num_shard_bits,cache_capacities". For example, "lru, 1, 1K, 2K, 4M, 4G".

When we replay the trace, we also perform lookups and inserts on the simulated caches.
In the end, it reports the miss ratio for each tuple <cache_name, num_shard_bits, cache_capacity> in a output file.

This PR also adds a main source block_cache_trace_analyzer so that we can run the analyzer in command line.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5449

Test Plan:
Added tests for block_cache_trace_analyzer.
COMPILE_WITH_ASAN=1 make check -j32.

Differential Revision: D15797073

Pulled By: HaoyuHuang

fbshipit-source-id: aef0c5c2e7938f3e8b6a10d4a6a50e6928ecf408
2019-06-17 16:41:12 -07:00
Yanqin Jin 7d8d56413d Override check consistency for DBImplSecondary (#5469)
Summary:
`DBImplSecondary` calls `CheckConsistency()` during open. In the past, `DBImplSecondary` did not override this function thus `DBImpl::CheckConsistency()` is called.
The following can happen. The secondary instance is performing consistency check which calls `GetFileSize(file_path)` but the file at `file_path` is deleted by the primary instance. `DBImpl::CheckConsistency` does not account for this and fails the consistency check. This is undesirable. The solution is that, we call `DBImpl::CheckConsistency()` first. If it passes, then we are good. If not, we give it a second chance and handles the case of file(s) being deleted.

Test plan (on dev server):
```
$make clean && make -j20 all
$./db_secondary_test
```
All other existing unit tests must pass as well.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5469

Differential Revision: D15861845

Pulled By: riversand963

fbshipit-source-id: 507d72392508caed3cd003bb2e2aa43f993dd597
2019-06-17 15:39:55 -07:00
Zhongyi Xie 671d15cbdd Persistent Stats: persist stats history to disk (#5046)
Summary:
This PR continues the work in https://github.com/facebook/rocksdb/pull/4748 and https://github.com/facebook/rocksdb/pull/4535 by adding a new DBOption `persist_stats_to_disk` which instructs RocksDB to persist stats history to RocksDB itself. When statistics is enabled, and  both options `stats_persist_period_sec` and `persist_stats_to_disk` are set, RocksDB will periodically write stats to a built-in column family in the following form: key -> (timestamp in microseconds)#(stats name), value -> stats value. The existing API `GetStatsHistory` will detect the current value of `persist_stats_to_disk` and either read from in-memory data structure or from the hidden column family on disk.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5046

Differential Revision: D15863138

Pulled By: miasantreble

fbshipit-source-id: bb82abdb3f2ca581aa42531734ac799f113e931b
2019-06-17 15:21:50 -07:00
Maysam Yabandeh ee294c24ed Make db_bloom_filter_test parallel (#5467)
Summary:
When run under TSAN it sometimes goes over 10m and times out. The slowest ones are `DBBloomFilterTestWithParam.BloomFilter` which we have 6 of them. Making the tests run in parallel should take care of the timeout issue.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5467

Differential Revision: D15856912

Pulled By: maysamyabandeh

fbshipit-source-id: 26c43c55312974c1b809c070342dee037d0219f4
2019-06-17 11:13:45 -07:00
haoyuhuang d43b4cd570 Integrate block cache tracing into db_bench (#5459)
Summary:
This PR integrates the block cache tracing into db_bench. It adds three command line arguments.
-block_cache_trace_file (Block cache trace file path.) type: string default: ""
-block_cache_trace_max_trace_file_size_in_bytes (The maximum block cache
trace file size in bytes. Block cache accesses will not be logged if the
trace file size exceeds this threshold. Default is 64 GB.) type: int64
default: 68719476736
-block_cache_trace_sampling_frequency (Block cache trace sampling
frequency, termed s. It uses spatial downsampling and samples accesses to
one out of s blocks.) type: int32 default: 1
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5459

Differential Revision: D15832031

Pulled By: HaoyuHuang

fbshipit-source-id: 0ecf2f2686557251fe741a2769b21170777efa3d
2019-06-17 11:08:21 -07:00
Adam Retter d1ae67bdb9 Switch Travis to Xenial build (#4789)
Summary:
I think this should now also run on Travis's new virtualised infrastructure which affords more memory and CPU.

We also need to think about migrating from travis-ci.org to travis-ci.com.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4789

Differential Revision: D15856272

fbshipit-source-id: 10b41d21924e8a362bc9646a63ccd1a5dfc437c6
2019-06-17 10:20:02 -07:00
haoyuhuang 7a8d7358bb Integrate block cache tracer in block based table reader. (#5441)
Summary:
This PR integrates the block cache tracer into block based table reader. The tracer will write the block cache accesses using the trace_writer. The tracer is null in this PR so that nothing will be logged.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5441

Differential Revision: D15772029

Pulled By: HaoyuHuang

fbshipit-source-id: a64adb92642cd23222e0ba8b10d86bf522b42f9b
2019-06-14 17:40:31 -07:00
Sagar Vemuri f1219644ec Validate CF Options when creating a new column family (#5453)
Summary:
It seems like CF Options are not properly validated  when creating a new column family with `CreateColumnFamily` API; only a selected few checks are done. Calling `ColumnFamilyData::ValidateOptions`, which is the single source for all CFOptions validations,  will help fix this. (`ColumnFamilyData::ValidateOptions` is already called at the time of `DB::Open`).

**Test Plan:**
Added a new test: `DBTest.CreateColumnFamilyShouldFailOnIncompatibleOptions`
```
TEST_TMPDIR=/dev/shm ./db_test --gtest_filter=DBTest.CreateColumnFamilyShouldFailOnIncompatibleOptions
```
Also ran gtest-parallel to make sure the new test is not flaky.
```
TEST_TMPDIR=/dev/shm ~/gtest-parallel/gtest-parallel ./db_test --gtest_filter=DBTest.CreateColumnFamilyShouldFailOnIncompatibleOptions --repeat=10000
[10000/10000] DBTest.CreateColumnFamilyShouldFailOnIncompatibleOptions (15 ms)
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5453

Differential Revision: D15816851

Pulled By: sagar0

fbshipit-source-id: 9e702b9850f5c4a7e0ef8d39e1e6f9b81e7fe1e5
2019-06-14 14:11:10 -07:00
Huisheng Liu b47cfec5d0 fix compilation error on MSVC (#5458)
Summary:
"__attribute__((__weak__))" was introduced in port\jemalloc_helper.h. It's not supported by Microsoft VS 2015, resulting in compile error. This fix adds a #if branch to work around the compile issue.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5458

Differential Revision: D15827285

fbshipit-source-id: 8c5f7ad31de1ac677bd96f16c4450767de834beb
2019-06-14 11:28:13 -07:00
Maysam Yabandeh 58c78358ef Set executeLocal on child lego jobs (#5456)
Summary:
This property is needed to run the child jobs on the same host and thus propagate the child job status back to the parent's.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5456

Reviewed By: yancouto

Differential Revision: D15824382

Pulled By: maysamyabandeh

fbshipit-source-id: 42f2efbedaa3a8b399281105f0ce793c1c9a6191
2019-06-14 10:38:04 -07:00
haoyuhuang 89695bfbaa Remove unused variable (#5457)
Summary:
This PR removes the unused variable that causes CLANG build to fail.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5457

Differential Revision: D15825027

Pulled By: HaoyuHuang

fbshipit-source-id: 72c847c39ca310560efcbc5938cffa6f31164068
2019-06-14 09:17:09 -07:00
haoyuhuang bb4178066d Integrate block cache tracer into db_impl (#5433)
Summary:
This PR integrates the block cache tracer class into db_impl.cc.
db_impl.cc contains a member variable of AtomicBlockCacheTraceWriter class and passes its reference to the block_based_table_reader.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5433

Differential Revision: D15728016

Pulled By: HaoyuHuang

fbshipit-source-id: 23d5659e8c82d556833dcc1a5558aac8c1f7db71
2019-06-13 15:43:10 -07:00
Levi Tamasi a3b8c76d8e Add missing check before calling PurgeObsoleteFiles in EnableFileDeletions (#5448)
Summary:
Calling PurgeObsoleteFiles with a JobContext for which HaveSomethingToDelete
is false is a precondition violation. This would trigger an assertion in debug builds;
however, in release builds with assertions disabled, this can result in the
pending_purge_obsolete_files_ counter in DBImpl underflowing, which in turn can lead
to the process hanging during database close.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5448

Differential Revision: D15792569

Pulled By: ltamasi

fbshipit-source-id: 82d92c9b4f6a9efcdc69dbb3d5a52a1ae2dd2472
2019-06-13 14:43:13 -07:00
Andrew Kryczka 2c9df9f9e5 Dynamic test whether sync_file_range returns ENOSYS (#5416)
Summary:
`sync_file_range` returns `ENOSYS` on Windows Subsystem for Linux even
when using a supposedly supported filesystem like ext4. To handle this
case we can do a dynamic check that a no-op `sync_file_range`
invocation, which is accomplished by passing zero for the `flags`
argument, succeeds.

Also I rearranged the function and comments to hopefully make it more
easily understandable.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5416

Differential Revision: D15807061

fbshipit-source-id: d31d94e1f228b7850ea500e6199f8b5daf8cfbd3
2019-06-13 13:56:10 -07:00
Bin Fan ec8111c5a4 Add Alluxio to USERS.md (#5434)
Summary:
Add Alluxio's use case of RocksDB to `USERS.md` for metadata service
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5434

Differential Revision: D15766559

Pulled By: riversand963

fbshipit-source-id: b68ef851f8f92e0925c31e55296260225fdf849e
2019-06-13 12:25:26 -07:00
Patrick Zhang 5c76ba9dc4 Support rocksdbjava aarch64 build and test (#5258)
Summary:
Verified with an Ampere Computing eMAG aarch64 system.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5258

Differential Revision: D15807309

Pulled By: maysamyabandeh

fbshipit-source-id: ab85d2fd3fe40e6094430ab0eba557b1e979510d
2019-06-13 11:48:10 -07:00
Maysam Yabandeh 60f3ec2ca5 Fix appveyor compliant about passing const to thread (#5447)
Summary:
CLANG would complain if we pass const to lambda function and appveyor complains if we don't (https://github.com/facebook/rocksdb/pull/5443). The patch fixes that by using the default capture mode.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5447

Differential Revision: D15788722

Pulled By: maysamyabandeh

fbshipit-source-id: 47e7f49264afe31fdafe42cb8bf93da126abfca9
2019-06-12 15:06:22 -07:00
Maysam Yabandeh f9842869cf Disable pipeline writes in stress test (#5445)
Summary:
The tsan crash tests are failing with a data race compliant with pipelined write option. Temporarily disable it until its concurrency issue are fixed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5445

Differential Revision: D15783824

Pulled By: maysamyabandeh

fbshipit-source-id: 413a0c3230b86f524fc7eeea2cf8e8375406e65b
2019-06-12 11:12:36 -07:00
Maysam Yabandeh f43edff9ac Disable kPipelinedWrite in MultiThreaded (#5442)
Summary:
TSAN tests report a race condition. We temporarily exclude kPipelinedWrite from MultiThreaded until the race condition is fixed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5442

Differential Revision: D15782349

Pulled By: maysamyabandeh

fbshipit-source-id: 42b4f9b3fa9137f0675e13ad132c0a06800c1bdd
2019-06-12 10:37:40 -07:00
Maysam Yabandeh 4a285d0dd3 Remove passing const variable to thread (#5443)
Summary:
CLANG complains that passing const to thread is not necessary. The patch removes it form PreparedHeap::Concurrent test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5443

Differential Revision: D15781598

Pulled By: maysamyabandeh

fbshipit-source-id: 3aceb05d96182fa4726d6d37eed45fd3aac4c016
2019-06-12 09:45:57 -07:00
Maysam Yabandeh 773f914a40 WritePrepared: switch PreparedHeap from priority_queue to deque (#5436)
Summary:
Internally PreparedHeap is currently using a priority_queue. The rationale was the in the initial design PreparedHeap::AddPrepared could be called in arbitrary order. With the recent optimizations, we call ::AddPrepared only from the main write queue, which results into in-order insertion into PreparedHeap. The patch thus replaces the underlying priority_queue with a more efficient deque implementation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5436

Differential Revision: D15752147

Pulled By: maysamyabandeh

fbshipit-source-id: e6960f2b2097e13137dded1ceeff3b10b03b0aeb
2019-06-11 19:55:14 -07:00
Manuel Ung ca1aee2a19 WriteUnprepared: commit only from the 2nd queue (#5439)
Summary:
This is a port of this PR into WriteUnprepared:
https://github.com/facebook/rocksdb/pull/5014

This also reverts this test change to restore some flaky write unprepared
tests: https://github.com/facebook/rocksdb/pull/5315

Tested with:
$ gtest-parallel ./transaction_test --gtest_filter=MySQLStyleTransactionTest/MySQLStyleTransactionTest.TransactionStressTest/9 --repeat=128
[128/128] MySQLStyleTransactionTest/MySQLStyleTransactionTest.TransactionStressTest/9 (18250 ms)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5439

Differential Revision: D15761405

Pulled By: lth

fbshipit-source-id: ae2581fd942d8a5b3f9278fd6bc3c1ac0b2c964c
2019-06-11 18:01:39 -07:00
Levi Tamasi ba64a4cf52 Revert "Reduce iterator key comparison for upper/lower bound check (#5111)" (#5440)
Summary:
This reverts commit f3a7847598.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5440

Differential Revision: D15765967

Pulled By: ltamasi

fbshipit-source-id: d027fe24132e3729289cd7c01857a7eb449d9dd0
2019-06-11 16:23:41 -07:00
Yanqin Jin 7177dc46a1 Handle missing WAL in secondary mode (#5323)
Summary:
In secondary mode, it is possible that the secondary lists the primary's WAL
directory, finds a WAL and tries to open it. It is possible that the primary
deletes the WAL after secondary listing dir but before the secondary opening
it. Then the secondary will fail to open the WAL file with a PathNotFound
status. In this case, we can return OK without replaying WAL and optionally
replay more MANIFEST.

Test Plan (on my dev machine):
Without this PR, the following will fail several times out of 100 runs.
```
~/gtest-parallel/gtest-parallel -r 100 -w 16 ./db_secondary_test --gtest_filter=DBSecondaryTest.SwitchToNewManifestDuringOpen
```
With this PR, the above should always succeed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5323

Differential Revision: D15763878

Pulled By: riversand963

fbshipit-source-id: c7164fa7cb8d9001abc258b6a2dc93613e4f38ff
2019-06-11 13:08:28 -07:00
haoyuhuang 9bbccda01e First commit for block cache trace analyzer (#5425)
Summary:
This PR contains the first commit for block cache trace analyzer. It reads a block cache trace file and prints statistics of the traces.

We will extend this class to provide more functionalities.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5425

Differential Revision: D15709580

Pulled By: HaoyuHuang

fbshipit-source-id: 2f43bd2311f460ab569880819d95eeae217c20bb
2019-06-11 12:22:44 -07:00
sdong 58c4aee42e TransactionUtil::CheckKey() to skip unnecessary history (#4941)
Summary:
If a memtable definitely covers a key, there isn't a need to check older memtables.
We can skip them by checking the earliest sequence number.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4941

Differential Revision: D13932666

fbshipit-source-id: b9d52f234b8ad9dd3bf6547645cd457175a3ca9b
2019-06-11 11:46:42 -07:00
Levi Tamasi a94aef6596 Fix DBTest.DynamicMiscOptions so it passes even with Snappy disabled (#5438)
Summary:
This affects our "no compression" automated tests. Since PR #5368, DBTest.DynamicMiscOptions has been failing with:

db/db_test.cc:4889: Failure
dbfull()->SetOptions({{"compression", "kSnappyCompression"}})
Invalid argument: Compression type Snappy is not linked with the binary.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5438

Differential Revision: D15752100

Pulled By: ltamasi

fbshipit-source-id: 3f19eff7cafc03b333965be0203c5853d2a9cb71
2019-06-10 18:47:58 -07:00
Maysam Yabandeh c8c1a549f0 Avoid deadlock between mutex_ and log_write_mutex_ (#5437)
Summary:
To avoid deadlock mutex_ should never be acquired before log_write_mutex_. The patch documents that and also fixes one case in ::FlushWAL that acquires mutex_ through ::WriteStatusCheck when it already holds lock on log_write_mutex_.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5437

Differential Revision: D15749722

Pulled By: maysamyabandeh

fbshipit-source-id: f57b69c44b4b80cc6d7ddf3d3fdf4a9eb5a5a45a
2019-06-10 17:06:50 -07:00
Maysam Yabandeh b2584577fa Remove global locks from FlushScheduler (#5372)
Summary:
FlushScheduler's methods are instrumented with debug-time locks to check the scheduler state against a simple container definition. Since https://github.com/facebook/rocksdb/pull/2286 the scope of such locks are widened to the entire methods' body. The result is that the concurrency tested during testing (in debug mode) is stricter than the concurrency level manifested at runtime (in release mode).
The patch reverts this change to reduce the scope of such locks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5372

Differential Revision: D15545831

Pulled By: maysamyabandeh

fbshipit-source-id: 01d69191afb1dd807d4bdc990fc74813ae7b5426
2019-06-10 16:50:26 -07:00
Yanqin Jin 641cc8d541 Use CreateLoggerFromOptions function (#5427)
Summary:
Use `CreateLoggerFromOptions` function to reduce code duplication.

Test plan (on my machine)
```
$make clean && make -j32 db_secondary_test
$KEEP_DB=1 ./db_secondary_test
```
Verify all info logs of the secondary instance are properly logged.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5427

Differential Revision: D15748922

Pulled By: riversand963

fbshipit-source-id: bad7261df1b8373efc504f141efc7871e375a311
2019-06-10 16:00:30 -07:00
haoyuhuang 5efa0d6b0d Create a BlockCacheLookupContext to enable fine-grained block cache tracing. (#5421)
Summary:
BlockCacheLookupContext only contains the caller for now.
We will trace block accesses at five places:
1. BlockBasedTable::GetFilter.
2. BlockBasedTable::GetUncompressedDict.
3. BlockBasedTable::MaybeReadAndLoadToCache. (To trace access on data, index, and range deletion block.)
4. BlockBasedTable::Get. (To trace the referenced key and whether the referenced key exists in a fetched data block.)
5. BlockBasedTable::MultiGet. (To trace the referenced key and whether the referenced key exists in a fetched data block.)

We create the context at:
1. BlockBasedTable::Get. (kUserGet)
2. BlockBasedTable::MultiGet. (kUserMGet)
3. BlockBasedTable::NewIterator. (either kUserIterator, kCompaction, or external SST ingestion calls this function.)
4. BlockBasedTable::Open. (kPrefetch)
5. Index/Filter::CacheDependencies. (kPrefetch)
6. BlockBasedTable::ApproximateOffsetOf. (kCompaction or kUserApproximateSize).

I loaded 1 million key-value pairs into the database and ran the readrandom benchmark with a single thread. I gave the block cache 10 GB to make sure all reads hit the block cache after warmup. The throughput is comparable.
Throughput of this PR: 231334 ops/s.
Throughput of the master branch: 238428 ops/s.

Experiment setup:
RocksDB:    version 6.2
Date:       Mon Jun 10 10:42:51 2019
CPU:        24 * Intel Core Processor (Skylake)
CPUCache:   16384 KB
Keys:       20 bytes each
Values:     100 bytes each (100 bytes after compression)
Entries:    1000000
Prefix:    20 bytes
Keys per prefix:    0
RawSize:    114.4 MB (estimated)
FileSize:   114.4 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: NoCompression
Compression sampling rate: 0
Memtablerep: skip_list
Perf Level: 1

Load command: ./db_bench --benchmarks="fillseq" --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --statistics --cache_index_and_filter_blocks --cache_size=10737418240 --disable_auto_compactions=1 --disable_wal=1 --compression_type=none --min_level_to_compress=-1 --compression_ratio=1 --num=1000000

Run command: ./db_bench --benchmarks="readrandom,stats" --use_existing_db --threads=1 --duration=120 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --statistics --cache_index_and_filter_blocks --cache_size=10737418240 --disable_auto_compactions=1 --disable_wal=1 --compression_type=none --min_level_to_compress=-1 --compression_ratio=1 --num=1000000 --duration=120

TODOs:
1. Create a caller for external SST file ingestion and differentiate the callers for iterator.
2. Integrate tracer to trace block cache accesses.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5421

Differential Revision: D15704258

Pulled By: HaoyuHuang

fbshipit-source-id: 4aa8a55f8cb1576ffb367bfa3186a91d8f06d93a
2019-06-10 15:33:27 -07:00
anand76 63ace8ef0e Reuse data block iterator in BlockBasedTableReader::MultiGet() (#5314)
Summary:
Instead of creating a new DataBlockIterator for every key in a MultiGet batch, reuse it if the next key is in the same block. This results in a small 1-2% cpu improvement.

TEST_TMPDIR=/dev/shm/multiget numactl -C 10  ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4

Without the change -
multireadrandom :       3.066 micros/op 326122 ops/sec; (29375968 of 29375968 found)

With the change -
multireadrandom :       3.003 micros/op 332945 ops/sec; (29983968 of 29983968 found)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5314

Differential Revision: D15742108

Pulled By: anand1976

fbshipit-source-id: 220fb0b8eea9a0d602ddeb371528f7af7936d771
2019-06-10 13:31:19 -07:00
Yanqin Jin 6ce5580882 Improve memtable earliest seqno assignment for secondary instance (#5413)
Summary:
In regular RocksDB instance, `MemTable::earliest_seqno_` is "db sequence number at the time of creation". However, we cannot use the db sequence number to set the value of `MemTable::earliest_seqno_` for secondary instance, i.e. `DBImplSecondary` due to the logic of MANIFEST and WAL replay.
When replaying the log files of the primary, the secondary instance first replays MANIFEST and updates the db sequence number if necessary. Next, the secondary replays WAL files, creates new memtables if necessary and inserts key-value pairs into memtables. The following can occur when the db has two or more column families.
Assume the db has column family "default" and "cf1". At a certain in time, both "default" and "cf1" have data in memtables.
1. Primary triggers a flush and flushes "cf1". "default" is **not** flushed.
2. Secondary replays the MANIFEST updates its db sequence number to the latest value learned from the MANIFEST.
3. Secondary starts to replay WAL that contains the writes to "default". It is possible that the write batches' sequence numbers are smaller than the db sequence number. In this case, these write batches will be skipped, and these updates will not be visible to reader until "default" is later flushed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5413

Differential Revision: D15637407

Pulled By: riversand963

fbshipit-source-id: 3de3fe35cfc6f1b9f844f3f926f0df29717b6580
2019-06-10 12:58:14 -07:00
Maysam Yabandeh c292dc8540 WritePrepared: reduce prepared_mutex_ overhead (#5420)
Summary:
The patch reduces the contention over prepared_mutex_ using these techniques:
1) Move ::RemovePrepared() to be called from the commit callback when we have two write queues.
2) Use two separate mutex for PreparedHeap, one prepared_mutex_ needed for ::RemovePrepared, and one ::push_pop_mutex() needed for ::AddPrepared(). Given that we call ::AddPrepared only from the first write queue and ::RemovePrepared mostly from the 2nd, this will result into each the two write queues not competing with each other over a single mutex. ::RemovePrepared might occasionally need to acquire ::push_pop_mutex() if ::erase() ends up with calling ::pop()
3) Acquire ::push_pop_mutex() on the first callback of the write queue and release it on the last.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5420

Differential Revision: D15741985

Pulled By: maysamyabandeh

fbshipit-source-id: 84ce8016007e88bb6e10da5760ba1f0d26347735
2019-06-10 11:53:31 -07:00
Levi Tamasi a16d0cc494 Fix build errors regarding const qualifier being ignored on cast result type (#5432)
Summary:
This affects some TSAN builds:

env/env_test.cc: In member function ‘virtual void rocksdb::EnvPosixTestWithParam_MultiRead_Test::TestBody()’:
env/env_test.cc:1126:76: error: type qualifiers ignored on cast result type [-Werror=ignored-qualifiers]
       auto data = NewAligned(kSectorSize * 8, static_cast<const char>(i + 1));
                                                                            ^
env/env_test.cc:1154:77: error: type qualifiers ignored on cast result type [-Werror=ignored-qualifiers]
       auto buf = NewAligned(kSectorSize * 8, static_cast<const char>(i*2 + 1));
                                                                             ^
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5432

Differential Revision: D15727277

Pulled By: ltamasi

fbshipit-source-id: dc0e687b123e7c4d703ccc0c16b7167e07d1c9b0
2019-06-07 19:37:41 -07:00
anand76 b703a56e5c Potential fix for stress test failure due to "SST file ahead of WAL" error (#5412)
Summary:
I'm not able to prove it, but the stress test failure may be caused by the following sequence of events -

1. Crash db_stress while writing the log file. This should result in a corrupted WAL.
2. Run db_stress with recycle_log_file_num=1. Crash during recovery immediately after writing manifest and updating the current file. The old log from the previous run is left behind, but the memtable would have been flushed during recovery and the CF log number will point to the newer log
3. Run db_stress with recycle_log_file_num=0. During recovery, the old log file will be processed and the corruption will be detected. Since the CF has moved ahead, we get the "SST file is ahead of WAL" error

Test -
1. stress_crash
2. make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5412

Differential Revision: D15699120

Pulled By: anand1976

fbshipit-source-id: 9092ce81e7c4a0b4b4e66560c23ea4812a4d9cbe
2019-06-07 15:35:47 -07:00
Levi Tamasi 0f48e56f96 Revert to checking the upper bound on a per-key basis in BlockBasedTableIterator (#5428)
Summary:
PR #5111 reduced the number of key comparisons when iterating with
upper/lower bounds; however, this caused a regression for MyRocks.
Reverting to the previous behavior in BlockBasedTableIterator as a hotfix.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5428

Differential Revision: D15721038

Pulled By: ltamasi

fbshipit-source-id: 5450106442f1763bccd17f6cfd648697f2ae8b6c
2019-06-07 15:17:05 -07:00