Commit graph

7526 commits

Author SHA1 Message Date
Abhishek Madan 7528130e38 Cache fragmented range tombstones in BlockBasedTableReader (#4493)
Summary:
This allows tombstone fragmenting to only be performed when the table is opened, and cached for subsequent accesses.

On the same DB used in #4449, running `readrandom` results in the following:
```
readrandom   :       0.983 micros/op 1017076 ops/sec;   78.3 MB/s (63103 of 100000 found)
```

Now that Get performance in the presence of range tombstones is reasonable, I also compared the performance between a DB with range tombstones, "expanded" range tombstones (several point tombstones that cover the same keys the equivalent range tombstone would cover, a common workaround for DeleteRange), and no range tombstones. The created DBs had 5 million keys each, and DeleteRange was called at regular intervals (depending on the total number of range tombstones being written) after 4.5 million Puts. The table below summarizes the results of a `readwhilewriting` benchmark (in order to provide somewhat more realistic results):
```
   Tombstones?    | avg micros/op | stddev micros/op |  avg ops/s   | stddev ops/s
----------------- | ------------- | ---------------- | ------------ | ------------
None              |        0.6186 |          0.04637 | 1,625,252.90 | 124,679.41
500 Expanded      |        0.6019 |          0.03628 | 1,666,670.40 | 101,142.65
500 Unexpanded    |        0.6435 |          0.03994 | 1,559,979.40 | 104,090.52
1k Expanded       |        0.6034 |          0.04349 | 1,665,128.10 | 125,144.57
1k Unexpanded     |        0.6261 |          0.03093 | 1,600,457.50 |  79,024.94
5k Expanded       |        0.6163 |          0.05926 | 1,636,668.80 | 154,888.85
5k Unexpanded     |        0.6402 |          0.04002 | 1,567,804.70 | 100,965.55
10k Expanded      |        0.6036 |          0.05105 | 1,667,237.70 | 142,830.36
10k Unexpanded    |        0.6128 |          0.02598 | 1,634,633.40 |  72,161.82
25k Expanded      |        0.6198 |          0.04542 | 1,620,980.50 | 116,662.93
25k Unexpanded    |        0.5478 |          0.0362  | 1,833,059.10 | 121,233.81
50k Expanded      |        0.5104 |          0.04347 | 1,973,107.90 | 184,073.49
50k Unexpanded    |        0.4528 |          0.03387 | 2,219,034.50 | 170,984.32
```

After a large enough quantity of range tombstones are written, range tombstone Gets can become faster than reading from an equivalent DB with several point tombstones.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4493

Differential Revision: D10842844

Pulled By: abhimadan

fbshipit-source-id: a7d44534f8120e6aabb65779d26c6b9df954c509
2018-10-25 19:26:44 -07:00
Zhongyi Xie fe0d23059d Fix two contrun job failures (#4587)
Summary:
Currently there are two contrun test failures:
* rocksdb-contrun-lite:
> tools/db_bench_tool.cc: In function ‘int rocksdb::db_bench_tool(int, char**)’:
tools/db_bench_tool.cc:5814:5: error: ‘DumpMallocStats’ is not a member of ‘rocksdb’
     rocksdb::DumpMallocStats(&stats_string);
     ^
make: *** [tools/db_bench_tool.o] Error 1
* rocksdb-contrun-unity:
> In file included from unity.cc:44:0:
db/range_tombstone_fragmenter.cc: In member function ‘void rocksdb::FragmentedRangeTombstoneIterator::FragmentTombstones(std::unique_ptr<rocksdb::InternalIteratorBase<rocksdb::Slice> >, rocksdb::SequenceNumber)’:
db/range_tombstone_fragmenter.cc:90:14: error: reference to ‘ParsedInternalKeyComparator’ is ambiguous
   auto cmp = ParsedInternalKeyComparator(icmp_);

This PR will fix them
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4587

Differential Revision: D10846554

Pulled By: miasantreble

fbshipit-source-id: 8d3358879e105060197b1379c84aecf51b352b93
2018-10-24 20:16:45 -07:00
Yanqin Jin eb8c9918f7 Remove unused variable
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4585

Differential Revision: D10841983

Pulled By: riversand963

fbshipit-source-id: 6a7e0b40065bcfbb10a2cac0cec1e8da0750a617
2018-10-24 15:51:45 -07:00
Jigar Bhati 6ecd26af27 WriteBufferManager JNI fixes (#4579)
Summary:
1. `WriteBufferManager` should have a reference alive in Java side through `Options`/`DBOptions` otherwise, if it's GC'ed at java side, native side can seg fault.
2. native method `setWriteBufferManager()` in `DBOptions.java` doesn't have it's jni method invocation in rocksdbjni which is added in this PR
3. `DBOptionsTest.java` is referencing object of `Options`. Instead it should be testing against `DBOptions`. Seems like a copy paste error.
4. Add a getter for WriteBufferManager.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4579

Differential Revision: D10561150

Pulled By: sagar0

fbshipit-source-id: 139a15c7f051a9f77b4200215b88267b48fbc487
2018-10-24 12:40:52 -07:00
Abhishek Madan 8c78348c77 Use only "local" range tombstones during Get (#4449)
Summary:
Previously, range tombstones were accumulated from every level, which
was necessary if a range tombstone in a higher level covered a key in a lower
level. However, RangeDelAggregator::AddTombstones's complexity is based on
the number of tombstones that are currently stored in it, which is wasteful in
the Get case, where we only need to know the highest sequence number of range
tombstones that cover the key from higher levels, and compute the highest covering
sequence number at the current level. This change introduces this optimization, and
removes the use of RangeDelAggregator from the Get path.

In the benchmark results, the following command was used to initialize the database:
```
./db_bench -db=/dev/shm/5k-rts -use_existing_db=false -benchmarks=filluniquerandom -write_buffer_size=1048576 -compression_type=lz4 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -value_size=112 -key_size=16 -block_size=4096 -level_compaction_dynamic_level_bytes=true -num=5000000 -max_background_jobs=12 -benchmark_write_rate_limit=20971520 -range_tombstone_width=100 -writes_per_range_tombstone=100 -max_num_range_tombstones=50000 -bloom_bits=8
```

...and the following command was used to measure read throughput:
```
./db_bench -db=/dev/shm/5k-rts/ -use_existing_db=true -benchmarks=readrandom -disable_auto_compactions=true -num=5000000 -reads=100000 -threads=32
```

The filluniquerandom command was only run once, and the resulting database was used
to measure read performance before and after the PR. Both binaries were compiled with
`DEBUG_LEVEL=0`.

Readrandom results before PR:
```
readrandom   :       4.544 micros/op 220090 ops/sec;   16.9 MB/s (63103 of 100000 found)
```

Readrandom results after PR:
```
readrandom   :      11.147 micros/op 89707 ops/sec;    6.9 MB/s (63103 of 100000 found)
```

So it's actually slower right now, but this PR paves the way for future optimizations (see #4493).

----
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4449

Differential Revision: D10370575

Pulled By: abhimadan

fbshipit-source-id: 9a2e152be1ef36969055c0e9eb4beb0d96c11f4d
2018-10-24 12:31:12 -07:00
Zhongyi Xie 21bf7421ca use per-level perf context for bloom filter related counters (#4581)
Summary:
PR https://github.com/facebook/rocksdb/pull/4226 introduced per-level perf context which allows breaking down perf context by levels.
This PR takes advantage of the feature to populate a few counters related to bloom filters
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4581

Differential Revision: D10518010

Pulled By: miasantreble

fbshipit-source-id: 011244561783ec860d32d5b0fa6bce6e78d70ef8
2018-10-24 12:21:38 -07:00
Simon Grätzer ad21b1af52 Set WriteCommitted txn id to commit sequence number (#4565)
Summary:
SetId and GetId are the experimental API that so far being used in WritePrepared and WriteUnPrepared transactions, where the id is assigned at the prepare time. The patch extends the API to WriteCommitted transactions, by setting the id at commit time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4565

Differential Revision: D10557862

Pulled By: maysamyabandeh

fbshipit-source-id: 2b27a140682b6185a4988fa88f8152628e0d67af
2018-10-24 12:21:38 -07:00
Sagar Vemuri abb8ecb4cd Add missing methods to WritableFileWrapper (#4584)
Summary:
`WritableFileWrapper` was missing some newer methods that were added to `WritableFile`. Without these functions, the missing wrapper methods would fallback to using the default implementations in WritableFile instead of using the corresponding implementations in, say, `PosixWritableFile` or `WinWritableFile`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4584

Differential Revision: D10559199

Pulled By: sagar0

fbshipit-source-id: 0d0f18a486aee727d5b8eebd3110a41988e27391
2018-10-24 12:19:54 -07:00
Yi Wu 0415244bfa option to print malloc stats at the end of db_bench (#4582)
Summary:
Option to print malloc stats to stdout at the end of db_bench. This is different from `--dump_malloc_stats`, which periodically print the same information to LOG file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4582

Differential Revision: D10520814

Pulled By: yiwu-arbug

fbshipit-source-id: beff5e514e414079d31092b630813f82939ffe5c
2018-10-24 11:39:05 -07:00
Neil Mayhew 43dbd4411e Adapt three unit tests with newer compiler/libraries (#4562)
Summary:
This fixes three tests that fail with relatively recent tools and libraries:

The tests are:

* `spatial_db_test`
* `table_test`
* `db_universal_compaction_test`

I'm using:

* `gcc` 7.3.0
* `glibc` 2.27
* `snappy` 1.1.7
* `gflags` 2.2.1
* `zlib` 1.2.11
* `bzip2` 1.0.6.0.1
* `lz4` 1.8.2
* `jemalloc` 5.0.1

The versions used in the Travis environment (which is two Ubuntu LTS versions behind the current one and doesn't use `lz4` or `jemalloc`) don't seem to have a problem. However, to be safe, I verified that these tests pass with and without my changes in a trusty Docker container without `lz4` and `jemalloc`.

However, I do get an unrelated set of other failures when using a trusty Docker container that uses `lz4` and `jemalloc`:

```
db/db_universal_compaction_test.cc:506: Failure
Value of: num + 1
  Actual: 3
Expected: NumSortedRuns(1)
Which is: 4
[  FAILED  ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/0, where GetParam() = (1, false) (1189 ms)
[ RUN      ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/1
db/db_universal_compaction_test.cc:506: Failure
Value of: num + 1
  Actual: 3
Expected: NumSortedRuns(1)
Which is: 4
[  FAILED  ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/1, where GetParam() = (1, true) (1246 ms)
[ RUN      ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/2
db/db_universal_compaction_test.cc:506: Failure
Value of: num + 1
  Actual: 3
Expected: NumSortedRuns(1)
Which is: 4
[  FAILED  ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/2, where GetParam() = (3, false) (1237 ms)
[ RUN      ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/3
db/db_universal_compaction_test.cc:506: Failure
Value of: num + 1
  Actual: 3
Expected: NumSortedRuns(1)
Which is: 4
[  FAILED  ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/3, where GetParam() = (3, true) (1195 ms)
[ RUN      ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/4
db/db_universal_compaction_test.cc:506: Failure
Value of: num + 1
  Actual: 3
Expected: NumSortedRuns(1)
Which is: 4
[  FAILED  ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/4, where GetParam() = (5, false) (1161 ms)
[ RUN      ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/5
db/db_universal_compaction_test.cc:506: Failure
Value of: num + 1
  Actual: 3
Expected: NumSortedRuns(1)
Which is: 4
[  FAILED  ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/5, where GetParam() = (5, true) (1229 ms)
```

I haven't attempted to fix these since I'm not using trusty and Travis doesn't use `lz4` and `jemalloc`. However, the final commit in this PR does at least fix the compilation errors that occur when using trusty's version of `lz4`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4562

Differential Revision: D10510917

Pulled By: maysamyabandeh

fbshipit-source-id: 59534042015ec339270e5fc2f6ac4d859370d189
2018-10-24 08:17:56 -07:00
Zhongyi Xie f6b151f16d fix clang analyzer error (#4583)
Summary:
clang analyzer currently fails with the following warnings:
> db/log_reader.cc:323:9: warning: Undefined or garbage value returned to caller
        return r;
        ^~~~~~~~
db/log_reader.cc:344:11: warning: Undefined or garbage value returned to caller
          return r;
          ^~~~~~~~
db/log_reader.cc:369:11: warning: Undefined or garbage value returned to caller
          return r;
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4583

Differential Revision: D10523517

Pulled By: miasantreble

fbshipit-source-id: 0cc8b8f27657b202bead148bbe7c4aa84fed095b
2018-10-23 22:14:54 -07:00
Yi Wu c7a45ca91f BlobDB: handle IO error on write (#4580)
Summary:
A fix similar to #4410 but on the write path. On IO error on `SelectBlobFile()` we didn't return error code properly, but simply a nullptr of `BlobFile`. The `AppendBlob()` method didn't have null check for the pointer and caused crash. The fix make sure we properly return error code in this case.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4580

Differential Revision: D10513849

Pulled By: yiwu-arbug

fbshipit-source-id: 80bca920d1d7a3541149de981015ad83e0aa14b5
2018-10-23 15:03:45 -07:00
Yi Wu 742302a1a3 Fix compile error with aligned-new (#4576)
Summary:
In fbcode when we build with clang7++, although -faligned-new is available in compile phase, we link with an older version of libstdc++.a and it doesn't come with aligned-new support (e.g. `nm libstdc++.a | grep align_val_t` return empty). In this case the previous -faligned-new detection can pass but will end up with link error. Fixing it by only have the detection for non-fbcode build.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4576

Differential Revision: D10500008

Pulled By: yiwu-arbug

fbshipit-source-id: b375de4fbb61d2a08e54ab709441aa8e7b4b08cf
2018-10-23 10:55:41 -07:00
jsteemann d1c0d3f358 Small issues (#4564)
Summary:
Couple of very minor improvements (typos in comments, full qualification of class name, reordering members of a struct to make it smaller)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4564

Differential Revision: D10510183

Pulled By: maysamyabandeh

fbshipit-source-id: c7ddf9bfbf2db08cd31896c3fd93789d3fa68c8b
2018-10-23 10:35:57 -07:00
Maysam Yabandeh c34cc40424 Fix user comparator receiving internal key (#4575)
Summary:
There was a bug that the user comparator would receive the internal key instead of the user key. The bug was due to RangeMightExistAfterSortedRun expecting user key but receiving internal key when called in GenerateBottommostFiles. The patch augment an existing unit test to reproduce the bug and fixes it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4575

Differential Revision: D10500434

Pulled By: maysamyabandeh

fbshipit-source-id: 858346d2fd102cce9e20516d77338c112bdfe366
2018-10-23 08:14:46 -07:00
Siying Dong 7024263682 Dynamic level to adjust level multiplier when write is too heavy (#4338)
Summary:
Level compaction usually performs poorly when the writes so heavy that the level targets can't be guaranteed. With this improvement, we improve level_compaction_dynamic_level_bytes = true so that in the write heavy cases, the level multiplier can be slightly adjusted based on the size of L0.

We keep the behavior the same if number of L0 files is under 2X compaction trigger and the total size is less than options.max_bytes_for_level_base, so that unless write is so heavy that compaction cannot keep up, the behavior doesn't change.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4338

Differential Revision: D9636782

Pulled By: siying

fbshipit-source-id: e27fc17a7c29c84b00064cc17536a01dacef7595
2018-10-22 10:21:47 -07:00
Yi Wu 933250e355 Fix RepeatableThreadTest::MockEnvTest hang (#4560)
Summary:
When `MockTimeEnv` is used in test to mock time methods, we cannot use `CondVar::TimedWait` because it is using real time, not the mocked time for wait timeout. On Mac the method can return immediately without awaking other waiting threads, if the real time is larger than `wait_until` (which is a mocked time). When that happen, the `wait()` method will fall into an infinite loop.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4560

Differential Revision: D10472851

Pulled By: yiwu-arbug

fbshipit-source-id: 898902546ace7db7ac509337dd8677a527209d19
2018-10-21 20:17:18 -07:00
Simon Grätzer f959e88048 Fix printf formatting on MacOS (#4533)
Summary:
On MacOS with clang the compilation of _tools/db_bench_tool.cc_ always fails because the format used in a `fprintf` call has the wrong type. This PR should hopefully fix this issue
```
tools/db_bench_tool.cc:4233:61: error: format specifies type 'unsigned long long' but the argument has type 'size_t' (aka 'unsigned long')
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4533

Differential Revision: D10471657

Pulled By: maysamyabandeh

fbshipit-source-id: f20f5f3756d3571b586c895c845d0d4d1e34a398
2018-10-19 14:46:09 -07:00
Siying Dong c17383f918 Fix WriteBatchWithIndex's SeekForPrev() (#4559)
Summary:
WriteBatchWithIndex's SeekForPrev() has a bug that we internally place the position just before the seek key rather than after. This makes the iterator to miss the result that is the same as the seek key. Fix it by position the iterator equal or smaller.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4559

Differential Revision: D10468534

Pulled By: siying

fbshipit-source-id: 2fb371ae809c561b60a1c11cef71e1c66fea1f19
2018-10-19 14:40:50 -07:00
Yanqin Jin da4aa59b4c Add read retry support to log reader (#4394)
Summary:
Current `log::Reader` does not perform retry after encountering `EOF`. In the future, we need the log reader to be able to retry tailing the log even after `EOF`.

Current implementation is simple. It does not provide more advanced retry policies. Will address this in the future.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4394

Differential Revision: D9926508

Pulled By: riversand963

fbshipit-source-id: d86d145792a41bd64a72f642a2a08c7b7b5201e1
2018-10-19 11:53:00 -07:00
Abhishek Madan 35cd754a6d Add writes_before_delete_range flag to db_bench (#4538)
Summary:
The new flag allows tombstones to be generated after enough
keys have been written to the database, which makes it easier to ensure
that tombstones cover a lot of keys.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4538

Differential Revision: D10455685

Pulled By: abhimadan

fbshipit-source-id: f25d5421745a353c830dea12b79784e852056551
2018-10-18 17:19:59 -07:00
Maysam Yabandeh 0afa5b53d7 Disable GroupCommitTest in Appveyor (#4536)
Summary:
We have already disabled it on Travis since it has been too flaky. The same problem arises in Appveyor as well.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4536

Differential Revision: D10452240

Pulled By: maysamyabandeh

fbshipit-source-id: 728f4ecddf780097159dc0a0737d460eb5ce4f09
2018-10-18 14:21:09 -07:00
Philip Jameson 56e129da01 Remove usages of headers attribute as a string
Reviewed By: andrewjcg

Differential Revision: D10409082

fbshipit-source-id: a1432270f79c2baf2e52e3351b5f481c7398c58d
2018-10-18 13:59:08 -07:00
Huachao Huang faa2c90f7c cmake: fix FORCE_SSE42 (#4490)
Summary:
When HAVE_SSE42 is true, it should always add "-msse4.2 -mpclmul" no
matter if FORCE_SSE42 is true or not.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4490

Differential Revision: D10384256

Pulled By: yiwu-arbug

fbshipit-source-id: c82bc988f017981a0f84ddc136f36e2366c3ea8a
2018-10-18 11:22:02 -07:00
Jigar Bhati a4d9aa6b18 Plumb WriteBufferManager through JNI (#4492)
Summary:
Allow rocks java to explicitly create WriteBufferManager by plumbing it to the native code through JNI.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4492

Differential Revision: D10428506

Pulled By: sagar0

fbshipit-source-id: cd9dd8c2ef745a0303416b44e2080547bdcca1fd
2018-10-17 11:49:57 -07:00
Abhishek Madan 45f213b558 Lazily initialize RangeDelAggregator stripe map entries (#4497)
Summary:
When there are no range deletions, flush and compaction perform a binary search
on an effectively empty map every time they call ShouldDelete. This PR lazily
initializes each stripe map entry so that the binary search can be elided in
these cases.

After this PR, the total amount of time spent in compactions is 52.541331s, and the total amount of time spent in flush is 5.532608s, the former of which is a significant improvement from the results after #4495.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4497

Differential Revision: D10428610

Pulled By: abhimadan

fbshipit-source-id: 6f7e1ce3698fac3ef86d1197955e6b72e0931a0f
2018-10-17 11:47:34 -07:00
Zhongyi Xie d6ec288703 Add PerfContextByLevel to provide per level perf context information (#4226)
Summary:
Current implementation of perf context is level agnostic. Making it hard to do performance evaluation for the LSM tree. This PR adds `PerfContextByLevel` to decompose the counters by level.
This will be helpful when analyzing point and range query performance as well as tuning bloom filter
Also replaced __thread with thread_local keyword for perf_context
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4226

Differential Revision: D10369509

Pulled By: miasantreble

fbshipit-source-id: f1ced4e0de5fcebdb7f9cff36164516bc6382d82
2018-10-17 11:19:40 -07:00
anand1976 1e3845805d Properly determine a truncated CompactRange stop key (#4496)
Summary:
When a CompactRange() call for a level is truncated before the end key
is reached, because it exceeds max_compaction_bytes, we need to properly
set the compaction_end parameter to indicate the stop key. The next
CompactRange will use that as the begin key. We set it to the smallest
key of the next file in the level after expanding inputs to get a clean
cut.

Previously, we were setting it before expanding inputs. So we could end
up recompacting some files. In a pathological case, where a single key
has many entries spanning all the files in the level (possibly due to
merge operands without a partial merge operator, thus resulting in
compaction output identical to the input), this would result in
an endless loop over the same set of files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4496

Differential Revision: D10395026

Pulled By: anand1976

fbshipit-source-id: f0c2f89fee29b4b3be53b6467b53abba8e9146a9
2018-10-15 23:22:51 -07:00
Yanqin Jin e633983cf1 Add support to flush multiple CFs atomically (#4262)
Summary:
Leverage existing `FlushJob` to implement atomic flush of multiple column families.

This PR depends on other PRs and is a subset of #3752 . This PR itself is not sufficient in fulfilling atomic flush.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4262

Differential Revision: D9283109

Pulled By: riversand963

fbshipit-source-id: 65401f913e4160b0a61c0be6cd02adc15dad28ed
2018-10-15 20:01:17 -07:00
Andrew Kryczka 32b4d4ad47 Avoid per-key linear scan over snapshots in compaction (#4495)
Summary:
`CompactionIterator::snapshots_` is ordered by ascending seqnum, just like `DBImpl`'s linked list of snapshots from which it was copied. This PR exploits this ordering to make `findEarliestVisibleSnapshot` do binary search rather than linear scan. This can make flush/compaction significantly faster when many snapshots exist since that function is called on every single key.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4495

Differential Revision: D10386470

Pulled By: ajkr

fbshipit-source-id: 29734991631227b6b7b677e156ac567690118a8b
2018-10-15 16:21:22 -07:00
Maysam Yabandeh 0f955f2aef Update WritePrepared blog post with latest results (#4494)
Summary:
WritePrepared is declared production ready (overdue update) and the benchmark results are also reported.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4494

Differential Revision: D10385336

Pulled By: maysamyabandeh

fbshipit-source-id: 662672ddfa286aa46af544f505b4d4b7a882d408
2018-10-15 14:01:31 -07:00
Yanqin Jin ce52274640 Replace 'string' with 'const string&' in FileOperationInfo (#4491)
Summary:
Using const string& can avoid one extra string copy. This PR addresses a recent comment made by siying  on #3933.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4491

Differential Revision: D10381211

Pulled By: riversand963

fbshipit-source-id: 27fc2d65d84bc7cd07833c77cdc47f06dcfaeb31
2018-10-15 13:46:01 -07:00
Yi Wu f60c4e5a58 Set -DROCKSDB_JEMALLOC for buck build if jemalloc presents (#4489)
Summary:
Set the macro if default allocator is jemalloc. It doesn't handle the case when allocator is specified, e.g.
```
cpp_binary(
    name="xxx"
    allocator="jemalloc", # or "malloc" or something else
    deps=["//rocksdb:rocksdb"],
)
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4489

Differential Revision: D10363683

Pulled By: yiwu-arbug

fbshipit-source-id: 5da490336a8e78e0feb0900c29e8036e7ec6f12b
2018-10-15 11:41:47 -07:00
Yanqin Jin 729a617b5b Add listener to sample file io (#3933)
Summary:
We would like to collect file-system-level statistics including file name, offset, length, return code, latency, etc., which requires to add callbacks to intercept file IO function calls when RocksDB is running.
To collect file-system-level statistics, users can inherit the class `EventListener`, as in `TestFileOperationListener `. Note that `TestFileOperationListener::ShouldBeNotifiedOnFileIO()` returns true.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3933

Differential Revision: D10219571

Pulled By: riversand963

fbshipit-source-id: 7acc577a2d31097766a27adb6f78eaf8b1e8ff15
2018-10-12 18:36:11 -07:00
John Calcote 9c20797136 Add UInt64AddOperator to rocksjava (#4448)
Summary:
Closes https://github.com/facebook/rocksdb/issues/4447
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4448

Differential Revision: D10351852

Pulled By: ajkr

fbshipit-source-id: 18287b5190ae0b8153ce425da9a0bdfe1af88c34
2018-10-12 17:35:47 -07:00
Yi Wu 6f8d4bdff1 Fix compile error with jemalloc (#4488)
Summary:
The "je_" prefix of jemalloc APIs presents only when the macro `JEMALLOC_NO_RENAME` from jemalloc.h presents.

With the patch I'm also adding -DROCKSDB_JEMALLOC flag in buck TARGETS.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4488

Differential Revision: D10355971

Pulled By: yiwu-arbug

fbshipit-source-id: 03a2d69790a44ac89219c7525763fa937a63d95a
2018-10-12 11:50:50 -07:00
Chinmay Kamat 6422356a27 Acquire lock on DB LOCK file before starting repair. (#4435)
Summary:
This commit adds code to acquire lock on the DB LOCK file
before starting the repair process. This will prevent
multiple processes from performing repair on the same DB
simultaneously. Fixes repair_test to work with this change.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4435

Differential Revision: D10361499

Pulled By: riversand963

fbshipit-source-id: 3c512c48b7193d383b2279ccecabdb660ac1cf22
2018-10-12 10:41:54 -07:00
Wilfried Goesgens 5d809ecef7 Add compile time option to work with utf8 filename strings (#4469)
Summary:
The default behaviour of rocksdb is to use the `*A(` windows API functions.
These accept filenames in the currently configured system encoding,
be it Latin 1, utf8 or whatever.
If the Application intends to completely work with utf8 strings internally,
converting these to that codepage properly isn't even always possible.
Thus this patch adds a switch to use the `*W(` functions, which accept
UTF-16 filenames, and uses C++11 features to translate the
UTF8 containing std::string to an UTF16 containing std::wstring.

This feature is a compile time options, that can be enabled by setting `WITH_WINDOWS_UTF8_FILENAMES` to true.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4469

Differential Revision: D10356011

Pulled By: yiwu-arbug

fbshipit-source-id: 27b6ae9171f209085894cdf80069e8a896642044
2018-10-11 23:24:28 -07:00
Abhishek Madan 7dd1641048 Use vector in UncollapsedRangeDelMap (#4487)
Summary:
Using `./range_del_aggregator_bench --use_collapsed=false
--num_range_tombstones=5000 --num_runs=1000`, here are the results before and
after this change:

Before:
```
=========================
Results:
=========================
AddTombstones:           1822.61 us
ShouldDelete (first):    94.5286 us
```

After:
```
=========================
Results:
=========================
AddTombstones:           199.26 us
ShouldDelete (first):    38.9344 us
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4487

Differential Revision: D10347288

Pulled By: abhimadan

fbshipit-source-id: d44efe3a166d583acfdc3ec1199e0892f34dbfb7
2018-10-11 15:29:14 -07:00
zpalmtree 46dd8b1e13 C++17 support (#4482)
Summary:
Closes https://github.com/facebook/rocksdb/issues/4462

I'm not sure if you'll be happy with `std::random_device{}`, perhaps you would want to use your rand instance instead. I didn't test to see if your rand instance supports the requirements that `std::shuffle` takes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4482

Differential Revision: D10325133

Pulled By: yiwu-arbug

fbshipit-source-id: 47b7adaf4bb2b8d64cf090ea6b1b48ef53180581
2018-10-11 10:50:04 -07:00
Young Tack Jin c648d90f8e benchmark.sh: to fix divide by zero runtime error (#4442)
Summary:
"Write (GB)" of $9 rather than "Rnp1 (GB)" of $8
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4442

Differential Revision: D10318193

Pulled By: yiwu-arbug

fbshipit-source-id: 03a7ef1938d9332e06fb3fd8490ca212f61fac6b
2018-10-10 21:03:19 -07:00
UncP 531786ebf7 DBWriteImpl: remove redundant code (#4450)
Summary:
in `WriteThread::LaunchParallelMemTableWriters`, there is `  write_group->running.store(write_group->size);
`
https://github.com/facebook/rocksdb/blob/master/db/write_thread.cc#L510
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4450

Differential Revision: D10201900

Pulled By: yiwu-arbug

fbshipit-source-id: 96c8fbbba5aff7ba8a6ceb3117a2bd7cc9b2f34b
2018-10-10 21:00:32 -07:00
Simon Grätzer ceded4535d WriteBatch::Iterate wrongly returns Status::Corruption (#4478)
Summary:
Wrong I overwrite `WriteBatch::Handler::Continue` to return _false_ at some point, I always get the `Status::Corruption` error.
I don't think this check is used correctly here: The counter in `found` cannot reflect all entries in the WriteBatch when we exit the loop early.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4478

Differential Revision: D10317416

Pulled By: yiwu-arbug

fbshipit-source-id: cccae3382805035f9b3239b66682b5fcbba6bb61
2018-10-10 20:57:27 -07:00
Andrew Kryczka 7e56072290 Fix merge operand reappearing when covered by DeleteRange (#4481)
Summary:
Even during `DBIter::Prev()`, there is a case where we need to use `RangeDelPositioningMode::kForwardTraversal`. In particular, when we hit too many internal keys for a single user key, we use seek to find the newest internal key. If it's a merge operand, we then scan forwards, collecting the merge operands. This forward scan should be using `RangeDelPositioningMode::kForwardTraversal`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4481

Differential Revision: D10319507

Pulled By: ajkr

fbshipit-source-id: b5ce7352461f3a7696b28a5136ae0076f2bde51f
2018-10-10 18:16:12 -07:00
Peter Pei 09814f2cfc support OnCompactionBegin (#4431)
Summary:
fix #4288

Add `OnCompactionBegin` support to `rocksdb::EventListener`.

Currently, we only have these three callbacks:

- OnFlushBegin
- OnFlushCompleted
- OnCompactionCompleted

As paolococchi requested in #4288 , and ajkr agreed, we should also support `OnCompactionBegin`.

This PR is a try to implement the support of `OnCompactionBegin`.

Hope it is useful to you.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4431

Differential Revision: D10055515

Pulled By: yiwu-arbug

fbshipit-source-id: 39c0f95f8e9ff1c7ca3a10787502a17f258d2334
2018-10-10 17:32:27 -07:00
Yi Wu f8c1de4c7c Update docs/Gemfile.lock to fix github warning (#4480)
Summary:
Fix security warning from github: https://nvd.nist.gov/vuln/detail/CVE-2018-17567
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4480

Reviewed By: gfosco

Differential Revision: D10316443

Pulled By: yiwu-arbug

fbshipit-source-id: 34555c6f5774d779734b664b9baa70bd4209175f
2018-10-10 15:38:16 -07:00
Andrew Kryczka faa70fc575 DeleteRange regression tests using public API (#4476)
Summary:
I wrote a couple tests using the public API to expose/prevent the bugs we talked. In particular,

- When files have overlapping endpoints and a range tombstone spans them, ensure the largest key does not reappear to readers. This was happening due to a bug that skipped writing range tombstones to an output file when their begin key exactly matched the file's largest key.
- When a tombstone spans multiple atomic compaction units, ensure newer keys do not disappear by being compacted beneath it. This happened due to a range tombstone appearing untruncated to readers when it spanned files with overlapping endpoints, even if it extended into files without overlapping endpoints (i.e., different atomic compaction units).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4476

Differential Revision: D10286001

Pulled By: ajkr

fbshipit-source-id: bb5ca51d0f90812fb37bfe1d01aec93f7eda55aa
2018-10-10 12:30:11 -07:00
Abhishek Madan 9c6fea7fe1 Update HISTORY.md, fix unity_test failure (#4479)
Summary:
Follow-up to https://github.com/facebook/rocksdb/pull/4432.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4479

Differential Revision: D10304151

Pulled By: abhimadan

fbshipit-source-id: 3608b95c324702ca26791f95cb26dae1d49efbe7
2018-10-10 12:09:56 -07:00
Zhichao Cao 7ca1a1f0d8 Fix trace_analyzer potential huge memory wasting due to no valid query analyzed (#4473)
Summary:
If the query types being analyzed do not appear in the trace, the current trace_analyzer will use 0 as the begin time, which create the time duration from 1970/01/01 to the now time. It will waste huge memory. Fixed by adding the trace_create_time to limit the duration.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4473

Differential Revision: D10246204

Pulled By: zhichao-cao

fbshipit-source-id: 42850b080b2e62f586fe73afd7737c2246d1a8c8
2018-10-10 10:00:00 -07:00
Anand Ananthabhotla 854a4be03f Handle mixed slowdown/no_slowdown writer properly (#4475)
Summary:
There is a bug when the write queue leader is blocked on a write
delay/stop, and the queue has writers with WriteOptions::no_slowdown set
to true. They are not woken up until the write stall is cleared.

The fix introduces a dummy writer inserted at the tail to indicate a
write stall and prevent further inserts into the queue, and a condition
variable that writers who can tolerate slowdown wait on before adding
themselves to the queue. The leader calls WriteThread::BeginWriteStall()
to add the dummy writer and then walk the queue to fail any writers with
no_slowdown set. Once the stall clears, the leader calls
WriteThread::EndWriteStall() to remove the dummy writer and signal the
condition variable.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4475

Differential Revision: D10285827

Pulled By: anand1976

fbshipit-source-id: 747465e5e7f07a829b1fb0bc1afcd7b93f4ab1a9
2018-10-09 22:52:40 -07:00