Commit Graph

634 Commits

Author SHA1 Message Date
Alexander Zinoviev 32a6dd9a41 Add a new CPU time counter to compaction report (#4889)
Summary:
Measure CPU time consumed for a compaction and report it in the stats report
Enable NowCPUNanos() to work for MacOS
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4889

Differential Revision: D13701276

Pulled By: zinoale

fbshipit-source-id: 5024e5bbccd4dd10fd90d947870237f436445055
2019-01-29 17:24:00 -08:00
Siying Dong 5bf941966b CompactionPri = kMinOverlappingRatio also uses compensated file size (#4907)
Summary:
Right now, CompactionPri = kMinOverlappingRatio provides best write amplification, but it doesn't
prioritize files with more tombstones. We combine the two good features: make kMinOverlappingRatio
to boost files with lots of tombstones too.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4907

Differential Revision: D13788774

Pulled By: siying

fbshipit-source-id: 1991cbb495fb76c8b529de69896e38d81ed9d9b3
2019-01-23 13:21:01 -08:00
Siying Dong 8641e9adf7 Non-initial file preloading should always prefetch index and filter (#4852)
Summary:
https://github.com/facebook/rocksdb/pull/3340 introduces preloading when max_open_files != -1.
It doesn't preload index and filter in non-initial file loading case. This is a little bit too
complicated to understand. We observed in one MyRocks use case where the filter is expected to be
preloaded but is not. To simplify the use case, we simply always prefetch the index and filter.
They anyway is expected to be loaded in the file verification phase anyway.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4852

Differential Revision: D13595402

Pulled By: siying

fbshipit-source-id: d4d8624eb3e849e20aeb990df2100502d85aff31
2019-01-08 12:47:34 -08:00
Andrew Kryczka 9e2c804fe6 Fix point lookup on range tombstone sentinel endpoint (#4829)
Summary:
Previously for point lookup we decided which file to look into based on user key overlap only. We also did not truncate range tombstones in the point lookup code path. These two ideas did not interact well in cases like this:

- L1 has range tombstone [a, c)#1 and point key b#2. The data is split between file1 with range [a#1,1, b#72057594037927935,15], and file2 with range [b#2, c#1].
- L1's file2 gets compacted to L2.
- User issues `Get()` for b#3.
- L1's file1 is opened and the range tombstone [a, c)#1 is found for b, while no point-key for b is found in L1.
- `Get()` assumes that the range tombstone must cover all data in that range in lower levels, so short circuits and returns `NotFound`.

The solution to this problem is to not look into files that only overlap with the point lookup at a range tombstone sentinel endpoint. In the above example, this would mean not opening L1's file1 or its tombstones during the `Get()`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4829

Differential Revision: D13561355

Pulled By: ajkr

fbshipit-source-id: a13c21c816870a2f5d32a48af6dbd719a7d9d19f
2019-01-04 11:24:08 -08:00
Siying Dong f0dda35d7d Preload some files even if options.max_open_files (#3340)
Summary:
Choose to preload some files if options.max_open_files != -1. This can slightly narrow the gap of performance between options.max_open_files is -1 and a large number. To avoid a significant regression to DB reopen speed if options.max_open_files != -1. Limit the files to preload in DB open time to 16.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3340

Differential Revision: D6686945

Pulled By: siying

fbshipit-source-id: 8ec11bbdb46e3d0cdee7b6ad5897a09c5a07869f
2018-12-28 18:02:28 -08:00
Abhishek Madan 81b6b09f6b Remove v1 RangeDelAggregator (#4778)
Summary:
Now that v2 is fully functional, the v1 aggregator is removed.
The v2 aggregator has been renamed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4778

Differential Revision: D13495930

Pulled By: abhimadan

fbshipit-source-id: 9d69500a60a283e79b6c4fa938fc68a8aa4d40d6
2018-12-17 17:33:46 -08:00
Abhishek Madan abf931afa6 Add compaction logic to RangeDelAggregatorV2 (#4758)
Summary:
RangeDelAggregatorV2 now supports ShouldDelete calls on
snapshot stripes and creation of range tombstone compaction iterators.
RangeDelAggregator is no longer used on any non-test code path, and will
be removed in a future commit.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4758

Differential Revision: D13439254

Pulled By: abhimadan

fbshipit-source-id: fe105bcf8e3d4a2df37a622d5510843cd71b0401
2018-12-17 13:20:51 -08:00
Yanqin Jin 4fce44fc8b Improve flushing multiple column families (#4708)
Summary:
If one column family is dropped, we should simply skip it and continue to flush
other active ones.
Currently we use Status::ShutdownInProgress to notify caller of column families
being dropped. In the future, we should consider using a different Status code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4708

Differential Revision: D13378954

Pulled By: riversand963

fbshipit-source-id: 42f248cdf2d32d4c0f677cd39012694b8f1328ca
2018-12-13 15:12:40 -08:00
Yi Wu 05d9d82181 Revert "Move MemoryAllocator option from Cache to BlockBasedTableOpti… (#4697)
Summary:
…ons (#4676)"

This reverts commit b32d087dbb.

`MemoryAllocator` needs to be with `Cache`, since cache entry can
outlive DB and block based table. The cache needs to hold reference to
memory allocator when deleting cache entry.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4697

Differential Revision: D13133490

Pulled By: yiwu-arbug

fbshipit-source-id: 8ef7e8a51263bfd929f892fd062665ff4ce9ce5a
2018-11-21 11:29:57 -08:00
Abhishek Madan 457f77b9ff Introduce RangeDelAggregatorV2 (#4649)
Summary:
The old RangeDelAggregator did expensive pre-processing work
to create a collapsed, binary-searchable representation of range
tombstones. With FragmentedRangeTombstoneIterator, much of this work is
now unnecessary. RangeDelAggregatorV2 takes advantage of this by seeking
in each iterator to find a covering tombstone in ShouldDelete, while
doing minimal work in AddTombstones. The old RangeDelAggregator is still
used during flush/compaction for now, though RangeDelAggregatorV2 will
support those uses in a future PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4649

Differential Revision: D13146964

Pulled By: abhimadan

fbshipit-source-id: be29a4c020fc440500c137216fcc1cf529571eb3
2018-11-21 10:56:45 -08:00
Abhishek Madan ed5aec5ba3 Fix range tombstone covering short-circuit logic (#4698)
Summary:
Since a range tombstone seen at one level will cover all keys
in the range at lower levels, there was a short-circuiting check in Get
that reported a key was not found at most one file after the range
tombstone was discovered. However, this was incorrect for merge
operands, since a deletion might only cover some merge operands,
which implies that the key should be found. This PR fixes this logic in
the Version portion of Get, and removes the logic from the MemTable
portion of Get, since the perforamnce benefit provided there is minimal.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4698

Differential Revision: D13142484

Pulled By: abhimadan

fbshipit-source-id: cbd74537c806032f2bfa564724d01a80df7c8f10
2018-11-20 13:29:22 -08:00
Andrew Kryczka 9d6d4867ab Fix uninitialized fields in file metadata (#4693)
Summary:
This is a quick fix for the uninitialized bugs in `LiveFileMetaData` and `SstFileMetaData` that were uncovered in #4686.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4693

Differential Revision: D13113189

Pulled By: ajkr

fbshipit-source-id: 18e798d031d2a59d0b55fc010c135e0126f4042d
2018-11-16 20:49:17 -08:00
Yi Wu b32d087dbb Move MemoryAllocator option from Cache to BlockBasedTableOptions (#4676)
Summary:
Per offline discussion with siying, `MemoryAllocator` and `Cache` should be decouple. The idea is that memory allocator handles memory allocation, while cache handle cache policy.

It is normal that external cache libraries pack couple the two components for better optimization. If we want to integrate with such library in the future, we can make a wrapper of the library implementing both `Cache` and `MemoryAllocator` interface.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4676

Differential Revision: D13047662

Pulled By: yiwu-arbug

fbshipit-source-id: cd42e246d80ab600b4de47d073f7d2db308ce6dd
2018-11-13 13:48:38 -08:00
QingpingWang 4f0fcb78ae Expose num entries and deletions of sst files (#4623)
Summary:
he ratio of num_deletions to num_entries of a level can be useful to determine if a manual compaction needs to be triggered on a level.
Also refer #3980
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4623

Differential Revision: D13045744

Pulled By: sagar0

fbshipit-source-id: 71f3c8e363a8ffd194ec3bb0ed0b69612231f0b3
2018-11-13 11:52:19 -08:00
Zhongyi Xie b313019326 use per-level perfcontext for DB::Get calls (#4617)
Summary:
this PR adds two more per-level perf context counters to track
* number of keys returned in Get call, break down by levels
* total processing time at each level during Get call
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4617

Differential Revision: D12898024

Pulled By: miasantreble

fbshipit-source-id: 6b84ef1c8097c0d9e97bee1a774958f56ab4a6c4
2018-11-13 10:40:49 -08:00
Sagar Vemuri dc3528077a Update all unique/shared_ptr instances to be qualified with namespace std (#4638)
Summary:
Ran the following commands to recursively change all the files under RocksDB:
```
find . -type f -name "*.cc" -exec sed -i 's/ unique_ptr/ std::unique_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/<unique_ptr/<std::unique_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/ shared_ptr/ std::shared_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/<shared_ptr/<std::shared_ptr/g' {} +
```
Running `make format` updated some formatting on the files touched.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4638

Differential Revision: D12934992

Pulled By: sagar0

fbshipit-source-id: 45a15d23c230cdd64c08f9c0243e5183934338a8
2018-11-09 11:19:58 -08:00
Yanqin Jin d1118f6f19 Add test to check if DB can handle atomic group (#4433)
Summary:
Add unit tests to demonstrate that `VersionSet::Recover` is able to detect and handle cases in which the MANIFEST has valid atomic group, incomplete trailing atomic group, atomic group mixed with normal version edits and atomic group with incorrect size.
With this capability, RocksDB identifies non-valid groups of version edits and do not apply them, thus guaranteeing that the db is restored to a state consistent with the most recent successful atomic flush before applying WAL.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4433

Differential Revision: D10079202

Pulled By: riversand963

fbshipit-source-id: a0e0b8bf4da1cf68e044d397588c121b66c68876
2018-10-30 16:37:47 -07:00
Abhishek Madan eaaf1a6f05 Promote rocksdb.{deleted.keys,merge.operands} to main table properties (#4594)
Summary:
Since the number of range deletions are reported in
TableProperties, it is confusing to not report the number of merge
operands and point deletions as top-level properties; they are
accessible through the public API, but since they are not the "main"
properties, they do not appear in aggregated table properties, or the
string representation of table properties.

This change promotes those two property keys to
`rocksdb/table_properties.h`, adds corresponding uint64 members for
them, deprecates the old access methods `GetDeletedKeys()` and
`GetMergeOperands()` (though they are still usable for now), and removes
`InternalKeyPropertiesCollector`. The property key strings are the same
as before this change, so this should be able to read DBs written from older
versions (though I haven't tested this yet).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4594

Differential Revision: D12826893

Pulled By: abhimadan

fbshipit-source-id: 9e4e4fbdc5b0da161c89582566d184101ba8eb68
2018-10-30 15:34:27 -07:00
Abhishek Madan 7528130e38 Cache fragmented range tombstones in BlockBasedTableReader (#4493)
Summary:
This allows tombstone fragmenting to only be performed when the table is opened, and cached for subsequent accesses.

On the same DB used in #4449, running `readrandom` results in the following:
```
readrandom   :       0.983 micros/op 1017076 ops/sec;   78.3 MB/s (63103 of 100000 found)
```

Now that Get performance in the presence of range tombstones is reasonable, I also compared the performance between a DB with range tombstones, "expanded" range tombstones (several point tombstones that cover the same keys the equivalent range tombstone would cover, a common workaround for DeleteRange), and no range tombstones. The created DBs had 5 million keys each, and DeleteRange was called at regular intervals (depending on the total number of range tombstones being written) after 4.5 million Puts. The table below summarizes the results of a `readwhilewriting` benchmark (in order to provide somewhat more realistic results):
```
   Tombstones?    | avg micros/op | stddev micros/op |  avg ops/s   | stddev ops/s
----------------- | ------------- | ---------------- | ------------ | ------------
None              |        0.6186 |          0.04637 | 1,625,252.90 | 124,679.41
500 Expanded      |        0.6019 |          0.03628 | 1,666,670.40 | 101,142.65
500 Unexpanded    |        0.6435 |          0.03994 | 1,559,979.40 | 104,090.52
1k Expanded       |        0.6034 |          0.04349 | 1,665,128.10 | 125,144.57
1k Unexpanded     |        0.6261 |          0.03093 | 1,600,457.50 |  79,024.94
5k Expanded       |        0.6163 |          0.05926 | 1,636,668.80 | 154,888.85
5k Unexpanded     |        0.6402 |          0.04002 | 1,567,804.70 | 100,965.55
10k Expanded      |        0.6036 |          0.05105 | 1,667,237.70 | 142,830.36
10k Unexpanded    |        0.6128 |          0.02598 | 1,634,633.40 |  72,161.82
25k Expanded      |        0.6198 |          0.04542 | 1,620,980.50 | 116,662.93
25k Unexpanded    |        0.5478 |          0.0362  | 1,833,059.10 | 121,233.81
50k Expanded      |        0.5104 |          0.04347 | 1,973,107.90 | 184,073.49
50k Unexpanded    |        0.4528 |          0.03387 | 2,219,034.50 | 170,984.32
```

After a large enough quantity of range tombstones are written, range tombstone Gets can become faster than reading from an equivalent DB with several point tombstones.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4493

Differential Revision: D10842844

Pulled By: abhimadan

fbshipit-source-id: a7d44534f8120e6aabb65779d26c6b9df954c509
2018-10-25 19:26:44 -07:00
Abhishek Madan 8c78348c77 Use only "local" range tombstones during Get (#4449)
Summary:
Previously, range tombstones were accumulated from every level, which
was necessary if a range tombstone in a higher level covered a key in a lower
level. However, RangeDelAggregator::AddTombstones's complexity is based on
the number of tombstones that are currently stored in it, which is wasteful in
the Get case, where we only need to know the highest sequence number of range
tombstones that cover the key from higher levels, and compute the highest covering
sequence number at the current level. This change introduces this optimization, and
removes the use of RangeDelAggregator from the Get path.

In the benchmark results, the following command was used to initialize the database:
```
./db_bench -db=/dev/shm/5k-rts -use_existing_db=false -benchmarks=filluniquerandom -write_buffer_size=1048576 -compression_type=lz4 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -value_size=112 -key_size=16 -block_size=4096 -level_compaction_dynamic_level_bytes=true -num=5000000 -max_background_jobs=12 -benchmark_write_rate_limit=20971520 -range_tombstone_width=100 -writes_per_range_tombstone=100 -max_num_range_tombstones=50000 -bloom_bits=8
```

...and the following command was used to measure read throughput:
```
./db_bench -db=/dev/shm/5k-rts/ -use_existing_db=true -benchmarks=readrandom -disable_auto_compactions=true -num=5000000 -reads=100000 -threads=32
```

The filluniquerandom command was only run once, and the resulting database was used
to measure read performance before and after the PR. Both binaries were compiled with
`DEBUG_LEVEL=0`.

Readrandom results before PR:
```
readrandom   :       4.544 micros/op 220090 ops/sec;   16.9 MB/s (63103 of 100000 found)
```

Readrandom results after PR:
```
readrandom   :      11.147 micros/op 89707 ops/sec;    6.9 MB/s (63103 of 100000 found)
```

So it's actually slower right now, but this PR paves the way for future optimizations (see #4493).

----
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4449

Differential Revision: D10370575

Pulled By: abhimadan

fbshipit-source-id: 9a2e152be1ef36969055c0e9eb4beb0d96c11f4d
2018-10-24 12:31:12 -07:00
Maysam Yabandeh c34cc40424 Fix user comparator receiving internal key (#4575)
Summary:
There was a bug that the user comparator would receive the internal key instead of the user key. The bug was due to RangeMightExistAfterSortedRun expecting user key but receiving internal key when called in GenerateBottommostFiles. The patch augment an existing unit test to reproduce the bug and fixes it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4575

Differential Revision: D10500434

Pulled By: maysamyabandeh

fbshipit-source-id: 858346d2fd102cce9e20516d77338c112bdfe366
2018-10-23 08:14:46 -07:00
Siying Dong 7024263682 Dynamic level to adjust level multiplier when write is too heavy (#4338)
Summary:
Level compaction usually performs poorly when the writes so heavy that the level targets can't be guaranteed. With this improvement, we improve level_compaction_dynamic_level_bytes = true so that in the write heavy cases, the level multiplier can be slightly adjusted based on the size of L0.

We keep the behavior the same if number of L0 files is under 2X compaction trigger and the total size is less than options.max_bytes_for_level_base, so that unless write is so heavy that compaction cannot keep up, the behavior doesn't change.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4338

Differential Revision: D9636782

Pulled By: siying

fbshipit-source-id: e27fc17a7c29c84b00064cc17536a01dacef7595
2018-10-22 10:21:47 -07:00
Yanqin Jin da4aa59b4c Add read retry support to log reader (#4394)
Summary:
Current `log::Reader` does not perform retry after encountering `EOF`. In the future, we need the log reader to be able to retry tailing the log even after `EOF`.

Current implementation is simple. It does not provide more advanced retry policies. Will address this in the future.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4394

Differential Revision: D9926508

Pulled By: riversand963

fbshipit-source-id: d86d145792a41bd64a72f642a2a08c7b7b5201e1
2018-10-19 11:53:00 -07:00
anand1976 1e3845805d Properly determine a truncated CompactRange stop key (#4496)
Summary:
When a CompactRange() call for a level is truncated before the end key
is reached, because it exceeds max_compaction_bytes, we need to properly
set the compaction_end parameter to indicate the stop key. The next
CompactRange will use that as the begin key. We set it to the smallest
key of the next file in the level after expanding inputs to get a clean
cut.

Previously, we were setting it before expanding inputs. So we could end
up recompacting some files. In a pathological case, where a single key
has many entries spanning all the files in the level (possibly due to
merge operands without a partial merge operator, thus resulting in
compaction output identical to the input), this would result in
an endless loop over the same set of files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4496

Differential Revision: D10395026

Pulled By: anand1976

fbshipit-source-id: f0c2f89fee29b4b3be53b6467b53abba8e9146a9
2018-10-15 23:22:51 -07:00
Yanqin Jin e633983cf1 Add support to flush multiple CFs atomically (#4262)
Summary:
Leverage existing `FlushJob` to implement atomic flush of multiple column families.

This PR depends on other PRs and is a subset of #3752 . This PR itself is not sufficient in fulfilling atomic flush.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4262

Differential Revision: D9283109

Pulled By: riversand963

fbshipit-source-id: 65401f913e4160b0a61c0be6cd02adc15dad28ed
2018-10-15 20:01:17 -07:00
Yanqin Jin 729a617b5b Add listener to sample file io (#3933)
Summary:
We would like to collect file-system-level statistics including file name, offset, length, return code, latency, etc., which requires to add callbacks to intercept file IO function calls when RocksDB is running.
To collect file-system-level statistics, users can inherit the class `EventListener`, as in `TestFileOperationListener `. Note that `TestFileOperationListener::ShouldBeNotifiedOnFileIO()` returns true.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3933

Differential Revision: D10219571

Pulled By: riversand963

fbshipit-source-id: 7acc577a2d31097766a27adb6f78eaf8b1e8ff15
2018-10-12 18:36:11 -07:00
Abhishek Madan 3a4bd36fed Truncate range tombstones by leveraging InternalKeys (#4432)
Summary:
To more accurately truncate range tombstones at SST boundaries,
we now represent them in RangeDelAggregator using InternalKeys, which
are end-key-exclusive as they were before this change.

During compaction, "atomic compaction unit boundaries" (the range of
keys contained in neighbouring and overlaping SSTs) are propagated down
to RangeDelAggregator to truncate range tombstones at those boundariies
instead. See https://github.com/facebook/rocksdb/pull/4432#discussion_r221072219 and https://github.com/facebook/rocksdb/pull/4432#discussion_r221138683
for motivating examples.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4432

Differential Revision: D10263952

Pulled By: abhimadan

fbshipit-source-id: 2fe85ff8a02b3a6a2de2edfe708012797a7bd579
2018-10-09 15:19:38 -07:00
Zhongyi Xie ce1fc5af09 fix unused param `allocator` in compression.h (#4453)
Summary:
this should fix currently failing contrun test: rocksdb-contrun-no_compression, rocksdb-contrun-tsan, rocksdb-contrun-tsan_crash
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4453

Differential Revision: D10202626

Pulled By: miasantreble

fbshipit-source-id: 850b07f14f671b5998c22d8239e2a55b2fc1e355
2018-10-04 13:24:22 -07:00
JiYou a1f6142f38 VersionSet: GetOverlappingInputs() fix overflow and optimize. (#4385)
Summary:
This fix is for `level == 0` in `GetOverlappingInputs()`:
- In `GetOverlappingInputs()`, if `level == 0`, it has potential
risk of overflow if `i == 0`.
- Optmize process when `expand = true`, the expected complexity
can be reduced to O(n).

Signed-off-by: JiYou <jiyou09@gmail.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4385

Differential Revision: D10181001

Pulled By: riversand963

fbshipit-source-id: 46eef8a1d1605c9329c164e6471cd5c5b6de16b5
2018-10-03 18:40:59 -07:00
JiYou 75ca13875c FindFile: use std::lower_bound reduce the repeated code. (#4372)
Summary:
`FindFile()` and  `FindFileInRange()` actually works as the same
of `std::lower_bound()`. Use `std::lower_bound()` to reduce the
repeated code.

- change `FindFile()` and `FindFileInRange()` to use `std::lower_bound()`

Signed-off-by: JiYou <jiyou09@gmail.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4372

Differential Revision: D9919677

Pulled By: ajkr

fbshipit-source-id: f74aaa30e2f80e410e299c5a5bca4eaf2a7a26de
2018-09-27 10:35:00 -07:00
Andrew Kryczka c94523ee56 Delete code for WAL reader to start at nonzero offset (#4362)
Summary:
The code is dead in RocksDB as `log::Reader::initial_offset_` is always zero. We should delete it so we don't have to maintain it like in #4359.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4362

Differential Revision: D9817829

Pulled By: ajkr

fbshipit-source-id: 474a2c679e5bd273b40608f3a5332931d9eefe6d
2018-09-13 17:13:03 -07:00
cngzhnp 64324e329e Support pragma once in all header files and cleanup some warnings (#4339)
Summary:
As you know, almost all compilers support "pragma once" keyword instead of using include guards. To be keep consistency between header files, all header files are edited.

Besides this, try to fix some warnings about loss of data.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4339

Differential Revision: D9654990

Pulled By: ajkr

fbshipit-source-id: c2cf3d2d03a599847684bed81378c401920ca848
2018-09-05 18:13:31 -07:00
Yanqin Jin bb5dcea98e Add path to WritableFileWriter. (#4039)
Summary:
We want to sample the file I/O issued by RocksDB and report the function calls. This requires us to include the file paths otherwise it's hard to tell what has been going on.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4039

Differential Revision: D8670178

Pulled By: riversand963

fbshipit-source-id: 97ee806d1c583a2983e28e213ee764dc6ac28f7a
2018-08-23 10:12:58 -07:00
Yanqin Jin d116a1725d Update recovery code for version edits group commit. (#3945)
Summary:
During recovery, RocksDB is able to handle version edits that belong to group commits.
This PR is a subset of [PR 3752](https://github.com/facebook/rocksdb/pull/3752)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3945

Differential Revision: D8529122

Pulled By: riversand963

fbshipit-source-id: 57cb0f9cc55ecca684a837742d6626dc9c07f37e
2018-08-20 14:58:00 -07:00
Yanqin Jin 54de56844d Remove random writes from SST file ingestion (#4172)
Summary:
RocksDB used to store global_seqno in external SST files written by
SstFileWriter. During file ingestion, RocksDB uses `pwrite` to update the
`global_seqno`. Since random write is not supported in some non-POSIX compliant
file systems, external SST file ingestion is not supported on these file
systems. To address this limitation, we no longer update `global_seqno` during
file ingestion. Later RocksDB uses the MANIFEST and other information in table
properties to deduce global seqno for externally-ingested SST files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4172

Differential Revision: D8961465

Pulled By: riversand963

fbshipit-source-id: 4382ec85270a96be5bc0cf33758ca2b167b05071
2018-07-27 16:12:23 -07:00
Zhongyi Xie f95a5b2464 Avoid unnecessary big for-loop when reporting ticker stats stored in GetContext (#3490)
Summary:
Currently in `Version::Get` when reporting ticker stats stored in `GetContext`, there is a big for-loop through all `Ticker` which adds unnecessary cost to overall CPU usage. We can optimize by storing only ticker values that are used in `Get()` calls in a new struct `GetContextStats` since only a small fraction of all tickers are used in `Get()` calls. For comparison, with the new approach we only need to visit 17 values while old approach will require visiting 100+ `Ticker`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3490

Differential Revision: D6969154

Pulled By: miasantreble

fbshipit-source-id: fc27072965a3a94125a3e6883d20dafcf5b84029
2018-07-20 16:58:13 -07:00
Yanqin Jin 2736752b33 Fix a bug in MANIFEST group commit (#4157)
Summary:
PR #3944 introduces group commit of `VersionEdit` in MANIFEST. The
implementation has a bug. When updating the log file number of each column
family, we must consider only `VersionEdit`s that operate on the same column
family. Otherwise, a column family may accidentally set its log file number
higher than actual value, indicating that log files with smaller file number
will be ignored, thus causing some updates to be lost.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4157

Differential Revision: D8916650

Pulled By: riversand963

fbshipit-source-id: 8f456cf688f17bf35ad87b38e30e899aa162f201
2018-07-19 17:27:56 -07:00
Sagar Vemuri 991120fa10 Allow ttl to be changed dynamically (#4133)
Summary:
Allow ttl to be changed dynamically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4133

Differential Revision: D8845440

Pulled By: sagar0

fbshipit-source-id: c8c87ae643b3a8c4123e4c037c4645efc094a2d3
2018-07-16 14:27:53 -07:00
Peter Mattis 90fc40690a Relax VersionStorageInfo::GetOverlappingInputs check (#4050)
Summary:
Do not consider the range tombstone sentinel key as causing 2 adjacent
sstables in a level to overlap. When a range tombstone's end key is the
largest key in an sstable, the sstable's end key is so to a "sentinel"
value that is the smallest key in the next sstable with a sequence
number of kMaxSequenceNumber. This "sentinel" is guaranteed to not
overlap in internal-key space with the next sstable. Unfortunately,
GetOverlappingFiles uses user-keys to determine overlap and was thus
considering 2 adjacent sstables in a level to overlap if they were
separated by this sentinel key. This in turn would cause compactions to
be larger than necessary.

Note that this conflicts with
https://github.com/facebook/rocksdb/pull/2769 and cases
`DBRangeDelTest.CompactionTreatsSplitInputLevelDeletionAtomically` to
fail.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4050

Differential Revision: D8844423

Pulled By: ajkr

fbshipit-source-id: df3f9f1db8f4cff2bff77376b98b83c2ae1d155b
2018-07-13 17:42:38 -07:00
Zhongyi Xie b3efb1cbe0 fix clang analyzer warnings (#4072)
Summary:
clang analyze is giving the following warnings:
> db/compaction_job.cc:1178:16: warning: Called C++ object pointer is null
    } else if (meta->smallest.size() > 0) {
               ^~~~~~~~~~~~~~~~~~~~~
db/compaction_job.cc:1201:33: warning: Access to field 'marked_for_compaction' results in a dereference of a null pointer (loaded from variable 'meta')
    meta->marked_for_compaction = sub_compact->builder->NeedCompact();
    ~~~~
db/version_set.cc:2770:26: warning: Called C++ object pointer is null
        uint32_t cf_id = last_writer->cfd->GetID();
                         ^~~~~~~~~~~~~~~~~~~~~~~~~
Closes https://github.com/facebook/rocksdb/pull/4072

Differential Revision: D8685852

Pulled By: miasantreble

fbshipit-source-id: b0e2fd9dfc1cbba2317723e09886384b9b1c9085
2018-06-28 19:12:35 -07:00
Yanqin Jin 26d67e357e Support group commits of version edits (#3944)
Summary:
This PR supports the group commit of multiple version edit entries corresponding to different column families. Column family drop/creation still cannot be grouped. This PR is a subset of [PR 3752](https://github.com/facebook/rocksdb/pull/3752).
Closes https://github.com/facebook/rocksdb/pull/3944

Differential Revision: D8432536

Pulled By: riversand963

fbshipit-source-id: 8f11bd05193b6c0d9272d82e44b676abfac113cb
2018-06-28 12:34:39 -07:00
Maysam Yabandeh 235ab9dd32 Pin mmap files in ReadOnlyDB (#4053)
Summary:
https://github.com/facebook/rocksdb/pull/3881 fixed a bug where PinnableSlice pin mmap files which could be deleted with background compaction. This is however a non-issue for ReadOnlyDB when there is no compaction running and max_open_files is -1. This patch reenables the pinning feature for that case.
Closes https://github.com/facebook/rocksdb/pull/4053

Differential Revision: D8662546

Pulled By: maysamyabandeh

fbshipit-source-id: 402962602eb0f644e17822748332999c3af029fd
2018-06-27 17:13:34 -07:00
Yanqin Jin 524c6e6b72 Add file name info to SequentialFileReader. (#4026)
Summary:
We potentially need this information for tracing, profiling and diagnosis.
Closes https://github.com/facebook/rocksdb/pull/4026

Differential Revision: D8555214

Pulled By: riversand963

fbshipit-source-id: 4263e06c00b6d5410b46aa46eb4e358ff2161dd2
2018-06-21 08:42:24 -07:00
Zhongyi Xie 80bc35927c Should only decode restart points for uncompressed blocks (#3996)
Summary:
The Block object assumes contents are uncompressed. Block's constructor tries to read the number of restarts, but does not get an accurate number when its contents are compressed, which is causing issues like https://github.com/facebook/rocksdb/issues/3843.
This PR address this issue by skipping reconstruction of restart points when blocks are known to be compressed. Somehow the restart points can be read directly when Snappy is used and some tests (for example https://github.com/facebook/rocksdb/blob/master/db/db_block_cache_test.cc#L196) expects blocks to be fully constructed even when Snappy compression is used, so here we keep the restart point logic for Snappy.
Closes https://github.com/facebook/rocksdb/pull/3996

Differential Revision: D8416186

Pulled By: miasantreble

fbshipit-source-id: 002c0b62b9e5d89fb7736563d354ce0023c8cb28
2018-06-15 19:26:58 -07:00
Zhongyi Xie c3ebc75843 Move prefix_extractor to MutableCFOptions
Summary:
Currently it is not possible to change bloom filter config without restart the db, which is causing a lot of operational complexity for users.
This PR aims to make it possible to dynamically change bloom filter config.
Closes https://github.com/facebook/rocksdb/pull/3601

Differential Revision: D7253114

Pulled By: miasantreble

fbshipit-source-id: f22595437d3e0b86c95918c484502de2ceca120c
2018-05-21 14:43:11 -07:00
Mike Kolupaev 8bf555f487 Change and clarify the relationship between Valid(), status() and Seek*() for all iterators. Also fix some bugs
Summary:
Before this PR, Iterator/InternalIterator may simultaneously have non-ok status() and Valid() = true. That state means that the last operation failed, but the iterator is nevertheless positioned on some unspecified record. Likely intended uses of that are:
 * If some sst files are corrupted, a normal iterator can be used to read the data from files that are not corrupted.
 * When using read_tier = kBlockCacheTier, read the data that's in block cache, skipping over the data that is not.

However, this behavior wasn't documented well (and until recently the wiki on github had misleading incorrect information). In the code there's a lot of confusion about the relationship between status() and Valid(), and about whether Seek()/SeekToLast()/etc reset the status or not. There were a number of bugs caused by this confusion, both inside rocksdb and in the code that uses rocksdb (including ours).

This PR changes the convention to:
 * If status() is not ok, Valid() always returns false.
 * Any seek operation resets status. (Before the PR, it depended on iterator type and on particular error.)

This does sacrifice the two use cases listed above, but siying said it's ok.

Overview of the changes:
 * A commit that adds missing status checks in MergingIterator. This fixes a bug that actually affects us, and we need it fixed. `DBIteratorTest.NonBlockingIterationBugRepro` explains the scenario.
 * Changes to lots of iterator types to make all of them conform to the new convention. Some bug fixes along the way. By far the biggest changes are in DBIter, which is a big messy piece of code; I tried to make it less big and messy but mostly failed.
 * A stress-test for DBIter, to gain some confidence that I didn't break it. It does a few million random operations on the iterator, while occasionally modifying the underlying data (like ForwardIterator does) and occasionally returning non-ok status from internal iterator.

To find the iterator types that needed changes I searched for "public .*Iterator" in the code. Here's an overview of all 27 iterator types:

Iterators that didn't need changes:
 * status() is always ok(), or Valid() is always false: MemTableIterator, ModelIter, TestIterator, KVIter (2 classes with this name anonymous namespaces), LoggingForwardVectorIterator, VectorIterator, MockTableIterator, EmptyIterator, EmptyInternalIterator.
 * Thin wrappers that always pass through Valid() and status(): ArenaWrappedDBIter, TtlIterator, InternalIteratorFromIterator.

Iterators with changes (see inline comments for details):
 * DBIter - an overhaul:
    - It used to silently skip corrupted keys (`FindParseableKey()`), which seems dangerous. This PR makes it just stop immediately after encountering a corrupted key, just like it would for other kinds of corruption. Let me know if there was actually some deeper meaning in this behavior and I should put it back.
    - It had a few code paths silently discarding subiterator's status. The stress test caught a few.
    - The backwards iteration code path was expecting the internal iterator's set of keys to be immutable. It's probably always true in practice at the moment, since ForwardIterator doesn't support backwards iteration, but this PR fixes it anyway. See added DBIteratorTest.ReverseToForwardBug for an example.
    - Some parts of backwards iteration code path even did things like `assert(iter_->Valid())` after a seek, which is never a safe assumption.
    - It used to not reset status on seek for some types of errors.
    - Some simplifications and better comments.
    - Some things got more complicated from the added error handling. I'm open to ideas for how to make it nicer.
 * MergingIterator - check status after every operation on every subiterator, and in some places assert that valid subiterators have ok status.
 * ForwardIterator - changed to the new convention, also slightly simplified.
 * ForwardLevelIterator - fixed some bugs and simplified.
 * LevelIterator - simplified.
 * TwoLevelIterator - changed to the new convention. Also fixed a bug that would make SeekForPrev() sometimes silently ignore errors from first_level_iter_.
 * BlockBasedTableIterator - minor changes.
 * BlockIter - replaced `SetStatus()` with `Invalidate()` to make sure non-ok BlockIter is always invalid.
 * PlainTableIterator - some seeks used to not reset status.
 * CuckooTableIterator - tiny code cleanup.
 * ManagedIterator - fixed some bugs.
 * BaseDeltaIterator - changed to the new convention and fixed a bug.
 * BlobDBIterator - seeks used to not reset status.
 * KeyConvertingIterator - some small change.
Closes https://github.com/facebook/rocksdb/pull/3810

Differential Revision: D7888019

Pulled By: al13n321

fbshipit-source-id: 4aaf6d3421c545d16722a815b2fa2e7912bc851d
2018-05-17 02:56:56 -07:00
Maysam Yabandeh 171f415b30 Rename vars to satisfy unity built
Summary:
Tested by "make unity_test"
Closes https://github.com/facebook/rocksdb/pull/3807

Differential Revision: D7882657

Pulled By: maysamyabandeh

fbshipit-source-id: 84862c18d7f2fc762bd96ad070eaeb6936e45159
2018-05-04 15:28:06 -07:00
Siying Dong d59549298f Skip deleted WALs during recovery
Summary:
This patch record min log number to keep to the manifest while flushing SST files to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic.

Before the commit, for 2PC case, we determined which log number to keep in FindObsoleteFiles(). We looked at the earliest logs with outstanding prepare entries, or prepare entries whose respective commit or abort are in memtable. With the commit, the same calculation is done while we apply the SST flush. Just before installing the flush file, we precompute the earliest log file to keep after the flush finishes using the same logic (but skipping the memtables just flushed), record this information to the manifest entry for this new flushed SST file. This pre-computed value is also remembered in memory, and will later be used to determine whether a log file can be deleted. This value is unlikely to change until next flush because the commit entry will stay in memtable. (In WritePrepared, we could have removed the older log files as soon as all prepared entries are committed. It's not yet done anyway. Even if we do it, the only thing we loss with this new approach is earlier log deletion between two flushes, which does not guarantee to happen anyway because the obsolete file clean-up function is only executed after flush or compaction)

This min log number to keep is stored in the manifest using the safely-ignore customized field of AddFile entry, in order to guarantee that the DB generated using newer release can be opened by previous releases no older than 4.2.
Closes https://github.com/facebook/rocksdb/pull/3765

Differential Revision: D7747618

Pulled By: siying

fbshipit-source-id: d00c92105b4f83852e9754a1b70d6b64cb590729
2018-05-03 15:43:09 -07:00
Siying Dong d5afa73789 Revert "Skip deleted WALs during recovery"
Summary:
This reverts commit 73f21a7b21.

It breaks compatibility. When created a DB using a build with this new change, opening the DB and reading the data will fail with this error:

"Corruption: Can't access /000000.sst: IO error: while stat a file for size: /tmp/xxxx/000000.sst: No such file or directory"

This is because the dummy AddFile4 entry generated by the new code will be treated as a real entry by an older build. The older build will think there is a real file with number 0, but there isn't such a file.
Closes https://github.com/facebook/rocksdb/pull/3762

Differential Revision: D7730035

Pulled By: siying

fbshipit-source-id: f2051859eff20ef1837575ecb1e1bb96b3751e77
2018-04-23 12:01:26 -07:00
David Lai 3be9b36453 comment unused parameters to turn on -Wunused-parameter flag
Summary:
This PR comments out the rest of the unused arguments which allow us to turn on the -Wunused-parameter flag. This is the second part of a codemod relating to https://github.com/facebook/rocksdb/pull/3557.
Closes https://github.com/facebook/rocksdb/pull/3662

Differential Revision: D7426121

Pulled By: Dayvedde

fbshipit-source-id: 223994923b42bd4953eb016a0129e47560f7e352
2018-04-12 17:59:16 -07:00
Zhongyi Xie f3a1d9e049 fix data race
Summary:
Fix a TSAN failure in `DBRangeDelTest.ValidLevelSubcompactionBoundaries`:
https://gist.github.com/miasantreble/712e04b4de2ff7f193c98b1acf07e899
Closes https://github.com/facebook/rocksdb/pull/3691

Differential Revision: D7541400

Pulled By: miasantreble

fbshipit-source-id: b0b4538980bce7febd0385e61d6e046580bcaefb
2018-04-09 12:28:28 -07:00
Phani Shekhar Mantripragada 446b32cfc3 Support for Column family specific paths.
Summary:
In this change, an option to set different paths for different column families is added.
This option is set via cf_paths setting of ColumnFamilyOptions. This option will work in a similar fashion to db_paths setting. Cf_paths is a vector of Dbpath values which contains a pair of the absolute path and target size. Multiple levels in a Column family can go to different paths if cf_paths has more than one path.
To maintain backward compatibility, if cf_paths is not specified for a column family, db_paths setting will be used. Note that, if db_paths setting is also not specified, RocksDB already has code to use db_name as the only path.

Changes :
1) A new member "cf_paths" is added to ImmutableCfOptions. This is set, based on cf_paths setting of ColumnFamilyOptions and db_paths setting of ImmutableDbOptions.  This member is used to identify the path information whenever files are accessed.
2) Validation checks are added for cf_paths setting based on existing checks for db_paths setting.
3) DestroyDB, PurgeObsoleteFiles etc. are edited to support multiple cf_paths.
4) Unit tests are added appropriately.
Closes https://github.com/facebook/rocksdb/pull/3102

Differential Revision: D6951697

Pulled By: ajkr

fbshipit-source-id: 60d2262862b0a8fd6605b09ccb0da32bb331787d
2018-04-05 19:58:20 -07:00
Sagar Vemuri 04c11b867d Level Compaction with TTL
Summary:
Level Compaction with TTL.

As of today, a file could exist in the LSM tree without going through the compaction process for a really long time if there are no updates to the data in the file's key range. For example, in certain use cases, the keys are not actually "deleted"; instead they are just set to empty values. There might not be any more writes to this "deleted" key range, and if so, such data could remain in the LSM for a really long time resulting in wasted space.

Introducing a TTL could solve this problem. Files (and, in turn, data) older than TTL will be scheduled for compaction when there is no other background work. This will make the data go through the regular compaction process and get rid of old unwanted data.
This also has the (good) side-effect of all the data in the non-bottommost level being newer than ttl, and all data in the bottommost level older than ttl. It could lead to more writes while reducing space.

This functionality can be controlled by the newly introduced column family option -- ttl.

TODO for later:
- Make ttl mutable
- Extend TTL to Universal compaction as well? (TTL is already supported in FIFO)
- Maybe deprecate CompactionOptionsFIFO.ttl in favor of this new ttl option.
Closes https://github.com/facebook/rocksdb/pull/3591

Differential Revision: D7275442

Pulled By: sagar0

fbshipit-source-id: dcba484717341200d419b0953dafcdf9eb2f0267
2018-04-02 22:14:28 -07:00
Maysam Yabandeh 73f21a7b21 Skip deleted WALs during recovery
Summary:
This patch record the deleted WAL numbers in the manifest to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic.
Closes https://github.com/facebook/rocksdb/pull/3488

Differential Revision: D6967893

Pulled By: maysamyabandeh

fbshipit-source-id: 13119feb155a08ab6d4909f437c7a750480dc8a1
2018-03-30 11:28:05 -07:00
Huachao Huang ecfca1ff59 Optimize overlap checking for external file ingestion
Summary:
If there are a lot of overlapped files in L0, creating a merging iterator for
all files in L0 to check overlap can be very slow because we need to read and
seek all files in L0. However, in that case, the ingested file is likely to
overlap with some files in L0, so if we check those files one by one, we can stop
once we encounter overlap.

Ref: https://github.com/facebook/rocksdb/issues/3540
Closes https://github.com/facebook/rocksdb/pull/3564

Differential Revision: D7196784

Pulled By: anand1976

fbshipit-source-id: 8700c1e903bd515d0fa7005b6ce9b3a3d9db2d67
2018-03-16 10:43:17 -07:00
Bruce Mitchener 0de710f5b8 Use nullptr instead of NULL / 0 more consistently.
Summary: Closes https://github.com/facebook/rocksdb/pull/3569

Differential Revision: D7170968

Pulled By: yiwu-arbug

fbshipit-source-id: 308a6b7dd358a04fd9a7de3d927bfd8abd57d348
2018-03-07 12:42:12 -08:00
Huachao Huang 9cb4856dbd Don't need to UpdateFilesByCompactionPri for kCompactionStyleNone
Summary: Closes https://github.com/facebook/rocksdb/pull/3563

Differential Revision: D7154653

Pulled By: ajkr

fbshipit-source-id: 4f32fb1b02451a934504c40be22b07fb1f2deb9c
2018-03-05 17:57:39 -08:00
Andrew Kryczka 5d68243e61 Comment out unused variables
Summary:
Submitting on behalf of another employee.
Closes https://github.com/facebook/rocksdb/pull/3557

Differential Revision: D7146025

Pulled By: ajkr

fbshipit-source-id: 495ca5db5beec3789e671e26f78170957704e77e
2018-03-05 13:13:41 -08:00
Yi Wu bf937cf15b Add "rocksdb.live-sst-files-size" DB property
Summary:
Add "rocksdb.live-sst-files-size" DB property which only include files of latest version. Existing "rocksdb.total-sst-files-size" include files from all versions and thus include files that's obsolete but not yet deleted. I'm going to use this new property to cap blob db sst + blob files size.
Closes https://github.com/facebook/rocksdb/pull/3548

Differential Revision: D7116939

Pulled By: yiwu-arbug

fbshipit-source-id: c6a52e45ce0f24ef78708156e1a923c1dd6bc79a
2018-03-01 18:01:10 -08:00
Igor Sugak aba3409740 Back out "[codemod] - comment out unused parameters"
Reviewed By: igorsugak

fbshipit-source-id: 4a93675cc1931089ddd574cacdb15d228b1e5f37
2018-02-22 12:43:17 -08:00
David Lai f4a030ce81 - comment out unused parameters
Reviewed By: everiq, igorsugak

Differential Revision: D7046710

fbshipit-source-id: 8e10b1f1e2aecebbfb229c742e214db887e5a461
2018-02-22 09:44:23 -08:00
Siying Dong b555ed30a4 Customized BlockBasedTableIterator and LevelIterator
Summary:
Use a customzied BlockBasedTableIterator and LevelIterator to replace current implementations leveraging two-level-iterator. Hope the customized logic will make code easier to understand. As a side effect, BlockBasedTableIterator reduces the allocation for the data block iterator object, and avoid the virtual function call to it, because we can directly reference BlockIter, a final class. Similarly, LevelIterator reduces virtual function call to the dummy iterator iterating the file metadata. It also enabled further optimization.

The upper bound check is also moved from index block to data block. This implementation fits this iterator better. After the change, forwared iterator is slightly optimized to ensure we trim those iterators.

The two-level-iterator now is only used by partitioned index, so it is simplified.
Closes https://github.com/facebook/rocksdb/pull/3406

Differential Revision: D6809041

Pulled By: siying

fbshipit-source-id: 7da3b9b1d3c8e9d9405302c15920af1fcaf50ffa
2018-02-12 17:12:25 -08:00
Tamir Duberstein cd5092e168 Suppress unused warnings
Summary:
- Use `__unused__` everywhere
- Suppress unused warnings in Release mode
    + This currently affects non-MSVC builds (e.g. mingw64).
Closes https://github.com/facebook/rocksdb/pull/3448

Differential Revision: D6885496

Pulled By: miasantreble

fbshipit-source-id: f2f6adacec940cc3851a9eee328fafbf61aad211
2018-02-02 12:27:07 -08:00
Zhongyi Xie 51c2ea0feb Reduce heavy hitter for Get operation
Summary:
This PR addresses the following heavy hitters in `Get` operation by moving calls to `StatisticsImpl::recordTick` from `BlockBasedTable` to `Version::Get`

- rocksdb.block.cache.bytes.write
- rocksdb.block.cache.add
- rocksdb.block.cache.data.miss
- rocksdb.block.cache.data.bytes.insert
- rocksdb.block.cache.data.add
- rocksdb.block.cache.hit
- rocksdb.block.cache.data.hit
- rocksdb.block.cache.bytes.read

The db_bench statistics before and after the change are:

|1GB block read|Children      |Self  |Command          |Shared Object        |Symbol|
|---|---|---|---|---|---|
|master:     |4.22%     |1.31%  |db_bench  |db_bench  |[.] rocksdb::StatisticsImpl::recordTick|
|updated:    |0.51%     |0.21%  |db_bench  |db_bench  |[.] rocksdb::StatisticsImpl::recordTick|
|     	     |0.14%     |0.14%  |db_bench  |db_bench  |[.] rocksdb::GetContext::record_counters|

|1MB block read|Children      |Self  |Command          |Shared Object        |Symbol|
|---|---|---|---|---|---|
|master:    |3.48%     |1.08%  |db_bench  |db_bench  |[.] rocksdb::StatisticsImpl::recordTick|
|updated:    |0.80%     |0.31%  |db_bench  |db_bench  |[.] rocksdb::StatisticsImpl::recordTick|
|    	     |0.35%     |0.35%  |db_bench  |db_bench  |[.] rocksdb::GetContext::record_counters|
Closes https://github.com/facebook/rocksdb/pull/3172

Differential Revision: D6330532

Pulled By: miasantreble

fbshipit-source-id: 2b492959e00a3db29e9437ecdcc5e48ca4ec5741
2017-12-12 21:11:33 -08:00
Prashant D 34aa245dd8 Fix coverity issues version, write_batch
Summary:
db/version_builder.cc:
117        base_vstorage_->InternalComparator();

CID 1351713 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR)
2. uninit_member: Non-static class member field level_zero_cmp_.internal_comparator is not initialized in this constructor nor in any functions that it calls.

db/version_edit.h:
145  FdWithKeyRange()
146      : fd(),
147        smallest_key(),
148        largest_key() {

CID 1418254 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR)
2. uninit_member: Non-static class member file_metadata is not initialized in this constructor nor in any functions that it calls.
149  }

db/version_set.cc:
120    }

CID 1322789 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR)
4. uninit_member: Non-static class member curr_file_level_ is not initialized in this constructor nor in any functions that it calls.
121  }

db/write_batch.cc:
 939    assert(cf_mems_);

CID 1419862 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
3. uninit_member: Non-static class member rebuilding_trx_seq_ is not initialized in this constructor nor in any functions that it calls.
 940  }
Closes https://github.com/facebook/rocksdb/pull/3092

Differential Revision: D6505666

Pulled By: yiwu-arbug

fbshipit-source-id: fd2c68948a0280772691a419d72ac7e190951d86
2017-12-07 11:57:36 -08:00
Andrew Kryczka 78d1a5ec72 Preserve overlapping file endpoint invariant
Summary:
Fix for #2833.

- In `DeleteFilesInRange`, use `GetCleanInputsWithinInterval` instead of `GetOverlappingInputs` to make sure we get a clean cut set of files to delete.
- In `GetCleanInputsWithinInterval`, support nullptr as `begin_key` or `end_key`.
- In `GetOverlappingInputsRangeBinarySearch`, move the assertion for non-empty range away from `ExtendFileRangeWithinInterval`, which should be allowed to return an empty range (via `end_index < begin_index`).
Closes https://github.com/facebook/rocksdb/pull/2843

Differential Revision: D5772387

Pulled By: ajkr

fbshipit-source-id: e554e8461823c6be82b21a9262a2da02b3957881
2017-12-06 18:56:54 -08:00
Maysam Yabandeh 18dcf7f98d WritePrepared Txn: PreReleaseCallback
Summary:
Add PreReleaseCallback to be called at the end of WriteImpl but before publishing the sequence number. The callback is used in WritePrepareTxn to i) update the commit map, ii) update the last published sequence number in the 2nd write queue. It also ensures that all the commits will go to the 2nd queue.
These changes will ensure that the commit map is updated before the sequence number is published and used by reading snapshots. If we use two write queues, the snapshots will use the seq number published by the 2nd queue. If we use one write queue (the default, the snapshots will use the last seq number in the memtable, which also indicates the last published seq number.
Closes https://github.com/facebook/rocksdb/pull/3205

Differential Revision: D6438959

Pulled By: maysamyabandeh

fbshipit-source-id: f8b6c434e94bc5f5ab9cb696879d4c23e2577ab9
2017-11-30 23:50:45 -08:00
Andrew Kryczka 1bdb44de95 optimize file ingestion checks for range deletion overlap
Summary:
Before we were checking every file in the level which was unnecessary. We can piggyback onto the code for checking point-key overlap, which already opens all the files that could possibly contain overlapping range deletions. This PR makes us check just the range deletions from those files, so no extra ones will be opened.
Closes https://github.com/facebook/rocksdb/pull/3179

Differential Revision: D6358125

Pulled By: ajkr

fbshipit-source-id: 00e200770fdb8f3cc6b1b2da232b755e4ba36279
2017-11-28 11:27:02 -08:00
Zhongyi Xie 32e31d49d1 Make DBOption compaction_readahead_size dynamic
Summary: Closes https://github.com/facebook/rocksdb/pull/3004

Differential Revision: D6056141

Pulled By: miasantreble

fbshipit-source-id: 56df1630f464fd56b07d25d38161f699e0528b7f
2017-11-16 17:57:25 -08:00
Maysam Yabandeh 857adf388f WritePrepared Txn: Refactor conf params
Summary:
Summary of changes:
- Move seq_per_batch out of Options
- Rename concurrent_prepare to two_write_queues
- Add allocate_seq_only_for_data_
Closes https://github.com/facebook/rocksdb/pull/3136

Differential Revision: D6304458

Pulled By: maysamyabandeh

fbshipit-source-id: 08e685bfa82bbc41b5b1c5eb7040a8ca6e05e58c
2017-11-10 17:28:12 -08:00
Shaohua Li 33c7d4ccd9 Make writable_file_max_buffer_size dynamic
Summary:
The DBOptions::writable_file_max_buffer_size can be changed dynamically.
Closes https://github.com/facebook/rocksdb/pull/3053

Differential Revision: D6152720

Pulled By: shligit

fbshipit-source-id: aa0c0cfcfae6a54eb17faadb148d904797c68681
2017-10-31 13:56:35 -07:00
Andrew Kryczka 9b18cc2363 single-file bottom-level compaction when snapshot released
Summary:
When snapshots are held for a long time, files may reach the bottom level containing overwritten/deleted keys. We previously had no mechanism to trigger compaction on such files. This particularly impacted DBs that write to different parts of the keyspace over time, as such files would never be naturally compacted due to second-last level files moving down. This PR introduces a mechanism for bottommost files to be recompacted upon releasing all snapshots that prevent them from dropping their deleted/overwritten keys.

- Changed `CompactionPicker` to compact files in `BottommostFilesMarkedForCompaction()`. These are the last choice when picking. Each file will be compacted alone and output to the same level in which it originated. The goal of this type of compaction is to rewrite the data excluding deleted/overwritten keys.
- Changed `ReleaseSnapshot()` to recompute the bottom files marked for compaction when the oldest existing snapshot changes, and schedule a compaction if needed. We cache the value that oldest existing snapshot needs to exceed in order for another file to be marked in `bottommost_files_mark_threshold_`, which allows us to avoid recomputing marked files for most snapshot releases.
- Changed `VersionStorageInfo` to track the list of bottommost files, which is recomputed every time the version changes by `UpdateBottommostFiles()`. The list of marked bottommost files is first computed in `ComputeBottommostFilesMarkedForCompaction()` when the version changes, but may also be recomputed when `ReleaseSnapshot()` is called.
- Extracted core logic of `Compaction::IsBottommostLevel()` into `VersionStorageInfo::RangeMightExistAfterSortedRun()` since logic to check whether a file is bottommost is now necessary outside of compaction.
Closes https://github.com/facebook/rocksdb/pull/3009

Differential Revision: D6062044

Pulled By: ajkr

fbshipit-source-id: 123d201cf140715a7d5928e8b3cb4f9cd9f7ad21
2017-10-25 16:30:37 -07:00
Dmitri Smirnov d2a65c59e1 Fix unused var warnings in Release mode
Summary:
MSVC does not support unused attribute at this time. A separate assignment line fixes the issue probably by being counted as usage for MSVC and it no longer complains about unused var.
Closes https://github.com/facebook/rocksdb/pull/3048

Differential Revision: D6126272

Pulled By: maysamyabandeh

fbshipit-source-id: 4907865db45fd75a39a15725c0695aaa17509c1f
2017-10-23 14:27:04 -07:00
Sagar Vemuri f0804db7f7 Make FIFO compaction options dynamically configurable
Summary:
ColumnFamilyOptions::compaction_options_fifo and all its sub-fields can be set dynamically now.

Some of the ways in which the fifo compaction options can be set are:
- `SetOptions({{"compaction_options_fifo", "{max_table_files_size=1024}"}})`
- `SetOptions({{"compaction_options_fifo", "{ttl=600;}"}})`
- `SetOptions({{"compaction_options_fifo", "{max_table_files_size=1024;ttl=600;}"}})`
- `SetOptions({{"compaction_options_fifo", "{max_table_files_size=51;ttl=49;allow_compaction=true;}"}})`

Most of the code has been made generic enough so that it could be reused later to make universal options (and other such nested defined-types) dynamic with very few lines of parsing/serializing code changes.
Introduced a few new functions like `ParseStruct`, `SerializeStruct` and `GetStringFromStruct`.
The duplicate code in `GetStringFromDBOptions` and `GetStringFromColumnFamilyOptions` has been moved into `GetStringFromStruct`. So they become just simple wrappers now.
Closes https://github.com/facebook/rocksdb/pull/3006

Differential Revision: D6058619

Pulled By: sagar0

fbshipit-source-id: 1e8f78b3374ca5249bb4f3be8a6d3bb4cbc52f92
2017-10-19 15:26:36 -07:00
Andrew Kryczka 70aa942153 fix file numbers after repair
Summary:
The file numbers assigned post-repair were sometimes smaller than older files' numbers due to `LogAndApply` saving the wrong next file number in the manifest.

- Mark the highest file seen during repair as used before `LogAndApply` so the correct next file number will be stored.
- Renamed `MarkFileNumberUsedDuringRecovery` to `MarkFileNumberUsed` since now it's used during repair in addition to during recovery
- Added `TEST_Current_Next_FileNo` to expose the next file number for the unit test.
Closes https://github.com/facebook/rocksdb/pull/2988

Differential Revision: D6018083

Pulled By: ajkr

fbshipit-source-id: 3f25cbf74439cb8f16dd12af90b67f9f9f75e718
2017-10-10 13:12:37 -07:00
Yi Wu d1cab2b64e Add ValueType::kTypeBlobIndex
Summary:
Add kTypeBlobIndex value type, which will be used by blob db only, to insert a (key, blob_offset) KV pair. The purpose is to
1. Make it possible to open existing rocksdb instance as blob db. Existing value will be of kTypeIndex type, while value inserted by blob db will be of kTypeBlobIndex.
2. Make rocksdb able to detect if the db contains value written by blob db, if so return error.
3. Make it possible to have blob db optionally store value in SST file (with kTypeValue type) or as a blob value (with kTypeBlobIndex type).

The root db (DBImpl) basically pretended kTypeBlobIndex are normal value on write. On Get if is_blob is provided, return whether the value read is of kTypeBlobIndex type, or return Status::NotSupported() status if is_blob is not provided. On scan allow_blob flag is pass and if the flag is true, return wether the value is of kTypeBlobIndex type via iter->IsBlob().

Changes on blob db side will be in a separate patch.
Closes https://github.com/facebook/rocksdb/pull/2886

Differential Revision: D5838431

Pulled By: yiwu-arbug

fbshipit-source-id: 3c5306c62bc13bb11abc03422ec5cbcea1203cca
2017-10-03 09:11:23 -07:00
Maysam Yabandeh 385049baf2 WritePrepared Txn: Recovery
Summary:
Recover txns from the WAL. Also added some unit tests.
Closes https://github.com/facebook/rocksdb/pull/2901

Differential Revision: D5859596

Pulled By: maysamyabandeh

fbshipit-source-id: 6424967b231388093b4effffe0a3b1b7ec8caeb0
2017-09-28 16:56:45 -07:00
Amy Xu 5785b1fcb8 Fix naming in InternalKey
Summary:
- Switched all instances of SetMinPossibleForUserKey and SetMaxPossibleForUserKey in accordance to InternalKeyComparator's comparison logic
Closes https://github.com/facebook/rocksdb/pull/2868

Differential Revision: D5804152

Pulled By: axxufb

fbshipit-source-id: 80be35e04f2e8abc35cc64abe1fecb03af24e183
2017-09-12 17:17:42 -07:00
Maysam Yabandeh f46464d383 write-prepared txn: call IsInSnapshot
Summary:
This patch instruments the read path to verify each read value against an optional ReadCallback class. If the value is rejected, the reader moves on to the next value. The WritePreparedTxn makes use of this feature to skip sequence numbers that are not in the read snapshot.
Closes https://github.com/facebook/rocksdb/pull/2850

Differential Revision: D5787375

Pulled By: maysamyabandeh

fbshipit-source-id: 49d808b3062ab35e7ae98ad388f659757794184c
2017-09-11 09:14:48 -07:00
Yi Wu 3c840d1a6d Allow DB reopen with reduced options.num_levels
Summary:
Allow user to reduce number of levels in LSM by issue a full CompactRange() and put the result in a lower level, and then reopen DB with reduced options.num_levels. Previous this will fail on reopen on when recovery replaying the previous MANIFEST and found a historical file was on a higher level than the new options.num_levels. The workaround was after CompactRange(), reopen the DB with old num_levels, which will create a new MANIFEST, and then reopen the DB again with new num_levels.

This patch relax the check of levels during recovery. It allows DB to open if there was a historical file on level > options.num_levels, but was also deleted.
Closes https://github.com/facebook/rocksdb/pull/2740

Differential Revision: D5629354

Pulled By: yiwu-arbug

fbshipit-source-id: 545903f6b36b6083e8cbaf777176aef2f488021d
2017-08-24 16:10:54 -07:00
Andrew Kryczka cc01985db0 Introduce bottom-pri thread pool for large universal compactions
Summary:
When we had a single thread pool for compactions, a thread could be busy for a long time (minutes) executing a compaction involving the bottom level. In multi-instance setups, the entire thread pool could be consumed by such bottom-level compactions. Then, top-level compactions (e.g., a few L0 files) would be blocked for a long time ("head-of-line blocking"). Such top-level compactions are critical to prevent compaction stalls as they can quickly reduce number of L0 files / sorted runs.

This diff introduces a bottom-priority queue for universal compactions including the bottom level. This alleviates the head-of-line blocking situation for fast, top-level compactions.

- Added `Env::Priority::BOTTOM` thread pool. This feature is only enabled if user explicitly configures it to have a positive number of threads.
- Changed `ThreadPoolImpl`'s default thread limit from one to zero. This change is invisible to users as we call `IncBackgroundThreadsIfNeeded` on the low-pri/high-pri pools during `DB::Open` with values of at least one. It is necessary, though, for bottom-pri to start with zero threads so the feature is disabled by default.
- Separated `ManualCompaction` into two parts in `PrepickedCompaction`. `PrepickedCompaction` is used for any compaction that's picked outside of its execution thread, either manual or automatic.
- Forward universal compactions involving last level to the bottom pool (worker thread's entry point is `BGWorkBottomCompaction`).
- Track `bg_bottom_compaction_scheduled_` so we can wait for bottom-level compactions to finish. We don't count them against the background jobs limits. So users of this feature will get an extra compaction for free.
Closes https://github.com/facebook/rocksdb/pull/2580

Differential Revision: D5422916

Pulled By: ajkr

fbshipit-source-id: a74bd11f1ea4933df3739b16808bb21fcd512333
2017-08-03 15:43:29 -07:00
Andrew Kryczka 710411aea6 fix asan/valgrind for TableCache cleanup
Summary:
Breaking commit: d12691b86f

In the above commit, I moved the `TableCache` cleanup logic from `Version` destructor into `PurgeObsoleteFiles`. I missed cleaning up `TableCache` entries for the current `Version` during DB destruction.

This PR adds that logic to `VersionSet` destructor. One unfortunate side effect is now we're potentially deleting `TableReader`s after `column_family_set_.reset()`, which means we can't call `BlockBasedTableReader::Close` a second time as the block cache might already be destroyed.
Closes https://github.com/facebook/rocksdb/pull/2662

Differential Revision: D5515108

Pulled By: ajkr

fbshipit-source-id: 2cb820e19aa813e0d258d17f76b2d7b6b7ee0b18
2017-07-27 20:28:04 -07:00
Andrew Kryczka d12691b86f move TableCache::EraseHandle outside of db mutex
Summary:
Post-compaction work holds onto db mutex for the longest time (found by tracing lock acquires/releases with LTTng and correlating timestamps with our info log). Further experimentation showed `TableCache::EraseHandle` is responsible for ~86% of time mutex is held. We can just release the handle outside the db mutex.
Closes https://github.com/facebook/rocksdb/pull/2654

Differential Revision: D5507126

Pulled By: ajkr

fbshipit-source-id: 703c01ddf2aea16bc0f9e33c08935d78aa6b781d
2017-07-27 12:14:41 -07:00
Sagar Vemuri 72502cf227 Revert "comment out unused parameters"
Summary:
This reverts the previous commit 1d7048c598, which broke the build.

Did a `git revert 1d7048c`.
Closes https://github.com/facebook/rocksdb/pull/2627

Differential Revision: D5476473

Pulled By: sagar0

fbshipit-source-id: 4756ff5c0dfc88c17eceb00e02c36176de728d06
2017-07-21 18:26:26 -07:00
Victor Gao 1d7048c598 comment out unused parameters
Summary: This uses `clang-tidy` to comment out unused parameters (in functions, methods and lambdas) in fbcode. Cases that the tool failed to handle are fixed manually.

Reviewed By: igorsugak

Differential Revision: D5454343

fbshipit-source-id: 5dee339b4334e25e963891b519a5aa81fbf627b2
2017-07-21 14:57:44 -07:00
Siying Dong 3c327ac2d0 Change RocksDB License
Summary: Closes https://github.com/facebook/rocksdb/pull/2589

Differential Revision: D5431502

Pulled By: siying

fbshipit-source-id: 8ebf8c87883daa9daa54b2303d11ce01ab1f6f75
2017-07-15 16:11:23 -07:00
Mike Kolupaev 397ab11152 Improve Status message for block checksum mismatches
Summary:
We've got some DBs where iterators return Status with message "Corruption: block checksum mismatch" all the time. That's not very informative. It would be much easier to investigate if the error message contained the file name - then we would know e.g. how old the corrupted file is, which would be very useful for finding the root cause. This PR adds file name, offset and other stuff to some block corruption-related status messages.

It doesn't improve all the error messages, just a few that were easy to improve. I'm mostly interested in "block checksum mismatch" and "Bad table magic number" since they're the only corruption errors that I've ever seen in the wild.
Closes https://github.com/facebook/rocksdb/pull/2507

Differential Revision: D5345702

Pulled By: al13n321

fbshipit-source-id: fc8023d43f1935ad927cef1b9c55481ab3cb1339
2017-06-28 21:27:01 -07:00
Sagar Vemuri 1cd45cd1b3 FIFO Compaction with TTL
Summary:
Introducing FIFO compactions with TTL.

FIFO compaction is based on size only which makes it tricky to enable in production as use cases can have organic growth. A user requested an option to drop files based on the time of their creation instead of the total size.

To address that request:
- Added a new TTL option to FIFO compaction options.
- Updated FIFO compaction score to take TTL into consideration.
- Added a new table property, creation_time, to keep track of when the SST file is created.
- Creation_time is set as below:
  - On Flush: Set to the time of flush.
  - On Compaction: Set to the max creation_time of all the files involved in the compaction.
  - On Repair and Recovery: Set to the time of repair/recovery.
  - Old files created prior to this code change will have a creation_time of 0.
- FIFO compaction with TTL is enabled when ttl > 0. All files older than ttl will be deleted during compaction. i.e. `if (file.creation_time < (current_time - ttl)) then delete(file)`. This will enable cases where you might want to delete all files older than, say, 1 day.
- FIFO compaction will fall back to the prior way of deleting files based on size if:
  - the creation_time of all files involved in compaction is 0.
  - the total size (of all SST files combined) does not drop below `compaction_options_fifo.max_table_files_size` even if the files older than ttl are deleted.

This feature is not supported if max_open_files != -1 or with table formats other than Block-based.

**Test Plan:**
Added tests.

**Benchmark results:**
Base: FIFO with max size: 100MB ::
```
svemuri@dev15905 ~/rocksdb (fifo-compaction) $ TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=readwhilewriting --num=5000000 --threads=16 --compaction_style=2 --fifo_compaction_max_table_files_size_mb=100

readwhilewriting :       1.924 micros/op 519858 ops/sec;   13.6 MB/s (1176277 of 5000000 found)
```

With TTL (a low one for testing) ::
```
svemuri@dev15905 ~/rocksdb (fifo-compaction) $ TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=readwhilewriting --num=5000000 --threads=16 --compaction_style=2 --fifo_compaction_max_table_files_size_mb=100 --fifo_compaction_ttl=20

readwhilewriting :       1.902 micros/op 525817 ops/sec;   13.7 MB/s (1185057 of 5000000 found)
```
Example Log lines:
```
2017/06/26-15:17:24.609249 7fd5a45ff700 (Original Log Time 2017/06/26-15:17:24.609177) [db/compaction_picker.cc:1471] [default] FIFO compaction: picking file 40 with creation time 1498515423 for deletion
2017/06/26-15:17:24.609255 7fd5a45ff700 (Original Log Time 2017/06/26-15:17:24.609234) [db/db_impl_compaction_flush.cc:1541] [default] Deleted 1 files
...
2017/06/26-15:17:25.553185 7fd5a61a5800 [DEBUG] [db/db_impl_files.cc:309] [JOB 0] Delete /dev/shm/dbbench/000040.sst type=2 #40 -- OK
2017/06/26-15:17:25.553205 7fd5a61a5800 EVENT_LOG_v1 {"time_micros": 1498515445553199, "job": 0, "event": "table_file_deletion", "file_number": 40}
```

SST Files remaining in the dbbench dir, after db_bench execution completed:
```
svemuri@dev15905 ~/rocksdb (fifo-compaction)  $ ls -l /dev/shm//dbbench/*.sst
-rw-r--r--. 1 svemuri users 30749887 Jun 26 15:17 /dev/shm//dbbench/000042.sst
-rw-r--r--. 1 svemuri users 30768779 Jun 26 15:17 /dev/shm//dbbench/000044.sst
-rw-r--r--. 1 svemuri users 30757481 Jun 26 15:17 /dev/shm//dbbench/000046.sst
```
Closes https://github.com/facebook/rocksdb/pull/2480

Differential Revision: D5305116

Pulled By: sagar0

fbshipit-source-id: 3e5cfcf5dd07ed2211b5b37492eb235b45139174
2017-06-27 17:11:48 -07:00
Maysam Yabandeh 499ebb3ab5 Optimize for serial commits in 2PC
Summary:
Throughput: 46k tps in our sysbench settings (filling the details later)

The idea is to have the simplest change that gives us a reasonable boost
in 2PC throughput.

Major design changes:
1. The WAL file internal buffer is not flushed after each write. Instead
it is flushed before critical operations (WAL copy via fs) or when
FlushWAL is called by MySQL. Flushing the WAL buffer is also protected
via mutex_.
2. Use two sequence numbers: last seq, and last seq for write. Last seq
is the last visible sequence number for reads. Last seq for write is the
next sequence number that should be used to write to WAL/memtable. This
allows to have a memtable write be in parallel to WAL writes.
3. BatchGroup is not used for writes. This means that we can have
parallel writers which changes a major assumption in the code base. To
accommodate for that i) allow only 1 WriteImpl that intends to write to
memtable via mem_mutex_--which is fine since in 2PC almost all of the memtable writes
come via group commit phase which is serial anyway, ii) make all the
parts in the code base that assumed to be the only writer (via
EnterUnbatched) to also acquire mem_mutex_, iii) stat updates are
protected via a stat_mutex_.

Note: the first commit has the approach figured out but is not clean.
Submitting the PR anyway to get the early feedback on the approach. If
we are ok with the approach I will go ahead with this updates:
0) Rebase with Yi's pipelining changes
1) Currently batching is disabled by default to make sure that it will be
consistent with all unit tests. Will make this optional via a config.
2) A couple of unit tests are disabled. They need to be updated with the
serial commit of 2PC taken into account.
3) Replacing BatchGroup with mem_mutex_ got a bit ugly as it requires
releasing mutex_ beforehand (the same way EnterUnbatched does). This
needs to be cleaned up.
Closes https://github.com/facebook/rocksdb/pull/2345

Differential Revision: D5210732

Pulled By: maysamyabandeh

fbshipit-source-id: 78653bd95a35cd1e831e555e0e57bdfd695355a4
2017-06-24 14:11:29 -07:00
Siying Dong 6837a17621 Fix Data Race Between CreateColumnFamily() and GetAggregatedIntProperty()
Summary:
CreateColumnFamily() releases DB mutex after adding column family to the set and install super version (to write option file), so if users call GetAggregatedIntProperty() in the middle, then super version will be null and the process will crash. Fix it by skipping those column families without super version installed.

Maybe we should also fix the problem of releasing the lock when reading option file, but it is more risky. so I'm doing a quick and safer fix and we can investigate it later.
Closes https://github.com/facebook/rocksdb/pull/2475

Differential Revision: D5298053

Pulled By: siying

fbshipit-source-id: 4b3c8f91c60400b163fcc6cda8a0c77723be0ef6
2017-06-22 15:56:47 -07:00
Siying Dong 5d5a28a98c Fix Clang release build broken by 5582123dee
Summary:
5582123dee broken CLANG release build because of an unexpected change. Fix it.
Closes https://github.com/facebook/rocksdb/pull/2443

Differential Revision: D5236297

Pulled By: siying

fbshipit-source-id: 1b410adf13ded149c53e8235e9ea9f3130fb5403
2017-06-13 04:56:35 -07:00
Siying Dong 5582123dee Sample number of reads per SST file
Summary:
We estimate number of reads per SST files, by updating the counter per file in sampled read requests. This information can later be used to trigger compactions to improve read performacne.
Closes https://github.com/facebook/rocksdb/pull/2417

Differential Revision: D5193528

Pulled By: siying

fbshipit-source-id: b4241c5ad0eaf444b61afb53f8e6290d9f5da2df
2017-06-12 07:12:08 -07:00
Mike Kolupaev 138b87eae4 Fix interaction between CompactionFilter::Decision::kRemoveAndSkipUnt…
Summary:
Fixes the following scenario:
 1. Set prefix extractor. Enable bloom filters, with `whole_key_filtering = false`. Use compaction filter that sometimes returns `kRemoveAndSkipUntil`.
 2. Do a compaction.
 3. Compaction creates an iterator with `total_order_seek = false`, calls `SeekToFirst()` on it, then repeatedly calls `Next()`.
 4. At some point compaction filter returns `kRemoveAndSkipUntil`.
 5. Compaction calls `Seek(skip_until)` on the iterator. The key that it seeks to happens to have prefix that doesn't match the bloom filter. Since `total_order_seek = false`, iterator becomes invalid, and compaction thinks that it has reached the end. The rest of the compaction input is silently discarded.

The fix is to make compaction iterator use `total_order_seek = true`.

The implementation for PlainTable is quite awkward. I've made `kRemoveAndSkipUntil` officially incompatible with PlainTable. If you try to use them together, compaction will fail, and DB will enter read-only mode (`bg_error_`). That's not a very graceful way to communicate a misconfiguration, but the alternatives don't seem worth the implementation time and complexity. To be able to check in advance that `kRemoveAndSkipUntil` is not going to be used with PlainTable, we'd need to extend the interface of either `CompactionFilter` or `InternalIterator`. It seems unlikely that anyone will ever want to use `kRemoveAndSkipUntil` with PlainTable: PlainTable probably has very few users, and `kRemoveAndSkipUntil` has only one user so far: us (logdevice).
Closes https://github.com/facebook/rocksdb/pull/2349

Differential Revision: D5110388

Pulled By: lightmark

fbshipit-source-id: ec29101a99d9dcd97db33923b87f72bce56cc17a
2017-06-02 15:11:38 -07:00
Andrew Kryczka 215076ef06 Fix TSAN: avoid arena mode with range deletions
Summary:
The range deletion meta-block iterators weren't getting cleaned up properly since they don't support arena allocation. I didn't implement arena support since, in the general case, each iterator is used only once and separately from all other iterators, so there should be no benefit to data locality.

Anyways, this diff fixes up #2370 by treating range deletion iterators as non-arena-allocated.
Closes https://github.com/facebook/rocksdb/pull/2399

Differential Revision: D5171119

Pulled By: ajkr

fbshipit-source-id: bef6f5c4c5905a124f4993945aed4bd86e2807d8
2017-06-01 22:26:49 -07:00
Andrew Kryczka 3a8a848a55 account for L0 size in estimated compaction bytes
Summary:
also changed the `>` in the comparison against `level0_file_num_compaction_trigger` into a `>=` since exactly `level0_file_num_compaction_trigger` can trigger a compaction from L0.
Closes https://github.com/facebook/rocksdb/pull/2179

Differential Revision: D4915772

Pulled By: ajkr

fbshipit-source-id: e38fec6253de6f9a40e61734615c6670d84038aa
2017-06-01 17:56:59 -07:00
Andrew Kryczka 9c9909bf7d Support ingest file when range deletions exist
Summary:
Previously we returned NotSupported when ingesting files into a database containing any range deletions. This diff adds the support.

- Flush if any memtable contains range deletions overlapping the to-be-ingested file
- Place to-be-ingested file before any level that contains range deletions overlapping it.
- Added support for `Version` to return iterators over range deletions in a given level. Previously, we piggybacked getting range deletions onto `Version`'s `Get()` / `AddIterator()` functions by passing them a `RangeDelAggregator*`. But file ingestion needs to get iterators over range deletions, not populate an aggregator (since the aggregator does collapsing and doesn't expose the actual ranges).
Closes https://github.com/facebook/rocksdb/pull/2370

Differential Revision: D5127648

Pulled By: ajkr

fbshipit-source-id: 816faeb9708adfa5287962bafdde717db56e3f1a
2017-05-31 13:57:19 -07:00
Siying Dong 51ac91f586 Histogram of number of merge operands
Summary:
Add a histogram in statistics to help users understand how many merge operands they merge.
Closes https://github.com/facebook/rocksdb/pull/2373

Differential Revision: D5139983

Pulled By: siying

fbshipit-source-id: 61b9ba8ca83f358530a4833d68f0103b56a0e182
2017-05-31 07:41:44 -07:00
赵星宇 d03c34497c update comment of GetNextFile
Summary: Closes https://github.com/facebook/rocksdb/pull/2377

Differential Revision: D5141274

Pulled By: lightmark

fbshipit-source-id: c237a285b73ad93488c080ea80c71a29a17f1be0
2017-05-26 15:12:13 -07:00
Aaron Gao 3e86c0f07c disable direct reads for log and manifest and add direct io to tests
Summary:
Disable direct reads for log and manifest. Direct reads should not affect sequential_file
Also add kDirectIO for option_config_ in db_test_util
Closes https://github.com/facebook/rocksdb/pull/2337

Differential Revision: D5100261

Pulled By: lightmark

fbshipit-source-id: 0ebfd13b93fa1b8f9acae514ac44f8125a05868b
2017-05-22 18:41:28 -07:00
Aaron Gao a30a696034 do not read next datablock if upperbound is reached
Summary:
Now if we have iterate_upper_bound set, we continue read until get a key >= upper_bound. For a lot of cases that neighboring data blocks have a user key gap between them, our index key will be a user key in the middle to get a shorter size. For example, if we have blocks:
[a b c d][f g h]
Then the index key for the first block will be 'e'.
then if upper bound is any key between 'd' and 'e', for example, d1, d2, ..., d99999999999, we don't have to read the second block and also know that we have done our iteration by reaching the last key that smaller the upper bound already.

This diff can reduce RA in most cases.
Closes https://github.com/facebook/rocksdb/pull/2239

Differential Revision: D4990693

Pulled By: lightmark

fbshipit-source-id: ab30ea2e3c6edf3fddd5efed3c34fcf7739827ff
2017-05-05 23:20:01 -07:00