Commit graph

8439 commits

Author SHA1 Message Date
sdong 2a9e5caffe Make FIFO compaction take default 30 days TTL by default (#5987)
Summary:
Right now, by default FIFO compaction has no TTL. We believe that a default TTL of 30 days will be better. With this patch, the default will be changed to 30 days. Default of Options.periodic_compaction_seconds will mean the same as options.ttl. If Options.ttl and Options.periodic_compaction_seconds left default, a default 30 days TTL will be used. If both options are set, the stricter value of the two will be used.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5987

Test Plan: Add an option sanitize test to cover the case.

Differential Revision: D18237935

fbshipit-source-id: a6dcea1f36c3849e13c0a69e413d73ad8eab58c9
2019-10-31 11:13:12 -07:00
Maysam Yabandeh dccaf9f03c Turn compaction asserts to runtime check (#5935)
Summary:
Compaction iterator has many assert statements that are active only during test runs. Some rare bugs would show up only at runtime could violate the assert condition but go unnoticed since assert statements are not compiled in release mode. Turning the assert statements to runtime check sone pors and cons:
Pros:
- A bug that would result into incorrect data would be detected early before the incorrect data is written to the disk.

Cons:
- Runtime overhead: which should be negligible since compaction cpu is the minority in the overall cpu usage
- The assert statements might already being violated at runtime, and turning them to runtime failure might result into reliability issues.

The patch takes a conservative step in this direction by logging the assert violations at runtime. If we see any violation reported in logs, we investigate. Otherwise, we can go ahead turning them to runtime error.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5935

Differential Revision: D18229697

Pulled By: maysamyabandeh

fbshipit-source-id: f1890eca80ccd7cca29737f1825badb9aa8038a8
2019-10-30 13:48:38 -07:00
sdong 0337d87b42 crash_test: disable atomic flush with pipelined write (#5986)
Summary:
Recently, pipelined write is enabled even if atomic flush is enabled, which causing sanitizing failure in db_stress. Revert this change.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5986

Test Plan: Run "make crash_test_with_atomic_flush" and see it to run for some while so that the old sanitizing error (which showed up quickly) doesn't show up.

Differential Revision: D18228278

fbshipit-source-id: 27fdf2f8e3e77068c9725a838b9bef4ab25a2553
2019-10-30 11:36:55 -07:00
sdong 15119f08e2 Add more release branches to tools/check_format_compatible.sh (#5985)
Summary:
More release branches are created. We should include them in continuous format compatibility checks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5985

Test Plan: Let's see whether it is passes.

Differential Revision: D18226532

fbshipit-source-id: 75d8cad5b03ccea4ce16f00cea1f8b7893b0c0c8
2019-10-30 11:20:49 -07:00
sdong a3960fc875 Move pipeline write waiting logic into WaitForPendingWrites() (#5716)
Summary:
In pipeline writing mode, memtable switching needs to wait for memtable writing to finish to make sure that when memtables are made immutable, inserts are not going to them. This is currently done in DBImpl::SwitchMemtable(). This is done after flush_scheduler_.TakeNextColumnFamily() is called to fetch the list of column families to switch. The function flush_scheduler_.TakeNextColumnFamily() itself, however, is not thread-safe when being called together with flush_scheduler_.ScheduleFlush().
This change provides a fix, which moves the waiting logic before flush_scheduler_.TakeNextColumnFamily(). WaitForPendingWrites() is a natural place where the logic can happen.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5716

Test Plan: Run all tests with ASAN and TSAN.

Differential Revision: D18217658

fbshipit-source-id: b9c5e765c9989645bf10afda7c5c726c3f82f6c3
2019-10-29 18:16:36 -07:00
sdong f22aaf8b3f db_stress: CF Consistency check to use random CF to validate iterator results (#5983)
Summary:
Right now, in db_stress's iterator tests, we always use the same CF to validate iterator results. This commit changes it so that a randomized CF is used in Cf consistency test, where every CF should have exactly the same data. This would help catch more bugs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5983

Test Plan: Run "make crash_test_with_atomic_flush".

Differential Revision: D18217643

fbshipit-source-id: 3ac998852a0378bb59790b20c5f236f6a5d681fe
2019-10-29 18:16:35 -07:00
Sagar Vemuri 4c9aa30a62 Auto enable Periodic Compactions if a Compaction Filter is used (#5865)
Summary:
- Periodic compactions are auto-enabled if a compaction filter or a compaction filter factory is set, in Level Compaction.
- The default value of `periodic_compaction_seconds` is changed to UINT64_MAX, which lets RocksDB auto-tune periodic compactions as needed. An explicit value of 0 will still work as before ie. to disable periodic compactions completely. For now, on seeing a compaction filter along with a UINT64_MAX value for `periodic_compaction_seconds`, RocksDB will make SST files older than 30 days to go through periodic copmactions.

Some RocksDB users make use of compaction filters to control when their data can be deleted, usually with a custom TTL logic. But it is occasionally possible that the compactions get delayed by considerable time due to factors like low writes to a key range, data reaching bottom level, etc before the TTL expiry. Periodic Compactions feature was originally built to help such cases. Now periodic compactions are auto enabled by default when compaction filters or compaction filter factories are used, as it is generally helpful to all cases to collect garbage.

`periodic_compaction_seconds` is set to a large value, 30 days, in `SanitizeOptions` when RocksDB sees that a `compaction_filter` or `compaction_filter_factory` is used.

This is done only for Level Compaction style.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5865

Test Plan:
- Added a new test `DBCompactionTest.LevelPeriodicCompactionWithCompactionFilters` to make sure that `periodic_compaction_seconds` is set if either `compaction_filter` or `compaction_filter_factory` options are set.
- `COMPILE_WITH_ASAN=1 make check`

Differential Revision: D17659180

Pulled By: sagar0

fbshipit-source-id: 4887b9cf2e53cf2dc93a7b658c6b15e1181217ee
2019-10-29 15:05:51 -07:00
Peter Dillinger 26dc29633e filter_bench not needed for ROCKSDB_LITE (#5978)
Summary:
filter_bench is a specialized micro-benchmarking tool that
should not be needed with ROCKSDB_LITE. This should fix the LITE build.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5978

Test Plan: make LITE=1 check

Differential Revision: D18177941

Pulled By: pdillinger

fbshipit-source-id: b73a171404661e09e018bc99afcf8d4bf1e2949c
2019-10-28 14:12:36 -07:00
Vijay Nadimpalli 79018ba51b Upgrading version to 6.6.0 on Master (#5965)
Summary:
Upgrading version to 6.6.0 on Master.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5965

Differential Revision: D18119839

Pulled By: vjnadimpalli

fbshipit-source-id: 4adbcbb82b108d2f626e88c786453baad8455f4e
2019-10-28 13:14:45 -07:00
Vijay Nadimpalli 1075c376ef Fix for lite build (#5971)
Summary:
Fix for lite build
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5971

Test Plan: make J=1 -j64  LITE=1 all check

Differential Revision: D18148306

Pulled By: vjnadimpalli

fbshipit-source-id: 5b9a3edc3e73e054fee6b96e6f6e583cecc898f3
2019-10-25 18:22:24 -07:00
Peter Dillinger 3f891c40a0 More improvements to filter_bench (#5968)
Summary:
* Adds support for plain table filter. This is not critical right now, but does add a -impl flag that will be useful for new filter implementations initially targeted at block-based table (and maybe later ported to plain table)
* Better mixing of inside vs. outside queries, for more realism
* A -best_case option handy for implementation tuning inner loop
* Option for whether to include hashing time in dry run / net timings

No modifications to production code, just filter_bench.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5968

Differential Revision: D18139872

Pulled By: pdillinger

fbshipit-source-id: 5b09eba963111b48f9e0525a706e9921070990e8
2019-10-25 13:27:07 -07:00
Peter Dillinger b3dc2f3691 Update xxhash.cc to allow combined compilation (#5969)
Summary:
To fix unity_test
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5969

Test Plan: make unity_test

Differential Revision: D18140426

Pulled By: pdillinger

fbshipit-source-id: d5516e6d665f57e3706b9f9b965b0c458e58ccef
2019-10-25 12:54:41 -07:00
Vijay Nadimpalli ec880436c1 API to get file_creation_time of the oldest file in the DB (#5948)
Summary:
Adding a new API to db.h that allows users to get file_creation_time of the oldest file in the DB.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5948

Test Plan: Added unit test.

Differential Revision: D18056151

Pulled By: vjnadimpalli

fbshipit-source-id: 448ec9d34cb6772e1e5a62db399ace00dcbfbb5d
2019-10-25 11:53:57 -07:00
Peter Dillinger 013babc685 Clean up some filter tests and comments (#5960)
Summary:
Some filtering tests were unfriendly to new implementations of
FilterBitsBuilder because of dynamic_cast to FullFilterBitsBuilder. Most
of those have now been cleaned up, worked around, or at least changed
from crash on dynamic_cast failure to individual test failure.

Also put some clarifying comments on filter-related APIs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5960

Test Plan: make check

Differential Revision: D18121223

Pulled By: pdillinger

fbshipit-source-id: e83827d9d5d96315d96f8e25a99cd70f497d802c
2019-10-24 18:48:16 -07:00
Yanqin Jin 2309fd63bf Update column families' log number altogether after flushing during recovery (#5856)
Summary:
A bug occasionally shows up in crash test, and https://github.com/facebook/rocksdb/issues/5851 reproduces it.
The bug can surface in the following way.
1. Database has multiple column families.
2. Between one DB restart, the last log file is corrupted in the middle (not the tail)
3. During restart, DB crashes between flushing between two column families.

Then DB will fail to be opened again with error "SST file is ahead of WALs".
Solution is to update the log number associated with each column family altogether after flushing all column families' memtables. The version edits should be written to a new MANIFEST. Only after writing to all these version edits succeed does RocksDB (atomically) points the CURRENT file to the new MANIFEST.

Test plan (on devserver):
```
$make all && make check
```
Specifically
```
$make db_test2
$./db_test2 --gtest_filter=DBTest2.CrashInRecoveryMultipleCF
```
Also checked for compatibility as follows.
Use this branch, run DBTest2.CrashInRecoveryMultipleCF and preserve the db directory.
Then checkout 5.4, build ldb, and dump the MANIFEST.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5856

Differential Revision: D17620818

Pulled By: riversand963

fbshipit-source-id: b52ce5969c9a8052cacec2bd805fcfb373589039
2019-10-24 18:29:30 -07:00
Peter Dillinger ca7ccbe2ea Misc hashing updates / upgrades (#5909)
Summary:
- Updated our included xxhash implementation to version 0.7.2 (== the latest dev version as of 2019-10-09).
- Using XXH_NAMESPACE (like other fb projects) to avoid potential name collisions.
- Added fastrange64, and unit tests for it and fastrange32. These are faster alternatives to hash % range.
- Use preview version of XXH3 instead of MurmurHash64A for NPHash64
-- Had to update cache_test to increase probability of passing for any given hash function.
- Use fastrange64 instead of % with uses of NPHash64
-- Had to fix WritePreparedTransactionTest.CommitOfDelayedPrepared to avoid deadlock apparently caused by new hash collision.
- Set default seed for NPHash64 because specifying a seed rarely makes sense for it.
- Removed unnecessary include xxhash.h in a popular .h file
- Rename preview version of XXH3 to XXH3p for clarity and to ease backward compatibility in case final version of XXH3 is integrated.

Relying on existing unit tests for NPHash64-related changes. Each new implementation of fastrange64 passed unit tests when manipulating my local build to select it. I haven't done any integration performance tests, but I consider the improved performance of the pieces being swapped in to be well established.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5909

Differential Revision: D18125196

Pulled By: pdillinger

fbshipit-source-id: f6bf83d49d20cbb2549926adf454fd035f0ecc0d
2019-10-24 17:16:46 -07:00
Peter Dillinger ec11eff3bc FilterPolicy consolidation, part 2/2 (#5966)
Summary:
The parts that are used to implement FilterPolicy /
NewBloomFilterPolicy and not used other than for the block-based table
should be consolidated under table/block_based/filter_policy*.

This change is step 2 of 2:
mv util/bloom.cc table/block_based/filter_policy.cc
This gets its own PR so that git has the best chance of following the
rename for blame purposes. Note that low-level shared implementation
details of Bloom filters remain in util/bloom_impl.h, and
util/bloom_test.cc remains where it is for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5966

Test Plan: make check

Differential Revision: D18124930

Pulled By: pdillinger

fbshipit-source-id: 823bc09025b3395f092ef46a46aa5ba92a914d84
2019-10-24 15:44:51 -07:00
Levi Tamasi f7e7b34ebe Propagate SST and blob file numbers through the EventListener interface (#5962)
Summary:
This patch adds a number of new information elements to the FlushJobInfo and
CompactionJobInfo structures that are passed to EventListeners via the
OnFlush{Begin, Completed} and OnCompaction{Begin, Completed} callbacks.
Namely, for flushes, the file numbers of the new SST and the oldest blob file it
references are propagated. For compactions, the new pieces of information are
the file number, level, and the oldest blob file referenced by each compaction
input and output file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5962

Test Plan:
Extended the EventListener unit tests with logic that checks that these information
elements are correctly propagated from the corresponding FileMetaData.

Differential Revision: D18095568

Pulled By: ltamasi

fbshipit-source-id: 6874359a6aadb53366b5fe87adcb2f9bd27a0a56
2019-10-24 14:44:15 -07:00
Peter Dillinger dd19014a7a FilterPolicy consolidation, part 1/2 (#5963)
Summary:
The parts that are used to implement FilterPolicy /
NewBloomFilterPolicy and not used other than for the block-based table
should be consolidated under table/block_based/filter_policy*. I don't
foresee sharing these APIs with e.g. the Plain Table because they don't
expose hashes for reuse in indexing.

This change is step 1 of 2:
(a) mv table/full_filter_bits_builder.h to
table/block_based/filter_policy_internal.h which I expect to expand
soon to internally reveal more implementation details for testing.
(b) consolidate eventual contents of table/block_based/filter_policy.cc
in util/bloom.cc, which has the most elaborate revision history
(see step 2 ...)

Step 2 soon to follow:
mv util/bloom.cc table/block_based/filter_policy.cc
This gets its own PR so that git has the best chance of following the
rename for blame purposes. Note that low-level shared implementation
details of Bloom filters are in util/bloom_impl.h.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5963

Test Plan: make check

Differential Revision: D18121199

Pulled By: pdillinger

fbshipit-source-id: 8f21732c3d8909777e3240e4ac3123d73140326a
2019-10-24 13:20:35 -07:00
Peter Dillinger 2837008525 Vary key size and alignment in filter_bench (#5933)
Summary:
The first version of filter_bench has selectable key size
but that size does not vary throughout a test run. This artificially
favors "branchy" hash functions like the existing BloomHash,
MurmurHash1, probably because of optimal return for branch prediction.

This change primarily varies those key sizes from -2 to +2 bytes vs.
the average selected size. We also set the default key size at 24 to
better reflect our best guess of typical key size.

But steadily random key sizes may not be realistic either. So this
change introduces a new filter_bench option:
-vary_key_size_log2_interval=n where the same key size is used 2^n
times and then changes to another size. I've set the default at 5
(32 times same size) as a compromise between deployments with
rather consistent vs. rather variable key sizes. On my Skylake
system, the performance boost to MurmurHash1 largely lies between
n=10 and n=15.

Also added -vary_key_alignment (bool, now default=true), though this
doesn't currently seem to matter in hash functions under
consideration.

This change also does a "dry run" for each testing scenario, to improve
the accuracy of those numbers, as there was more difference between
scenarios than expected. Subtracting gross test run times from dry run
times is now also embedded in the output, because these "net" times are
generally the most useful.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5933

Differential Revision: D18121683

Pulled By: pdillinger

fbshipit-source-id: 3c7efee1c5661a5fe43de555e786754ddf80dc1e
2019-10-24 13:08:30 -07:00
Dan Lambright 2509531123 Add test showing range tombstones can create excessively large compactions (#5956)
Summary:
For more information on the original problem see this [link](https://github.com/facebook/rocksdb/issues/3977).

This change adds two new tests. They are identical other than one uses range tombstones and the other does not. Each test generates sub files at L2 which overlap with keys L3. The test that uses range tombstones generates a single file at L2. This single file will generate a very large range overlap that will in turn create excessively large compaction.

1: T001 - T005
2:  000 -  005

In contrast, the test that uses key ranges generates 3 files at L2. As a single file is compacted at a time, those 3 files will generate less work per compaction iteration.

1:  001 - 002
1:  003 - 004
1:  005
2:  000 - 005
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5956

Differential Revision: D18071631

Pulled By: dlambrig

fbshipit-source-id: 12abae75fb3e0b022d228c6371698aa5e53385df
2019-10-24 11:08:44 -07:00
sdong 9f1e5a0b87 CfConsistencyStressTest to validate key consistent across CFs in TestGet() (#5863)
Summary:
Right now in CF consitency stres test's TestGet(), keys are just fetched without validation. With this change, in 1/2 the time, compare all the CFs share the same value with the same key.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5863

Test Plan: Run "make crash_test_with_atomic_flush" and see tests pass. Hack the code to generate some inconsistency and observe the test fails as expected.

Differential Revision: D17934206

fbshipit-source-id: 00ba1a130391f28785737b677f80f366fb83cced
2019-10-23 16:57:16 -07:00
Peter Dillinger 6a32e3b562 Remove unused BloomFilterPolicy::hash_func_ (#5961)
Summary:
This is an internal, file-local "feature" that is not used and
potentially confusing.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5961

Test Plan: make check

Differential Revision: D18099018

Pulled By: pdillinger

fbshipit-source-id: 7870627eeed09941d12538ec55d10d2e164fc716
2019-10-23 15:47:17 -07:00
Yanqin Jin b4ebda7a39 Make buckifier python3 compatible (#5922)
Summary:
Make buckifier/buckify_rocksdb.py run on both Python 3 and 2
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5922

Test Plan:
```
$python3 buckifier/buckify_rocksdb.py
$python3 buckifier/buckify_rocksdb.py '{"fake": {"extra_deps": [":test_dep", "//fakes/module:mock1"], "extra_compiler_flags": ["-DROCKSDB_LITE", "-Os"]}}'
$python2 buckifier/buckify_rocksdb.py
$python2 buckifier/buckify_rocksdb.py '{"fake": {"extra_deps": [":test_dep", "//fakes/module:mock1"], "extra_compiler_flags": ["-DROCKSDB_LITE", "-Os"]}}'
```

Differential Revision: D17920611

Pulled By: riversand963

fbshipit-source-id: cc6e2f36013a88a710d96098f6ca18cbe85e3f62
2019-10-23 13:52:27 -07:00
Zhichao Cao 0933360644 Fix the potential memory leak in trace_replay (#5955)
Summary:
In the previous PR https://github.com/facebook/rocksdb/issues/5934 , in the while loop, if/else if is used without ending with else to free the object referenced by ra, it might cause potential memory leak (warning during compiling). Fix it by changing the last "else if" to "else".
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5955

Test Plan: pass make asan check, pass the USE_CLANG=1 TEST_TMPDIR=/dev/shm/rocksdb OPT=-g make -j64 analyze.

Differential Revision: D18071612

Pulled By: zhichao-cao

fbshipit-source-id: 51c00023d0c97c2921507254329aed55d56e1786
2019-10-22 16:39:46 -07:00
Yanqin Jin c0abc6bbc1 Use FLAGS_env for certain operations in db_bench (#5943)
Summary:
Since we already parse env_uri from command line and creates custom Env
accordingly, we should invoke the methods of such Envs instead of using
Env::Default().

Test Plan (on devserver):
```
$make db_bench db_stress
$./db_bench -benchmarks=fillseq
./db_stress
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5943

Differential Revision: D18018550

Pulled By: riversand963

fbshipit-source-id: 03b61329aaae0dfd914a0b902cc677f570f102e3
2019-10-22 11:43:21 -07:00
Yanqin Jin 925250f42f Include db_stress_tool in rocksdb tools lib (#5950)
Summary:
include db_stress_tool in rocksdb tools lib

Test Plan (on devserver):
```
$make db_stress
$./db_stress
$make all && make check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5950

Differential Revision: D18044399

Pulled By: riversand963

fbshipit-source-id: 895585abbbdfd8b954965921dba4b1400b7af1b1
2019-10-21 19:40:35 -07:00
Vijay Nadimpalli 5677f4f775 Using clang for internal ubsan tests (#5952)
Summary:
Using clang for internal ubsan tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5952

Differential Revision: D18048810

Pulled By: vjnadimpalli

fbshipit-source-id: ae55677a1928397b067e972d0ecb4ac1b7e2c8dc
2019-10-21 19:37:00 -07:00
Peter Dillinger 27a124571f Fix memory leak on error opening PlainTable (#5951)
Summary:
Several error paths in opening of a plain table would leak memory. PR https://github.com/facebook/rocksdb/issues/5940 opened the leak to one more error path, which happens to have been (mistakenly) exercised by CuckooTableDBTest.AdaptiveTable. That test has been fixed, and the exercising of
plain table error cases (more than before) has been added as BadOptions1 and BadOptions2
to PlainTableDBTest. This effectively moved the memory leak to plain_table_db_test.

Also here is a cheap fix for the memory leak, without (yet?) changing the signature of
ReadTableProperties. This fixes ASAN on unit tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5951

Test Plan: make COMPILE_WITH_ASAN=1 check

Differential Revision: D18051940

Pulled By: pdillinger

fbshipit-source-id: e2952930c09a2b46c4f1ff09818c5090426929de
2019-10-21 16:53:06 -07:00
Zhichao Cao 7245fb5f63 Fix the potential memory leak of ReplayMultiThread (#5949)
Summary:
The pointer ra needs to be freed the status s returns not OK. In the previous  PR https://github.com/facebook/rocksdb/issues/5934  , the ra is not freed which might cause potential memory leak. Fix this issue by moving the clarification of ra inside the while loop and freeing it as desired.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5949

Test Plan: pass make asan check.

Differential Revision: D18045726

Pulled By: zhichao-cao

fbshipit-source-id: d5445b7b832c8bb1dafe008bafea7bfe9eb0b1ce
2019-10-21 15:05:01 -07:00
Vijay Nadimpalli 2ce6aa5f39 Making platform 007 (gcc 7) default in build_detect_platform.sh (#5947)
Summary:
Making platform 007 (gcc 7) default in build_detect_platform.sh.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5947

Differential Revision: D18038837

Pulled By: vjnadimpalli

fbshipit-source-id: 9ac2ddaa93bf328a416faec028970e039886378e
2019-10-21 12:09:29 -07:00
sdong a0cd920026 LevelIterator to avoid gap after prefix bloom filters out a file (#5861)
Summary:
Right now, when LevelIterator::Seek() is called, when a file is filtered out by prefix bloom filter, the position is put to the beginning of the next file. This is a confusing internal interface because many keys in the levels are skipped. Avoid this behavior by checking the key of the next file against the seek key, and invalidate the whole iterator if the prefix doesn't match.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5861

Test Plan: Add a new unit test to validate the behavior; run all exsiting tests; run crash_test

Differential Revision: D17918213

fbshipit-source-id: f06b47d937c7cc8919001f18dcc3af5b28c9cdac
2019-10-21 11:40:57 -07:00
sdong 30e2dc02f0 Fix VerifyChecksum readahead with mmap mode (#5945)
Summary:
A recent change introduced readahead inside VerifyChecksum(). However it is not compatible with mmap mode and generated wrong checksum verification failure. Fix it by not enabling readahead in mmap
 mode.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5945

Test Plan: Add a unit test that used to fail.

Differential Revision: D18021443

fbshipit-source-id: 6f2eb600f81b26edb02222563a4006869d576bff
2019-10-21 11:38:30 -07:00
sdong 1a21afa789 Fix some dependency paths (#5946)
Summary:
Some dependency path is not correct so that ASAN cannot run with CLANG. Fix it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5946

Test Plan: Run ASAN with CLANG

Differential Revision: D18040933

fbshipit-source-id: 1d82be9d350485cf1df1c792dad765188958641f
2019-10-21 10:41:47 -07:00
Levi Tamasi 29ccf2075c Store the filter bits reader alongside the filter block contents (#5936)
Summary:
Amongst other things, PR https://github.com/facebook/rocksdb/issues/5504 refactored the filter block readers so that
only the filter block contents are stored in the block cache (as opposed to the
earlier design where the cache stored the filter block reader itself, leading to
potentially dangling pointers and concurrency bugs). However, this change
introduced a performance hit since with the new code, the metadata fields are
re-parsed upon every access. This patch reunites the block contents with the
filter bits reader to eliminate this overhead; since this is still a self-contained
pure data object, it is safe to store it in the cache. (Note: this is similar to how
the zstd digest is handled.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5936

Test Plan:
make asan_check

filter_bench results for the old code:

```
$ ./filter_bench -quick
WARNING: Assertions are enabled; benchmarks unnecessarily slow
Building...
Build avg ns/key: 26.7153
Number of filters: 16669
Total memory (MB): 200.009
Bits/key actual: 10.0647
----------------------------
Inside queries...
  Dry run (46b) ns/op: 33.4258
  Single filter ns/op: 42.5974
  Random filter ns/op: 217.861
----------------------------
Outside queries...
  Dry run (25d) ns/op: 32.4217
  Single filter ns/op: 50.9855
  Random filter ns/op: 219.167
    Average FP rate %: 1.13993
----------------------------
Done. (For more info, run with -legend or -help.)

$ ./filter_bench -quick -use_full_block_reader
WARNING: Assertions are enabled; benchmarks unnecessarily slow
Building...
Build avg ns/key: 26.5172
Number of filters: 16669
Total memory (MB): 200.009
Bits/key actual: 10.0647
----------------------------
Inside queries...
  Dry run (46b) ns/op: 32.3556
  Single filter ns/op: 83.2239
  Random filter ns/op: 370.676
----------------------------
Outside queries...
  Dry run (25d) ns/op: 32.2265
  Single filter ns/op: 93.5651
  Random filter ns/op: 408.393
    Average FP rate %: 1.13993
----------------------------
Done. (For more info, run with -legend or -help.)
```

With the new code:

```
$ ./filter_bench -quick
WARNING: Assertions are enabled; benchmarks unnecessarily slow
Building...
Build avg ns/key: 25.4285
Number of filters: 16669
Total memory (MB): 200.009
Bits/key actual: 10.0647
----------------------------
Inside queries...
  Dry run (46b) ns/op: 31.0594
  Single filter ns/op: 43.8974
  Random filter ns/op: 226.075
----------------------------
Outside queries...
  Dry run (25d) ns/op: 31.0295
  Single filter ns/op: 50.3824
  Random filter ns/op: 226.805
    Average FP rate %: 1.13993
----------------------------
Done. (For more info, run with -legend or -help.)

$ ./filter_bench -quick -use_full_block_reader
WARNING: Assertions are enabled; benchmarks unnecessarily slow
Building...
Build avg ns/key: 26.5308
Number of filters: 16669
Total memory (MB): 200.009
Bits/key actual: 10.0647
----------------------------
Inside queries...
  Dry run (46b) ns/op: 33.2968
  Single filter ns/op: 58.6163
  Random filter ns/op: 291.434
----------------------------
Outside queries...
  Dry run (25d) ns/op: 32.1839
  Single filter ns/op: 66.9039
  Random filter ns/op: 292.828
    Average FP rate %: 1.13993
----------------------------
Done. (For more info, run with -legend or -help.)
```

Differential Revision: D17991712

Pulled By: ltamasi

fbshipit-source-id: 7ea205550217bfaaa1d5158ebd658e5832e60f29
2019-10-18 19:32:59 -07:00
Yanqin Jin c53db172a1 Fix TestIterate for HashSkipList in db_stress (#5942)
Summary:
Since SeekForPrev (used by Prev) is not supported by HashSkipList when prefix is used, we disable it when stress testing HashSkipList.

- Change the default memtablerep to skip list.
- Avoid Prev() when memtablerep is HashSkipList and prefix is used.

Test Plan (on devserver):
```
$make db_stress
$./db_stress -ops_per_thread=10000 -reopen=1 -destroy_db_initially=true -column_families=1 -threads=1 -column_families=1 -memtablerep=prefix_hash
$# or simply
$./db_stress
$./db_stress -memtablerep=prefix_hash
```
Results must print "Verification successful".
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5942

Differential Revision: D18017062

Pulled By: riversand963

fbshipit-source-id: af867e59aa9e6f533143c984d7d529febf232fd7
2019-10-18 15:49:12 -07:00
Peter Dillinger 5f8f2fda0e Refactor / clean up / optimize FullFilterBitsReader (#5941)
Summary:
FullFilterBitsReader, after creating in BloomFilterPolicy, was
responsible for decoding metadata bits. This meant that
FullFilterBitsReader::MayMatch had some metadata checks in order to
implement "always true" or "always false" functionality in the case
of inconsistent or trivial metadata. This made for ugly
mixing-of-concerns code and probably had some runtime cost. It also
didn't really support plugging in alternative filter implementations
with extensions to the existing metadata schema.

BloomFilterPolicy::GetFilterBitsReader is now (exclusively) responsible
for decoding filter metadata bits and constructing appropriate instances
deriving from FilterBitsReader. "Always false" and "always true" derived
classes allow FullFilterBitsReader not to be concerned with handling of
trivial or inconsistent metadata. This also makes for easy expansion
to alternative filter implementations in new, alternative derived
classes. This change makes calls to FilterBitsReader::MayMatch
*necessarily* virtual because there's now more than one built-in
implementation. Compared with the previous implementation's extra
'if' checks in MayMatch, there's no consistent performance difference,
measured by (an older revision of) filter_bench (differences here seem
to be within noise):

    Inside queries...
    -  Dry run (407) ns/op: 35.9996
    +  Dry run (407) ns/op: 35.2034
    -  Single filter ns/op: 47.5483
    +  Single filter ns/op: 47.4034
    -  Batched, prepared ns/op: 43.1559
    +  Batched, prepared ns/op: 42.2923
    ...
    -  Random filter ns/op: 150.697
    +  Random filter ns/op: 149.403
    ----------------------------
    Outside queries...
    -  Dry run (980) ns/op: 34.6114
    +  Dry run (980) ns/op: 34.0405
    -  Single filter ns/op: 56.8326
    +  Single filter ns/op: 55.8414
    -  Batched, prepared ns/op: 48.2346
    +  Batched, prepared ns/op: 47.5667
    -  Random filter ns/op: 155.377
    +  Random filter ns/op: 153.942
         Average FP rate %: 1.1386

Also, the FullFilterBitsReader ctor was responsible for a surprising
amount of CPU in production, due in part to inefficient determination of
the CACHE_LINE_SIZE used to construct the filter being read. The
overwhelming common case (same as my CACHE_LINE_SIZE) is now
substantially optimized, as shown with filter_bench with
-new_reader_every=1 (old option - see below) (repeatable result):

    Inside queries...
    -  Dry run (453) ns/op: 118.799
    +  Dry run (453) ns/op: 105.869
    -  Single filter ns/op: 82.5831
    +  Single filter ns/op: 74.2509
    ...
    -  Random filter ns/op: 224.936
    +  Random filter ns/op: 194.833
    ----------------------------
    Outside queries...
    -  Dry run (aa1) ns/op: 118.503
    +  Dry run (aa1) ns/op: 104.925
    -  Single filter ns/op: 90.3023
    +  Single filter ns/op: 83.425
    ...
    -  Random filter ns/op: 220.455
    +  Random filter ns/op: 175.7
         Average FP rate %: 1.13886

However PR#5936 has/will reclaim most of this cost. After that PR, the optimization of this code path is likely negligible, but nonetheless it's clear we aren't making performance any worse.

Also fixed inadequate check of consistency between filter data size and
num_lines. (Unit test updated.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5941

Test Plan:
previously added unit tests FullBloomTest.CorruptFilters and
FullBloomTest.RawSchema

Differential Revision: D18018353

Pulled By: pdillinger

fbshipit-source-id: 8e04c2b4a7d93223f49a237fd52ef2483929ed9c
2019-10-18 14:50:52 -07:00
Peter Dillinger fe464bca5c Fix PlainTableReader not to crash sst_dump (#5940)
Summary:
Plain table SSTs could crash sst_dump because of a bug in
PlainTableReader that can leave table_properties_ as null. Even if it
was intended not to keep the table properties in some cases, they were
leaked on the offending code path.

Steps to reproduce:

    $ db_bench --benchmarks=fillrandom --num=2000000 --use_plain_table --prefix-size=12
    $ sst_dump --file=0000xx.sst --show_properties
    from [] to []
    Process /dev/shm/dbbench/000014.sst
    Sst file format: plain table
    Raw user collected properties
    ------------------------------
    Segmentation fault (core dumped)

Also added missing unit testing of plain table full_scan_mode, and
an assertion in NewIterator to check for regression.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5940

Test Plan: new unit test, manual, make check

Differential Revision: D18018145

Pulled By: pdillinger

fbshipit-source-id: 4310c755e824c4cd6f3f86a3abc20dfa417c5e07
2019-10-18 14:44:42 -07:00
Zhichao Cao 526e3b9763 Enable trace_replay with multi-threads (#5934)
Summary:
In the current trace replay, all the queries are serialized and called by single threads. It may not simulate the original application query situations closely. The multi-threads replay is implemented in this PR. Users can set the number of threads to replay the trace. The queries generated according to the trace records are scheduled in the thread pool job queue.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5934

Test Plan: test with make check and real trace replay.

Differential Revision: D17998098

Pulled By: zhichao-cao

fbshipit-source-id: 87eecf6f7c17a9dc9d7ab29dd2af74f6f60212c8
2019-10-18 14:13:50 -07:00
Levi Tamasi 69bd8a2859 Update HISTORY.md with recent BlobDB adjacent changes
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5939

Differential Revision: D18009096

Pulled By: ltamasi

fbshipit-source-id: 032a48a302f9da38aecf4055b5a8d4e1dffd9dc7
2019-10-18 10:24:23 -07:00
Yanqin Jin e60cc0925c Expose db stress tests (#5937)
Summary:
expose db stress test by providing db_stress_tool.h in public header.
This PR does the following:
- adds a new header, db_stress_tool.h, in include/rocksdb/
- renames db_stress.cc to db_stress_tool.cc
- adds a db_stress.cc which simply invokes a test function.
- update Makefile accordingly.

Test Plan (dev server):
```
make db_stress
./db_stress
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5937

Differential Revision: D17997647

Pulled By: riversand963

fbshipit-source-id: 1a8d9994f89ce198935566756947c518f0052410
2019-10-18 09:46:44 -07:00
Levi Tamasi fdc1cb43a6 Support decoding blob indexes in sst_dump (#5926)
Summary:
The patch adds a new command line parameter --decode_blob_index to sst_dump.
If this switch is specified, sst_dump prints blob indexes in a human readable format,
printing the blob file number, offset, size, and expiration (if applicable) for blob
references, and the blob value (and expiration) for inlined blobs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5926

Test Plan:
Used db_bench's BlobDB mode to generate SST files containing blob references with
and without expiration, as well as inlined blobs with and without expiration (note: the
latter are stored as plain values), and confirmed sst_dump correctly prints all four types
of records.

Differential Revision: D17939077

Pulled By: ltamasi

fbshipit-source-id: edc5f58fee94ba35f6699c6a042d5758f5b3963d
2019-10-17 19:36:54 -07:00
Yi Wu 1f9d7c0f54 Fix OnFlushCompleted fired before flush result write to MANIFEST (#5908)
Summary:
When there are concurrent flush job on the same CF, `OnFlushCompleted` can be called before the flush result being install to LSM. Fixing the issue by passing `FlushJobInfo` through `MemTable`, and the thread who commit the flush result can fetch the `FlushJobInfo` and fire `OnFlushCompleted` on behave of the thread actually writing the SST.

Fix https://github.com/facebook/rocksdb/issues/5892
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5908

Test Plan: Add new test. The test will fail without the fix.

Differential Revision: D17916144

Pulled By: riversand963

fbshipit-source-id: e18df67d9533b5baee52ae3605026cdeb05cbe10
2019-10-16 10:40:23 -07:00
Maysam Yabandeh 2c9e9f2a59 Update HISTORY for SeekForPrev bug fix (#5925)
Summary:
Update history for the bug fix in https://github.com/facebook/rocksdb/pull/5907
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5925

Differential Revision: D17952605

Pulled By: maysamyabandeh

fbshipit-source-id: 609afcbb2e4087f9153822c4d11193a75a7b0e7a
2019-10-16 07:59:26 -07:00
Yanqin Jin 5ef27dea33 Fix clang analyzer error (#5924)
Summary:
Without this PR, clang analyzer complains.
```
$USE_CLANG=1 make analyze
db/compaction/compaction_job_test.cc:161:20: warning: The left operand of '==' is a garbage value
      if (key.type == kTypeBlobIndex) {
                ~~~~~~~~ ^
                1 warning generated.
```

Test Plan (on devserver)
```
$USE_CLANG=1 make analyze
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5924

Differential Revision: D17923226

Pulled By: riversand963

fbshipit-source-id: 9d1eb769b5e0de7cb3d89dc90d1cfa895db7fdc8
2019-10-14 22:14:24 -07:00
Levi Tamasi 78b28d80b0 Support non-TTL Puts for BlobDB in db_bench (#5921)
Summary:
Currently, db_bench only supports PutWithTTL operations for BlobDB but
not regular Puts. The patch adds support for regular (non-TTL) Puts and also
changes the default for blob_db_max_ttl_range to zero, which corresponds
to no TTL.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5921

Test Plan:
make check

./db_bench -benchmarks=fillrandom -statistics -stats_interval_seconds=1
-duration=90 -num=500000 -use_blob_db=1 -blob_db_file_size=1000000
-target_file_size_base=1000000 (issues Put operations with no TTL)

./db_bench -benchmarks=fillrandom -statistics -stats_interval_seconds=1
-duration=90 -num=500000 -use_blob_db=1 -blob_db_file_size=1000000
-target_file_size_base=1000000 -blob_db_max_ttl_range=86400 (issues
PutWithTTL operations with random TTLs in the [0, blob_db_max_ttl_range)
interval, as before)

Differential Revision: D17919798

Pulled By: ltamasi

fbshipit-source-id: b946c3522b836b92b4c157ffbad24f92ba2b0a16
2019-10-14 17:49:20 -07:00
Peter Dillinger 93edd51c4a bloom_test.cc: include <array> (#5920)
Summary:
Fix build failure on some platforms, reported in issue https://github.com/facebook/rocksdb/issues/5914
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5920

Test Plan: make bloom_test && ./bloom_test

Differential Revision: D17918328

Pulled By: pdillinger

fbshipit-source-id: b822004d4442de0171db2aeff433677783f7b94e
2019-10-14 15:38:31 -07:00
Levi Tamasi 5f025ea832 BlobDB GC: add SST <-> oldest blob file referenced mapping (#5903)
Summary:
This is groundwork for adding garbage collection support to BlobDB. The
patch adds logic that keeps track of the oldest blob file referred to by
each SST file. The oldest blob file is identified during flush/
compaction (similarly to how the range of keys covered by the SST is
identified), and persisted in the manifest as a custom field of the new
file edit record. Blob indexes with TTL are ignored for the purposes of
identifying the oldest blob file (since such blob files are cleaned up by the
TTL logic in BlobDB).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5903

Test Plan:
Added new unit tests; also ran db_bench in BlobDB mode, inspected the
manifest using ldb, and confirmed (by scanning the SST files using
sst_dump) that the value of the oldest blob file number field matches
the contents of the file for each SST.

Differential Revision: D17859997

Pulled By: ltamasi

fbshipit-source-id: 21662c137c6259a6af70446faaf3a9912c550e90
2019-10-14 15:21:01 -07:00
Levi Tamasi a59dc843a4 Move blob_index.h to db/ (#5919)
Summary:
Extracted from PR https://github.com/facebook/rocksdb/issues/5903 for technical reasons.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5919

Test Plan: make check

Differential Revision: D17910132

Pulled By: ltamasi

fbshipit-source-id: 6ecbb8d6e84b2a1d1f28575ad48ac3cc65833eb5
2019-10-14 12:54:05 -07:00
Yanqin Jin 231fffd07c Add Env::SanitizeEnvOptions (#5885)
Summary:
Add Env::SanitizeEnvOptions to allow underlying environments properly
configure env options.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5885

Test Plan:
```
make check
```

Differential Revision: D17910327

Pulled By: riversand963

fbshipit-source-id: 86a1ac616e485742c35c4a9cc9f1227c529fc00f
2019-10-14 12:25:00 -07:00