Commit graph

8308 commits

Author SHA1 Message Date
Andrew Kryczka 20dd828c01 Avoid clock_gettime on pre-10.12 macOS versions (#5570)
Summary:
On older macOS like 10.10 we saw the following compiler error:

```
/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/env/env_posix.cc:845:19:
error: use of undeclared identifier 'CLOCK_THREAD_CPUTIME_ID'
    clock_gettime(CLOCK_THREAD_CPUTIME_ID, &ts);
                  ^
```

According to mac's `man clock_gettime`: "These functions first appeared in Mac
OSX 10.12". So we should not try to compile it on earlier versions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5570

Test Plan:
verified it compiles now on 10.10. Also did some investigation to
ensure it does not cause regression on macOS 10.12+, although I do not
have access to such an environment to really test.

Differential Revision: D17322629

Pulled By: riversand963

fbshipit-source-id: e0a412223854f826b4d83e6d15c3739ff4620d7d
2019-09-11 14:07:25 -07:00
tongyingrui c85c87a718 test size was wrong in 'fillbatch' benchmark (#5198)
Summary:
for fillbatch benchmar, the numEntries should be [num_] but not [num_ / 1000] because numEntries is just the total entries we want to test
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5198

Differential Revision: D17274664

Pulled By: anand1976

fbshipit-source-id: f96e952babdbac63fb99d14e1254d478a10437be
2019-09-11 12:04:44 -07:00
anand76 2becafdb43 Fix Appveyor build due to signed/unsigned comparison
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5788

Test Plan: Travis CI and Appveyor should complete successfully.

Differential Revision: D17287422

Pulled By: anand1976

fbshipit-source-id: d9408b692f78be95d0088b29b33f6a8ff40ec97b
2019-09-10 14:34:37 -07:00
anand76 eb9026f09b Add a db_bench benchmark to warm up the row cache
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5707

Differential Revision: D17242698

Pulled By: anand1976

fbshipit-source-id: 5d1bfda3c9e8f56176ae391cae6c91e6262016b8
2019-09-10 11:06:36 -07:00
jsteemann 4d945c57ac do a bit less work in the normal case (#5695)
Summary:
i.e. if alive logfile is not being moved to archive while we are in GetSortedWalsOfType()
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5695

Differential Revision: D17279489

Pulled By: vjnadimpalli

fbshipit-source-id: 02bcf920a75b812edba8b87c6079b4e6fd5e683c
2019-09-10 09:41:45 -07:00
Richard He 699e1b5ede Added support for SstFileReader JNI interface (#5556)
Summary:
Feature request as per https://github.com/facebook/rocksdb/issues/5538 issue.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5556

Differential Revision: D17219008

fbshipit-source-id: e31f18dec318416eac9dea8213bab31da96e1f3a
2019-09-09 18:12:53 -07:00
Peter Dillinger 7af6ced14b Fix block allocation bug in new DynamicBloom (#5783)
Summary:
Bug found by valgrind. New DynamicBloom wasn't allocating in
block sizes. New assertion added that probes starting in final word
would be in bounds.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5783

Test Plan: ROCKSDB_VALGRIND_RUN=1 DISABLE_JEMALLOC=1 valgrind --leak-check=full ./dynamic_bloom_test

Differential Revision: D17270623

Pulled By: pdillinger

fbshipit-source-id: 1e0407504b875133a771383cd488c70f91be2b87
2019-09-09 15:26:43 -07:00
Peter Dillinger 108c619acb Add regression test for serialized Bloom filters (#5778)
Summary:
Check that we don't accidentally change the on-disk format of
existing Bloom filter implementations, including for various
CACHE_LINE_SIZE (by changing temporarily).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5778

Test Plan: thisisthetest

Differential Revision: D17269630

Pulled By: pdillinger

fbshipit-source-id: c77017662f010a77603b7d475892b1f0d5563d8b
2019-09-09 14:51:30 -07:00
Wilfried Goesgens fbab9913e2 upgrade gtest 1.7.0 => 1.8.1 for json result writing
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5332

Differential Revision: D17242232

fbshipit-source-id: c0d4646556a1335e51ac7382b986ca7f6ced7b64
2019-09-09 11:24:11 -07:00
sdong adbc25a4c8 Rename InternalDBStatsType enum names (#5779)
Summary:
When building with clang 9, warning is reported for InternalDBStatsType type names shadowed the one for statistics. Rename them.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5779

Test Plan: Build with clang 9 and see it passes.

Differential Revision: D17239378

fbshipit-source-id: af28fb42066c738cd1b841f9fe21ab4671dafd18
2019-09-06 17:31:10 -07:00
houbingjian cbfa729d37 cmakelist fix, add +crypto flag when use arm crc (#5750)
Summary:
cmake list add +crypto flag when use armv8 cpu

the function crc32c_arm64 use HAVE_ARM64_CRYPTO to check if can enable arm-neon instructions :

#ifdef HAVE_ARM64_CRYPTO
  /* Crc32c Parallel computation
   *   Algorithm comes from Intel whitepaper:
   *   crc-iscsi-polynomial-crc32-instruction-paper
   *
   * Input data is divided into three equal-sized blocks
   *   Three parallel blocks (crc0, crc1, crc2) for 1024 Bytes
   *   One Block: 42(BLK_LENGTH) * 8(step length: crc32c_u64) bytes
   */

but the cmakelist not check and pass crypto flag now

I check the default Makefile has it:

ifeq (,$(shell $(CXX) -fsyntax-only -march=armv8-a+crc -xc /dev/null 2>&1))
CXXFLAGS += -march=armv8-a+crc+crypto
CFLAGS += -march=armv8-a+crc+crypto
ARMCRC_SOURCE=1
endif
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5750

Differential Revision: D17242027

fbshipit-source-id: 443c9b89755b4bc34e265205ab922db1b2e14bde
2019-09-06 17:03:21 -07:00
Maysam Yabandeh 78b8cfc7ec WriteUnPrepared: Split ReadYourOwnWriteStress to three (#5776)
Summary:
ReadYourOwnWriteStress occasionally times out on some platforms. The patch splits it to three.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5776

Differential Revision: D17231743

Pulled By: maysamyabandeh

fbshipit-source-id: d42eeaf22f61a48d50f9c404d98b1081ae8dac94
2019-09-06 15:25:26 -07:00
Manuel Ung 2208cc0196 Fix build break in TransactionBaseImpl::TrackKey (#5771)
Summary:
Fix build broken in https://github.com/facebook/rocksdb/pull/5696.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5771

Differential Revision: D17217665

Pulled By: lth

fbshipit-source-id: 7aa84a2a9b4feb7a3ab1cab174e09276430fe042
2019-09-06 10:18:04 -07:00
奏之章 533e47709c Fix WriteBatchWithIndex with MergeOperator bug (#5577)
Summary:
```
TEST_F(WriteBatchWithIndexTest, TestGetFromBatchAndDBMerge3) {
  DB* db;
  Options options;

  options.create_if_missing = true;
  std::string dbname = test::PerThreadDBPath("write_batch_with_index_test");

  options.merge_operator = MergeOperators::CreateFromStringId("stringappend");

  DestroyDB(dbname, options);
  Status s = DB::Open(options, dbname, &db);
  assert(s.ok());

  ReadOptions read_options;
  WriteOptions write_options;
  FlushOptions flush_options;
  std::string value;

  WriteBatchWithIndex batch;

  ASSERT_OK(db->Put(write_options, "A", "1"));
  ASSERT_OK(db->Flush(flush_options, db->DefaultColumnFamily()));
  ASSERT_OK(batch.Merge("A", "2"));

  ASSERT_OK(batch.GetFromBatchAndDB(db, read_options, "A", &value));
  ASSERT_EQ(value, "1,2");

  delete db;
  DestroyDB(dbname, options);
}
```
Fix ASSERT in batch.GetFromBatchAndDB()
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5577

Differential Revision: D16379847

fbshipit-source-id: b1320e24ec8e71350c525083cc0a16180a63f752
2019-09-05 17:52:14 -07:00
Richard He cfc20019d1 Fixed FALLOC_FL_KEEP_SIZE undefined (#5614)
Summary:
Fix `error: ‘FALLOC_FL_KEEP_SIZE’` undeclared error in `io_posix.cc` during Vagrant build in CentOS as per issue https://github.com/facebook/rocksdb/issues/5599
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5614

Differential Revision: D17217960

fbshipit-source-id: ef736c51b16833107fd9ccc7917ed1def2a8d02c
2019-09-05 17:37:21 -07:00
Jeffrey Xiao eae9f040eb Initialized pinned_pos_ and pinned_seq_pos_ in FragmentedRangeTombstoneIterator (#5720)
Summary:
These uninitialized member variables can cause a key to not be pinned when it should be, causing erroneous behavior. For example ingesting a file with range deletion tombstones will yield an "external file have corrupted keys" on a Mac.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5720

Differential Revision: D17217673

fbshipit-source-id: cd7df7ce3ad9cf69c841c4d3dc6fd144eff9e212
2019-09-05 17:30:29 -07:00
Yi Wu 83b991922e Fix EncryptedEnv assert (#5735)
Summary:
Fixes https://github.com/facebook/rocksdb/issues/5734. By reading the code the assert don't quite make sense to me, since `dataSize` and `fileOffset` has no correlation. But my knowledge about `EncryptedEnv` is very limited.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5735

Test Plan:
run `ENCRYPTED_ENV=1 ./db_encryption_test`

Signed-off-by: Yi Wu <yiwu@pingcap.com>

Differential Revision: D17133849

fbshipit-source-id: bb7262d308e5b2503c400b180edc252668df0ef0
2019-09-05 17:21:42 -07:00
Andrew Kryczka 43a5cdb58c remove unused #include to fix musl libc build (#5583)
Summary:
The `#include "core_local.h"` was pulling in libgcc's `posix_memalign()`
declaration. That declaration specifies `throw()` whereas musl libc's
declaration does not. This was leading to the following compiler error
when using musl libc:

```
In file included from /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/port/jemalloc_helper.h:26:0,
                 from /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/util/jemalloc_nodump_allocator.h:11,
                 from /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/util/jemalloc_nodump_allocator.cc:6:
/go/native/x86_64-unknown-linux-musl/jemalloc/include/jemalloc/jemalloc.h:63:29: error: declaration of 'int posix_memalign(void**, size_t, size_t) throw ()' has a different exception specifier
 #  define je_posix_memalign posix_memalign
                             ^
/go/native/x86_64-unknown-linux-musl/jemalloc/include/jemalloc/jemalloc.h:63:29: note: from previous declaration 'int posix_memalign(void**, size_t, size_t)'
 #  define je_posix_memalign posix_memalign
                             ^
/go/native/x86_64-unknown-linux-musl/jemalloc/include/jemalloc/jemalloc.h:202:38: note: in expansion of macro 'je_posix_memalign'
 JEMALLOC_EXPORT int JEMALLOC_NOTHROW je_posix_memalign(void **memptr,
                                      ^~~~~~~~~~~~~~~~~
make[4]: *** [CMakeFiles/rocksdb.dir/util/jemalloc_nodump_allocator.cc.o] Error 1
```

Since `#include "core_local.h"` is not actually used, we can just remove
it. I verified that fixes the build.

There was a related PR here (https://github.com/facebook/rocksdb/issues/2188), although the problem description is
slightly different.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5583

Differential Revision: D16343227

fbshipit-source-id: 0386bc2b5fd55b2c3b5fba19382014efa52e44f8
2019-09-05 17:18:49 -07:00
HouBingjian ac97e6930f bloom test check fail on arm (#5745)
Summary:
FullFilterBitsBuilder::CalculateSpace use CACHE_LINE_SIZE which is 64@X86 but 128@ARM64
when it run bloom_test.FullVaryingLengths it failed on ARM64 server,
the assert can be fixed by change  128->CACHE_LINE_SIZE*2 as merged
ASSERT_LE(FilterSize(), (size_t)((length * 10 / 8) + CACHE_LINE_SIZE * 2 + 5)) << length;

run  bloom_test
before fix:
/root/rocksdb-master/util/bloom_test.cc:281: Failure
Expected: (FilterSize()) <= ((size_t)((length * 10 / 8) + 128 + 5)), actual: 389 vs 383
200
[  FAILED  ] FullBloomTest.FullVaryingLengths (32 ms)
[----------] 4 tests from FullBloomTest (32 ms total)

[----------] Global test environment tear-down
[==========] 7 tests from 2 test cases ran. (116 ms total)
[  PASSED  ] 6 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] FullBloomTest.FullVaryingLengths

after fix:
Filters: 37 good, 0 mediocre
[       OK ] FullBloomTest.FullVaryingLengths (90 ms)
[----------] 4 tests from FullBloomTest (90 ms total)

[----------] Global test environment tear-down
[==========] 7 tests from 2 test cases ran. (174 ms total)
[  PASSED  ] 7 tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5745

Differential Revision: D17076047

fbshipit-source-id: e7beb5d55d4855fceb2b84bc8119a6b0759de635
2019-09-05 17:03:24 -07:00
Peter Dillinger b55b2f45d0 Faster new DynamicBloom implementation (for memtable) (#5762)
Summary:
Since DynamicBloom is now only used in-memory, we're free to
change it without schema compatibility issues. The new implementation
is drawn from (with manifest permission)
303542a767/bloom_simulation_tests/foo.cc (L613)

This has several speed advantages over the prior implementation:
* Uses fastrange instead of %
* Minimum logic to determine first (and all) probed memory addresses
* (Major) Two probes per 64-bit memory fetch/write.
* Very fast and effective (murmur-like) hash expansion/re-mixing. (At
least on recent CPUs, integer multiplication is very cheap.)

While a Bloom filter with 512-bit cache locality has about a 1.15x FP
rate penalty (e.g. 0.84% to 0.97%), further restricting to two probes
per 64 bits incurs an additional 1.12x FP rate penalty (e.g. 0.97% to
1.09%). Nevertheless, the unit tests show no "mediocre" FP rate samples,
unlike the old implementation with more erratic FP rates.

Especially for the memtable, we expect speed to outweigh somewhat higher
FP rates. For example, a negative table query would have to be 1000x
slower than a BF query to justify doubling BF query time to shave 10% off
FP rate (working assumption around 1% FP rate). While that seems likely
for SSTs, my data suggests a speed factor of roughly 50x for the memtable
(vs. BF; ~1.5% lower write throughput when enabling memtable Bloom
filter, after this change).  Thus, it's probably not worth even 5% more
time in the Bloom filter to shave off 1/10th of the Bloom FP rate, or 0.1%
in absolute terms, and it's probably at least 20% slower to recoup that
much FP rate from this new implementation. Because of this, we do not see
a need for a 'locality' option that affects the MemTable Bloom filter
and have decoupled the MemTable Bloom filter from Options::bloom_locality.

Note that just 3% more memory to the Bloom filter (10.3 bits per key vs.
just 10) is able to make up for the ~12% FP rate drop in the new
implementation:

[] # Nearly "ideal" FP-wise but reasonably fast cache-local implementation
[~/wormhashing/bloom_simulation_tests] ./foo_gcc_IMPL_CACHE_WORM64_FROM32_any.out 10000000 6 10 $RANDOM 100000000
./foo_gcc_IMPL_CACHE_WORM64_FROM32_any.out time: 3.29372 sampled_fp_rate: 0.00985956 ...

[] # Close match to this new implementation
[~/wormhashing/bloom_simulation_tests] ./foo_gcc_IMPL_CACHE_MUL64_BLOCK_FROM32_any.out 10000000 6 10.3 $RANDOM 100000000
./foo_gcc_IMPL_CACHE_MUL64_BLOCK_FROM32_any.out time: 2.10072 sampled_fp_rate: 0.00985655 ...

[] # Old locality=1 implementation
[~/wormhashing/bloom_simulation_tests] ./foo_gcc_IMPL_CACHE_ROCKSDB_DYNAMIC_any.out 10000000 6 10 $RANDOM 100000000
./foo_gcc_IMPL_CACHE_ROCKSDB_DYNAMIC_any.out time: 3.95472 sampled_fp_rate: 0.00988943 ...

Also note the dramatic speed improvement vs. alternatives.

--

Performance unit test: DynamicBloomTest.concurrent_with_perf is updated
to report more precise timing data. (Measure running time of each
thread, not just longest running thread, etc.) Results averaged over
various sizes enabled with --enable_perf and 20 runs each; old dynamic
bloom refers to locality=1, the faster of the old:

old dynamic bloom, avg add latency = 65.6468
new dynamic bloom, avg add latency = 44.3809
old dynamic bloom, avg query latency = 50.6485
new dynamic bloom, avg query latency = 43.2186
old avg parallel add latency = 41.678
new avg parallel add latency = 24.5238
old avg parallel hit latency = 14.6322
new avg parallel hit latency = 12.3939
old avg parallel miss latency = 16.7289
new avg parallel miss latency = 12.2134

Tested on a dedicated 64-bit production machine at Facebook. Significant
improvement all around.

Despite now using std::atomic<uint64_t>, quick before-and-after test on
a 32-bit machine (Intel Atom N270, released 2008) shows no regression in
performance, in some cases modest improvement.

--

Performance integration test (synthetic): with DEBUG_LEVEL=0, used
TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=fillrandom,readmissing,readrandom,stats --num=2000000
and optionally with -memtable_whole_key_filtering -memtable_bloom_size_ratio=0.01
300 runs each configuration.

Write throughput change by enabling memtable bloom:
Old locality=0: -3.06%
Old locality=1: -2.37%
New:            -1.50%
conclusion -> seems to substantially close the gap

Readmissing throughput change by enabling memtable bloom:
Old locality=0: +34.47%
Old locality=1: +34.80%
New:            +33.25%
conclusion -> maybe a small new penalty from FP rate

Readrandom throughput change by enabling memtable bloom:
Old locality=0: +31.54%
Old locality=1: +31.13%
New:            +30.60%
conclusion -> maybe also from FP rate (after memtable flush)

--

Another conclusion we can draw from this new implementation is that the
existing 32-bit hash function is not inherently crippling the Bloom
filter speed or accuracy, below about 5 million keys. For speed, the
implementation is essentially the same whether starting with 32-bits or
64-bits of hash; it just determines whether the first multiplication
after fastrange is a pseudorandom expansion or needed re-mix. Note that
this multiplication can occur while memory is fetching.

For accuracy, in a standard configuration, you need about 5 million
keys before you have about a 1.1x FP penalty due to using a
32-bit hash vs. 64-bit:

[~/wormhashing/bloom_simulation_tests] ./foo_gcc_IMPL_CACHE_MUL64_BLOCK_FROM32_any.out $((5 * 1000 * 1000 * 10)) 6 10 $RANDOM 100000000
./foo_gcc_IMPL_CACHE_MUL64_BLOCK_FROM32_any.out time: 2.52069 sampled_fp_rate: 0.0118267 ...
[~/wormhashing/bloom_simulation_tests] ./foo_gcc_IMPL_CACHE_MUL64_BLOCK_any.out $((5 * 1000 * 1000 * 10)) 6 10 $RANDOM 100000000
./foo_gcc_IMPL_CACHE_MUL64_BLOCK_any.out time: 2.43871 sampled_fp_rate: 0.0109059
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5762

Differential Revision: D17214194

Pulled By: pdillinger

fbshipit-source-id: ad9da031772e985fd6b62a0e1db8e81892520595
2019-09-05 14:59:25 -07:00
jsteemann 19e8c9b64f use c++17's try_emplace if available (#5696)
Summary:
This avoids rehashing the key in TrackKey() in case the key is not already
in the map of tracked keys, which will happen at least once per key used in a
transaction.

Additionally fix two typos.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5696

Differential Revision: D17210178

Pulled By: lth

fbshipit-source-id: 7e2c28e9e505c1d1c1535d435250cf2b191a6fdf
2019-09-05 13:59:40 -07:00
Peter Dillinger 20dec1401f Copy/split PlainTableBloomV1 from DynamicBloom (refactor) (#5767)
Summary:
DynamicBloom was being used both for memory-only and for on-disk filters, as part of the PlainTable format. To set up enhancements to the memtable Bloom filter, this splits the code into two copies and removes unused features from each copy. Adds test PlainTableDBTest.BloomSchema to ensure no accidental change to that format.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5767

Differential Revision: D17206963

Pulled By: pdillinger

fbshipit-source-id: 6cce8d55305ed0df051b4c58bdc98c8ad81d0553
2019-09-05 10:05:20 -07:00
ENDOH takanao 3f2723a81b fix checking the '-march' flag (#5766)
Summary:
Hi! guys,

I got errors on the ARM machine.

before:

```console
$ make static_lib
...
g++: error: unrecognized argument in option '-march=armv8-a+crc+crypto'
g++: note: valid arguments to '-march=' are: armv2 armv2a armv3 armv3m armv4 armv4t armv5 armv5e armv5t armv5te armv6 armv6-m armv6j armv6k armv6kz armv6s-m armv6t2 armv6z armv6zk armv7 armv7-a armv7-m armv7-r armv7e-m armv7ve armv8-a armv8-a+crc armv8.1-a armv8.1-a+crc iwmmxt iwmmxt2 native
```

Thanks!
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5766

Differential Revision: D17191117

fbshipit-source-id: 7a61e3a2a4a06f37faeb8429bd7314da54ec5868
2019-09-04 14:34:28 -07:00
Maysam Yabandeh f9fb9f1421 Add a unit test to detect infinite loops with reseek optimizations (#5727)
Summary:
Iterators reseek to the target key after iterating over max_sequential_skip_in_iterations invalid values. The logic is susceptible to an infinite loop bug, which has been present with WritePrepared Transactions up until 6.2 release. Although the bug is not present on master, the patch adds a unit test to prevent it from resurfacing again.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5727

Differential Revision: D16952759

Pulled By: maysamyabandeh

fbshipit-source-id: d0d973dddc8dfabd5a794931232aa4c862c74f51
2019-09-04 14:31:10 -07:00
Affan Dar 229e6fbe0e Adding DB::GetCurrentWalFile() API as a repliction/backup helper (#5765)
Summary:
Adding a light weight API to get last live WAL file name and size. Meant to be used as a helper for backup/restore tooling in a larger ecosystem such as MySQL with a MyRocks storage engine.

Specifically within MySQL's backup/restore mechanism, this call can be made with a write lock on the mysql db to get a transactionally consistent snapshot of the current WAL file position along with other non-rocksdb log/data files.

Without this, the alternative would be to take the aforementioned lock, scan the WAL dir for all files, find the last file and note its exact size as the rocksdb 'checkpoint'.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5765

Differential Revision: D17172717

Pulled By: affandar

fbshipit-source-id: f2fabafd4c0e6fc45f126670c8c88a9f84cb8a37
2019-09-04 12:10:17 -07:00
Yanqin Jin 38b17ecd0e Replace named comparator struct with lambda (#5768)
Summary:
Tiny code mod: replace a named comparator struct with anonymous lambda.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5768

Differential Revision: D17185141

Pulled By: riversand963

fbshipit-source-id: fabe367649931c33a39ad035dc707d2efc3ad5fc
2019-09-04 11:38:34 -07:00
git-hulk cdb6334e68 MOD: trim last space and comma in perf context and iostat context ToString()
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5755

Differential Revision: D17165190

Pulled By: riversand963

fbshipit-source-id: a3a4633961bfe019bf360f97a4c4d36464e7fa0b
2019-09-03 12:27:17 -07:00
Vijay Nadimpalli 979fbdc696 Persistent globally unique DB ID in manifest (#5725)
Summary:
Each DB has a globally unique ID. A DB can be physically copied around, or backed-up and restored, and the users should be identify the same DB. This unique ID right now is stored as plain text in file IDENTITY under the DB directory. This approach introduces at least two problems: (1) the file is not checksumed; (2) the source of truth of a DB is the manifest file, which can be copied separately from IDENTITY file, causing the DB ID to be wrong.
The goal of this PR is solve this problem by moving the  DB ID to manifest. To begin with we will write to both identity file and manifest. Write to Manifest is controlled via the flag write_dbid_to_manifest in Options and default is false.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5725

Test Plan: Added unit tests.

Differential Revision: D16963840

Pulled By: vjnadimpalli

fbshipit-source-id: 8a86a4c8c82c716003c40fd6b9d2d758030d92e9
2019-09-03 08:52:24 -07:00
Yanqin Jin 44eca41add Fix a bug in file ingestion (#5760)
Summary:
Before this PR, when the number of column families involved in a file ingestion exceeds 2, a bug in the looping logic prevents correct file number being assigned to each ingestion job.
Also skip deleting non-existing hard links during cleanup-after-failure.

Test plan (devserver)
```
$COMPILE_WITH_ASAN=1 make all
$./external_sst_file_test --gtest_filter=ExternalSSTFileTest/ExternalSSTFileTest.IngestFilesIntoMultipleColumnFamilies_*/*
$makke check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5760

Differential Revision: D17142982

Pulled By: riversand963

fbshipit-source-id: 06c1847a4e7a402647bcf28d124e70f2a0f9daf6
2019-08-30 18:29:07 -07:00
Yanqin Jin 672befea2a Fix assertion failure in FIFO compaction with TTL (#5754)
Summary:
Before this PR, the following sequence of events can cause assertion failure as shown below.
Stack trace (partial):
```
(gdb) bt
2  0x00007f59b350ad15 in __assert_fail_base (fmt=<optimized out>, assertion=assertion@entry=0x9f8390 "mark_as_compacted ? !inputs_[i][j]->being_compacted : inputs_[i][j]->being_compacted", file=file@entry=0x9e347c "db/compaction/compaction.cc", line=line@entry=395, function=function@entry=0xa21ec0 <rocksdb::Compaction::MarkFilesBeingCompacted(bool)::__PRETTY_FUNCTION__> "void rocksdb::Compaction::MarkFilesBeingCompacted(bool)") at assert.c:92
3  0x00007f59b350adc3 in __GI___assert_fail (assertion=assertion@entry=0x9f8390 "mark_as_compacted ? !inputs_[i][j]->being_compacted : inputs_[i][j]->being_compacted", file=file@entry=0x9e347c "db/compaction/compaction.cc", line=line@entry=395, function=function@entry=0xa21ec0 <rocksdb::Compaction::MarkFilesBeingCompacted(bool)::__PRETTY_FUNCTION__> "void rocksdb::Compaction::MarkFilesBeingCompacted(bool)") at assert.c:101
4  0x0000000000492ccd in rocksdb::Compaction::MarkFilesBeingCompacted (this=<optimized out>, mark_as_compacted=<optimized out>) at db/compaction/compaction.cc:394
5  0x000000000049467a in rocksdb::Compaction::Compaction (this=0x7f59af013000, vstorage=0x7f581af53030, _immutable_cf_options=..., _mutable_cf_options=..., _inputs=..., _output_level=<optimized out>, _target_file_size=0, _max_compaction_bytes=0, _output_path_id=0, _compression=<incomplete type>, _compression_opts=..., _max_subcompactions=0, _grandparents=..., _manual_compaction=false, _score=4, _deletion_compaction=true, _compaction_reason=rocksdb::CompactionReason::kFIFOTtl) at db/compaction/compaction.cc:241
6  0x00000000004af9bc in rocksdb::FIFOCompactionPicker::PickTTLCompaction (this=0x7f59b31a6900, cf_name=..., mutable_cf_options=..., vstorage=0x7f581af53030, log_buffer=log_buffer@entry=0x7f59b1bfa930) at db/compaction/compaction_picker_fifo.cc:101
7  0x00000000004b0771 in rocksdb::FIFOCompactionPicker::PickCompaction (this=0x7f59b31a6900, cf_name=..., mutable_cf_options=..., vstorage=0x7f581af53030, log_buffer=0x7f59b1bfa930) at db/compaction/compaction_picker_fifo.cc:201
8  0x00000000004838cc in rocksdb::ColumnFamilyData::PickCompaction (this=this@entry=0x7f59b31b3700, mutable_options=..., log_buffer=log_buffer@entry=0x7f59b1bfa930) at db/column_family.cc:933
9  0x00000000004f3645 in rocksdb::DBImpl::BackgroundCompaction (this=this@entry=0x7f59b3176000, made_progress=made_progress@entry=0x7f59b1bfa6bf, job_context=job_context@entry=0x7f59b1bfa760, log_buffer=log_buffer@entry=0x7f59b1bfa930, prepicked_compaction=prepicked_compaction@entry=0x0, thread_pri=rocksdb::Env::LOW) at db/db_impl/db_impl_compaction_flush.cc:2541
10 0x00000000004f5e2a in rocksdb::DBImpl::BackgroundCallCompaction (this=this@entry=0x7f59b3176000, prepicked_compaction=prepicked_compaction@entry=0x0, bg_thread_pri=bg_thread_pri@entry=rocksdb::Env::LOW) at db/db_impl/db_impl_compaction_flush.cc:2312
11 0x00000000004f648e in rocksdb::DBImpl::BGWorkCompaction (arg=<optimized out>) at db/db_impl/db_impl_compaction_flush.cc:2087
```
This can be caused by the following sequence of events.
```
Time
|      thr          bg_compact_thr1                     bg_compact_thr2
|      write
|      flush
|                   mark all l0 as being compacted
|      write
|      flush
|                   add cf to queue again
|                                                       mark all l0 as being
|                                                       compacted, fail the
|                                                       assertion
V
```
Test plan (on devserver)
Since bg_compact_thr1 and bg_compact_thr2 are two threads executing the same
code, it is difficult to use sync point dependency to
coordinate their execution. Therefore, I choose to use db_stress.
```
$TEST_TMPDIR=/dev/shm/rocksdb ./db_stress --periodic_compaction_seconds=1 --max_background_compactions=20 --format_version=2 --memtablerep=skip_list --max_write_buffer_number=3 --cache_index_and_filter_blocks=1 --reopen=20 --recycle_log_file_num=0 --acquire_snapshot_one_in=10000 --delpercent=4 --log2_keys_per_lock=22 --compaction_ttl=1 --block_size=16384 --use_multiget=1 --compact_files_one_in=1000000 --target_file_size_multiplier=2 --clear_column_family_one_in=0 --max_bytes_for_level_base=10485760 --use_full_merge_v1=1 --target_file_size_base=2097152 --checkpoint_one_in=1000000 --mmap_read=0 --compression_type=zstd --writepercent=35 --readpercent=45 --subcompactions=4 --use_merge=0 --write_buffer_size=4194304 --test_batches_snapshots=0 --db=/dev/shm/rocksdb/rocksdb_crashtest_whitebox --use_direct_reads=0 --compact_range_one_in=1000000 --open_files=-1 --destroy_db_initially=0 --progress_reports=0 --compression_zstd_max_train_bytes=0 --snapshot_hold_ops=100000 --enable_pipelined_write=0 --nooverwritepercent=1 --compression_max_dict_bytes=0 --max_key=1000000 --prefixpercent=5 --flush_one_in=1000000 --ops_per_thread=40000 --index_block_restart_interval=7 --cache_size=1048576 --compaction_style=2 --verify_checksum=1 --delrangepercent=1 --use_direct_io_for_flush_and_compaction=0
```
This should see no assertion failure.
Last but not least,
```
$COMPILE_WITH_ASAN=1 make -j32 all
$make check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5754

Differential Revision: D17109791

Pulled By: riversand963

fbshipit-source-id: 25fc46101235add158554e096540b72c324be078
2019-08-30 12:42:01 -07:00
Paul O'Shannessy 9a449865e3 Adopt Contributor Covenant
Summary:
In order to foster healthy open source communities, we're adopting the
[Contributor Covenant](https://www.contributor-covenant.org/). It has been
built by open source community members and represents a shared understanding of
what is expected from a healthy community.

Reviewed By: josephsavona, danobi, rdzhabarov

Differential Revision: D17104640

fbshipit-source-id: d210000de686c5f0d97d602b50472d5869bc6a49
2019-08-29 23:21:01 -07:00
Pratik Dhandharia a281822331 Lower the risk for users to run options.force_consistency_checks = true (#5744)
Summary:
Open-source users recently reported two occurrences of LSM-tree corruption (https://github.com/facebook/rocksdb/issues/5558 is one), which would be caught by options.force_consistency_checks = true. options.force_consistency_checks has a usability limitation because it crashes the service once inconsistency is detected. This makes the feature hard to use. Most users serve from multiple RocksDB shards per server and the impacts of crashing the service is higher than it should be.

Instead, we just pass the error back to users without killing the service, and ask them to deal with the problem accordingly.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5744

Differential Revision: D17096940

Pulled By: pdhandharia

fbshipit-source-id: b6780039044e265f26ed2ad03c51f4abbe8b603c
2019-08-29 14:07:37 -07:00
anand76 1729779b85 Disable MultiGet row cache test in LITE mode (#5756)
Summary:
Row cache is not supported in LITE mode. So disable the test in that mode.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5756

Test Plan: make LITE=1 all check

Differential Revision: D17115684

Pulled By: anand1976

fbshipit-source-id: e6433c2e528674645cea76cdfc80ddc473708fc2
2019-08-29 12:13:28 -07:00
Jeremy Taylor c5e12ebfd2 Add Crux to USERS.md
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5718

Differential Revision: D17096939

Pulled By: riversand963

fbshipit-source-id: 4301078d3ca3d54a1c7e841eccad95379cd1570d
2019-08-29 11:29:48 -07:00
Shafreeck Sea ab0645a596 Fix comment of function NotifyCollectTableCollectorsOnFinish (#5738)
Summary:
Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5738

Differential Revision: D17097075

Pulled By: riversand963

fbshipit-source-id: ed01b5f59e8eed262a49abe1f96552842d364af1
2019-08-29 10:57:01 -07:00
anand76 e10570331d Support row cache with batched MultiGet (#5706)
Summary:
This PR adds support for row cache in ```rocksdb::TableCache::MultiGet```.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5706

Test Plan:
1. Unit tests in db_basic_test
2. db_bench results with batch size of 2 (```Get``` is faster than ```MultiGet``` for single key) -
Get -
readrandom   :       3.935 micros/op 254116 ops/sec;   28.1 MB/s (22870998 of 22870999 found)
MultiGet -
multireadrandom :       3.743 micros/op 267190 ops/sec; (24047998 of 24047998 found)

Command used -
TEST_TMPDIR=/dev/shm/multiget numactl -C 10  ./db_bench -use_existing_db=true -use_existing_keys=false -benchmarks="readtorowcache,[read|multiread]random" -write_buffer_size=16777216 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -row_cache_size=4194304000 -batch_size=2 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=131072

Differential Revision: D17086297

Pulled By: anand1976

fbshipit-source-id: 85784378da913e05f1baf31ec1b4e7c9345e7f57
2019-08-28 16:11:56 -07:00
sdong 1daff8f85a crash_test to skip compaction TTL for FIFO compaction. (#5749)
Summary:
https://github.com/facebook/rocksdb/pull/5741 added compaction TTL to crash test, but it causes assertion fails for FIFO compaction. Disable this combination for now while we debug the assertion failure.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5749

Test Plan: Run crash test and observe that when compaction_style=2, compaction_ttl is always 0.

Differential Revision: D17078292

fbshipit-source-id: 446821a3b9739956094d5e4f9be1251a15b57f5d
2019-08-27 17:55:37 -07:00
Pratik Dhandharia 1b4c104a67 replace some reinterpret_cast with static_cast_with_check (#5740)
Summary:
This PR focuses on replacing some of the reinterpret_cast<DBImpl*> to static_cast_with_check<DBImpl, DB>.

Files impacted:

./db/db_impl/db_impl_compaction_flush.cc
./db/write_batch.cc
./utilities/blob_db/blob_db_impl.cc
./utilities/transactions/pessimistic_transaction_db.cc
./utilities/transactions/transaction_base.cc
./utilities/transactions/write_prepared_txn_db.cc
./utilities/transactions/write_unprepared_txn_db.cc
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5740

Differential Revision: D17055691

Pulled By: pdhandharia

fbshipit-source-id: 0f8034d1b32eade56e37d59c04b7bf236a81d8e8
2019-08-27 10:59:11 -07:00
sdong 1d6a10f52d Extend stress test to cover periodic compaction and compaction TTL (#5741)
Summary:
Covering periodic compaction and compaction TTL can help us expose potential issues. Add it there.
Randomly select value for these two options.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5741

Test Plan: Run crash_test and see the perameters generated.

Differential Revision: D17059515

fbshipit-source-id: 8213974846a0b6a22fc13be705825c9054d1d097
2019-08-26 15:03:25 -07:00
Andrew Kryczka ba0967b567 Reduce severity of too many levels log message (#5742)
Summary:
This condition is now a normal occurrence during write burst so there is
no need to warn the user about it. Here is a scenario where it happens
under completely normal conditions.

* Initially we have a DB of three levels (L0, L1, and L2) that is stable, i.e., compaction scores are all less than one.
* Now a write burst comes along. At first L0 blows up a bit in size as compaction hasn't had a chance to catch up.
* As a result of the above, `base_bytes_min` also increases since it is based on L0 size as of https://github.com/facebook/rocksdb/issues/4338
* If `base_bytes_min` increased enough (i.e., to be larger than L1), then we are shown the warning that the DB has more levels than necessary.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5742

Differential Revision: D17059221

fbshipit-source-id: e4a31d6eea42089a8d273095f19653991bd91bea
2019-08-26 15:00:43 -07:00
jsteemann 62829ff751 reuse scratch buffer in transaction_log_reader (#5702)
Summary:
in order to avoid reallocations for a scratch std::string on every call to Next().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5702

Differential Revision: D16867803

fbshipit-source-id: 1391220a1b172b23336bbc71dc0c79ccf3b1c701
2019-08-26 11:26:29 -07:00
Zhongyi Xie 2f41ecfe75 Refactor trimming logic for immutable memtables (#5022)
Summary:
MyRocks currently sets `max_write_buffer_number_to_maintain` in order to maintain enough history for transaction conflict checking. The effectiveness of this approach depends on the size of memtables. When memtables are small, it may not keep enough history; when memtables are large, this may consume too much memory.
We are proposing a new way to configure memtable list history: by limiting the memory usage of immutable memtables. The new option is `max_write_buffer_size_to_maintain` and it will take precedence over the old `max_write_buffer_number_to_maintain` if they are both set to non-zero values. The new option accounts for the total memory usage of flushed immutable memtables and mutable memtable. When the total usage exceeds the limit, RocksDB may start dropping immutable memtables (which is also called trimming history), starting from the oldest one.
The semantics of the old option actually works both as an upper bound and lower bound. History trimming will start if number of immutable memtables exceeds the limit, but it will never go below (limit-1) due to history trimming.
In order the mimic the behavior with the new option, history trimming will stop if dropping the next immutable memtable causes the total memory usage go below the size limit. For example, assuming the size limit is set to 64MB, and there are 3 immutable memtables with sizes of 20, 30, 30. Although the total memory usage is 80MB > 64MB, dropping the oldest memtable will reduce the memory usage to 60MB < 64MB, so in this case no memtable will be dropped.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5022

Differential Revision: D14394062

Pulled By: miasantreble

fbshipit-source-id: 60457a509c6af89d0993f988c9b5c2aa9e45f5c5
2019-08-23 13:55:34 -07:00
DaiZhiwei 26293c89a6 crc32c_arm64 performance optimization (#5675)
Summary:
Crc32c Parallel computation coding optimization:
Macro unfolding removes the "for" loop and is good to decrease branch-miss in arm64 micro architecture
1024 Bytes is divided into  8(head) + 1008( 6 * 7 * 3 * 8 ) + 8(tail)  three parts
Macro unfolding 42 loops to 6 CRC32C7X24BYTESs
1 CRC32C7X24BYTES containing 7 CRC32C24BYTESs

1, crc32c_test
[==========] Running 4 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 4 tests from CRC
[ RUN      ] CRC.StandardResults
[       OK ] CRC.StandardResults (1 ms)
[ RUN      ] CRC.Values
[       OK ] CRC.Values (0 ms)
[ RUN      ] CRC.Extend
[       OK ] CRC.Extend (0 ms)
[ RUN      ] CRC.Mask
[       OK ] CRC.Mask (0 ms)
[----------] 4 tests from CRC (1 ms total)

[----------] Global test environment tear-down
[==========] 4 tests from 1 test case ran. (1 ms total)
[  PASSED  ] 4 tests.

2, db_bench --benchmarks="crc32c"
crc32c : 0.218 micros/op 4595390 ops/sec; 17950.7 MB/s (4096 per op)

3, repeated crc32c_test case  60000 times
perf stat -e branch-miss -- ./crc32c_test
before optimization:
739,426,504      branch-miss
after optimization:
1,128,572      branch-miss
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5675

Differential Revision: D16989210

fbshipit-source-id: 7204e6069bb6ed066d49c2d1b3ac385065a98557
2019-08-23 11:04:08 -07:00
Levi Tamasi df8c307d63 Revert to storing UncompressionDicts in the cache (#5645)
Summary:
PR https://github.com/facebook/rocksdb/issues/5584 decoupled the uncompression dictionary object from the underlying block data; however, this defeats the purpose of the digested ZSTD dictionary, since the whole point
of the digest is to create it once and reuse it over and over again. This patch goes back to
storing the uncompression dictionary itself in the cache (which should be now safe to do,
since it no longer includes a Statistics pointer), while preserving the rest of the refactoring.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5645

Test Plan: make asan_check

Differential Revision: D16551864

Pulled By: ltamasi

fbshipit-source-id: 2a7e2d34bb16e70e3c816506d5afe1d842057800
2019-08-23 08:27:30 -07:00
sdong d8a27d9331 Atomic Flush Crash Test also covers the case that WAL is enabled. (#5729)
Summary:
AtomicFlushStressTest is a powerful test, but right now we only run it for atomic_flush=true + disable_wal=true. We further extend it to the case where atomic_flush=false + disable_wal = false. All the workload generation and validation can stay the same.
Atomic flush crash test is also changed to switch between the two test scenarios. It makes the name "atomic flush crash test" out of sync from what it really does. We leave it as it is to avoid troubles with continous test set-up.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5729

Test Plan: Run "CRASH_TEST_KILL_ODD=188 TEST_TMPDIR=/dev/shm/ USE_CLANG=1 make whitebox_crash_test_with_atomic_flush", observe the settings used and see it passed.

Differential Revision: D16969791

fbshipit-source-id: 56e37487000ae631e31b0100acd7bdc441c04163
2019-08-22 16:32:55 -07:00
Patrick Pei 202942b20c Fix local includes
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5722

Differential Revision: D16908380

fbshipit-source-id: 6a0e3cb2730b08d6012d3d7f31c937f01c399846
2019-08-22 16:21:47 -07:00
Maysam Yabandeh 244e6f2002 Refactor MultiGet names in BlockBasedTable (#5726)
Summary:
To improve code readability, since RetrieveBlock already calls MaybeReadBlockAndLoadToCache, we avoid name similarity of the functions that call RetrieveBlock with MaybeReadBlockAndLoadToCache. The patch thus renames MaybeLoadBlocksToCache to RetrieveMultipleBlock and deletes GetDataBlockFromCache, which contains only two lines.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5726

Differential Revision: D16962535

Pulled By: maysamyabandeh

fbshipit-source-id: 99e8946808ce4eb7857592b9003812e3004f92d6
2019-08-22 08:49:00 -07:00
anand76 9046bdc5d3 Fix MultiGet() bug when whole_key_filtering is disabled (#5665)
Summary:
The batched MultiGet() implementation was not correctly handling bloom filter lookups when whole_key_filtering is disabled. It was incorrectly skipping keys not in the prefix_extractor domain, and not calling transform for keys in domain. This PR fixes both problems by moving the domain check and transformation to the FilterBlockReader.

Tests:
Unit test (confirmed failed before the fix)
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5665

Differential Revision: D16902380

Pulled By: anand1976

fbshipit-source-id: a6be81ad68a6e37134a65246aec7a2c590eccf00
2019-08-21 10:23:23 -07:00
Maysam Yabandeh 7bc18e2727 Disable snapshot refresh feature when snap_refresh_nanos is 0 (#5724)
Summary:
The comments of snap_refresh_nanos advertise that the snapshot refresh feature will be disabled when the option is set to 0. This contract is however not honored in the code: https://github.com/facebook/rocksdb/pull/5278
The patch fixes that and also adds an assert to ensure that the feature is not used when the option  is zero.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5724

Differential Revision: D16918185

Pulled By: maysamyabandeh

fbshipit-source-id: fec167287df7d85093e087fc39c0eb243e3bbd7e
2019-08-20 11:40:07 -07:00
sdong 3552473668 Introduce IngestExternalFileOptions.verify_checksums_readahead_size (#5721)
Summary:
Recently readahead is introduced for checksum verifying. However, users cannot override the setting for the checksum verifying before external SST file ingestion. Introduce a new option for the purpose.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5721

Test Plan: Add a new unit test for it.

Differential Revision: D16906896

fbshipit-source-id: 218ec37001ddcc05411cefddbe233d15ab308476
2019-08-20 10:43:39 -07:00