rocksdb

Commit Graph

Author	SHA1	Message	Date
Peter Dillinger	bae6f58696	Basic MultiGet support for partitioned filters (#6757 ) Summary: In MultiGet, access each applicable filter partition only once per batch, rather than for each applicable key. Also, * Fix Bloom stats for MultiGet * Fix/refactor MultiGetContext::Range::KeysLeft, including * Add efficient BitsSetToOne implementation * Assert that MultiGetContext::Range does not go beyond shift range Performance test: Generate db: $ ./db_bench --benchmarks=fillrandom --num=15000000 --cache_index_and_filter_blocks -bloom_bits=10 -partition_index_and_filters=true ... Before (middle performing run of three; note some missing Bloom stats): $ ./db_bench --use-existing-db --benchmarks=multireadrandom --num=15000000 --cache_index_and_filter_blocks --bloom_bits=10 --threads=16 --cache_size=20000000 -partition_index_and_filters -batch_size=32 -multiread_batched -statistics --duration=20 2>&1 \| egrep 'micros/op\|block.cache.filter.hit\|bloom.filter.(full\|use)\|number.multiget' multireadrandom : 26.403 micros/op 597517 ops/sec; (548427 of 671968 found) rocksdb.block.cache.filter.hit COUNT : 83443275 rocksdb.bloom.filter.useful COUNT : 0 rocksdb.bloom.filter.full.positive COUNT : 0 rocksdb.bloom.filter.full.true.positive COUNT : 7931450 rocksdb.number.multiget.get COUNT : 385984 rocksdb.number.multiget.keys.read COUNT : 12351488 rocksdb.number.multiget.bytes.read COUNT : 793145000 rocksdb.number.multiget.keys.found COUNT : 7931450 After (middle performing run of three): $ ./db_bench_new --use-existing-db --benchmarks=multireadrandom --num=15000000 --cache_index_and_filter_blocks --bloom_bits=10 --threads=16 --cache_size=20000000 -partition_index_and_filters -batch_size=32 -multiread_batched -statistics --duration=20 2>&1 \| egrep 'micros/op\|block.cache.filter.hit\|bloom.filter.(full\|use)\|number.multiget' multireadrandom : 21.024 micros/op 752963 ops/sec; (705188 of 863968 found) rocksdb.block.cache.filter.hit COUNT : 49856682 rocksdb.bloom.filter.useful COUNT : 45684579 rocksdb.bloom.filter.full.positive COUNT : 10395458 rocksdb.bloom.filter.full.true.positive COUNT : 9908456 rocksdb.number.multiget.get COUNT : 481984 rocksdb.number.multiget.keys.read COUNT : 15423488 rocksdb.number.multiget.bytes.read COUNT : 990845600 rocksdb.number.multiget.keys.found COUNT : 9908456 So that's about 25% higher throughput even for random keys Pull Request resolved: https://github.com/facebook/rocksdb/pull/6757 Test Plan: unit test included Reviewed By: anand1976 Differential Revision: D21243256 Pulled By: pdillinger fbshipit-source-id: 5644a1468d9e8c8575be02f4e04bc5d62dbbb57f	2020-04-28 14:49:34 -07:00
Yanqin Jin	d4398e08fc	Fix timestamp support for MultiGet (#6748 ) Summary: 1. Avoid nullptr dereference when passing timestamp to KeyContext creation. 2. Construct LookupKey correctly with timestamp when creating MultiGetContext. 3. Compare without timestamp when sorting KeyContexts. Fixes https://github.com/facebook/rocksdb/issues/6745 Test plan (dev server): make check Pull Request resolved: https://github.com/facebook/rocksdb/pull/6748 Reviewed By: pdillinger Differential Revision: D21258691 Pulled By: riversand963 fbshipit-source-id: 44e65b759c18b9986947783edf03be4f890bb004	2020-04-27 22:49:56 -07:00
Peter Dillinger	249eff0f30	Stats for redundant insertions into block cache (#6681 ) Summary: Since read threads do not coordinate on loading data into block cache, two threads between Lookup and Insert can end up loading and inserting the same data. This is particularly concerning with cache_index_and_filter_blocks since those are hot and more likely to be race targets if ejected from (or not pre-populated in) the cache. Particularly with moves toward disaggregated / network storage, the cost of redundant retrieval might be high, and we should at least have some hard statistics from which we can estimate impact. Example with full filter thrashing "cliff": $ ./db_bench --benchmarks=fillrandom --num=15000000 --cache_index_and_filter_blocks -bloom_bits=10 ... $ ./db_bench --db=/tmp/rocksdbtest-172704/dbbench --use_existing_db --benchmarks=readrandom,stats --num=200000 --cache_index_and_filter_blocks --cache_size=$((130 * 1024 * 1024)) --bloom_bits=10 --threads=16 -statistics 2>&1 \| egrep '^rocksdb.block.cache.(.add\|.redundant)' \| grep -v compress \| sort rocksdb.block.cache.add COUNT : 14181 rocksdb.block.cache.add.failures COUNT : 0 rocksdb.block.cache.add.redundant COUNT : 476 rocksdb.block.cache.data.add COUNT : 12749 rocksdb.block.cache.data.add.redundant COUNT : 18 rocksdb.block.cache.filter.add COUNT : 1003 rocksdb.block.cache.filter.add.redundant COUNT : 217 rocksdb.block.cache.index.add COUNT : 429 rocksdb.block.cache.index.add.redundant COUNT : 241 $ ./db_bench --db=/tmp/rocksdbtest-172704/dbbench --use_existing_db --benchmarks=readrandom,stats --num=200000 --cache_index_and_filter_blocks --cache_size=$((120 * 1024 * 1024)) --bloom_bits=10 --threads=16 -statistics 2>&1 \| egrep '^rocksdb.block.cache.(.add\|.redundant)' \| grep -v compress \| sort rocksdb.block.cache.add COUNT : 1182223 rocksdb.block.cache.add.failures COUNT : 0 rocksdb.block.cache.add.redundant COUNT : 302728 rocksdb.block.cache.data.add COUNT : 31425 rocksdb.block.cache.data.add.redundant COUNT : 12 rocksdb.block.cache.filter.add COUNT : 795455 rocksdb.block.cache.filter.add.redundant COUNT : 130238 rocksdb.block.cache.index.add COUNT : 355343 rocksdb.block.cache.index.add.redundant COUNT : 172478 Pull Request resolved: https://github.com/facebook/rocksdb/pull/6681 Test Plan: Some manual testing (above) and unit test covering key metrics is included Reviewed By: ltamasi Differential Revision: D21134113 Pulled By: pdillinger fbshipit-source-id: c11497b5f00f4ffdfe919823904e52d0a1a91d87	2020-04-27 13:20:27 -07:00
Cheng Chang	0a77617820	Disable O_DIRECT in stress test when db directory does not support direct IO (#6727 ) Summary: In crash test, the db directory might be set to /dev/shm or /tmp, in certain environments such as internal testing infrastructure, neither of these directories support direct IO, so direct IO is never enabled in crash test. This PR sets up SyncPoints in direct IO related code paths to disable O_DIRECT flag in calls to `open`, so the direct IO code paths will be executed, all direct IO related assertions will be checked, but no real direct IO request will be issued to the file system. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6727 Test Plan: export CRASH_TEST_EXT_ARGS="--use_direct_reads=1 --mmap_read=0" make -j24 crash_test Reviewed By: zhichao-cao Differential Revision: D21139250 Pulled By: cheng-chang fbshipit-source-id: db9adfe78d91aa4759835b1af91c5db7b27b62ee	2020-04-25 00:01:03 -07:00
Yanqin Jin	e04f3bce4f	Update CURRENT file after best-efforts recovery (#6746 ) Summary: After a successful recovery, the CURRENT file should be updated to point to the valid MANIFEST. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6746 Test Plan: make check Reviewed By: anand1976 Differential Revision: D21189876 Pulled By: riversand963 fbshipit-source-id: 7537b49988c5c425ebe9505a5cc260de351ad79b	2020-04-23 16:21:09 -07:00
Levi Tamasi	bc51e33d9c	Make sure (Shared)BlobFileMetaData are owned by shared_ptrs (#6749 ) Summary: The patch makes a couple of small cleanups to `SharedBlobFileMetaData` and `BlobFileMetaData`: * It makes the constructors private and introduces factory methods to ensure these objects are always owned by `shared_ptr`s. Note that `SharedBlobFileMetaData` has an additional factory that takes a deleter object; we can utilize this to e.g. notify `VersionSet` when a blob file becomes obsolete (which is exactly when `SharedBlobFileMetaData` is destroyed). * It disables move operations explicitly instead of relying on them being suppressed because of a user-declared destructor. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6749 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D21206947 Pulled By: ltamasi fbshipit-source-id: 9094c14cc335b3e226f883e5a0df4f87a5cdeb95	2020-04-23 13:44:29 -07:00
mrambacher	4cbc19d2a1	Add a ConfigOptions for use in comparing objects and converting to/from strings (#6389 ) Summary: The methods in convenience.h are used to compare/convert objects to/from strings. There is a mishmash of parameters in use here with more needed in the future. This PR replaces those parameters with a single structure. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6389 Reviewed By: siying Differential Revision: D21163707 Pulled By: zhichao-cao fbshipit-source-id: f807b4cc7e2b0af3871536b69546b2604dfa81bd	2020-04-21 17:38:17 -07:00
anand76	c1ccd6b6af	Implement deadline support for MultiGet (#6710 ) Summary: Initial implementation of ReadOptions.deadline for MultiGet. If the request takes longer than the deadline, the keys not yet found will be returned with Status::TimedOut(). This implementation enforces the deadline in DBImpl, which is fairly high level. Its best effort and may not check the deadline after every key lookup, but may do so after a batch of keys. In subsequent stages, we will extend this to passing a timeout down to the FileSystem. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6710 Test Plan: Add new unit tests Reviewed By: riversand963 Differential Revision: D21149158 Pulled By: anand1976 fbshipit-source-id: 9f44eecffeb40873f5034ed59a66d21f9f88879e	2020-04-21 14:51:51 -07:00
Tomas Kolda	6ee66cf8f0	Prevents Table Cache to open same files more times (#6707 ) Summary: In highly concurrent requests table cache opens same file more times which lowers purpose of max_open_files. Fixes (https://github.com/facebook/rocksdb/issues/6699) Pull Request resolved: https://github.com/facebook/rocksdb/pull/6707 Reviewed By: ltamasi Differential Revision: D21044965 fbshipit-source-id: f6e91d90b60dad86e518b5147021da42460ee1d2	2020-04-21 13:16:31 -07:00
Akanksha Mahajan	03a1d95db0	Set max_background_flushes dynamically (#6701 ) Summary: 1. Add changes so that max_background_flushes can be set dynamically. 2. Add a testcase DBOptionsTest.SetBackgroundFlushThreads which set the max_background_flushes dynamically using SetDBOptions. TestPlan: 1. make -j64 check 2. Using new testcase DBOptionsTest.SetBackgroundFlushThreads Pull Request resolved: https://github.com/facebook/rocksdb/pull/6701 Reviewed By: ajkr Differential Revision: D21028010 Pulled By: akankshamahajan15 fbshipit-source-id: 5f949e4a8fd3c32537b637947b7ee09a69cfc7c1	2020-04-20 16:19:02 -07:00
Peter Dillinger	31da5e34c1	C++20 compatibility (#6697 ) Summary: Based on https://github.com/facebook/rocksdb/issues/6648 (CLA Signed), but heavily modified / extended: * Implicit capture of this via [=] deprecated in C++20, and [=,this] not standard before C++20 -> now using explicit capture lists * Implicit copy operator deprecated in gcc 9 -> add explicit '= default' definition * std::random_shuffle deprecated in C++17 and removed in C++20 -> migrated to a replacement in RocksDB random.h API * Add the ability to build with different std version though -DCMAKE_CXX_STANDARD=11/14/17/20 on the cmake command line * Minimal rebuild flag of MSVC is deprecated and is forbidden with /std:c++latest (C++20) * Added MSVC 2019 C++11 & MSVC 2019 C++20 in AppVeyor * Added GCC 9 C++11 & GCC9 C++20 in Travis Pull Request resolved: https://github.com/facebook/rocksdb/pull/6697 Test Plan: make check and CI Reviewed By: cheng-chang Differential Revision: D21020318 Pulled By: pdillinger fbshipit-source-id: 12311be5dbd8675a0e2c817f7ec50fa11c18ab91	2020-04-20 13:24:25 -07:00
Peter Dillinger	45d2b4efca	Fix tabs and lint-ignores (#6734 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6734 Reviewed By: cheng-chang Differential Revision: D21134556 Pulled By: pdillinger fbshipit-source-id: 3636cc1d1333137b70031f8277458781c21631fb	2020-04-20 11:39:31 -07:00
Andrew Kryczka	6717ada899	Fix CF import with overlapping SST files (#6663 ) Summary: Invariant checking should use internal key comparator rather than `sstableKeyCompare()`. The latter was intended for checking whether a compaction input file's neighboring files need to be included in the same compaction. Using it for invariant checking was leading to false positives for files with overlapping endpoints. Fixes https://github.com/facebook/rocksdb/issues/6647. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6663 Test Plan: regression test Reviewed By: zhichao-cao Differential Revision: D20910466 Pulled By: ajkr fbshipit-source-id: f0b70dad7c4096fce635cab7a36f16e14f74ae3f	2020-04-16 13:16:06 -07:00
Mike Kolupaev	e45673dece	Properly report IO errors when IndexType::kBinarySearchWithFirstKey is used (#6621 ) Summary: Context: Index type `kBinarySearchWithFirstKey` added the ability for sst file iterator to sometimes report a key from index without reading the corresponding data block. This is useful when sst blocks are cut at some meaningful boundaries (e.g. one block per key prefix), and many seeks land between blocks (e.g. for each prefix, the ranges of keys in different sst files are nearly disjoint, so a typical seek needs to read a data block from only one file even if all files have the prefix). But this added a new error condition, which rocksdb code was really not equipped to deal with: `InternalIterator::value()` may fail with an IO error or Status::Incomplete, but it's just a method returning a Slice, with no way to report error instead. Before this PR, this type of error wasn't handled at all (an empty slice was returned), and kBinarySearchWithFirstKey implementation was considered a prototype. Now that we (LogDevice) have experimented with kBinarySearchWithFirstKey for a while and confirmed that it's really useful, this PR is adding the missing error handling. It's a pretty inconvenient situation implementation-wise. The error needs to be reported from InternalIterator when trying to access value. But there are ~700 call sites of `InternalIterator::value()`, most of which either can't hit the error condition (because the iterator is reading from memtable or from index or something) or wouldn't benefit from the deferred loading of the value (e.g. compaction iterator that reads all values anyway). Adding error handling to all these call sites would needlessly bloat the code. So instead I made the deferred value loading optional: only the call sites that may use deferred loading have to call the new method `PrepareValue()` before calling `value()`. The feature is enabled with a new bool argument `allow_unprepared_value` to a bunch of methods that create iterators (it wouldn't make sense to put it in ReadOptions because it's completely internal to iterators, with virtually no user-visible effect). Lmk if you have better ideas. Note that the deferred value loading only happens for internal iterators. The user-visible iterator (DBIter) always prepares the value before returning from Seek/Next/etc. We could go further and add an API to defer that value loading too, but that's most likely not useful for LogDevice, so it doesn't seem worth the complexity for now. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6621 Test Plan: make -j5 check . Will also deploy to some logdevice test clusters and look at stats. Reviewed By: siying Differential Revision: D20786930 Pulled By: al13n321 fbshipit-source-id: 6da77d918bad3780522e918f17f4d5513d3e99ee	2020-04-15 17:40:44 -07:00
Yanqin Jin	eeb3cf3f58	Fix release build (#6690 ) Summary: Fix release build caused by variable defined but unused. Test plan (devserver) ``` make release ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/6690 Reviewed By: cheng-chang Differential Revision: D20980571 Pulled By: riversand963 fbshipit-source-id: c3f3b13f81dce4bdb19876dc2e710d5902ff8a02	2020-04-11 22:04:04 -07:00
Yanqin Jin	0c05624d50	Compaction with timestamp: input boundaries (#6645 ) Summary: Towards making compaction logic compatible with user timestamp. When computing boundaries and overlapping ranges for inputs of compaction, We need to compare SSTs by user key without timestamp. Test plan (devserver): ``` make check ``` Several individual tests: ``` ./version_set_test --gtest_filter=VersionStorageInfoTimestampTest.GetOverlappingInputs ./db_with_timestamp_compaction_test ./db_with_timestamp_basic_test ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/6645 Reviewed By: ltamasi Differential Revision: D20960012 Pulled By: riversand963 fbshipit-source-id: ad377fa9eb481bf7a8a3e1824aaade48cdc653a4	2020-04-10 16:05:49 -07:00
Akanksha Mahajan	a0faff126d	Report kFilesMarkedForCompaction for delete triggered compactions (#6680 ) Summary: Summary : Set manual_compaction false in case of DeleteTriggeredCompaction object so that kFilesMarkedForComapaction can be reported. Added a DeletionTriggeredUniversalCompactionMarking test case for Deletion Triggered compaction in case of Universal Compaction. Test Plan : make check -j64 Pull Request resolved: https://github.com/facebook/rocksdb/pull/6680 Test Plan: make check -j64 Reviewed By: anand1976 Differential Revision: D20945946 Pulled By: akankshamahajan15 fbshipit-source-id: af84e417bd7127652aaae9143c560d1ab3815d25	2020-04-10 15:30:38 -07:00
Huisheng Liu	9e89ffb776	make iterator return versions between timestamp bounds (#6544 ) Summary: (Based on Yanqin's idea) Add a new field in readoptions as lower timestamp bound for iterator. When the parameter is not supplied (nullptr), the iterator returns the latest visible version of a record. When it is supplied, the existing timestamp field is the upper bound. Together the two serves as a bounded time window. The iterator returns all versions of a record falling in the window. SeekRandom perf test (10 minutes) on the same development machine ram drive with the same DB data shows no regression (within marge of error). The test is adapted from https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks. base line (commit `e860f8840`): seekrandom : 7.836 micros/op 4082449 ops/sec; (0 of 73481999 found) This PR: seekrandom : 7.764 micros/op 4120935 ops/sec; (0 of 71303999 found) db_bench --db=r:\rocksdb.github --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --cache_size=2147483648 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=134217728 --max_bytes_for_level_base=1073741824 --disable_wal=0 --wal_dir=r:\rocksdb.github\WAL_LOG --sync=0 --verify_checksum=1 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --duration=600 --benchmarks=seekrandom --use_existing_db=1 --num=25000000 --threads=32 --allow_concurrent_memtable_write=0 Pull Request resolved: https://github.com/facebook/rocksdb/pull/6544 Reviewed By: ltamasi Differential Revision: D20844069 Pulled By: riversand963 fbshipit-source-id: d97f2bf38a323c8c6a68db213b2d3c694b1c1f74	2020-04-10 09:51:58 -07:00
Yi Wu	eb287c72d7	Fix wrong key being read on ingested file with global seqno and delta encoding (#6669 ) Summary: On reading an ingested SST file, `DataBlockIter` will replace seqno encoded in a key with global seqno. However, if the original seqno was part of the prefix used for the next key, the global seqno is by mistake used as part of the prefix to construct the next key, causing wrong result being returned. Although at this point it is only software error while data in the file is not corrupted, the issue can further cause compaction output out of order and corrupted result when the ingested SST participated in compaction. Fixing the issue by save the actual seqno and restore it before the key being used as prefix to construct next key. The unit test is by Little-Wallace from https://github.com/facebook/rocksdb/issues/6666. Fixing https://github.com/facebook/rocksdb/issues/6666. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6669 Test Plan: New unit test Signed-off-by: Yi Wu <yiwu@pingcap.com> Reviewed By: cheng-chang Differential Revision: D20931808 Pulled By: ajkr fbshipit-source-id: f01959c35d6a493954dca981663766c7a5a9e8ab	2020-04-08 21:22:15 -07:00
Peter Dillinger	e5f1bfc263	Fix initializer syntax for old Xcode compiler (#6662 ) Summary: Example compiler output, from OSX TEST_GROUP=3: db/flush_job_test.cc:185:7: error: suggest braces around initialization of subobject [-Werror,-Wmissing-braces] kInvalidBlobFileNumber, 5, 103, 17, 102, 101}; Apparently permitted in newer version, but worth working around. https://stackoverflow.com/questions/31555584/why-is-clang-warning-suggest-braces-around-initialization-of-subobject-wmis Pull Request resolved: https://github.com/facebook/rocksdb/pull/6662 Test Plan: CI (temporarily including OSX TEST_GROUP=3 in Travis) Reviewed By: ltamasi Differential Revision: D20901009 Pulled By: pdillinger fbshipit-source-id: 5338878613b5725e5d632c8858904de467dc4692	2020-04-07 16:00:26 -07:00
Kirill Abrosimov	3ff603171d	added new functions to c-api (#5630 ) Summary: Few functions from options added to C-api Pull Request resolved: https://github.com/facebook/rocksdb/pull/5630 Reviewed By: anand1976 Differential Revision: D20896731 Pulled By: ltamasi fbshipit-source-id: e4215a58b3c2429ec44e3f0d0381cbf86700fb14	2020-04-07 14:45:39 -07:00
Steven Fackler	f53cdab3d7	Hex encode keys in compaction flush logs (#6616 ) Summary: The raw key bytes are currently dumped directly into the log messages, which is not ideal if the keys aren't ASCII strings. Null bytes in particular can cut off bits of the message early. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6616 Reviewed By: ajkr Differential Revision: D20879218 Pulled By: anand1976 fbshipit-source-id: 825a20715fe6d8012c0163c6e7b8159f7926a1a7	2020-04-06 17:41:45 -07:00
mrambacher	259b6ec8da	Move the OptionTypeMap code closer to home (#6198 ) Summary: This is a predecessor to the Configurable PR. This change moves the OptionTypeInfo maps closer to where they will be used. When the Configurable changes are adopted, these values will become static and not associated with the OptionsHelper. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6198 Reviewed By: siying Differential Revision: D20778108 Pulled By: zhichao-cao fbshipit-source-id: a9f85fc73bc53503656e1958ecc1e764052fd1aa	2020-04-03 10:52:38 -07:00
Peter Dillinger	079e77ff9e	Revamp cache_bench to resemble a real workload (#6629 ) Summary: I suspect LRUCache could use some optimization, and to support such an effort, a good benchmarking tool is needed. The existing cache_bench was heavily skewed toward insertion and lookup misses, and did not saturate memory with other work. This change should improve those things to better resemble a real workload. (All below using clang compiler, for some consistency, but not necessarily same version and settings.) The real workload is from production MySQL on RocksDB, filtering stacks containing "LRU", "ShardedCache" or "CacheShard." Lookup inclusive: 66% Insert inclusive: 17% Release inclusive: 15% An alternate simulated workload is MySQL running a LinkBench read test: Lookup inclusive: 54% Insert inclusive: 24% Release inclusive: 21% cache_bench default settings, prior to this change: Lookup inclusive: 35.8% Insert inclusive: 63.6% Release inclusive: 0% cache_bench after this change (intended as somewhat "tighter" workload than average production, more like LinkBench): Lookup inclusive: 52% Insert inclusive: 20% Release inclusive: 26% And top exclusive stacks (portion of stack samples as filtered above): Production MySQL: LRUHandleTable::FindPointer: 25.3% rocksdb::operator==: 15.1% <-- Slice == LRUCacheShard::LRU_Remove: 13.8% ShardedCache::Lookup: 8.9% __pthread_mutex_lock: 7.1% LRUCacheShard::LRU_Insert: 6.3% MurmurHash64A: 4.8% <-- Since upgraded to XXH3p ... Old cache_bench: LRUHandleTable::FindPointer: 23.6% __pthread_mutex_lock: 15.0% __pthread_mutex_unlock_usercnt: 11.7% __lll_lock_wait: 8.6% __lll_unlock_wake: 6.8% LRUCacheShard::LRU_Insert: 6.0% ShardedCache::Lookup: 4.4% LRUCacheShard::LRU_Remove: 2.8% ... rocksdb::operator==: 0.2% <-- Slice == ... New cache_bench: LRUHandleTable::FindPointer: 22.8% __pthread_mutex_unlock_usercnt: 14.3% rocksdb::operator==: 10.5% <-- Slice == LRUCacheShard::LRU_Insert: 9.0% __pthread_mutex_lock: 5.9% LRUCacheShard::LRU_Remove: 5.0% ... ShardedCache::Lookup: 2.9% ... So there's a bit more lock contention in the benchmark than in production, but otherwise looks similar enough to me. At least it's a big improvement over the existing code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6629 Test Plan: No production code changes, ran cache_bench with ASAN Reviewed By: ltamasi Differential Revision: D20824318 Pulled By: pdillinger fbshipit-source-id: 6f8dc5891ead0f87edbed3a615ecd5289d9abe12	2020-04-03 10:26:49 -07:00
Zhichao Cao	ef088f0e93	Fix the multi-thread Manifest write dependency in error_handler_fs_test (#6637 ) Summary: In CompactionManifestWriteRetryableError in error_handler_fs_test, the manifest write of flush should pass with no fs error. After flush, fs is set to error status and the manifest write of compaction should fail due to the IO Error. Currently, the manifest write of flush is not synced with the compaction in order, which might cause manifest write fails, which will cause test failure. Fixed by adding the LoadDependency of sync-point after flush and before compaction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6637 Test Plan: pass error_hanlder_fs_tes. Pass make asan_check Reviewed By: anand1976 Differential Revision: D20826969 Pulled By: zhichao-cao fbshipit-source-id: fb2e702caa19bd63c82570320536b7acda870ff1	2020-04-02 18:08:46 -07:00
anand76	0709cd04ca	Fix LITE mode test failure in DBOptionsTest.ChangeCompression (#6635 ) Summary: This failure was introduced in https://github.com/facebook/rocksdb/issues/6262 Pull Request resolved: https://github.com/facebook/rocksdb/pull/6635 Reviewed By: siying Differential Revision: D20822602 Pulled By: anand1976 fbshipit-source-id: 96b316816cce6b95b092a7fc46ea968ed6ba8809	2020-04-02 16:41:09 -07:00
sdong	d0f3894cf1	In block based table builder, make variables for estimating file size atomic (#6636 ) Summary: With https://github.com/facebook/rocksdb/issues/6262, TSAN complains about data race of some variables. Those variables are used to estimate file size and are accessed in writer and background threads. Since file size estimation doesn't have to be 100% accurate, we make some variables atomic and use relaxed memory order. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6636 Test Plan: Run all tests with TSAN. Reviewed By: anand1976 Differential Revision: D20820635 fbshipit-source-id: 1ea45ff38be15e33674ffe06b7d42fc9fe161ea5	2020-04-02 16:16:24 -07:00
Levi Tamasi	2165c3bacc	Re-persist blob file metadata when a new manifest file is created (#6630 ) Summary: Does what it says on the can. Similarly to table files, we need to re-persist the metadata of live blob files whenever a new manifest file is opened. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6630 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D20802126 Pulled By: ltamasi fbshipit-source-id: 5738692d898790293bf09d66e9997369bbf89566	2020-04-02 11:53:05 -07:00
Ziyue Yang	03a781a90c	Add pipelined & parallel compression optimization (#6262 ) Summary: This PR adds support for pipelined & parallel compression optimization for `BlockBasedTableBuilder`. This optimization makes block building, block compression and block appending a pipeline, and uses multiple threads to accelerate block compression. Users can set `CompressionOptions::parallel_threads` greater than 1 to enable compression parallelism. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6262 Reviewed By: ajkr Differential Revision: D20651306 fbshipit-source-id: 62125590a9c15b6d9071def9dc72589c1696a4cb	2020-04-01 16:40:18 -07:00
Yanqin Jin	b5818f87f0	Fix clang analyze error (#6622 ) Summary: As title. https://github.com/facebook/rocksdb/issues/6612 caused clang analyze to fail with the error: ``` db/compaction/compaction_picker_fifo.cc:105:39: warning: Called C++ object pointer is null cf_name.c_str(), f->fd.GetNumber(), creation_time); ^~~~~~~~~~~~~~~~~ ./logging/logging.h:59:36: note: expanded from macro 'ROCKS_LOG_BUFFER' ##__VA_ARGS__) ^~~~~~~~~~~ 1 warning generated. ``` Test Plan (devserver): USE_CLANG=1 make analyze Pull Request resolved: https://github.com/facebook/rocksdb/pull/6622 Reviewed By: ltamasi Differential Revision: D20787407 Pulled By: riversand963 fbshipit-source-id: a5de4910cc1aa0d3481a73ec114578925bfd63f7	2020-04-01 10:01:38 -07:00
Levi Tamasi	e6f86cfb36	Revert the recent cache deleter change (#6620 ) Summary: Revert "Use function objects as deleters in the block cache (https://github.com/facebook/rocksdb/issues/6545)" This reverts commit `6301dbe7a7`. Revert "Call out the cache deleter related interface change in HISTORY.md (https://github.com/facebook/rocksdb/issues/6606)" This reverts commit `3a35542f86`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6620 Test Plan: `make check` Reviewed By: zhichao-cao Differential Revision: D20773311 Pulled By: ltamasi fbshipit-source-id: 7637a761f718f323ef0e7da959462e8fb06e7a2b	2020-03-31 16:11:06 -07:00
sdong	80979f81c7	Make options.bottommost_compression, compression_opts and bottommost_compression_opts dynamically changeable. (#6615 ) Summary: These three options should be made dynamically changeable. Simply add them to MutableCFOptions and made the change. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6615 Test Plan: Add a unit test to make sure that SetOptions() can change the options. Reviewed By: riversand963 Differential Revision: D20755951 fbshipit-source-id: 8165f4fd7a7a665cc7fb049698935022a5d2e7ff	2020-03-31 12:11:42 -07:00
Yanqin Jin	18cf0de640	Use flush time for the props.creation_time for FIFO compaction (#6612 ) Summary: For FIFO compaction, we use flush time instead of oldest key time as the creation time. This is to prevent FIFO compaction dropping files whose oldest key time is older than TTL but which has newer keys than TTL. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6612 Test Plan: make check Reviewed By: siying Differential Revision: D20748217 Pulled By: riversand963 fbshipit-source-id: 3f7b00a847020760537cdddd12f6fe039e5bc663	2020-03-30 18:59:17 -07:00
Zhichao Cao	e8d332d97e	Use FileChecksumGenFactory for SST file checksum (#6600 ) Summary: In the current implementation, sst file checksum is calculated by a shared checksum function object, which may make some checksum function hard to be applied here such as SHA1. In this implementation, each sst file will have its own checksum generator obejct, created by FileChecksumGenFactory. User needs to implement its own FilechecksumGenerator and Factory to plugin the in checksum calculation method. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6600 Test Plan: tested with make asan_check Reviewed By: riversand963 Differential Revision: D20717670 Pulled By: zhichao-cao fbshipit-source-id: 2a74c1c280ac11a07a1980185b43b671acaa71c6	2020-03-29 15:58:46 -07:00
Cheng Chang	ee50b8d499	Be able to decrease background thread's CPU priority when creating database backup (#6602 ) Summary: When creating a database backup, the background threads will not only consume IO resources by copying files, but also consuming CPU such as by computing checksums. During peak times, the CPU consumption by the background threads might affect online queries. This PR makes it possible to decrease CPU priority of these threads when creating a new backup. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6602 Test Plan: make check Reviewed By: siying, zhichao-cao Differential Revision: D20683216 Pulled By: cheng-chang fbshipit-source-id: 9978b9ed9488e8ce135e90ca083e5b4b7221fd84	2020-03-28 19:07:25 -07:00
Zhichao Cao	4246888101	Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487 ) Summary: In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status. The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()). Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487 Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check Reviewed By: anand1976 Differential Revision: D20685017 Pulled By: zhichao-cao fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0	2020-03-27 16:04:43 -07:00
Levi Tamasi	6f62322fe4	Add blob files to VersionStorageInfo/VersionBuilder (#6597 ) Summary: The patch adds a couple of classes to represent metadata about blob files: `SharedBlobFileMetaData` contains the information elements that are immutable (once the blob file is closed), e.g. blob file number, total number and size of blob files, checksum method/value, while `BlobFileMetaData` contains attributes that can vary across versions like the amount of garbage in the file. There is a single `SharedBlobFileMetaData` for each blob file, which is jointly owned by the `BlobFileMetaData` objects that point to it; `BlobFileMetaData` objects, in turn, are owned by `Version`s and can also be shared if the (immutable _and_ mutable) state of the blob file is the same in two versions. In addition, the patch adds the blob file metadata to `VersionStorageInfo`, and extends `VersionBuilder` so that it can apply blob file related `VersionEdit`s (i.e. those containing `BlobFileAddition`s and/or `BlobFileGarbage`), and save blob file metadata to a new `VersionStorageInfo`. Consistency checks are also extended to ensure that table files point to blob files that are part of the `Version`, and that all blob files that are part of any given `Version` have at least some _non_-garbage data in them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6597 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D20656803 Pulled By: ltamasi fbshipit-source-id: f1f74d135045b3b42d0146f03ee576ef0a4bfd80	2020-03-26 18:51:53 -07:00
Levi Tamasi	6301dbe7a7	Use function objects as deleters in the block cache (#6545 ) Summary: As the first step of reintroducing eviction statistics for the block cache, the patch switches from using simple function pointers as deleters to function objects implementing an interface. This will enable using deleters that have state, like a smart pointer to the statistics object that is to be updated when an entry is removed from the cache. For now, the patch adds a deleter template class `SimpleDeleter`, which simply casts the `value` pointer to its original type and calls `delete` or `delete[]` on it as appropriate. Note: to prevent object lifecycle issues, deleters must outlive the cache entries referring to them; `SimpleDeleter` ensures this by using the ("leaky") Meyers singleton pattern. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6545 Test Plan: `make asan_check` Reviewed By: siying Differential Revision: D20475823 Pulled By: ltamasi fbshipit-source-id: fe354c33dd96d9bafc094605462352305449a22a	2020-03-26 16:19:58 -07:00
Mike Kolupaev	963af52f15	Fix iterator reading filter block despite read_tier == kBlockCacheTier (#6562 ) Summary: We're seeing iterators with `ReadOptions::read_tier == kBlockCacheTier` sometimes doing file reads. Stack trace: ``` rocksdb::RandomAccessFileReader::Read(unsigned long, unsigned long, rocksdb::Slice, char, bool) const rocksdb::BlockFetcher::ReadBlockContents() rocksdb::Status rocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache<rocksdb::ParsedFullFilterBlock>(rocksdb::FilePrefetchBuffer, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::UncompressionDict const&, rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>, rocksdb::BlockType, rocksdb::GetContext, rocksdb::BlockCacheLookupContext, rocksdb::BlockContents) const rocksdb::Status rocksdb::BlockBasedTable::RetrieveBlock<rocksdb::ParsedFullFilterBlock>(rocksdb::FilePrefetchBuffer, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::UncompressionDict const&, rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>, rocksdb::BlockType, rocksdb::GetContext, rocksdb::BlockCacheLookupContext, bool, bool) const rocksdb::FilterBlockReaderCommon<rocksdb::ParsedFullFilterBlock>::ReadFilterBlock(rocksdb::BlockBasedTable const, rocksdb::FilePrefetchBuffer, rocksdb::ReadOptions const&, bool, rocksdb::GetContext, rocksdb::BlockCacheLookupContext, rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>) rocksdb::FilterBlockReaderCommon<rocksdb::ParsedFullFilterBlock>::GetOrReadFilterBlock(bool, rocksdb::GetContext, rocksdb::BlockCacheLookupContext, rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>) const rocksdb::FullFilterBlockReader::MayMatch(rocksdb::Slice const&, bool, rocksdb::GetContext, rocksdb::BlockCacheLookupContext) const rocksdb::FullFilterBlockReader::RangeMayExist(rocksdb::Slice const, rocksdb::Slice const&, rocksdb::SliceTransform const, rocksdb::Comparator const, rocksdb::Slice const, bool, bool, rocksdb::BlockCacheLookupContext) rocksdb::BlockBasedTable::PrefixMayMatch(rocksdb::Slice const&, rocksdb::ReadOptions const&, rocksdb::SliceTransform const, bool, rocksdb::BlockCacheLookupContext) const rocksdb::BlockBasedTableIterator<rocksdb::DataBlockIter, rocksdb::Slice>::SeekImpl(rocksdb::Slice const) rocksdb::ForwardIterator::SeekInternal(rocksdb::Slice const&, bool) rocksdb::DBIter::Seek(rocksdb::Slice const&) ``` `BlockBasedTableIterator::CheckPrefixMayMatch` was missing a check for `kBlockCacheTier`. This PR adds it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6562 Test Plan: deployed it to a logdevice test cluster and looked at logdevice's IO tracing. Reviewed By: siying Differential Revision: D20529368 Pulled By: al13n321 fbshipit-source-id: 65bf33964b1951464415c900336635fb20919611	2020-03-26 15:21:26 -07:00
sdong	6fd0ed4993	CompactRange() to use bottom pool when goes to bottommost level (#6593 ) Summary: In automatic compaction, if a compaction is bottommost, it goes to bottom thread pool. We should do the same for manual compaction too. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6593 Test Plan: Add a unit test. See all existing tests pass. Reviewed By: ajkr Differential Revision: D20637408 fbshipit-source-id: cb03031e8f895085f7acf6d2d65e69e84c9ddef3	2020-03-24 20:24:32 -07:00
Huisheng Liu	a6ce5c823b	multiget support for timestamps (#6483 ) Summary: Add timestamp support for MultiGet(). timestamp from readoptions is honored, and timestamps can be returned along with values. MultiReadRandom perf test (10 minutes) on the same development machine ram drive with the same DB data shows no regression (within marge of error). The test is adapted from https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks. base line (commit `17bef7d3a`): multireadrandom : 104.173 micros/op 307167 ops/sec; (5462999 of 5462999 found) This PR: multireadrandom : 104.199 micros/op 307095 ops/sec; (5307999 of 5307999 found) .\db_bench --db=r:\rocksdb.github --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --cache_size=2147483648 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=134217728 --max_bytes_for_level_base=1073741824 --disable_wal=0 --wal_dir=r:\rocksdb.github\WAL_LOG --sync=0 --verify_checksum=1 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --duration=600 --benchmarks=multireadrandom --use_existing_db=1 --num=25000000 --threads=32 --allow_concurrent_memtable_write=0 Pull Request resolved: https://github.com/facebook/rocksdb/pull/6483 Reviewed By: anand1976 Differential Revision: D20498373 Pulled By: riversand963 fbshipit-source-id: 8505f22bc40fd791bc7dd05e48d7e67c91edb627	2020-03-24 11:24:09 -07:00
sdong	921cdd37e2	Fix bug that number of table loading threads is set as a boolean (#6576 ) Summary: When applying a new version in non DB open case, optimize_filters_for_hits is used for max_threads, which is clearly a bug. It is not clear what the indented value in the first place, but it value 1 makes sense here, which would create no extra threads. This bug is not expected to cause user visible problems, assuming C++ implicitly cast bool to 0 or 1. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6576 Test Plan: Run all exsiting test. Reviewed By: ajkr Differential Revision: D20602467 fbshipit-source-id: 40b2cd8619aba09ae9242b36c415464db3c9b737	2020-03-24 10:17:40 -07:00
anand76	a9d168cfd7	Simplify migration to FileSystem API (#6552 ) Summary: The current Env/FileSystem API separation has a couple of issues - 1. It requires the user to specify 2 options - ```Options::env``` and ```Options::file_system``` - which means they have to make code changes to benefit from the new APIs. Furthermore, there is a risk of accessing the same APIs in two different ways, through Env in the old way and through FileSystem in the new way. The two may not always match, for example, if env is ```PosixEnv``` and FileSystem is a custom implementation. Any stray RocksDB calls to env will use the ```PosixEnv``` implementation rather than the file_system implementation. 2. There needs to be a simple way for the FileSystem developer to instantiate an Env for backward compatibility purposes. This PR solves the above issues and simplifies the migration in the following ways - 1. Embed a shared_ptr to the ```FileSystem``` in the ```Env```, and remove ```Options::file_system``` as a configurable option. This way, no code changes will be required in application code to benefit from the new API. The default Env constructor uses a ```LegacyFileSystemWrapper``` as the embedded ```FileSystem```. 1a. - This also makes it more robust by ensuring that even if RocksDB has some stray calls to Env APIs rather than FileSystem, they will go through the same object and thus there is no risk of getting out of sync. 2. Provide a ```NewCompositeEnv()``` API that can be used to construct a PosixEnv with a custom FileSystem implementation. This eliminates an indirection to call Env APIs, and relieves the FileSystem developer of the burden of having to implement wrappers for the Env APIs. 3. Add a couple of missing FileSystem APIs - ```SanitizeEnvOptions()``` and ```NewLogger()``` Tests: 1. New unit tests 2. make check and make asan_check Pull Request resolved: https://github.com/facebook/rocksdb/pull/6552 Reviewed By: riversand963 Differential Revision: D20592038 Pulled By: anand1976 fbshipit-source-id: c3801ad4153f96d21d5a3ae26c92ba454d1bf1f7	2020-03-23 21:54:21 -07:00
Zhichao Cao	d300d10962	Fix the MultiGet testing failure in Circleci (#6578 ) Summary: The MultiGet test in db_basic_test fails in CircleCI vs2019. The reason is that even Snappy compression is enabled, the first compression type is still kNoCompression. This PR checks the list and ensure that only when compression is enable and the compression type is valid, compression will be enabled. Such that, it will not fail the combined read test in MultiGet. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6578 Test Plan: make check, db_basic_test. Reviewed By: anand1976 Differential Revision: D20607529 Pulled By: zhichao-cao fbshipit-source-id: dcead264d5c2da105912c18caad34b8510bb04b0	2020-03-23 18:51:09 -07:00
Yanqin Jin	617f479266	Fix LITE build (#6575 ) Summary: Fix LITE build by excluding some unit tests that use features not supported in LITE. ``` db/db_basic_test.cc:1778:8: error: ‘void rocksdb::{anonymous}::TableFileListener::OnTableFileCreated(const rocksdb::TableFileCreationInfo&)’ marked ‘override’, but does not override void OnTableFileCreated(const TableFileCreationInfo& info) override { ^~~~~~~~~~~~~~~~~~ make: *** [db/db_basic_test.o] Error 1 ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/6575 Reviewed By: ltamasi Differential Revision: D20598598 Pulled By: riversand963 fbshipit-source-id: 367f7cb2500360ad57030b138a94c0f731a04339	2020-03-23 13:05:36 -07:00
Zhichao Cao	5c6346c420	Revert "Added the safe-to-ignore tag to version_edit (#6530 )" (#6569 ) Summary: This reverts commit `e10553f2a6`. Pass make asan_check Pull Request resolved: https://github.com/facebook/rocksdb/pull/6569 Reviewed By: riversand963 Differential Revision: D20574319 Pulled By: zhichao-cao fbshipit-source-id: ce36981a21596f5f2e14da6a59a2bb3619509a8b	2020-03-23 10:27:47 -07:00
Yanqin Jin	fb09ef05dc	Attempt to recover from db with missing table files (#6334 ) Summary: There are situations when RocksDB tries to recover, but the db is in an inconsistent state due to SST files referenced in the MANIFEST being missing. In this case, previous RocksDB will just fail the recovery and return a non-ok status. This PR enables another possibility. During recovery, RocksDB checks possible MANIFEST files, and try to recover to the most recent state without missing table file. `VersionSet::Recover()` applies version edits incrementally and "materializes" a version only when this version does not reference any missing table file. After processing the entire MANIFEST, the version created last will be the latest version. `DBImpl::Recover()` calls `VersionSet::Recover()`. Afterwards, WAL replay will not be performed. To use this capability, set `options.best_efforts_recovery = true` when opening the db. Best-efforts recovery is currently incompatible with atomic flush. Test plan (on devserver): ``` $make check $COMPILE_WITH_ASAN=1 make all && make check ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/6334 Reviewed By: anand1976 Differential Revision: D19778960 Pulled By: riversand963 fbshipit-source-id: c27ea80f29bc952e7d3311ecf5ee9c54393b40a8	2020-03-20 19:30:48 -07:00
Cheng Chang	5fd152b7ad	Get block size only in direct IO mode (#6522 ) Summary: When `use_direct_reads` and `use_direct_writes` are `false`, `logical_sector_size_` inside various `*File` implementations are not actually used, so `GetLogicalBlockSize` does not necessarily need to be called for `logical_sector_size_`, just set a default page size. This is a follow up PR for https://github.com/facebook/rocksdb/pull/6457. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6522 Test Plan: make check Reviewed By: siying Differential Revision: D20408885 Pulled By: cheng-chang fbshipit-source-id: f2d3808f41265237e7fa2c0be9f084f8fa97fe3d	2020-03-20 15:26:10 -07:00
Zhichao Cao	e10553f2a6	Added the safe-to-ignore tag to version_edit (#6530 ) Summary: Each time RocksDB switches to a new MANIFEST file from old one, it calls WriteCurrentStateToManifest() which writes a 'snapshot' of the current in-memory state of versions to the beginning of the new manifest as a bunch of version edits. We can distinguish these version edits from other version edits written during normal operations with a custom, safe-to-ignore tag. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6530 Test Plan: added test to version_edit_test, pass make asan_check Reviewed By: riversand963 Differential Revision: D20524516 Pulled By: zhichao-cao fbshipit-source-id: f1de102f5499bfa88dae3caa2f32c7f42cf904db	2020-03-19 11:30:26 -07:00
Levi Tamasi	442404558a	Clean up VersionBuilder a bit (#6556 ) Summary: The whole point of the pimpl idiom is to hide implementation details. Internal helper methods like `CheckConsistency`, `CheckConsistencyForDeletes`, and `MaybeAddFile` do not belong in the public interface of the class. In addition, the patch switches to `unique_ptr` for the implementation object instead of using a raw `delete`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6556 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D20523568 Pulled By: ltamasi fbshipit-source-id: 5bbb0ccebd0c47a33b815398c7f9cfe13bd775ac	2020-03-19 10:44:16 -07:00
Yanqin Jin	66ed58083a	Reduce runtime of db_with_timestamp_basic_test (#6546 ) Summary: Reduce runtime by reducing test scale to avoid test time-outs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6546 Test Plan: time ./db_with_timestamp_basic_test and watch internal tests. Reviewed By: zhichao-cao Differential Revision: D20479292 Pulled By: riversand963 fbshipit-source-id: c9e4a155be7699dd4de60fa531de86d442a3ba0a	2020-03-17 10:50:48 -07:00
Cheng Chang	402da454cb	Migrate AppVeyor to CircleCI (#6518 ) Summary: CircleCI is the new recommended CI system internally. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6518 Test Plan: Watch https://app.circleci.com/pipelines/github/facebook/rocksdb Differential Revision: D20454743 Pulled By: cheng-chang fbshipit-source-id: 39031568d6c1d3d25b7fbd78fa9a0e6067ddc47c	2020-03-13 21:58:51 -07:00
Cheng Chang	23eae14d24	Destroy DB at the end of each test in db_logical_block_size_cache_test (#6532 ) Summary: If DB is not deleted, in concurrent test, the tests might fail because of the previously existing DB. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6532 Test Plan: make clean && make -j24 LITE=1 db_logical_block_size_cache_test && ./db_logical_block_size_cache_test make clean && make -j24 db_logical_block_size_cache_test && ./db_logical_block_size_cache_test Differential Revision: D20454734 Pulled By: cheng-chang fbshipit-source-id: 8abede2ec1d79c1a4fe1bc95fbda489f8f7ee052	2020-03-13 21:53:38 -07:00
Zhichao Cao	5c30e6c088	Separate timestamp related test from db_basic_test (#6516 ) Summary: In some of the test, db_basic_test may cause time out due to its long running time. Separate the timestamp related test from db_basic_test to avoid the potential issue. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6516 Test Plan: pass make asan_check Differential Revision: D20423922 Pulled By: zhichao-cao fbshipit-source-id: d6306f89a8de55b07bf57233e4554c09ef1fe23a	2020-03-13 11:37:15 -07:00
Cheng Chang	0d2c8e47e8	OpenForReadOnly is not supported in LITE mode (#6523 ) Summary: In DBLogicalBlockSizeCacheTest, do not test OpenForReadOnly in LITE mode. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6523 Test Plan: watch test for LITE mode Differential Revision: D20420321 Pulled By: cheng-chang fbshipit-source-id: e45bf6f2800206d6f8ce9af7308e76a08de80643	2020-03-12 14:13:59 -07:00
Levi Tamasi	c15e85bdcb	Move BlobDB related files under db/ to db/blob/ (#6519 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6519 Test Plan: ``` make all make check ``` Differential Revision: D20400691 Pulled By: ltamasi fbshipit-source-id: 20ef911cf1c2c92c7f71ef0b493f9be64f2eef94	2020-03-12 11:00:56 -07:00
Cheng Chang	2d9efc9ab2	Cache result of GetLogicalBufferSize in Linux (#6457 ) Summary: In Linux, when reopening DB with many SST files, profiling shows that 100% system cpu time spent for a couple of seconds for `GetLogicalBufferSize`. This slows down MyRocks' recovery time when site is down. This PR introduces two new APIs: 1. `Env::RegisterDbPaths` and `Env::UnregisterDbPaths` lets `DB` tell the env when it starts or stops using its database directories . The `PosixFileSystem` takes this opportunity to set up a cache from database directories to the corresponding logical block sizes. 2. `LogicalBlockSizeCache` is defined only for OS_LINUX to cache the logical block sizes. Other modifications: 1. rename `logical buffer size` to `logical block size` to be consistent with Linux terms. 2. declare `GetLogicalBlockSize` in `PosixHelper` to expose it to `PosixFileSystem`. 3. change the functions `IOError` and `IOStatus` in `env/io_posix.h` to have external linkage since they are used in other translation units too. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6457 Test Plan: 1. A new unit test is added for `LogicalBlockSizeCache` in `env/io_posix_test.cc`. 2. A new integration test is added for `DB` operations related to the cache in `db/db_logical_block_size_cache_test.cc`. `make check` Differential Revision: D20131243 Pulled By: cheng-chang fbshipit-source-id: 3077c50f8065c0bffb544d8f49fb10bba9408d04	2020-03-11 18:40:05 -07:00
sdong	331e6199df	Include more information in file lock failure (#6507 ) Summary: When users fail to open a DB with file lock failure, it is sometimes hard for users to debug. We now include the time the lock is acquired and the thread ID that acquired the lock, to help users debug problems like this. Default Env's thread ID is used. Since type of lockedFiles is changed, rename it to follow naming convention too. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6507 Test Plan: Add a unit test and improve an existing test to validate the case. Differential Revision: D20378333 fbshipit-source-id: 312fe0e9733fd1d1e9969c321b90ce523cf4708a	2020-03-11 16:23:08 -07:00
Levi Tamasi	37a635cfe6	Disambiguate CustomFieldTags for the unity build (#6513 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6513 Test Plan: `make unity_test` Differential Revision: D20388919 Pulled By: ltamasi fbshipit-source-id: 88dbceab0723a54ee3939e1644e13dc9a4c70420	2020-03-11 14:45:12 -07:00
Adam Retter	8fc20ac468	Add ppc64le builds to Travis (#6144 ) Summary: Let's see how this goes... Pull Request resolved: https://github.com/facebook/rocksdb/pull/6144 Differential Revision: D20387515 Pulled By: pdillinger fbshipit-source-id: ba2669348c267141dfddff910b4c2224a22cbb38	2020-03-11 12:33:45 -07:00
Levi Tamasi	f5bc3b99d5	Split BlobFileState into an immutable and a mutable part (#6502 ) Summary: It's never too soon to refactor something. The patch splits the recently introduced (`VersionEdit` related) `BlobFileState` into two classes `BlobFileAddition` and `BlobFileGarbage`. The idea is that once blob files are closed, they are immutable, and the only thing that changes is the amount of garbage in them. In the new design, `BlobFileAddition` contains the immutable attributes (currently, the count and total size of all blobs, checksum method, and checksum value), while `BlobFileGarbage` contains the mutable GC-related information elements (count and total size of garbage blobs). This is a better fit for the GC logic and is more consistent with how SST files are handled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6502 Test Plan: `make check` Differential Revision: D20348352 Pulled By: ltamasi fbshipit-source-id: ff93f0121e80ab15e0e0a6525ba0d6af16a0e008	2020-03-10 17:27:26 -07:00
Yanqin Jin	fd1da22111	Support options.max_open_files != -1 with FIFO compaction (#6503 ) Summary: Allow user to specify options.max_open_files != -1 with FIFO compaction. If max_open_files != -1, not all table files are kept open. In the past, FIFO style compaction requires all table files to be open in order to read file creation time from table properties. Later, we added file creation time to MANIFEST, making it possible to read file creation time without opening file. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6503 Test Plan: make check Differential Revision: D20353758 Pulled By: riversand963 fbshipit-source-id: ba5c61a648419e47e9ef6d74e0e280e3ee24f296	2020-03-09 18:45:06 -07:00
Yanqin Jin	d93812c9ae	Iterator with timestamp (#6255 ) Summary: Preliminary support for iterator with user timestamp. Current implementation does not consider merge operator and reverse iterator. Auto compaction is also disabled in unit tests. Create an iterator with timestamp. ``` ... read_opts.timestamp = &ts; auto* iter = db->NewIterator(read_opts); // target is key without timestamp. for (iter->Seek(target); iter->Valid(); iter->Next()) {} for (iter->SeekToFirst(); iter->Valid(); iter->Next()) {} delete iter; read_opts.timestamp = &ts1; // lower_bound and upper_bound are without timestamp. read_opts.iterate_lower_bound = &lower_bound; read_opts.iterate_upper_bound = &upper_bound; auto* iter1 = db->NewIterator(read_opts); // Do Seek or SeekToFirst() delete iter1; ``` Test plan (dev server) ``` $make check ``` Simple benchmarking (dev server) 1. The overhead introduced by this PR even when timestamp is disabled. key size: 16 bytes value size: 100 bytes Entries: 1000000 Data reside in main memory, and try to stress iterator. Repeated three times on master and this PR. - Seek without next ``` ./db_bench -db=/dev/shm/rocksdbtest-1000 -benchmarks=fillseq,seekrandom -enable_pipelined_write=false -disable_wal=true -format_version=3 ``` master: 159047.0 ops/sec this PR: 158922.3 ops/sec (2% drop in throughput) - Seek and next 10 times ``` ./db_bench -db=/dev/shm/rocksdbtest-1000 -benchmarks=fillseq,seekrandom -enable_pipelined_write=false -disable_wal=true -format_version=3 -seek_nexts=10 ``` master: 109539.3 ops/sec this PR: 107519.7 ops/sec (2% drop in throughput) Pull Request resolved: https://github.com/facebook/rocksdb/pull/6255 Differential Revision: D19438227 Pulled By: riversand963 fbshipit-source-id: b66b4979486f8474619f4aa6bdd88598870b0746	2020-03-06 16:24:27 -08:00
Yuqi Gu	e171a219d5	Fix db_wal_test::TruncateLastLogAfterRecoverWithoutFlush failure (#6437 ) Summary: `TruncateLastLogAfterRecoverWithoutFlush` case depends on fallocate support of underlying file system. On a file system which lacks of this feature, like zfs, it will fail to allocate predefined file size as this test case intends to do; So a check block is added to detect fallocate support and skip test if not. The related work is done by JunHe77. Thanks! Signed-off-by: Yuqi Gu <yuqi.gu@arm.com> Pull Request resolved: https://github.com/facebook/rocksdb/pull/6437 Differential Revision: D20145032 Pulled By: pdillinger fbshipit-source-id: c8b691dc508e95acfa2a004ddbc07e2faa76680d	2020-03-05 17:18:16 -08:00
Cheng Chang	afb97094ae	Skip high levels with no key falling in the range in CompactRange (#6482 ) Summary: In CompactRange, if there is no key in memtable falling in the specified range, then flush is skipped. This PR extends this skipping logic to SST file levels: it starts compaction from the highest level (starting from L0) that has files with key falling in the specified range, instead of always starts from L0. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6482 Test Plan: A new test ManualCompactionTest::SkipLevel is added. Also updated a test related to statistics of index block cache hit in db_test2, the index cache hit is increased by 1 in this PR because when checking overlap for the key range in L0, OverlapWithLevelIterator will do a seek in the table cache iterator, which will read from the cached index. Also updated db_compaction_test and db_test to use correct range for full compaction. Differential Revision: D20251149 Pulled By: cheng-chang fbshipit-source-id: f822157cf4796972bd5035d9d7178d8dfb7af08b	2020-03-04 20:15:25 -08:00
Zhichao Cao	e62fe50634	Introduce FaultInjectionTestFS to test fault File system instead of Env (#6414 ) Summary: In the current code base, we can use FaultInjectionTestEnv to simulate the env issue such as file write/read errors, which are used in most of the test. The PR https://github.com/facebook/rocksdb/issues/5761 introduce the File System as a new Env API. This PR implement the FaultInjectionTestFS, which can be used to simulate when File System has issues such as IO error. user can specify any IOStatus error as input, such that FS corresponding actions will return certain error to the caller. A set of ErrorHandlerFSTests are introduced for testing Pull Request resolved: https://github.com/facebook/rocksdb/pull/6414 Test Plan: pass make asan_check, pass error_handler_fs_test. Differential Revision: D20252421 Pulled By: zhichao-cao fbshipit-source-id: e922038f8ce7e6d1da329fd0bba7283c4b779a21	2020-03-04 12:35:05 -08:00
Kefu Chai	03dbd11ead	s/const auto/const auto&/ when doing loop (#6477 ) Summary: this silences following warning from clang-11 ``` rocksdb/db/db_impl/db_impl_compaction_flush.cc:1040:21: warning: loop variable 'newf' of type 'const std::pair<int, rocksdb::FileMetaData>' creates a copy from type 'const std::pair<int\ , rocksdb::FileMetaData>' [-Wrange-loop-analysis] for (const auto newf : c->edit()->GetNewFiles()) { ^ rocksdb/db/db_impl/db_impl_compaction_flush.cc:1040:10: note: use reference type 'const std::pair<int, rocksdb::FileMetaData> &' to prevent copying for (const auto newf : c->edit()->GetNewFiles()) { ^~~~~~~~~~~~~~~~~ & ``` Signed-off-by: Kefu Chai <tchaikov@gmail.com> Pull Request resolved: https://github.com/facebook/rocksdb/pull/6477 Differential Revision: D20211850 Pulled By: ltamasi fbshipit-source-id: 3e89e13a12bba79f1b934d46b7c4c0576cdafb01	2020-03-03 08:41:57 -08:00
sdong	17bef7d3a8	Fix data race of GetCreationTimeOfOldestFile() (#6473 ) Summary: When DBImpl::GetCreationTimeOfOldestFile() calls Version::GetCreationTimeOfOldestFile(), the version is not directly or indirectly referenced, so an event like compaction can race with the operation and cause DBImpl::GetCreationTimeOfOldestFile() to access delocated data. This was caught by an ASAN run: ==268==ERROR: AddressSanitizer: heap-use-after-free on address 0x612000b7d198 at pc 0x000018332913 bp 0x7f391510d310 sp 0x7f391510d308 READ of size 8 at 0x612000b7d198 thread T845 (store_load-33) SCARINESS: 51 (8-byte-read-heap-use-after-free) #0 0x18332912 in rocksdb::Version::GetCreationTimeOfOldestFile(unsigned long) rocksdb/src/db/version_set.cc:1488 https://github.com/facebook/rocksdb/issues/1 0x1803ddaa in rocksdb::DBImpl::GetCreationTimeOfOldestFile(unsigned long) rocksdb/src/db/db_impl/db_impl.cc:4499 https://github.com/facebook/rocksdb/issues/2 0xe24ca09 in rocksdb::StackableDB::GetCreationTimeOfOldestFile(unsigned long) rocksdb/utilities/stackable_db.h:392 ...... 0x612000b7d198 is located 216 bytes inside of 296-byte region [0x612000b7d0c0,0x612000b7d1e8) freed by thread T28 here: ...... https://github.com/facebook/rocksdb/issues/5 0x1832c73f in std::vector<rocksdb::FileMetaData, std::allocator<rocksdb::FileMetaData> >::~vector() third-party-buck/platform007/build/libgcc/include/c++/trunk/bits/stl_vector.h:435 https://github.com/facebook/rocksdb/issues/6 0x1832c73f in rocksdb::VersionStorageInfo::~VersionStorageInfo() rocksdb/src/db/version_set.cc:734 https://github.com/facebook/rocksdb/issues/7 0x1832cf42 in rocksdb::Version::~Version() rocksdb/src/db/version_set.cc:758 https://github.com/facebook/rocksdb/issues/8 0x9d1bb5 in rocksdb::Version::Unref() rocksdb/src/db/version_set.cc:2869 https://github.com/facebook/rocksdb/issues/9 0x183e7631 in rocksdb::Compaction::~Compaction() rocksdb/src/db/compaction/compaction.cc:275 https://github.com/facebook/rocksdb/issues/10 0x9e6de6 in std::default_delete<rocksdb::Compaction>::operator()(rocksdb::Compaction) const third-party-buck/platform007/build/libgcc/include/c++/trunk/bits/unique_ptr.h:78 https://github.com/facebook/rocksdb/issues/11 0x9e6de6 in std::unique_ptr<rocksdb::Compaction, std::default_delete<rocksdb::Compaction> >::reset(rocksdb::Compaction) third-party-buck/platform007/build/libgcc/include/c++/trunk/bits/unique_ptr.h:376 https://github.com/facebook/rocksdb/issues/12 0x9e6de6 in rocksdb::DBImpl::BackgroundCompaction(bool, rocksdb::JobContext, rocksdb::LogBuffer, rocksdb::DBImpl::PrepickedCompaction, rocksdb::Env::Priority) rocksdb/src/db/db_impl/db_impl_compaction_flush.cc:2826 https://github.com/facebook/rocksdb/issues/13 0x9ac3b8 in rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction, rocksdb::Env::Priority) rocksdb/src/db/db_impl/db_impl_compaction_flush.cc:2320 https://github.com/facebook/rocksdb/issues/14 0x9abff7 in rocksdb::DBImpl::BGWorkCompaction(void*) rocksdb/src/db/db_impl/db_impl_compaction_flush.cc:2096 ...... Fix the issue by reference the super version and use the referenced version from it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6473 Test Plan: Run ASAN for all existing tests. Differential Revision: D20196416 fbshipit-source-id: 5f4a7918110fc7b8dd7841932d376bc9d1e59d6f	2020-03-02 16:37:01 -08:00
Zhichao Cao	8d73137ae8	Replace Directory with FSDirectory in DB (#6468 ) Summary: In the current code base, we can use Directory from Env to manage directory (e.g, Fsync()). The PR https://github.com/facebook/rocksdb/issues/5761 introduce the File System as a new Env API. So we further replace the Directory class in DB with FSDirectory such that we can have more IO information from IOStatus returned by FSDirectory. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6468 Test Plan: pass make asan_check Differential Revision: D20195261 Pulled By: zhichao-cao fbshipit-source-id: 93962cb9436852bfcfb76e086d9e7babd461cbe1	2020-03-02 16:16:26 -08:00
Huisheng Liu	904a60ff63	return timestamp from get (#6409 ) Summary: Added new Get() methods that return timestamp. Dummy implementation is given so that classes derived from DB don't need to be touched to provide their implementation. MultiGet is not included. ReadRandom perf test (10 minutes) on the same development machine ram drive with the same DB data shows no regression (within marge of error). The test is adapted from https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks. base line (commit `72ee067b9`): 101.712 micros/op 314602 ops/sec; 36.0 MB/s (5658999 of 5658999 found) This PR: 100.288 micros/op 319071 ops/sec; 36.5 MB/s (5674999 of 5674999 found) ./db_bench --db=r:\rocksdb.github --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --cache_size=2147483648 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=134217728 --max_bytes_for_level_base=1073741824 --disable_wal=0 --wal_dir=r:\rocksdb.github\WAL_LOG --sync=0 --verify_checksum=1 --delete_obsolete_files_period_micros=314572800 --max_background_compactions=4 --max_background_flushes=0 --level0_slowdown_writes_trigger=16 --level0_stop_writes_trigger=24 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --mmap_read=1 --mmap_write=0 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --duration=600 --benchmarks=readrandom --use_existing_db=1 --num=25000000 --threads=32 Pull Request resolved: https://github.com/facebook/rocksdb/pull/6409 Differential Revision: D20200086 Pulled By: riversand963 fbshipit-source-id: 490edd74d924f62bd8ae9c29c2a6bbbb8410ca50	2020-03-02 16:01:00 -08:00
Michael R. Crusoe	051696bf98	fix some spelling typos (#6464 ) Summary: Found from Debian's "Lintian" program Pull Request resolved: https://github.com/facebook/rocksdb/pull/6464 Differential Revision: D20162862 Pulled By: zhichao-cao fbshipit-source-id: 06941ee2437b038b2b8045becbe9d2c6fbff3e12	2020-02-28 14:14:03 -08:00
Andrew Kryczka	69679e7375	Fix range deletion tombstone ingestion with global seqno (#6429 ) Summary: Original author: jeffrey-xiao If we are writing a global seqno for an ingested file, the range tombstone metablock gets accessed and put into the cache during ingestion preparation. At the time, the global seqno of the ingested file has not yet been determined, so the cached block will not have a global seqno. When the file is ingested and we read its range tombstone metablock, it will be returned from the cache with no global seqno. In that case, we use the actual seqnos stored in the range tombstones, which are all zero, so the tombstones cover nothing. This commit removes global_seqno_ variable from Block. When iterating over a block, the global seqno for the block is determined by the iterator instead of storing this mutable attribute in Block. Additionally, this commit adds a regression test to check that keys are deleted when ingesting a file with a global seqno and range deletion tombstones. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6429 Differential Revision: D19961563 Pulled By: ajkr fbshipit-source-id: 5cf777397fa3e452401f0bf0364b0750492487b7	2020-02-25 15:31:48 -08:00
Levi Tamasi	d87c10c6ab	Add blob file state to VersionEdit (#6416 ) Summary: BlobDB currently does not keep track of blob files: no records are written to the manifest when a blob file is added or removed, and upon opening a database, the list of blob files is populated simply based on the contents of the blob directory. This means that lost blob files cannot be detected at the moment. We plan to solve this issue by making blob files a part of `Version`; as a first step, this patch makes it possible to store information about blob files in `VersionEdit`. Currently, this information includes blob file number, total number and size of all blobs, and total number and size of garbage blobs. However, the format is extensible: new fields can be added in both a forward compatible and a forward incompatible manner if needed (similarly to `kNewFile4`). Pull Request resolved: https://github.com/facebook/rocksdb/pull/6416 Test Plan: `make check` Differential Revision: D19894234 Pulled By: ltamasi fbshipit-source-id: f9753e1f2aedf6dadb70c09b345207cb9c58c329	2020-02-24 18:39:53 -08:00
Yanqin Jin	890d87fadc	Some minor fix-ups (#6440 ) Summary: Cleanup some code without any real change in functionality. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6440 Differential Revision: D20015891 Pulled By: riversand963 fbshipit-source-id: 33e18754b0f002006a6d4805e9aaf84c0c8ad25a	2020-02-21 15:09:56 -08:00
Zaiyang Li	fcec56e86c	Add function to set row cache on rocksdb_options_t (#6442 ) Summary: Adding a C API function to set `row_cache` on `rocksdb_options_t` as this functionality is missing. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6442 Differential Revision: D20036813 Pulled By: riversand963 fbshipit-source-id: c1fa95ea343345fbc1e57961d0d048e0e79be373	2020-02-21 11:12:42 -08:00
Yanqin Jin	362b8d4393	Fix MANIFEST name assignment (#6426 ) Summary: Currently, a new MANIFEST file is assigned a new file number when 1) no MANIFEST is open, or 2) current MANIFEST file size exceeds a threshold. This is not sufficient. There are cases when the caller explicitly specifies that a new MANIFEST be created. For example, if user sets options.write_dbid_to_manifest = true, and there are WAL files, then RocksDB will run into an issue during recovery. `DBImpl::Recover()` will call `LogAndApply()` to write dbid. At this point, the db being recovered creates a new MANIFEST, say, MANIFEST-000003. Since there are WALs, `DBImpl::RecoverLogFiles` will be called. Towards the end of this function, we call `LogAndApply(new_descriptor_log=true)`, which explicitly creates a new MANIFEST. However, the manifest_file_number is wrong before this fix. Consequently, RocksDB opens an existing, non-empty file for append, effectively truncating the file to zero. If a crash occurs, then there will be data loss. Test Plan (devserver): make check Pull Request resolved: https://github.com/facebook/rocksdb/pull/6426 Test Plan: make check Differential Revision: D19951866 Pulled By: riversand963 fbshipit-source-id: 4b1b9fc28d4fe2ac12764b388ef9e61f05e766da	2020-02-20 14:30:58 -08:00
sdong	fdf882ded2	Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433 ) Summary: When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433 Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag. Differential Revision: D19977691 fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e	2020-02-20 12:09:57 -08:00
Andrew Kryczka	c6abe30ee3	Fix concurrent full purge and WAL recycling (#5900 ) Summary: We were removing the file from `log_recycle_files_` before renaming it with `ReuseWritableFile()`. Since `ReuseWritableFile()` occurs outside the DB mutex, it was possible for a concurrent full purge to sneak in and delete the file before it could be renamed. Consequently, `SwitchMemtable()` would fail and the DB would enter read-only mode. The fix is to hold the old file number in `log_recycle_files_` until after the file has been renamed. Full purge uses that list to decide which files to keep, so it can no longer delete a file pending recycling. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5900 Test Plan: new unit test Differential Revision: D19771719 Pulled By: ajkr fbshipit-source-id: 094346349ca3fb499712e62de03905acc30b5ce8	2020-02-18 13:54:13 -08:00
Andrew Kryczka	0f9dcb88b2	Return NotSupported from WriteBatchWithIndex::DeleteRange (#5393 ) Summary: As discovered in https://github.com/facebook/rocksdb/issues/5260 and https://github.com/facebook/rocksdb/issues/5392, reads on the indexed batch do not account for range tombstones. So, return `Status::NotSupported` from `WriteBatchWithIndex::DeleteRange` until we properly support it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5393 Test Plan: added unit test Differential Revision: D19912360 Pulled By: ajkr fbshipit-source-id: 0bbfc978ea015d64516ca708fce2429abba524cb	2020-02-18 11:18:25 -08:00
Cheng Chang	4034e289ad	Fail fast in paranoid mode when LoadTableHandlers fail during recovering (#6368 ) Summary: Previously, when recovering version set, LoadTableHandlers failures are ignored. If paranoid_checks is true, this failure should not be ignored, otherwise, the opened db might be in an inconsistent state. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6368 Test Plan: make check Differential Revision: D19713459 Pulled By: cheng-chang fbshipit-source-id: 68cb94f4f2cc43f8b024b14755193cd45cfcad55	2020-02-14 08:17:10 -08:00
Manuel Ung	908b1ee64e	WriteUnPrepared: Fix assertion during recovery (#6419 ) Summary: During recovery, multiple (un)prepared batches could exist in the same WAL record due to group commit. This breaks an assertion in `MemTableInserter::MarkBeginPrepare`. To fix, reset unprepared_batch_ to false after `MarkEndPrepare`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6419 Differential Revision: D19896148 Pulled By: lth fbshipit-source-id: b1a32ef88f775a0881264a18bd1a4a5b8c85eee3	2020-02-13 18:52:05 -08:00
sdong	ac8e89a443	Should flush and sync WAL when writing it in DB::Open() (#6417 ) Summary: A recent fix related to 2pc https://github.com/facebook/rocksdb/pull/6313/ writes something to WAL, but does not flush or sync. This causes assertion failure "impl->TEST_WALBufferIsEmpty()" if manual_wal_flush = true. We should fsync the entry to make sure a second power reset can recover. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6417 Test Plan: Add manual_wal_flush=true case in TransactionTest.DoubleCrashInRecovery and fix a bug in the test so that the bug can be reproduced. It passes with the fix. Differential Revision: D19894537 fbshipit-source-id: f1e84e49e2269f583c6019743118292cd8b6598e	2020-02-13 18:41:04 -08:00
Huisheng Liu	5138764eb5	Fix destroydb (#6308 ) Summary: It's observed on Windows DestroyDB failed to remove the log file because the logger is still alive in sst file manager and holding a handle to the log file. This fix makes sure the logger is released before attempt to clear the database directory. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6308 Differential Revision: D19818829 Pulled By: riversand963 fbshipit-source-id: 54c3e6859aadaaba4a49b3e851b73dc35ec7dc6a	2020-02-13 11:21:27 -08:00
Cheng Chang	a676001f95	Revert usage of Defer. (#6410 ) Summary: Seems like this caused the following test failure on AppVeyor: DBTest2.CrashInRecoveryMultipleCF c:\projects\rocksdb\db\db_test_util.cc(107): error: DestroyDB(dbname_, options) IO error: Failed to delete: C:\projects\rocksdb\db_tests\\testrocksdb-3112//db_test2_10791409581227174103/000013.sst: Access is denied. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6410 Test Plan: Wait to see whether the AppVeyor test passes. Differential Revision: D19879872 Pulled By: cheng-chang fbshipit-source-id: 59a9c55ca88566e9210c0b715ecc45a4fd9afe26	2020-02-13 10:31:32 -08:00
anand76	3e49249d30	Ensure all MultiGet IO errors are propagated to user (#6403 ) Summary: Unrevert the previous fix to propagate error status, and an additional fix to not treat a memtable lookup MergeInProgress status as an error. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6403 Test Plan: Unit tests Tried running stress tests but couldn't repro the stress failure Differential Revision: D19846721 Pulled By: anand1976 fbshipit-source-id: 7db10cccbdc863d9b559497f0a46b608d2488ca4	2020-02-11 17:27:22 -08:00
anand76	35ed530d2c	Revert "Check KeyContext status in MultiGet (#6387 )" (#6401 ) Summary: This reverts commit `d70011bccc`. The commit is causing some stress test failure due to unexpected Status::MergeInProgress() return for some keys. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6401 Differential Revision: D19826623 Pulled By: anand1976 fbshipit-source-id: edd634cede9cb7bdd2cb8f46e662ea709b16d2f1	2020-02-10 22:23:36 -08:00
Cheng Chang	dafb568052	Add utility class Defer (#6382 ) Summary: Add a utility class `Defer` to defer the execution of a function until the Defer object goes out of scope. Used in VersionSet:: ProcessManifestWrites as an example. The inline comments for class `Defer` have more details. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6382 Test Plan: `make defer_test version_set_test && ./defer_test && ./version_set_test` Differential Revision: D19797538 Pulled By: cheng-chang fbshipit-source-id: b1a9b7306e4fd4f48ec2ab55783caa561a315f0f	2020-02-10 17:59:47 -08:00
Levi Tamasi	cbf5f3be43	Do not move VersionEdit into AtomicGroupReadBuffer (#6400 ) Summary: https://github.com/facebook/rocksdb/pull/6383 surfaced an issue with `VersionSet`/`ReactiveVersionSet` and `AtomicGroupReadBuffer::AddEdit` (which was added in https://github.com/facebook/rocksdb/pull/5411): `AddEdit` moves the `VersionEdit` passed to it into `replay_buffer_`, however, the client `VersionSet` classes keep using it afterwards. This seemed to work before the refactoring but it really did not: since `VersionEdit` used to have a user-declared destructor, no move constructor/move assignment operator was generated, and the `move` in `AddEdit` was really a copy. The patch makes the copy explicit. Note: it should be possible to rework this logic so that we can get away with the move but for now, this should fix the issue. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6400 Test Plan: `make check` `make analyze` Differential Revision: D19824466 Pulled By: ltamasi fbshipit-source-id: f38033967daf2a39c78dcd6e12978bafe37632b4	2020-02-10 17:15:42 -08:00
Zhichao Cao	4369f2c7bb	Checksum for each SST file and stores in MANIFEST (#6216 ) Summary: In the current code base, RocksDB generate the checksum for each block and verify the checksum at usage. Current PR enable SST file checksum. After a SST file is generated by Flush or Compaction, RocksDB generate the SST file checksum and store the checksum value and checksum method name in the vs_info and MANIFEST as part for the FileMetadata. Added the enable_sst_file_checksum to Options to enable or disable file checksum. Added sst_file_checksum to Options such that user can plugin their own SST file checksum calculate method via overriding the SstFileChecksum class. The checksum information inlcuding uint32_t checksum value and a checksum name (string). A new tool is added to LDB such that user can dump out a list of file checksum information from MANIFEST. If user enables the file checksum but does not provide the sst_file_checksum instance, RocksDB will use the default crc32checksum implemented in table/sst_file_checksum_crc32c.h Pull Request resolved: https://github.com/facebook/rocksdb/pull/6216 Test Plan: Added the testing case in table_test and ldb_cmd_test to verify checksum is correct in different level. Pass make asan_check. Differential Revision: D19171461 Pulled By: zhichao-cao fbshipit-source-id: b2e53479eefc5bb0437189eaa1941670e5ba8b87	2020-02-10 15:52:52 -08:00
Yutian Li	2e0159ec9e	Add error status for no_slowdown & low priority write (#6396 ) Summary: When `no_slowdown` is enabled, it returns `Status::Incomplete("Write stall")` if a stall would occur. This patch adds descriptive text for when `no_slowdown` and `low_pri` are enabled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6396 Differential Revision: D19808978 Pulled By: cheng-chang fbshipit-source-id: a53b0d25ed414c821a086531e0222027f925e627	2020-02-10 12:33:16 -08:00
anand76	d70011bccc	Check KeyContext status in MultiGet (#6387 ) Summary: Currently, any IO errors and checksum mismatches while reading data blocks, are being ignored by the batched MultiGet. Its only looking at the GetContext state. Fix that. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6387 Test Plan: Add unit tests Differential Revision: D19799819 Pulled By: anand1976 fbshipit-source-id: 46133dccbb04e64067b9fe6cda73e282203db969	2020-02-07 16:48:16 -08:00
Robert Yang	4e457278fa	db/write_thread.cc: Initialize state (#6275 ) Summary: Fixed an error when compiled with -Og: db/write_thread.cc:183:14: error: 'state' may be used uninitialized in this function [-Werror=maybe-uninitialized] Signed-off-by: Robert Yang <liezhi.yang@windriver.com> Pull Request resolved: https://github.com/facebook/rocksdb/pull/6275 Differential Revision: D19381755 fbshipit-source-id: a90bf3cd4a7248d9d71219e918fc6253deb97e3c	2020-02-07 15:20:38 -08:00
Levi Tamasi	752c87af78	Clean up VersionEdit a bit (#6383 ) Summary: This is a bunch of small improvements to `VersionEdit`. Namely, the patch * Makes the names and order of variables, methods, and code chunks related to the various information elements more consistent, and adds missing getters for the sake of completeness. * Initializes previously uninitialized stack variables. * Marks all getters const to improve const correctness. * Adds in-class initializers and removes the default ctor that would create an object with uninitialized built-in fields and call `Clear` afterwards. * Adds a new type alias for new files and changes the existing `typedef` for deleted files into a type alias as well. * Makes the helper method `DecodeNewFile4From` private. * Switches from long-winded iterator syntax to range based loops in a couple of places. * Fixes a couple of assignments where an integer 0 was assigned to boolean members. * Fixes a getter which used to return a `const std::string` instead of the intended `const std::string&`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6383 Test Plan: make check Differential Revision: D19780537 Pulled By: ltamasi fbshipit-source-id: b0b4f09fee0ec0e7c7b7a6d76bfe5346e91824d0	2020-02-07 13:27:06 -08:00
Cheng Chang	5f478b9f75	Remove outdated comment (#6379 ) Summary: Since the logic for handling IDENTITY file is now inside `NewDB`, the comment above `NewDB` is no longer relevant. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6379 Test Plan: not needed Differential Revision: D19795440 Pulled By: cheng-chang fbshipit-source-id: 0b1cca87ac6d92474701c46aa4c8d4d708bfa19b	2020-02-07 13:18:43 -08:00
Cheng Chang	0a74e1b958	Add status checks during DB::Open (#6380 ) Summary: Several statuses were not checked during DB::Open. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6380 Test Plan: make check Differential Revision: D19780237 Pulled By: cheng-chang fbshipit-source-id: c8d189d20344bd1607890dd1449345bda2ef96b9	2020-02-07 12:32:09 -08:00
Yanqin Jin	f361cedf06	Atomic flush rollback once on failure (#6385 ) Summary: Before this fix, atomic flush codepath may hit an assertion failure on a specific failure case. If all flush jobs within an atomic flush succeed (they do not write to MANIFEST), but batch writing version edits to MANIFEST fails, then `cfd->imm()->RollbackMemTableFlush()` will be called twice, and the second invocation hits assertion failure `assert(m->flush_in_progress_)` since the first invocation resets the variable `flush_in_progress_` to false already. Test plan (dev server): ``` ./db_flush_test --gtest_filter=DBAtomicFlushTest/DBAtomicFlushTest.RollbackAfterFailToInstallResults make check ``` Both must succeed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6385 Differential Revision: D19782943 Pulled By: riversand963 fbshipit-source-id: 84e1592625e729d1b70fdc8479959387a74cb121	2020-02-07 10:52:10 -08:00
Cheng Chang	f5f79f01a2	Be able to read compatible leveldb sst files (#6370 ) Summary: In `DBSSTTest.SSTsWithLdbSuffixHandling`, some sst files are renamed to ldb files, the original intention of the test is to test that the ldb files can be loaded along with the sst files. The original test checks this by `ASSERT_NE("NOT_FOUND", Get(Key(k)))`, but the problem is `Get(Key(k))` returns IO error due to path not found instead of NOT_FOUND, so the success of ASSERT_NE does not mean the key can be retrieved. This PR updates the test to make sure Get(Key(k)) returns the original value. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6370 Test Plan: make db_sst_test && ./db_sst_test Differential Revision: D19726278 Pulled By: cheng-chang fbshipit-source-id: 993127f56457b315e669af4eeb92d6f956b7a4b7	2020-02-06 10:15:44 -08:00
Mike Kolupaev	1ed7d9b1b5	Avoid lots of calls to Env::GetFileSize() in SstFileManagerImpl when opening DB (#6363 ) Summary: Before this PR it calls GetFileSize() once for each sst file in the DB. This can take a long time if there are be tens of thousands of sst files (e.g. in thousands of column families), and even longer if Env is talking to some remote service rather than local filesystem. This PR makes DB::Open() use sst file sizes that are already known from manifest (typically almost all files in the DB) and only call GetFileSize() for non-sst or obsolete files. Note that GetFileSize() is also called and checked against manifest in CheckConsistency(), so the calls in SstFileManagerImpl were completely redundant. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6363 Test Plan: deployed to a test cluster, looked at a dump of Env calls (from a custom instrumented Env) - no more thousands of GetFileSize()s. Differential Revision: D19702509 Pulled By: al13n321 fbshipit-source-id: 99f8110620cb2e9d0c092dfcdbb11f3af4ff8b73	2020-02-04 13:41:53 -08:00
sdong	69c8614815	Avoid to get manifest file size when recovering from it. (#6369 ) Summary: Right now RocksDB gets manifest file size before recovering from it. The information is available in LogReader. Use it instead to prevent one file system call. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6369 Test Plan: Run all existing tests Differential Revision: D19714872 fbshipit-source-id: 0144be324d403c99e3da875ea2feccc8f64e883d	2020-02-04 11:39:23 -08:00
Mike Kolupaev	637e64b9ac	Add an option to prevent DB::Open() from querying sizes of all sst files (#6353 ) Summary: When paranoid_checks is on, DBImpl::CheckConsistency() iterates over all sst files and calls Env::GetFileSize() for each of them. As far as I could understand, this is pretty arbitrary and doesn't affect correctness - if filesystem doesn't corrupt fsynced files, the file sizes will always match; if it does, it may as well corrupt contents as well as sizes, and rocksdb doesn't check contents on open. If there are thousands of sst files, getting all their sizes takes a while. If, on top of that, Env is overridden to use some remote storage instead of local filesystem, it can be really slow and overload the remote storage service. This PR adds an option to not do GetFileSize(); instead it does GetChildren() for parent directory to check that all the expected sst files are at least present, but doesn't check their sizes. We can't just disable paranoid_checks instead because paranoid_checks do a few other important things: make the DB read-only on write errors, print error messages on read errors, etc. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6353 Test Plan: ran the added sanity check unit test. Will try it out in a LogDevice test cluster where the GetFileSize() calls are causing a lot of trouble. Differential Revision: D19656425 Pulled By: al13n321 fbshipit-source-id: c2c421b367633033760d1f56747bad206d1fbf82	2020-02-04 01:27:26 -08:00

1 2 3 4 5 ...

4014 Commits