mirror of https://github.com/facebook/rocksdb.git
1499 Commits
Author | SHA1 | Message | Date |
---|---|---|---|
anand76 | c3f720c60d |
Enable ReadAsync testing and fault injection in db_stress (#11037)
Summary: The db_stress code uses a wrapper Env on top of the raw/fault injection Env. The wrapper, DbStressEnvWrapper, is a legacy Env and thus has a default implementation of ReadAsync that just does a sync read. As a result, the ReadAsync implementations of PosixFileSystem and other file systems weren't being tested. Also, the ReadAsync interface wasn't implemented in FaultInjectionTestFS. This change implements the necessary interfaces in FaultInjectionTestFS and derives DbStressEnvWrapper from FileSystemWrapper rather than EnvWrapper. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11037 Test Plan: Run db_stress standalone and crash test. With this change, db_stress is able to repro the bug fixed in https://github.com/facebook/rocksdb/issues/10890. Reviewed By: akankshamahajan15 Differential Revision: D42061290 Pulled By: anand1976 fbshipit-source-id: 7f0331fd15ee33fb4f7f0f4b22b206fe801ba074 |
|
Yanqin Jin | c93ba7db5d |
Revise LockWAL/UnlockWAL implementation (#11020)
Summary: RocksDB has two public APIs: `DB::LockWAL()`/`DB::UnlockWAL()`. The current implementation acquires and releases the internal `DBImpl::log_write_mutex_`. According to the comment on `DBImpl::log_write_mutex_`: https://github.com/facebook/rocksdb/blob/7.8.fb/db/db_impl/db_impl.h#L2287:L2288 > Note: to avoid dealock, if needed to acquire both log_write_mutex_ and mutex_, the order should be first mutex_ and then log_write_mutex_. This puts limitations on how applications can use the `LockWAL()` API. After `LockWAL()` returns ok, then application should not perform any operation that acquires `mutex_`. Currently, the use case of `LockWAL()` is MyRocks implementing the MySQL storage engine handlerton `lock_hton_log` interface. The operation that MyRocks performs after `LockWAL()` is `GetSortedWalFiless()` which not only acquires mutex_, but also `log_write_mutex_`. There are two issues: 1. Applications using these two APIs may hang if one thread calls `GetSortedWalFiles()` after calling `LockWAL()` because log_write_mutex is not recursive. 2. Two threads may dead lock due to lock order inversion. To fix these issues, we can modify the implementation of LockWAL so that it does not keep `log_write_mutex_` held until UnlockWAL. To achieve the goal of locking the WAL, we can instead manually inject a write stall so that all future writes will be stopped. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11020 Test Plan: make check Reviewed By: ajkr Differential Revision: D41785203 Pulled By: riversand963 fbshipit-source-id: 5ccb7a9c6eb9a2c3fa80fd2c399cc2568b8f89ce |
|
Hui Xiao | 98d5db5c2e |
Sort L0 files by newly introduced epoch_num (#10922)
Summary:
**Context:**
Sorting L0 files by `largest_seqno` has at least two inconvenience:
- File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. `force_consistency_check=true` will catch such overlap seqno range even those harmless overlap.
- For example, consider the following sequence of events ("key@n" indicates key at seqno "n")
- insert k1@1 to memtable m1
- ingest file s1 with k2@2, ingest file s2 with k3@3
- insert k4@4 to m1
- compact files s1, s2 and result in new file s3 of seqno range [2, 3]
- flush m1 and result in new file s4 of seqno range [1, 4]. And `force_consistency_check=true` will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1
- However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption.
- Single delete can decrease a file's largest seqno and ordering by `largest_seqno` can introduce a wrong ordering hence file reordering corruption
- For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to ajkr for this example)
- an existing SST s1 contains only k1@1
- insert k1@2 to memtable m1
- ingest file s2 with k3@3, ingest file s3 with k4@4
- insert single delete k5@5 in m1
- flush m1 and result in new file s4 of seqno range [2, 5]
- compact s1, s2, s3 and result in new file s5 of seqno range [1, 4]
- compact s4 and result in new file s6 of seqno range [2] due to single delete
- By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by `force_consistency_check=true`, there isn't a good way to prevent this from happening if ordering by `largest_seqno`
Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to ajkr , we now introduce `epoch_num` which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum `epoch_num` among input files'). This will avoid the above inconvenience in the following ways:
- In the first case above, there will no longer be overlap seqno range check in `force_consistency_check=true` but `epoch_number` ordering check. This will result in file ordering s1 < s2 < s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class `DBCompactionTestL0FilesMisorderCorruption*` for more.
- In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption.
**Summary:**
- Introduce `epoch_number` stored per `ColumnFamilyData` and sort CF's L0 files by their assigned `epoch_number` instead of `largest_seqno`.
- `epoch_number` is increased and assigned upon `VersionEdit::AddFile()` for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the `kReservedEpochNumberForFileIngestedBehind`)
- Compaction output file is assigned with the minimum `epoch_number` among input files'
- Refit level: reuse refitted file's epoch_number
- Other paths needing `epoch_number` treatment:
- Import column families: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`
- Repair: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`.
- Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon `VersionEdit::AddFile()` where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair).
- Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later `Refit(target_level=0)`). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder.
- Persist `epoch_number` of each file in manifest and recover `epoch_number` on db recovery
- Backward compatibility with old db without `epoch_number` support is guaranteed by assigning `epoch_number` to recovered files by `NewestFirstBySeqno` order. See `VersionStorageInfo::RecoverEpochNumbers()` for more
- Forward compatibility with manifest is guaranteed by flexibility of `NewFileCustomTag`
- Replace `force_consistent_check` on L0 with `epoch_number` and remove false positive check like case 1 with `largest_seqno` above
- Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (`NewestFirstBySeqno`) to check/sort them till we infer their epoch number. See usages of `EpochNumberRequirement`.
- Remove fix https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and their outdated tests to file reordering corruption because such fix can be replaced by this PR.
- Misc:
- update existing tests with `epoch_number` so make check will pass
- update https://github.com/facebook/rocksdb/pull/5958#issue-511150930 tests to verify corruption is fixed using `epoch_number` and cover universal/fifo compaction/CompactRange/CompactFile cases
- assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber()
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10922
Test Plan:
- `make check`
- New unit tests under `db/db_compaction_test.cc`, `db/db_test2.cc`, `db/version_builder_test.cc`, `db/repair_test.cc`
- Updated tests (i.e, `DBCompactionTestL0FilesMisorderCorruption*`) under https://github.com/facebook/rocksdb/pull/5958#issue-511150930
- [Ongoing] Compatibility test: manually run
|
|
Peter Dillinger | 9b34c097a1 |
Fix bug updating latest backup on delete (#11029)
Summary: Previously, the "latest" valid backup would not be updated on delete. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11029 Test Plan: unit test included (added to an existing test for efficiency) Reviewed By: hx235 Differential Revision: D41967925 Pulled By: pdillinger fbshipit-source-id: ca143354d281eb979557ea421902cd26803a1137 |
|
Peter Dillinger | e079d562af |
Add a SecondaryCache::InsertSaved() API, use in CacheDumper impl (#10945)
Summary: Can simplify some ugly code in cache_dump_load_impl.cc by having an API in SecondaryCache that can directly consume persisted data. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10945 Test Plan: existing tests for CacheDumper, added basic unit test Reviewed By: anand1976 Differential Revision: D41231497 Pulled By: pdillinger fbshipit-source-id: b8ec993ef7d3e7efd68aae8602fd3f858da58068 |
|
Levi Tamasi | fbd9077d66 |
Fix a bug where GetContext does not update READ_NUM_MERGE_OPERANDS (#10925)
Summary: The patch fixes a bug where `GetContext::Merge` (and `MergeEntity`) does not update the ticker `READ_NUM_MERGE_OPERANDS` because it implicitly uses the default parameter value of `update_num_ops_stats=false` when calling `MergeHelper::TimedFullMerge`. Also, to prevent such issues going forward, the PR removes the default parameter values from the `TimedFullMerge` methods. In addition, it removes an unused/unnecessary parameter from `TimedFullMergeWithEntity`, and does some cleanup at the call sites of these methods. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10925 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D41096453 Pulled By: ltamasi fbshipit-source-id: fc60646d32b4d516b8fe81e265c3f020a32fd7f8 |
|
Yanqin Jin | 75aca74017 |
Replace member variable lambda with methods (#10924)
Summary: In transaction unit tests, replace a few member variable lambdas with non-static methods. It's easier for gdb to work with variables in methods than in lambdas. (Seen similar things to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86675). Pull Request resolved: https://github.com/facebook/rocksdb/pull/10924 Test Plan: make check Reviewed By: jay-zhuang Differential Revision: D41072241 Pulled By: riversand963 fbshipit-source-id: e4fa491de573c4656225a86a75af926c1df827f6 |
|
Yanqin Jin | 0547cecb81 |
Reduce access to atomic variables in a test (#10909)
Summary: With TSAN build on CircleCI (see mini-tsan in .circleci/config). Sometimes `SeqAdvanceConcurrentTest.SeqAdvanceConcurrent` will get stuck when an experimental feature called "unordered write" is enabled. Stack trace will be the following ``` Thread 7 (Thread 0x7f2284a1c700 (LWP 481523) "write_prepared_"): #0 0x00000000004fa3f5 in __tsan_atomic64_load () at ./db/merge_context.h:15 https://github.com/facebook/rocksdb/issues/1 0x00000000005e5942 in std::__atomic_base<unsigned long>::load (this=0x7b74000012f8, __m=std::memory_order_seq_cst) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/atomic_base.h:481 https://github.com/facebook/rocksdb/issues/2 std::__atomic_base<unsigned long>::operator unsigned long (this=0x7b74000012f8) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/atomic_base.h:341 https://github.com/facebook/rocksdb/issues/3 0x00000000005bf001 in rocksdb::SeqAdvanceConcurrentTest_SeqAdvanceConcurrent_Test::TestBody()::$_9::operator()(void*) const (this=0x7b14000085e8) at utilities/transactions/write_prepared_transaction_test.cc:1702 Thread 6 (Thread 0x7f228421b700 (LWP 481521) "write_prepared_"): #0 0x000000000052178c in __tsan::MetaMap::GetAndLock(__tsan::ThreadState*, unsigned long, unsigned long, bool, bool) () at ./db/merge_context.h:15 https://github.com/facebook/rocksdb/issues/1 0x00000000004fa48e in __tsan_atomic64_load () at ./db/merge_context.h:15 https://github.com/facebook/rocksdb/issues/2 0x00000000005e5942 in std::__atomic_base<unsigned long>::load (this=0x7b74000012f8, __m=std::memory_order_seq_cst) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/atomic_base.h:481 https://github.com/facebook/rocksdb/issues/3 std::__atomic_base<unsigned long>::operator unsigned long (this=0x7b74000012f8) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/atomic_base.h:341 https://github.com/facebook/rocksdb/issues/4 0x00000000005bf001 in rocksdb::SeqAdvanceConcurrentTest_SeqAdvanceConcurrent_Test::TestBody()::$_9::operator()(void*) const (this=0x7b14000085e8) at utilities/transactions/write_prepared_transaction_test.cc:1702 ``` This is problematic and suspicious. Two threads will get stuck in the same place trying to load from an atomic variable. https://github.com/facebook/rocksdb/blob/7.8.fb/utilities/transactions/write_prepared_transaction_test.cc#L1694:L1707. Not sure why two threads can reach the same point. The stack trace shows that there may be a deadlock, since the two threads are on the same write thread (one is doing Prepare, while the other is trying to commit). Pull Request resolved: https://github.com/facebook/rocksdb/pull/10909 Test Plan: On CircleCI mini-tsan, apply a patch first so that we have a higher chance of hitting the same problematic situation, ``` diff --git a/utilities/transactions/write_prepared_transaction_test.cc b/utilities/transactions/write_prepared_transaction_test.cc index 4bc1f3744..bd5dc4924 100644 --- a/utilities/transactions/write_prepared_transaction_test.cc +++ b/utilities/transactions/write_prepared_transaction_test.cc @@ -1714,13 +1714,13 @@ TEST_P(SeqAdvanceConcurrentTest, SeqAdvanceConcurrent) { size_t d = (n % base[bi + 1]) / base[bi]; switch (d) { case 0: - threads.emplace_back(txn_t0, bi); + threads.emplace_back(txn_t3, bi); break; case 1: - threads.emplace_back(txn_t1, bi); + threads.emplace_back(txn_t3, bi); break; case 2: - threads.emplace_back(txn_t2, bi); + threads.emplace_back(txn_t3, bi); break; case 3: threads.emplace_back(txn_t3, bi); ``` then build and run tests ``` COMPILE_WITH_TSAN=1 CC=clang-13 CXX=clang++-13 ROCKSDB_DISABLE_ALIGNED_NEW=1 USE_CLANG=1 make V=1 -j32 check gtest-parallel -r 100 ./write_prepared_transaction_test --gtest_filter=TwoWriteQueues/SeqAdvanceConcurrentTest.SeqAdvanceConcurrent/19 ``` In the above, `SeqAdvanceConcurrent/19`. The tests 10 to 19 correspond to unordered write in which Prepare() and Commit() can both enter the same write thread. Before this PR, there is a high chance of hitting the deadlock. With this PR, no deadlock has been encountered so far. Reviewed By: ltamasi Differential Revision: D40869387 Pulled By: riversand963 fbshipit-source-id: 81e82a70c263e4f3417597a201b081ee54f1deab |
|
Yanqin Jin | 7d26e4c5a3 |
Basic Support for Merge with user-defined timestamp (#10819)
Summary: This PR implements the originally disabled `Merge()` APIs when user-defined timestamp is enabled. Simplest usage: ```cpp // assume string append merge op is used with '.' as delimiter. // ts1 < ts2 db->Put(WriteOptions(), "key", ts1, "v0"); db->Merge(WriteOptions(), "key", ts2, "1"); ReadOptions ro; ro.timestamp = &ts2; db->Get(ro, "key", &value); ASSERT_EQ("v0.1", value); ``` Some code comments are added for clarity. Note: support for timestamp in `DB::GetMergeOperands()` will be done in a follow-up PR. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10819 Test Plan: make check Reviewed By: ltamasi Differential Revision: D40603195 Pulled By: riversand963 fbshipit-source-id: f96d6f183258f3392d80377025529f7660503013 |
|
Yanqin Jin | 900f79126d |
Pass `const LockInfo&` to AcquireLocked() and AcquireWithTimeout (#10874)
Summary: The motivation and benefit of current behavior of passing `LockInfo&&` as argument to AcquireLocked() and AcquireWithTimeout() is not clear to me. Furthermore, in AcquireWithTimeout(), we access members of `LockInfo&&` after it is passed to AcquireLocked() as rvalue ref. In addition, we may call `AcquireLocked()` with `std::move(lock_info)` multiple times. This leads to linter warning of use-after-move. If future implementation of AcquireLocked() does something like moving-construct a new `LockedInfo` using the passed-in `LockInfo&&`, then the caller cannot use it because `LockInfo` has a member of type `autovector`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10874 Test Plan: make check Reviewed By: ltamasi Differential Revision: D40704210 Pulled By: riversand963 fbshipit-source-id: 20091df65b4fc63b072bcec9809efc49955d6d35 |
|
Yanqin Jin | 95a1935cb1 |
Run clang-format on utilities/transactions (#10871)
Summary: This PR is the result of running the following command ``` find ./utilities/transactions/ -name '*.cc' -o -name '*.h' -o -name '*.c' -o -name '*.hpp' -o -name '*.cpp' | xargs clang-format -i ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10871 Test Plan: make check Reviewed By: cbi42 Differential Revision: D40686871 Pulled By: riversand963 fbshipit-source-id: 613738d667ec8f8e13cce4802e0e166d6be52211 |
|
Levi Tamasi | 4d9cb433fa |
Run clang-format on utilities/ (except utilities/transactions/) (#10853)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10853 Test Plan: `make check` Reviewed By: siying Differential Revision: D40651315 Pulled By: ltamasi fbshipit-source-id: 8b270ff4777a06464be86e376c2a680427866a46 |
|
akankshamahajan | 0e7b27bfcf |
Refactor block cache tracing APIs (#10811)
Summary: Refactor the classes, APIs and data structures for block cache tracing to allow a user provided trace writer to be used. Currently, only a TraceWriter is supported, with a default built-in implementation of FileTraceWriter. The TraceWriter, however, takes a flat trace record and is thus only suitable for file tracing. This PR introduces an abstract BlockCacheTraceWriter class that takes a structured BlockCacheTraceRecord. The BlockCacheTraceWriter implementation can then format and log the record in whatever way it sees fit. The default BlockCacheTraceWriterImpl does file tracing using a user provided TraceWriter. `DB::StartBlockTrace` will internally redirect to changed `BlockCacheTrace::StartBlockCacheTrace`. New API `DB::StartBlockTrace` is also added that directly takes `BlockCacheTraceWriter` pointer. This same philosophy can be applied to KV and IO tracing as well. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10811 Test Plan: existing unit tests Old API DB::StartBlockTrace checked with db_bench tool create database ``` ./db_bench --benchmarks="fillseq" \ --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 \ --cache_index_and_filter_blocks --cache_size=1048576 \ --disable_auto_compactions=1 --disable_wal=1 --compression_type=none \ --min_level_to_compress=-1 --compression_ratio=1 --num=10000000 ``` To trace block cache accesses when running readrandom benchmark: ``` ./db_bench --benchmarks="readrandom" --use_existing_db --duration=60 \ --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 \ --cache_index_and_filter_blocks --cache_size=1048576 \ --disable_auto_compactions=1 --disable_wal=1 --compression_type=none \ --min_level_to_compress=-1 --compression_ratio=1 --num=10000000 \ --threads=16 \ -block_cache_trace_file="/tmp/binary_trace_test_example" \ -block_cache_trace_max_trace_file_size_in_bytes=1073741824 \ -block_cache_trace_sampling_frequency=1 ``` Reviewed By: anand1976 Differential Revision: D40435289 Pulled By: akankshamahajan15 fbshipit-source-id: fa2755f4788185e19f4605e731641cfd21ab3282 |
|
Peter Dillinger | e466173d5c |
Print stack traces on frozen tests in CI (#10828)
Summary: Instead of existing calls to ps from gnu_parallel, call a new wrapper that does ps, looks for unit test like processes, and uses pstack or gdb to print thread stack traces. Also, using `ps -wwf` instead of `ps -wf` ensures output is not cut off. For security, CircleCI runs with security restrictions on ptrace (/proc/sys/kernel/yama/ptrace_scope = 1), and this change adds a work-around to `InstallStackTraceHandler()` (only used by testing tools) to allow any process from the same user to debug it. (I've also touched >100 files to ensure all the unit tests call this function.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/10828 Test Plan: local manual + temporary infinite loop in a unit test to observe in CircleCI Reviewed By: hx235 Differential Revision: D40447634 Pulled By: pdillinger fbshipit-source-id: 718a4c4a5b54fa0f9af2d01a446162b45e5e84e1 |
|
Peter Dillinger | 2d0380adbe |
Allow manifest fix-up without requiring prior state (#10796)
Summary: This change is motivated by ensuring that `ldb update_manifest` or `UpdateManifestForFilesState` can run without expecting files to open when the old temperature is provided (in case the FileSystem strictly interprets non-kUnknown), but ended up fixing a problem in `OfflineManifestWriter` (used by `ldb unsafe_remove_sst_file`) where it would open some SST files during recovery and expect them to match the prior manifest state, even if not required by the intended new state. Also update BackupEngine to retry with Temperature kUnknown when reading file with potentially "wrong" temperature. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10796 Test Plan: tests added/updated, that fail before the change(s) and now pass Reviewed By: jay-zhuang Differential Revision: D40232645 Pulled By: jay-zhuang fbshipit-source-id: b5aa2688aecfe0c320b80a7da689b315414c20be |
|
gitbw95 | 47b57a3731 |
add SetCapacity and GetCapacity for secondary cache (#10712)
Summary: To support tuning secondary cache dynamically, add `SetCapacity()` and `GetCapacity()` for CompressedSecondaryCache. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10712 Test Plan: Unit Tests Reviewed By: anand1976 Differential Revision: D39685212 Pulled By: gitbw95 fbshipit-source-id: 19573c67237011927320207732b5de083cb87240 |
|
Yanqin Jin | 7045b74b47 |
Remove timestamp before inserting to WBWI's index (#10742)
Summary: Currently, this original behavior should not lead to incorrect result, but will violate the contract of CompareWithTimestamp() that when a_has_ts or b_has_ts is false, the slice does not include timestamp. Resolves https://github.com/facebook/rocksdb/issues/10709 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10742 Test Plan: make check Reviewed By: ltamasi Differential Revision: D39834096 Pulled By: riversand963 fbshipit-source-id: c597600f5a7820734f07d0926cdc224cea5eabe1 |
|
Yanqin Jin | 07249fea8f |
Fix DBImpl::GetLatestSequenceForKey() for Merge (#10724)
Summary: Currently, without this fix, DBImpl::GetLatestSequenceForKey() may not return the latest sequence number for merge operands of the key. This can cause conflict checking during optimistic transaction commit phase to fail. Fix it by always returning the latest sequence number of the key, also considering range tombstones. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10724 Test Plan: make check Reviewed By: cbi42 Differential Revision: D39756847 Pulled By: riversand963 fbshipit-source-id: 0764c3dd4cb24960b37e18adccc6e7feed0e6876 |
|
Peter Dillinger | ef443cead4 |
Refactor to avoid confusing "raw block" (#10408)
Summary: We have a lot of confusing code because of mixed, sometimes completely opposite uses of of the term "raw block" or "raw contents", sometimes within the same source file. For example, in `BlockBasedTableBuilder`, `raw_block_contents` and `raw_size` generally referred to uncompressed block contents and size, while `WriteRawBlock` referred to writing a block that is already compressed if it is going to be. Meanwhile, in `BlockBasedTable`, `raw_block_contents` either referred to a (maybe compressed) block with trailer, or a maybe compressed block maybe without trailer. (Note: left as follow-up work to use C++ typing to better sort out the various kinds of BlockContents.) This change primarily tries to apply some consistent terminology around the kinds of block representations, avoiding the unclear "raw". (Any meaning of "raw" assumes some bias toward the storage layer or toward the logical data layer.) Preferred terminology: * **Serialized block** - bytes that go into storage. For block-based table (usually the case) this includes the block trailer. WART: block `size` may or may not include the trailer; need to be clear about whether it does or not. * **Maybe compressed block** - like a serialized block, but without the trailer (or no promise of including a trailer). Must be accompanied by a CompressionType. * **Uncompressed block** - "payload" bytes that are either stored with no compression, used as input to compression function, or result of decompression function. * **Parsed block** - an in-memory form of a block in block cache, as it is used by the table reader. Different C++ types are used depending on the block type (see block_like_traits.h). Other refactorings: * Misc corrections/improvements of internal API comments * Remove a few misleading / unhelpful / redundant comments. * Use move semantics in some places to simplify contracts * Use better parameter names to indicate which parameters are used for outputs * Remove some extraneous `extern` * Various clean-ups to `CacheDumperImpl` (mostly unnecessary code) Pull Request resolved: https://github.com/facebook/rocksdb/pull/10408 Test Plan: existing tests Reviewed By: akankshamahajan15 Differential Revision: D38172617 Pulled By: pdillinger fbshipit-source-id: ccb99299f324ac5ca46996d34c5089621a4f260c |
|
Bo Wang | b418ace352 |
Disable PersistentCacheTierTest.BasicTest (#10683)
Summary: Disable this flaky test since PersistentCache is not used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10683 Test Plan: Unit Tests Reviewed By: cbi42 Differential Revision: D39545974 Pulled By: gitbw95 fbshipit-source-id: ac53e96f6ba880e7612e325eb5ff22ee2799efed |
|
Yanqin Jin | 832fd644fc |
Reset pessimistic transaction's read/commit timestamps during Initialize() (#10677)
Summary: RocksDB allows reusing old `Transaction` objects when creating new ones. Therefore, we need to reset the transaction's read and commit timestamps back to default values `kMaxTxnTimestamp`. Otherwise, `CommitAndTryCreateSnapshot()` may fail with "Status::InvalidArgument("Different commit ts specified")". Pull Request resolved: https://github.com/facebook/rocksdb/pull/10677 Test Plan: make check Reviewed By: ltamasi Differential Revision: D39513543 Pulled By: riversand963 fbshipit-source-id: bea01cac149bff3a23a2978fc0c3b198243a6291 |
|
Peter Dillinger | 6de7081cf3 |
Always verify SST unique IDs on SST file open (#10532)
Summary: Although we've been tracking SST unique IDs in the DB manifest unconditionally, checking has been opt-in and with an extra pass at DB::Open time. This changes the behavior of `verify_sst_unique_id_in_manifest` to check unique ID against manifest every time an SST file is opened through table cache (normal DB operations), replacing the explicit pass over files at DB::Open time. This change also enables the option by default and removes the "EXPERIMENTAL" designation. One possible criticism is that the option no longer ensures the integrity of a DB at Open time. This is far from an all-or-nothing issue. Verifying the IDs of all SST files hardly ensures all the data in the DB is readable. (VerifyChecksum is supposed to do that.) Also, with max_open_files=-1 (default, extremely common), all SST files are opened at DB::Open time anyway. Implementation details: * `VerifySstUniqueIdInManifest()` functions are the extra/explicit pass that is now removed. * Unit tests that manipulate/corrupt table properties have to opt out of this check, because that corrupts the "actual" unique id. (And even for testing we don't currently have a mechanism to set "no unique id" in the in-memory file metadata for new files.) * A lot of other unit test churn relates to (a) default checking on, and (b) checking on SST open even without DB::Open (e.g. on flush) * Use `FileMetaData` for more `TableCache` operations (in place of `FileDescriptor`) so that we have access to the unique_id whenever we might need to open an SST file. **There is the possibility of performance impact because we can no longer use the more localized `fd` part of an `FdWithKeyRange` but instead follow the `file_metadata` pointer. However, this change (possible regression) is only done for `GetMemoryUsageByTableReaders`.** * Removed a completely unnecessary constructor overload of `TableReaderOptions` Possible follow-up: * Verification only happens when opening through table cache. Are there more places where this should happen? * Improve error message when there is a file size mismatch vs. manifest (FIXME added in the appropriate place). * I'm not sure there's a justification for `FileDescriptor` to be distinct from `FileMetaData`. * I'm skeptical that `FdWithKeyRange` really still makes sense for optimizing some data locality by duplicating some data in memory, but I could be wrong. * An unnecessary overload of NewTableReader was recently added, in the public API nonetheless (though unusable there). It should be cleaned up to put most things under `TableReaderOptions`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10532 Test Plan: updated unit tests Performance test showing no significant difference (just noise I think): `./db_bench -benchmarks=readwhilewriting[-X10] -num=3000000 -disable_wal=1 -bloom_bits=8 -write_buffer_size=1000000 -target_file_size_base=1000000` Before: readwhilewriting [AVG 10 runs] : 68702 (± 6932) ops/sec After: readwhilewriting [AVG 10 runs] : 68239 (± 7198) ops/sec Reviewed By: jay-zhuang Differential Revision: D38765551 Pulled By: pdillinger fbshipit-source-id: a827a708155f12344ab2a5c16e7701c7636da4c2 |
|
Bo Wang | d490bfcdb6 |
Avoid recompressing cold block in CompressedSecondaryCache (#10527)
Summary: **Summary:** When a block is firstly `Lookup` from the secondary cache, we just insert a dummy block in the primary cache (charging the actual size of the block) and don’t erase the block from the secondary cache. A standalone handle is returned from `Lookup`. Only if the block is hit again, we erase it from the secondary cache and add it into the primary cache. When a block is firstly evicted from the primary cache to the secondary cache, we just insert a dummy block (size 0) in the secondary cache. When the block is evicted again, it is treated as a hot block and is inserted into the secondary cache. **Implementation Details** Add a new state of LRUHandle: The handle is never inserted into the LRUCache (both hash table and LRU list) and it doesn't experience the above three states. The entry can be freed when refs becomes 0. (refs >= 1 && in_cache == false && IS_STANDALONE == true) The behaviors of `LRUCacheShard::Lookup()` are updated if the secondary_cache is CompressedSecondaryCache: 1. If a handle is found in primary cache: 1.1. If the handle's value is not nullptr, it is returned immediately. 1.2. If the handle's value is nullptr, this means the handle is a dummy one. For a dummy handle, if it was retrieved from secondary cache, it may still exist in secondary cache. - 1.2.1. If no valid handle can be `Lookup` from secondary cache, return nullptr. - 1.2.2. If the handle from secondary cache is valid, erase it from the secondary cache and add it into the primary cache. 2. If a handle is not found in primary cache: 2.1. If no valid handle can be `Lookup` from secondary cache, return nullptr. 2.2. If the handle from secondary cache is valid, insert a dummy block in the primary cache (charging the actual size of the block) and return a standalone handle. The behaviors of `LRUCacheShard::Promote()` are updated as follows: 1. If `e->sec_handle` has value, one of the following steps can happen: 1.1. Insert a dummy handle and return a standalone handle to caller when `secondary_cache_` is `CompressedSecondaryCache` and e is a standalone handle. 1.2. Insert the item into the primary cache and return the handle to caller. 1.3. Exception handling. 3. If `e->sec_handle` has no value, mark the item as not in cache and charge the cache as its only metadata that'll shortly be released. The behavior of `CompressedSecondaryCache::Insert()` is updated: 1. If a block is evicted from the primary cache for the first time, a dummy item is inserted. 4. If a dummy item is found for a block, the block is inserted into the secondary cache. The behavior of `CompressedSecondaryCache:::Lookup()` is updated: 1. If a handle is not found or it is a dummy item, a nullptr is returned. 2. If `erase_handle` is true, the handle is erased. The behaviors of `LRUCacheShard::Release()` are adjusted for the standalone handles. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10527 Test Plan: 1. stress tests. 5. unit tests. 6. CPU profiling for db_bench. Reviewed By: siying Differential Revision: D38747613 Pulled By: gitbw95 fbshipit-source-id: 74a1eba7e1957c9affb2bd2ae3e0194584fa6eca |
|
Changyu Bi | 30bc495c03 |
Skip swaths of range tombstone covered keys in merging iterator (2022 edition) (#10449)
Summary: Delete range logic is moved from `DBIter` to `MergingIterator`, and `MergingIterator` will seek to the end of a range deletion if possible instead of scanning through each key and check with `RangeDelAggregator`. With the invariant that a key in level L (consider memtable as the first level, each immutable and L0 as a separate level) has a larger sequence number than all keys in any level >L, a range tombstone `[start, end)` from level L covers all keys in its range in any level >L. This property motivates optimizations in iterator: - in `Seek(target)`, if level L has a range tombstone `[start, end)` that covers `target.UserKey`, then for all levels > L, we can do Seek() on `end` instead of `target` to skip some range tombstone covered keys. - in `Next()/Prev()`, if the current key is covered by a range tombstone `[start, end)` from level L, we can do `Seek` to `end` for all levels > L. This PR implements the above optimizations in `MergingIterator`. As all range tombstone covered keys are now skipped in `MergingIterator`, the range tombstone logic is removed from `DBIter`. The idea in this PR is similar to https://github.com/facebook/rocksdb/issues/7317, but this PR leaves `InternalIterator` interface mostly unchanged. **Credit**: the cascading seek optimization and the sentinel key (discussed below) are inspired by [Pebble](https://github.com/cockroachdb/pebble/blob/master/merging_iter.go) and suggested by ajkr in https://github.com/facebook/rocksdb/issues/7317. The two optimizations are mostly implemented in `SeekImpl()/SeekForPrevImpl()` and `IsNextDeleted()/IsPrevDeleted()` in `merging_iterator.cc`. See comments for each method for more detail. One notable change is that the minHeap/maxHeap used by `MergingIterator` now contains range tombstone end keys besides point key iterators. This helps to reduce the number of key comparisons. For example, for a range tombstone `[start, end)`, a `start` and an `end` `HeapItem` are inserted into the heap. When a `HeapItem` for range tombstone start key is popped from the minHeap, we know this range tombstone becomes "active" in the sense that, before the range tombstone's end key is popped from the minHeap, all the keys popped from this heap is covered by the range tombstone's internal key range `[start, end)`. Another major change, *delete range sentinel key*, is made to `LevelIterator`. Before this PR, when all point keys in an SST file are iterated through in `MergingIterator`, a level iterator would advance to the next SST file in its level. In the case when an SST file has a range tombstone that covers keys beyond the SST file's last point key, advancing to the next SST file would lose this range tombstone. Consequently, `MergingIterator` could return keys that should have been deleted by some range tombstone. We prevent this by pretending that file boundaries in each SST file are sentinel keys. A `LevelIterator` now only advance the file iterator once the sentinel key is processed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10449 Test Plan: - Added many unit tests in db_range_del_test - Stress test: `./db_stress --readpercent=5 --prefixpercent=19 --writepercent=20 -delpercent=10 --iterpercent=44 --delrangepercent=2` - Additional iterator stress test is added to verify against iterators against expected state: https://github.com/facebook/rocksdb/issues/10538. This is based on ajkr's previous attempt https://github.com/facebook/rocksdb/pull/5506#issuecomment-506021913. ``` python3 ./tools/db_crashtest.py blackbox --simple --write_buffer_size=524288 --target_file_size_base=524288 --max_bytes_for_level_base=2097152 --compression_type=none --max_background_compactions=8 --value_size_mult=33 --max_key=5000000 --interval=10 --duration=7200 --delrangepercent=3 --delpercent=9 --iterpercent=25 --writepercent=60 --readpercent=3 --prefixpercent=0 --num_iterations=1000 --range_deletion_width=100 --verify_iterator_with_expected_state_one_in=1 ``` - Performance benchmark: I used a similar setup as in the blog [post](http://rocksdb.org/blog/2018/11/21/delete-range.html) that introduced DeleteRange, "a database with 5 million data keys, and 10000 range tombstones (ignoring those dropped during compaction) that were written in regular intervals after 4.5 million data keys were written". As expected, the performance with this PR depends on the range tombstone width. ``` # Setup: TEST_TMPDIR=/dev/shm ./db_bench_main --benchmarks=fillrandom --writes=4500000 --num=5000000 TEST_TMPDIR=/dev/shm ./db_bench_main --benchmarks=overwrite --writes=500000 --num=5000000 --use_existing_db=true --writes_per_range_tombstone=50 # Scan entire DB TEST_TMPDIR=/dev/shm ./db_bench_main --benchmarks=readseq[-X5] --use_existing_db=true --num=5000000 --disable_auto_compactions=true # Short range scan (10 Next()) TEST_TMPDIR=/dev/shm/width-100/ ./db_bench_main --benchmarks=seekrandom[-X5] --use_existing_db=true --num=500000 --reads=100000 --seek_nexts=10 --disable_auto_compactions=true # Long range scan(1000 Next()) TEST_TMPDIR=/dev/shm/width-100/ ./db_bench_main --benchmarks=seekrandom[-X5] --use_existing_db=true --num=500000 --reads=2500 --seek_nexts=1000 --disable_auto_compactions=true ``` Avg over of 10 runs (some slower tests had fews runs): For the first column (tombstone), 0 means no range tombstone, 100-10000 means width of the 10k range tombstones, and 1 means there is a single range tombstone in the entire DB (width is 1000). The 1 tombstone case is to test regression when there's very few range tombstones in the DB, as no range tombstone is likely to take a different code path than with range tombstones. - Scan entire DB | tombstone width | Pre-PR ops/sec | Post-PR ops/sec | ±% | | ------------- | ------------- | ------------- | ------------- | | 0 range tombstone |2525600 (± 43564) |2486917 (± 33698) |-1.53% | | 100 |1853835 (± 24736) |2073884 (± 32176) |+11.87% | | 1000 |422415 (± 7466) |1115801 (± 22781) |+164.15% | | 10000 |22384 (± 227) |227919 (± 6647) |+918.22% | | 1 range tombstone |2176540 (± 39050) |2434954 (± 24563) |+11.87% | - Short range scan | tombstone width | Pre-PR ops/sec | Post-PR ops/sec | ±% | | ------------- | ------------- | ------------- | ------------- | | 0 range tombstone |35398 (± 533) |35338 (± 569) |-0.17% | | 100 |28276 (± 664) |31684 (± 331) |+12.05% | | 1000 |7637 (± 77) |25422 (± 277) |+232.88% | | 10000 |1367 |28667 |+1997.07% | | 1 range tombstone |32618 (± 581) |32748 (± 506) |+0.4% | - Long range scan | tombstone width | Pre-PR ops/sec | Post-PR ops/sec | ±% | | ------------- | ------------- | ------------- | ------------- | | 0 range tombstone |2262 (± 33) |2353 (± 20) |+4.02% | | 100 |1696 (± 26) |1926 (± 18) |+13.56% | | 1000 |410 (± 6) |1255 (± 29) |+206.1% | | 10000 |25 |414 |+1556.0% | | 1 range tombstone |1957 (± 30) |2185 (± 44) |+11.65% | - Microbench does not show significant regression: https://gist.github.com/cbi42/59f280f85a59b678e7e5d8561e693b61 Reviewed By: ajkr Differential Revision: D38450331 Pulled By: cbi42 fbshipit-source-id: b5ef12e8d8c289ed2e163ccdf277f5039b511fca |
|
sdong | 9509003503 |
Option migration tool to break down files for FIFO compaction (#10600)
Summary: Right now, when the option migration tool migrates to FIFO compaction, it compacts all the data into one single SST file and move to L0. Although it creates a valid LSM-tree for FIFO, for any data to be deleted for FIFO, the giant file will be deleted, which might make the DB almost empty. There is not good solution for it, because usually we don't have enough information to reconstruct the FIFO LSM-tree. This change changes to a solution that compromises the FIFO condition. We hope the solution is more useable. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10600 Test Plan: Add unit tests for that. Reviewed By: jay-zhuang Differential Revision: D39106424 fbshipit-source-id: bdfd852c3b343373765b8d9716fefc08fd27145c |
|
Peter Dillinger | c5afbbfe4b |
Don't wait for indirect flush in read-only DB (#10569)
Summary: Some APIs for getting live files, which are used by Checkpoint and BackupEngine, can optionally trigger and wait for a flush. These would deadlock when used on a read-only DB. Here we fix that by assuming the user wants the overall operation to succeed and is OK without flushing (because the DB is read-only). Follow-up work: the same or other issues can be hit by directly invoking some DB functions that are clearly not appropriate for read-only instance, but are not covered by overrides in DBImplReadOnly and CompactedDBImpl. These should be fixed to avoid similar problems on accidental misuse. (Long term, it would be nice to have a DBReadOnly class without those members, like BackupEngineReadOnly.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/10569 Test Plan: tests updated to catch regression (hang before the fix) Reviewed By: riversand963 Differential Revision: D38995759 Pulled By: pdillinger fbshipit-source-id: f5f8bc7123e13cb45bd393dd974d7d6eda20bc68 |
|
Hui Xiao | b16655a547 |
Add missing synchronization in TestFSWritableFile (#10544)
Summary: **Context:** ajkr's command revealed an existing TSAN data race between `TestFSWritableFile::Append` and `TestFSWritableFile::Sync` on `TestFSWritableFile::state_` ``` $ make clean && COMPILE_WITH_TSAN=1 make -j56 db_stress $ python3 tools/db_crashtest.py blackbox --simple --duration=3600 --interval=10 --sync_fault_injection=1 --disable_wal=0 --max_key=10000 --checkpoint_one_in=1000 ``` The race is due to concurrent access from [checkpoint's WAL sync](https://github.com/facebook/rocksdb/blob/7.4.fb/utilities/fault_injection_fs.cc#L324) and [db put's WAL write when ‘sync_fault_injection=1 ‘](https://github.com/facebook/rocksdb/blob/7.4.fb/utilities/fault_injection_fs.cc#L208) to the `state_` on the same WAL `TestFSWritableFile` under the missing synchronization. ``` WARNING: ThreadSanitizer: data race (pid=11275) Write of size 8 at 0x7b480003d850 by thread T23 (mutexes: write M69230): #0 rocksdb::TestFSWritableFile::Sync(rocksdb::IOOptions const&, rocksdb::IODebugContext*) internal_repo_rocksdb/repo/utilities/fault_injection_fs.cc:297 (db_stress+0x716004) https://github.com/facebook/rocksdb/issues/1 rocksdb::(anonymous namespace)::CompositeWritableFileWrapper::Sync() internal_repo_rocksdb/repo/env/composite_env.cc:154 (db_stress+0x4dfa78) https://github.com/facebook/rocksdb/issues/2 rocksdb::(anonymous namespace)::LegacyWritableFileWrapper::Sync(rocksdb::IOOptions const&, rocksdb::IODebugContext*) internal_repo_rocksdb/repo/env/env.cc:280 (db_stress+0x6dfd24) https://github.com/facebook/rocksdb/issues/3 rocksdb::WritableFileWriter::SyncInternal(bool) internal_repo_rocksdb/repo/file/writable_file_writer.cc:460 (db_stress+0xa1b98c) https://github.com/facebook/rocksdb/issues/4 rocksdb::WritableFileWriter::SyncWithoutFlush(bool) internal_repo_rocksdb/repo/file/writable_file_writer.cc:435 (db_stress+0xa1e441) https://github.com/facebook/rocksdb/issues/5 rocksdb::DBImpl::SyncWAL() internal_repo_rocksdb/repo/db/db_impl/db_impl.cc:1385 (db_stress+0x529458) https://github.com/facebook/rocksdb/issues/6 rocksdb::DBImpl::FlushWAL(bool) internal_repo_rocksdb/repo/db/db_impl/db_impl.cc:1339 (db_stress+0x54f82a) https://github.com/facebook/rocksdb/issues/7 rocksdb::DBImpl::GetLiveFilesStorageInfo(rocksdb::LiveFilesStorageInfoOptions const&, std::vector<rocksdb::LiveFileStorageInfo, std::allocator<rocksdb::LiveFileStorageInfo> >*) internal_repo_rocksdb/repo/db/db_filesnapshot.cc:387 (db_stress+0x5c831d) https://github.com/facebook/rocksdb/issues/8 rocksdb::CheckpointImpl::CreateCustomCheckpoint(std::function<rocksdb::Status (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::FileType)>, std::function<rocksdb::Status (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long, rocksdb::FileType, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::Temperature)>, std::function<rocksdb::Status (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::FileType)>, unsigned long*, unsigned long, bool) internal_repo_rocksdb/repo/utilities/checkpoint/checkpoint_impl.cc:214 (db_stress+0x4c0343) https://github.com/facebook/rocksdb/issues/9 rocksdb::CheckpointImpl::CreateCheckpoint(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long, unsigned long*) internal_repo_rocksdb/repo/utilities/checkpoint/checkpoint_impl.cc:123 (db_stress+0x4c237e) https://github.com/facebook/rocksdb/issues/10 rocksdb::StressTest::TestCheckpoint(rocksdb::ThreadState*, std::vector<int, std::allocator<int> > const&, std::vector<long, std::allocator<long> > const&) internal_repo_rocksdb/repo/db_stress_tool/db_stress_test_base.cc:1699 (db_stress+0x328340) https://github.com/facebook/rocksdb/issues/11 rocksdb::StressTest::OperateDb(rocksdb::ThreadState*) internal_repo_rocksdb/repo/db_stress_tool/db_stress_test_base.cc:825 (db_stress+0x33921f) https://github.com/facebook/rocksdb/issues/12 rocksdb::ThreadBody(void*) internal_repo_rocksdb/repo/db_stress_tool/db_stress_driver.cc:33 (db_stress+0x354857) https://github.com/facebook/rocksdb/issues/13 rocksdb::(anonymous namespace)::StartThreadWrapper(void*) internal_repo_rocksdb/repo/env/env_posix.cc:447 (db_stress+0x6eb2ad) Previous read of size 8 at 0x7b480003d850 by thread T64 (mutexes: write M980798978697532600, write M253744503184415024, write M1262): #0 memcpy <null> (db_stress+0xbc9696) https://github.com/facebook/rocksdb/issues/1 operator= internal_repo_rocksdb/repo/utilities/fault_injection_fs.h:35 (db_stress+0x70d5f1) https://github.com/facebook/rocksdb/issues/2 rocksdb::FaultInjectionTestFS::WritableFileAppended(rocksdb::FSFileState const&) internal_repo_rocksdb/repo/utilities/fault_injection_fs.cc:827 (db_stress+0x70d5f1) https://github.com/facebook/rocksdb/issues/3 rocksdb::TestFSWritableFile::Append(rocksdb::Slice const&, rocksdb::IOOptions const&, rocksdb::IODebugContext*) internal_repo_rocksdb/repo/utilities/fault_injection_fs.cc:173 (db_stress+0x7143af) https://github.com/facebook/rocksdb/issues/4 rocksdb::(anonymous namespace)::CompositeWritableFileWrapper::Append(rocksdb::Slice const&) internal_repo_rocksdb/repo/env/composite_env.cc:115 (db_stress+0x4de3ab) https://github.com/facebook/rocksdb/issues/5 rocksdb::(anonymous namespace)::LegacyWritableFileWrapper::Append(rocksdb::Slice const&, rocksdb::IOOptions const&, rocksdb::IODebugContext*) internal_repo_rocksdb/repo/env/env.cc:248 (db_stress+0x6df44b) https://github.com/facebook/rocksdb/issues/6 rocksdb::WritableFileWriter::WriteBuffered(char const*, unsigned long, rocksdb::Env::IOPriority) internal_repo_rocksdb/repo/file/writable_file_writer.cc:551 (db_stress+0xa1a953) https://github.com/facebook/rocksdb/issues/7 rocksdb::WritableFileWriter::Flush(rocksdb::Env::IOPriority) internal_repo_rocksdb/repo/file/writable_file_writer.cc:327 (db_stress+0xa16ee8) https://github.com/facebook/rocksdb/issues/8 rocksdb::log::Writer::AddRecord(rocksdb::Slice const&, rocksdb::Env::IOPriority) internal_repo_rocksdb/repo/db/log_writer.cc:147 (db_stress+0x7f121f) https://github.com/facebook/rocksdb/issues/9 rocksdb::DBImpl::WriteToWAL(rocksdb::WriteBatch const&, rocksdb::log::Writer*, unsigned long*, unsigned long*, rocksdb::Env::IOPriority, rocksdb::DBImpl::LogFileNumberSize&) internal_repo_rocksdb/repo/db/db_impl/db_impl_write.cc:1285 (db_stress+0x695042) https://github.com/facebook/rocksdb/issues/10 rocksdb::DBImpl::WriteToWAL(rocksdb::WriteThread::WriteGroup const&, rocksdb::log::Writer*, unsigned long*, bool, bool, unsigned long, rocksdb::DBImpl::LogFileNumberSize&) internal_repo_rocksdb/repo/db/db_impl/db_impl_write.cc:1328 (db_stress+0x6907e8) https://github.com/facebook/rocksdb/issues/11 rocksdb::DBImpl::PipelinedWriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, bool, unsigned long*) internal_repo_rocksdb/repo/db/db_impl/db_impl_write.cc:731 (db_stress+0x68e8a7) https://github.com/facebook/rocksdb/issues/12 rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, bool, unsigned long*, unsigned long, rocksdb::PreReleaseCallback*, rocksdb::PostMemTableCallback*) internal_repo_rocksdb/repo/db/db_impl/db_impl_write.cc:283 (db_stress+0x688370) https://github.com/facebook/rocksdb/issues/13 rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*) internal_repo_rocksdb/repo/db/db_impl/db_impl_write.cc:126 (db_stress+0x69a7b5) https://github.com/facebook/rocksdb/issues/14 rocksdb::DB::Put(rocksdb::WriteOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, rocksdb::Slice const&, rocksdb::Slice const&) internal_repo_rocksdb/repo/db/db_impl/db_impl_write.cc:2247 (db_stress+0x698634) https://github.com/facebook/rocksdb/issues/15 rocksdb::DBImpl::Put(rocksdb::WriteOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, rocksdb::Slice const&, rocksdb::Slice const&) internal_repo_rocksdb/repo/db/db_impl/db_impl_write.cc:37 (db_stress+0x699868) https://github.com/facebook/rocksdb/issues/16 rocksdb::NonBatchedOpsStressTest::TestPut(rocksdb::ThreadState*, rocksdb::WriteOptions&, rocksdb::ReadOptions const&, std::vector<int, std::allocator<int> > const&, std::vector<long, std::allocator<long> > const&, char (&) [100], std::unique_ptr<rocksdb::MutexLock, std::default_delete<rocksdb::MutexLock> >&) internal_repo_rocksdb/repo/db_stress_tool/no_batched_ops_stress.cc:681 (db_stress+0x38d20c) https://github.com/facebook/rocksdb/issues/17 rocksdb::StressTest::OperateDb(rocksdb::ThreadState*) internal_repo_rocksdb/repo/db_stress_tool/db_stress_test_base.cc:897 (db_stress+0x3399ec) https://github.com/facebook/rocksdb/issues/18 rocksdb::ThreadBody(void*) internal_repo_rocksdb/repo/db_stress_tool/db_stress_driver.cc:33 (db_stress+0x354857) https://github.com/facebook/rocksdb/issues/19 rocksdb::(anonymous namespace)::StartThreadWrapper(void*) internal_repo_rocksdb/repo/env/env_posix.cc:447 (db_stress+0x6eb2ad) Location is heap block of size 352 at 0x7b480003d800 allocated by thread T23: #0 operator new(unsigned long) <null> (db_stress+0xb685dc) https://github.com/facebook/rocksdb/issues/1 rocksdb::FaultInjectionTestFS::NewWritableFile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::FileOptions const&, std::unique_ptr<rocksdb::FSWritableFile, std::default_delete<rocksdb::FSWritableFile> >*, rocksdb::IODebugContext*) internal_repo_rocksdb/repo/utilities/fault_injection_fs.cc:506 (db_stress+0x711192) https://github.com/facebook/rocksdb/issues/2 rocksdb::CompositeEnv::NewWritableFile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unique_ptr<rocksdb::WritableFile, std::default_delete<rocksdb::WritableFile> >*, rocksdb::EnvOptions const&) internal_repo_rocksdb/repo/env/composite_env.cc:329 (db_stress+0x4d33fa) https://github.com/facebook/rocksdb/issues/3 rocksdb::EnvWrapper::NewWritableFile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unique_ptr<rocksdb::WritableFile, std::default_delete<rocksdb::WritableFile> >*, rocksdb::EnvOptions const&) internal_repo_rocksdb/repo/include/rocksdb/env.h:1425 (db_stress+0x300662) ... ``` **Summary:** - Added the missing lock in functions mentioned above along with three other functions with a similar need in TestFSWritableFile - Added clarification comment Pull Request resolved: https://github.com/facebook/rocksdb/pull/10544 Test Plan: - Past the above race condition repro Reviewed By: ajkr Differential Revision: D38886634 Pulled By: hx235 fbshipit-source-id: 0571bae9615f35b16fbd8168204607e306b1b486 |
|
Bo Wang | 13cb7a84b6 |
Fix the memory leak in db_stress tests that are caused by `FaultInjectionSecondaryCache` and add `CompressedSecondaryCache` into stress tests. (#10523)
Summary: 1. Fix the memory leak in db_stress tests that are caused by `FaultInjectionSecondaryCache`. To address the test requirements for both CompressedSecondaryCache and CachlibWrapper, a new class variable `base_is_compressed_sec_cache_` is added to determine the different behaviors in `Lookup()` and `WaitAll()`. 2. Add `CompressedSecondaryCache` into stress tests. Before this PR, memory leak is reported during crash tests if `CompressedSecondaryCache` is in stress tests. One example is shown as follows: ``` ==70722==ERROR: LeakSanitizer: detected memory leaks Direct leak of 6648240 byte(s) in 83103 object(s) allocated from: #0 0x13de9d7 in operator new(unsigned long) (/data/sandcastle/boxes/eden-trunk-hg-fbcode-fbsource/fbcode/buck-out/dbgo/gen/aab7ed39/internal_repo_rocksdb/repo/db_stress+0x13de9d7) https://github.com/facebook/rocksdb/issues/1 0x9084c7 in rocksdb::BlocklikeTraits<rocksdb::Block>::Create(rocksdb::BlockContents&&, unsigned long, rocksdb::Statistics*, bool, rocksdb::FilterPolicy const*) internal_repo_rocksdb/repo/table/block_based/block_like_traits.h:128 https://github.com/facebook/rocksdb/issues/2 0x9084c7 in std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> rocksdb::GetCreateCallback<rocksdb::Block>(unsigned long, rocksdb::Statistics*, bool, rocksdb::FilterPolicy const*)::'lambda'(void const*, unsigned long, void**, unsigned long*)::operator()(void const*, unsigned long, void**, unsigned long*) const internal_repo_rocksdb/repo/table/block_based/block_like_traits.h:34 https://github.com/facebook/rocksdb/issues/3 0x9082c9 in rocksdb::Block std::__invoke_impl<rocksdb::Status, std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> rocksdb::GetCreateCallback<rocksdb::Block>(unsigned long, rocksdb::Statistics*, bool, rocksdb::FilterPolicy const*)::'lambda'(void const*, unsigned long, void**, unsigned long*)&, void const*, unsigned long, void**, unsigned long*>(std::__invoke_other, std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> rocksdb::GetCreateCallback<rocksdb::Block>(unsigned long, rocksdb::Statistics*, bool, rocksdb::FilterPolicy const*)::'lambda'(void const*, unsigned long, void**, unsigned long*)&, void const*&&, unsigned long&&, void**&&, unsigned long*&&) third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/invoke.h:61 https://github.com/facebook/rocksdb/issues/4 0x90825d in std::enable_if<is_invocable_r_v<rocksdb::Block, std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> rocksdb::GetCreateCallback<rocksdb::Block>(unsigned long, rocksdb::Statistics*, bool, rocksdb::FilterPolicy const*)::'lambda'(void const*, unsigned long, void**, unsigned long*)&, void const*, unsigned long, void**, unsigned long*>, rocksdb::Block>::type std::__invoke_r<rocksdb::Status, std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> rocksdb::GetCreateCallback<rocksdb::Block>(unsigned long, rocksdb::Statistics*, bool, rocksdb::FilterPolicy const*)::'lambda'(void const*, unsigned long, void**, unsigned long*)&, void const*, unsigned long, void**, unsigned long*>(std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> rocksdb::GetCreateCallback<rocksdb::Block>(unsigned long, rocksdb::Statistics*, bool, rocksdb::FilterPolicy const*)::'lambda'(void const*, unsigned long, void**, unsigned long*)&, void const*&&, unsigned long&&, void**&&, unsigned long*&&) third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/invoke.h:114 https://github.com/facebook/rocksdb/issues/5 0x9081b0 in std::_Function_handler<rocksdb::Status (void const*, unsigned long, void**, unsigned long*), std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> rocksdb::GetCreateCallback<rocksdb::Block>(unsigned long, rocksdb::Statistics*, bool, rocksdb::FilterPolicy const*)::'lambda'(void const*, unsigned long, void**, unsigned long*)>::_M_invoke(std::_Any_data const&, void const*&&, unsigned long&&, void**&&, unsigned long*&&) third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/std_function.h:291 https://github.com/facebook/rocksdb/issues/6 0x991f2c in std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)>::operator()(void const*, unsigned long, void**, unsigned long*) const third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/std_function.h:560 https://github.com/facebook/rocksdb/issues/7 0x990277 in rocksdb::CompressedSecondaryCache::Lookup(rocksdb::Slice const&, std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> const&, bool, bool&) internal_repo_rocksdb/repo/cache/compressed_secondary_cache.cc:77 https://github.com/facebook/rocksdb/issues/8 0xd3aa4d in rocksdb::FaultInjectionSecondaryCache::Lookup(rocksdb::Slice const&, std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> const&, bool, bool&) internal_repo_rocksdb/repo/utilities/fault_injection_secondary_cache.cc:92 https://github.com/facebook/rocksdb/issues/9 0xeadaab in rocksdb::lru_cache::LRUCacheShard::Lookup(rocksdb::Slice const&, unsigned int, rocksdb::Cache::CacheItemHelper const*, std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> const&, rocksdb::Cache::Priority, bool, rocksdb::Statistics*) internal_repo_rocksdb/repo/cache/lru_cache.cc:445 https://github.com/facebook/rocksdb/issues/10 0x1064573 in rocksdb::ShardedCache::Lookup(rocksdb::Slice const&, rocksdb::Cache::CacheItemHelper const*, std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> const&, rocksdb::Cache::Priority, bool, rocksdb::Statistics*) internal_repo_rocksdb/repo/cache/sharded_cache.cc:89 https://github.com/facebook/rocksdb/issues/11 0x8be0df in rocksdb::BlockBasedTable::GetEntryFromCache(rocksdb::CacheTier const&, rocksdb::Cache*, rocksdb::Slice const&, rocksdb::BlockType, bool, rocksdb::GetContext*, rocksdb::Cache::CacheItemHelper const*, std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> const&, rocksdb::Cache::Priority) const internal_repo_rocksdb/repo/table/block_based/block_based_table_reader.cc:389 https://github.com/facebook/rocksdb/issues/12 0x905790 in rocksdb::Status rocksdb::BlockBasedTable::GetDataBlockFromCache<rocksdb::Block>(rocksdb::Slice const&, rocksdb::Cache*, rocksdb::Cache*, rocksdb::ReadOptions const&, rocksdb::CachableEntry<rocksdb::Block>*, rocksdb::UncompressionDict const&, rocksdb::BlockType, bool, rocksdb::GetContext*) const internal_repo_rocksdb/repo/table/block_based/block_based_table_reader.cc:1263 https://github.com/facebook/rocksdb/issues/13 0x8b9259 in rocksdb::Status rocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache<rocksdb::Block>(rocksdb::FilePrefetchBuffer*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::UncompressionDict const&, bool, bool, rocksdb::CachableEntry<rocksdb::Block>*, rocksdb::BlockType, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocksdb::BlockContents*, bool) const internal_repo_rocksdb/repo/table/block_based/block_based_table_reader.cc:1559 https://github.com/facebook/rocksdb/issues/14 0x8b710c in rocksdb::Status rocksdb::BlockBasedTable::RetrieveBlock<rocksdb::Block>(rocksdb::FilePrefetchBuffer*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::UncompressionDict const&, rocksdb::CachableEntry<rocksdb::Block>*, rocksdb::BlockType, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, bool, bool, bool, bool) const internal_repo_rocksdb/repo/table/block_based/block_based_table_reader.cc:1726 https://github.com/facebook/rocksdb/issues/15 0x8c329f in rocksdb::DataBlockIter* rocksdb::BlockBasedTable::NewDataBlockIterator<rocksdb::DataBlockIter>(rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::DataBlockIter*, rocksdb::BlockType, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocksdb::FilePrefetchBuffer*, bool, bool, rocksdb::Status&) const internal_repo_rocksdb/repo/table/block_based/block_based_table_reader_impl.h:58 https://github.com/facebook/rocksdb/issues/16 0x920117 in rocksdb::BlockBasedTableIterator::InitDataBlock() internal_repo_rocksdb/repo/table/block_based/block_based_table_iterator.cc:262 https://github.com/facebook/rocksdb/issues/17 0x920d42 in rocksdb::BlockBasedTableIterator::MaterializeCurrentBlock() internal_repo_rocksdb/repo/table/block_based/block_based_table_iterator.cc:332 https://github.com/facebook/rocksdb/issues/18 0xc6a201 in rocksdb::IteratorWrapperBase<rocksdb::Slice>::PrepareValue() internal_repo_rocksdb/repo/table/iterator_wrapper.h:78 https://github.com/facebook/rocksdb/issues/19 0xc6a201 in rocksdb::IteratorWrapperBase<rocksdb::Slice>::PrepareValue() internal_repo_rocksdb/repo/table/iterator_wrapper.h:78 https://github.com/facebook/rocksdb/issues/20 0xef9f6c in rocksdb::MergingIterator::PrepareValue() internal_repo_rocksdb/repo/table/merging_iterator.cc:260 https://github.com/facebook/rocksdb/issues/21 0xc6a201 in rocksdb::IteratorWrapperBase<rocksdb::Slice>::PrepareValue() internal_repo_rocksdb/repo/table/iterator_wrapper.h:78 https://github.com/facebook/rocksdb/issues/22 0xc67bcd in rocksdb::DBIter::FindNextUserEntryInternal(bool, rocksdb::Slice const*) internal_repo_rocksdb/repo/db/db_iter.cc:326 https://github.com/facebook/rocksdb/issues/23 0xc66d36 in rocksdb::DBIter::FindNextUserEntry(bool, rocksdb::Slice const*) internal_repo_rocksdb/repo/db/db_iter.cc:234 https://github.com/facebook/rocksdb/issues/24 0xc7ab47 in rocksdb::DBIter::Next() internal_repo_rocksdb/repo/db/db_iter.cc:161 https://github.com/facebook/rocksdb/issues/25 0x70d938 in rocksdb::BatchedOpsStressTest::TestPrefixScan(rocksdb::ThreadState*, rocksdb::ReadOptions const&, std::vector<int, std::allocator<int> > const&, std::vector<long, std::allocator<long> > const&) internal_repo_rocksdb/repo/db_stress_tool/batched_ops_stress.cc:320 https://github.com/facebook/rocksdb/issues/26 0x6dc6a8 in rocksdb::StressTest::OperateDb(rocksdb::ThreadState*) internal_repo_rocksdb/repo/db_stress_tool/db_stress_test_base.cc:907 https://github.com/facebook/rocksdb/issues/27 0x6867de in rocksdb::ThreadBody(void*) internal_repo_rocksdb/repo/db_stress_tool/db_stress_driver.cc:33 https://github.com/facebook/rocksdb/issues/28 0xce4cc2 in rocksdb::(anonymous namespace)::StartThreadWrapper(void*) internal_repo_rocksdb/repo/env/env_posix.cc:461 https://github.com/facebook/rocksdb/issues/29 0x7f23f9068c0e in start_thread /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/nptl/pthread_create.c:434:8 ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10523 Test Plan: ``` $COMPILE_WITH_ASAN=1 make -j 24 $db_stress J=40 crash_test_with_txn ``` Reviewed By: anand1976 Differential Revision: D38646839 Pulled By: gitbw95 fbshipit-source-id: 9452895c7dc95481a9d7afe83b15193cf5b1c43e |
|
sdong | bc575c614c |
Fix two extra headers (#10525)
Summary: Fix copyright for two more extra headers to make internal tool happy. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10525 Reviewed By: jay-zhuang Differential Revision: D38661390 fbshipit-source-id: ab2d055bfd145dfe82b5bae7a6c25cc338c8de94 |
|
Peter Dillinger | 86a1e3e0e7 |
Derive cache keys from SST unique IDs (#10394)
Summary: ... so that cache keys can be derived from DB manifest data before reading the file from storage--so that every part of the file can potentially go in a persistent cache. See updated comments in cache_key.cc for technical details. Importantly, the new cache key encoding uses some fancy but efficient math to pack data into the cache key without depending on the sizes of the various pieces. This simplifies some existing code creating cache keys, like cache warming before the file size is known. This should provide us an essentially permanent mapping between SST unique IDs and base cache keys, with the ability to "upgrade" SST unique IDs (and thus cache keys) with new SST format_versions. These cache keys are of similar, perhaps indistinguishable quality to the previous generation. Before this change (see "corrected" days between collision): ``` ./cache_bench -stress_cache_key -sck_keep_bits=43 18 collisions after 2 x 90 days, est 10 days between (1.15292e+19 corrected) ``` After this change (keep 43 bits, up through 50, to validate "trajectory" is ok on "corrected" days between collision): ``` 19 collisions after 3 x 90 days, est 14.2105 days between (1.63836e+19 corrected) 16 collisions after 5 x 90 days, est 28.125 days between (1.6213e+19 corrected) 15 collisions after 7 x 90 days, est 42 days between (1.21057e+19 corrected) 15 collisions after 17 x 90 days, est 102 days between (1.46997e+19 corrected) 15 collisions after 49 x 90 days, est 294 days between (2.11849e+19 corrected) 15 collisions after 62 x 90 days, est 372 days between (1.34027e+19 corrected) 15 collisions after 53 x 90 days, est 318 days between (5.72858e+18 corrected) 15 collisions after 309 x 90 days, est 1854 days between (1.66994e+19 corrected) ``` However, the change does modify (probably weaken) the "guaranteed unique" promise from this > SST files generated in a single process are guaranteed to have unique cache keys, unless/until number session ids * max file number = 2**86 to this (see https://github.com/facebook/rocksdb/issues/10388) > With the DB id limitation, we only have nice guaranteed unique cache keys for files generated in a single process until biggest session_id_counter and offset_in_file reach combined 64 bits I don't think this is a practical concern, though. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10394 Test Plan: unit tests updated, see simulation results above Reviewed By: jay-zhuang Differential Revision: D38667529 Pulled By: pdillinger fbshipit-source-id: 49af3fe7f47e5b61162809a78b76c769fd519fba |
|
sdong | 9277569ba3 |
Add some missing headers (#10519)
Summary: Some files miss headers. Also some headers are irregular. Fix them to make an internal checkup tool happy. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10519 Reviewed By: jay-zhuang Differential Revision: D38603291 fbshipit-source-id: 13b1bbd6d48f5ee15ba20da67544396de48238f1 |
|
Jay Zhuang | 5d3aefb682 |
Migrate to docker for CI run (#10496)
Summary: Moved linux builds to using docker to avoid CI instability caused by dependency installation site down. Added the `Dockerfile` which is used to build the image. The build time is also significantly reduced, because no dependencies installation and with using 2xlarge+ instance for slow build (like tsan test). Also fixed a few issues detected while building this: * `DestoryDB()` Status not checked for a few tests * nullptr might be used in `inlineskiplist.cc` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10496 Test Plan: CI Reviewed By: ajkr Differential Revision: D38554200 Pulled By: jay-zhuang fbshipit-source-id: 16e8fb2bf07b9c84bb27fb18421c4d54f2f248fd |
|
Jay Zhuang | 3f763763aa |
Change `bottommost_temperture` to `last_level_temperture` (#10471)
Summary: Change tiered compaction feature from `bottommost_temperture` to `last_level_temperture`. The old option is kept for migration purpose only, which is behaving the same as `last_level_temperture` and it will be removed in the next release. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10471 Test Plan: CI Reviewed By: siying Differential Revision: D38450621 Pulled By: jay-zhuang fbshipit-source-id: cc1cdf8bad409376fec0152abc0a64fb72a91527 |
|
Peter Dillinger | 27f3af5966 |
Fix serious FSDirectory use-after-Close bug (missing fsync) (#10460)
Summary: TL;DR: due to a recent change, if you drop a column family, often that DB will no longer fsync after writing new SST files to remaining or new column families, which could lead to data loss on power loss. More bug detail: The intent of https://github.com/facebook/rocksdb/issues/10049 was to Close FSDirectory objects at DB::Close time rather than waiting for DB object destruction. Unfortunately, it also closes shared FSDirectory objects on DropColumnFamily (& destroy remaining handles), which can lead to use-after-Close on FSDirectory shared with remaining column families. Those "uses" are only Fsyncs (or redundant Closes). In the default Posix filesystem, an Fsync on a closed FSDirectory is a quiet no-op. Consequently (under most configurations), if you drop a column family, that DB will no longer fsync after writing new SST files to column families sharing the same directory (true under most configurations). More fix detail: Basically, this removes unnecessary Close ops on destroying ColumnFamilyData. We let `shared_ptr` take care of calling the destructor at the right time. If the intent was to require Close be called before destroying FSDirectory, that was not made clear by the author of FileSystem and was not at all enforced by https://github.com/facebook/rocksdb/issues/10049, which could have added `assert(fd_ == -1)` to `~PosixDirectory()` but did not. To keep this fix simple, we relax the unit test for https://github.com/facebook/rocksdb/issues/10049 to allow timely destruction of FSDirectory to suffice as Close (in CountedFileSystem). Added a TODO to revisit that. Also in this PR: * Added a TODO to share FSDirectory instances between DB and its column families. (Already shared among column families.) * Made DB::Close attempt to close all its open FSDirectory objects even if there is a failure in closing one. Also code clean-up around this logic. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10460 Test Plan: add an assert to check for use-after-Close. With that existing tests can detect the misuse. With fix, tests pass (except noted relaxing of unit test for https://github.com/facebook/rocksdb/issues/10049) Reviewed By: ajkr Differential Revision: D38357922 Pulled By: pdillinger fbshipit-source-id: d42079cadbedf0a969f03389bf586b3b4e1f9137 |
|
Jay Zhuang | fcccc412d7 |
Remove Travis CI (#10407)
Summary: Travis CI is depreciated and haven't been maintained for some time. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10407 Reviewed By: ajkr Differential Revision: D38078382 Pulled By: jay-zhuang fbshipit-source-id: f42057f2f41f722bdce56bf195f67a94835191fb |
|
DaPorkchop_ | 6bebe65030 |
Correctly implement Create-/DropColumnFamilies for PessimisticTransactionDB (#10332)
Summary: This overrides `CreateColumnFamilies` and `DropColumnFamilies` in `PessimisticTransactionDB` in order to add/remove the created column families to/from the lock manager. Fixes https://github.com/facebook/rocksdb/issues/10322. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10332 Reviewed By: ajkr Differential Revision: D37841079 Pulled By: riversand963 fbshipit-source-id: 854d7d9948b0089e0054a8f2875485ba44436fd2 |
|
Wallace | 1e9bf25f61 |
Do not hold mutex when write keys if not necessary (#7516)
Summary: ## Problem Summary RocksDB will acquire the global mutex of db instance for every time when user calls `Write`. When RocksDB schedules a lot of compaction jobs, it will compete the mutex with write thread and it will hurt the write performance. ## Problem Solution: I want to use log_write_mutex to replace the global mutex in most case so that we do not acquire it in write-thread unless there is a write-stall event or a write-buffer-full event occur. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7516 Test Plan: 1. make check 2. CI 3. COMPILE_WITH_TSAN=1 make db_stress make crash_test make crash_test_with_multiops_wp_txn make crash_test_with_multiops_wc_txn make crash_test_with_atomic_flush Reviewed By: siying Differential Revision: D36908702 Pulled By: riversand963 fbshipit-source-id: 59b13881f4f5c0a58fd3ca79128a396d9cd98efe |
|
Andrew Kryczka | 25cc564ff7 |
Make RateLimiter not Customizable (#10378)
Summary: (PR created for informational/testing purposes only.) - Fixes lost dynamic updates to GenericRateLimiter bandwidth using `SetBytesPerSecond()` - Benefit over #10374 is eliminating race conditions with Configurable framework. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10378 Reviewed By: pdillinger Differential Revision: D37914865 fbshipit-source-id: d4f566d60ec9726d26932388c61671adf0ee0f30 |
|
Jay Zhuang | dcb6a3be4e |
Add helper function to get debug type name (#10243)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10243 Reviewed By: riversand963 Differential Revision: D37370236 Pulled By: jay-zhuang fbshipit-source-id: 6e7a6fadf45fdfb5afe97b3f6fe4acf1260d4a86 |
|
Peter Dillinger | e6c5e0ab9a |
Have Cache use Status::MemoryLimit (#10262)
Summary:
I noticed it would clean up some things to have Cache::Insert()
return our MemoryLimit Status instead of Incomplete for the case in
which the capacity limit is reached. I suspect this fixes some existing but
unknown bugs where this Incomplete could be confused with other uses
of Incomplete, especially no_io cases. This is the most suspicious case I
noticed, but was not able to reproduce a bug, in part because the existing
code is not covered by unit tests (FIXME added):
|
|
yite.gu | a9117a3490 |
BackupEngine: we can return immediately if GetFileSize failed (#10176)
Summary: In some case, GetFileSize would be failure in copy_file_cb. If failure, we can return immediately, the subsequent code is meaningless, and add a log info let user know that problem happen here. Singed-off-by: Yite Gu <ess_gyt@qq.com> Pull Request resolved: https://github.com/facebook/rocksdb/pull/10176 Reviewed By: cbi42 Differential Revision: D37510888 Pulled By: ajkr fbshipit-source-id: 044ad8c45852fd19b8cd564b11f65d40c39e296f |
|
Andrew Kryczka | ca81b80d83 |
Deflake RateLimiting/BackupEngineRateLimitingTestWithParam (#10271)
Summary: We saw flakes with the following failure: ``` [ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/1 utilities/backup/backup_engine_test.cc:2667: Failure Expected: (restore_time) > (0.8 * rate_limited_restore_time), actual: 48269 vs 60470.4 terminate called after throwing an instance of 'testing::internal::GoogleTestFailureException' what(): utilities/backup/backup_engine_test.cc:2667: Failure Expected: (restore_time) > (0.8 * rate_limited_restore_time), actual: 48269 vs 60470.4 Received signal 6 (Aborted) t/run-backup_engine_test-RateLimiting-BackupEngineRateLimitingTestWithParam.RateLimiting-1: line 4: 1032887 Aborted (core dumped) TEST_TMPDIR=$d ./backup_engine_test --gtest_filter=RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/1 ``` Investigation revealed we forgot to use the mock time `SystemClock` for restore rate limiting. Then the test used wall clock time, which made the execution of "GenericRateLimiter::Request:PostTimedWait" non-deterministic as wall clock time might have advanced enough that waiting was not needed. This PR changes restore rate limiting to use mock time, which guarantees we always execute "GenericRateLimiter::Request:PostTimedWait". Then the assertions that rely on times recorded inside that callback should be robust. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10271 Test Plan: Applied the following patch which guaranteed repro before the fix. Verified the test passes after this PR even with that patch applied. ``` diff --git a/util/rate_limiter.cc b/util/rate_limiter.cc index f369e3220..6b3ed82fa 100644 --- a/util/rate_limiter.cc +++ b/util/rate_limiter.cc @@ -158,6 +158,7 @@ void GenericRateLimiter::SetBytesPerSecond(int64_t bytes_per_second) { void GenericRateLimiter::Request(int64_t bytes, const Env::IOPriority pri, Statistics* stats) { + usleep(100000); assert(bytes <= refill_bytes_per_period_.load(std::memory_order_relaxed)); bytes = std::max(static_cast<int64_t>(0), bytes); TEST_SYNC_POINT("GenericRateLimiter::Request"); ``` Reviewed By: hx235 Differential Revision: D37499848 Pulled By: ajkr fbshipit-source-id: fd790d5a192996be8ba13b656751ccc7d8cb8f6e |
|
Yanqin Jin | 9586dcf1ce |
Expose the initial logger creation error (#10223)
Summary: https://github.com/facebook/rocksdb/issues/9984 changes the behavior of RocksDB: if logger creation failed during `SanitizeOptions()`, `DB::Open()` will fail. However, since `SanitizeOptions()` is called in `DBImpl::DBImpl()`, we cannot directly expose the error to caller without some additional work. This is a first version proposal which: - Adds a new member `init_logger_creation_s` to `DBImpl` to store the result of init logger creation - Checks the error during `DB::Open()` and return it to caller if non-ok This is not very ideal. We can alternatively move the logger creation logic out of the `SanitizeOptions()`. Since `SanitizeOptions()` is used in other places, we need to check whether this change breaks anything in case other callers of `SanitizeOptions()` assumes that a logger should be created. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10223 Test Plan: make check Reviewed By: pdillinger Differential Revision: D37321717 Pulled By: riversand963 fbshipit-source-id: 58042358a86369d606549dd9938933dd47591c4b |
|
Andrew Kryczka | d5d8920f2c |
Fix race condition with WAL tracking and `FlushWAL(true /* sync */)` (#10185)
Summary: `FlushWAL(true /* sync */)` is used internally and for manual WAL sync. It had a bug when used together with `track_and_verify_wals_in_manifest` where the synced size tracked in MANIFEST was larger than the number of bytes actually synced. The bug could be repro'd almost immediately with the following crash test command: `python3 tools/db_crashtest.py blackbox --simple --write_buffer_size=524288 --max_bytes_for_level_base=2097152 --target_file_size_base=524288 --duration=3600 --interval=10 --sync_fault_injection=1 --disable_wal=0 --checkpoint_one_in=1000 --max_key=10000 --value_size_mult=33`. An example error message produced by the above command is shown below. The error sometimes arose from the checkpoint and other times arose from the main stress test DB. ``` Corruption: Size mismatch: WAL (log number: 119) in MANIFEST is 27938 bytes , but actually is 27859 bytes on disk. ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10185 Test Plan: - repro unit test - the above crash test command no longer finds the error. It does find a different error after a while longer such as "Corruption: WAL file 481 required by manifest but not in directory list" Reviewed By: riversand963 Differential Revision: D37200993 Pulled By: ajkr fbshipit-source-id: 98e0071c1a89f4d009888512ed89f9219779ae5f |
|
Hui Xiao | a5d773e077 |
Add rate-limiting support to batched MultiGet() (#10159)
Summary: **Context/Summary:** https://github.com/facebook/rocksdb/pull/9424 added rate-limiting support for user reads, which does not include batched `MultiGet()`s that call `RandomAccessFileReader::MultiRead()`. The reason is that it's harder (compared with RandomAccessFileReader::Read()) to implement the ideal rate-limiting where we first call `RateLimiter::RequestToken()` for allowed bytes to multi-read and then consume those bytes by satisfying as many requests in `MultiRead()` as possible. For example, it can be tricky to decide whether we want partially fulfilled requests within one `MultiRead()` or not. However, due to a recent urgent user request, we decide to pursue an elementary (but a conditionally ineffective) solution where we accumulate enough rate limiter requests toward the total bytes needed by one `MultiRead()` before doing that `MultiRead()`. This is not ideal when the total bytes are huge as we will actually consume a huge bandwidth from rate-limiter causing a burst on disk. This is not what we ultimately want with rate limiter. Therefore a follow-up work is noted through TODO comments. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10159 Test Plan: - Modified existing unit test `DBRateLimiterOnReadTest/DBRateLimiterOnReadTest.NewMultiGet` - Traced the underlying system calls `io_uring_enter` and verified they are 10 seconds apart from each other correctly under the setting of `strace -ftt -e trace=io_uring_enter ./db_bench -benchmarks=multireadrandom -db=/dev/shm/testdb2 -readonly -num=50 -threads=1 -multiread_batched=1 -batch_size=100 -duration=10 -rate_limiter_bytes_per_sec=200 -rate_limiter_refill_period_us=1000000 -rate_limit_bg_reads=1 -disable_auto_compactions=1 -rate_limit_user_ops=1` where each `MultiRead()` read about 2000 bytes (inspected by debugger) and the rate limiter grants 200 bytes per seconds. - Stress test: - Verified `./db_stress (-test_cf_consistency=1/test_batches_snapshots=1) -use_multiget=1 -cache_size=1048576 -rate_limiter_bytes_per_sec=10241024 -rate_limit_bg_reads=1 -rate_limit_user_ops=1` work Reviewed By: ajkr, anand1976 Differential Revision: D37135172 Pulled By: hx235 fbshipit-source-id: 73b8e8f14761e5d4b77235dfe5d41f4eea968bcd |
|
Andrew Kryczka | 5d6005c780 |
Add WriteOptions::protection_bytes_per_key (#10037)
Summary: Added an option, `WriteOptions::protection_bytes_per_key`, that controls how many bytes per key we use for integrity protection in `WriteBatch`. It takes effect when `WriteBatch::GetProtectionBytesPerKey() == 0`. Currently the only supported value is eight. Invoking a user API with it set to any other nonzero value will result in `Status::NotSupported` returned to the user. There is also a bug fix for integrity protection with `inplace_callback`, where we forgot to take into account the possible change in varint length when calculating KV checksum for the final encoded buffer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10037 Test Plan: - Manual - Set default value of `WriteOptions::protection_bytes_per_key` to eight and ran `make check -j24` - Enabled in MyShadow for 1+ week - Automated - Unit tests have a `WriteMode` that enables the integrity protection via `WriteOptions` - Crash test - in most cases, use `WriteOptions::protection_bytes_per_key` to enable integrity protection Reviewed By: cbi42 Differential Revision: D36614569 Pulled By: ajkr fbshipit-source-id: 8650087ceac9b61b560f1e5fafe5e1baf9c725fb |
|
Peter Dillinger | 126c223714 |
Remove deprecated block-based filter (#10184)
Summary: In https://github.com/facebook/rocksdb/issues/9535, release 7.0, we hid the old block-based filter from being created using the public API, because of its inefficiency. Although we normally maintain read compatibility on old DBs forever, filters are not required for reading a DB, only for optimizing read performance. Thus, it should be acceptable to remove this code and the substantial maintenance burden it carries as useful features are developed and validated (such as user timestamp). This change completely removes the code for reading and writing the old block-based filters, net removing about 1370 lines of code no longer needed. Options removed from testing / benchmarking tools. The prior existence is only evident in a couple of places: * `CacheEntryRole::kDeprecatedFilterBlock` - We can update this public API enum in a major release to minimize source code incompatibilities. * A warning is logged when an old table file is opened that used the old block-based filter. This is provided as a courtesy, and would be a pain to unit test, so manual testing should suffice. Unfortunately, sst_dump does not tell you whether a file uses block-based filter, and the structure of the code makes it very difficult to fix. * To detect that case, `kObsoleteFilterBlockPrefix` (renamed from `kFilterBlockPrefix`) for metaindex is maintained (for now). Other notes: * In some cases where numbers are associated with filter configurations, we have had to update the assigned numbers so that they all correspond to something that exists. * Fixed potential stat counting bug by assuming `filter_checked = false` for cases like `filter == nullptr` rather than assuming `filter_checked = true` * Removed obsolete `block_offset` and `prefix_extractor` parameters from several functions. * Removed some unnecessary checks `if (!table_prefix_extractor() && !prefix_extractor)` because the caller guarantees the prefix extractor exists and is compatible Pull Request resolved: https://github.com/facebook/rocksdb/pull/10184 Test Plan: tests updated, manually test new warning in LOG using base version to generate a DB Reviewed By: riversand963 Differential Revision: D37212647 Pulled By: pdillinger fbshipit-source-id: 06ee020d8de3b81260ffc36ad0c1202cbf463a80 |
|
Yanqin Jin | 1777e5f7e9 |
Snapshots with user-specified timestamps (#9879)
Summary: In RocksDB, keys are associated with (internal) sequence numbers which denote when the keys are written to the database. Sequence numbers in different RocksDB instances are unrelated, thus not comparable. It is nice if we can associate sequence numbers with their corresponding actual timestamps. One thing we can do is to support user-defined timestamp, which allows the applications to specify the format of custom timestamps and encode a timestamp with each key. More details can be found at https://github.com/facebook/rocksdb/wiki/User-defined-Timestamp-%28Experimental%29. This PR provides a different but complementary approach. We can associate rocksdb snapshots (defined in https://github.com/facebook/rocksdb/blob/7.2.fb/include/rocksdb/snapshot.h#L20) with **user-specified** timestamps. Since a snapshot is essentially an object representing a sequence number, this PR establishes a bi-directional mapping between sequence numbers and timestamps. In the past, snapshots are usually taken by readers. The current super-version is grabbed, and a `rocksdb::Snapshot` object is created with the last published sequence number of the super-version. You can see that the reader actually has no good idea of what timestamp to assign to this snapshot, because by the time the `GetSnapshot()` is called, an arbitrarily long period of time may have already elapsed since the last write, which is when the last published sequence number is written. This observation motivates the creation of "timestamped" snapshots on the write path. Currently, this functionality is exposed only to the layer of `TransactionDB`. Application can tell RocksDB to create a snapshot when a transaction commits, effectively associating the last sequence number with a timestamp. It is also assumed that application will ensure any two snapshots with timestamps should satisfy the following: ``` snapshot1.seq < snapshot2.seq iff. snapshot1.ts < snapshot2.ts ``` If the application can guarantee that when a reader takes a timestamped snapshot, there is no active writes going on in the database, then we also allow the user to use a new API `TransactionDB::CreateTimestampedSnapshot()` to create a snapshot with associated timestamp. Code example ```cpp // Create a timestamped snapshot when committing transaction. txn->SetCommitTimestamp(100); txn->SetSnapshotOnNextOperation(); txn->Commit(); // A wrapper API for convenience Status Transaction::CommitAndTryCreateSnapshot( std::shared_ptr<TransactionNotifier> notifier, TxnTimestamp ts, std::shared_ptr<const Snapshot>* ret); // Create a timestamped snapshot if caller guarantees no concurrent writes std::pair<Status, std::shared_ptr<const Snapshot>> snapshot = txn_db->CreateTimestampedSnapshot(100); ``` The snapshots created in this way will be managed by RocksDB with ref-counting and potentially shared with other readers. We provide the following APIs for readers to retrieve a snapshot given a timestamp. ```cpp // Return the timestamped snapshot correponding to given timestamp. If ts is // kMaxTxnTimestamp, then we return the latest timestamped snapshot if present. // Othersise, we return the snapshot whose timestamp is equal to `ts`. If no // such snapshot exists, then we return null. std::shared_ptr<const Snapshot> TransactionDB::GetTimestampedSnapshot(TxnTimestamp ts) const; // Return the latest timestamped snapshot if present. std::shared_ptr<const Snapshot> TransactionDB::GetLatestTimestampedSnapshot() const; ``` We also provide two additional APIs for stats collection and reporting purposes. ```cpp Status TransactionDB::GetAllTimestampedSnapshots( std::vector<std::shared_ptr<const Snapshot>>& snapshots) const; // Return timestamped snapshots whose timestamps fall in [ts_lb, ts_ub) and store them in `snapshots`. Status TransactionDB::GetTimestampedSnapshots( TxnTimestamp ts_lb, TxnTimestamp ts_ub, std::vector<std::shared_ptr<const Snapshot>>& snapshots) const; ``` To prevent the number of timestamped snapshots from growing infinitely, we provide the following API to release timestamped snapshots whose timestamps are older than or equal to a given threshold. ```cpp void TransactionDB::ReleaseTimestampedSnapshotsOlderThan(TxnTimestamp ts); ``` Before shutdown, RocksDB will release all timestamped snapshots. Comparison with user-defined timestamp and how they can be combined: User-defined timestamp persists every key with a timestamp, while timestamped snapshots maintain a volatile mapping between snapshots (sequence numbers) and timestamps. Different internal keys with the same user key but different timestamps will be treated as different by compaction, thus a newer version will not hide older versions (with smaller timestamps) unless they are eligible for garbage collection. In contrast, taking a timestamped snapshot at a certain sequence number and timestamp prevents all the keys visible in this snapshot from been dropped by compaction. Here, visible means (seq < snapshot and most recent). The timestamped snapshot supports the semantics of reading at an exact point in time. Timestamped snapshots can also be used with user-defined timestamp. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9879 Test Plan: ``` make check TEST_TMPDIR=/dev/shm make crash_test_with_txn ``` Reviewed By: siying Differential Revision: D35783919 Pulled By: riversand963 fbshipit-source-id: 586ad905e169189e19d3bfc0cb0177a7239d1bd4 |
|
Yu Zhang | a101c9de60 |
Return "invalid argument" when read timestamp is too old (#10109)
Summary: With this change, when a given read timestamp is smaller than the column-family's full_history_ts_low, Get(), MultiGet() and iterators APIs will return Status::InValidArgument(). Test plan ``` $COMPILE_WITH_ASAN=1 make -j24 all $./db_with_timestamp_basic_test --gtest_filter=DBBasicTestWithTimestamp.UpdateFullHistoryTsLow $ make -j24 check ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10109 Reviewed By: riversand963 Differential Revision: D36901126 Pulled By: jowlyzhang fbshipit-source-id: 255feb1a66195351f06c1d0e42acb1ff74527f86 |
|
Peter Dillinger | 4f78f9699b |
Refactor: Add BlockTypes to make them imply C++ type in block cache (#10098)
Summary: We have three related concepts: * BlockType: an internal enum conceptually indicating a type of SST file block * CacheEntryRole: a user-facing enum for categorizing block cache entries, which is also involved in associated cache entries with an appropriate deleter. Can include categories for non-block cache entries (e.g. memory reservations). * TBlocklike: a C++ type for the actual type behind a void* cache entry. We had some existing code ugliness because BlockType did not imply TBlocklike, because of various kinds of "filter" block. This refactoring fixes that with new BlockTypes. More clean-up can come in later work. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10098 Test Plan: existing tests Reviewed By: akankshamahajan15 Differential Revision: D36897945 Pulled By: pdillinger fbshipit-source-id: 3ae496b5caa81e0a0ed85e873eb5b525e2d9a295 |
|
Yanqin Jin | 3e02c6e05a |
Point-lookup returns timestamps of Delete and SingleDelete (#10056)
Summary: If caller specifies a non-null `timestamp` argument in `DB::Get()` or a non-null `timestamps` in `DB::MultiGet()`, RocksDB will return the timestamps of the point tombstones. Note: DeleteRange is still unsupported. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10056 Test Plan: make check Reviewed By: ltamasi Differential Revision: D36677956 Pulled By: riversand963 fbshipit-source-id: 2d7af02cc7237b1829cd269086ea895a49d501ae |
|
Zichen Zhu | 65893ad959 |
Explicitly closing all directory file descriptors (#10049)
Summary: Currently, the DB directory file descriptor is left open until the deconstruction process (`DB::Close()` does not close the file descriptor). To verify this, comment out the lines between `db_ = nullptr` and `db_->Close()` (line 512, 513, 514, 515 in ldb_cmd.cc) to leak the ``db_'' object, build `ldb` tool and run ``` strace --trace=open,openat,close ./ldb --db=$TEST_TMPDIR --ignore_unknown_options put K1 V1 --create_if_missing ``` There is one directory file descriptor that is not closed in the strace log. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10049 Test Plan: Add a new unit test DBBasicTest.DBCloseAllDirectoryFDs: Open a database with different WAL directory and three different data directories, and all directory file descriptors should be closed after calling Close(). Explicitly call Close() after a directory file descriptor is not used so that the counter of directory open and close should be equivalent. Reviewed By: ajkr, hx235 Differential Revision: D36722135 Pulled By: littlepig2013 fbshipit-source-id: 07bdc2abc417c6b30997b9bbef1f79aa757b21ff |
|
Jay Zhuang | 0adac6f88e |
Deflake Transaction stress tests (#10063)
Summary: TSAN test is slower, for `TransactionStressTest` and `DeadlockStress`, they're reaching the timeout limit of 600 seconds. Decreasing the transaction test number. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10063 Test Plan: CI Reviewed By: ajkr Differential Revision: D36711727 Pulled By: jay-zhuang fbshipit-source-id: 600f82a6d32108f52fbe5572fcc7497607b7fe98 |
|
Yanqin Jin | 514f0b0937 |
Fail DB::Open() if logger cannot be created (#9984)
Summary: For regular db instance and secondary instance, we return error and refuse to open DB if Logger creation fails. Our current code allows it, but it is really difficult to debug because there will be no LOG files. The same for OPTIONS file, which will be explored in another PR. Furthermore, Arena::AllocateAligned(size_t bytes, size_t huge_page_size, Logger* logger) has an assertion as the following: ```cpp #ifdef MAP_HUGETLB if (huge_page_size > 0 && bytes > 0) { assert(logger != nullptr); } #endif ``` It can be removed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9984 Test Plan: make check Reviewed By: jay-zhuang Differential Revision: D36347754 Pulled By: riversand963 fbshipit-source-id: 529798c0511d2eaa2f0fd40cf7e61c4cbc6bc57e |
|
tagliavini | 6c50082654 |
Remove code that only compiles for Visual Studio versions older than 2015 (#10065)
Summary: There are currently some preprocessor checks that assume support for Visual Studio versions older than 2015 (i.e., 0 < _MSC_VER < 1900), although we don't support them any more. We removed all code that only compiles on those older versions, except third-party/ files. The ROCKSDB_NOEXCEPT symbol is now obsolete, since it now always gets replaced by noexcept. We removed it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10065 Reviewed By: pdillinger Differential Revision: D36721901 Pulled By: guidotag fbshipit-source-id: a2892d365ef53cce44a0a7d90dd6b72ee9b5e5f2 |
|
Levi Tamasi | af7ae912e2 |
Fix potential ambiguities in/around port/sys_time.h (#10045)
Summary: There are some time-related POSIX APIs that are not available on Windows (e.g. `localtime_r`), which we have worked around by providing our own implementations in `port/sys_time.h`. This workaround actually relies on some ambiguity: on Windows, a call to `localtime_r` calls `ROCKSDB_NAMESPACE::port::localtime_r` (which is pulled into `ROCKSDB_NAMESPACE` by a using-declaration), while on other platforms it calls the global `localtime_r`. This works fine as long as there is only one candidate function; however, it breaks down when there is more than one `localtime_r` visible in a scope. The patch fixes this by introducing `ROCKSDB_NAMESPACE::port::{TimeVal, GetTimeOfDay, LocalTimeR}` to eliminate any ambiguity. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10045 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D36639372 Pulled By: ltamasi fbshipit-source-id: fc13dbfa421b7c8918111a6d9e24ce77e91a7c50 |
|
Changyu Bi | 8515bd50c9 |
Support read rate-limiting in SequentialFileReader (#9973)
Summary: Added rate limiter and read rate-limiting support to SequentialFileReader. I've updated call sites to SequentialFileReader::Read with appropriate IO priority (or left a TODO and specified IO_TOTAL for now). The PR is separated into four commits: the first one added the rate-limiting support, but with some fixes in the unit test since the number of request bytes from rate limiter in SequentialFileReader are not accurate (there is overcharge at EOF). The second commit fixed this by allowing SequentialFileReader to check file size and determine how many bytes are left in the file to read. The third commit added benchmark related code. The fourth commit moved the logic of using file size to avoid overcharging the rate limiter into backup engine (the main user of SequentialFileReader). Pull Request resolved: https://github.com/facebook/rocksdb/pull/9973 Test Plan: - `make check`, backup_engine_test covers usage of SequentialFileReader with rate limiter. - Run db_bench to check if rate limiting is throttling as expected: Verified that reads and writes are together throttled at 2MB/s, and at 0.2MB chunks that are 100ms apart. - Set up: `./db_bench --benchmarks=fillrandom -db=/dev/shm/test_rocksdb` - Benchmark: ``` strace -ttfe read,write ./db_bench --benchmarks=backup -db=/dev/shm/test_rocksdb --backup_rate_limit=2097152 --use_existing_db strace -ttfe read,write ./db_bench --benchmarks=restore -db=/dev/shm/test_rocksdb --restore_rate_limit=2097152 --use_existing_db ``` - db bench on backup and restore to ensure no performance regression. - backup (avg over 50 runs): pre-change: 1.90443e+06 micros/op; post-change: 1.8993e+06 micros/op (improve by 0.2%) - restore (avg over 50 runs): pre-change: 1.79105e+06 micros/op; post-change: 1.78192e+06 micros/op (improve by 0.5%) ``` # Set up ./db_bench --benchmarks=fillrandom -db=/tmp/test_rocksdb -num=10000000 # benchmark TEST_TMPDIR=/tmp/test_rocksdb NUM_RUN=50 for ((j=0;j<$NUM_RUN;j++)) do ./db_bench -db=$TEST_TMPDIR -num=10000000 -benchmarks=backup -use_existing_db | egrep 'backup' # Restore #./db_bench -db=$TEST_TMPDIR -num=10000000 -benchmarks=restore -use_existing_db done > rate_limit.txt && awk -v NUM_RUN=$NUM_RUN '{sum+=$3;sum_sqrt+=$3^2}END{print sum/NUM_RUN, sqrt(sum_sqrt/NUM_RUN-(sum/NUM_RUN)^2)}' rate_limit.txt >> rate_limit_2.txt ``` Reviewed By: hx235 Differential Revision: D36327418 Pulled By: cbi42 fbshipit-source-id: e75d4307cff815945482df5ba630c1e88d064691 |
|
XieJiSS | 8b1df101da |
fix: build on risc-v (#9215)
Summary:
Patch is modified from ~~https://reviews.llvm.org/file/data/du5ol5zctyqw53ma7dwz/PHID-FILE-knherxziu4tl4erti5ab/file~~
Tested on Arch Linux riscv64gc (qemu)
UPDATE: Seems like the above link is broken, so I tried to search for a link pointing to the original merge request. It turned out to me that the LLVM guys are cherry-picking from `google/benchmark`, and the upstream should be this:
|
|
mrambacher | b11ff347b4 |
Use STATIC_AVOID_DESTRUCTION for static objects with non-trivial destructors (#9958)
Summary: Changed the static objects that had non-trivial destructors to use the STATIC_AVOID_DESTRUCTION construct. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9958 Reviewed By: pdillinger Differential Revision: D36442982 Pulled By: mrambacher fbshipit-source-id: 029d47b1374d30d198bfede369a4c0ae7a4eb519 |
|
Hui Xiao | e66e6d2faa |
Use SpecialEnv to speed up some slow BackupEngineRateLimitingTestWithParam (#9974)
Summary: **Context:** `BackupEngineRateLimitingTestWithParam.RateLimiting` and `BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup` involve creating backup and restoring of a big database with rate-limiting. Using the normal env with a normal clock requires real elapse of time (13702 - 19848 ms/per test). As suggested in https://github.com/facebook/rocksdb/pull/8722#discussion_r703698603, this PR is to speed it up with SpecialEnv (`time_elapse_only_sleep=true`) where its clock accepts fake elapse of time during rate-limiting (100 - 600 ms/per test) **Summary:** - Added TEST_ function to set clock of the default rate limiters in backup engine - Shrunk testdb by 10 times while keeping it big enough for testing - Renamed some test variables and reorganized some if-else branch for clarity without changing the test Pull Request resolved: https://github.com/facebook/rocksdb/pull/9974 Test Plan: - Run tests pre/post PR the same time to verify the tests are sped up by 90 - 95% `BackupEngineRateLimitingTestWithParam.RateLimiting` Pre: ``` [ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/0 [ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/0 (11123 ms) [ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/1 [ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/1 (9441 ms) [ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/2 [ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/2 (11096 ms) [ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/3 [ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/3 (9339 ms) [ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/4 [ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/4 (11121 ms) [ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/5 [ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/5 (9413 ms) [ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/6 [ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/6 (11185 ms) [ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/7 [ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/7 (9511 ms) [----------] 8 tests from RateLimiting/BackupEngineRateLimitingTestWithParam (82230 ms total) ``` Post: ``` [ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/0 [ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/0 (395 ms) [ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/1 [ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/1 (564 ms) [ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/2 [ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/2 (358 ms) [ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/3 [ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/3 (567 ms) [ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/4 [ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/4 (173 ms) [ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/5 [ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/5 (176 ms) [ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/6 [ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/6 (191 ms) [ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/7 [ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/7 (177 ms) [----------] 8 tests from RateLimiting/BackupEngineRateLimitingTestWithParam (2601 ms total) ``` `BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup` Pre: ``` [ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/0 [ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/0 (7275 ms) [ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/1 [ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/1 (3961 ms) [ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/2 [ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/2 (7117 ms) [ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/3 [ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/3 (3921 ms) [ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/4 [ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/4 (19862 ms) [ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/5 [ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/5 (10231 ms) [ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/6 [ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/6 (19848 ms) [ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/7 [ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/7 (10372 ms) [----------] 8 tests from RateLimiting/BackupEngineRateLimitingTestWithParam (82587 ms total) ``` Post: ``` [ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/0 [ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/0 (157 ms) [ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/1 [ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/1 (152 ms) [ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/2 [ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/2 (160 ms) [ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/3 [ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/3 (158 ms) [ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/4 [ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/4 (155 ms) [ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/5 [ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/5 (151 ms) [ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/6 [ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/6 (146 ms) [ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/7 [ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/7 (153 ms) [----------] 8 tests from RateLimiting/BackupEngineRateLimitingTestWithParam (1232 ms total) ``` Reviewed By: pdillinger Differential Revision: D36336345 Pulled By: hx235 fbshipit-source-id: 724c6ba745f95f56d4440a6d2f1e4512a2987589 |
|
mrambacher | 204a42ca97 |
Added GetFactoryCount/Names/Types to ObjectRegistry (#9358)
Summary: These methods allow for more thorough testing of the ObjectRegistry and Customizable infrastructure in a simpler manner. With this change, the Customizable tests can now check what factories are registered and attempt to create each of them in a systematic fashion. With this change, I think all of the factories registered with the ObjectRegistry/CreateFromString are now tested via the customizable_test classes. Note that there were a few other minor changes. There was a "posix://*" register with the ObjectRegistry which was missed during the PatternEntry conversion -- these changes found that. The nickname and default names for the FileSystem classes was also inverted. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9358 Reviewed By: pdillinger Differential Revision: D33433542 Pulled By: mrambacher fbshipit-source-id: 9a32da74e6620745b4eeffb2712be70eeeadfa7e |
|
sdong | 736a7b5433 |
Remove own ToString() (#9955)
Summary: ToString() is created as some platform doesn't support std::to_string(). However, we've already used std::to_string() by mistake for 16 months (in db/db_info_dumper.cc). This commit just remove ToString(). Pull Request resolved: https://github.com/facebook/rocksdb/pull/9955 Test Plan: Watch CI tests Reviewed By: riversand963 Differential Revision: D36176799 fbshipit-source-id: bdb6dcd0e3a3ab96a1ac810f5d0188f684064471 |
|
Andrew Kryczka | a62506aee2 |
Enable unsynced data loss in crash test (#9947)
Summary: `db_stress` already tracks expected state history to verify prefix-recoverability when `sync_fault_injection` is enabled. This PR enables `sync_fault_injection` in `db_crashtest.py`. Previously enabling `sync_fault_injection` would cause whole unsynced files to be dropped. This PR adds a more interesting case of losing only the tail of unsynced data by implementing `TestFSWritableFile::RangeSync()` and enabling `{wal_,}bytes_per_sync`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9947 Test Plan: - regular blackbox, blackbox --simple - various commands to stress this new case, such as `TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py blackbox --max_key=100000 --write_buffer_size=2097152 --avoid_flush_during_recovery=1 --disable_wal=0 --interval=10 --db_write_buffer_size=0 --sync_fault_injection=1 --wal_compression=none --delpercent=0 --delrangepercent=0 --prefixpercent=0 --iterpercent=0 --writepercent=100 --readpercent=0 --wal_bytes_per_sync=131072 --duration=36000 --sync=0 --open_write_fault_one_in=16` Reviewed By: riversand963 Differential Revision: D36152775 Pulled By: ajkr fbshipit-source-id: 44b68a7fad0a4cf74af9fe1f39be01baab8141d8 |
|
sdong | 49628c9a83 |
Use std::numeric_limits<> (#9954)
Summary: Right now we still don't fully use std::numeric_limits but use a macro, mainly for supporting VS 2013. Right now we only support VS 2017 and up so it is not a problem. The code comment claims that MinGW still needs it. We don't have a CI running MinGW so it's hard to validate. since we now require C++17, it's hard to imagine MinGW would still build RocksDB but doesn't support std::numeric_limits<>. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9954 Test Plan: See CI Runs. Reviewed By: riversand963 Differential Revision: D36173954 fbshipit-source-id: a35a73af17cdcae20e258cdef57fcf29a50b49e0 |
|
Yanqin Jin | 2b5df21e95 |
Remove ifdef for try_emplace after upgrading to c++17 (#9932)
Summary: Test plan make check Pull Request resolved: https://github.com/facebook/rocksdb/pull/9932 Reviewed By: ajkr Differential Revision: D36085404 Pulled By: riversand963 fbshipit-source-id: 2ece14ca0e2e4c1288339ff79e7e126b76eaf786 |
|
Yanqin Jin | 2b5c29f9f3 |
Enforce the contract of SingleDelete (#9888)
Summary: Enforce the contract of SingleDelete so that they are not mixed with Delete for the same key. Otherwise, it will lead to undefined behavior. See https://github.com/facebook/rocksdb/wiki/Single-Delete#notes. Also fix unit tests and write-unprepared. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9888 Test Plan: make check Reviewed By: ajkr Differential Revision: D35837817 Pulled By: riversand963 fbshipit-source-id: acd06e4dcba8cb18df92b44ed18c57e10e5a7635 |
|
Anvesh Komuravelli | aafb377bb5 |
Update protection info on recovered logs data (#9875)
Summary: Update protection info on recovered logs data Pull Request resolved: https://github.com/facebook/rocksdb/pull/9875 Test Plan: - Benchmark setup: `TEST_TMPDIR=/dev/shm/100MB_WAL_DB/ ./db_bench -benchmarks=fillrandom -write_buffer_size=1048576000` - Benchmark command: `TEST_TMPDIR=/dev/shm/100MB_WAL_DB/ /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=overwrite -write_buffer_size=1048576000 -writes=1 -report_open_timing=true` - Results before this PR ``` OpenDb: 2350.14 milliseconds OpenDb: 2296.94 milliseconds OpenDb: 2184.29 milliseconds OpenDb: 2167.59 milliseconds OpenDb: 2231.24 milliseconds OpenDb: 2109.57 milliseconds OpenDb: 2197.71 milliseconds OpenDb: 2120.8 milliseconds OpenDb: 2148.12 milliseconds OpenDb: 2207.95 milliseconds ``` - Results after this PR ``` OpenDb: 2424.52 milliseconds OpenDb: 2359.84 milliseconds OpenDb: 2317.68 milliseconds OpenDb: 2339.4 milliseconds OpenDb: 2325.36 milliseconds OpenDb: 2321.06 milliseconds OpenDb: 2353.98 milliseconds OpenDb: 2344.64 milliseconds OpenDb: 2384.09 milliseconds OpenDb: 2428.58 milliseconds ``` Mean regressed 7.2% (2201.4 -> 2359.9) Reviewed By: ajkr Differential Revision: D36012787 Pulled By: akomurav fbshipit-source-id: d2aba09f29c6beb2fd0fe8e1e359be910b4ef02a |
|
Yanqin Jin | 94e245a14d |
Improve stress test for MultiOpsTxnsStressTest (#9829)
Summary: Adds more coverage to `MultiOpsTxnsStressTest` with a focus on write-prepared transactions. 1. Add a hack to manually evict commit cache entries. We currently cannot assign small values to `wp_commit_cache_bits` because it requires a prepared transaction to commit within a certain range of sequence numbers, otherwise it will throw. 2. Add coverage for commit-time-write-batch. If write policy is write-prepared, we need to set `use_only_the_last_commit_time_batch_for_recovery` to true. 3. After each flush/compaction, verify data consistency. This is possible since data size can be small: default numbers of primary/secondary keys are just 1000. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9829 Test Plan: ``` TEST_TMPDIR=/dev/shm/rocksdb_crashtest_blackbox/ make blackbox_crash_test_with_multiops_wp_txn ``` Reviewed By: pdillinger Differential Revision: D35806678 Pulled By: riversand963 fbshipit-source-id: d7fde7a29fda0fb481a61f553e0ca0c47da93616 |
|
Herman Lee | d9d456de49 |
Fix locktree accesses to PessimisticTransactions (#9898)
Summary: The current locktree implementation stores the address of the PessimisticTransactions object as the TXNID. However, when a transaction is blocked on a lock, it records the list of waitees with conflicting locks using the rocksdb assigned TransactionID. This is performed by calling GetID() on PessimisticTransactions objects of the waitees, and then recorded in the waiter's list. However, there is no guarantee the objects are valid when recording the waitee list during the conflict callbacks because the waitee could have released the lock and freed the PessimisticTransactions object. The waitee/txnid values are only valid PessimisticTransaction objects while the mutex for the root of the locktree is held. The simplest fix for this problem is to use the address of the PessimisticTransaction as the TransactionID so that it is consistent with its usage in the locktree. The TXNID is only converted back to a PessimisticTransaction for the report_wait callbacks. Since these callbacks are now all made within the critical section where the lock_request queue mutx is held, these conversions will be safe. Otherwise, only the uint64_t TXNID of the waitee is registerd with the waiter transaction. The PessimisitcTransaction object of the waitee is never referenced. The main downside of this approach is the TransactionID will not change if the PessimisticTransaction object is reused for new transactions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9898 Test Plan: Add a new test case and run unit tests. Also verified with MyRocks workloads using range locks that the crash no longer happens. Reviewed By: riversand963 Differential Revision: D35950376 Pulled By: hermanlee fbshipit-source-id: 8c9cae272e23e487fc139b6a8ed5b8f8f24b1570 |
|
Yanqin Jin | d13825e586 |
Add rollback_deletion_type_callback to TxnDBOptions (#9873)
Summary: This PR does not affect write-committed. Add a member, `rollback_deletion_type_callback` to TransactionDBOptions so that a write-prepared transaction, when rolling back, can call this callback to decide if a `Delete` or `SingleDelete` should be used to cancel a prior `Put` written to the database during prepare phase. The purpose of this PR is to prevent mixing `Delete` and `SingleDelete` for the same key, causing undefined behaviors. Without this PR, the following can happen: ``` // The application always issues SingleDelete when deleting keys. txn1->Put('a'); txn1->Prepare(); // writes to memtable and potentially gets flushed/compacted to Lmax txn1->Rollback(); // inserts DELETE('a') txn2->Put('a'); txn2->Commit(); // writes to memtable and potentially gets flushed/compacted ``` In the database, we may have ``` L0: [PUT('a', s=100)] L1: [DELETE('a', s=90)] Lmax: [PUT('a', s=0)] ``` If a compaction compacts L0 and L1, then we have ``` L1: [PUT('a', s=100)] Lmax: [PUT('a', s=0)] ``` If a future transaction issues a SingleDelete, we have ``` L0: [SD('a', s=110)] L1: [PUT('a', s=100)] Lmax: [PUT('a', s=0)] ``` Then, a compaction including L0, L1 and Lmax leads to ``` Lmax: [PUT('a', s=0)] ``` which is incorrect. Similar bugs reported and addressed in https://github.com/cockroachdb/pebble/issues/1255. Based on our team's current priority, we have decided to take this approach for now. We may come back and revisit in the future. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9873 Test Plan: make check Reviewed By: ltamasi Differential Revision: D35762170 Pulled By: riversand963 fbshipit-source-id: b28d56eefc786b53c9844b9ef4a7807acdd82c8d |
|
sdong | 4f9c0fd083 |
Add Aggregation Merge Operator (#9780)
Summary: Add a merge operator that allows users to register specific aggregation function so that they can does aggregation based per key using different aggregation types. See comments of function CreateAggMergeOperator() for actual usage. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9780 Test Plan: Add a unit test to coverage various cases. Reviewed By: ltamasi Differential Revision: D35267444 fbshipit-source-id: 5b02f31c4f3e17e96dd4025cdc49fca8c2868628 |
|
Levi Tamasi | db536ee045 |
Propagate errors from UpdateBoundaries (#9851)
Summary: In `FileMetaData`, we keep track of the lowest-numbered blob file referenced by the SST file in question for the purposes of BlobDB's garbage collection in the `oldest_blob_file_number` field, which is updated in `UpdateBoundaries`. However, with the current code, `BlobIndex` decoding errors (or invalid blob file numbers) are swallowed in this method. The patch changes this by propagating these errors and failing the corresponding flush/compaction. (Note that since blob references are generated by the BlobDB code and also parsed by `CompactionIterator`, in reality this can only happen in the case of memory corruption.) This change necessitated updating some unit tests that involved fake/corrupt `BlobIndex` objects. Some of these just used a dummy string like `"blob_index"` as a placeholder; these were replaced with real `BlobIndex`es. Some were relying on the earlier behavior to simulate corruption; these were replaced with `SyncPoint`-based test code that corrupts a valid blob reference at read time. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9851 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D35683671 Pulled By: ltamasi fbshipit-source-id: f7387af9945c48e4d5c4cd864f1ba425c7ad51f6 |
|
Peter Dillinger | efd035164b |
Meta-internal folly integration with F14FastMap (#9546)
Summary: Especially after updating to C++17, I don't see a compelling case for *requiring* any folly components in RocksDB. I was able to purge the existing hard dependencies, and it can be quite difficult to strip out non-trivial components from folly for use in RocksDB. (The prospect of doing that on F14 has changed my mind on the best approach here.) But this change creates an optional integration where we can plug in components from folly at compile time, starting here with F14FastMap to replace std::unordered_map when possible (probably no public APIs for example). I have replaced the biggest CPU users of std::unordered_map with compile-time pluggable UnorderedMap which will use F14FastMap when USE_FOLLY is set. USE_FOLLY is always set in the Meta-internal buck build, and a simulation of that is in the Makefile for public CI testing. A full folly build is not needed, but checking out the full folly repo is much simpler for getting the dependency, and anything else we might want to optionally integrate in the future. Some picky details: * I don't think the distributed mutex stuff is actually used, so it was easy to remove. * I implemented an alternative to `folly::constexpr_log2` (which is much easier in C++17 than C++11) so that I could pull out the hard dependencies on `ConstexprMath.h` * I had to add noexcept move constructors/operators to some types to make F14's complainUnlessNothrowMoveAndDestroy check happy, and I added a macro to make that easier in some common cases. * Updated Meta-internal buck build to use folly F14Map (always) No updates to HISTORY.md nor INSTALL.md as this is not (yet?) considered a production integration for open source users. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9546 Test Plan: CircleCI tests updated so that a couple of them use folly. Most internal unit & stress/crash tests updated to use Meta-internal latest folly. (Note: they should probably use buck but they currently use Makefile.) Example performance improvement: when filter partitions are pinned in cache, they are tracked by PartitionedFilterBlockReader::filter_map_ and we can build a test that exercises that heavily. Build DB with ``` TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters ``` and test with (simultaneous runs with & without folly, ~20 times each to see convergence) ``` TEST_TMPDIR=/dev/shm/rocksdb ./db_bench_folly -readonly -use_existing_db -benchmarks=readrandom -num=10000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters -duration=40 -pin_l0_filter_and_index_blocks_in_cache ``` Average ops/s no folly: 26229.2 Average ops/s with folly: 26853.3 (+2.4%) Reviewed By: ajkr Differential Revision: D34181736 Pulled By: pdillinger fbshipit-source-id: ffa6ad5104c2880321d8a1aa7187e00ab0d02e94 |
|
mrambacher | b7db7eae26 |
Plugin Registry (#7949)
Summary: Added a Plugin class to the ObjectRegistry. Enabled compile-time and program-time addition of plugins to the Registry. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7949 Reviewed By: mrambacher Differential Revision: D33517674 Pulled By: pdillinger fbshipit-source-id: c3e3270aab76a489bfa9e85d78cdfca951912557 |
|
gitbw95 | f241d082b6 |
Prevent double caching in the compressed secondary cache (#9747)
Summary: ### **Summary:** When both LRU Cache and CompressedSecondaryCache are configured together, there possibly are some data blocks double cached. **Changes include:** 1. Update IS_PROMOTED to IS_IN_SECONDARY_CACHE to prevent confusions. 2. This PR updates SecondaryCacheResultHandle and use IsErasedFromSecondaryCache to determine whether the handle is erased in the secondary cache. Then, the caller can determine whether to SetIsInSecondaryCache(). 3. Rename LRUSecondaryCache to CompressedSecondaryCache. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9747 Test Plan: **Test Scripts:** 1. Populate a DB. The on disk footprint is 482 MB. The data is set to be 50% compressible, so the total decompressed size is expected to be 964 MB. ./db_bench --benchmarks=fillrandom --num=10000000 -db=/db_bench_1 2. overwrite it to a stable state: ./db_bench --benchmarks=overwrite,stats --num=10000000 -use_existing_db -duration=10 --benchmark_write_rate_limit=2000000 -db=/db_bench_1 4. Run read tests with diffeernt cache setting: T1: ./db_bench --benchmarks=seekrandom,stats --threads=16 --num=10000000 -use_existing_db -duration=120 --benchmark_write_rate_limit=52000000 -use_direct_reads --cache_size=520000000 --statistics -db=/db_bench_1 T2: ./db_bench --benchmarks=seekrandom,stats --threads=16 --num=10000000 -use_existing_db -duration=120 --benchmark_write_rate_limit=52000000 -use_direct_reads --cache_size=320000000 -compressed_secondary_cache_size=400000000 --statistics -use_compressed_secondary_cache -db=/db_bench_1 T3: ./db_bench --benchmarks=seekrandom,stats --threads=16 --num=10000000 -use_existing_db -duration=120 --benchmark_write_rate_limit=52000000 -use_direct_reads --cache_size=520000000 -compressed_secondary_cache_size=400000000 --statistics -use_compressed_secondary_cache -db=/db_bench_1 T4: ./db_bench --benchmarks=seekrandom,stats --threads=16 --num=10000000 -use_existing_db -duration=120 --benchmark_write_rate_limit=52000000 -use_direct_reads --cache_size=20000000 -compressed_secondary_cache_size=500000000 --statistics -use_compressed_secondary_cache -db=/db_bench_1 **Before this PR** | Cache Size | Compressed Secondary Cache Size | Cache Hit Rate | |------------|-------------------------------------|----------------| |520 MB | 0 MB | 85.5% | |320 MB | 400 MB | 96.2% | |520 MB | 400 MB | 98.3% | |20 MB | 500 MB | 98.8% | **Before this PR** | Cache Size | Compressed Secondary Cache Size | Cache Hit Rate | |------------|-------------------------------------|----------------| |520 MB | 0 MB | 85.5% | |320 MB | 400 MB | 99.9% | |520 MB | 400 MB | 99.9% | |20 MB | 500 MB | 99.2% | Reviewed By: anand1976 Differential Revision: D35117499 Pulled By: gitbw95 fbshipit-source-id: ea2657749fc13efebe91a8a1b56bc61d6a224a12 |
|
Yanqin Jin | 1a1c5bda23 |
Disallow commit-time-batch for write-prepared/write-unprepared txn conditionally (#9794)
Summary: For write-prepared/write-unprepared transactions, GetCommitTimeWriteBatch() can be used only if the transaction is started with `TransactionOptions::use_only_the_last_commit_time_batch_for_recovery` set to true. Otherwise, it is possible that multiple uncommitted versions of the same key exist in the database. During bottommost compaction, RocksDB may set the sequence numbers of both to zero once they become committed, causing output SST file to have two identical internal keys. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9794 Test Plan: make check pay special attention to the following ``` transaction_test --gtest_filter=MySQLStyleTransactionTest/MySQLStyleTransactionTest.TransactionStressTest/* ``` Reviewed By: lth Differential Revision: D35327214 Pulled By: riversand963 fbshipit-source-id: 3bae00a28359c10e96e4c6f676d20de5610d8a0f |
|
Peter Dillinger | 6534c6dea4 |
Fix remaining uses of "backupable" (#9792)
Summary: Various renaming and fixes to get rid of remaining uses of "backupable" which is terminology leftover from the original, flawed design of BackupableDB. Now any DB can be backed up, using BackupEngine. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9792 Test Plan: CI Reviewed By: ajkr Differential Revision: D35334386 Pulled By: pdillinger fbshipit-source-id: 2108a42b4575c8cccdfd791c549aae93ec2f3329 |
|
Chen Lixiang | cd59b139fc |
Fix some typos in comments and HISTORY.md (#9798)
Summary: compation --> compaction Pull Request resolved: https://github.com/facebook/rocksdb/pull/9798 Reviewed By: ajkr Differential Revision: D35341611 Pulled By: jay-zhuang fbshipit-source-id: 5ea07527c311de75cade219456b6ee52b23020f6 |
|
Peter Dillinger | 40e3f30a28 |
Fix FileStorageInfo fields from GetLiveFilesMetaData (#9769)
Summary: In making `SstFileMetaData` inherit from `FileStorageInfo`, I overlooked setting some `FileStorageInfo` fields when then default `SstFileMetaData()` ctor is used. This affected `GetLiveFilesMetaData()`. Also removed some buggy `static_cast<size_t>` Pull Request resolved: https://github.com/facebook/rocksdb/pull/9769 Test Plan: Updated tests Reviewed By: jay-zhuang Differential Revision: D35220383 Pulled By: pdillinger fbshipit-source-id: 05b4ee468258dbd3699517e1124838bf405fe7f8 |
|
gitbw95 | 8102690a52 |
Update Cache::Release param from force_erase to erase_if_last_ref (#9728)
Summary: The param name force_erase may be misleading, since the handle is erased only if it has last reference even if the param is set true. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9728 Reviewed By: pdillinger Differential Revision: D35038673 Pulled By: gitbw95 fbshipit-source-id: 0d16d1e8fed17b97eba7fb53207119332f659a5f |
|
Peter Dillinger | cff0d1e8e6 |
New backup meta schema, with file temperatures (#9660)
Summary: The primary goal of this change is to add support for backing up and restoring (applying on restore) file temperature metadata, without committing to either the DB manifest or the FS reported "current" temperatures being exclusive "source of truth". To achieve this goal, we need to add temperature information to backup metadata, which requires updated backup meta schema. Fortunately I prepared for this in https://github.com/facebook/rocksdb/issues/8069, which began forward compatibility in version 6.19.0 for this kind of schema update. (Previously, backup meta schema was not extensible! Making this schema update public will allow some other "nice to have" features like taking backups with hard links, and avoiding crc32c checksum computation when another checksum is already available.) While schema version 2 is newly public, the default schema version is still 1. Until we change the default, users will need to set to 2 to enable features like temperature data backup+restore. New metadata like temperature information will be ignored with a warning in versions before this change and since 6.19.0. The metadata is considered ignorable because a functioning DB can be restored without it. Some detail: * Some renaming because "future schema" is now just public schema 2. * Initialize some atomics in TestFs (linter reported) * Add temperature hint support to SstFileDumper (used by BackupEngine) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9660 Test Plan: related unit test majorly updated for the new functionality, including some shared testing support for tracking temperatures in a FS. Some other tests and testing hooks into production code also updated for making the backup meta schema change public. Reviewed By: ajkr Differential Revision: D34686968 Pulled By: pdillinger fbshipit-source-id: 3ac1fa3e67ee97ca8a5103d79cc87d872c1d862a |
|
Yanqin Jin | 565fcead22 |
Fix clang-analyze by adding assertion (#9682)
Summary: Clang-analyze complains about potential nullptr dereference. Fix by adding an assertion to make clang happy. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9682 Test Plan: USE_CLANG=1 make -j20 analyze_incremental Reviewed By: ltamasi Differential Revision: D34755210 Pulled By: riversand963 fbshipit-source-id: 948e1899846ee1aa05a1b500a11ff43b0b412e0a |
|
Yanqin Jin | 3b6dc049f7 |
Support user-defined timestamps in write-committed txns (#9629)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9629 Pessimistic transactions use pessimistic concurrency control, i.e. locking. Keys are locked upon first operation that writes the key or has the intention of writing. For example, `PessimisticTransaction::Put()`, `PessimisticTransaction::Delete()`, `PessimisticTransaction::SingleDelete()` will write to or delete a key, while `PessimisticTransaction::GetForUpdate()` is used by application to indicate to RocksDB that the transaction has the intention of performing write operation later in the same transaction. Pessimistic transactions support two-phase commit (2PC). A transaction can be `Prepared()`'ed and then `Commit()`. The prepare phase is similar to a promise: once `Prepare()` succeeds, the transaction has acquired the necessary resources to commit. The resources include locks, persistence of WAL, etc. Write-committed transaction is the default pessimistic transaction implementation. In RocksDB write-committed transaction, `Prepare()` will write data to the WAL as a prepare section. `Commit()` will write a commit marker to the WAL and then write data to the memtables. While writing to the memtables, different keys in the transaction's write batch will be assigned different sequence numbers in ascending order. Until commit/rollback, the transaction holds locks on the keys so that no other transaction can write to the same keys. Furthermore, the keys' sequence numbers represent the order in which they are committed and should be made visible. This is convenient for us to implement support for user-defined timestamps. Since column families with and without timestamps can co-exist in the same database, a transaction may or may not involve timestamps. Based on this observation, we add two optional members to each `PessimisticTransaction`, `read_timestamp_` and `commit_timestamp_`. If no key in the transaction's write batch has timestamp, then setting these two variables do not have any effect. For the rest of this commit, we discuss only the cases when these two variables are meaningful. read_timestamp_ is used mainly for validation, and should be set before first call to `GetForUpdate()`. Otherwise, the latter will return non-ok status. `GetForUpdate()` calls `TryLock()` that can verify if another transaction has written the same key since `read_timestamp_` till this call to `GetForUpdate()`. If another transaction has indeed written the same key, then validation fails, and RocksDB allows this transaction to refine `read_timestamp_` by increasing it. Note that a transaction can still use `Get()` with a different timestamp to read, but the result of the read should not be used to determine data that will be written later. commit_timestamp_ must be set after finishing writing and before transaction commit. This applies to both 2PC and non-2PC cases. In the case of 2PC, it's usually set after prepare phase succeeds. We currently require that the commit timestamp be chosen after all keys are locked. This means we disallow the `TransactionDB`-level APIs if user-defined timestamp is used by the transaction. Specifically, calling `PessimisticTransactionDB::Put()`, `PessimisticTransactionDB::Delete()`, `PessimisticTransactionDB::SingleDelete()`, etc. will return non-ok status because they specify timestamps before locking the keys. Users are also prompted to use the `Transaction` APIs when they receive the non-ok status. Reviewed By: ltamasi Differential Revision: D31822445 fbshipit-source-id: b82abf8e230216dc89cc519564a588224a88fd43 |
|
Peter Dillinger | ce60d0cbe5 |
Test refactoring for Backups+Temperatures (#9655)
Summary: In preparation for more support for file Temperatures in BackupEngine, this change does some test refactoring: * Move DBTest2::BackupFileTemperature test to BackupEngineTest::FileTemperatures, with some updates to make it work in the new home. This test will soon be expanded for deeper backup work. * Move FileTemperatureTestFS from db_test2.cc to db_test_util.h, to support sharing because of above moved test, but split off the "no link" part to the test needing it. * Use custom FileSystems in backupable_db_test rather than custom Envs, because going through Env file interfaces doesn't support temperatures. * Fix RemapFileSystem to map DirFsyncOptions::renamed_new_name parameter to FsyncWithDirOptions, which was required because this limitation caused a crash only after moving to higher fidelity of FileSystem interface (vs. LegacyDirectoryWrapper throwing away some parameter details) * `backupable_options_` -> `engine_options_` as part of the ongoing work to get rid of the obsolete "backupable" naming. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9655 Test Plan: test code updates only Reviewed By: jay-zhuang Differential Revision: D34622183 Pulled By: pdillinger fbshipit-source-id: f24b7a596a89b9e089e960f4e5d772575513e93f |
|
Yanqin Jin | 6f12599863 |
Support WBWI for keys having timestamps (#9603)
Summary: This PR supports inserting keys to a `WriteBatchWithIndex` for column families that enable user-defined timestamps and reading the keys back. **The index does not have timestamps.** Writing a key to WBWI is unchanged, because the underlying WriteBatch already supports it. When reading the keys back, we need to make sure to distinguish between keys with and without timestamps before comparison. When user calls `GetFromBatchAndDB()`, no timestamp is needed to query the batch, but a timestamp has to be provided to query the db. The assumption is that data in the batch must be newer than data from the db. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9603 Test Plan: make check Reviewed By: ltamasi Differential Revision: D34354849 Pulled By: riversand963 fbshipit-source-id: d25d1f84e2240ce543e521fa30595082fb8db9a0 |
|
Jay Zhuang | d3a2f284d9 |
Add Temperature info in `NewSequentialFile()` (#9499)
Summary: Add Temperature hints information from RocksDB in API `NewSequentialFile()`. backup and checkpoint operations need to open the source files with `NewSequentialFile()`, which will have the temperature hints. Other operations are not covered. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9499 Test Plan: Added unittest Reviewed By: pdillinger Differential Revision: D34006115 Pulled By: jay-zhuang fbshipit-source-id: 568b34602b76520e53128672bd07e9d886786a2f |
|
mrambacher | 30b08878d8 |
Make FilterPolicy Customizable (#9590)
Summary: Make FilterPolicy into a Customizable class. Allow new FilterPolicy to be discovered through the ObjectRegistry Pull Request resolved: https://github.com/facebook/rocksdb/pull/9590 Reviewed By: pdillinger Differential Revision: D34327367 Pulled By: mrambacher fbshipit-source-id: 37e7edac90ec9457422b72f359ab8ef48829c190 |
|
Andrew Kryczka | babe56ddba |
Add rate limiter priority to ReadOptions (#9424)
Summary: Users can set the priority for file reads associated with their operation by setting `ReadOptions::rate_limiter_priority` to something other than `Env::IO_TOTAL`. Rate limiting `VerifyChecksum()` and `VerifyFileChecksums()` is the motivation for this PR, so it also includes benchmarks and minor bug fixes to get that working. `RandomAccessFileReader::Read()` already had support for rate limiting compaction reads. I changed that rate limiting to be non-specific to compaction, but rather performed according to the passed in `Env::IOPriority`. Now the compaction read rate limiting is supported by setting `rate_limiter_priority = Env::IO_LOW` on its `ReadOptions`. There is no default value for the new `Env::IOPriority` parameter to `RandomAccessFileReader::Read()`. That means this PR goes through all callers (in some cases multiple layers up the call stack) to find a `ReadOptions` to provide the priority. There are TODOs for cases I believe it would be good to let user control the priority some day (e.g., file footer reads), and no TODO in cases I believe it doesn't matter (e.g., trace file reads). The API doc only lists the missing cases where a file read associated with a provided `ReadOptions` cannot be rate limited. For cases like file ingestion checksum calculation, there is no API to provide `ReadOptions` or `Env::IOPriority`, so I didn't count that as missing. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9424 Test Plan: - new unit tests - new benchmarks on ~50MB database with 1MB/s read rate limit and 100ms refill interval; verified with strace reads are chunked (at 0.1MB per chunk) and spaced roughly 100ms apart. - setup command: `./db_bench -benchmarks=fillrandom,compact -db=/tmp/testdb -target_file_size_base=1048576 -disable_auto_compactions=true -file_checksum=true` - benchmarks command: `strace -ttfe pread64 ./db_bench -benchmarks=verifychecksum,verifyfilechecksums -use_existing_db=true -db=/tmp/testdb -rate_limiter_bytes_per_sec=1048576 -rate_limit_bg_reads=1 -rate_limit_user_ops=true -file_checksum=true` - crash test using IO_USER priority on non-validation reads with https://github.com/facebook/rocksdb/issues/9567 reverted: `python3 tools/db_crashtest.py blackbox --max_key=1000000 --write_buffer_size=524288 --target_file_size_base=524288 --level_compaction_dynamic_level_bytes=true --duration=3600 --rate_limit_bg_reads=true --rate_limit_user_ops=true --rate_limiter_bytes_per_sec=10485760 --interval=10` Reviewed By: hx235 Differential Revision: D33747386 Pulled By: ajkr fbshipit-source-id: a2d985e97912fba8c54763798e04f006ccc56e0c |
|
Yanqin Jin | 1cda273dc3 |
Fix a silent data loss for write-committed txn (#9571)
Summary: The following sequence of events can cause silent data loss for write-committed transactions. ``` Time thread 1 bg flush | db->Put("a") | txn = NewTxn() | txn->Put("b", "v") | txn->Prepare() // writes only to 5.log | db->SwitchMemtable() // memtable 1 has "a" | // close 5.log, | // creates 8.log | trigger flush | pick memtable 1 | unlock db mutex | write new sst | txn->ctwb->Put("gtid", "1") // writes 8.log | txn->Commit() // writes to 8.log | // writes to memtable 2 | compute min_log_number_to_keep_2pc, this | will be 8 (incorrect). | | Purge obsolete wals, including 5.log | V ``` At this point, writes of txn exists only in memtable. Close db without flush because db thinks the data in memtable are backed by log. Then reopen, the writes are lost except key-value pair {"gtid"->"1"}, only the commit marker of txn is in 8.log The reason lies in `PrecomputeMinLogNumberToKeep2PC()` which calls `FindMinPrepLogReferencedByMemTable()`. In the above example, when bg flush thread tries to find obsolete wals, it uses the information computed by `PrecomputeMinLogNumberToKeep2PC()`. The return value of `PrecomputeMinLogNumberToKeep2PC()` depends on three components - `PrecomputeMinLogNumberToKeepNon2PC()`. This represents the WAL that has unflushed data. As the name of this method suggests, it does not account for 2PC. Although the keys reside in the prepare section of a previous WAL, the column family references the current WAL when they are actually inserted into the memtable during txn commit. - `prep_tracker->FindMinLogContainingOutstandingPrep()`. This represents the WAL with a prepare section but the txn hasn't committed. - `FindMinPrepLogReferencedByMemTable()`. This represents the WAL on which some memtables (mutable and immutable) depend for their unflushed data. The bug lies in `FindMinPrepLogReferencedByMemTable()`. Originally, this function skips checking the column families that are being flushed, but the unit test added in this PR shows that they should not be. In this unit test, there is only the default column family, and one of its memtables has unflushed data backed by a prepare section in 5.log. We should return this information via `FindMinPrepLogReferencedByMemTable()`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9571 Test Plan: ``` ./transaction_test --gtest_filter=*/TransactionTest.SwitchMemtableDuringPrepareAndCommit_WC/* make check ``` Reviewed By: siying Differential Revision: D34235236 Pulled By: riversand963 fbshipit-source-id: 120eb21a666728a38dda77b96276c6af72b008b1 |
|
mrambacher | c42d0cf862 |
Add support for decimals to PatternEntry (#9577)
Summary: Add support for doubles to ObjectLibrary::PatternEntry. This support will allow patterns containing a non-integer number to be parsed correctly. Added appropriate test cases to cover this new option. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9577 Reviewed By: pdillinger Differential Revision: D34269763 Pulled By: mrambacher fbshipit-source-id: b5ce16cbd3665c2974ec0f3412ef2b403ef8b155 |
|
Yanqin Jin | 241b5aa15a |
Timestamp-based validation for pessimistic txn (#9562)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9562 With per-transaction `read_timestamp_`, it is possible to perform transaction validation after locking a key in addition to sequence-based validation. Specifically, if a transaction has a read_timestamp, then we perform timestamp-based validation as well after the key is locked via `GetForUpdate()`. This is to make sure that no other transaction has modified the key and committed successfully since the read timestamp (but before the locking operation) which represents a consistent view of the database. Reviewed By: ltamasi Differential Revision: D31822034 fbshipit-source-id: c6f1828b7fc23e4f85e2d1ed73ff51464a058d91 |
|
Yanqin Jin | d6e1e6f37a |
Add commit_timestamp and read_timestamp to Pessimistic transaction (#9537)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9537 Add `Transaction::SetReadTimestampForValidation()` and `Transaction::SetCommitTimestamp()` APIs with default implementation returning `Status::NotSupported()`. Currently, calling these two APIs do not have any effect. Also add checks to `PessimisticTransactionDB` to enforce that column families in the same db either - disable user-defined timestamp - enable 64-bit timestamp Just to clarify, a `PessimisticTransactionDB` can have some column families without timestamps as well as column families that enable timestamp. Each `PessimisticTransaction` can have two optional timestamps, `read_timestamp_` used for additional validation and `commit_timestamp_` which denotes when the transaction commits. For now, we are going to support `WriteCommittedTxn` (in a series of subsequent PRs) Once set, we do not allow decreasing `read_timestamp_`. The `commit_timestamp_` must be greater than `read_timestamp_` for each transaction and must be set before commit, unless the transaction does not involve any column family that enables user-defined timestamp. TransactionDB builds on top of RocksDB core `DB` layer. Though `DB` layer assumes that user-defined timestamps are byte arrays, `TransactionDB` uses uint64_t to store timestamps. When they are passed down, they are still interpreted as byte-arrays by `DB`. Reviewed By: ltamasi Differential Revision: D31567959 fbshipit-source-id: b0b6b69acab5d8e340cf174f33e8b09f1c3d3502 |
|
mrambacher | 81ada95bd7 |
Add STATIC_AVOID_DESTRUCTION for ObjectLibrary/Registry (#9464)
Summary: This change should guarantee that the default ObjectLibrary/Registry are long-lived and not destroyed while the process is running. This will prevent some issues of them being referenced after they were destroyed via the static destruction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9464 Reviewed By: pdillinger Differential Revision: D33849876 Pulled By: mrambacher fbshipit-source-id: 7a69177d7c58c81be293fc7ef8e600d47ddbc14b |
|
mrambacher | fe9d495112 |
Return different Status based on ObjectRegistry::NewObject calls (#9333)
Summary: This fix addresses https://github.com/facebook/rocksdb/issues/9299. If attempting to create a new object via the ObjectRegistry and a factory is not found, the ObjectRegistry will return a "NotSupported" status. This is the same behavior as previously. If the factory is found but could not successfully create the object, an "InvalidArgument" status is returned. If the factory returned a reason why (in the errmsg), this message will be in the returned status. In practice, there are two options in the ConfigOptions that control how these errors are propagated: - If "ignore_unknown_options=true", then both InvalidArgument and NotSupported status codes will be swallowed internally. Both cases will return success - If "ignore_unsupported_options=true", then having no factory will return success but a failing factory will return an error - If both options are false, both cases (no and failing factory) will return errors. In practice this likely only changes Customizable that may be partially available. For example, the JEMallocMemoryAllocator is a built-in allocator that is registered with the system but may not be compiled in. In this case, the status code for this allocator changed from NotSupported("JEMalloc not available") to InvalidArgumen("JEMalloc not available"). Other Customizable builtins/plugins would have the same semantics. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9333 Reviewed By: pdillinger Differential Revision: D33517681 Pulled By: mrambacher fbshipit-source-id: 8033052d4a4a7b88c2d9f90147b1b4467e51f6fd |
|
Peter Dillinger | 5cb137a860 |
Work around some new clang-analyze failures (#9515)
Summary: ... seen only in internal clang-analyze runs after https://github.com/facebook/rocksdb/issues/9481 * Mostly, this works around falsely reported leaks by using std::unique_ptr in some places where clang-analyze was getting confused. (I didn't see any changes in C++17 that could make our Status implementation leak memory.) * Also fixed SetBGError returning address of a stack variable. * Also fixed another false null deref report by adding an assert. Also, use SKIP_LINK=1 to speed up `make analyze` Pull Request resolved: https://github.com/facebook/rocksdb/pull/9515 Test Plan: Was able to reproduce the reported errors locally and verify they're fixed (except SetBGError). Otherwise, existing tests Reviewed By: hx235 Differential Revision: D34054630 Pulled By: pdillinger fbshipit-source-id: 38600ef3da75ddca307dff96b7a1a523c2885c2e |
|
mrambacher | aae3093719 |
Introduce a CountedFileSystem for counting file operations (#9283)
Summary: Added a CountedFileSystem that tracks a number of file operations (opens, closes, deletes, renames, flushes, syncs, fsyncs, reads, writes). This class was based on the ReportFileOpEnv from db_bench. This is a stepping stone PR to be able to change the SpecialEnv into a SpecialFileSystem, where several of the file varieties wish to do operation counting. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9283 Reviewed By: pdillinger Differential Revision: D33062004 Pulled By: mrambacher fbshipit-source-id: d0d297a7fb9c48c06cbf685e5fa755c27193b6f5 |
|
Yanqin Jin | 3122cb4358 |
Revise APIs related to user-defined timestamp (#8946)
Summary: ajkr reminded me that we have a rule of not including per-kv related data in `WriteOptions`. Namely, `WriteOptions` should not include information about "what-to-write", but should just include information about "how-to-write". According to this rule, `WriteOptions::timestamp` (experimental) is clearly a violation. Therefore, this PR removes `WriteOptions::timestamp` for compliance. After the removal, we need to pass timestamp info via another set of APIs. This PR proposes a set of overloaded functions `Put(write_opts, key, value, ts)`, `Delete(write_opts, key, ts)`, and `SingleDelete(write_opts, key, ts)`. Planned to add `Write(write_opts, batch, ts)`, but its complexity made me reconsider doing it in another PR (maybe). For better checking and returning error early, we also add a new set of APIs to `WriteBatch` that take extra `timestamp` information when writing to `WriteBatch`es. These set of APIs in `WriteBatchWithIndex` are currently not supported, and are on our TODO list. Removed `WriteBatch::AssignTimestamps()` and renamed `WriteBatch::AssignTimestamp()` to `WriteBatch::UpdateTimestamps()` since this method require that all keys have space for timestamps allocated already and multiple timestamps can be updated. The constructor of `WriteBatch` now takes a fourth argument `default_cf_ts_sz` which is the timestamp size of the default column family. This will be used to allocate space when calling APIs that do not specify a column family handle. Also, updated `DB::Get()`, `DB::MultiGet()`, `DB::NewIterator()`, `DB::NewIterators()` methods, replacing some assertions about timestamp to returning Status code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8946 Test Plan: make check ./db_bench -benchmarks=fillseq,fillrandom,readrandom,readseq,deleterandom -user_timestamp_size=8 ./db_stress --user_timestamp_size=8 -nooverwritepercent=0 -test_secondary=0 -secondary_catch_up_one_in=0 -continuous_verification_interval=0 Make sure there is no perf regression by running the following ``` ./db_bench_opt -db=/dev/shm/rocksdb -use_existing_db=0 -level0_stop_writes_trigger=256 -level0_slowdown_writes_trigger=256 -level0_file_num_compaction_trigger=256 -disable_wal=1 -duration=10 -benchmarks=fillrandom ``` Before this PR ``` DB path: [/dev/shm/rocksdb] fillrandom : 1.831 micros/op 546235 ops/sec; 60.4 MB/s ``` After this PR ``` DB path: [/dev/shm/rocksdb] fillrandom : 1.820 micros/op 549404 ops/sec; 60.8 MB/s ``` Reviewed By: ltamasi Differential Revision: D33721359 Pulled By: riversand963 fbshipit-source-id: c131561534272c120ffb80711d42748d21badf09 |
|
Peter Dillinger | 78aee6fedc |
Remove obsolete backupable_db.h, utility_db.h (#9438)
Summary: This also removes the obsolete names BackupableDBOptions and UtilityDB. API users must now use BackupEngineOptions and DBWithTTL::Open. In C API, `rocksdb_backupable_db_*` is replaced `rocksdb_backup_engine_*`. Similar renaming in Java API. In reference to https://github.com/facebook/rocksdb/issues/9389 Pull Request resolved: https://github.com/facebook/rocksdb/pull/9438 Test Plan: CI Reviewed By: mrambacher Differential Revision: D33780269 Pulled By: pdillinger fbshipit-source-id: 4a6cfc5c1b4c78bcad790b9d3dd13c5fdf4a1fac |
|
Peter Dillinger | 449029f865 |
Remove deprecated ObjectLibrary::Register() (and Regex public API) (#9439)
Summary: Regexes are considered potentially problematic for use in registering RocksDB extensions, so we are removing ObjectLibrary::Register() and the Regex public API it depended on (now unused). In reference to https://github.com/facebook/rocksdb/issues/9389 Why? * The power of Regexes can make it hard to reason about which extension will match what. (The replacement API isn't perfect, but we are at least "holding the line" on patterns we have seen in practice.) * It is easy to make regexes that don't quite mean what you think they mean, such as forgetting that the `.` in `foo.bar` can match any character or that matching is nondeterministic, as in `a🅱️42` matching `.*:[0-9]+`. * Some regexes and implementations can have disastrously bad performance. This might not be much practical concern for ObjectLibray here, but we don't want to encourage potentially dangerous further use in production code. (Testing code is fine. See TestRegex.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9439 Test Plan: CI Reviewed By: mrambacher Differential Revision: D33792342 Pulled By: pdillinger fbshipit-source-id: 4f64dcb04764e639162c8977a5fa196f67754cec |
|
Yanqin Jin | fa52376117 |
Move RADOS support to separate repo (#9206)
Summary: This PR moves RADOS support from RocksDB repo to a separate repo. The new (temporary?) repo in this PR serves as an example before we finalize the decision on where and who to host RADOS support. At this point, people can start from the example repo and fork. The goal is to include this commit in RocksDB 7.0 release. Reference: https://github.com/ajkr/dedupfs by ajkr Pull Request resolved: https://github.com/facebook/rocksdb/pull/9206 Test Plan: Follow instructions in https://github.com/riversand963/rocksdb-rados-env/blob/main/README.md and build test binary `env_librados_test` and run it. Also, make check Reviewed By: ajkr Differential Revision: D33751690 Pulled By: riversand963 fbshipit-source-id: 30466c62afa9e4619847a48567ed158e62835e35 |