rocksdb

Commit Graph

Author	SHA1	Message	Date
mrambacher	b6640c3117	Remove FactoryFunc from LoadXXXObject (#11203 ) Summary: The primary purpose of the FactoryFunc was to support LITE mode where the ObjectRegistry was not available. With the removal of LITE mode, the function was no longer required. Note that the MergeOperator had some private classes defined in header files. To gain access to their constructors (and name methods), the class definitions were moved into header files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11203 Reviewed By: cbi42 Differential Revision: D43160255 Pulled By: pdillinger fbshipit-source-id: f3a465fd5d1a7049b73ecf31e4b8c3762f6dae6c	2023-02-17 12:54:07 -08:00
Andrew Kryczka	25e1365227	Merge operator failed subcode (#11231 ) Summary: From HISTORY.md: Added a subcode of `Status::Corruption`, `Status::SubCode::kMergeOperatorFailed`, for users to identify corruption failures originating in the merge operator, as opposed to RocksDB's internally identified data corruptions. This is a followup to https://github.com/facebook/rocksdb/issues/11092, where we gave users the ability to keep running a DB despite merge operator failing. Now that the DB keeps running despite such failures, they want to be able to distinguish such failures from real corruptions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11231 Test Plan: updated unit test Reviewed By: akankshamahajan15 Differential Revision: D43396607 Pulled By: ajkr fbshipit-source-id: 17fbcc779ad724dafada8abd73efd38e1c5208b9	2023-02-17 10:58:46 -08:00
akankshamahajan	68e4581c67	Return NotSupported in scan if IOUring not supported and enable IOUring in db_stress for async_io testing (#11197 ) Summary: - Return NotSupported in scan if IOUring not supported if async_io is enabled - Enable IOUring in db_stress for async_io testing - Disable async_io in circleci crash testing as circleci doesn't support IOUring Pull Request resolved: https://github.com/facebook/rocksdb/pull/11197 Test Plan: CircleCI jobs Reviewed By: anand1976 Differential Revision: D43096313 Pulled By: akankshamahajan15 fbshipit-source-id: c2c53a87636950c0243038b9f5bd0d91608e4fda	2023-02-16 18:33:06 -08:00
Peter Dillinger	64a1f7670f	Customize CompressedSecondaryCache by block kind (#11204 ) Summary: Added `do_not_compress_roles` to `CompressedSecondaryCacheOptions` to disable compression on certain kinds of block. Filter blocks are now not compressed by CompressedSecondaryCache by default. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11204 Test Plan: unit test added Reviewed By: anand1976 Differential Revision: D43147698 Pulled By: pdillinger fbshipit-source-id: db496975ae975fa18f157f93fe131a16315ac875	2023-02-16 17:22:27 -08:00
Levi Tamasi	94ec433833	Mention the new MultiGetEntity API in HISTORY.md (#11226 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11226 Reviewed By: akankshamahajan15 Differential Revision: D43317602 Pulled By: ltamasi fbshipit-source-id: 4b7e063848d3cfbdb9f0c0f54d68aeab8a82595c	2023-02-15 11:50:18 -08:00
Peter Dillinger	3cacd4b4ec	Put Cache and CacheWrapper in new public header (#11192 ) Summary: The definition of the Cache class should not be needed by the vast majority of RocksDB users, so I think it is just distracting to include it in cache.h, which is primarily needed for configuring and creating caches. This change moves the class to a new header advanced_cache.h. It is just cut-and-paste except for modifying the class API comment. In general, operations on shared_ptr<Cache> should continue to work when only a forward declaration of Cache is available, as long as all the Cache instances provided are already shared_ptr. See https://stackoverflow.com/a/17650101/454544 Also, the most common way to customize a Cache is by wrapping an existing implementation, so it makes sense to provide CacheWrapper in the public API. This was a cut-and-paste job except removing the implementation of Name() so that derived classes must provide it. Intended follow-up: consolidate Release() into one function to reduce customization bugs / confusion Pull Request resolved: https://github.com/facebook/rocksdb/pull/11192 Test Plan: `make check` Reviewed By: anand1976 Differential Revision: D43055487 Pulled By: pdillinger fbshipit-source-id: 7b05492df35e0f30b581b4c24c579bc275b6d110	2023-02-09 12:12:02 -08:00
anand76	77b61abc7b	Fix bug in WAL streaming uncompression (#11198 ) Summary: Fix a bug in the calculation of the input buffer address/offset in log_reader.cc. The bug is when consecutive fragments of a compressed record are located at the same offset in the log reader buffer, the second fragment input buffer is treated as a leftover from the previous input buffer. As a result, the offset in the `ZSTD_inBuffer` is not reset. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11198 Test Plan: Add a unit test in log_test.cc that fails without the fix and passes with it. Reviewed By: ajkr, cbi42 Differential Revision: D43102692 Pulled By: anand1976 fbshipit-source-id: aa2648f4802c33991b76a3233c5a58d4cc9e77fd	2023-02-08 12:05:49 -08:00
Levi Tamasi	876d281592	Add compaction filter support for wide-column entities (#11196 ) Summary: The patch adds compaction filter support for wide-column entities by introducing a new `CompactionFilter` API called `FilterV3`. This API is called for regular key-values, merge operands, and wide-column entities as well. It is passed the existing value/operand or wide-column structure and it can update the value or columns or keep/delete/etc. the key-value as usual. For compatibility, the default implementation of `FilterV3` keeps all wide-column entities and falls back to calling `FilterV2` for plain old key-values and merge operands. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11196 Test Plan: `make check` Reviewed By: akankshamahajan15 Differential Revision: D43094147 Pulled By: ltamasi fbshipit-source-id: 75acabe9a35254f7f404ba6173ee9c2774382ebd	2023-02-07 16:17:39 -08:00
Hui Xiao	6650ca244e	Remove a couple deprecated convenience.h APIs (#11120 ) Summary: Context/Summary: As instructed by convenience.h comments, a few deprecated APIs are removed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11120 Test Plan: - make check & CI - eyeball check on test semantics. Reviewed By: pdillinger Differential Revision: D42937507 Pulled By: hx235 fbshipit-source-id: a9e4709387da01b1d0e9148c2e210f02e9746ee1	2023-02-07 14:11:53 -08:00
Peter Dillinger	cf756ed916	Use LIB_MODE=shared build by default with make (#11168 ) Summary: With https://github.com/facebook/rocksdb/issues/11150 this becomes a practical change that I think is overall good for developer efficiency. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11168 Test Plan: More efficient build of all unit tests and tools: ``` $ git clean -fdx $ du -sh . 522M . $ /usr/bin/time make -j32 LIB_MODE=static ... 14270.63user 1043.33system 11:19.85elapsed 2252%CPU (0avgtext+0avgdata 1929944maxresident)k ... $ du -sh . 62G . $ ``` Vs. ``` $ git clean -fdx $ du -sh . 522M . $ /usr/bin/time make -j32 LIB_MODE=shared ... 9479.87user 478.26system 7:20.82elapsed 2258%CPU (0avgtext+0avgdata 1929272maxresident)k ... $ du -sh . 5.4G . $ ``` So 1/3 less build time and >90% less space usage. Individual unit test edit-compile-run is not too different. Modifying an average unit test source file: ``` $ touch db/version_builder_test.cc $ /usr/bin/time make -j32 LIB_MODE=static version_builder_test ... 34.74user 3.37system 0:38.29elapsed 99%CPU (0avgtext+0avgdata 945520maxresident)k ``` Vs. ``` $ touch db/version_builder_test.cc $ /usr/bin/time make -j32 LIB_MODE=shared version_builder_test ... 116.26user 43.91system 0:28.65elapsed 559%CPU (0avgtext+0avgdata 675160maxresident)k ``` A little faster with shared. However, modifying an average DB implementation file has an extra linking step with shared lib: ``` $ touch db/db_impl/db_impl_files.cc $ /usr/bin/time make -j32 LIB_MODE=static version_builder_test ... 33.17user 5.13system 0:39.70elapsed 96%CPU (0avgtext+0avgdata 945544maxresident)k ``` Vs. ``` $ touch db/db_impl/db_impl_files.cc $ /usr/bin/time make -j32 LIB_MODE=shared version_builder_test ... 40.80user 4.66system 0:45.54elapsed 99%CPU (0avgtext+0avgdata 1056340maxresident)k ``` A little slower with shared. On the whole, should be faster and lighter weight because of the many unit test files case Reviewed By: cbi42 Differential Revision: D42894004 Pulled By: pdillinger fbshipit-source-id: 9e827e52ace79b86f849b6a24466e318b4b605a7	2023-02-03 15:28:52 -08:00
Peter Dillinger	0cf1008fe3	Deprecate write_global_seqno and default to false (#11179 ) Summary: This option has long been intended to be set to false by default and deprecated. It might never be practical to completely remove the feature, so that we can continue to test for backward compatibility by keeping the ability to generate DBs in the old way. Also improved API comments. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11179 Test Plan: existing tests (with one tiny update) Reviewed By: hx235 Differential Revision: D42973927 Pulled By: pdillinger fbshipit-source-id: e9bc161cb933266e094aea2dff8cc03753c39dab	2023-02-03 13:00:04 -08:00
anand76	63da9cfa26	Return any errors returned by ReadAsync to the MultiGet caller (#11171 ) Summary: Currently, we incorrectly return a Status::Corruption to the MultiGet caller if the file system ReadAsync cannot issue a read and returns an error for some reason, such as IOStatus::NotSupported(). In this PR, we copy the ReadAsync error to the request status so it can be returned to the user. Tests: Update existing unit tests and add a new one for this scenario Pull Request resolved: https://github.com/facebook/rocksdb/pull/11171 Reviewed By: akankshamahajan15 Differential Revision: D42950057 Pulled By: anand1976 fbshipit-source-id: 85ffcb015fa6c064c311f8a28488fec78c487869	2023-02-02 16:35:27 -08:00
Andrew Kryczka	6af16ac7c1	Update HISTORY.md for #11136 (#11177 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11177 Reviewed By: cbi42 Differential Revision: D42948946 Pulled By: ajkr fbshipit-source-id: 783d3d9007faaa036923a0364cdd0bfbd8e78062	2023-02-02 07:50:55 -08:00
Andrew Kryczka	071c33846d	Allow canceling manual compaction while waiting for conflicting compaction (#11165 ) Summary: This PR adds logic to the `RunManualCompaction()` loop to check for cancellation before waiting on any conflicting compactions to finish. In case of cancellation, `RunManualCompaction()` no longer waits on conflicting compactions Pull Request resolved: https://github.com/facebook/rocksdb/pull/11165 Test Plan: repro test case Reviewed By: cbi42 Differential Revision: D42864058 Pulled By: ajkr fbshipit-source-id: ea4dd1a8f294abe212905495a8fbe8f07fca3f5a	2023-01-31 16:57:49 -08:00
Levi Tamasi	a82021c3d0	Fix a bug where GetEntity would expose a blob reference (#11162 ) Summary: The patch fixes a feature interaction bug between BlobDB and the `GetEntity` API: without the patch, `GetEntity` would return the blob reference (wrapped into a single-column entity) instead of the actual blob value. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11162 Test Plan: `make check` Reviewed By: akankshamahajan15 Differential Revision: D42854092 Pulled By: ltamasi fbshipit-source-id: f750d0ff57def107da16f545077ddce9860ff21a	2023-01-31 09:59:25 -08:00
Peter Dillinger	94e3beec77	Cleanup, improve, stress test LockWAL() (#11143 ) Summary: The previous API comments for LockWAL didn't provide much about why you might want to use it, and didn't really meet what one would infer its contract was. Also, LockWAL was not in db_stress / crash test. In this change: * Implement a counting semantics for LockWAL()+UnlockWAL(), so that they can safely be used concurrently across threads or recursively within a thread. This should make the API much less bug-prone and easier to use. * Make sure no UnlockWAL() is needed after non-OK LockWAL() (to match RocksDB conventions) * Make UnlockWAL() reliably return non-OK when there's no matching LockWAL() (for debug-ability) * Clarify API comments on LockWAL(), UnlockWAL(), FlushWAL(), and SyncWAL(). Their exact meanings are not obvious, and I don't think it's appropriate to talk about implementation mutexes in the API comments, but about what operations might block each other. * Add LockWAL()/UnlockWAL() to db_stress and crash test, mostly to check for assertion failures, but also checks that latest seqno doesn't change while WAL is locked. This is simpler to add when LockWAL() is allowed in multiple threads. * Remove unnecessary use of sync points in test DBWALTest::LockWal. There was a bug during development of above changes that caused this test to fail sporadically, with and without this sync point change. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11143 Test Plan: unit tests added / updated, added to stress/crash test Reviewed By: ajkr Differential Revision: D42848627 Pulled By: pdillinger fbshipit-source-id: 6d976c51791941a31fd8fbf28b0f82e888d9f4b4	2023-01-30 22:52:30 -08:00
Yu Zhang	24ac53d81a	Use user key on sst file for blob verification for Get and MultiGet (#11105 ) Summary: Use the user key on sst file for blob verification for `Get` and `MultiGet` instead of the user key passed from caller. Add tests for `Get` and `MultiGet` operations when user defined timestamp feature is enabled in a BlobDB. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11105 Test Plan: make V=1 db_blob_basic_test ./db_blob_basic_test --gtest_filter="DBBlobTestWithTimestamp.*" Reviewed By: ltamasi Differential Revision: D42716487 Pulled By: jowlyzhang fbshipit-source-id: 5987ecbb7e56ddf46d2467a3649369390789506a	2023-01-30 10:21:21 -08:00
sdong	4720ba4391	Remove RocksDB LITE (#11147 ) Summary: We haven't been actively mantaining RocksDB LITE recently and the size must have been gone up significantly. We are removing the support. Most of changes were done through following comments: unifdef -m -UROCKSDB_LITE `git grep -l ROCKSDB_LITE \| egrep '[.](cc\|h)'` by Peter Dillinger. Others changes were manually applied to build scripts, CircleCI manifests, ROCKSDB_LITE is used in an expression and file db_stress_test_base.cc. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11147 Test Plan: See CI Reviewed By: pdillinger Differential Revision: D42796341 fbshipit-source-id: 4920e15fc2060c2cd2221330a6d0e5e65d4b7fe2	2023-01-27 13:14:19 -08:00
Yu Zhang	6943ff6e50	Remove deprecated util functions in options_util.h (#11126 ) Summary: Remove the util functions in options_util.h that have previously been marked deprecated. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11126 Test Plan: `make check` Reviewed By: ltamasi Differential Revision: D42757496 Pulled By: jowlyzhang fbshipit-source-id: 2a138a3c207d0e0e0bbb4d99548cf2cadb44bcfb	2023-01-27 11:10:53 -08:00
Changyu Bi	c94c8fcbd4	Remove deprecated FileSystem::Load() (#11122 ) Summary: user should use FileSystem::CreateFromString() instead. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11122 Reviewed By: ajkr Differential Revision: D42727580 Pulled By: cbi42 fbshipit-source-id: c68b17bb82ba9dee46ba23b677d87ecf0a1e06c8	2023-01-26 20:20:58 -08:00
Karim TAAM	a1e92bd956	use verify checksum option in block based table reader Open() (#11099 ) Summary: ## Description In this issue https://github.com/facebook/rocksdb/issues/11002 we found that when we use rocksdb with the `verify checksum` read_option to false the verification is done anyway By analyzing the code along the stacktrace I saw that at the level of https://github.com/facebook/rocksdb/compare/main...matkt:feature/use-verify-checksum-in-block-based-table-reader?expand=1#diff-57ed8c49db2bdd4db7618646a177397674bbf25beacacecb104070071d30129f we are not keeping all the options and we forget the `verify_checksum` the comment in this class suggests that it should be managed https://github.com/facebook/rocksdb/compare/main...matkt:feature/use-verify-checksum-in-block-based-table-reader?expand=1#diff-57ed8c49db2bdd4db7618646a177397674bbf25beacacecb104070071d30129fL581 <img width="1724" alt="204511641-86ab4b9b-45e5-4a2b-a13d-81fa26435d38" src="https://user-images.githubusercontent.com/26581503/213152802-c46bc1c7-a3a2-4a6f-9bb1-bf92ee93af7a.png"> this PR just adds the line to manage the `verify checksum` ## Tests - Running unit tests - Test without setting `verify checksum` and verifying that we are calling the checksum code - Test by setting `verify checksum` to true and verifying that we are calling the checksum code - Test by setting `verify checksum` to false and verifying that we are not calling the checksum code Pull Request resolved: https://github.com/facebook/rocksdb/pull/11099 Reviewed By: cbi42 Differential Revision: D42679881 Pulled By: ajkr fbshipit-source-id: c7dd10768282fd0699f7e1bf397ceb7adbea4ab6	2023-01-26 17:38:59 -08:00
Levi Tamasi	a6cfdd4eda	Fix the HISTORY.md entry related to the removed statistics (#11140 ) Summary: Some histograms were incorrectly categorized as tickers. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11140 Reviewed By: anand1976 Differential Revision: D42780030 Pulled By: ltamasi fbshipit-source-id: 5aca8ec5baad8f73676aaa9d6cdbbd2a619c8a89	2023-01-26 10:38:45 -08:00
Levi Tamasi	6da2e20df3	Remove more obsolete statistics (#11131 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11131 Test Plan: `make check` Reviewed By: pdillinger Differential Revision: D42753997 Pulled By: ltamasi fbshipit-source-id: ce8b84c1e55374257e93ed74fd255c9b759723ce	2023-01-25 15:14:13 -08:00
Peter Dillinger	9afa0f05ad	Remove deprecated Env::LoadEnv() (#11121 ) Summary: Can use Env::CreateFromString() instead Pull Request resolved: https://github.com/facebook/rocksdb/pull/11121 Test Plan: unit tests updated Reviewed By: cbi42 Differential Revision: D42723813 Pulled By: pdillinger fbshipit-source-id: 5d4b5b10225dfdaf662f5f8049ee965a05d3edc9	2023-01-25 12:08:49 -08:00
Levi Tamasi	99e559533d	Remove some deprecated/obsolete statistics from the API (#11123 ) Summary: These tickers/histograms have been obsolete (and not populated) for a long time. The patch removes them from the API completely. Note that this means that the numeric values of the remaining tickers change in the C++ code as they get shifted up. This should be OK: the values of some existing tickers have changed many times over the years as items have been added in the middle. (In contrast, the convention in the Java bindings is to keep the ids, which are not guaranteed to be the same as the ids on the C++ side, the same across releases.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/11123 Test Plan: `make check` Reviewed By: akankshamahajan15 Differential Revision: D42727793 Pulled By: ltamasi fbshipit-source-id: e058a155a20b05b45f53e67ee380aece1b43b6c5	2023-01-24 20:56:15 -08:00
sdong	2800aa069a	Remove compressed block cache (#11117 ) Summary: Compressed block cache is replaced by compressed secondary cache. Remove the feature. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11117 Test Plan: See CI passes Reviewed By: pdillinger Differential Revision: D42700164 fbshipit-source-id: 6cbb24e460da29311150865f60ecb98637f9f67d	2023-01-24 17:09:19 -08:00
Hui Xiao	86fa2592be	Fix data race on `ColumnFamilyData::flush_reason` by letting FlushRequest/Job owns flush_reason instead of CFD (#11111 ) Summary: Context: Concurrent flushes on the same CF can set on `ColumnFamilyData::flush_reason` before each other flush finishes. An symptom is one CF has different flush_reason with others though all of them are in an atomic flush `db_stress: db/db_impl/db_impl_compaction_flush.cc:423: rocksdb::Status rocksdb::DBImpl::AtomicFlushMemTablesToOutputFiles(const rocksdb::autovector<rocksdb::DBImpl::BGFlushArg>&, bool, rocksdb::JobContext, rocksdb::LogBuffer, rocksdb::Env::Priority): Assertion cfd->GetFlushReason() == cfds[0]->GetFlushReason() failed. ` Summary:* Suggested by ltamasi, we now refactor and let FlushRequest/Job to own flush_reason as there is no good way to define `ColumnFamilyData::flush_reason` in face of concurrent flushes on the same CF (which wasn't the case a long time ago when `ColumnFamilyData::flush_reason ` first introduced`) Tets: - new unit test - make check - aggressive crash test rehearsal Pull Request resolved: https://github.com/facebook/rocksdb/pull/11111 Reviewed By: ajkr Differential Revision: D42644600 Pulled By: hx235 fbshipit-source-id: 8589c8184869d3415e5b780c887f877818a5ebaf	2023-01-24 09:54:04 -08:00
Hui Xiao	7e7548477c	Update HISTORY.md/version.h/format compatiblity test for 7.10 release (#11114 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11114 Reviewed By: ajkr Differential Revision: D42685234 Pulled By: hx235 fbshipit-source-id: 79908a66ab9052a2552f080049065462ebf2f94c	2023-01-23 13:26:11 -08:00
Andrew Kryczka	b7fbcefda8	Add API to limit blast radius of merge operator failure (#11092 ) Summary: Prior to this PR, `FullMergeV2()` can only return `false` to indicate failure, which causes any operation invoking it to fail. During a compaction, such a failure causes the compaction to fail and causes the DB to irreversibly enter read-only mode. Some users asked for a way to allow the merge operator to fail without such widespread damage. To limit the blast radius of merge operator failures, this PR introduces the `MergeOperationOutput::op_failure_scope` API. When unpopulated (`kDefault`) or set to `kTryMerge`, the merge operator failure handling is the same as before. When set to `kMustMerge`, merge operator failure still causes failure to operations that must merge (`Get()`, iterator, `MultiGet()`, etc.). However, under `kMustMerge`, flushes/compactions can survive merge operator failures by outputting the unmerged input operands. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11092 Reviewed By: siying Differential Revision: D42525673 Pulled By: ajkr fbshipit-source-id: 951dc3bf190f86347dccf3381be967565cda52ee	2023-01-20 14:40:30 -08:00
Peter Dillinger	fd911f9655	Upgrade xxhash.h to latest dev (#11098 ) Summary: Upgrading xxhash.h to latest dev version as of 1/17/2023, which is d7197ddea81364a539051f116ca77926100fc77f This should improve performance on some ARM machines. I allowed some of our RocksDB-specific changes to be made obsolete where it seemed appropriate, for example * xxhash.h has its own fallthrough marker (which I hope works for us) * As in https://github.com/Cyan4973/xxHash/pull/549 Merging and resolving conflicts one way or the other was all that went into this diff. Except I had to mix the two sides around `defined(__loongarch64)` How I did the upgrade (for future reference), so that I could use usual merge conflict resolution: ``` # New branch to help with merging git checkout -b xxh_merge_base # Check out RocksDB revision before last xxhash.h upgrade git reset --hard 22161b7547652af82a5dc67458de9ca8946ac83d^ # Create a commit with the raw base version from xxHash repo (from xxHash repo) git show 2c611a76f914828bed675f0f342d6c4199ffee1e:xxhash.h > ../rocksdb/util/xxhash.h # In RocksDB repo git commit -a # Merge in the last xxhash.h upgrade git merge `22161b7547` # Resolve conflict using committed version git show 22161b7547652af82a5dc67458de9ca8946ac83d:util/xxhash.h > util/xxhash.h git commit -a # Catch up to upstream git merge upstream/main # Create a different branch for applying raw upgrade git checkout -b xxh_upgrade_2023 # Find the RocksDB commit we made for the raw base version from xxHash git log main..HEAD # Rewind to it git reset --hard `2428b727a9` # Copy in latest raw version (from xxHash repo) cat xxhash.h > ../rocksdb/util/xxhash.h # Merge in RocksDB changes, use typical tools for conflict resolution git merge xxh_merge_base ``` Branch https://github.com/facebook/rocksdb/tree/xxhash_merge_base can be used as a base for future xxhash merges. Fixes https://github.com/facebook/rocksdb/issues/11073 Pull Request resolved: https://github.com/facebook/rocksdb/pull/11098 Test Plan: existing tests (e.g. Bloom filter schema stability tests) Also seems to include a small performance boost on my Intel dev machine, using `./db_bench --benchmarks=xxh3[-X50] 2>&1 \| egrep -o 'operations;.*' \| sort` Fastest out of 50 runs, before: 15477.3 MB/s Fastest out of 50 runs, after: 15850.7 MB/s, and 11 more runs faster than the "before" number Slowest out of 50 runs, before: 12267.5 MB/s Slowest out of 50 runs, after: 13897.1 MB/s More repetitions show the distinction is repeatable Reviewed By: hx235 Differential Revision: D42560010 Pulled By: pdillinger fbshipit-source-id: c43ee52f1c5fe0ba3d6d6e4eebb22ded5f5492ea	2023-01-19 12:07:50 -08:00
Changyu Bi	4d0f9a995c	Consider TTL compaction file cutting earlier to prevent small output file (#11075 ) Summary: in `CompactionOutputs::ShouldStopBefore()`, TTL-related states, `cur_files_to_cut_for_ttl_` and `next_files_to_cut_for_ttl_`, are not updated if the function returns early. This can cause unnecessary compaction output file cuttings and hence produce smaller output files, which may hurt write amp. See the example in the unit test for how this "unnecessary file cutting" can happen. This PR fixes this issue by moving the code for updating TTL states earlier in `CompactionOutputs::ShouldStopBefore()` so that the states are updated for each key. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11075 Test Plan: - Added new unit test. Reviewed By: hx235 Differential Revision: D42398739 Pulled By: cbi42 fbshipit-source-id: 09fab66679c1a734abcfc31bcea33dd9aeb9dbc7	2023-01-17 16:42:41 -08:00
Changyu Bi	f515d9d203	Revert #10802 Consider range tombstone in compaction output file cutting (#11089 ) Summary: This reverts commit `f02c708aa3` since it introduced several bugs (see https://github.com/facebook/rocksdb/issues/11078 and https://github.com/facebook/rocksdb/issues/11067 for attempts to fix them) and that I do not have a high confidence to fix all of them and ensure no further ones before the next release branch cut. There are also come existing issue found during bug fixing. We will work on it and try to merge it to the release after. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11089 Test Plan: existing CI. Reviewed By: ajkr Differential Revision: D42505972 Pulled By: cbi42 fbshipit-source-id: 2f66dcde6b85dc94977b317c2ce513872cfbc153	2023-01-13 12:28:21 -08:00
Peter Dillinger	9f7801c5f1	Major Cache refactoring, CPU efficiency improvement (#10975 ) Summary: This is several refactorings bundled into one to avoid having to incrementally re-modify uses of Cache several times. Overall, there are breaking changes to Cache class, and it becomes more of low-level interface for implementing caches, especially block cache. New internal APIs make using Cache cleaner than before, and more insulated from block cache evolution. Hopefully, this is the last really big block cache refactoring, because of rather effectively decoupling the implementations from the uses. This change also removes the EXPERIMENTAL designation on the SecondaryCache support in Cache. It seems reasonably mature at this point but still subject to change/evolution (as I warn in the API docs for Cache). The high-level motivation for this refactoring is to minimize code duplication / compounding complexity in adding SecondaryCache support to HyperClockCache (in a later PR). Other benefits listed below. * static_cast lines of code +29 -35 (net removed 6) * reinterpret_cast lines of code +6 -32 (net removed 26) ## cache.h and secondary_cache.h * Always use CacheItemHelper with entries instead of just a Deleter. There are several motivations / justifications: * Simpler for implementations to deal with just one Insert and one Lookup. * Simpler and more efficient implementation because we don't have to track which entries are using helpers and which are using deleters * Gets rid of hack to classify cache entries by their deleter. Instead, the CacheItemHelper includes a CacheEntryRole. This simplifies a lot of code (cache_entry_roles.h almost eliminated). Fixes https://github.com/facebook/rocksdb/issues/9428. * Makes it trivial to adjust SecondaryCache behavior based on kind of block (e.g. don't re-compress filter blocks). * It is arguably less convenient for many direct users of Cache, but direct users of Cache are now rare with introduction of typed_cache.h (below). * I considered and rejected an alternative approach in which we reduce customizability by assuming each secondary cache compatible value starts with a Slice referencing the uncompressed block contents (already true or mostly true), but we apparently intend to stack secondary caches. Saving an entry from a compressed secondary to a lower tier requires custom handling offered by SaveToCallback, etc. * Make CreateCallback part of the helper and introduce CreateContext to work with it (alternative to https://github.com/facebook/rocksdb/issues/10562). This cleans up the interface while still allowing context to be provided for loading/parsing values into primary cache. This model works for async lookup in BlockBasedTable reader (reader owns a CreateContext) under the assumption that it always waits on secondary cache operations to finish. (Otherwise, the CreateContext could be destroyed while async operation depending on it continues.) This likely contributes most to the observed performance improvement because it saves an std::function backed by a heap allocation. * Use char* for serialized data, e.g. in SaveToCallback, where void* was confusingly used. (We use `char` for serialized byte data all over RocksDB, with many advantages over `void`. `memcpy` etc. are legacy APIs that should not be mimicked.) * Add a type alias Cache::ObjectPtr = void, so that we can better indicate the intent of the void when it is to be the object associated with a Cache entry. Related: started (but did not complete) a refactoring to move away from "value" of a cache entry toward "object" or "obj". (It is confusing to call Cache a key-value store (like DB) when it is really storing arbitrary in-memory objects, not byte strings.) * Remove unnecessary key param from DeleterFn. This is good for efficiency in HyperClockCache, which does not directly store the cache key in memory. (Alternative to https://github.com/facebook/rocksdb/issues/10774) * Add allocator to Cache DeleterFn. This is a kind of future-proofing change in case we get more serious about using the Cache allocator for memory tracked by the Cache. Right now, only the uncompressed block contents are allocated using the allocator, and a pointer to that allocator is saved as part of the cached object so that the deleter can use it. (See CacheAllocationPtr.) If in the future we are able to "flatten out" our Cache objects some more, it would be good not to have to track the allocator as part of each object. * Removes legacy `ApplyToAllCacheEntries` and changes `ApplyToAllEntries` signature for Deleter->CacheItemHelper change. ## typed_cache.h Adds various "typed" interfaces to the Cache as internal APIs, so that most uses of Cache can use simple type safe code without casting and without explicit deleters, etc. Almost all of the non-test, non-glue code uses of Cache have been migrated. (Follow-up work: CompressedSecondaryCache deserves deeper attention to migrate.) This change expands RocksDB's internal usage of metaprogramming and SFINAE (https://en.cppreference.com/w/cpp/language/sfinae). The existing usages of Cache are divided up at a high level into these new interfaces. See updated existing uses of Cache for examples of how these are used. * PlaceholderCacheInterface - Used for making cache reservations, with entries that have a charge but no value. * BasicTypedCacheInterface<TValue> - Used for primary cache storage of objects of type TValue, which can be cleaned up with std::default_delete<TValue>. The role is provided by TValue::kCacheEntryRole or given in an optional template parameter. * FullTypedCacheInterface<TValue, TCreateContext> - Used for secondary cache compatible storage of objects of type TValue. In addition to BasicTypedCacheInterface constraints, we require TValue::ContentSlice() to return persistable data. This simplifies usage for the normal case of simple secondary cache compatibility (can give you a Slice to the data already in memory). In addition to TCreateContext performing the role of Cache::CreateContext, it is also expected to provide a factory function for creating TValue. * For each of these, there's a "Shared" version (e.g. FullTypedSharedCacheInterface) that holds a shared_ptr to the Cache, rather than assuming external ownership by holding only a raw `Cache*`. These interfaces introduce specific handle types for each interface instantiation, so that it's easy to see what kind of object is controlled by a handle. (Ultimately, this might not be worth the extra complexity, but it seems OK so far.) Note: I attempted to make the cache 'charge' automatically inferred from the cache object type, such as by expecting an ApproximateMemoryUsage() function, but this is not so clean because there are cases where we need to compute the charge ahead of time and don't want to re-compute it. ## block_cache.h This header is essentially the replacement for the old block_like_traits.h. It includes various things to support block cache access with typed_cache.h for block-based table. ## block_based_table_reader.cc Before this change, accessing the block cache here was an awkward mix of static polymorphism (template TBlocklike) and switch-case on a dynamic BlockType value. This change mostly unifies on static polymorphism, relying on minor hacks in block_cache.h to distinguish variants of Block. We still check BlockType in some places (especially for stats, which could be improved in follow-up work) but at least the BlockType is a static constant from the template parameter. (No more awkward partial redundancy between static and dynamic info.) This likely contributes to the overall performance improvement, but hasn't been tested in isolation. The other key source of simplification here is a more unified system of creating block cache objects: for directly populating from primary cache and for promotion from secondary cache. Both use BlockCreateContext, for context and for factory functions. ## block_based_table_builder.cc, cache_dump_load_impl.cc Before this change, warming caches was super ugly code. Both of these source files had switch statements to basically transition from the dynamic BlockType world to the static TBlocklike world. None of that mess is needed anymore as there's a new, untyped WarmInCache function that handles all the details just as promotion from SecondaryCache would. (Fixes `TODO akanksha: Dedup below code` in block_based_table_builder.cc.) ## Everything else Mostly just updating Cache users to use new typed APIs when reasonably possible, or changed Cache APIs when not. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10975 Test Plan: tests updated Performance test setup similar to https://github.com/facebook/rocksdb/issues/10626 (by cache size, LRUCache when not "hyper" for HyperClockCache): 34MB 1thread base.hyper -> kops/s: 0.745 io_bytes/op: 2.52504e+06 miss_ratio: 0.140906 max_rss_mb: 76.4844 34MB 1thread new.hyper -> kops/s: 0.751 io_bytes/op: 2.5123e+06 miss_ratio: 0.140161 max_rss_mb: 79.3594 34MB 1thread base -> kops/s: 0.254 io_bytes/op: 1.36073e+07 miss_ratio: 0.918818 max_rss_mb: 45.9297 34MB 1thread new -> kops/s: 0.252 io_bytes/op: 1.36157e+07 miss_ratio: 0.918999 max_rss_mb: 44.1523 34MB 32thread base.hyper -> kops/s: 7.272 io_bytes/op: 2.88323e+06 miss_ratio: 0.162532 max_rss_mb: 516.602 34MB 32thread new.hyper -> kops/s: 7.214 io_bytes/op: 2.99046e+06 miss_ratio: 0.168818 max_rss_mb: 518.293 34MB 32thread base -> kops/s: 3.528 io_bytes/op: 1.35722e+07 miss_ratio: 0.914691 max_rss_mb: 264.926 34MB 32thread new -> kops/s: 3.604 io_bytes/op: 1.35744e+07 miss_ratio: 0.915054 max_rss_mb: 264.488 233MB 1thread base.hyper -> kops/s: 53.909 io_bytes/op: 2552.35 miss_ratio: 0.0440566 max_rss_mb: 241.984 233MB 1thread new.hyper -> kops/s: 62.792 io_bytes/op: 2549.79 miss_ratio: 0.044043 max_rss_mb: 241.922 233MB 1thread base -> kops/s: 1.197 io_bytes/op: 2.75173e+06 miss_ratio: 0.103093 max_rss_mb: 241.559 233MB 1thread new -> kops/s: 1.199 io_bytes/op: 2.73723e+06 miss_ratio: 0.10305 max_rss_mb: 240.93 233MB 32thread base.hyper -> kops/s: 1298.69 io_bytes/op: 2539.12 miss_ratio: 0.0440307 max_rss_mb: 371.418 233MB 32thread new.hyper -> kops/s: 1421.35 io_bytes/op: 2538.75 miss_ratio: 0.0440307 max_rss_mb: 347.273 233MB 32thread base -> kops/s: 9.693 io_bytes/op: 2.77304e+06 miss_ratio: 0.103745 max_rss_mb: 569.691 233MB 32thread new -> kops/s: 9.75 io_bytes/op: 2.77559e+06 miss_ratio: 0.103798 max_rss_mb: 552.82 1597MB 1thread base.hyper -> kops/s: 58.607 io_bytes/op: 1449.14 miss_ratio: 0.0249324 max_rss_mb: 1583.55 1597MB 1thread new.hyper -> kops/s: 69.6 io_bytes/op: 1434.89 miss_ratio: 0.0247167 max_rss_mb: 1584.02 1597MB 1thread base -> kops/s: 60.478 io_bytes/op: 1421.28 miss_ratio: 0.024452 max_rss_mb: 1589.45 1597MB 1thread new -> kops/s: 63.973 io_bytes/op: 1416.07 miss_ratio: 0.0243766 max_rss_mb: 1589.24 1597MB 32thread base.hyper -> kops/s: 1436.2 io_bytes/op: 1357.93 miss_ratio: 0.0235353 max_rss_mb: 1692.92 1597MB 32thread new.hyper -> kops/s: 1605.03 io_bytes/op: 1358.04 miss_ratio: 0.023538 max_rss_mb: 1702.78 1597MB 32thread base -> kops/s: 280.059 io_bytes/op: 1350.34 miss_ratio: 0.023289 max_rss_mb: 1675.36 1597MB 32thread new -> kops/s: 283.125 io_bytes/op: 1351.05 miss_ratio: 0.0232797 max_rss_mb: 1703.83 Almost uniformly improving over base revision, especially for hot paths with HyperClockCache, up to 12% higher throughput seen (1597MB, 32thread, hyper). The improvement for that is likely coming from much simplified code for providing context for secondary cache promotion (CreateCallback/CreateContext), and possibly from less branching in block_based_table_reader. And likely a small improvement from not reconstituting key for DeleterFn. Reviewed By: anand1976 Differential Revision: D42417818 Pulled By: pdillinger fbshipit-source-id: f86bfdd584dce27c028b151ba56818ad14f7a432	2023-01-11 14:20:40 -08:00
Hui Xiao	9502856edd	Add missing range conflict check between file ingestion and RefitLevel() (#10988 ) Summary: Context: File ingestion never checks whether the key range it acts on overlaps with an ongoing RefitLevel() (used in `CompactRange()` with `change_level=true`). That's because RefitLevel() doesn't register and make its key range known to file ingestion. Though it checks overlapping with other compactions by https://github.com/facebook/rocksdb/blob/7.8.fb/db/external_sst_file_ingestion_job.cc#L998. RefitLevel() (used in `CompactRange()` with `change_level=true`) doesn't check whether the key range it acts on overlaps with an ongoing file ingestion. That's because file ingestion does not register and make its key range known to other compactions. - Note that non-refitlevel-compaction (e.g, manual compaction w/o RefitLevel() or general compaction) also does not check key range overlap with ongoing file ingestion for the same reason. - But it's fine. Credited to cbi42's discovery, `WaitForIngestFile` was called by background and foreground compactions. They were introduced in `0f88160f67`, `5c64fb67d2` and `87dfc1d23e`. - Regardless, this PR registers file ingestion like a compaction is a general approach that will also add range conflict check between file ingestion and non-refitlevel-compaction, though it has not been the issue motivated this PR. Above are bugs resulting in two bad consequences: - If file ingestion and RefitLevel() creates files in the same level, then range-overlapped files will be created at that level and caught as corruption by `force_consistency_checks=true` - If file ingestion and RefitLevel() creates file in different levels, then with one further compaction on the ingested file, it can result in two same keys both with seqno 0 in two different levels. Then with iterator's [optimization](`c62f322169/db/db_iter.cc (L342-L343)`) that assumes no two same keys both with seqno 0, it will either break this assertion in debug build or, even worst, return value of this same key for the key after it, which is the wrong value to return, in release build. Therefore we decide to introduce range conflict check for file ingestion and RefitLevel() inspired from the existing range conflict check among compactions. Summary: - Treat file ingestion job and RefitLevel() as `Compaction` of new compaction reasons: `CompactionReason::kExternalSstIngestion` and `CompactionReason::kRefitLevel` and register/unregister them. File ingestion is treated as compaction from L0 to different levels and RefitLevel() as compaction from source level to target level. - Check for `RangeOverlapWithCompaction` with other ongoing compactions, `RegisterCompaction()` on this "compaction" before changing the LSM state in `VersionStorageInfo`, and `UnregisterCompaction()` after changing. - Replace scattered fixes (`0f88160f67`, `5c64fb67d2` and `87dfc1d23e`.) that prevents overlapping between file ingestion and non-refit-level compaction with this fix cuz those practices are easy to overlook. - Misc: logic cleanup, see PR comments Pull Request resolved: https://github.com/facebook/rocksdb/pull/10988 Test Plan: - New unit test `DBCompactionTestWithOngoingFileIngestionParam*` that failed pre-fix and passed afterwards. - Made compatible with existing tests, see PR comments - make check - [Ongoing] Stress test rehearsal with normal value and aggressive CI value https://github.com/facebook/rocksdb/pull/10761 Reviewed By: cbi42 Differential Revision: D41535685 Pulled By: hx235 fbshipit-source-id: 549833a577ba1496d20a870583d4caa737da1258	2022-12-29 15:05:36 -08:00
Peter Dillinger	02f2b20864	Add BackupEngine feature to exclude files (#11030 ) Summary: We have a request for RocksDB to essentially support disconnected incremental backup. In other words, if there is limited or no connectivity to the primary backup dir, we should still be able to take an incremental backup relative to that primary backup dir, assuming some metadata about that primary dir is available (and obviously anticipating primary backup dir will be fully available if restore is needed). To support that, this feature allows the API user to "exclude" DB files from backup. This only applies to files that can be shared between backups (sst and blob files), and excluded files are tracked in the backup metadata sufficiently to ensure they are restored at restore time. At restore time, the user provides a set of alternate backup directories (as open BackupEngines, which can be read-only), and excluded files must be found in one of the backup directories ("included" in some backup). This feature depends on backup schema version 2 features, though schema version 2.0 support is not sufficient to read / restore a backup with exclusions. This change updates the schema version to 2.1 because of this feature, so that it's easy to recognize whether a RocksDB release supports this feature, while backups not using the feature are fully compatible with 2.0. Also in this PR: * Stacked on https://github.com/facebook/rocksdb/pull/11029 * Allow progress_callback to be empty, not just no-op function, and recover from exceptions thrown by BackupEngine callbacks. * The internal-only `AsBackupEngine()` function is working around the diamond hierarchy of `BackupEngineImplThreadSafe` to get to the internals, without using confusing features like virtual inheritance. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11030 Test Plan: unit tests added / updated Reviewed By: ajkr Differential Revision: D42004388 Pulled By: pdillinger fbshipit-source-id: 31b6e533d308a5462e528d9012d650482d974077	2022-12-29 10:42:50 -08:00
anand76	bec4264813	Avoid mixing sync and async prefetch (#11050 ) Summary: Reading uncompression dict block always uses sync reads, while data blocks may use async reads and prefetching. This causes problems in FilePrefetchBuffer. So avoid mixing the two by reading the uncompression dict straight from the file. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11050 Test Plan: Crash test Reviewed By: akankshamahajan15 Differential Revision: D42194682 Pulled By: anand1976 fbshipit-source-id: aaa8b396fdfe966b157e210f5ef8501c45b7b69e	2022-12-21 22:42:19 -08:00
Peter Dillinger	e6b6e74154	Make CompactRange() more aware of SstPartitionerFactory (#11032 ) Summary: Some users are at least considering using SstPartitioner to support efficient physical migration of specific key ranges between RocksDB instances. One might expect manual `CompactRange()` over a narrow key range across some partition to enforce partitioning of any SST files crossing that partition boundary, but that currently only works if there are keys within that range. This change makes the overlap logic in CompactRange more aware of the partitioner to automatically select relevant files crossing a partition boundary, even when they otherwise would not be selected due to the compaction range falling in a gap between entries. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11032 Test Plan: unit test included Reviewed By: hx235 Differential Revision: D41981380 Pulled By: pdillinger fbshipit-source-id: 2fe445bdddc73c00276c20f295cc1fa33d15b05a	2022-12-21 15:41:10 -08:00
anand76	dbf37c290a	Fix async prefetch heap use after free (#11049 ) Summary: This PR fixes a heap use after free bug in the async prefetch code that happens in the following scenario - 1. Scan thread starts 2 async reads for Seek, one for the seek block and one for prefetching 2. Before the first read in https://github.com/facebook/rocksdb/issues/1 completes, another thread reads and loads the block in cache 3. The first scan thread finds the block in cache, continues and the next block cache miss is for a block that spans the boundary of the 2 prefetch buffers, and the 1st read is complete but the 2nd one is not complete yet 4. The scan thread will reallocate (i.e free the old buffer and allocate a new one) the 2nd prefetch buffer, and the in-progress prefetch is orphaned 5. The orphaned prefetch finally completes, resulting in a use after free Also add a few asserts to surface bugs earlier in the crash tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11049 Test Plan: Repro with db_stress and verify the fix Reviewed By: akankshamahajan15 Differential Revision: D42181118 Pulled By: anand1976 fbshipit-source-id: 1ac55d2f64a89ce128c1c574262b8aa7d82eb8cc	2022-12-21 09:15:53 -08:00
Changyu Bi	f02c708aa3	Consider range tombstone in compaction output file cutting (#10802 ) Summary: This PR is the first step for Issue https://github.com/facebook/rocksdb/issues/4811. Currently compaction output files are cut at point keys, and the decision is made mainly in `CompactionOutputs::ShouldStopBefore()`. This makes it possible for range tombstones to cause large compactions that does not respect `max_compaction_bytes`. For example, we can have a large range tombstone that overlaps with too many files from the next level. Another example is when there is a gap between a range tombstone and another key. The first issue may be more acceptable, as a lot of data is deleted. This PR address the second issue by calling `ShouldStopBefore()` for range tombstone start keys. The main change is for `CompactionIterator` to emit range tombstone start keys to be processed by `CompactionOutputs`. A new `CompactionMergingIterator` is introduced and only used under `CompactionIterator` for this purpose. Further improvement after this PR include 1) cut compaction output at some grandparent boundary key instead of at the next point key or range tombstone start key and 2) cut compaction output file within a large range tombstone (it may be easier and reasonable to only do it for range tombstones at the end of a compaction output). Pull Request resolved: https://github.com/facebook/rocksdb/pull/10802 Test Plan: - added unit tests in db_range_del_test. - stress test: `python3 tools/db_crashtest.py whitebox --[simple\|enable_ts] --verify_iterator_with_expected_state_one_in=5 --delrangepercent=5 --prefixpercent=2 --writepercent=58 --readpercen=21 --duration=36000 --range_deletion_width=1000000` Reviewed By: ajkr, jay-zhuang Differential Revision: D40308827 Pulled By: cbi42 fbshipit-source-id: a8fd6f70a3f09d0ef7a40e006f6c964bba8c00df	2022-12-15 09:11:54 -08:00
Yanqin Jin	c93ba7db5d	Revise LockWAL/UnlockWAL implementation (#11020 ) Summary: RocksDB has two public APIs: `DB::LockWAL()`/`DB::UnlockWAL()`. The current implementation acquires and releases the internal `DBImpl::log_write_mutex_`. According to the comment on `DBImpl::log_write_mutex_`: https://github.com/facebook/rocksdb/blob/7.8.fb/db/db_impl/db_impl.h#L2287:L2288 > Note: to avoid dealock, if needed to acquire both log_write_mutex_ and mutex_, the order should be first mutex_ and then log_write_mutex_. This puts limitations on how applications can use the `LockWAL()` API. After `LockWAL()` returns ok, then application should not perform any operation that acquires `mutex_`. Currently, the use case of `LockWAL()` is MyRocks implementing the MySQL storage engine handlerton `lock_hton_log` interface. The operation that MyRocks performs after `LockWAL()` is `GetSortedWalFiless()` which not only acquires mutex_, but also `log_write_mutex_`. There are two issues: 1. Applications using these two APIs may hang if one thread calls `GetSortedWalFiles()` after calling `LockWAL()` because log_write_mutex is not recursive. 2. Two threads may dead lock due to lock order inversion. To fix these issues, we can modify the implementation of LockWAL so that it does not keep `log_write_mutex_` held until UnlockWAL. To achieve the goal of locking the WAL, we can instead manually inject a write stall so that all future writes will be stopped. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11020 Test Plan: make check Reviewed By: ajkr Differential Revision: D41785203 Pulled By: riversand963 fbshipit-source-id: 5ccb7a9c6eb9a2c3fa80fd2c399cc2568b8f89ce	2022-12-13 21:45:00 -08:00
Hui Xiao	98d5db5c2e	Sort L0 files by newly introduced epoch_num (#10922 ) Summary: Context: Sorting L0 files by `largest_seqno` has at least two inconvenience: - File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. `force_consistency_check=true` will catch such overlap seqno range even those harmless overlap. - For example, consider the following sequence of events ("key@n" indicates key at seqno "n") - insert k1@1 to memtable m1 - ingest file s1 with k2@2, ingest file s2 with k3@3 - insert k4@4 to m1 - compact files s1, s2 and result in new file s3 of seqno range [2, 3] - flush m1 and result in new file s4 of seqno range [1, 4]. And `force_consistency_check=true` will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1 - However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption. - Single delete can decrease a file's largest seqno and ordering by `largest_seqno` can introduce a wrong ordering hence file reordering corruption - For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to ajkr for this example) - an existing SST s1 contains only k1@1 - insert k1@2 to memtable m1 - ingest file s2 with k3@3, ingest file s3 with k4@4 - insert single delete k5@5 in m1 - flush m1 and result in new file s4 of seqno range [2, 5] - compact s1, s2, s3 and result in new file s5 of seqno range [1, 4] - compact s4 and result in new file s6 of seqno range [2] due to single delete - By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by `force_consistency_check=true`, there isn't a good way to prevent this from happening if ordering by `largest_seqno` Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to ajkr , we now introduce `epoch_num` which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum `epoch_num` among input files'). This will avoid the above inconvenience in the following ways: - In the first case above, there will no longer be overlap seqno range check in `force_consistency_check=true` but `epoch_number` ordering check. This will result in file ordering s1 < s2 < s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class `DBCompactionTestL0FilesMisorderCorruption` for more. - In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption. Summary:* - Introduce `epoch_number` stored per `ColumnFamilyData` and sort CF's L0 files by their assigned `epoch_number` instead of `largest_seqno`. - `epoch_number` is increased and assigned upon `VersionEdit::AddFile()` for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the `kReservedEpochNumberForFileIngestedBehind`) - Compaction output file is assigned with the minimum `epoch_number` among input files' - Refit level: reuse refitted file's epoch_number - Other paths needing `epoch_number` treatment: - Import column families: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo` - Repair: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`. - Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon `VersionEdit::AddFile()` where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair). - Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later `Refit(target_level=0)`). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder. - Persist `epoch_number` of each file in manifest and recover `epoch_number` on db recovery - Backward compatibility with old db without `epoch_number` support is guaranteed by assigning `epoch_number` to recovered files by `NewestFirstBySeqno` order. See `VersionStorageInfo::RecoverEpochNumbers()` for more - Forward compatibility with manifest is guaranteed by flexibility of `NewFileCustomTag` - Replace `force_consistent_check` on L0 with `epoch_number` and remove false positive check like case 1 with `largest_seqno` above - Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (`NewestFirstBySeqno`) to check/sort them till we infer their epoch number. See usages of `EpochNumberRequirement`. - Remove fix https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and their outdated tests to file reordering corruption because such fix can be replaced by this PR. - Misc: - update existing tests with `epoch_number` so make check will pass - update https://github.com/facebook/rocksdb/pull/5958#issue-511150930 tests to verify corruption is fixed using `epoch_number` and cover universal/fifo compaction/CompactRange/CompactFile cases - assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber() Pull Request resolved: https://github.com/facebook/rocksdb/pull/10922 Test Plan: - `make check` - New unit tests under `db/db_compaction_test.cc`, `db/db_test2.cc`, `db/version_builder_test.cc`, `db/repair_test.cc` - Updated tests (i.e, `DBCompactionTestL0FilesMisorderCorruption*`) under https://github.com/facebook/rocksdb/pull/5958#issue-511150930 - [Ongoing] Compatibility test: manually run `36a5686ec0` (with file ingestion off for running the `.orig` binary to prevent this bug affecting upgrade/downgrade formality checking) for 1 hour on `simple black/white box`, `cf_consistency/txn/enable_ts with whitebox + test_best_efforts_recovery with blackbox` - [Ongoing] normal db stress test - [Ongoing] db stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761 Reviewed By: ajkr Differential Revision: D41063187 Pulled By: hx235 fbshipit-source-id: 826cb23455de7beaabe2d16c57682a82733a32a9	2022-12-13 13:29:37 -08:00
Peter Dillinger	9b34c097a1	Fix bug updating latest backup on delete (#11029 ) Summary: Previously, the "latest" valid backup would not be updated on delete. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11029 Test Plan: unit test included (added to an existing test for efficiency) Reviewed By: hx235 Differential Revision: D41967925 Pulled By: pdillinger fbshipit-source-id: ca143354d281eb979557ea421902cd26803a1137	2022-12-13 09:42:34 -08:00
Hui Xiao	5f52dbc505	Move two history entries mistake out of 7.9 section (#11013 ) Summary: https://github.com/facebook/rocksdb/pull/10892 and https://github.com/facebook/rocksdb/pull/10955 mistakenly added two entries under sealed 7.9.history section. This PR fixes these two. No need to update 7.9 branch (https://github.com/facebook/rocksdb/blob/7.9.fb/HISTORY.md) cuz it's cut before these two PRs landed Pull Request resolved: https://github.com/facebook/rocksdb/pull/11013 Reviewed By: cbi42 Differential Revision: D41666514 Pulled By: hx235 fbshipit-source-id: c4bc7a29ff663664bf0be1ba1c7eab4d00a61528	2022-12-06 18:33:35 -08:00
anand76	8ffabdc226	Fix table cache leak in MultiGet with async_io (#10997 ) Summary: When MultiGet with the async_io option encounters an IO error in TableCache::GetTableReader, it may result in leakage of table cache handles due to queued coroutines being abandoned. This PR fixes it by ensuring any queued coroutines are run before aborting the MultiGet. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10997 Test Plan: 1. New unit test in db_basic_test 2. asan_crash Reviewed By: pdillinger Differential Revision: D41587244 Pulled By: anand1976 fbshipit-source-id: 900920cd3fba47cb0fc744a62facc5ffe2eccb64	2022-12-04 22:58:25 -08:00
Hui Xiao	2f76ab150d	Fix missing WAL in new manifest by rolling over the WAL deletion record from prev manifest (#10892 ) Summary: Context `Options::track_and_verify_wals_in_manifest = true` verifies each of the WALs tracked in manifest indeed presents in the WAL folder. If not, a corruption "Missing WAL with log number" will be thrown. `DB::SyncWAL()` called at a specific timing (i.e, at the `TEST_SYNC_POINT("FindObsoleteFiles::PostMutexUnlock")`) can record in a new manifest the WAL addition of a WAL file that already had a WAL deletion recorded in the previous manifest. And the WAL deletion record is not rollover-ed to the new manifest. So the new manifest creates the illusion of such WAL never gets deleted and should presents at db re/open. - Such WAL deletion record can be caused by flushing the memtable associated with that WAL and such WAL deletion can actually happen in` PurgeObsoleteFiles()`. As a consequence, upon `DB::Reopen()`, this WAL file can be deleted while manifest still has its WAL addition record , which causes a false alarm of corruption "Missing WAL with log number" to be thrown. Summary This PR fixes this false alarm by rolling over the WAL deletion record from prev manifest to the new manifest by adding the WAL deletion record to the new manifest. Test - Make check - Added new unit test `TEST_F(DBWALTest, FixSyncWalOnObseletedWalWithNewManifestCausingMissingWAL)` that failed before the fix and passed after - [Ongoing]CI stress test + aggressive value as in https://github.com/facebook/rocksdb/pull/10761 , which is how this false alarm was first surfaced, to confirm such false alarm disappears - [Ongoing]Regular CI stress test to confirm such fix didn't harm anything Pull Request resolved: https://github.com/facebook/rocksdb/pull/10892 Reviewed By: ajkr Differential Revision: D40778965 Pulled By: hx235 fbshipit-source-id: a512364bfdeb0b1a55c171890e60d856c528f37f	2022-11-29 14:14:43 -08:00
Hui Xiao	f1574a20ff	Revert PR 10777 "Fix FIFO causing overlapping seqnos in L0 files due to overla…" (#10999 ) Summary: Context/Summary: This reverts commit `fc74abb436` and related HISTORY record. The issue with PR 10777 or general approach using earliest_mem_seqno like https://github.com/facebook/rocksdb/pull/5958#issue-511150930 is that the earliest seqno of memtable of each CFs does not get persisted and will always start with 0 upon Recover(). Later when creating a new memtable in certain CF, we use the last seqno of the whole DB (but not of that CF from previous DB session) for this CF. This will lead to false positive overlapping seqno and PR 10777 will throw something like https://github.com/facebook/rocksdb/blob/main/db/compaction/compaction_picker.cc#L1002-L1004 Luckily a more elegant and complete solution to the overlapping seqno problem these PR aim to solve does not have above problem, see https://github.com/facebook/rocksdb/pull/10922. It is already being pursued and in the process of review. So we can just revert this PR and focus on getting PR10922 to land. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10999 Test Plan: make check Reviewed By: anand1976 Differential Revision: D41572604 Pulled By: hx235 fbshipit-source-id: 9d9bdf594abd235e2137045cef513ca0b14e0a3a	2022-11-29 10:56:42 -08:00
Hui Xiao	d8c043f7ad	Trigger FIFO file deletion in non L0 only if exceeding max_table_files_size (#10955 ) Summary: Context https://github.com/facebook/rocksdb/pull/10348 allows multi-level FIFO but accidentally made change to the logic of deleting files in `FIFOCompactionPicker::PickSizeCompaction`. With [this](https://github.com/facebook/rocksdb/pull/10348/files#diff-d8fb3d50749aa69b378de447e3d9cf2f48abe0281437f010b5d61365a7b813fdR156) and [this](https://github.com/facebook/rocksdb/pull/10348/files#diff-d8fb3d50749aa69b378de447e3d9cf2f48abe0281437f010b5d61365a7b813fdR235) together, it deletes one file in non-L0 even when `total_size <= mutable_cf_options.compaction_options_fifo.max_table_files_size`, which is incorrect. As a consequence, FIFO exercises more file deletion in our crash testing, which is not able to verify correctly on deleted keys in the file deleted by compaction. This results in errors `error : inconsistent values for key 000000000000239F000000000000012B000000000000028B: expected state has the key, Get() returns NotFound. Verification failed :(` or `Expected state has key 00000000000023A90000000000000003787878, iterator is at key 00000000000023A9000000000000004178 Column family: default, op_logs: S 00000000000023A90000000000000003787878` Summary: - Delete file for non-L0 only if `total_size <= mutable_cf_options.compaction_options_fifo.max_table_files_size` - Add some helpful log to LOG file Pull Request resolved: https://github.com/facebook/rocksdb/pull/10955 Test Plan: - Errors repro-ed by ``` ./db_stress --preserve_unverified_changes=1 --acquire_snapshot_one_in=10000 --adaptive_readahead=0 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --async_io=0 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=10 --bottommost_compression_type=none --bytes_per_sync=0 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=1 --checkpoint_one_in=1000000 --checksum_type=kxxHash --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=3 --compaction_style=2 --compaction_ttl=0 --compression_max_dict_buffer_bytes=8589934591 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=xpress --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --db_write_buffer_size=1048576 --delpercent=0 --delrangepercent=0 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=1 --fifo_allow_compaction=1 --file_checksum_impl=xxh64 --flush_one_in=1000000 --format_version=4 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=10 --index_type=2 --ingest_external_file_one_in=1000000 --initial_auto_readahead_size=16384 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=False --log2_keys_per_lock=10 --long_running_snapshots=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=524288 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=25000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=True --nooverwritepercent=0 --num_file_reads_for_auto_readahead=2 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=40000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=7 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=1000 --readahead_size=0 --readpercent=65 --recycle_log_file_num=1 --reopen=0 --ribbon_starting_level=999 --secondary_cache_fault_one_in=0 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=1 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=1 --use_direct_reads=1 --use_full_merge_v1=1 --use_merge=0 --use_multiget=0 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_iterator_with_expected_state_one_in=0 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=1 --writepercent=20 ``` is gone after this fix - CI Reviewed By: ajkr Differential Revision: D41319441 Pulled By: hx235 fbshipit-source-id: 6939753767007f7449ea7055b1420aabd03d7709	2022-11-28 15:45:03 -08:00
Changyu Bi	534fb06dd3	Prevent iterating over range tombstones beyond `iterate_upper_bound` (#10966 ) Summary: Currently, `iterate_upper_bound` is not checked for range tombstone keys in MergingIterator. This may impact performance when there is a large number of range tombstones right after `iterate_upper_bound`. This PR fixes this issue by checking `iterate_upper_bound` in MergingIterator for range tombstone keys. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10966 Test Plan: - added unit test - stress test: `python3 tools/db_crashtest.py whitebox --simple --verify_iterator_with_expected_state_one_in=5 --delrangepercent=5 --prefixpercent=18 --writepercent=48 --readpercen=15 --duration=36000 --range_deletion_width=100` - ran different stress tests over sandcastle - Falcon team ran some test traffic and saw reduced CPU usage on processing range tombstones. Reviewed By: ajkr Differential Revision: D41414172 Pulled By: cbi42 fbshipit-source-id: 9b2c29eb3abb99327c6a649bdc412e70d863f981	2022-11-23 14:27:14 -08:00
Yanqin Jin	3d0d6b8140	Make best-efforts recovery verify SST unique ID before Version construction (#10962 ) Summary: The check for SST unique IDs added to best-efforts recovery (`Options::best_efforts_recovery` is true). With best_efforts_recovery being true, RocksDB will recover to the latest point in MANIFEST such that all valid SST files included up to this point pass unique ID checks as well. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10962 Test Plan: make check Reviewed By: pdillinger Differential Revision: D41378241 Pulled By: riversand963 fbshipit-source-id: a036064e2c17dec13d080a24ef2a9f85d607b16c	2022-11-22 22:53:31 -08:00
anand76	f4cfcfe824	Post 7.9.0 release branch cut updates (#10974 ) Summary: Update HISTORY.md, version.h, and check_format_compatible.sh Pull Request resolved: https://github.com/facebook/rocksdb/pull/10974 Reviewed By: akankshamahajan15 Differential Revision: D41455289 Pulled By: anand1976 fbshipit-source-id: 99888ebcb9109e5ced80584a66b20123f8783c0b	2022-11-21 19:24:42 -08:00
anand76	3ff6da6bd5	Update HISTORY.md for 7.9.0 (#10973 ) Summary: Update HISTORY.md for 7.9.0 release. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10973 Reviewed By: pdillinger Differential Revision: D41453720 Pulled By: anand1976 fbshipit-source-id: 47a23d4b6539ec6a9a09c9e69c026f7c8b10afa7	2022-11-21 17:00:01 -08:00
Peter Dillinger	e079d562af	Add a SecondaryCache::InsertSaved() API, use in CacheDumper impl (#10945 ) Summary: Can simplify some ugly code in cache_dump_load_impl.cc by having an API in SecondaryCache that can directly consume persisted data. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10945 Test Plan: existing tests for CacheDumper, added basic unit test Reviewed By: anand1976 Differential Revision: D41231497 Pulled By: pdillinger fbshipit-source-id: b8ec993ef7d3e7efd68aae8602fd3f858da58068	2022-11-21 16:17:36 -08:00
Andrew Kryczka	097f9f4425	Fix CompactionIterator flag for penultimate level output (#10967 ) Summary: We were not resetting it in non-debug mode so it could be true once and then stay true for future keys where it should be false. This PR adds the reset logic. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10967 Test Plan: - built `db_bench` with DEBUG_LEVEL=0 - ran benchmark: `TEST_TMPDIR=/dev/shm/prefix ./db_bench -benchmarks=fillrandom -compaction_style=1 -preserve_internal_time_seconds=100 -preclude_last_level_data_seconds=10 -write_buffer_size=1048576 -target_file_size_base=1048576 -subcompactions=8 -duration=120` - compared "output_to_penultimate_level: X bytes + last: Y bytes" lines in LOG output - Before this fix, Y was always zero - After this fix, Y gradually increased throughout the benchmark Reviewed By: riversand963 Differential Revision: D41417726 Pulled By: ajkr fbshipit-source-id: ace1e9a289e751a5b0c2fbaa8addd4eda5525329	2022-11-21 16:14:03 -08:00
Peter Dillinger	3182beeffc	Observe and warn about misconfigured HyperClockCache (#10965 ) Summary: Background. One of the core risks of chosing HyperClockCache is ending up with degraded performance if estimated_entry_charge is very significantly wrong. Too low leads to under-utilized hash table, which wastes a bit of (tracked) memory and likely increases access times due to larger working set size (more TLB misses). Too high leads to fully populated hash table (at some limit with reasonable lookup performance) and not being able to cache as many objects as the memory limit would allow. In either case, performance degradation is graceful/continuous but can be quite significant. For example, cutting block size in half without updating estimated_entry_charge could lead to a large portion of configured block cache memory (up to roughly 1/3) going unused. Fix. This change adds a mechanism through which the DB periodically probes the block cache(s) for "problems" to report, and adds diagnostics to the HyperClockCache for bad estimated_entry_charge. The periodic probing is currently done with DumpStats / stats_dump_period_sec, and diagnostics reported to info_log (normally LOG file). Pull Request resolved: https://github.com/facebook/rocksdb/pull/10965 Test Plan: unit test included. Doesn't cover all the implemented subtleties of reporting, but ensures basics of when to report or not. Also manual testing with db_bench. Create db with ``` ./db_bench --benchmarks=fillrandom,flush --num=3000000 --disable_wal=1 ``` Use and check LOG file for HyperClockCache for various block sizes (used as estimated_entry_charge) ``` ./db_bench --use_existing_db --benchmarks=readrandom --num=3000000 --duration=20 --stats_dump_period_sec=8 --cache_type=hyper_clock_cache -block_size=XXXX ``` Seeing warnings / errors or not as expected. Reviewed By: anand1976 Differential Revision: D41406932 Pulled By: pdillinger fbshipit-source-id: 4ca56162b73017e4b9cec2cad74466f49c27a0a7	2022-11-21 12:08:21 -08:00
Peter Dillinger	8c0f5b1fcf	Mark HyperClockCache as production-ready (#10963 ) Summary: After a couple minor bug fixes and successful productions roll-outs in a few places, I think we can mark this as production-ready. It has a clear value proposition for many workloads, even if we don't have clear advice for every workload yet. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10963 Test Plan: existing tests, comment changes only Reviewed By: siying Differential Revision: D41384083 Pulled By: pdillinger fbshipit-source-id: 56359f01a57bb28de8697666b342382fac72ce6d	2022-11-17 14:44:59 -08:00
Levi Tamasi	8fa8780932	Mention wide-column support in HISTORY.md (#10959 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10959 Reviewed By: akankshamahajan15 Differential Revision: D41348198 Pulled By: ltamasi fbshipit-source-id: 51e89d03c1fe87f576a766f609a7f233a519c83d	2022-11-16 12:22:35 -08:00
anand76	ecba6a320e	Add some async read stats (#10947 ) Summary: Add stats for time spent in the ReadAsync call, and async read errors. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10947 Test Plan: Run db_bench and look at stats Reviewed By: akankshamahajan15 Differential Revision: D41236637 Pulled By: anand1976 fbshipit-source-id: 70539b69a28491d57acead449436a761f7108acf	2022-11-13 21:38:35 -08:00
Peter Dillinger	f321e8fc98	Don't attempt to use SecondaryCache on block_cache_compressed (#10944 ) Summary: Compressed block cache depends on reading the block compression marker beyond the payload block size. Only the payload bytes were being saved and loaded from SecondaryCache -> boom! This removes some unnecessary code attempting to combine these two competing features. Note that BlockContents was previously used for block-based filter in block cache, but that support has been removed. Also marking block_cache_compressed as deprecated in this commit as we expect it to be replaced with SecondaryCache. This problem was discovered during refactoring but didn't want to combine bug fix with that refactoring. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10944 Test Plan: test added that fails on base revision (at least with ASAN) Reviewed By: akankshamahajan15 Differential Revision: D41205578 Pulled By: pdillinger fbshipit-source-id: 1b29d36c7a6552355ac6511fcdc67038ef4af29f	2022-11-11 17:35:53 -08:00
akankshamahajan	d1aca4a5ae	Fix async_io regression in scans (#10939 ) Summary: Fix async_io regression in scans due to incorrect check which was causing the valid data in buffer to be cleared during seek. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10939 Test Plan: - stress tests export CRASH_TEST_EXT_ARGS="--async_io=1" make crash_test -j32 - Ran db_bench command which was caught the regression: ./db_bench --db=/rocksdb_async_io_testing/prefix_scan --disable_wal=1 --use_existing_db=true --benchmarks="seekrandom" -key_size=32 -value_size=512 -num=50000000 -use_direct_reads=false -seek_nexts=963 -duration=30 -ops_between_duration_checks=1 --async_io=true --compaction_readahead_size=4194304 --log_readahead_size=0 --blob_compaction_readahead_size=0 --initial_auto_readahead_size=65536 --num_file_reads_for_auto_readahead=0 --max_auto_readahead_size=524288 seekrandom : 3777.415 micros/op 264 ops/sec 30.000 seconds 7942 operations; 132.3 MB/s (7942 of 7942 found) Reviewed By: anand1976 Differential Revision: D41173899 Pulled By: akankshamahajan15 fbshipit-source-id: 2d75b06457d65b1851c92382565d9c3fac329dfe	2022-11-11 13:34:49 -08:00
Levi Tamasi	fbd9077d66	Fix a bug where GetContext does not update READ_NUM_MERGE_OPERANDS (#10925 ) Summary: The patch fixes a bug where `GetContext::Merge` (and `MergeEntity`) does not update the ticker `READ_NUM_MERGE_OPERANDS` because it implicitly uses the default parameter value of `update_num_ops_stats=false` when calling `MergeHelper::TimedFullMerge`. Also, to prevent such issues going forward, the PR removes the default parameter values from the `TimedFullMerge` methods. In addition, it removes an unused/unnecessary parameter from `TimedFullMergeWithEntity`, and does some cleanup at the call sites of these methods. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10925 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D41096453 Pulled By: ltamasi fbshipit-source-id: fc60646d32b4d516b8fe81e265c3f020a32fd7f8	2022-11-07 15:42:10 -08:00
Andrew Kryczka	aa0a11e1b9	Fix flush picking non-consecutive memtables (#10921 ) Summary: Prevents `MemTableList::PickMemtablesToFlush()` from picking non-consecutive memtables. It leads to wrong ordering in L0 if the files are committed, or an error like below if force_consistency_checks=true catches it: ``` Corruption: force_consistency_checks: VersionBuilder: L0 file https://github.com/facebook/rocksdb/issues/25 with seqno 320416 368066 vs. file https://github.com/facebook/rocksdb/issues/24 with seqno 336037 352068 ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10921 Test Plan: fix the expectation in the existing test of this behavior Reviewed By: riversand963 Differential Revision: D41046935 Pulled By: ajkr fbshipit-source-id: 783696bff56115063d5dc5856dfaed6a9881d1ab	2022-11-04 15:55:54 -07:00
akankshamahajan	ff9ad2c39b	Fix async_io failures in case there is error in reading data (#10890 ) Summary: Fix memory corruption error in scans if async_io is enabled. Memory corruption happened if data is overlapping between two buffers. If there is IOError while reading the data, it leads to empty buffer and other buffer already in progress of async read goes again for reading causing the error. Fix: Added check to abort IO in second buffer if curr_ got empty. This PR also fixes db_stress failures which happened when buffers are not aligned. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10890 Test Plan: - Ran make crash_test -j32 with async_io enabled. - Ran benchmarks to make sure there is no regression. Reviewed By: anand1976 Differential Revision: D40881731 Pulled By: akankshamahajan15 fbshipit-source-id: 39fcf2134c7b1bbb08415ede3e1ef261ac2dbc58	2022-11-01 16:06:51 -07:00
Yanqin Jin	7d26e4c5a3	Basic Support for Merge with user-defined timestamp (#10819 ) Summary: This PR implements the originally disabled `Merge()` APIs when user-defined timestamp is enabled. Simplest usage: ```cpp // assume string append merge op is used with '.' as delimiter. // ts1 < ts2 db->Put(WriteOptions(), "key", ts1, "v0"); db->Merge(WriteOptions(), "key", ts2, "1"); ReadOptions ro; ro.timestamp = &ts2; db->Get(ro, "key", &value); ASSERT_EQ("v0.1", value); ``` Some code comments are added for clarity. Note: support for timestamp in `DB::GetMergeOperands()` will be done in a follow-up PR. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10819 Test Plan: make check Reviewed By: ltamasi Differential Revision: D40603195 Pulled By: riversand963 fbshipit-source-id: f96d6f183258f3392d80377025529f7660503013	2022-10-31 22:28:58 -07:00
Hui Xiao	7f5e438aee	Move move wrong history entry out of 7.8 release (#10898 ) Summary: Context/Summary: https://github.com/facebook/rocksdb/pull/10777 mistakenly added a history entry under 7.8 release but the PR is not included in 7.8. This mistake was due to rebase and merge didn't realize it was a conflict when "## Unreleased" was changed to "## 7.8.0 (10/22/2022)". Pull Request resolved: https://github.com/facebook/rocksdb/pull/10898 Test Plan: Make check Reviewed By: akankshamahajan15 Differential Revision: D40861001 Pulled By: hx235 fbshipit-source-id: b2310c95490f6ebb90834a210c965a74c9560b51	2022-10-31 15:02:29 -07:00
sdong	d989300ad1	Avoid repeat periodic stats printing when there is no change (#10891 ) Summary: When there is a column family that doesn't get any traffic, its stats are still dumped when options.options.stats_dump_period_sec triggers. This sometimes spam the information logs. With this change, we skip the printing if there is not change, until 8 periods. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10891 Test Plan: Manually test the behavior with hacked db_bench setups. Reviewed By: jay-zhuang Differential Revision: D40777183 fbshipit-source-id: ef0b9a793e4f6282df099b464f01d1fb4c5a2cab	2022-10-31 09:51:38 -07:00
Changyu Bi	56715350d9	Reduce heap operations for range tombstone keys in iterator (#10877 ) Summary: Right now in MergingIterator, for each range tombstone start and end key, we pop one end from heap and push the other end into the heap. This involves extra downheap and upheap cost. In the likely cases when a range tombstone iterator emits relatively adjacent keys, these keys should have similar order within all keys in the heap. This can happen when there is a burst of consecutive range tombstones, and most of the keys covered by them are dropped already. This PR uses `replace_top()` when inserting new range tombstone keys, which is more efficient in these common cases. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10877 Test Plan: - existing UT - ran all flavors of stress test through sandcastle - benchmark: ``` # Set up: --writes_per_range_tombstone=1 means one point write and one delete range TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone ./db_bench --benchmarks=fillseq,levelstats --writes_per_range_tombstone=1 --max_num_range_tombstones=1000000 --range_tombstone_width=2 --num=100000000 --writes=800000 --max_bytes_for_level_base=4194304 --disable_auto_compactions --write_buffer_size=33554432 --key_size=64 Level Files Size(MB) -------------------- 0 8 152 1 0 0 2 0 0 3 0 0 4 0 0 5 0 0 6 0 0 # Benchmark TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone/ ./db_bench --benchmarks=readseq[-W1][-X5],levelstats --use_existing_db=true --cache_size=3221225472 --num=100000000 --reads=1000000 --disable_auto_compactions=true --avoid_flush_during_recovery=true # Pre PR readseq [AVG 5 runs] : 1432116 (± 59664) ops/sec; 224.0 (± 9.3) MB/sec readseq [MEDIAN 5 runs] : 1454886 ops/sec; 227.5 MB/sec # Post PR readseq [AVG 5 runs] : 1944425 (± 29521) ops/sec; 304.1 (± 4.6) MB/sec readseq [MEDIAN 5 runs] : 1959430 ops/sec; 306.5 MB/sec ``` Reviewed By: ajkr Differential Revision: D40710936 Pulled By: cbi42 fbshipit-source-id: cb782fb9cdcd26c0c3eb9443215a4ef4d2f79022	2022-10-27 14:28:50 -07:00
Hui Xiao	fc74abb436	Fix FIFO causing overlapping seqnos in L0 files due to overlapped seqnos between ingested files and memtable's (#10777 ) Summary: Context: Same as https://github.com/facebook/rocksdb/pull/5958#issue-511150930 but apply the fix to FIFO Compaction case Repro: ``` COERCE_CONTEXT_SWICH=1 make -j56 db_stress ./db_stress --acquire_snapshot_one_in=0 --adaptive_readahead=0 --allow_data_in_errors=True --async_io=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=18 --bottommost_compression_type=disable --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=1 --checkpoint_one_in=0 --checksum_type=kCRC32c --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=0 --compact_range_one_in=1000 --compaction_pri=3 --open_files=-1 --compaction_style=2 --fifo_allow_compaction=1 --compaction_ttl=0 --compression_max_dict_buffer_bytes=8388607 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=zlib --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test0/rocksdb_crashtest_whitebox --db_write_buffer_size=8388608 --delpercent=4 --delrangepercent=1 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --fail_if_options_file_error=1 --file_checksum_impl=none --flush_one_in=1000 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=0 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=15 --index_type=3 --ingest_external_file_one_in=100 --initial_auto_readahead_size=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --log2_keys_per_lock=10 --long_running_snapshots=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=100000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=4194304 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --num_levels=1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=32 --open_write_fault_one_in=0 --ops_per_thread=200000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=0 --periodic_compaction_seconds=0 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --progress_reports=0 --read_fault_one_in=0 --readahead_size=16384 --readpercent=45 --recycle_log_file_num=1 --reopen=20 --ribbon_starting_level=999 --snapshot_hold_ops=1000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=3 --unpartitioned_pinning=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=1 --use_merge=0 --use_multiget=1 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=524288 --write_dbid_to_manifest=0 --writepercent=35 put or merge error: Corruption: force_consistency_checks(DEBUG): VersionBuilder: L0 file https://github.com/facebook/rocksdb/issues/479 with seqno 23711 29070 vs. file https://github.com/facebook/rocksdb/issues/482 with seqno 27138 29049 ``` Summary: FIFO only does intra-L0 compaction in the following four cases. For other cases, FIFO drops data instead of compacting on data, which is irrelevant to the overlapping seqno issue we are solving. - [FIFOCompactionPicker::PickSizeCompaction](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L155) when `total size < compaction_options_fifo.max_table_files_size` and `compaction_options_fifo.allow_compaction == true` - For this path, we simply reuse the fix in `FindIntraL0Compaction` https://github.com/facebook/rocksdb/pull/5958/files#diff-c261f77d6dd2134333c4a955c311cf4a196a08d3c2bb6ce24fd6801407877c89R56 - This path was not stress-tested at all. Therefore we covered `fifo.allow_compaction` in stress test to surface the overlapping seqno issue we are fixing here. - [FIFOCompactionPicker::PickCompactionToWarm](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L313) when `compaction_options_fifo.age_for_warm > 0` - For this path, we simply replicate the idea in https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and skip files of largest seqno greater than `earliest_mem_seqno` - This path was not stress-tested at all. However covering `age_for_warm` option worths a separate PR to deal with db stress compatibility. Therefore we manually tested this path for this PR - [FIFOCompactionPicker::CompactRange](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L365) that ends up picking one of the above two compactions - [CompactionPicker::CompactFiles](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker.cc#L378) - Since `SanitizeCompactionInputFiles()` will be called [before](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker.h#L111-L113) `CompactionPicker::CompactFiles` , we simply replicate the idea in https://github.com/facebook/rocksdb/pull/5958#issue-511150930 in `SanitizeCompactionInputFiles()`. To simplify implementation, we return `Stats::Abort()` on encountering seqno-overlapped file when doing compaction to L0 instead of skipping the file and proceed with the compaction. Some additional clean-up included in this PR: - Renamed `earliest_memtable_seqno` to `earliest_mem_seqno` for consistent naming - Added comment about `earliest_memtable_seqno` in related APIs - Made parameter `earliest_memtable_seqno` constant and required Pull Request resolved: https://github.com/facebook/rocksdb/pull/10777 Test Plan: - make check - New unit test `TEST_P(DBCompactionTestFIFOCheckConsistencyWithParam, FlushAfterIntraL0CompactionWithIngestedFile)`corresponding to the above 4 cases, which will fail accordingly without the fix - Regular CI stress run on this PR + stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761 and on FIFO compaction only Reviewed By: ajkr Differential Revision: D40090485 Pulled By: hx235 fbshipit-source-id: 52624186952ee7109117788741aeeac86b624a4f	2022-10-25 10:39:58 -07:00
akankshamahajan	daceb85c51	Update version.h, HISTORY.md and add branches to compatibility check (#10846 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10846 Reviewed By: ajkr Differential Revision: D40617997 Pulled By: akankshamahajan15 fbshipit-source-id: 4b2d6e85dbca7e73b930c4165869b693d3e4e137	2022-10-23 19:42:06 -07:00
akankshamahajan	9a55e5da17	Update HISTORY.md for 7.8 release (#10844 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10844 Reviewed By: ajkr Differential Revision: D40592956 Pulled By: akankshamahajan15 fbshipit-source-id: 6656f4bc5faa30fa7882bf44155f7931895590e2	2022-10-22 10:09:07 -07:00
Jay Zhuang	f726d29a82	Allow penultimate level output for the last level only compaction (#10822 ) Summary: Allow the last level only compaction able to output result to penultimate level if the penultimate level is empty. Which will also block the other compaction output to the penultimate level. (it includes the PR https://github.com/facebook/rocksdb/issues/10829) Pull Request resolved: https://github.com/facebook/rocksdb/pull/10822 Reviewed By: siying Differential Revision: D40389180 Pulled By: jay-zhuang fbshipit-source-id: 4e5dcdce307795b5e07b5dd1fa29dd75bb093bad	2022-10-22 08:57:38 -07:00
Peter Dillinger	27c9705ac4	Use kXXH3 as default checksum (CPU efficiency) (#10778 ) Summary: Since this has been supported for about a year, I think it's time to make it the default. This should improve CPU efficiency slightly on most hardware. A current DB performance comparison using buck+clang build: ``` TEST_TMPDIR=/dev/shm ./db_bench -checksum_type={1,4} -benchmarks=fillseq[-X1000] -num=3000000 -disable_wal ``` kXXH3 (+0.2% DB write throughput): `fillseq [AVG 1000 runs] : 822149 (± 1004) ops/sec; 91.0 (± 0.1) MB/sec` kCRC32c: `fillseq [AVG 1000 runs] : 820484 (± 1203) ops/sec; 90.8 (± 0.1) MB/sec` Micro benchmark comparison: ``` ./db_bench --benchmarks=xxh3[-X20],crc32c[-X20] ``` Machine 1, buck+clang build: `xxh3 [AVG 20 runs] : 3358616 (± 19091) ops/sec; 13119.6 (± 74.6) MB/sec` `crc32c [AVG 20 runs] : 2578725 (± 7742) ops/sec; 10073.1 (± 30.2) MB/sec` Machine 2, make+gcc build, DEBUG_LEVEL=0 PORTABLE=0: `xxh3 [AVG 20 runs] : 6182084 (± 137223) ops/sec; 24148.8 (± 536.0) MB/sec` `crc32c [AVG 20 runs] : 5032465 (± 42454) ops/sec; 19658.1 (± 165.8) MB/sec` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10778 Test Plan: make check, unit tests updated Reviewed By: ajkr Differential Revision: D40112510 Pulled By: pdillinger fbshipit-source-id: e59a8d50a60346137732f8668ba7cfac93be2b37	2022-10-21 18:09:12 -07:00
sdong	5d17297b76	Make UserComparatorWrapper not Customizable (#10837 ) Summary: Right now UserComparatorWrapper is a Customizable object, although it is not, which introduces some intialization overhead for the object. In some benchmarks, it shows up in CPU profiling. Make it not configurable by defining most functions needed by UserComparatorWrapper to an interface and implement the interface. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10837 Test Plan: Make sure existing tests pass Reviewed By: pdillinger Differential Revision: D40528511 fbshipit-source-id: 70eaac89ecd55401a26e8ed32abbc413a9617c62	2022-10-21 12:27:50 -07:00
akankshamahajan	0e7b27bfcf	Refactor block cache tracing APIs (#10811 ) Summary: Refactor the classes, APIs and data structures for block cache tracing to allow a user provided trace writer to be used. Currently, only a TraceWriter is supported, with a default built-in implementation of FileTraceWriter. The TraceWriter, however, takes a flat trace record and is thus only suitable for file tracing. This PR introduces an abstract BlockCacheTraceWriter class that takes a structured BlockCacheTraceRecord. The BlockCacheTraceWriter implementation can then format and log the record in whatever way it sees fit. The default BlockCacheTraceWriterImpl does file tracing using a user provided TraceWriter. `DB::StartBlockTrace` will internally redirect to changed `BlockCacheTrace::StartBlockCacheTrace`. New API `DB::StartBlockTrace` is also added that directly takes `BlockCacheTraceWriter` pointer. This same philosophy can be applied to KV and IO tracing as well. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10811 Test Plan: existing unit tests Old API DB::StartBlockTrace checked with db_bench tool create database ``` ./db_bench --benchmarks="fillseq" \ --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 \ --cache_index_and_filter_blocks --cache_size=1048576 \ --disable_auto_compactions=1 --disable_wal=1 --compression_type=none \ --min_level_to_compress=-1 --compression_ratio=1 --num=10000000 ``` To trace block cache accesses when running readrandom benchmark: ``` ./db_bench --benchmarks="readrandom" --use_existing_db --duration=60 \ --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 \ --cache_index_and_filter_blocks --cache_size=1048576 \ --disable_auto_compactions=1 --disable_wal=1 --compression_type=none \ --min_level_to_compress=-1 --compression_ratio=1 --num=10000000 \ --threads=16 \ -block_cache_trace_file="/tmp/binary_trace_test_example" \ -block_cache_trace_max_trace_file_size_in_bytes=1073741824 \ -block_cache_trace_sampling_frequency=1 ``` Reviewed By: anand1976 Differential Revision: D40435289 Pulled By: akankshamahajan15 fbshipit-source-id: fa2755f4788185e19f4605e731641cfd21ab3282	2022-10-21 12:15:35 -07:00
Changyu Bi	333abe9c55	Ignore max_compaction_bytes for compaction input that are within output key-range (#10835 ) Summary: When picking compaction input files, we sometimes stop picking a file that is fully included in the output key-range due to hitting max_compaction_bytes. Including these input files can potentially reduce WA at the expense of larger compactions. Larger compaction should be fine as files from input level are usually 10X smaller than files from output level. This PR adds a mutable CF option `ignore_max_compaction_bytes_for_input` that is enabled by default. We can remove this option once we are sure it is safe. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10835 Test Plan: - CI, a unit test on max_compaction_bytes fails before turning this flag off. - Benchmark does not show much difference in WA: `./db_bench --benchmarks=fillrandom,waitforcompaction,stats,levelstats -max_background_jobs=12 -num=2000000000 -target_file_size_base=33554432 --write_buffer_size=33554432` ``` main: Compaction Stats [default] Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 3/0 91.59 MB 0.8 70.9 0.0 70.9 200.8 129.9 0.0 1.5 25.2 71.2 2886.55 2463.45 9725 0.297 1093M 254K 0.0 0.0 L1 9/0 248.03 MB 1.0 392.0 129.8 262.2 391.7 129.5 0.0 3.0 69.0 68.9 5821.71 5536.90 804 7.241 6029M 5814K 0.0 0.0 L2 87/0 2.50 GB 1.0 537.0 128.5 408.5 533.8 125.2 0.7 4.2 69.5 69.1 7912.24 7323.70 4417 1.791 8299M 36M 0.0 0.0 L3 836/0 24.99 GB 1.0 616.9 118.3 498.7 594.5 95.8 5.2 5.0 66.9 64.5 9442.38 8490.28 4204 2.246 9749M 306M 0.0 0.0 L4 2355/0 62.95 GB 0.3 67.3 37.1 30.2 54.2 24.0 38.9 1.5 72.2 58.2 954.37 821.18 917 1.041 1076M 173M 0.0 0.0 Sum 3290/0 90.77 GB 0.0 1684.2 413.7 1270.5 1775.0 504.5 44.9 13.7 63.8 67.3 27017.25 24635.52 20067 1.346 26G 522M 0.0 0.0 Cumulative compaction: 1774.96 GB write, 154.29 MB/s write, 1684.19 GB read, 146.40 MB/s read, 27017.3 seconds This PR: Compaction Stats [default] Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 3/0 45.71 MB 0.8 72.9 0.0 72.9 202.8 129.9 0.0 1.6 25.4 70.7 2938.16 2510.36 9741 0.302 1124M 265K 0.0 0.0 L1 8/0 234.54 MB 0.9 384.5 129.8 254.7 384.2 129.6 0.0 3.0 69.0 68.9 5708.08 5424.43 791 7.216 5913M 5753K 0.0 0.0 L2 84/0 2.47 GB 1.0 543.1 128.6 414.5 539.9 125.4 0.7 4.2 69.6 69.2 7989.31 7403.13 4418 1.808 8393M 36M 0.0 0.0 L3 839/0 24.96 GB 1.0 615.6 118.4 497.2 593.2 96.0 5.1 5.0 66.6 64.1 9471.23 8489.31 4193 2.259 9726M 306M 0.0 0.0 L4 2360/0 63.04 GB 0.3 67.6 37.3 30.3 54.4 24.1 38.9 1.5 71.5 57.6 967.30 827.99 907 1.066 1080M 173M 0.0 0.0 Sum 3294/0 90.75 GB 0.0 1683.8 414.2 1269.6 1774.5 504.9 44.8 13.7 63.7 67.1 27074.08 24655.22 20050 1.350 26G 522M 0.0 0.0 Cumulative compaction: 1774.52 GB write, 157.09 MB/s write, 1683.77 GB read, 149.06 MB/s read, 27074.1 seconds ``` Reviewed By: ajkr Differential Revision: D40518319 Pulled By: cbi42 fbshipit-source-id: f4ea614bc0ebefe007ffaf05bb9aec9a8ca25b60	2022-10-21 10:22:41 -07:00
Andrew Kryczka	33ceea9b76	Add DB property for fast block cache stats collection (#10832 ) Summary: This new property allows users to trigger the background block cache stats collection mode through the `GetProperty()` and `GetMapProperty()` APIs. The background mode has much lower overhead at the expense of returning stale values in more cases. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10832 Test Plan: updated unit test Reviewed By: pdillinger Differential Revision: D40497883 Pulled By: ajkr fbshipit-source-id: bdcc93402f426463abb2153756aad9e295447343	2022-10-20 15:04:29 -07:00
Yueh-Hsuan Chiang	e267909ecf	Enable a multi-level db to smoothly migrate to FIFO via DB::Open (#10348 ) Summary: FIFO compaction can theoretically open a DB with any compaction style. However, the current code only allows FIFO compaction to open a DB with a single level. This PR relaxes the limitation of FIFO compaction and allows it to open a DB with multiple levels. Below is the read / write / compaction behavior: * The read behavior is untouched, and it works like a regular rocksdb instance. * The write behavior is untouched as well. When a FIFO compacted DB is opened with multiple levels, all new files will still be in level 0, and no files will be moved to a different level. * Compaction logic is extended. It will first identify the bottom-most non-empty level. Then, it will delete the oldest file in that level. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10348 Test Plan: Added a new test to verify the migration from level to FIFO where the db has multiple levels. Extended existing test cases in db_test and db_basic_test to also verify all entries of a key after reopening the DB with FIFO compaction. Reviewed By: jay-zhuang Differential Revision: D40233744 fbshipit-source-id: 6cc011d6c3467e6bfb9b6a4054b87619e69815e1	2022-10-18 14:38:13 -07:00
Jay Zhuang	5a5f21c489	Allow the last level data moving up to penultimate level (#10782 ) Summary: Lock the penultimate level for the whole compaction inputs range, so any key in that compaction is safe to move up from the last level to penultimate level. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10782 Reviewed By: siying Differential Revision: D40231540 Pulled By: siying fbshipit-source-id: ca115cc8b4018b35d797329fa85a19b06cc8c13e	2022-10-10 22:50:34 -07:00
Peter Dillinger	2d0380adbe	Allow manifest fix-up without requiring prior state (#10796 ) Summary: This change is motivated by ensuring that `ldb update_manifest` or `UpdateManifestForFilesState` can run without expecting files to open when the old temperature is provided (in case the FileSystem strictly interprets non-kUnknown), but ended up fixing a problem in `OfflineManifestWriter` (used by `ldb unsafe_remove_sst_file`) where it would open some SST files during recovery and expect them to match the prior manifest state, even if not required by the intended new state. Also update BackupEngine to retry with Temperature kUnknown when reading file with potentially "wrong" temperature. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10796 Test Plan: tests added/updated, that fail before the change(s) and now pass Reviewed By: jay-zhuang Differential Revision: D40232645 Pulled By: jay-zhuang fbshipit-source-id: b5aa2688aecfe0c320b80a7da689b315414c20be	2022-10-10 17:59:17 -07:00
akankshamahajan	ebf8c454fd	Provide support for async_io with tailing iterators (#10781 ) Summary: Provide support for async_io if ReadOptions.tailing is set true. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10781 Test Plan: - Update unit tests - Ran db_bench: ./db_bench --benchmarks="readrandom" --use_existing_db --use_tailing_iterator=1 --async_io=1 Reviewed By: anand1976 Differential Revision: D40128882 Pulled By: anand1976 fbshipit-source-id: 55e17855536871a5c47e2de92d238ae005c32d01	2022-10-10 15:48:48 -07:00
Jay Zhuang	c401f285c3	Add option `preserve_internal_time_seconds` to preserve the time info (#10747 ) Summary: Add option `preserve_internal_time_seconds` to preserve the internal time information. It's mostly for the migration of the existing data to tiered storage ( `preclude_last_level_data_seconds`). When the tiering feature is just enabled, the existing data won't have the time information to decide if it's hot or cold. Enabling this feature will start collect and preserve the time information for the new data. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10747 Reviewed By: siying Differential Revision: D39910141 Pulled By: siying fbshipit-source-id: 25c21638e37b1a7c44006f636b7d714fe7242138	2022-10-07 18:49:40 -07:00
Peter Dillinger	b205c6d029	Fix bug in HyperClockCache ApplyToEntries; cleanup (#10768 ) Summary: We have seen some rare crash test failures in HyperClockCache, and the source could certainly be a bug fixed in this change, in ClockHandleTable::ConstApplyToEntriesRange. It wasn't properly accounting for the fact that incrementing the acquire counter could be ineffective, due to parallel updates. (When incrementing the acquire counter is ineffective, it is incorrect to then decrement it.) This change includes some other minor clean-up in HyperClockCache, and adds stats_dump_period_sec with a much lower period to the crash test. This should be the primary caller of ApplyToEntries, in collecting cache entry stats. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10768 Test Plan: haven't been able to reproduce the failure, but should be in a better state (bug fix and improved crash test) Reviewed By: anand1976 Differential Revision: D40034747 Pulled By: anand1976 fbshipit-source-id: a06fcefe146e17ee35001984445cedcf3b63eb68	2022-10-06 14:54:21 -07:00
Yanqin Jin	4d82b94896	Sanitize min_write_buffer_number_to_merge to 1 with atomic_flush (#10773 ) Summary: With current implementation, within the same RocksDB instance, all column families with non-empty memtables will be scheduled for flush if RocksDB determines that any column family needs to be flushed, e.g. memtable full, write buffer manager, etc., if atomic flush is enabled. Not doing so can lead to data loss and inconsistency when WAL is disabled, which is a common setting when atomic flush is enabled. Therefore, setting a per-column-family knob, min_write_buffer_number_to_merge to a value greater than 1 is not compatible with atomic flush, and should be sanitized during column family creation and db open. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10773 Test Plan: Reproduce: D39993203 has detailed steps. Run the test with and without the fix. Reviewed By: cbi42 Differential Revision: D40077955 Pulled By: cbi42 fbshipit-source-id: 451a9179eb531ac42eaccf40b451b9dec4085240	2022-10-05 12:24:39 -07:00
Changyu Bi	eca47fb696	Ignore kBottommostFiles compaction logic when allow_ingest_behind (#10767 ) Summary: fix for https://github.com/facebook/rocksdb/issues/10752 where RocksDB could be in an infinite compaction loop (with compaction reason kBottommostFiles) if allow_ingest_behind is enabled and the bottommost level is unfilled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10767 Test Plan: Added a unit test to reproduce the compaction loop. Reviewed By: ajkr Differential Revision: D40031861 Pulled By: ajkr fbshipit-source-id: 71c4b02931fbe507a847632905404c9b8fa8c96b	2022-10-05 09:27:14 -07:00
Changyu Bi	ffde463a5f	Cleanup SuperVersion in Iterator::Refresh() (#10770 ) Summary: Fix a bug in Iterator::Refresh() where the local SV it obtained could be obsolete upon return, and should be cleaned up. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10770 Test Plan: added a unit test to reproduce the issue. Reviewed By: ajkr Differential Revision: D40063809 Pulled By: ajkr fbshipit-source-id: 619e728eb0f1ac9540b4d0ad38e43acc37a514b2	2022-10-04 22:23:24 -07:00
Yanqin Jin	edda219fc3	Manual flush with `wait=false` should not stall when writes stopped (#10001 ) Summary: When `FlushOptions::wait` is set to false, manual flush should not stall forever. If the database has already stopped writes, then the thread calling `DB::Flush()` with `FlushOptions::wait=false` should not enter the `DBImpl::write_thread_`. To prevent this, we should do a check at the beginning and return `TryAgain()` Resolves: https://github.com/facebook/rocksdb/issues/9892 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10001 Reviewed By: siying Differential Revision: D36422303 Pulled By: siying fbshipit-source-id: 723bd3065e8edc4f17c82449d0d6b95a2381ac0a	2022-10-04 16:43:01 -07:00
Jay Zhuang	f007ad8b4f	RoundRobin TTL compaction (#10725 ) Summary: For RoundRobin compaction, the data should be mostly sorted per level and within level. Use normal compaction picker for RR until all expired data is compacted. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10725 Reviewed By: ajkr Differential Revision: D39771069 Pulled By: jay-zhuang fbshipit-source-id: 7ccf88d7c093fad5673bda73a7b08cc4757780cd	2022-10-04 14:53:32 -07:00
akankshamahajan	ae0f9c3339	Add new property in IOOptions to skip recursing through directories and list only files during GetChildren. (#10668 ) Summary: Add new property "do_not_recurse" in IOOptions for underlying file system to skip iteration of directories during DB::Open if there are no sub directories and list only files. By default this property is set to false. This property is set true currently in the code where RocksDB is sure only files are needed during DB::Open. Provided support in PosixFileSystem to use "do_not_recurse". TestPlan: - Existing tests Pull Request resolved: https://github.com/facebook/rocksdb/pull/10668 Reviewed By: anand1976 Differential Revision: D39471683 Pulled By: akankshamahajan15 fbshipit-source-id: 90e32f0b86d5346d53bc2714d3a0e7002590527f	2022-10-03 10:59:45 -07:00
Changyu Bi	9f2363f4c4	User-defined timestamp support for `DeleteRange()` (#10661 ) Summary: Add user-defined timestamp support for range deletion. The new API is `DeleteRange(opt, cf, begin_key, end_key, ts)`. Most of the change is to update the comparator to compare without timestamp. Other than that, major changes are - internal range tombstone data structures (`FragmentedRangeTombstoneList`, `RangeTombstone`, etc.) to store timestamps. - Garbage collection of range tombstones and range tombstone covered keys during compaction. - Get()/MultiGet() to return the timestamp of a range tombstone when needed. - Get/Iterator with range tombstones bounded by readoptions.timestamp. - timestamp crash test now issues DeleteRange by default. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10661 Test Plan: - Added unit test: `make check` - Stress test: `python3 tools/db_crashtest.py --enable_ts whitebox --readpercent=57 --prefixpercent=4 --writepercent=25 -delpercent=5 --iterpercent=5 --delrangepercent=4` - Ran `db_bench` to measure regression when timestamp is not enabled. The tests are for write (with some range deletion) and iterate with DB fitting in memory: `./db_bench--benchmarks=fillrandom,seekrandom --writes_per_range_tombstone=200 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=500000 --reads=500000 --seek_nexts=10 --disable_auto_compactions -disable_wal=true --max_num_range_tombstones=1000`. Did not see consistent regression in no timestamp case. \| micros/op \| fillrandom \| seekrandom \| \| --- \| --- \| --- \| \|main\| 2.58 \|10.96\| \|PR 10661\| 2.68 \|10.63\| Reviewed By: riversand963 Differential Revision: D39441192 Pulled By: cbi42 fbshipit-source-id: f05aca3c41605caf110daf0ff405919f300ddec2	2022-09-30 16:13:03 -07:00
Jay Zhuang	f3cc66632b	Align compaction output file boundaries to the next level ones (#10655 ) Summary: Try to align the compaction output file boundaries to the next level ones (grandparent level), to reduce the level compaction write-amplification. In level compaction, there are "wasted" data at the beginning and end of the output level files. Align the file boundary can avoid such "wasted" compaction. With this PR, it tries to align the non-bottommost level file boundaries to its next level ones. It may cut file when the file size is large enough (at least 50% of target_file_size) and not too large (2x target_file_size). db_bench shows about 12.56% compaction reduction: ``` TEST_TMPDIR=/data/dbbench2 ./db_bench --benchmarks=fillrandom,readrandom -max_background_jobs=12 -num=400000000 -target_file_size_base=33554432 # baseline: Flush(GB): cumulative 25.882, interval 7.216 Cumulative compaction: 285.90 GB write, 162.36 MB/s write, 269.68 GB read, 153.15 MB/s read, 2926.7 seconds # with this change: Flush(GB): cumulative 25.882, interval 7.753 Cumulative compaction: 249.97 GB write, 141.96 MB/s write, 233.74 GB read, 132.74 MB/s read, 2534.9 seconds ``` The compaction simulator shows a similar result (14% with 100G random data). As a side effect, with this PR, the SST file size can exceed the target_file_size, but is capped at 2x target_file_size. And there will be smaller files. Here are file size statistics when loading 100GB with the target file size 32MB: ``` baseline this_PR count 1.656000e+03 1.705000e+03 mean 3.116062e+07 3.028076e+07 std 7.145242e+06 8.046139e+06 ``` The feature is enabled by default, to revert to the old behavior disable it with `AdvancedColumnFamilyOptions.level_compaction_dynamic_file_size = false` Also includes https://github.com/facebook/rocksdb/issues/1963 to cut file before skippable grandparent file. Which is for use case like user adding 2 or more non-overlapping data range at the same time, it can reduce the overlapping of 2 datasets in the lower levels. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10655 Reviewed By: cbi42 Differential Revision: D39552321 Pulled By: jay-zhuang fbshipit-source-id: 640d15f159ab0cd973f2426cfc3af266fc8bdde2	2022-09-29 19:43:55 -07:00
Changyu Bi	df492791b6	Fix segfault in Iterator::Refresh() (#10739 ) Summary: when a new internal iterator is constructed during iterator refresh, pointer to the previous memtable range tombstone iterator was not cleared. This could cause segfault for future `Refresh()` calls when they try to free the memtable range tombstones. This PR fixes this issue. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10739 Test Plan: added a unit test in db_range_del_test.cc to reproduce this issue. Reviewed By: ajkr, riversand963 Differential Revision: D39825283 Pulled By: cbi42 fbshipit-source-id: 3b59a2b73865aed39e28cdd5c1b57eed7991b94c	2022-09-26 18:57:23 -07:00
Yanqin Jin	52f2411722	Update HISTORY to mention PR #10724 (#10737 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10737 Reviewed By: cbi42 Differential Revision: D39825386 Pulled By: riversand963 fbshipit-source-id: a3c55f2777e034d6ae6ff44ef0219d9fbbf1cc96	2022-09-26 15:59:30 -07:00
Changyu Bi	93f46da1fa	Mention in HISTORY.md the fix in #10705 (#10720 ) Summary: Mention in HISTORY.md the fix in https://github.com/facebook/rocksdb/issues/10705. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10720 Reviewed By: riversand963 Differential Revision: D39709455 Pulled By: cbi42 fbshipit-source-id: 1a2c9dd8425c73f7095ddb1d0b1cca8ed35b7ef2	2022-09-21 15:08:43 -07:00
Akanksha Mahajan	0f4978d34f	Fix sqe->addr passed in cancel request in io_uring (#10644 ) Summary: Update io_uring_prep_cancel as it is now backward compatible. Also, io_uring_prep_cancel expects sqe->addr to match with read request submitted. It's being set wrong which is now fixed in this PR. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10644 Test Plan: - Ran internally with lastest liburing package and on RocksDB github repo with older version. - Ran seekrandom regression to confirm there is no regression. Reviewed By: anand1976 Differential Revision: D39284229 Pulled By: akankshamahajan15 fbshipit-source-id: fd52cdf23d676da114896163626b75c8ae09c980	2022-09-21 14:21:59 -07:00
Yanqin Jin	2b6f3510c2	Update version number and HISTORY in main branch (#10694 ) Summary: This PR bumps up version number from 7.7 to 7.8 in main branch, indicating that next release will be 7.8. We are going to release 7.7 soon. Since 7.7.fb branch has been created, we can land this to main. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10694 Reviewed By: gitbw95 Differential Revision: D39581577 Pulled By: riversand963 fbshipit-source-id: 84f3fecf25fd9ac96e46b4cd6d50ddb6edc89427	2022-09-19 12:25:44 -07:00
gitbw95	2cc5b39560	Add enable_split_merge option for CompressedSecondaryCache (#10690 ) Summary: `enable_custom_split_merge` is added for enabling the custom split and merge feature, which split the compressed value into chunks so that they may better fit jemalloc bins. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10690 Test Plan: Unit Tests Stress Tests Reviewed By: anand1976 Differential Revision: D39567604 Pulled By: gitbw95 fbshipit-source-id: f6d1d46200f365220055f793514601dcb0edc4b7	2022-09-16 15:41:49 -07:00
Peter Dillinger	0f91c72adc	Call experimental new clock cache HyperClockCache (#10684 ) Summary: This change establishes a distinctive name for the experimental new lock-free clock cache (originally developed by guidotag and revamped in PR https://github.com/facebook/rocksdb/issues/10626). A few reasons: * We want to make it clear that this is a fundamentally different implementation vs. the old clock cache, to avoid people saying "I already tried clock cache." * We want to highlight the key feature: it's fast (especially under parallel load) * Because it requires an estimated charge per entry, it is not drop-in API compatible with old clock cache. This estimate might always be required for highest performance, and giving it a distinct name should reduce confusion about the distinct API requirements. * We might develop a variant requiring the same estimate parameter but with LRU eviction. In that case, using the name HyperLRUCache should make things more clear. (FastLRUCache is just a prototype that might soon be removed.) Some API detail: * To reduce copy-pasting parameter lists, etc. as in LRUCache construction, I have a `MakeSharedCache()` function on `HyperClockCacheOptions` instead of `NewHyperClockCache()`. * Changes -cache_type=clock_cache to -cache_type=hyper_clock_cache for applicable tools. I think this is more consistent / sustainable for reasons already stated. For performance tests see https://github.com/facebook/rocksdb/pull/10626 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10684 Test Plan: no interesting functional changes; tests updated Reviewed By: anand1976 Differential Revision: D39547800 Pulled By: pdillinger fbshipit-source-id: 5c0fe1b5cf3cb680ab369b928c8569682b9795bf	2022-09-16 12:47:29 -07:00
anand76	37b75e1364	Fix some MultiGet stats (#10673 ) Summary: The stats were not accurate for the coroutine version of MultiGet. This PR fixes it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10673 Reviewed By: akankshamahajan15 Differential Revision: D39492615 Pulled By: anand1976 fbshipit-source-id: b46c04e15ea27e66f4c31f00c66497aa283bf9d3	2022-09-15 22:48:06 -07:00
anand76	7b11d48444	Change MultiGet multi-level optimization to default on (#10671 ) Summary: Change the ```ReadOptions.optimize_multiget_for_io``` flag to default on. It doesn't impact regular MultiGet users as its only applicable when ```ReadOptions.async_io``` is also set to true. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10671 Reviewed By: akankshamahajan15 Differential Revision: D39477439 Pulled By: anand1976 fbshipit-source-id: 47abcdbfa69f9bc60422ab68a238b232e085d4ba	2022-09-14 08:51:16 -07:00
Changyu Bi	f291eefb02	Cache fragmented range tombstone list for mutable memtables (#10547 ) Summary: Each read from memtable used to read and fragment all the range tombstones into a `FragmentedRangeTombstoneList`. https://github.com/facebook/rocksdb/issues/10380 improved the inefficient here by caching a `FragmentedRangeTombstoneList` with each immutable memtable. This PR extends the caching to mutable memtables. The fragmented range tombstone can be constructed in either read (This PR) or write path (https://github.com/facebook/rocksdb/issues/10584). With both implementation, each `DeleteRange()` will invalidate the cache, and the difference is where the cache is re-constructed.`CoreLocalArray` is used to store the cache with each memtable so that multi-threaded reads can be efficient. More specifically, each core will have a shared_ptr to a shared_ptr pointing to the current cache. Each read thread will only update the reference count in its core-local shared_ptr, and this is only needed when reading from mutable memtables. The choice between write path and read path is not an easy one: they are both improvement compared to no caching in the current implementation, but they favor different operations and could cause regression in the other operation (read vs write). The write path caching in (https://github.com/facebook/rocksdb/issues/10584) leads to a cleaner implementation, but I chose the read path caching here to avoid significant regression in write performance when there is a considerable amount of range tombstones in a single memtable (the number from the benchmark below suggests >1000 with concurrent writers). Note that even though the fragmented range tombstone list is only constructed in `DeleteRange()` operations, it could block other writes from proceeding, and hence affects overall write performance. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10547 Test Plan: - TestGet() in stress test is updated in https://github.com/facebook/rocksdb/issues/10553 to compare Get() result against expected state: `./db_stress_branch --readpercent=57 --prefixpercent=4 --writepercent=25 -delpercent=5 --iterpercent=5 --delrangepercent=4` - Perf benchmark: tested read and write performance where a memtable has 0, 1, 10, 100 and 1000 range tombstones. ``` ./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=200 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=200000 --reads=100000 --disable_auto_compactions --max_num_range_tombstones=1000 ``` Write perf regressed since the cost of constructing fragmented range tombstone list is shifted from every read to a single write. 6cbe5d8e172dc5f1ef65c9d0a6eedbd9987b2c72 is included in the last column as a reference to see performance impact on multi-thread reads if `CoreLocalArray` is not used. micros/op averaged over 5 runs: first 4 columns are for fillrandom, last 4 columns are for readrandom. \| \|fillrandom main \| write path caching \| read path caching \|memtable V3 (https://github.com/facebook/rocksdb/issues/10308) \| readrandom main \| write path caching \| read path caching \|memtable V3 \| \|--- \|--- \|--- \|--- \|--- \| --- \| --- \| --- \| --- \| \| 0 \|6.35 \|6.15 \|5.82 \|6.12 \|2.24 \|2.26 \|2.03 \|2.07 \| \| 1 \|5.99 \|5.88 \|5.77 \|6.28 \|2.65 \|2.27 \|2.24 \|2.5 \| \| 10 \|6.15 \|6.02 \|5.92 \|5.95 \|5.15 \|2.61 \|2.31 \|2.53 \| \| 100 \|5.95 \|5.78 \|5.88 \|6.23 \|28.31 \|2.34 \|2.45 \|2.94 \| \| 100 25 threads \|52.01 \|45.85 \|46.18 \|47.52 \|35.97 \|3.34 \|3.34 \|3.56 \| \| 1000 \|6.0 \|7.07 \|5.98 \|6.08 \|333.18 \|2.86 \|2.7 \|3.6 \| \| 1000 25 threads \|52.6 \|148.86 \|79.06 \|45.52 \|473.49 \|3.66 \|3.48 \|4.38 \| - Benchmark performance of`readwhilewriting` from https://github.com/facebook/rocksdb/issues/10552, 100 range tombstones are written: `./db_bench --benchmarks=readwhilewriting --writes_per_range_tombstone=500 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=100000 --reads=500000 --disable_auto_compactions --max_num_range_tombstones=10000 --finish_after_writes` readrandom micros/op: \| \|main \|write path caching \|read path caching \|memtable V3 \| \|---\|---\|---\|---\|---\| \| single thread \|48.28 \|1.55 \|1.52 \|1.96 \| \| 25 threads \|64.3 \|2.55 \|2.67 \|2.64 \| Reviewed By: ajkr Differential Revision: D38895410 Pulled By: cbi42 fbshipit-source-id: 930bfc309dd1b2f4e8e9042f5126785bba577559	2022-09-13 20:07:28 -07:00
Akanksha Mahajan	03fc43976d	Async optimization in scan path (#10602 ) Summary: Optimizations 1. In FilePrefetchBuffer, when data is overlapping between two buffers, it copies the data from first to third buffer, then from second to third buffer to return continuous buffer. This optimization will call ReadAsync on first buffer as soon as buffer is empty instead of getting blocked by second buffer to copy the data. 2. For fixed size readahead_size, FilePrefetchBuffer will issues two async read calls. One with length + readahead_size_/2 on first buffer(if buffer is empty) and readahead_size_/2 on second buffer during seek. - Add readahead_size to db_stress for stress testing these changes in https://github.com/facebook/rocksdb/pull/10632 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10602 Test Plan: - CircleCI tests - stress_test completed successfully export CRASH_TEST_EXT_ARGS="--async_io=1" make crash_test -j32 - db_bench showed no regression With this PR: ``` ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main1 -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=50000000 -use_direct_reads=false -seek_nexts=327680 -duration=30 -ops_between_duration_checks=1 -async_io=1 Set seed to 1661876074584472 because --seed was 0 Initializing RocksDB Options from the specified file Initializing RocksDB Options from command-line flags Integrated BlobDB: blob cache disabled RocksDB: version 7.7.0 Date: Tue Aug 30 09:14:34 2022 CPU: 32 * Intel Xeon Processor (Skylake) CPUCache: 16384 KB Keys: 32 bytes each (+ 0 bytes user-defined timestamp) Values: 512 bytes each (256 bytes after compression) Entries: 50000000 Prefix: 0 bytes Keys per prefix: 0 RawSize: 25939.9 MB (estimated) FileSize: 13732.9 MB (estimated) Write rate: 0 bytes/second Read rate: 0 ops/second Compression: Snappy Compression sampling rate: 0 Memtablerep: SkipListFactory Perf Level: 1 ------------------------------------------------ DB path: [/tmp/prefix_scan_prefetch_main1] seekrandom : 270878.018 micros/op 3 ops/sec 30.068 seconds 111 operations; 618.7 MB/s (111 of 111 found) ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main1 -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=50000000 -use_direct_reads=true -seek_nexts=327680 -duration=30 -ops_between_duration_checks=1 -async_io=1 Set seed to 1661875332862922 because --seed was 0 Initializing RocksDB Options from the specified file Initializing RocksDB Options from command-line flags Integrated BlobDB: blob cache disabled RocksDB: version 7.7.0 Date: Tue Aug 30 09:02:12 2022 CPU: 32 * Intel Xeon Processor (Skylake) CPUCache: 16384 KB Keys: 32 bytes each (+ 0 bytes user-defined timestamp) Values: 512 bytes each (256 bytes after compression) Entries: 50000000 Prefix: 0 bytes Keys per prefix: 0 RawSize: 25939.9 MB (estimated) FileSize: 13732.9 MB (estimated) Write rate: 0 bytes/second Read rate: 0 ops/second Compression: Snappy Compression sampling rate: 0 Memtablerep: SkipListFactory Perf Level: 1 WARNING: Assertions are enabled; benchmarks unnecessarily slow ------------------------------------------------ DB path: [/tmp/prefix_scan_prefetch_main1] seekrandom : 358352.488 micros/op 2 ops/sec 30.102 seconds 84 operations; 474.4 MB/s (84 of 84 found) ``` Without PR in main: ``` ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main1 -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=50000000 -use_direct_reads=false -seek_nexts=327680 -duration=30 -ops_between_duration_checks=1 -async_io=1 Set seed to 1661876425983045 because --seed was 0 Initializing RocksDB Options from the specified file Initializing RocksDB Options from command-line flags Integrated BlobDB: blob cache disabled RocksDB: version 7.7.0 Date: Tue Aug 30 09:20:26 2022 CPU: 32 * Intel Xeon Processor (Skylake) CPUCache: 16384 KB Keys: 32 bytes each (+ 0 bytes user-defined timestamp) Values: 512 bytes each (256 bytes after compression) Entries: 50000000 Prefix: 0 bytes Keys per prefix: 0 RawSize: 25939.9 MB (estimated) FileSize: 13732.9 MB (estimated) Write rate: 0 bytes/second Read rate: 0 ops/second Compression: Snappy Compression sampling rate: 0 Memtablerep: SkipListFactory Perf Level: 1 ------------------------------------------------ DB path: [/tmp/prefix_scan_prefetch_main1] seekrandom : 280881.953 micros/op 3 ops/sec 30.054 seconds 107 operations; 605.2 MB/s (107 of 107 found) ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main1 -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=50000000 -use_direct_reads=false -seek_nexts=327680 -duration=30 -ops_between_duration_checks=1 -async_io=0 Set seed to 1661876475267771 because --seed was 0 Initializing RocksDB Options from the specified file Initializing RocksDB Options from command-line flags Integrated BlobDB: blob cache disabled RocksDB: version 7.7.0 Date: Tue Aug 30 09:21:15 2022 CPU: 32 * Intel Xeon Processor (Skylake) CPUCache: 16384 KB Keys: 32 bytes each (+ 0 bytes user-defined timestamp) Values: 512 bytes each (256 bytes after compression) Entries: 50000000 Prefix: 0 bytes Keys per prefix: 0 RawSize: 25939.9 MB (estimated) FileSize: 13732.9 MB (estimated) Write rate: 0 bytes/second Read rate: 0 ops/second Compression: Snappy Compression sampling rate: 0 Memtablerep: SkipListFactory Perf Level: 1 ------------------------------------------------ DB path: [/tmp/prefix_scan_prefetch_main1] seekrandom : 363155.084 micros/op 2 ops/sec 30.142 seconds 83 operations; 468.1 MB/s (83 of 83 found) ``` Reviewed By: anand1976 Differential Revision: D39141328 Pulled By: akankshamahajan15 fbshipit-source-id: 560655922c1a437a8569c228abb31b8c0b413120	2022-09-12 17:42:01 -07:00

1 2 3 4 5 ...

1529 Commits