rocksdb

Commit Graph

Author	SHA1	Message	Date
Alan Paxton	d9a441113e	JNI get_helper code sharing / multiGet() use efficient batch C++ support (#12344 ) Summary: Implement RAII-based helpers for JNIGet() and multiGet() Replace JNI C++ helpers `rocksdb_get_helper, rocksdb_get_helper_direct`, `multi_get_helper`, `multi_get_helper_direct`, `multi_get_helper_release_keys`, `txn_get_helper`, and `txn_multi_get_helper`. The model is to entirely do away with a single helper, instead a number of utility methods allow each separate JNI `Get()` and `MultiGet()` method to organise their parameters efficiently, then call the underlying C++ `db->Get()`, `db->MultiGet()`, `txn->Get()`, or `txn->MultiGet()` method itself, and use further utilities to retrieve results. Roughly speaking: * get keys into C++ form * Call C++ Get() * get results and status into Java form We achieve a useful performance gain as part of this work; by using the updated C++ multiGet we immediately pick up its performance gains (batch improvements to multiGet C++ were previously implemented, but not until now used by Java/JNI). multiGetBB already uses the batched C++ multiGet(), and all other benchmarks show consistent improvement after the changes: ## Before: ``` Benchmark (columnFamilyTestType) (keyCount) (keySize) (multiGetSize) (valueSize) Mode Cnt Score Error Units MultiGetNewBenchmarks.multiGetBB200 no_column_family 10000 1024 100 256 thrpt 25 5315.459 ± 20.465 ops/s MultiGetNewBenchmarks.multiGetBB200 no_column_family 10000 1024 100 1024 thrpt 25 5673.115 ± 78.299 ops/s MultiGetNewBenchmarks.multiGetBB200 no_column_family 10000 1024 100 4096 thrpt 25 2616.860 ± 46.994 ops/s MultiGetNewBenchmarks.multiGetBB200 no_column_family 10000 1024 100 16384 thrpt 25 1700.058 ± 24.034 ops/s MultiGetNewBenchmarks.multiGetBB200 no_column_family 10000 1024 100 65536 thrpt 25 791.171 ± 13.955 ops/s MultiGetNewBenchmarks.multiGetList10 no_column_family 10000 1024 100 256 thrpt 25 6129.929 ± 94.200 ops/s MultiGetNewBenchmarks.multiGetList10 no_column_family 10000 1024 100 1024 thrpt 25 7012.405 ± 97.886 ops/s MultiGetNewBenchmarks.multiGetList10 no_column_family 10000 1024 100 4096 thrpt 25 2799.014 ± 39.352 ops/s MultiGetNewBenchmarks.multiGetList10 no_column_family 10000 1024 100 16384 thrpt 25 1417.205 ± 22.272 ops/s MultiGetNewBenchmarks.multiGetList10 no_column_family 10000 1024 100 65536 thrpt 25 655.594 ± 13.050 ops/s MultiGetNewBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 256 thrpt 25 6147.247 ± 82.711 ops/s MultiGetNewBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 1024 thrpt 25 7004.213 ± 79.251 ops/s MultiGetNewBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 4096 thrpt 25 2715.154 ± 110.017 ops/s MultiGetNewBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 16384 thrpt 25 1408.070 ± 31.714 ops/s MultiGetNewBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 65536 thrpt 25 623.829 ± 57.374 ops/s MultiGetNewBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 256 thrpt 25 6119.243 ± 116.313 ops/s MultiGetNewBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 1024 thrpt 25 6931.873 ± 128.094 ops/s MultiGetNewBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 4096 thrpt 25 2678.253 ± 39.113 ops/s MultiGetNewBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 16384 thrpt 25 1337.384 ± 19.500 ops/s MultiGetNewBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 65536 thrpt 25 625.596 ± 14.525 ops/s ``` ## After: ``` Benchmark (columnFamilyTestType) (keyCount) (keySize) (multiGetSize) (valueSize) Mode Cnt Score Error Units MultiGetBenchmarks.multiGetBB200 no_column_family 10000 1024 100 256 thrpt 25 5191.074 ± 78.250 ops/s MultiGetBenchmarks.multiGetBB200 no_column_family 10000 1024 100 1024 thrpt 25 5378.692 ± 260.682 ops/s MultiGetBenchmarks.multiGetBB200 no_column_family 10000 1024 100 4096 thrpt 25 2590.183 ± 34.844 ops/s MultiGetBenchmarks.multiGetBB200 no_column_family 10000 1024 100 16384 thrpt 25 1634.793 ± 34.022 ops/s MultiGetBenchmarks.multiGetBB200 no_column_family 10000 1024 100 65536 thrpt 25 786.455 ± 8.462 ops/s MultiGetBenchmarks.multiGetBB200 1_column_family 10000 1024 100 256 thrpt 25 5285.055 ± 11.676 ops/s MultiGetBenchmarks.multiGetBB200 1_column_family 10000 1024 100 1024 thrpt 25 5586.758 ± 213.008 ops/s MultiGetBenchmarks.multiGetBB200 1_column_family 10000 1024 100 4096 thrpt 25 2527.172 ± 17.106 ops/s MultiGetBenchmarks.multiGetBB200 1_column_family 10000 1024 100 16384 thrpt 25 1819.547 ± 12.958 ops/s MultiGetBenchmarks.multiGetBB200 1_column_family 10000 1024 100 65536 thrpt 25 803.861 ± 9.963 ops/s MultiGetBenchmarks.multiGetBB200 20_column_families 10000 1024 100 256 thrpt 25 5253.793 ± 28.020 ops/s MultiGetBenchmarks.multiGetBB200 20_column_families 10000 1024 100 1024 thrpt 25 5705.591 ± 20.556 ops/s MultiGetBenchmarks.multiGetBB200 20_column_families 10000 1024 100 4096 thrpt 25 2523.377 ± 15.415 ops/s MultiGetBenchmarks.multiGetBB200 20_column_families 10000 1024 100 16384 thrpt 25 1815.344 ± 11.309 ops/s MultiGetBenchmarks.multiGetBB200 20_column_families 10000 1024 100 65536 thrpt 25 820.792 ± 3.192 ops/s MultiGetBenchmarks.multiGetBB200 100_column_families 10000 1024 100 256 thrpt 25 5262.184 ± 20.477 ops/s MultiGetBenchmarks.multiGetBB200 100_column_families 10000 1024 100 1024 thrpt 25 5706.959 ± 23.123 ops/s MultiGetBenchmarks.multiGetBB200 100_column_families 10000 1024 100 4096 thrpt 25 2520.362 ± 9.170 ops/s MultiGetBenchmarks.multiGetBB200 100_column_families 10000 1024 100 16384 thrpt 25 1789.185 ± 14.239 ops/s MultiGetBenchmarks.multiGetBB200 100_column_families 10000 1024 100 65536 thrpt 25 818.401 ± 12.132 ops/s MultiGetBenchmarks.multiGetList10 no_column_family 10000 1024 100 256 thrpt 25 6978.310 ± 14.084 ops/s MultiGetBenchmarks.multiGetList10 no_column_family 10000 1024 100 1024 thrpt 25 7664.242 ± 22.304 ops/s MultiGetBenchmarks.multiGetList10 no_column_family 10000 1024 100 4096 thrpt 25 2881.778 ± 81.054 ops/s MultiGetBenchmarks.multiGetList10 no_column_family 10000 1024 100 16384 thrpt 25 1599.826 ± 7.190 ops/s MultiGetBenchmarks.multiGetList10 no_column_family 10000 1024 100 65536 thrpt 25 737.520 ± 6.809 ops/s MultiGetBenchmarks.multiGetList10 1_column_family 10000 1024 100 256 thrpt 25 6974.376 ± 10.716 ops/s MultiGetBenchmarks.multiGetList10 1_column_family 10000 1024 100 1024 thrpt 25 7637.440 ± 45.877 ops/s MultiGetBenchmarks.multiGetList10 1_column_family 10000 1024 100 4096 thrpt 25 2820.472 ± 42.231 ops/s MultiGetBenchmarks.multiGetList10 1_column_family 10000 1024 100 16384 thrpt 25 1716.663 ± 8.527 ops/s MultiGetBenchmarks.multiGetList10 1_column_family 10000 1024 100 65536 thrpt 25 755.848 ± 7.514 ops/s MultiGetBenchmarks.multiGetList10 20_column_families 10000 1024 100 256 thrpt 25 6943.651 ± 20.040 ops/s MultiGetBenchmarks.multiGetList10 20_column_families 10000 1024 100 1024 thrpt 25 7679.415 ± 9.114 ops/s MultiGetBenchmarks.multiGetList10 20_column_families 10000 1024 100 4096 thrpt 25 2844.564 ± 13.388 ops/s MultiGetBenchmarks.multiGetList10 20_column_families 10000 1024 100 16384 thrpt 25 1729.545 ± 5.983 ops/s MultiGetBenchmarks.multiGetList10 20_column_families 10000 1024 100 65536 thrpt 25 783.218 ± 1.530 ops/s MultiGetBenchmarks.multiGetList10 100_column_families 10000 1024 100 256 thrpt 25 6944.276 ± 29.995 ops/s MultiGetBenchmarks.multiGetList10 100_column_families 10000 1024 100 1024 thrpt 25 7670.301 ± 8.986 ops/s MultiGetBenchmarks.multiGetList10 100_column_families 10000 1024 100 4096 thrpt 25 2839.828 ± 12.421 ops/s MultiGetBenchmarks.multiGetList10 100_column_families 10000 1024 100 16384 thrpt 25 1730.005 ± 9.209 ops/s MultiGetBenchmarks.multiGetList10 100_column_families 10000 1024 100 65536 thrpt 25 787.096 ± 1.977 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 256 thrpt 25 6896.944 ± 21.530 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 1024 thrpt 25 7622.407 ± 12.824 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 4096 thrpt 25 2927.538 ± 19.792 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 16384 thrpt 25 1598.041 ± 4.312 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 65536 thrpt 25 744.564 ± 9.236 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 1_column_family 10000 1024 100 256 thrpt 25 6853.760 ± 78.041 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 1_column_family 10000 1024 100 1024 thrpt 25 7360.917 ± 355.365 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 1_column_family 10000 1024 100 4096 thrpt 25 2848.774 ± 13.409 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 1_column_family 10000 1024 100 16384 thrpt 25 1727.688 ± 3.329 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 1_column_family 10000 1024 100 65536 thrpt 25 776.088 ± 7.517 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 20_column_families 10000 1024 100 256 thrpt 25 6910.339 ± 14.366 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 20_column_families 10000 1024 100 1024 thrpt 25 7633.660 ± 10.830 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 20_column_families 10000 1024 100 4096 thrpt 25 2787.799 ± 81.775 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 20_column_families 10000 1024 100 16384 thrpt 25 1726.517 ± 6.830 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 20_column_families 10000 1024 100 65536 thrpt 25 787.597 ± 3.362 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 100_column_families 10000 1024 100 256 thrpt 25 6922.445 ± 10.493 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 100_column_families 10000 1024 100 1024 thrpt 25 7604.710 ± 48.043 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 100_column_families 10000 1024 100 4096 thrpt 25 2848.788 ± 15.783 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 100_column_families 10000 1024 100 16384 thrpt 25 1730.837 ± 6.497 ops/s MultiGetBenchmarks.multiGetListExplicitCF20 100_column_families 10000 1024 100 65536 thrpt 25 794.557 ± 1.869 ops/s MultiGetBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 256 thrpt 25 6918.716 ± 15.766 ops/s MultiGetBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 1024 thrpt 25 7626.692 ± 9.394 ops/s MultiGetBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 4096 thrpt 25 2871.382 ± 72.155 ops/s MultiGetBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 16384 thrpt 25 1598.786 ± 4.819 ops/s MultiGetBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 65536 thrpt 25 748.469 ± 7.234 ops/s MultiGetBenchmarks.multiGetListRandomCF30 1_column_family 10000 1024 100 256 thrpt 25 6922.666 ± 17.131 ops/s MultiGetBenchmarks.multiGetListRandomCF30 1_column_family 10000 1024 100 1024 thrpt 25 7623.890 ± 8.805 ops/s MultiGetBenchmarks.multiGetListRandomCF30 1_column_family 10000 1024 100 4096 thrpt 25 2850.698 ± 18.004 ops/s MultiGetBenchmarks.multiGetListRandomCF30 1_column_family 10000 1024 100 16384 thrpt 25 1727.623 ± 4.868 ops/s MultiGetBenchmarks.multiGetListRandomCF30 1_column_family 10000 1024 100 65536 thrpt 25 774.534 ± 10.025 ops/s MultiGetBenchmarks.multiGetListRandomCF30 20_column_families 10000 1024 100 256 thrpt 25 5486.251 ± 13.582 ops/s MultiGetBenchmarks.multiGetListRandomCF30 20_column_families 10000 1024 100 1024 thrpt 25 4920.656 ± 44.557 ops/s MultiGetBenchmarks.multiGetListRandomCF30 20_column_families 10000 1024 100 4096 thrpt 25 3922.913 ± 25.686 ops/s MultiGetBenchmarks.multiGetListRandomCF30 20_column_families 10000 1024 100 16384 thrpt 25 2873.106 ± 4.336 ops/s MultiGetBenchmarks.multiGetListRandomCF30 20_column_families 10000 1024 100 65536 thrpt 25 802.404 ± 8.967 ops/s MultiGetBenchmarks.multiGetListRandomCF30 100_column_families 10000 1024 100 256 thrpt 25 4817.996 ± 18.042 ops/s MultiGetBenchmarks.multiGetListRandomCF30 100_column_families 10000 1024 100 1024 thrpt 25 4243.922 ± 13.929 ops/s MultiGetBenchmarks.multiGetListRandomCF30 100_column_families 10000 1024 100 4096 thrpt 25 3175.998 ± 7.773 ops/s MultiGetBenchmarks.multiGetListRandomCF30 100_column_families 10000 1024 100 16384 thrpt 25 2321.990 ± 12.501 ops/s MultiGetBenchmarks.multiGetListRandomCF30 100_column_families 10000 1024 100 65536 thrpt 25 1753.028 ± 7.130 ops/s ``` Closes https://github.com/facebook/rocksdb/issues/11518 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12344 Reviewed By: cbi42 Differential Revision: D54809714 Pulled By: pdillinger fbshipit-source-id: bee3b949720abac073bce043b59ce976a11e99eb	2024-03-12 12:42:08 -07:00
Yu Zhang	210c8df820	Use do_validate flag to control timestamp based validation in WriteCommittedTxn::GetForUpdate (#12369 ) Summary: When PR https://github.com/facebook/rocksdb/issues/9629 introduced user-defined timestamp support for `WriteCommittedTxn`, it adds this usage mandate for API `GetForUpdate` when UDT is enabled. The `do_validate` flag has to be true, and user should have already called `Transaction::SetReadTimestampForValidation` to set a read timestamp for validation. The rationale behind this mandate is this: 1) with do_vaildate = true, `GetForUpdate` could verify this relationships: let's denote the user-defined timestamp in db for the key as `Ts_db` and the read timestamp user set via `Transaction::SetReadTimestampForValidation` as `Ts_read`. UDT based validation will only pass if `Ts_db <= Ts_read`. `5950907a82/utilities/transactions/transaction_util.cc (L141)` 2) Let's denote the committed timestamp set via `Transaction::SetCommitTimestamp` to be `Ts_cmt`. Later `WriteCommitedTxn::Commit` would only pass if this condition is met: `Ts_read < Ts_cmt`. `5950907a82/utilities/transactions/pessimistic_transaction.cc (L431)` Together these two checks can ensure `Ts_db < Ts_cmt` to meet the user-defined timestamp invariant that newer timestamp should have newer sequence number. The `do_validate` flag was originally intended to make snapshot based validation optional. If it's true, `GetForUpdate` checks no entry is written after the snapshot. If it's false, it will skip this snapshot based validation. In this PR, we are making the UDT based validation configurable too based on this flag instead of mandating it for below reasons: 1) in some cases the users themselves can enforce aformentioned invariant on their side independently, without RocksDB help, for example, if they are managing a monotonically increasing timestamp, and their transactions are only committed in a single thread. So they don't need this UDT based validation and wants to skip it, 2) It also could be expensive or not practical for users to come up with such a read timestamp that is exactly in between their commit timestamp and the db's timestamp. For example, in aformentioned case where a monotonically increasing timestamp is managed, the users would need to access this timestamp both for setting the read timestamp and for setting the commit timestamp. So it's preferable to skip this check too. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12369 Test Plan: added unit tests Reviewed By: ltamasi Differential Revision: D54268920 Pulled By: jowlyzhang fbshipit-source-id: ca7693796f9bb11f376a2059d91841e51c89435a	2024-03-07 14:58:10 -08:00
yuzhangyu@fb.com	1cfdece85d	Run internal cpp modernizer on RocksDB repo (#12398 ) Summary: When internal cpp modernizer attempts to format rocksdb code, it will replace macro `ROCKSDB_NAMESPACE` with its default definition `rocksdb` when collapsing nested namespace. We filed a feedback for the tool T180254030 and the team filed a bug for this: https://github.com/llvm/llvm-project/issues/83452. At the same time, they suggested us to run the modernizer tool ourselves so future auto codemod attempts will be smaller. This diff contains: Running `xplat/scripts/codemod_service/cpp_modernizer.sh` in fbcode/internal_repo_rocksdb/repo (excluding some directories in utilities/transactions/lock/range/range_tree/lib that has a non meta copyright comment) without swapping out the namespace macro `ROCKSDB_NAMESPACE` Followed by RocksDB's own `make format` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12398 Test Plan: Auto tests Reviewed By: hx235 Differential Revision: D54382532 Pulled By: jowlyzhang fbshipit-source-id: e7d5b40f9b113b60e5a503558c181f080b9d02fa	2024-03-04 10:08:32 -08:00
Jay Huh	c00c16855d	Access DBImpl* and CFD* by CFHImpl* in Iterators (#12395 ) Summary: In the current implementation of iterators, `DBImpl` and `ColumnFamilyData` are held in `DBIter` and `ArenaWrappedDBIter` for two purposes: tracing and Refresh() API. With the introduction of a new iterator called MultiCfIterator in PR https://github.com/facebook/rocksdb/issues/12153 , which is a cross-column-family iterator that maintains multiple DBIters as child iterators from a consistent database state, we need to make some changes to the existing implementation. The new iterator will still be exposed through the generic Iterator interface with an additional capability to return AttributeGroups (via `attribute_groups()`) which is a list of wide columns grouped by column family. For more information about AttributeGroup, please refer to previous PRs: https://github.com/facebook/rocksdb/issues/11925 #11943, and https://github.com/facebook/rocksdb/issues/11977. To be able to return AttributeGroup in the default single CF iterator created, access to `ColumnFamilyHandle` within `DBIter` is necessary. However, this is not currently available in `DBIter`. Since `DBImpl` and `ColumnFamilyData` can be easily accessed via `ColumnFamilyHandleImpl`, we have decided to replace the pointers to `ColumnFamilyData` and `DBImpl` in `DBIter` with a pointer to `ColumnFamilyHandleImpl`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12395 Test Plan: # Summary In the current implementation of iterators, `DBImpl` and `ColumnFamilyData` are held in `DBIter` and `ArenaWrappedDBIter` for two purposes: tracing and Refresh() API. With the introduction of a new iterator called MultiCfIterator in PR #12153 , which is a cross-column-family iterator that maintains multiple DBIters as child iterators from a consistent database state, we need to make some changes to the existing implementation. The new iterator will still be exposed through the generic Iterator interface with an additional capability to return AttributeGroups (via `attribute_groups()`) which is a list of wide columns grouped by column family. For more information about AttributeGroup, please refer to previous PRs: #11925 #11943, and #11977. To be able to return AttributeGroup in the default single CF iterator created, access to `ColumnFamilyHandle` within `DBIter` is necessary. However, this is not currently available in `DBIter`. Since `DBImpl` and `ColumnFamilyData` can be easily accessed via `ColumnFamilyHandleImpl`, we have decided to replace the pointers to `ColumnFamilyData` and `DBImpl` in `DBIter` with a pointer to `ColumnFamilyHandleImpl`. # Test Plan There should be no behavior changes. Existing tests and CI for the correctness tests. Test for Perf Regression Build ``` $> make -j64 release ``` Setup ``` $> TEST_TMPDIR=/dev/shm/db_bench ./db_bench -benchmarks="filluniquerandom" -key_size=32 -value_size=512 -num=1000000 -compression_type=none ``` Run ``` TEST_TMPDIR=/dev/shm/db_bench ./db_bench -use_existing_db=1 -benchmarks="newiterator,seekrandom" -cache_size=10485760000 ``` Before the change ``` DB path: [/dev/shm/db_bench/dbbench] newiterator : 0.552 micros/op 1810157 ops/sec 0.552 seconds 1000000 operations; DB path: [/dev/shm/db_bench/dbbench] seekrandom : 4.502 micros/op 222143 ops/sec 4.502 seconds 1000000 operations; (0 of 1000000 found) ``` After the change ``` DB path: [/dev/shm/db_bench/dbbench] newiterator : 0.520 micros/op 1924401 ops/sec 0.520 seconds 1000000 operations; DB path: [/dev/shm/db_bench/dbbench] seekrandom : 4.532 micros/op 220657 ops/sec 4.532 seconds 1000000 operations; (0 of 1000000 found) ``` Reviewed By: pdillinger Differential Revision: D54332713 Pulled By: jaykorean fbshipit-source-id: b28d897ad519e58b1ca82eb068a6319544a4fae5	2024-03-01 10:28:20 -08:00
Richard Barnes	a4ff83d1b2	Fix deprecated use of 0/NULL in internal_repo_rocksdb/repo/utilities/transactions/lock/range/range_tree/lib/locktree/wfg.cc + 3 Summary: `nullptr` is typesafe. `0` and `NULL` are not. In the future, only `nullptr` will be allowed. This diff helps us embrace the future _now_ in service of enabling `-Wzero-as-null-pointer-constant`. Reviewed By: meyering Differential Revision: D54163069 fbshipit-source-id: e5bb4b6ee79d82f1437ffed602bdb41dcfc0e59a	2024-02-25 22:17:04 -08:00
anand76	d227276147	Deprecate some variants of Get and MultiGet (#12327 ) Summary: A lot of variants of Get and MultiGet have been added to `include/rocksdb/db.h` over the years. Try to consolidate them by marking variants that don't return timestamps as deprecated. The underlying DB implementation will check and return Status::NotSupported() if it doesn't support returning timestamps and the caller asks for it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12327 Reviewed By: pdillinger Differential Revision: D53828151 Pulled By: anand1976 fbshipit-source-id: e0b5ca42d32daa2739d5f439a729815a2d4ff050	2024-02-16 09:21:06 -08:00
Peter Dillinger	54cb9c77d9	Prefer static_cast in place of most reinterpret_cast (#12308 ) Summary: The following are risks associated with pointer-to-pointer reinterpret_cast: * Can produce the "wrong result" (crash or memory corruption). IIRC, in theory this can happen for any up-cast or down-cast for a non-standard-layout type, though in practice would only happen for multiple inheritance cases (where the base class pointer might be "inside" the derived object). We don't use multiple inheritance a lot, but we do. * Can mask useful compiler errors upon code change, including converting between unrelated pointer types that you are expecting to be related, and converting between pointer and scalar types unintentionally. I can only think of some obscure cases where static_cast could be troublesome when it compiles as a replacement: * Going through `void` could plausibly cause unnecessary or broken pointer arithmetic. Suppose we have `struct Derived: public Base1, public Base2`. If we have `Derived` -> `void` -> `Base2` -> `Derived` through reinterpret casts, this could plausibly work (though technical UB) assuming the `Base2` is not dereferenced. Changing to static cast could introduce breaking pointer arithmetic. * Unnecessary (but safe) pointer arithmetic could arise in a case like `Derived` -> `Base2` -> `Derived` where before the Base2 pointer might not have been dereferenced. This could potentially affect performance. With some light scripting, I tried replacing pointer-to-pointer reinterpret_casts with static_cast and kept the cases that still compile. Most occurrences of reinterpret_cast have successfully been changed (except for java/ and third-party/). 294 changed, 257 remain. A couple of related interventions included here: Previously Cache::Handle was not actually derived from in the implementations and just used as a `void` stand-in with reinterpret_cast. Now there is a relationship to allow static_cast. In theory, this could introduce pointer arithmetic (as described above) but is unlikely without multiple inheritance AND non-empty Cache::Handle. Remove some unnecessary casts to void* as this is allowed to be implicit (for better or worse). Most of the remaining reinterpret_casts are for converting to/from raw bytes of objects. We could consider better idioms for these patterns in follow-up work. I wish there were a way to implement a template variant of static_cast that would only compile if no pointer arithmetic is generated, but best I can tell, this is not possible. AFAIK the best you could do is a dynamic check that the void* conversion after the static cast is unchanged. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12308 Test Plan: existing tests, CI Reviewed By: ltamasi Differential Revision: D53204947 Pulled By: pdillinger fbshipit-source-id: 9de23e618263b0d5b9820f4e15966876888a16e2	2024-02-07 10:44:11 -08:00
Peter Dillinger	76c834e441	Remove 'virtual' when implied by 'override' (#12319 ) Summary: ... to follow modern C++ style / idioms. Used this hack: ``` for FILE in `cat my_list_of_files`; do perl -pi -e 'BEGIN{undef $/;} s/ virtual( [^;{]* override)/$1/smg' $FILE; done ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12319 Test Plan: existing tests, CI Reviewed By: jaykorean Differential Revision: D53275303 Pulled By: pdillinger fbshipit-source-id: bc0881af270aa8ef4d0ae4f44c5a6614b6407377	2024-01-31 13:14:42 -08:00
Radek Hubner	1d8c54aeaa	Fix build on OpenBSD i386 (#12142 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12142 Reviewed By: pdillinger Differential Revision: D53150218 Pulled By: ajkr fbshipit-source-id: a4c4d9d22d99e8a82d93d1a7ef37ec5326855cb5	2024-01-29 16:19:59 -08:00
Richard Barnes	502a1754c4	Remove extra semi colon from internal_repo_rocksdb/repo/utilities/transactions/lock/range/range_tree/lib/locktree/manager.cc (#12276 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12276 `-Wextra-semi` or `-Wextra-semi-stmt` If the code compiles, this is safe to land. Reviewed By: jaykorean Differential Revision: D52969073 fbshipit-source-id: 1b2495548d939c32e7a89a6424767497fab9550e	2024-01-24 07:25:27 -08:00
Richard Barnes	0797616de0	Remove extra semi colon from internal_repo_rocksdb/repo/utilities/transactions/write_unprepared_txn.h (#12273 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12273 `-Wextra-semi` or `-Wextra-semi-stmt` If the code compiles, this is safe to land. Reviewed By: jaykorean Differential Revision: D52969166 fbshipit-source-id: 129715bfe69735b83b077c7d6cbf1786c1dfc410	2024-01-24 07:24:06 -08:00
Richard Barnes	532c940b79	Remove extra semi colon from internal_repo_rocksdb/repo/utilities/transactions/transaction_base.h (#12272 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12272 `-Wextra-semi` or `-Wextra-semi-stmt` If the code compiles, this is safe to land. Reviewed By: jaykorean Differential Revision: D52969170 fbshipit-source-id: 581304039be789cbce6760740e9557a925e02722	2024-01-24 07:22:45 -08:00
Hui Xiao	06e593376c	Group SST write in flush, compaction and db open with new stats (#11910 ) Summary: ## Context/Summary Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity. For that, this PR does the following: - Tag different write IOs by passing down and converting WriteOptions to IOOptions - Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH\|COMPACTION\|DB_OPEN}_MICROS Some related code refactory to make implementation cleaner: - Blob stats - Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH\|COMPACTION\|DB_OPEN}_MICROS include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info. - Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write. - Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority - Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification - Build table - TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables - Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder. This parameter is used for dynamically changing file io priority for flush, see https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more - Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority ## Test ### db bench Flush ``` ./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100 rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377 rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377 rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0 rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0 ``` compaction, db oopen ``` Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench Run:./db_bench --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1 rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279 rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0 rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213 rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66 ``` blob stats - just to make sure they aren't broken by this PR ``` Integrated Blob DB Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1 pre-PR: rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600 rocksdb.blobdb.blob.file.synced COUNT : 1 rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 post-PR: rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614 - COUNT is higher and values are smaller as it includes header and footer write - COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164 rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same) rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same) ``` ``` Stacked Blob DB Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench pre-PR: rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876 rocksdb.blobdb.blob.file.synced COUNT : 8 rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 post-PR: rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924 - COUNT is higher and values are smaller as it includes header and footer write - COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164 rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same) rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same) ``` ### Rehearsal CI stress test Trigger 3 full runs of all our CI stress tests ### Performance Flush ``` TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000 -- default: 1 thread is used to run benchmark; enable_statistics = true Pre-pr: avg 507515519.3 ns 497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908, Post-pr: avg 511971266.5 ns, regressed 0.88% 502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408, ``` Compaction ``` TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre\|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1 --benchmark_repetitions=1000 -- default: 1 thread is used to run benchmark Pre-pr: avg 495346098.30 ns 492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846 Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97% 502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007 ``` Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats) ``` TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1 --benchmark_repetitions=1000 -- default: 1 thread is used to run benchmark Pre-pr: avg 3848.10 ns 3814,3838,3839,3848,3854,3854,3854,3860,3860,3860 Post-pr: avg 3874.20 ns, regressed 0.68% 3863,3867,3871,3874,3875,3877,3877,3877,3880,3881 ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910 Reviewed By: ajkr Differential Revision: D49788060 Pulled By: hx235 fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff	2023-12-29 15:29:23 -08:00
Peter Dillinger	106058c076	Initial CircleCI -> GitHub Actions migration (#12163 ) Summary: * Largely based on https://github.com/facebook/rocksdb/issues/12085 but grouped into one large workflow because of bad GHA UI design (see comments). * Windows job details consolidated into an action file so that those jobs can easily move between per-pr-push and nightly. * Simplify some handling of "CIRCLECI" environment and add "GITHUB_ACTIONS" in the same places * For jobs that we want to go in pr-jobs or nightly there are disabled "candidate" workflows with draft versions of those jobs. * ARM jobs are disabled waiting on full GHA support. * build-linux-java-static needed some special attention to work, due to GLIBC compatibility issues (see comments). Pull Request resolved: https://github.com/facebook/rocksdb/pull/12163 Test Plan: Nightly jobs can be seen passing between these two links: https://github.com/facebook/rocksdb/actions/runs/7266835435/job/19799390061?pr=12163 https://github.com/facebook/rocksdb/actions/runs/7269697823/job/19807724471?pr=12163 And per-PR jobs of course passing on this PR. Reviewed By: hx235 Differential Revision: D52335810 Pulled By: pdillinger fbshipit-source-id: bbb95196f33eabad8cddf3c6b52f4413c80e034d	2023-12-21 15:40:21 -08:00
Andrew Kryczka	8c568bac61	Sync a source file license from percona/PerconaFT (#12103 ) Summary: Fixes https://github.com/facebook/rocksdb/issues/10478 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12103 Reviewed By: cbi42 Differential Revision: D51623089 Pulled By: ajkr fbshipit-source-id: 81f88262ed247144ae063a0552e0162db90c0e43	2023-12-18 11:53:27 -08:00
Richard Barnes	4f04f96742	Remove extra semi colon from infrasec/authorization/audit/AclAuditor.cpp Summary: `-Wextra-semi` or `-Wextra-semi-stmt` If the code compiles, this is safe to land. Reviewed By: palmje Differential Revision: D51995065 fbshipit-source-id: 9b55a0d8abd0927b76376cb7751bf0fcab10518c	2023-12-08 17:21:52 -08:00
Andrew Kryczka	04cbc77b90	Add missing license to source files (#12083 ) Summary: Fixes https://github.com/facebook/rocksdb/issues/12079. Fixed missing licenses in "\.h" and "\.cc" files Pull Request resolved: https://github.com/facebook/rocksdb/pull/12083 Reviewed By: cbi42 Differential Revision: D51489634 Pulled By: ajkr fbshipit-source-id: 764bfee257b9d6603fd7606a55664b7537e1898f	2023-11-21 08:36:30 -08:00
rogertyang	543191f2ea	Add bounds checking to WBWIIteratorImpl and respect bounds of ReadOptions in Transaction (#11680 ) Summary: Fix https://github.com/facebook/rocksdb/issues/11607 Fix https://github.com/facebook/rocksdb/issues/11679 Fix https://github.com/facebook/rocksdb/issues/11606 Fix https://github.com/facebook/rocksdb/issues/2343 Add bounds checking to `WBWIIteratorImpl`, which will be reflected in `BaseDeltaIterator::delta_iterator_::Valid()`, just like `BaseDeltaIterator::base_iterator_::Valid()`. In this way, the two sub itertors become more aligned from `BaseDeltaIterator`'s perspective. Like `DBIter`, the added bounds checking caps in either bound when seeking and disvalidates the `WBWIIteratorImpl` iterator when the lower bound is past or the upper bound is reached. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11680 Test Plan: - A simple test added to write_batch_with_index_test.cc to exercise the bounds checking in `WBWIIteratorImpl`. - A sophisticated test added to transaction_test.cc to assert that `Transaction` with different write policies honor bounds in `ReadOptions`. It should be so as long as the `BaseDeltaIterator` is correctly coordinating the two sub iterators to perform iterating and bounds checking. Reviewed By: ajkr Differential Revision: D48125229 Pulled By: cbi42 fbshipit-source-id: c9acea52595aed1471a63d7ca6ef15d2a2af1367	2023-10-20 13:28:28 -07:00
Yu Zhang	552bc01669	Surface timestamp from db to the transaction iterator (#11847 ) Summary: Provide an override implementation of `Iterator::timestamp` API for `BaseDeltaIterator` so that timestamp read from DB can be surfaced by an iterator created from inside of a transaction. The behavior of the API follows this rule: 1) If the entry is read from within the transaction, an empty `Slice` is returned as the timestamp, regardless of whether `Transaction::SetCommitTimestamp` is called. 2) If the entry is read from the DB, the corresponding `DBIter::timestamp()` API's result is returned. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11847 Test Plan: make all check add some unit test Reviewed By: ltamasi Differential Revision: D49377359 Pulled By: jowlyzhang fbshipit-source-id: 1511ead262ce3515ee6c6e0f829f1b69a10fe994	2023-09-22 17:28:36 -07:00
Peter (Stig) Edwards	bf488c3052	Use *next_sequence -1 here (#11861 ) Summary: To fix off-by-one error: Transaction could not check for conflicts for operation at SequenceNumber 500000 as the MemTable only contains changes newer than SequenceNumber 500001. Fixes https://github.com/facebook/rocksdb/issues/11822 I think introduced in `a657ee9a9c` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11861 Reviewed By: pdillinger Differential Revision: D49457273 Pulled By: ajkr fbshipit-source-id: b527cbae4ecc7874633a11f07027adee62940d74	2023-09-21 13:52:01 -07:00
Changyu Bi	051cad3867	Fix CI failure due to transaction_test (#11843 ) Summary: Test ` build-linux-static_lib-alt_namespace-status_checked` has been failing in main branch. ``` utilities/transactions/transaction_test.cc:6777:3: error: 'rocksdb' has not been declared 6777 \| rocksdb::GetMergeOperandsOptions mergeOperandOptions; \| ^~~~~~~ ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11843 Test Plan: `ASSERT_STATUS_CHECKED=1 TEST_UINT128_COMPAT=1 ROCKSDB_MODIFY_NPHASH=1 LIB_MODE=static OPT="-DROCKSDB_NAMESPACE=alternative_rocksdb_ns" make V=1 -j24 J=24 transaction_test` Reviewed By: sarangbh Differential Revision: D49330210 Pulled By: cbi42 fbshipit-source-id: 85c99236eeca6a777af0101684fbab5a33cca1c9	2023-09-15 13:05:23 -07:00
Sarang Masti	c4a19ed399	Add Transaction::CollapseKey to collapse merge op chains ondemand (#11815 ) Summary: Application using rocksdb today dont have much control over the cost of reads when merge-ops are enabled, expect for waiting for compactions to kick in or using max_successive_merges hint, which only applies to memtable. This change adds Transaction::CollapseKey api giving applications the ability to request merge chain collapse on-demand, when they detect high read costs due to merges. Currently, this only supported on PessimisticTransactions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11815 Test Plan: Add a unit test Reviewed By: ajkr Differential Revision: D49309387 Pulled By: sarangbh fbshipit-source-id: a1eb5cc9e3bd4b3206a36150aacead770318e3e1	2023-09-15 10:25:57 -07:00
Yu Zhang	39a4ff2cab	Track full_history_ts_low per SuperVersion (#11784 ) Summary: As discussed in https://github.com/facebook/rocksdb/issues/11730 , this PR tracks the effective `full_history_ts_low` per SuperVersion and update existing sanity checks for `ReadOptions.timestamp >= full_history_ts_low` to use this per SuperVersion `full_history_ts_low` instead. This also means the check is moved to happen after acquiring SuperVersion. There are two motivations for this: 1) Each time `full_history_ts_low` really come into effect to collapse history, a new SuperVersion is always installed, because it would involve either a Flush or Compaction, both of which change the LSM tree shape. We can take advantage of this to ensure that as long as this sanity check is passed, even if `full_history_ts_low` can be concurrently increased and collapse some history above the requested `ReadOptions.timestamp`, a read request won’t have visibility to that part of history through this SuperVersion that it already acquired. 2) the existing sanity check uses `ColumnFamilyData::GetFullHistoryTsLow` without locking the db mutex, which is the mutex all `IncreaseFullHistoryTsLow` operation is using when mutating this field. So there is a race condition. This also solve the race condition on the read path. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11784 Test Plan: `make all check` // Checks success scenario really provide the read consistency attribute as mentioned above. `./db_with_timestamp_basic_test --gtest_filter=FullHistoryTsLowSanityCheckPassReadIsConsistent` // Checks failure scenario cleans up SuperVersion properly. `./db_with_timestamp_basic_test --gtest_filter=FullHistoryTsLowSanityCheckFail` `./db_secondary_test --gtest_filter=FullHistoryTsLowSanityCheckFail` `./db_readonly_with_timestamp_test --gtest_filter=FullHistoryTsLowSanitchCheckFail` Reviewed By: ltamasi Differential Revision: D48894795 Pulled By: jowlyzhang fbshipit-source-id: 1f801fe8e1bc8e63ca76c03cbdbd0974e5ff5bf6	2023-09-13 16:34:18 -07:00
jsteemann	0b8b17a9d1	avoid find() -> insert() sequence (#11743 ) Summary: when a key is recorded for locking in a pessimistic transaction, the key is first looked up in a map, and then inserted into the map if it was not already contained. this can be simplified to an unconditional insert. in the ideal case that all keys are unique, this saves all the find() operations. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11743 Reviewed By: anand1976 Differential Revision: D48656798 Pulled By: ajkr fbshipit-source-id: d0150de2db757e0c05e1797cfc24380790c71276	2023-08-29 18:34:59 -07:00
Changyu Bi	76ed9a3990	Add missing status check when compiling with `ASSERT_STATUS_CHECKED=1` (#11686 ) Summary: It seems the flag `-fno-elide-constructors` is incorrectly overwritten in Makefile by `9c2ebcc2c3/Makefile (L243)` Applying the change in PR https://github.com/facebook/rocksdb/issues/11675 shows a lot of missing status checks. This PR adds the missing status checks. Most of changes are just adding asserts in unit tests. I'll add pr comment around more interesting changes that need review. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11686 Test Plan: change Makefile as in https://github.com/facebook/rocksdb/issues/11675, and run `ASSERT_STATUS_CHECKED=1 TEST_UINT128_COMPAT=1 ROCKSDB_MODIFY_NPHASH=1 LIB_MODE=static OPT="-DROCKSDB_NAMESPACE=alternative_rocksdb_ns" make V=1 -j24 J=24 check` Reviewed By: hx235 Differential Revision: D48176132 Pulled By: cbi42 fbshipit-source-id: 6758946cfb1c6ff84c4c1e0ca540d05e6fc390bd	2023-08-09 15:46:44 -07:00
Yu Zhang	c751583c03	Set default cf ts sz for a reused transaction (#11685 ) Summary: Set up the default column family timestamp size for a reused write committed transaction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11685 Test Plan: Added unit test. Reviewed By: ltamasi Differential Revision: D48195129 Pulled By: jowlyzhang fbshipit-source-id: 54faa900c123fc6daa412c01490e36c10a24a678	2023-08-09 13:49:42 -07:00
Hui Xiao	9a034801ce	Group rocksdb.sst.read.micros stat by different user read IOActivity + misc (#11444 ) Summary: Context/Summary: - Similar to https://github.com/facebook/rocksdb/pull/11288 but for user read such as `Get(), MultiGet(), DBIterator::XXX(), Verify(File)Checksum()`. - For this, I refactored some user-facing `MultiGet` calls in `TransactionBase` and various types of `DB` so that it does not call a user-facing `Get()` but `GetImpl()` for passing the `ReadOptions::io_activity` check (see PR conversation) - New user read stats breakdown are guarded by `kExceptDetailedTimers` since measurement shows they have 4-5% regression to the upstream/main. - Misc - More refactoring: with https://github.com/facebook/rocksdb/pull/11288, we complete passing `ReadOptions/IOOptions` to FS level. So we can now replace the previously [added](https://github.com/facebook/rocksdb/pull/9424) `rate_limiter_priority` parameter in `RandomAccessFileReader`'s `Read/MultiRead/Prefetch()` with `IOOptions::rate_limiter_priority` - Also, `ReadAsync()` call time is measured in `SST_READ_MICRO` now Pull Request resolved: https://github.com/facebook/rocksdb/pull/11444 Test Plan: - CI fake db crash/stress test - Microbenchmarking Build `make clean && ROCKSDB_NO_FBCODE=1 DEBUG_LEVEL=0 make -jN db_basic_bench` - google benchmark version: `604f6fd3f4` - db_basic_bench_base: upstream - db_basic_bench_pr: db_basic_bench_base + this PR - asyncread_db_basic_bench_base: upstream + [db basic bench patch for IteratorNext](https://github.com/facebook/rocksdb/compare/main...hx235:rocksdb:micro_bench_async_read) - asyncread_db_basic_bench_pr: asyncread_db_basic_bench_base + this PR Test Get ``` TEST_TMPDIR=/dev/shm ./db_basic_bench_{null_stat\|base\|pr} --benchmark_filter=DBGet/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1/negative_query:0/enable_filter:0/mmap:1/threads:1 --benchmark_repetitions=1000 ``` Result ``` Coming soon ``` AsyncRead ``` TEST_TMPDIR=/dev/shm ./asyncread_db_basic_bench_{base\|pr} --benchmark_filter=IteratorNext/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1/async_io:1/include_detailed_timers:0 --benchmark_repetitions=1000 > syncread_db_basic_bench_{base\|pr}.out ``` Result ``` Base: 1956,1956,1968,1977,1979,1986,1988,1988,1988,1990,1991,1991,1993,1993,1993,1993,1994,1996,1997,1997,1997,1998,1999,2001,2001,2002,2004,2007,2007,2008, PR (2.3% regression, due to measuring `SST_READ_MICRO` that wasn't measured before): 1993,2014,2016,2022,2024,2027,2027,2028,2028,2030,2031,2031,2032,2032,2038,2039,2042,2044,2044,2047,2047,2047,2048,2049,2050,2052,2052,2052,2053,2053, ``` Reviewed By: ajkr Differential Revision: D45918925 Pulled By: hx235 fbshipit-source-id: 58a54560d9ebeb3a59b6d807639692614dad058a	2023-08-08 17:26:50 -07:00
Jay Huh	9a2a6db2a9	Use C++17 [[fallthrough]] in transaction_test.cc (#11663 ) Summary: (Copied from https://www.internalfb.com/diff/D46606060) This diff makes its files safe for use with -Wimplicit-fallthrough. Now that we're using C+20 there's no reason not to use this C++17 feature to make our code safer. It's currently possible to write code like this: ``` switch(x){ case 1: foo1(); case 2: foo2(); break; case 3: foo3(); } ``` But that's scary because we don't know whether the fallthrough from case 1 was intentional or not. The -Wimplicit-fallthrough flag will make this an error. The solution is to either fix the bug by inserting break or indicating intention by using [[fallthrough]]; (from C++17). ``` switch(x){ case 1: foo1(); [[fallthrough]]; // Solution if we intended to fallthrough break; // Solution if we did not intend to fallthrough case 2: foo2(); break; case 3: foo3(); } ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11663 Test Plan: Existing tests Reviewed By: jowlyzhang Differential Revision: D47961248 Pulled By: jaykorean fbshipit-source-id: 0d374c721bf1b328c14949dc5c17693da7311d03	2023-08-01 14:49:06 -07:00
Peter Dillinger	bb8fcc0044	db_stress: Reinstate Transaction::Rollback() calls before destruction (#11656 ) Summary: https://github.com/facebook/rocksdb/issues/11653 broke some crash tests. Apparently these Rollbacks are needed for pessimistic transaction cases. (I'm still not sure if the API makes any sense with regard to safe usage. It's certainly not documented. Will consider in follow-up PRs.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/11656 Test Plan: manual crash test runs, crash_test_with_multiops_wc_txn and crash_test_with_multiops_wp_txn Reviewed By: cbi42 Differential Revision: D47906280 Pulled By: pdillinger fbshipit-source-id: d058a01b6dbb47a4f08d199e335364168304f81b	2023-07-30 17:30:01 -07:00
Peter Dillinger	b3c54186ab	Allow TryAgain in db_stress with optimistic txn, and refactoring (#11653 ) Summary: In rare cases, optimistic transaction commit returns TryAgain. This change tolerates that intentional behavior in db_stress, up to a small limit in a row. This way, we don't miss a possible regression with excessive TryAgain, and trying again (rolling back the transaction) should have a well renewed chance of success as the writes will be associated with fresh sequence numbers. Also, some of the APIs were not clear about Transaction semantics, so I have clarified: * (Best I can tell....) Destroying a Transaction is safe without calling Rollback() (or at least should be). I don't know why it's a common pattern in our test code and examples to rollback before unconditional destruction. Stress test updated not to call Rollback unnecessarily (to test safe destruction). * Despite essentially doing what is asked, simply trying Commit() again when it returns TryAgain does not have a chance of success, because of the transaction being bound to the DB state at the time of operations before Commit. Similar logic applies to Busy AFAIK. Commit() API comments updated, and expanded unit test in optimistic_transaction_test. Also also, because I can't stop myself, I refactored a good portion of the transaction handling code in db_stress. * Avoid existing and new copy-paste for most transaction interactions with a new ExecuteTransaction (higher-order) function. * Use unique_ptr (nicely complements removing unnecessary Rollbacks) * Abstract out a pattern for safely calling std::terminate() and use it in more places. (The TryAgain errors we saw did not have stack traces because of "terminate called recursively".) Intended follow-up: resurrect use of `FLAGS_rollback_one_in` but also include non-trivial cases Pull Request resolved: https://github.com/facebook/rocksdb/pull/11653 Test Plan: this is the test :) Also, temporarily bypassed the new retry logic and boosted the chance of hitting TryAgain. Quickly reproduced the TryAgain error. Then re-enabled the new retry logic, and was not able to hit the error after running for tens of minutes, even with the boosted chances. Reviewed By: cbi42 Differential Revision: D47882995 Pulled By: pdillinger fbshipit-source-id: 21eadb1525423340dbf28d17cf166b9583311a0d	2023-07-28 16:25:29 -07:00
Peter Dillinger	05a1d52e77	Use FaultInjectionTestFS in transaction_test, clarify Close() APIs (#11499 ) Summary: ... instead of race-condition-laden FaultInjectionTestEnv. See https://app.circleci.com/pipelines/github/facebook/rocksdb/27912/workflows/4c63e5a8-597e-439d-8c7e-82308056af02/jobs/609648 and similar PR https://github.com/facebook/rocksdb/issues/11271 Had to fix the semantics of FaultInjectionTestFS Close() operations to allow a non-OK Close() to fulfill the obligation to close before destruction. To me, this is the obvious choice of Close contract, because what is the caller supposed to do if Close() fails and they still have an obligation to successfully close before object destruction? Call Close() in an infinite loop? Leak the object? I have added API comments to the Env and Filesystem Close() functions to clarify the contracts. Note that `DB::Close()` has one exception to this kind of Close contract, but it is clearly described in API comments and it is really only for catching programming mistakes, not for dealing with exogenous errors. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11499 Test Plan: watch CI Reviewed By: jowlyzhang Differential Revision: D46375708 Pulled By: pdillinger fbshipit-source-id: 03d4d8251e5df50a82ecd139f7e83f613015fe40	2023-06-21 23:38:54 -07:00
Jay Huh	81aeb15988	Add WaitForCompact with WaitForCompactOptions to public API (#11436 ) Summary: Context: This is the first PR for WaitForCompact() Implementation with WaitForCompactOptions. In this PR, we are introducing `Status WaitForCompact(const WaitForCompactOptions& wait_for_compact_options)` in the public API. This currently utilizes the existing internal `WaitForCompact()` implementation (with default abort_on_pause = false). `abort_on_pause` has been moved to `WaitForCompactOptions&`. In the later PRs, we will introduce the following two options in `WaitForCompactOptions` 1. `bool flush = false` by default - If true, flush before waiting for compactions to finish. Must be set to true to ensure no immediate compactions (except perhaps periodic compactions) after closing and re-opening the DB. 2. `bool close_db = false` by default - If true, will also close the DB upon compactions finishing. 1. struct `WaitForCompactOptions` added to options.h and `abort_on_pause` in the internal API moved to the option struct. 2. `Status WaitForCompact(const WaitForCompactOptions& wait_for_compact_options)` introduced in `db.h` 3. Changed the internal WaitForCompact() to `WaitForCompact(const WaitForCompactOptions& wait_for_compact_options)` and checks for the `abort_on_pause` inside the option. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11436 Test Plan: Following tests added - `DBCompactionTest::WaitForCompactWaitsOnCompactionToFinish` - `DBCompactionTest::WaitForCompactAbortOnPauseAborted` - `DBCompactionTest::WaitForCompactContinueAfterPauseNotAborted` - `DBCompactionTest::WaitForCompactShutdownWhileWaiting` - `TransactionTest::WaitForCompactAbortOnPause` NOTE: `TransactionTest::WaitForCompactAbortOnPause` was added to use `StackableDB` to ensure the wrapper function is in place. Reviewed By: pdillinger Differential Revision: D45799659 Pulled By: jaykorean fbshipit-source-id: b5b58f95957f2ab47d1221dee32a61d6cdc4685b	2023-05-25 17:25:51 -07:00
Peter Dillinger	17bc27741f	Improve memory efficiency of many OptimisticTransactionDBs (#11439 ) Summary: Currently it's easy to use a ton of memory with many small OptimisticTransactionDB instances, because each one by default allocates a million mutexes (40 bytes each on my compiler) for validating transactions. It even puts a lot of pressure on the allocator by allocating each one individually! In this change: * Create a new object and option that enables sharing these buckets of mutexes between instances. This is generally good for load balancing potential contention as various DBs become hotter or colder with txn writes. About the only cases where this sharing wouldn't make sense (e.g. each DB usually written by one thread) are cases that would be better off with OccValidationPolicy::kValidateSerial which doesn't use the buckets anyway. * Allocate the mutexes in a contiguous array, for efficiency * Add an option to ensure the mutexes are cache-aligned. In several other places we use cache-aligned mutexes but OptimisticTransactionDB historically does not. It should be a space-time trade-off the user can choose. * Provide some visibility into the memory used by the mutex buckets with an ApproximateMemoryUsage() function (also used in unit testing) * Share code with other users of "striped" mutexes, appropriate refactoring for customization & efficiency (e.g. using FastRange instead of modulus) Pull Request resolved: https://github.com/facebook/rocksdb/pull/11439 Test Plan: unit tests added. Ran sized-up versions of stress test in unit test, including a before-and-after performance test showing no consistent difference. (NOTE: OptimisticTransactionDB not currently covered by db_stress!) Reviewed By: ltamasi Differential Revision: D45796393 Pulled By: pdillinger fbshipit-source-id: ae2b3a26ad91ceeec15debcdc63ff48df6736a54	2023-05-24 11:57:15 -07:00
Yu Zhang	ffb5f1f445	Refactor WriteUnpreparedStressTest to be a unit test (#11424 ) Summary: This patch remove the "stress" aspect from the WriteUnpreparedStressTest and leave it to be a unit test for some correctness testing w.r.t. snapshot functionality. I added some read-your-write verification to the transaction test in db_stress. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11424 Test Plan: `./write_unprepared_transaction_test` `./db_crashtest.py whitebox --txn` `./db_crashtest.py blackbox --txn` Reviewed By: hx235 Differential Revision: D45551521 Pulled By: jowlyzhang fbshipit-source-id: 20c3d510eb4255b08ddd7b6c85bdb4945436f6e8	2023-05-22 12:31:52 -07:00
Jay Huh	586d78b31e	Remove wait_unscheduled from waitForCompact internal API (#11443 ) Summary: Context: In pull request https://github.com/facebook/rocksdb/issues/11436, we are introducing a new public API `waitForCompact(const WaitForCompactOptions& wait_for_compact_options)`. This API invokes the internal implementation `waitForCompact(bool wait_unscheduled=false)`. The unscheduled parameter indicates the compactions that are not yet scheduled but are required to process items in the queue. In certain cases, we are unable to wait for compactions, such as during a shutdown or when background jobs are paused. It is important to return the appropriate status in these scenarios. For all other cases, we should wait for all compaction and flush jobs, including the unscheduled ones. The primary purpose of this new API is to wait until the system has resolved its compaction debt. Currently, the usage of `wait_unscheduled` is limited to test code. This pull request eliminates the usage of wait_unscheduled. The internal `waitForCompact()` API now waits for unscheduled compactions unless the db is undergoing a shutdown. In the event of a shutdown, the API returns `Status::ShutdownInProgress()`. Additionally, a new parameter, `abort_on_pause`, has been introduced with a default value of `false`. This parameter addresses the possibility of waiting indefinitely for unscheduled jobs if `PauseBackgroundWork()` was called before `waitForCompact()` is invoked. By setting `abort_on_pause` to `true`, the API will immediately return `Status::Aborted`. Furthermore, all tests that previously called `waitForCompact(true)` have been fixed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11443 Test Plan: Existing tests that involve a shutdown in progress: - DBCompactionTest::CompactRangeShutdownWhileDelayed - DBTestWithParam::PreShutdownMultipleCompaction - DBTestWithParam::PreShutdownCompactionMiddle Reviewed By: pdillinger Differential Revision: D45923426 Pulled By: jaykorean fbshipit-source-id: 7dc93fe6a6841a7d9d2d72866fa647090dba8eae	2023-05-17 18:13:50 -07:00
Hui Xiao	151242ce46	Group rocksdb.sst.read.micros stat by IOActivity flush and compaction (#11288 ) Summary: Context: The existing stat rocksdb.sst.read.micros does not reflect each of compaction and flush cases but aggregate them, which is not so helpful for us to understand IO read behavior of each of them. Summary - Update `StopWatch` and `RandomAccessFileReader` to record `rocksdb.sst.read.micros` and `rocksdb.file.{flush/compaction}.read.micros` - Fixed the default histogram in `RandomAccessFileReader` - New field `ReadOptions/IOOptions::io_activity`; Pass `ReadOptions` through paths under db open, flush and compaction to where we can prepare `IOOptions` and pass it to `RandomAccessFileReader` - Use `thread_status_util` for assertion in `DbStressFSWrapper` for continuous testing on we are passing correct `io_activity` under db open, flush and compaction Pull Request resolved: https://github.com/facebook/rocksdb/pull/11288 Test Plan: - Stress test - Db bench 1: rocksdb.sst.read.micros COUNT ≈ sum of rocksdb.file.read.flush.micros's and rocksdb.file.read.compaction.micros's. (without blob) - May not be exactly the same due to `HistogramStat::Add` only guarantees atomic not accuracy across threads. ``` ./db_bench -db=/dev/shm/testdb/ -statistics=true -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 (-use_plain_table=1 -prefix_size=10) ``` ``` // BlockBasedTable rocksdb.sst.read.micros P50 : 2.009374 P95 : 4.968548 P99 : 8.110362 P100 : 43.000000 COUNT : 40456 SUM : 114805 rocksdb.file.read.flush.micros P50 : 1.871841 P95 : 3.872407 P99 : 5.540541 P100 : 43.000000 COUNT : 2250 SUM : 6116 rocksdb.file.read.compaction.micros P50 : 2.023109 P95 : 5.029149 P99 : 8.196910 P100 : 26.000000 COUNT : 38206 SUM : 108689 // PlainTable Does not apply ``` - Db bench 2: performance Read SETUP: db with 900 files ``` ./db_bench -db=/dev/shm/testdb/ -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -disable_auto_compactions=true -target_file_size_base=655 -compression_type=none ```run till convergence ``` ./db_bench -seed=1678564177044286 -use_existing_db=true -db=/dev/shm/testdb -benchmarks=readrandom[-X60] -statistics=true -num=1000000 -disable_auto_compactions=true -compression_type=none -bloom_bits=3 ``` Pre-change `readrandom [AVG 60 runs] : 21568 (± 248) ops/sec` Post-change (no regression, -0.3%) `readrandom [AVG 60 runs] : 21486 (± 236) ops/sec` Compaction/Flushrun till convergence ``` ./db_bench -db=/dev/shm/testdb2/ -seed=1678564177044286 -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -disable_auto_compactions=false -target_file_size_base=655 -compression_type=none rocksdb.sst.read.micros COUNT : 33820 rocksdb.sst.read.flush.micros COUNT : 1800 rocksdb.sst.read.compaction.micros COUNT : 32020 ``` Pre-change `fillseq [AVG 46 runs] : 1391 (± 214) ops/sec; 0.7 (± 0.1) MB/sec` Post-change (no regression, ~-0.4%) `fillseq [AVG 46 runs] : 1385 (± 216) ops/sec; 0.7 (± 0.1) MB/sec` Reviewed By: ajkr Differential Revision: D44007011 Pulled By: hx235 fbshipit-source-id: a54c89e4846dfc9a135389edf3f3eedfea257132	2023-04-21 09:07:18 -07:00
Peter Dillinger	390cc0b156	Ensure LockWAL() stall cleared for UnlockWAL() return (#11172 ) Summary: Fixes https://github.com/facebook/rocksdb/issues/11160 By counting the number of stalls placed on a write queue, we can check in UnlockWAL() whether the stall present at the start of UnlockWAL() has been cleared by the end, or wait until it's cleared. More details in code comments and new unit test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11172 Test Plan: unit test added. Yes, it uses sleep to amplify failure on buggy behavior if present, but using a sync point to only allow new behavior would fail with the old code only because it doesn't contain the new sync point. Basically, using a sync point in UnlockWAL() could easily mask a regression by artificially limiting key behaviors. The test would only check that UnlockWAL() invokes code that should do the right thing, without checking that it does the right thing. Reviewed By: ajkr Differential Revision: D42894341 Pulled By: pdillinger fbshipit-source-id: 15c9da0ca383e6aec845b29f5447d76cecbf46c3	2023-02-03 12:08:37 -08:00
sdong	4720ba4391	Remove RocksDB LITE (#11147 ) Summary: We haven't been actively mantaining RocksDB LITE recently and the size must have been gone up significantly. We are removing the support. Most of changes were done through following comments: unifdef -m -UROCKSDB_LITE `git grep -l ROCKSDB_LITE \| egrep '[.](cc\|h)'` by Peter Dillinger. Others changes were manually applied to build scripts, CircleCI manifests, ROCKSDB_LITE is used in an expression and file db_stress_test_base.cc. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11147 Test Plan: See CI Reviewed By: pdillinger Differential Revision: D42796341 fbshipit-source-id: 4920e15fc2060c2cd2221330a6d0e5e65d4b7fe2	2023-01-27 13:14:19 -08:00
Peter Dillinger	546e213c4f	Fix DelayWrite() calls for two_write_queues (#11130 ) Summary: PR https://github.com/facebook/rocksdb/issues/11020 fixed a case where it was easy to deadlock the DB with LockWAL() but introduced a bug showing up as a rare assertion failure in the stress test. Specifically, `assert(w->state == STATE_INIT)` in `WriteThread::LinkOne()` called from `BeginWriteStall()`, `DelayWrite()`, `WriteImplWALOnly()`. I haven't been about to generate a unit test that reproduces this failure but I believe the root cause is that DelayWrite() was never meant to be re-entrant, only called from the DB's write_thread_ leader. https://github.com/facebook/rocksdb/issues/11020 introduced a call to DelayWrite() from the nonmem_write_thread_ group leader. This fix is to make DelayWrite() apply to the specific write queue that it is being called from (inject a dummy write stall entry to the head of the appropriate write queue). WriteController is re-entrant, based on polling and state changes signalled with bg_cv_, so can manage stalling two queues. The only anticipated complication (called out by Andrew in previous PR) is that we don't want timed write delays being injected in parallel for the two queues, because that dimishes the intended throttling effect. Thus, we only allow timed delays for the primary write queue. HISTORY not updated because this is intended for the same release where the bug was introduced. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11130 Test Plan: Although I was not able to reproduce the assertion failure, I was able to reproduce a distinct flaw with what I believe is the same root cause: a kind of deadlock if both write queues need to wake up from stopped writes. Only one will be waiting on bg_cv_ (the other waiting in `LinkOne()` for the write queue to open up), so a single SignalAll() will only unblock one of the queues, with the other re-instating the stop until another signal on bg_cv_. A simple unit test is added for this case. Will also run crash_test_with_multiops_wc_txn for a while looking for issues. Reviewed By: ajkr Differential Revision: D42749330 Pulled By: pdillinger fbshipit-source-id: 4317dd899a93d57c26fd5af7143038f82d4d4d1b	2023-01-25 14:18:27 -08:00
Wenlong Zhang	1cfe3528a2	support loongarch64 for rocksdb (#10036 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10036 Reviewed By: hx235 Differential Revision: D42424074 Pulled By: ajkr fbshipit-source-id: 004adb75005a26bd01c5d568d1ec6ac442cd59dd	2023-01-13 08:42:44 -08:00
Yanqin Jin	c93ba7db5d	Revise LockWAL/UnlockWAL implementation (#11020 ) Summary: RocksDB has two public APIs: `DB::LockWAL()`/`DB::UnlockWAL()`. The current implementation acquires and releases the internal `DBImpl::log_write_mutex_`. According to the comment on `DBImpl::log_write_mutex_`: https://github.com/facebook/rocksdb/blob/7.8.fb/db/db_impl/db_impl.h#L2287:L2288 > Note: to avoid dealock, if needed to acquire both log_write_mutex_ and mutex_, the order should be first mutex_ and then log_write_mutex_. This puts limitations on how applications can use the `LockWAL()` API. After `LockWAL()` returns ok, then application should not perform any operation that acquires `mutex_`. Currently, the use case of `LockWAL()` is MyRocks implementing the MySQL storage engine handlerton `lock_hton_log` interface. The operation that MyRocks performs after `LockWAL()` is `GetSortedWalFiless()` which not only acquires mutex_, but also `log_write_mutex_`. There are two issues: 1. Applications using these two APIs may hang if one thread calls `GetSortedWalFiles()` after calling `LockWAL()` because log_write_mutex is not recursive. 2. Two threads may dead lock due to lock order inversion. To fix these issues, we can modify the implementation of LockWAL so that it does not keep `log_write_mutex_` held until UnlockWAL. To achieve the goal of locking the WAL, we can instead manually inject a write stall so that all future writes will be stopped. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11020 Test Plan: make check Reviewed By: ajkr Differential Revision: D41785203 Pulled By: riversand963 fbshipit-source-id: 5ccb7a9c6eb9a2c3fa80fd2c399cc2568b8f89ce	2022-12-13 21:45:00 -08:00
Yanqin Jin	75aca74017	Replace member variable lambda with methods (#10924 ) Summary: In transaction unit tests, replace a few member variable lambdas with non-static methods. It's easier for gdb to work with variables in methods than in lambdas. (Seen similar things to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86675). Pull Request resolved: https://github.com/facebook/rocksdb/pull/10924 Test Plan: make check Reviewed By: jay-zhuang Differential Revision: D41072241 Pulled By: riversand963 fbshipit-source-id: e4fa491de573c4656225a86a75af926c1df827f6	2022-11-07 12:31:48 -08:00
Yanqin Jin	0547cecb81	Reduce access to atomic variables in a test (#10909 ) Summary: With TSAN build on CircleCI (see mini-tsan in .circleci/config). Sometimes `SeqAdvanceConcurrentTest.SeqAdvanceConcurrent` will get stuck when an experimental feature called "unordered write" is enabled. Stack trace will be the following ``` Thread 7 (Thread 0x7f2284a1c700 (LWP 481523) "write_prepared_"): #0 0x00000000004fa3f5 in __tsan_atomic64_load () at ./db/merge_context.h:15 https://github.com/facebook/rocksdb/issues/1 0x00000000005e5942 in std::__atomic_base<unsigned long>::load (this=0x7b74000012f8, __m=std::memory_order_seq_cst) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/atomic_base.h:481 https://github.com/facebook/rocksdb/issues/2 std::__atomic_base<unsigned long>::operator unsigned long (this=0x7b74000012f8) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/atomic_base.h:341 https://github.com/facebook/rocksdb/issues/3 0x00000000005bf001 in rocksdb::SeqAdvanceConcurrentTest_SeqAdvanceConcurrent_Test::TestBody()::$_9::operator()(void) const (this=0x7b14000085e8) at utilities/transactions/write_prepared_transaction_test.cc:1702 Thread 6 (Thread 0x7f228421b700 (LWP 481521) "write_prepared_"): #0 0x000000000052178c in __tsan::MetaMap::GetAndLock(__tsan::ThreadState, unsigned long, unsigned long, bool, bool) () at ./db/merge_context.h:15 https://github.com/facebook/rocksdb/issues/1 0x00000000004fa48e in __tsan_atomic64_load () at ./db/merge_context.h:15 https://github.com/facebook/rocksdb/issues/2 0x00000000005e5942 in std::__atomic_base<unsigned long>::load (this=0x7b74000012f8, __m=std::memory_order_seq_cst) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/atomic_base.h:481 https://github.com/facebook/rocksdb/issues/3 std::__atomic_base<unsigned long>::operator unsigned long (this=0x7b74000012f8) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/atomic_base.h:341 https://github.com/facebook/rocksdb/issues/4 0x00000000005bf001 in rocksdb::SeqAdvanceConcurrentTest_SeqAdvanceConcurrent_Test::TestBody()::$_9::operator()(void*) const (this=0x7b14000085e8) at utilities/transactions/write_prepared_transaction_test.cc:1702 ``` This is problematic and suspicious. Two threads will get stuck in the same place trying to load from an atomic variable. https://github.com/facebook/rocksdb/blob/7.8.fb/utilities/transactions/write_prepared_transaction_test.cc#L1694:L1707. Not sure why two threads can reach the same point. The stack trace shows that there may be a deadlock, since the two threads are on the same write thread (one is doing Prepare, while the other is trying to commit). Pull Request resolved: https://github.com/facebook/rocksdb/pull/10909 Test Plan: On CircleCI mini-tsan, apply a patch first so that we have a higher chance of hitting the same problematic situation, ``` diff --git a/utilities/transactions/write_prepared_transaction_test.cc b/utilities/transactions/write_prepared_transaction_test.cc index 4bc1f3744..bd5dc4924 100644 --- a/utilities/transactions/write_prepared_transaction_test.cc +++ b/utilities/transactions/write_prepared_transaction_test.cc @@ -1714,13 +1714,13 @@ TEST_P(SeqAdvanceConcurrentTest, SeqAdvanceConcurrent) { size_t d = (n % base[bi + 1]) / base[bi]; switch (d) { case 0: - threads.emplace_back(txn_t0, bi); + threads.emplace_back(txn_t3, bi); break; case 1: - threads.emplace_back(txn_t1, bi); + threads.emplace_back(txn_t3, bi); break; case 2: - threads.emplace_back(txn_t2, bi); + threads.emplace_back(txn_t3, bi); break; case 3: threads.emplace_back(txn_t3, bi); ``` then build and run tests ``` COMPILE_WITH_TSAN=1 CC=clang-13 CXX=clang++-13 ROCKSDB_DISABLE_ALIGNED_NEW=1 USE_CLANG=1 make V=1 -j32 check gtest-parallel -r 100 ./write_prepared_transaction_test --gtest_filter=TwoWriteQueues/SeqAdvanceConcurrentTest.SeqAdvanceConcurrent/19 ``` In the above, `SeqAdvanceConcurrent/19`. The tests 10 to 19 correspond to unordered write in which Prepare() and Commit() can both enter the same write thread. Before this PR, there is a high chance of hitting the deadlock. With this PR, no deadlock has been encountered so far. Reviewed By: ltamasi Differential Revision: D40869387 Pulled By: riversand963 fbshipit-source-id: 81e82a70c263e4f3417597a201b081ee54f1deab	2022-11-02 14:54:58 -07:00
Yanqin Jin	7d26e4c5a3	Basic Support for Merge with user-defined timestamp (#10819 ) Summary: This PR implements the originally disabled `Merge()` APIs when user-defined timestamp is enabled. Simplest usage: ```cpp // assume string append merge op is used with '.' as delimiter. // ts1 < ts2 db->Put(WriteOptions(), "key", ts1, "v0"); db->Merge(WriteOptions(), "key", ts2, "1"); ReadOptions ro; ro.timestamp = &ts2; db->Get(ro, "key", &value); ASSERT_EQ("v0.1", value); ``` Some code comments are added for clarity. Note: support for timestamp in `DB::GetMergeOperands()` will be done in a follow-up PR. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10819 Test Plan: make check Reviewed By: ltamasi Differential Revision: D40603195 Pulled By: riversand963 fbshipit-source-id: f96d6f183258f3392d80377025529f7660503013	2022-10-31 22:28:58 -07:00
Yanqin Jin	900f79126d	Pass `const LockInfo&` to AcquireLocked() and AcquireWithTimeout (#10874 ) Summary: The motivation and benefit of current behavior of passing `LockInfo&&` as argument to AcquireLocked() and AcquireWithTimeout() is not clear to me. Furthermore, in AcquireWithTimeout(), we access members of `LockInfo&&` after it is passed to AcquireLocked() as rvalue ref. In addition, we may call `AcquireLocked()` with `std::move(lock_info)` multiple times. This leads to linter warning of use-after-move. If future implementation of AcquireLocked() does something like moving-construct a new `LockedInfo` using the passed-in `LockInfo&&`, then the caller cannot use it because `LockInfo` has a member of type `autovector`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10874 Test Plan: make check Reviewed By: ltamasi Differential Revision: D40704210 Pulled By: riversand963 fbshipit-source-id: 20091df65b4fc63b072bcec9809efc49955d6d35	2022-10-28 14:05:12 -07:00
Yanqin Jin	95a1935cb1	Run clang-format on utilities/transactions (#10871 ) Summary: This PR is the result of running the following command ``` find ./utilities/transactions/ -name '.cc' -o -name '.h' -o -name '.c' -o -name '.hpp' -o -name '*.cpp' \| xargs clang-format -i ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10871 Test Plan: make check Reviewed By: cbi42 Differential Revision: D40686871 Pulled By: riversand963 fbshipit-source-id: 613738d667ec8f8e13cce4802e0e166d6be52211	2022-10-25 14:15:22 -07:00
Peter Dillinger	e466173d5c	Print stack traces on frozen tests in CI (#10828 ) Summary: Instead of existing calls to ps from gnu_parallel, call a new wrapper that does ps, looks for unit test like processes, and uses pstack or gdb to print thread stack traces. Also, using `ps -wwf` instead of `ps -wf` ensures output is not cut off. For security, CircleCI runs with security restrictions on ptrace (/proc/sys/kernel/yama/ptrace_scope = 1), and this change adds a work-around to `InstallStackTraceHandler()` (only used by testing tools) to allow any process from the same user to debug it. (I've also touched >100 files to ensure all the unit tests call this function.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/10828 Test Plan: local manual + temporary infinite loop in a unit test to observe in CircleCI Reviewed By: hx235 Differential Revision: D40447634 Pulled By: pdillinger fbshipit-source-id: 718a4c4a5b54fa0f9af2d01a446162b45e5e84e1	2022-10-18 00:35:35 -07:00
Yanqin Jin	07249fea8f	Fix DBImpl::GetLatestSequenceForKey() for Merge (#10724 ) Summary: Currently, without this fix, DBImpl::GetLatestSequenceForKey() may not return the latest sequence number for merge operands of the key. This can cause conflict checking during optimistic transaction commit phase to fail. Fix it by always returning the latest sequence number of the key, also considering range tombstones. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10724 Test Plan: make check Reviewed By: cbi42 Differential Revision: D39756847 Pulled By: riversand963 fbshipit-source-id: 0764c3dd4cb24960b37e18adccc6e7feed0e6876	2022-09-23 17:29:05 -07:00
Yanqin Jin	832fd644fc	Reset pessimistic transaction's read/commit timestamps during Initialize() (#10677 ) Summary: RocksDB allows reusing old `Transaction` objects when creating new ones. Therefore, we need to reset the transaction's read and commit timestamps back to default values `kMaxTxnTimestamp`. Otherwise, `CommitAndTryCreateSnapshot()` may fail with "Status::InvalidArgument("Different commit ts specified")". Pull Request resolved: https://github.com/facebook/rocksdb/pull/10677 Test Plan: make check Reviewed By: ltamasi Differential Revision: D39513543 Pulled By: riversand963 fbshipit-source-id: bea01cac149bff3a23a2978fc0c3b198243a6291	2022-09-14 18:28:21 -07:00
sdong	bc575c614c	Fix two extra headers (#10525 ) Summary: Fix copyright for two more extra headers to make internal tool happy. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10525 Reviewed By: jay-zhuang Differential Revision: D38661390 fbshipit-source-id: ab2d055bfd145dfe82b5bae7a6c25cc338c8de94	2022-08-12 15:54:35 -07:00

1 2 3 4 5 ...

470 Commits