Commit graph

475 commits

Author SHA1 Message Date
Andrew Kryczka 8897bf2d04 Drop unsynced data in TestFSWritableFile::Close() (#12528)
Summary:
Our `FileSystem` for simulating unsynced data loss should not sync during `Close()` because it masks bugs where we forgot to sync as long as we closed the file.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12528

Test Plan:
Peeled back https://github.com/facebook/rocksdb/issues/10560 fix and verified it is caught much faster now (few seconds vs. ???) with command like

```
$ TEST_TMPDIR=./ python3 tools/db_crashtest.py blackbox --disable_wal=0 --max_key=1000 --write_buffer_size=131072 --max_bytes_for_level_base=524288 --target_file_size_base=131072 --interval=3 --sync_fault_injection=1 --enable_blob_files=0 --manual_wal_flush_one_in=10 --sync_wal_one_in=0 --get_live_files_one_in=0 --get_sorted_wal_files_one_in=0 --backup_one_in=0 --checkpoint_one_in=0 --write_fault_one_in=0 --read_fault_one_in=0 --open_write_fault_one_in=0 --compact_range_one_in=0 --compact_files_one_in=0 --open_read_fault_one_in=0 --get_property_one_in=0 --writepercent=100 -readpercent=0 -prefixpercent=0 -delpercent=0 -delrangepercent=0 -iterpercent=0
```

Reviewed By: anand1976

Differential Revision: D56033250

Pulled By: ajkr

fbshipit-source-id: 6bbf480d79a06c46f08f6214010937f6654af5ca
2024-04-12 09:57:56 -07:00
hasagi 5a86635e12 Fix CreateColumnFamilyWithImport for PessimisticTransactionDB (#12490)
Summary:
When we use the CreateColumnFamilyWithImport interface of PessimisticTransactionDB to create column family, the lack of related information may cause subsequent writes to be unable to find the Column Family ID.

The issue: (https://github.com/facebook/rocksdb/issues/12493)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12490

Reviewed By: jowlyzhang

Differential Revision: D55700343

Pulled By: cbi42

fbshipit-source-id: dc992a3eef433e1193d579cbf58b6ba940fa460d
2024-04-03 10:56:30 -07:00
Richard Barnes 90d61381bf Fix deprecated use of 0/NULL in internal_repo_rocksdb/repo/util/xxhash.h + 5
Summary:
`nullptr` is typesafe. `0` and `NULL` are not. In the future, only `nullptr` will be allowed.

This diff helps us embrace the future _now_ in service of enabling `-Wzero-as-null-pointer-constant`.

Reviewed By: dmm-fb

Differential Revision: D55559752

fbshipit-source-id: 9f1edc836ded919022c4b53722f6f86208fecf8d
2024-04-01 21:20:51 -07:00
Richard Barnes fc40165614 Remove extra semi colon from internal_repo_rocksdb/repo/tools/ldb_cmd.cc
Summary:
`-Wextra-semi` or `-Wextra-semi-stmt`

If the code compiles, this is safe to land.

Reviewed By: palmje

Differential Revision: D54362227

fbshipit-source-id: ac634ba34f9351ba559c4ed96448f51d6ef33175
2024-03-18 18:51:50 -07:00
Levi Tamasi a93edea7e5 Deduplicate WriteBatchWithIndex::{Get,GetEntity}FromBatch (#12442)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12442

The patch deduplicates and unifies the logic of `WriteBatchWithIndex::{Get,GetEntity}FromBatch` using templates and makes some small code hygiene improvements, including consistently clearing the output value in the various non-success cases.

Reviewed By: jaykorean

Differential Revision: D54922935

fbshipit-source-id: c92e89f905a3c80cef57c2c840f49f806629238f
2024-03-18 12:04:54 -07:00
Alan Paxton d9a441113e JNI get_helper code sharing / multiGet() use efficient batch C++ support (#12344)
Summary:
Implement RAII-based helpers for JNIGet() and multiGet()

Replace JNI C++ helpers `rocksdb_get_helper, rocksdb_get_helper_direct`, `multi_get_helper`, `multi_get_helper_direct`, `multi_get_helper_release_keys`, `txn_get_helper`, and `txn_multi_get_helper`.

The model is to entirely do away with a single helper, instead a number of utility methods allow each separate
JNI `Get()` and `MultiGet()` method to organise their parameters efficiently, then call the underlying C++ `db->Get()`,
`db->MultiGet()`, `txn->Get()`, or `txn->MultiGet()` method itself, and use further utilities to retrieve results.

Roughly speaking:

* get keys into C++ form
* Call C++ Get()
* get results and status into Java form

We achieve a useful performance gain as part of this work; by using the updated C++ multiGet we immediately pick up its performance gains (batch improvements to multiGet C++ were previously implemented, but not until now used by Java/JNI). multiGetBB already uses the batched C++ multiGet(), and all other benchmarks show consistent improvement after the changes:

## Before:
```
Benchmark (columnFamilyTestType) (keyCount) (keySize) (multiGetSize) (valueSize) Mode Cnt Score Error Units
MultiGetNewBenchmarks.multiGetBB200 no_column_family 10000 1024 100 256 thrpt 25 5315.459 ± 20.465 ops/s
MultiGetNewBenchmarks.multiGetBB200 no_column_family 10000 1024 100 1024 thrpt 25 5673.115 ± 78.299 ops/s
MultiGetNewBenchmarks.multiGetBB200 no_column_family 10000 1024 100 4096 thrpt 25 2616.860 ± 46.994 ops/s
MultiGetNewBenchmarks.multiGetBB200 no_column_family 10000 1024 100 16384 thrpt 25 1700.058 ± 24.034 ops/s
MultiGetNewBenchmarks.multiGetBB200 no_column_family 10000 1024 100 65536 thrpt 25 791.171 ± 13.955 ops/s
MultiGetNewBenchmarks.multiGetList10 no_column_family 10000 1024 100 256 thrpt 25 6129.929 ± 94.200 ops/s
MultiGetNewBenchmarks.multiGetList10 no_column_family 10000 1024 100 1024 thrpt 25 7012.405 ± 97.886 ops/s
MultiGetNewBenchmarks.multiGetList10 no_column_family 10000 1024 100 4096 thrpt 25 2799.014 ± 39.352 ops/s
MultiGetNewBenchmarks.multiGetList10 no_column_family 10000 1024 100 16384 thrpt 25 1417.205 ± 22.272 ops/s
MultiGetNewBenchmarks.multiGetList10 no_column_family 10000 1024 100 65536 thrpt 25 655.594 ± 13.050 ops/s
MultiGetNewBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 256 thrpt 25 6147.247 ± 82.711 ops/s
MultiGetNewBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 1024 thrpt 25 7004.213 ± 79.251 ops/s
MultiGetNewBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 4096 thrpt 25 2715.154 ± 110.017 ops/s
MultiGetNewBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 16384 thrpt 25 1408.070 ± 31.714 ops/s
MultiGetNewBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 65536 thrpt 25 623.829 ± 57.374 ops/s
MultiGetNewBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 256 thrpt 25 6119.243 ± 116.313 ops/s
MultiGetNewBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 1024 thrpt 25 6931.873 ± 128.094 ops/s
MultiGetNewBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 4096 thrpt 25 2678.253 ± 39.113 ops/s
MultiGetNewBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 16384 thrpt 25 1337.384 ± 19.500 ops/s
MultiGetNewBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 65536 thrpt 25 625.596 ± 14.525 ops/s
```

## After:
```
Benchmark                                    (columnFamilyTestType)  (keyCount)  (keySize)  (multiGetSize)  (valueSize)   Mode  Cnt     Score     Error  Units
MultiGetBenchmarks.multiGetBB200                   no_column_family       10000       1024             100          256  thrpt   25  5191.074 ±  78.250  ops/s
MultiGetBenchmarks.multiGetBB200                   no_column_family       10000       1024             100         1024  thrpt   25  5378.692 ± 260.682  ops/s
MultiGetBenchmarks.multiGetBB200                   no_column_family       10000       1024             100         4096  thrpt   25  2590.183 ±  34.844  ops/s
MultiGetBenchmarks.multiGetBB200                   no_column_family       10000       1024             100        16384  thrpt   25  1634.793 ±  34.022  ops/s
MultiGetBenchmarks.multiGetBB200                   no_column_family       10000       1024             100        65536  thrpt   25   786.455 ±   8.462  ops/s
MultiGetBenchmarks.multiGetBB200                    1_column_family       10000       1024             100          256  thrpt   25  5285.055 ±  11.676  ops/s
MultiGetBenchmarks.multiGetBB200                    1_column_family       10000       1024             100         1024  thrpt   25  5586.758 ± 213.008  ops/s
MultiGetBenchmarks.multiGetBB200                    1_column_family       10000       1024             100         4096  thrpt   25  2527.172 ±  17.106  ops/s
MultiGetBenchmarks.multiGetBB200                    1_column_family       10000       1024             100        16384  thrpt   25  1819.547 ±  12.958  ops/s
MultiGetBenchmarks.multiGetBB200                    1_column_family       10000       1024             100        65536  thrpt   25   803.861 ±   9.963  ops/s
MultiGetBenchmarks.multiGetBB200                 20_column_families       10000       1024             100          256  thrpt   25  5253.793 ±  28.020  ops/s
MultiGetBenchmarks.multiGetBB200                 20_column_families       10000       1024             100         1024  thrpt   25  5705.591 ±  20.556  ops/s
MultiGetBenchmarks.multiGetBB200                 20_column_families       10000       1024             100         4096  thrpt   25  2523.377 ±  15.415  ops/s
MultiGetBenchmarks.multiGetBB200                 20_column_families       10000       1024             100        16384  thrpt   25  1815.344 ±  11.309  ops/s
MultiGetBenchmarks.multiGetBB200                 20_column_families       10000       1024             100        65536  thrpt   25   820.792 ±   3.192  ops/s
MultiGetBenchmarks.multiGetBB200                100_column_families       10000       1024             100          256  thrpt   25  5262.184 ±  20.477  ops/s
MultiGetBenchmarks.multiGetBB200                100_column_families       10000       1024             100         1024  thrpt   25  5706.959 ±  23.123  ops/s
MultiGetBenchmarks.multiGetBB200                100_column_families       10000       1024             100         4096  thrpt   25  2520.362 ±   9.170  ops/s
MultiGetBenchmarks.multiGetBB200                100_column_families       10000       1024             100        16384  thrpt   25  1789.185 ±  14.239  ops/s
MultiGetBenchmarks.multiGetBB200                100_column_families       10000       1024             100        65536  thrpt   25   818.401 ±  12.132  ops/s
MultiGetBenchmarks.multiGetList10                  no_column_family       10000       1024             100          256  thrpt   25  6978.310 ±  14.084  ops/s
MultiGetBenchmarks.multiGetList10                  no_column_family       10000       1024             100         1024  thrpt   25  7664.242 ±  22.304  ops/s
MultiGetBenchmarks.multiGetList10                  no_column_family       10000       1024             100         4096  thrpt   25  2881.778 ±  81.054  ops/s
MultiGetBenchmarks.multiGetList10                  no_column_family       10000       1024             100        16384  thrpt   25  1599.826 ±   7.190  ops/s
MultiGetBenchmarks.multiGetList10                  no_column_family       10000       1024             100        65536  thrpt   25   737.520 ±   6.809  ops/s
MultiGetBenchmarks.multiGetList10                   1_column_family       10000       1024             100          256  thrpt   25  6974.376 ±  10.716  ops/s
MultiGetBenchmarks.multiGetList10                   1_column_family       10000       1024             100         1024  thrpt   25  7637.440 ±  45.877  ops/s
MultiGetBenchmarks.multiGetList10                   1_column_family       10000       1024             100         4096  thrpt   25  2820.472 ±  42.231  ops/s
MultiGetBenchmarks.multiGetList10                   1_column_family       10000       1024             100        16384  thrpt   25  1716.663 ±   8.527  ops/s
MultiGetBenchmarks.multiGetList10                   1_column_family       10000       1024             100        65536  thrpt   25   755.848 ±   7.514  ops/s
MultiGetBenchmarks.multiGetList10                20_column_families       10000       1024             100          256  thrpt   25  6943.651 ±  20.040  ops/s
MultiGetBenchmarks.multiGetList10                20_column_families       10000       1024             100         1024  thrpt   25  7679.415 ±   9.114  ops/s
MultiGetBenchmarks.multiGetList10                20_column_families       10000       1024             100         4096  thrpt   25  2844.564 ±  13.388  ops/s
MultiGetBenchmarks.multiGetList10                20_column_families       10000       1024             100        16384  thrpt   25  1729.545 ±   5.983  ops/s
MultiGetBenchmarks.multiGetList10                20_column_families       10000       1024             100        65536  thrpt   25   783.218 ±   1.530  ops/s
MultiGetBenchmarks.multiGetList10               100_column_families       10000       1024             100          256  thrpt   25  6944.276 ±  29.995  ops/s
MultiGetBenchmarks.multiGetList10               100_column_families       10000       1024             100         1024  thrpt   25  7670.301 ±   8.986  ops/s
MultiGetBenchmarks.multiGetList10               100_column_families       10000       1024             100         4096  thrpt   25  2839.828 ±  12.421  ops/s
MultiGetBenchmarks.multiGetList10               100_column_families       10000       1024             100        16384  thrpt   25  1730.005 ±   9.209  ops/s
MultiGetBenchmarks.multiGetList10               100_column_families       10000       1024             100        65536  thrpt   25   787.096 ±   1.977  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20        no_column_family       10000       1024             100          256  thrpt   25  6896.944 ±  21.530  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20        no_column_family       10000       1024             100         1024  thrpt   25  7622.407 ±  12.824  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20        no_column_family       10000       1024             100         4096  thrpt   25  2927.538 ±  19.792  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20        no_column_family       10000       1024             100        16384  thrpt   25  1598.041 ±   4.312  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20        no_column_family       10000       1024             100        65536  thrpt   25   744.564 ±   9.236  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20         1_column_family       10000       1024             100          256  thrpt   25  6853.760 ±  78.041  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20         1_column_family       10000       1024             100         1024  thrpt   25  7360.917 ± 355.365  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20         1_column_family       10000       1024             100         4096  thrpt   25  2848.774 ±  13.409  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20         1_column_family       10000       1024             100        16384  thrpt   25  1727.688 ±   3.329  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20         1_column_family       10000       1024             100        65536  thrpt   25   776.088 ±   7.517  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20      20_column_families       10000       1024             100          256  thrpt   25  6910.339 ±  14.366  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20      20_column_families       10000       1024             100         1024  thrpt   25  7633.660 ±  10.830  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20      20_column_families       10000       1024             100         4096  thrpt   25  2787.799 ±  81.775  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20      20_column_families       10000       1024             100        16384  thrpt   25  1726.517 ±   6.830  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20      20_column_families       10000       1024             100        65536  thrpt   25   787.597 ±   3.362  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20     100_column_families       10000       1024             100          256  thrpt   25  6922.445 ±  10.493  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20     100_column_families       10000       1024             100         1024  thrpt   25  7604.710 ±  48.043  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20     100_column_families       10000       1024             100         4096  thrpt   25  2848.788 ±  15.783  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20     100_column_families       10000       1024             100        16384  thrpt   25  1730.837 ±   6.497  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20     100_column_families       10000       1024             100        65536  thrpt   25   794.557 ±   1.869  ops/s
MultiGetBenchmarks.multiGetListRandomCF30          no_column_family       10000       1024             100          256  thrpt   25  6918.716 ±  15.766  ops/s
MultiGetBenchmarks.multiGetListRandomCF30          no_column_family       10000       1024             100         1024  thrpt   25  7626.692 ±   9.394  ops/s
MultiGetBenchmarks.multiGetListRandomCF30          no_column_family       10000       1024             100         4096  thrpt   25  2871.382 ±  72.155  ops/s
MultiGetBenchmarks.multiGetListRandomCF30          no_column_family       10000       1024             100        16384  thrpt   25  1598.786 ±   4.819  ops/s
MultiGetBenchmarks.multiGetListRandomCF30          no_column_family       10000       1024             100        65536  thrpt   25   748.469 ±   7.234  ops/s
MultiGetBenchmarks.multiGetListRandomCF30           1_column_family       10000       1024             100          256  thrpt   25  6922.666 ±  17.131  ops/s
MultiGetBenchmarks.multiGetListRandomCF30           1_column_family       10000       1024             100         1024  thrpt   25  7623.890 ±   8.805  ops/s
MultiGetBenchmarks.multiGetListRandomCF30           1_column_family       10000       1024             100         4096  thrpt   25  2850.698 ±  18.004  ops/s
MultiGetBenchmarks.multiGetListRandomCF30           1_column_family       10000       1024             100        16384  thrpt   25  1727.623 ±   4.868  ops/s
MultiGetBenchmarks.multiGetListRandomCF30           1_column_family       10000       1024             100        65536  thrpt   25   774.534 ±  10.025  ops/s
MultiGetBenchmarks.multiGetListRandomCF30        20_column_families       10000       1024             100          256  thrpt   25  5486.251 ±  13.582  ops/s
MultiGetBenchmarks.multiGetListRandomCF30        20_column_families       10000       1024             100         1024  thrpt   25  4920.656 ±  44.557  ops/s
MultiGetBenchmarks.multiGetListRandomCF30        20_column_families       10000       1024             100         4096  thrpt   25  3922.913 ±  25.686  ops/s
MultiGetBenchmarks.multiGetListRandomCF30        20_column_families       10000       1024             100        16384  thrpt   25  2873.106 ±   4.336  ops/s
MultiGetBenchmarks.multiGetListRandomCF30        20_column_families       10000       1024             100        65536  thrpt   25   802.404 ±   8.967  ops/s
MultiGetBenchmarks.multiGetListRandomCF30       100_column_families       10000       1024             100          256  thrpt   25  4817.996 ±  18.042  ops/s
MultiGetBenchmarks.multiGetListRandomCF30       100_column_families       10000       1024             100         1024  thrpt   25  4243.922 ±  13.929  ops/s
MultiGetBenchmarks.multiGetListRandomCF30       100_column_families       10000       1024             100         4096  thrpt   25  3175.998 ±   7.773  ops/s
MultiGetBenchmarks.multiGetListRandomCF30       100_column_families       10000       1024             100        16384  thrpt   25  2321.990 ±  12.501  ops/s
MultiGetBenchmarks.multiGetListRandomCF30       100_column_families       10000       1024             100        65536  thrpt   25  1753.028 ±   7.130  ops/s
```

Closes https://github.com/facebook/rocksdb/issues/11518

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12344

Reviewed By: cbi42

Differential Revision: D54809714

Pulled By: pdillinger

fbshipit-source-id: bee3b949720abac073bce043b59ce976a11e99eb
2024-03-12 12:42:08 -07:00
Yu Zhang 210c8df820 Use do_validate flag to control timestamp based validation in WriteCommittedTxn::GetForUpdate (#12369)
Summary:
When PR https://github.com/facebook/rocksdb/issues/9629 introduced user-defined timestamp support for `WriteCommittedTxn`, it adds this usage mandate for API `GetForUpdate` when UDT is enabled. The `do_validate` flag has to be true, and user should have already called `Transaction::SetReadTimestampForValidation` to set a read timestamp for validation. The rationale behind this mandate is this:
1) with do_vaildate = true, `GetForUpdate` could verify this relationships: let's denote the user-defined timestamp in db for the key as  `Ts_db` and the read timestamp user set via `Transaction::SetReadTimestampForValidation` as `Ts_read`. UDT based validation will only pass if `Ts_db <= Ts_read`.
5950907a82/utilities/transactions/transaction_util.cc (L141)

2)  Let's denote the committed timestamp set via `Transaction::SetCommitTimestamp` to be `Ts_cmt`. Later `WriteCommitedTxn::Commit` would only pass if this condition is met: `Ts_read < Ts_cmt`. 5950907a82/utilities/transactions/pessimistic_transaction.cc (L431)

Together these two checks can ensure `Ts_db < Ts_cmt` to meet the user-defined timestamp invariant that newer timestamp should have newer sequence number.

The `do_validate` flag was originally intended to make snapshot based validation optional. If it's true, `GetForUpdate` checks no entry is written after the snapshot. If it's false, it will skip this snapshot based validation. In this PR, we are making the UDT based validation configurable too based on this flag instead of mandating it for below reasons: 1) in some cases the users themselves can enforce aformentioned invariant on their side independently, without RocksDB help, for example, if they are managing a monotonically increasing timestamp, and their transactions are only committed in a single thread. So they don't need this UDT based validation and wants to skip it, 2) It also could be expensive or not practical for users to come up with such a read timestamp that is exactly in between their commit timestamp and the db's timestamp. For example, in aformentioned case where a monotonically increasing timestamp is managed, the users would need to access this timestamp both for setting the read timestamp and for setting the commit timestamp. So it's preferable to skip this check too.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12369

Test Plan: added unit tests

Reviewed By: ltamasi

Differential Revision: D54268920

Pulled By: jowlyzhang

fbshipit-source-id: ca7693796f9bb11f376a2059d91841e51c89435a
2024-03-07 14:58:10 -08:00
yuzhangyu@fb.com 1cfdece85d Run internal cpp modernizer on RocksDB repo (#12398)
Summary:
When internal cpp modernizer attempts to format rocksdb code, it will replace macro `ROCKSDB_NAMESPACE`  with its default definition `rocksdb` when collapsing nested namespace. We filed a feedback for the tool T180254030 and the team filed a bug for this: https://github.com/llvm/llvm-project/issues/83452. At the same time, they suggested us to run the modernizer tool ourselves so future auto codemod attempts will be smaller. This diff contains:

Running
`xplat/scripts/codemod_service/cpp_modernizer.sh`
in fbcode/internal_repo_rocksdb/repo (excluding some directories in utilities/transactions/lock/range/range_tree/lib that has a non meta copyright comment)
without swapping out the namespace macro `ROCKSDB_NAMESPACE`

Followed by RocksDB's own
`make format`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12398

Test Plan: Auto tests

Reviewed By: hx235

Differential Revision: D54382532

Pulled By: jowlyzhang

fbshipit-source-id: e7d5b40f9b113b60e5a503558c181f080b9d02fa
2024-03-04 10:08:32 -08:00
Jay Huh c00c16855d Access DBImpl* and CFD* by CFHImpl* in Iterators (#12395)
Summary:
In the current implementation of iterators, `DBImpl*` and `ColumnFamilyData*` are held in `DBIter` and `ArenaWrappedDBIter` for two purposes: tracing and Refresh() API. With the introduction of a new iterator called MultiCfIterator in PR https://github.com/facebook/rocksdb/issues/12153 , which is a cross-column-family iterator that maintains multiple DBIters as child iterators from a consistent database state, we need to make some changes to the existing implementation. The new iterator will still be exposed through the generic Iterator interface with an additional capability to return AttributeGroups (via `attribute_groups()`) which is a list of wide columns grouped by column family. For more information about AttributeGroup, please refer to previous PRs:  https://github.com/facebook/rocksdb/issues/11925 #11943, and https://github.com/facebook/rocksdb/issues/11977.

To be able to return AttributeGroup in the default single CF iterator created, access to `ColumnFamilyHandle*` within `DBIter` is necessary. However, this is not currently available in `DBIter`. Since `DBImpl*` and `ColumnFamilyData*` can be easily accessed via `ColumnFamilyHandleImpl*`, we have decided to replace the pointers to `ColumnFamilyData` and `DBImpl` in `DBIter` with a pointer to `ColumnFamilyHandleImpl`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12395

Test Plan:
# Summary

In the current implementation of iterators, `DBImpl*` and `ColumnFamilyData*` are held in `DBIter` and `ArenaWrappedDBIter` for two purposes: tracing and Refresh() API. With the introduction of a new iterator called MultiCfIterator in PR #12153 , which is a cross-column-family iterator that maintains multiple DBIters as child iterators from a consistent database state, we need to make some changes to the existing implementation. The new iterator will still be exposed through the generic Iterator interface with an additional capability to return AttributeGroups (via `attribute_groups()`) which is a list of wide columns grouped by column family. For more information about AttributeGroup, please refer to previous PRs:  #11925 #11943, and #11977.

To be able to return AttributeGroup in the default single CF iterator created, access to `ColumnFamilyHandle*` within `DBIter` is necessary. However, this is not currently available in `DBIter`. Since `DBImpl*` and `ColumnFamilyData*` can be easily accessed via `ColumnFamilyHandleImpl*`, we have decided to replace the pointers to `ColumnFamilyData` and `DBImpl` in `DBIter` with a pointer to `ColumnFamilyHandleImpl`.

# Test Plan

There should be no behavior changes. Existing tests and CI for the correctness tests.

**Test for Perf Regression**
Build
```
$> make -j64 release
```
Setup
```
$> TEST_TMPDIR=/dev/shm/db_bench ./db_bench -benchmarks="filluniquerandom" -key_size=32 -value_size=512 -num=1000000 -compression_type=none
```
Run
```
TEST_TMPDIR=/dev/shm/db_bench ./db_bench -use_existing_db=1 -benchmarks="newiterator,seekrandom" -cache_size=10485760000
```

Before the change
```
DB path: [/dev/shm/db_bench/dbbench]
newiterator  :       0.552 micros/op 1810157 ops/sec 0.552 seconds 1000000 operations;
DB path: [/dev/shm/db_bench/dbbench]
seekrandom   :       4.502 micros/op 222143 ops/sec 4.502 seconds 1000000 operations; (0 of 1000000 found)
```
After the change
```
DB path: [/dev/shm/db_bench/dbbench]
newiterator  :       0.520 micros/op 1924401 ops/sec 0.520 seconds 1000000 operations;
DB path: [/dev/shm/db_bench/dbbench]
seekrandom   :       4.532 micros/op 220657 ops/sec 4.532 seconds 1000000 operations; (0 of 1000000 found)
```

Reviewed By: pdillinger

Differential Revision: D54332713

Pulled By: jaykorean

fbshipit-source-id: b28d897ad519e58b1ca82eb068a6319544a4fae5
2024-03-01 10:28:20 -08:00
Richard Barnes a4ff83d1b2 Fix deprecated use of 0/NULL in internal_repo_rocksdb/repo/utilities/transactions/lock/range/range_tree/lib/locktree/wfg.cc + 3
Summary:
`nullptr` is typesafe. `0` and `NULL` are not. In the future, only `nullptr` will be allowed.

This diff helps us embrace the future _now_ in service of enabling `-Wzero-as-null-pointer-constant`.

Reviewed By: meyering

Differential Revision: D54163069

fbshipit-source-id: e5bb4b6ee79d82f1437ffed602bdb41dcfc0e59a
2024-02-25 22:17:04 -08:00
anand76 d227276147 Deprecate some variants of Get and MultiGet (#12327)
Summary:
A lot of variants of Get and MultiGet have been added to `include/rocksdb/db.h` over the years. Try to consolidate them by marking variants that don't return timestamps as deprecated. The underlying DB implementation will check and return Status::NotSupported() if it doesn't support returning timestamps and the caller asks for it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12327

Reviewed By: pdillinger

Differential Revision: D53828151

Pulled By: anand1976

fbshipit-source-id: e0b5ca42d32daa2739d5f439a729815a2d4ff050
2024-02-16 09:21:06 -08:00
Peter Dillinger 54cb9c77d9 Prefer static_cast in place of most reinterpret_cast (#12308)
Summary:
The following are risks associated with pointer-to-pointer reinterpret_cast:
* Can produce the "wrong result" (crash or memory corruption). IIRC, in theory this can happen for any up-cast or down-cast for a non-standard-layout type, though in practice would only happen for multiple inheritance cases (where the base class pointer might be "inside" the derived object). We don't use multiple inheritance a lot, but we do.
* Can mask useful compiler errors upon code change, including converting between unrelated pointer types that you are expecting to be related, and converting between pointer and scalar types unintentionally.

I can only think of some obscure cases where static_cast could be troublesome when it compiles as a replacement:
* Going through `void*` could plausibly cause unnecessary or broken pointer arithmetic. Suppose we have
`struct Derived: public Base1, public Base2`.  If we have `Derived*` -> `void*` -> `Base2*` -> `Derived*` through reinterpret casts, this could plausibly work (though technical UB) assuming the `Base2*` is not dereferenced. Changing to static cast could introduce breaking pointer arithmetic.
* Unnecessary (but safe) pointer arithmetic could arise in a case like `Derived*` -> `Base2*` -> `Derived*` where before the Base2 pointer might not have been dereferenced. This could potentially affect performance.

With some light scripting, I tried replacing pointer-to-pointer reinterpret_casts with static_cast and kept the cases that still compile. Most occurrences of reinterpret_cast have successfully been changed (except for java/ and third-party/). 294 changed, 257 remain.

A couple of related interventions included here:
* Previously Cache::Handle was not actually derived from in the implementations and just used as a `void*` stand-in with reinterpret_cast. Now there is a relationship to allow static_cast. In theory, this could introduce pointer arithmetic (as described above) but is unlikely without multiple inheritance AND non-empty Cache::Handle.
* Remove some unnecessary casts to void* as this is allowed to be implicit (for better or worse).

Most of the remaining reinterpret_casts are for converting to/from raw bytes of objects. We could consider better idioms for these patterns in follow-up work.

I wish there were a way to implement a template variant of static_cast that would only compile if no pointer arithmetic is generated, but best I can tell, this is not possible. AFAIK the best you could do is a dynamic check that the void* conversion after the static cast is unchanged.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12308

Test Plan: existing tests, CI

Reviewed By: ltamasi

Differential Revision: D53204947

Pulled By: pdillinger

fbshipit-source-id: 9de23e618263b0d5b9820f4e15966876888a16e2
2024-02-07 10:44:11 -08:00
Peter Dillinger 76c834e441 Remove 'virtual' when implied by 'override' (#12319)
Summary:
... to follow modern C++ style / idioms.

Used this hack:
```
for FILE in `cat my_list_of_files`; do perl -pi -e 'BEGIN{undef $/;} s/ virtual( [^;{]* override)/$1/smg' $FILE; done
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12319

Test Plan: existing tests, CI

Reviewed By: jaykorean

Differential Revision: D53275303

Pulled By: pdillinger

fbshipit-source-id: bc0881af270aa8ef4d0ae4f44c5a6614b6407377
2024-01-31 13:14:42 -08:00
Radek Hubner 1d8c54aeaa Fix build on OpenBSD i386 (#12142)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12142

Reviewed By: pdillinger

Differential Revision: D53150218

Pulled By: ajkr

fbshipit-source-id: a4c4d9d22d99e8a82d93d1a7ef37ec5326855cb5
2024-01-29 16:19:59 -08:00
Richard Barnes 502a1754c4 Remove extra semi colon from internal_repo_rocksdb/repo/utilities/transactions/lock/range/range_tree/lib/locktree/manager.cc (#12276)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12276

`-Wextra-semi` or `-Wextra-semi-stmt`

If the code compiles, this is safe to land.

Reviewed By: jaykorean

Differential Revision: D52969073

fbshipit-source-id: 1b2495548d939c32e7a89a6424767497fab9550e
2024-01-24 07:25:27 -08:00
Richard Barnes 0797616de0 Remove extra semi colon from internal_repo_rocksdb/repo/utilities/transactions/write_unprepared_txn.h (#12273)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12273

`-Wextra-semi` or `-Wextra-semi-stmt`

If the code compiles, this is safe to land.

Reviewed By: jaykorean

Differential Revision: D52969166

fbshipit-source-id: 129715bfe69735b83b077c7d6cbf1786c1dfc410
2024-01-24 07:24:06 -08:00
Richard Barnes 532c940b79 Remove extra semi colon from internal_repo_rocksdb/repo/utilities/transactions/transaction_base.h (#12272)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12272

`-Wextra-semi` or `-Wextra-semi-stmt`

If the code compiles, this is safe to land.

Reviewed By: jaykorean

Differential Revision: D52969170

fbshipit-source-id: 581304039be789cbce6760740e9557a925e02722
2024-01-24 07:22:45 -08:00
Hui Xiao 06e593376c Group SST write in flush, compaction and db open with new stats (#11910)
Summary:
## Context/Summary
Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity.

For that, this PR does the following:
- Tag different write IOs by passing down and converting WriteOptions to IOOptions
- Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS

Some related code refactory to make implementation cleaner:
- Blob stats
   - Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS  include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info.
   - Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write.
- Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority
- Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification
- Build table
   - TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables
   - Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder.
This parameter is used for dynamically changing file io priority for flush, see  https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more
   - Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority

## Test
### db bench

Flush
```
./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100

rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
```

compaction, db oopen
```
Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench

Run:./db_bench --statistics=1 --benchmarks=compact  --db=../db_bench --use_existing_db=1

rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279
rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213
rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66
```

blob stats - just to make sure they aren't broken by this PR
```
Integrated Blob DB

Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench

Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact  --db=../db_bench --use_existing_db=1

pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600
rocksdb.blobdb.blob.file.synced COUNT : 1
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842

post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164

rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same)
```

```
Stacked Blob DB

Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench

pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876
rocksdb.blobdb.blob.file.synced COUNT : 8
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445

post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164

rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same)
```

###  Rehearsal CI stress test
Trigger 3 full runs of all our CI stress tests

###  Performance

Flush
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark; enable_statistics = true

Pre-pr: avg 507515519.3 ns
497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908,

Post-pr: avg 511971266.5 ns, regressed 0.88%
502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408,
```

Compaction
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1  --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark

Pre-pr: avg 495346098.30 ns
492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846

Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97%
502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007
```

Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats)
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1  --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark

Pre-pr: avg 3848.10 ns
3814,3838,3839,3848,3854,3854,3854,3860,3860,3860

Post-pr: avg 3874.20 ns, regressed 0.68%
3863,3867,3871,3874,3875,3877,3877,3877,3880,3881
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910

Reviewed By: ajkr

Differential Revision: D49788060

Pulled By: hx235

fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff
2023-12-29 15:29:23 -08:00
Peter Dillinger 106058c076 Initial CircleCI -> GitHub Actions migration (#12163)
Summary:
* Largely based on https://github.com/facebook/rocksdb/issues/12085 but grouped into one large workflow because of bad GHA UI design (see comments).
* Windows job details consolidated into an action file so that those jobs can easily move between per-pr-push and nightly.
* Simplify some handling of "CIRCLECI" environment and add "GITHUB_ACTIONS" in the same places
* For jobs that we want to go in pr-jobs or nightly there are disabled "candidate" workflows with draft versions of those jobs.
* ARM jobs are disabled waiting on full GHA support.
* build-linux-java-static needed some special attention to work, due to GLIBC compatibility issues (see comments).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12163

Test Plan:
Nightly jobs can be seen passing between these two links:
https://github.com/facebook/rocksdb/actions/runs/7266835435/job/19799390061?pr=12163
https://github.com/facebook/rocksdb/actions/runs/7269697823/job/19807724471?pr=12163

And per-PR jobs of course passing on this PR.

Reviewed By: hx235

Differential Revision: D52335810

Pulled By: pdillinger

fbshipit-source-id: bbb95196f33eabad8cddf3c6b52f4413c80e034d
2023-12-21 15:40:21 -08:00
Andrew Kryczka 8c568bac61 Sync a source file license from percona/PerconaFT (#12103)
Summary:
Fixes https://github.com/facebook/rocksdb/issues/10478

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12103

Reviewed By: cbi42

Differential Revision: D51623089

Pulled By: ajkr

fbshipit-source-id: 81f88262ed247144ae063a0552e0162db90c0e43
2023-12-18 11:53:27 -08:00
Richard Barnes 4f04f96742 Remove extra semi colon from infrasec/authorization/audit/AclAuditor.cpp
Summary:
`-Wextra-semi` or `-Wextra-semi-stmt`

If the code compiles, this is safe to land.

Reviewed By: palmje

Differential Revision: D51995065

fbshipit-source-id: 9b55a0d8abd0927b76376cb7751bf0fcab10518c
2023-12-08 17:21:52 -08:00
Andrew Kryczka 04cbc77b90 Add missing license to source files (#12083)
Summary:
Fixes https://github.com/facebook/rocksdb/issues/12079.

Fixed missing licenses in "\*.h" and "\*.cc" files

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12083

Reviewed By: cbi42

Differential Revision: D51489634

Pulled By: ajkr

fbshipit-source-id: 764bfee257b9d6603fd7606a55664b7537e1898f
2023-11-21 08:36:30 -08:00
rogertyang 543191f2ea Add bounds checking to WBWIIteratorImpl and respect bounds of ReadOptions in Transaction (#11680)
Summary:
Fix https://github.com/facebook/rocksdb/issues/11607
Fix https://github.com/facebook/rocksdb/issues/11679
Fix https://github.com/facebook/rocksdb/issues/11606
Fix https://github.com/facebook/rocksdb/issues/2343

Add bounds checking to `WBWIIteratorImpl`, which will be reflected in `BaseDeltaIterator::delta_iterator_::Valid()`, just like `BaseDeltaIterator::base_iterator_::Valid()`. In this way, the two sub itertors become more aligned from `BaseDeltaIterator`'s perspective. Like `DBIter`, the added bounds checking caps in either bound when seeking and disvalidates the `WBWIIteratorImpl` iterator when the lower bound is past or the upper bound is reached.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11680

Test Plan:
- A simple test added to write_batch_with_index_test.cc to exercise the bounds checking in `WBWIIteratorImpl`.
- A sophisticated test added to transaction_test.cc to assert that `Transaction` with different write policies honor bounds in `ReadOptions`. It should be so as long as the  `BaseDeltaIterator` is correctly coordinating the two sub iterators to perform iterating and bounds checking.

Reviewed By: ajkr

Differential Revision: D48125229

Pulled By: cbi42

fbshipit-source-id: c9acea52595aed1471a63d7ca6ef15d2a2af1367
2023-10-20 13:28:28 -07:00
Yu Zhang 552bc01669 Surface timestamp from db to the transaction iterator (#11847)
Summary:
Provide an override implementation of `Iterator::timestamp` API for `BaseDeltaIterator` so that timestamp read from DB can be surfaced by an iterator created from inside of a transaction.

The behavior of the API follows this rule:
1) If the entry is read from within the transaction, an empty `Slice` is returned as the timestamp, regardless of whether `Transaction::SetCommitTimestamp` is called.
2) If the entry is read from the DB, the corresponding `DBIter::timestamp()` API's result is returned.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11847

Test Plan:
make all check
add some unit test

Reviewed By: ltamasi

Differential Revision: D49377359

Pulled By: jowlyzhang

fbshipit-source-id: 1511ead262ce3515ee6c6e0f829f1b69a10fe994
2023-09-22 17:28:36 -07:00
Peter (Stig) Edwards bf488c3052 Use *next_sequence -1 here (#11861)
Summary:
To fix off-by-one error:   Transaction could not check for conflicts for operation at SequenceNumber 500000 as the MemTable only contains changes newer than SequenceNumber 500001.

Fixes https://github.com/facebook/rocksdb/issues/11822

I think introduced in a657ee9a9c

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11861

Reviewed By: pdillinger

Differential Revision: D49457273

Pulled By: ajkr

fbshipit-source-id: b527cbae4ecc7874633a11f07027adee62940d74
2023-09-21 13:52:01 -07:00
Changyu Bi 051cad3867 Fix CI failure due to transaction_test (#11843)
Summary:
Test ` build-linux-static_lib-alt_namespace-status_checked` has been failing in main branch.

```
utilities/transactions/transaction_test.cc:6777:3: error: 'rocksdb' has not been declared
 6777 |   rocksdb::GetMergeOperandsOptions mergeOperandOptions;
      |   ^~~~~~~
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11843

Test Plan: `ASSERT_STATUS_CHECKED=1 TEST_UINT128_COMPAT=1 ROCKSDB_MODIFY_NPHASH=1 LIB_MODE=static OPT="-DROCKSDB_NAMESPACE=alternative_rocksdb_ns" make V=1 -j24 J=24 transaction_test`

Reviewed By: sarangbh

Differential Revision: D49330210

Pulled By: cbi42

fbshipit-source-id: 85c99236eeca6a777af0101684fbab5a33cca1c9
2023-09-15 13:05:23 -07:00
Sarang Masti c4a19ed399 Add Transaction::CollapseKey to collapse merge op chains ondemand (#11815)
Summary:
Application using rocksdb today dont have much control over the cost of reads when merge-ops are enabled, expect for waiting for compactions to kick in or using max_successive_merges hint, which only applies to memtable. This change adds Transaction::CollapseKey api giving applications the ability to request merge chain collapse on-demand, when they detect high read costs due to merges. Currently, this only supported on PessimisticTransactions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11815

Test Plan: Add a unit test

Reviewed By: ajkr

Differential Revision: D49309387

Pulled By: sarangbh

fbshipit-source-id: a1eb5cc9e3bd4b3206a36150aacead770318e3e1
2023-09-15 10:25:57 -07:00
Yu Zhang 39a4ff2cab Track full_history_ts_low per SuperVersion (#11784)
Summary:
As discussed in https://github.com/facebook/rocksdb/issues/11730 , this PR tracks the effective `full_history_ts_low` per SuperVersion and update existing sanity checks for `ReadOptions.timestamp >= full_history_ts_low` to use this per SuperVersion `full_history_ts_low` instead. This also means the check is moved to happen after acquiring SuperVersion.

There are two motivations for this: 1) Each time `full_history_ts_low` really come into effect to collapse history, a new SuperVersion is always installed, because it would involve either a Flush or Compaction, both of which change the LSM tree shape. We can take advantage of this to ensure that as long as this sanity check is passed, even if `full_history_ts_low` can be concurrently increased and collapse some history above the requested `ReadOptions.timestamp`, a read request won’t have visibility to that part of history through this SuperVersion that it already acquired.  2) the existing sanity check uses `ColumnFamilyData::GetFullHistoryTsLow` without locking the db mutex, which is the mutex all `IncreaseFullHistoryTsLow` operation is using when mutating this field. So there is a race condition. This also solve the race condition on the read path.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11784

Test Plan:
`make all check`

// Checks success scenario really provide the read consistency attribute as mentioned above.
`./db_with_timestamp_basic_test --gtest_filter=*FullHistoryTsLowSanityCheckPassReadIsConsistent*`

// Checks failure scenario cleans up SuperVersion properly.
`./db_with_timestamp_basic_test --gtest_filter=*FullHistoryTsLowSanityCheckFail*`
`./db_secondary_test --gtest_filter=*FullHistoryTsLowSanityCheckFail*`
`./db_readonly_with_timestamp_test --gtest_filter=*FullHistoryTsLowSanitchCheckFail*`

Reviewed By: ltamasi

Differential Revision: D48894795

Pulled By: jowlyzhang

fbshipit-source-id: 1f801fe8e1bc8e63ca76c03cbdbd0974e5ff5bf6
2023-09-13 16:34:18 -07:00
jsteemann 0b8b17a9d1 avoid find() -> insert() sequence (#11743)
Summary:
when a key is recorded for locking in a pessimistic transaction, the key is first looked up in a map, and then inserted into the map if it was not already contained.
this can be simplified to an unconditional insert. in the ideal case that all keys are unique, this saves all the find() operations.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11743

Reviewed By: anand1976

Differential Revision: D48656798

Pulled By: ajkr

fbshipit-source-id: d0150de2db757e0c05e1797cfc24380790c71276
2023-08-29 18:34:59 -07:00
Changyu Bi 76ed9a3990 Add missing status check when compiling with ASSERT_STATUS_CHECKED=1 (#11686)
Summary:
It seems the flag `-fno-elide-constructors` is incorrectly overwritten in Makefile by 9c2ebcc2c3/Makefile (L243)
Applying the change in PR https://github.com/facebook/rocksdb/issues/11675 shows a lot of missing status checks. This PR adds the missing status checks.

Most of changes are just adding asserts in unit tests. I'll add pr comment around more interesting changes that need review.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11686

Test Plan: change Makefile as in https://github.com/facebook/rocksdb/issues/11675, and run `ASSERT_STATUS_CHECKED=1 TEST_UINT128_COMPAT=1 ROCKSDB_MODIFY_NPHASH=1 LIB_MODE=static OPT="-DROCKSDB_NAMESPACE=alternative_rocksdb_ns" make V=1 -j24 J=24 check`

Reviewed By: hx235

Differential Revision: D48176132

Pulled By: cbi42

fbshipit-source-id: 6758946cfb1c6ff84c4c1e0ca540d05e6fc390bd
2023-08-09 15:46:44 -07:00
Yu Zhang c751583c03 Set default cf ts sz for a reused transaction (#11685)
Summary:
Set up the default column family timestamp size for a reused write committed transaction.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11685

Test Plan: Added unit test.

Reviewed By: ltamasi

Differential Revision: D48195129

Pulled By: jowlyzhang

fbshipit-source-id: 54faa900c123fc6daa412c01490e36c10a24a678
2023-08-09 13:49:42 -07:00
Hui Xiao 9a034801ce Group rocksdb.sst.read.micros stat by different user read IOActivity + misc (#11444)
Summary:
**Context/Summary:**
- Similar to https://github.com/facebook/rocksdb/pull/11288 but for user read such as `Get(), MultiGet(), DBIterator::XXX(), Verify(File)Checksum()`.
   - For this, I refactored some user-facing `MultiGet` calls in `TransactionBase` and various types of `DB` so that it does not call a user-facing `Get()` but `GetImpl()` for passing the `ReadOptions::io_activity` check (see PR conversation)
   - New user read stats breakdown are guarded by `kExceptDetailedTimers` since measurement shows they have 4-5% regression to the upstream/main.

- Misc
   - More refactoring: with https://github.com/facebook/rocksdb/pull/11288, we complete passing `ReadOptions/IOOptions` to FS level. So we can now replace the previously [added](https://github.com/facebook/rocksdb/pull/9424) `rate_limiter_priority` parameter in `RandomAccessFileReader`'s `Read/MultiRead/Prefetch()` with `IOOptions::rate_limiter_priority`
   - Also, `ReadAsync()` call time is measured in `SST_READ_MICRO` now

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11444

Test Plan:
- CI fake db crash/stress test
- Microbenchmarking

**Build** `make clean && ROCKSDB_NO_FBCODE=1 DEBUG_LEVEL=0 make -jN db_basic_bench`
- google benchmark version: 604f6fd3f4
- db_basic_bench_base: upstream
- db_basic_bench_pr: db_basic_bench_base + this PR
- asyncread_db_basic_bench_base: upstream + [db basic bench patch for IteratorNext](https://github.com/facebook/rocksdb/compare/main...hx235:rocksdb:micro_bench_async_read)
- asyncread_db_basic_bench_pr: asyncread_db_basic_bench_base + this PR

**Test**

Get
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{null_stat|base|pr} --benchmark_filter=DBGet/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1/negative_query:0/enable_filter:0/mmap:1/threads:1 --benchmark_repetitions=1000
```

Result
```
Coming soon
```

AsyncRead
```
TEST_TMPDIR=/dev/shm ./asyncread_db_basic_bench_{base|pr} --benchmark_filter=IteratorNext/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1/async_io:1/include_detailed_timers:0 --benchmark_repetitions=1000 > syncread_db_basic_bench_{base|pr}.out
```

Result
```
Base:
1956,1956,1968,1977,1979,1986,1988,1988,1988,1990,1991,1991,1993,1993,1993,1993,1994,1996,1997,1997,1997,1998,1999,2001,2001,2002,2004,2007,2007,2008,

PR (2.3% regression, due to measuring `SST_READ_MICRO` that wasn't measured before):
1993,2014,2016,2022,2024,2027,2027,2028,2028,2030,2031,2031,2032,2032,2038,2039,2042,2044,2044,2047,2047,2047,2048,2049,2050,2052,2052,2052,2053,2053,
```

Reviewed By: ajkr

Differential Revision: D45918925

Pulled By: hx235

fbshipit-source-id: 58a54560d9ebeb3a59b6d807639692614dad058a
2023-08-08 17:26:50 -07:00
Jay Huh 9a2a6db2a9 Use C++17 [[fallthrough]] in transaction_test.cc (#11663)
Summary:
(Copied from https://www.internalfb.com/diff/D46606060)

This diff makes its files safe for use with -Wimplicit-fallthrough. Now that we're using C+20 there's no reason not to use this C++17 feature to make our code safer.
It's currently possible to write code like this:
```
switch(x){
  case 1:
    foo1();
  case 2:
    foo2();
    break;
  case 3:
    foo3();
}
```
But that's scary because we don't know whether the fallthrough from case 1 was intentional or not.
The -Wimplicit-fallthrough flag will make this an error. The solution is to either  fix the bug by inserting break or indicating intention by using [[fallthrough]]; (from C++17).
```
switch(x){
  case 1:
    foo1();
    [[fallthrough]]; // Solution if we intended to fallthrough
    break;           // Solution if we did not intend to fallthrough
  case 2:
    foo2();
    break;
  case 3:
    foo3();
}
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11663

Test Plan: Existing tests

Reviewed By: jowlyzhang

Differential Revision: D47961248

Pulled By: jaykorean

fbshipit-source-id: 0d374c721bf1b328c14949dc5c17693da7311d03
2023-08-01 14:49:06 -07:00
Peter Dillinger bb8fcc0044 db_stress: Reinstate Transaction::Rollback() calls before destruction (#11656)
Summary:
https://github.com/facebook/rocksdb/issues/11653 broke some crash tests.
Apparently these Rollbacks are needed for pessimistic transaction cases. (I'm still not sure if the API makes any sense with regard to safe usage. It's certainly not documented. Will consider in follow-up PRs.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11656

Test Plan: manual crash test runs, crash_test_with_multiops_wc_txn and crash_test_with_multiops_wp_txn

Reviewed By: cbi42

Differential Revision: D47906280

Pulled By: pdillinger

fbshipit-source-id: d058a01b6dbb47a4f08d199e335364168304f81b
2023-07-30 17:30:01 -07:00
Peter Dillinger b3c54186ab Allow TryAgain in db_stress with optimistic txn, and refactoring (#11653)
Summary:
In rare cases, optimistic transaction commit returns TryAgain. This change tolerates that intentional behavior in db_stress, up to a small limit in a row. This way, we don't miss a possible regression with excessive TryAgain, and trying again (rolling back the transaction) should have a well renewed chance of success as the writes will be associated with fresh sequence numbers.

Also, some of the APIs were not clear about Transaction semantics, so I have clarified:
* (Best I can tell....) Destroying a Transaction is safe without calling Rollback() (or at least should be). I don't know why it's a common pattern in our test code and examples to rollback before unconditional destruction. Stress test updated not to call Rollback unnecessarily (to test safe destruction).
* Despite essentially doing what is asked, simply trying Commit() again when it returns TryAgain does not have a chance of success, because of the transaction being bound to the DB state at the time of operations before Commit. Similar logic applies to Busy AFAIK. Commit() API comments updated, and expanded unit test in optimistic_transaction_test.

Also also, because I can't stop myself, I refactored a good portion of the transaction handling code in db_stress.
* Avoid existing and new copy-paste for most transaction interactions with a new ExecuteTransaction (higher-order) function.
* Use unique_ptr (nicely complements removing unnecessary Rollbacks)
* Abstract out a pattern for safely calling std::terminate() and use it in more places. (The TryAgain errors we saw did not have stack traces because of "terminate called recursively".)

Intended follow-up: resurrect use of `FLAGS_rollback_one_in` but also include non-trivial cases

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11653

Test Plan:
this is the test :)

Also, temporarily bypassed the new retry logic and boosted the chance of hitting TryAgain. Quickly reproduced the TryAgain error. Then re-enabled the new retry logic, and was not able to hit the error after running for tens of minutes, even with the boosted chances.

Reviewed By: cbi42

Differential Revision: D47882995

Pulled By: pdillinger

fbshipit-source-id: 21eadb1525423340dbf28d17cf166b9583311a0d
2023-07-28 16:25:29 -07:00
Peter Dillinger 05a1d52e77 Use FaultInjectionTestFS in transaction_test, clarify Close() APIs (#11499)
Summary:
... instead of race-condition-laden FaultInjectionTestEnv. See https://app.circleci.com/pipelines/github/facebook/rocksdb/27912/workflows/4c63e5a8-597e-439d-8c7e-82308056af02/jobs/609648 and similar PR https://github.com/facebook/rocksdb/issues/11271

Had to fix the semantics of FaultInjectionTestFS Close() operations to allow a non-OK Close() to fulfill the obligation to close before destruction. To me, this is the obvious choice of Close contract, because what is the caller supposed to do if Close() fails and they still have an obligation to successfully close before object destruction? Call Close() in an infinite loop? Leak the object? I have added API comments to the Env and Filesystem Close() functions to clarify the contracts.

Note that `DB::Close()` has one exception to this kind of Close contract, but it is clearly described in API comments and it is really only for catching programming mistakes, not for dealing with exogenous errors.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11499

Test Plan: watch CI

Reviewed By: jowlyzhang

Differential Revision: D46375708

Pulled By: pdillinger

fbshipit-source-id: 03d4d8251e5df50a82ecd139f7e83f613015fe40
2023-06-21 23:38:54 -07:00
Jay Huh 81aeb15988 Add WaitForCompact with WaitForCompactOptions to public API (#11436)
Summary:
Context:

This is the first PR for WaitForCompact() Implementation with WaitForCompactOptions. In this PR, we are introducing `Status WaitForCompact(const WaitForCompactOptions& wait_for_compact_options)` in the public API. This currently utilizes the existing internal `WaitForCompact()` implementation (with default abort_on_pause = false). `abort_on_pause` has been moved to `WaitForCompactOptions&`. In the later PRs, we will introduce the following two options in `WaitForCompactOptions`

1. `bool flush = false` by default - If true, flush before waiting for compactions to finish. Must be set to true to ensure no immediate compactions (except perhaps periodic compactions) after closing and re-opening the DB.
2. `bool close_db = false` by default - If true, will also close the DB upon compactions finishing.

1. struct `WaitForCompactOptions` added to options.h and `abort_on_pause` in the internal API moved to the option struct.
2. `Status WaitForCompact(const WaitForCompactOptions& wait_for_compact_options)` introduced in `db.h`
3. Changed the internal WaitForCompact() to `WaitForCompact(const WaitForCompactOptions& wait_for_compact_options)` and checks for the `abort_on_pause` inside the option.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11436

Test Plan:
Following tests added
- `DBCompactionTest::WaitForCompactWaitsOnCompactionToFinish`
- `DBCompactionTest::WaitForCompactAbortOnPauseAborted`
- `DBCompactionTest::WaitForCompactContinueAfterPauseNotAborted`
- `DBCompactionTest::WaitForCompactShutdownWhileWaiting`
- `TransactionTest::WaitForCompactAbortOnPause`

NOTE: `TransactionTest::WaitForCompactAbortOnPause` was added to use `StackableDB` to ensure the wrapper function is in place.

Reviewed By: pdillinger

Differential Revision: D45799659

Pulled By: jaykorean

fbshipit-source-id: b5b58f95957f2ab47d1221dee32a61d6cdc4685b
2023-05-25 17:25:51 -07:00
Peter Dillinger 17bc27741f Improve memory efficiency of many OptimisticTransactionDBs (#11439)
Summary:
Currently it's easy to use a ton of memory with many small OptimisticTransactionDB instances, because each one by default allocates a million mutexes (40 bytes each on my compiler) for validating transactions. It even puts a lot of pressure on the allocator by allocating each one individually!

In this change:
* Create a new object and option that enables sharing these buckets of mutexes between instances. This is generally good for load balancing potential contention as various DBs become hotter or colder with txn writes. About the only cases where this sharing wouldn't make sense (e.g. each DB usually written by one thread) are cases that would be better off with OccValidationPolicy::kValidateSerial which doesn't use the buckets anyway.
* Allocate the mutexes in a contiguous array, for efficiency
* Add an option to ensure the mutexes are cache-aligned. In several other places we use cache-aligned mutexes but OptimisticTransactionDB historically does not. It should be a space-time trade-off the user can choose.
* Provide some visibility into the memory used by the mutex buckets with an ApproximateMemoryUsage() function (also used in unit testing)
* Share code with other users of "striped" mutexes, appropriate refactoring for customization & efficiency (e.g. using FastRange instead of modulus)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11439

Test Plan: unit tests added. Ran sized-up versions of stress test in unit test, including a before-and-after performance test showing no consistent difference. (NOTE: OptimisticTransactionDB not currently covered by db_stress!)

Reviewed By: ltamasi

Differential Revision: D45796393

Pulled By: pdillinger

fbshipit-source-id: ae2b3a26ad91ceeec15debcdc63ff48df6736a54
2023-05-24 11:57:15 -07:00
Yu Zhang ffb5f1f445 Refactor WriteUnpreparedStressTest to be a unit test (#11424)
Summary:
This patch remove the "stress" aspect from the WriteUnpreparedStressTest and leave it to be a unit test for some correctness testing w.r.t. snapshot functionality. I added some read-your-write verification to the transaction test in db_stress.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11424

Test Plan:
`./write_unprepared_transaction_test`
`./db_crashtest.py whitebox --txn`
`./db_crashtest.py blackbox --txn`

Reviewed By: hx235

Differential Revision: D45551521

Pulled By: jowlyzhang

fbshipit-source-id: 20c3d510eb4255b08ddd7b6c85bdb4945436f6e8
2023-05-22 12:31:52 -07:00
Jay Huh 586d78b31e Remove wait_unscheduled from waitForCompact internal API (#11443)
Summary:
Context:

In pull request https://github.com/facebook/rocksdb/issues/11436, we are introducing a new public API `waitForCompact(const WaitForCompactOptions& wait_for_compact_options)`. This API invokes the internal implementation `waitForCompact(bool wait_unscheduled=false)`. The unscheduled parameter indicates the compactions that are not yet scheduled but are required to process items in the queue.

In certain cases, we are unable to wait for compactions, such as during a shutdown or when background jobs are paused. It is important to return the appropriate status in these scenarios. For all other cases, we should wait for all compaction and flush jobs, including the unscheduled ones. The primary purpose of this new API is to wait until the system has resolved its compaction debt. Currently, the usage of `wait_unscheduled` is limited to test code.

This pull request eliminates the usage of wait_unscheduled. The internal `waitForCompact()` API now waits for unscheduled compactions unless the db is undergoing a shutdown. In the event of a shutdown, the API returns `Status::ShutdownInProgress()`.

Additionally, a new parameter, `abort_on_pause`, has been introduced with a default value of `false`. This parameter addresses the possibility of waiting indefinitely for unscheduled jobs if `PauseBackgroundWork()` was called before `waitForCompact()` is invoked. By setting `abort_on_pause` to `true`, the API will immediately return `Status::Aborted`.

Furthermore, all tests that previously called `waitForCompact(true)` have been fixed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11443

Test Plan:
Existing tests that involve a shutdown in progress:

- DBCompactionTest::CompactRangeShutdownWhileDelayed
- DBTestWithParam::PreShutdownMultipleCompaction
- DBTestWithParam::PreShutdownCompactionMiddle

Reviewed By: pdillinger

Differential Revision: D45923426

Pulled By: jaykorean

fbshipit-source-id: 7dc93fe6a6841a7d9d2d72866fa647090dba8eae
2023-05-17 18:13:50 -07:00
Hui Xiao 151242ce46 Group rocksdb.sst.read.micros stat by IOActivity flush and compaction (#11288)
Summary:
**Context:**
The existing stat rocksdb.sst.read.micros does not reflect each of compaction and flush cases but aggregate them, which is not so helpful for us to understand IO read behavior of each of them.

**Summary**
- Update `StopWatch` and `RandomAccessFileReader` to record `rocksdb.sst.read.micros` and `rocksdb.file.{flush/compaction}.read.micros`
   - Fixed the default histogram in `RandomAccessFileReader`
- New field `ReadOptions/IOOptions::io_activity`; Pass `ReadOptions` through paths under db open, flush and compaction to where we can prepare `IOOptions` and pass it to `RandomAccessFileReader`
- Use `thread_status_util` for assertion in `DbStressFSWrapper` for continuous testing on we are passing correct `io_activity` under db open, flush and compaction

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11288

Test Plan:
- **Stress test**
- **Db bench 1: rocksdb.sst.read.micros COUNT ≈ sum of rocksdb.file.read.flush.micros's and rocksdb.file.read.compaction.micros's.**  (without blob)
     - May not be exactly the same due to `HistogramStat::Add` only guarantees atomic not accuracy across threads.
```
./db_bench -db=/dev/shm/testdb/ -statistics=true -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 (-use_plain_table=1 -prefix_size=10)
```
```
// BlockBasedTable
rocksdb.sst.read.micros P50 : 2.009374 P95 : 4.968548 P99 : 8.110362 P100 : 43.000000 COUNT : 40456 SUM : 114805
rocksdb.file.read.flush.micros P50 : 1.871841 P95 : 3.872407 P99 : 5.540541 P100 : 43.000000 COUNT : 2250 SUM : 6116
rocksdb.file.read.compaction.micros P50 : 2.023109 P95 : 5.029149 P99 : 8.196910 P100 : 26.000000 COUNT : 38206 SUM : 108689

// PlainTable
Does not apply
```
- **Db bench 2: performance**

**Read**

SETUP: db with 900 files
```
./db_bench -db=/dev/shm/testdb/ -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655  -disable_auto_compactions=true -target_file_size_base=655 -compression_type=none
```run till convergence
```
./db_bench -seed=1678564177044286 -use_existing_db=true -db=/dev/shm/testdb -benchmarks=readrandom[-X60] -statistics=true -num=1000000 -disable_auto_compactions=true -compression_type=none -bloom_bits=3
```
Pre-change
`readrandom [AVG 60 runs] : 21568 (± 248) ops/sec`
Post-change (no regression, -0.3%)
`readrandom [AVG 60 runs] : 21486 (± 236) ops/sec`

**Compaction/Flush**run till convergence
```
./db_bench -db=/dev/shm/testdb2/ -seed=1678564177044286 -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655  -disable_auto_compactions=false -target_file_size_base=655 -compression_type=none

rocksdb.sst.read.micros  COUNT : 33820
rocksdb.sst.read.flush.micros COUNT : 1800
rocksdb.sst.read.compaction.micros COUNT : 32020
```
Pre-change
`fillseq [AVG 46 runs] : 1391 (± 214) ops/sec;    0.7 (± 0.1) MB/sec`

Post-change (no regression, ~-0.4%)
`fillseq [AVG 46 runs] : 1385 (± 216) ops/sec;    0.7 (± 0.1) MB/sec`

Reviewed By: ajkr

Differential Revision: D44007011

Pulled By: hx235

fbshipit-source-id: a54c89e4846dfc9a135389edf3f3eedfea257132
2023-04-21 09:07:18 -07:00
Peter Dillinger 390cc0b156 Ensure LockWAL() stall cleared for UnlockWAL() return (#11172)
Summary:
Fixes https://github.com/facebook/rocksdb/issues/11160

By counting the number of stalls placed on a write queue, we can check in UnlockWAL() whether the stall present at the start of UnlockWAL() has been cleared by the end, or wait until it's cleared.

More details in code comments and new unit test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11172

Test Plan: unit test added. Yes, it uses sleep to amplify failure on buggy behavior if present, but using a sync point to only allow new behavior would fail with the old code only because it doesn't contain the new sync point. Basically, using a sync point in UnlockWAL() could easily mask a regression by artificially limiting key behaviors. The test would only check that UnlockWAL() invokes code that *should* do the right thing, without checking that it *does* the right thing.

Reviewed By: ajkr

Differential Revision: D42894341

Pulled By: pdillinger

fbshipit-source-id: 15c9da0ca383e6aec845b29f5447d76cecbf46c3
2023-02-03 12:08:37 -08:00
sdong 4720ba4391 Remove RocksDB LITE (#11147)
Summary:
We haven't been actively mantaining RocksDB LITE recently and the size must have been gone up significantly. We are removing the support.

Most of changes were done through following comments:

unifdef -m -UROCKSDB_LITE `git grep -l ROCKSDB_LITE | egrep '[.](cc|h)'`

by Peter Dillinger. Others changes were manually applied to build scripts, CircleCI manifests, ROCKSDB_LITE is used in an expression and file db_stress_test_base.cc.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11147

Test Plan: See CI

Reviewed By: pdillinger

Differential Revision: D42796341

fbshipit-source-id: 4920e15fc2060c2cd2221330a6d0e5e65d4b7fe2
2023-01-27 13:14:19 -08:00
Peter Dillinger 546e213c4f Fix DelayWrite() calls for two_write_queues (#11130)
Summary:
PR https://github.com/facebook/rocksdb/issues/11020 fixed a case where it was easy to deadlock the DB with LockWAL() but introduced a bug showing up as a rare assertion failure in the stress test. Specifically, `assert(w->state == STATE_INIT)` in `WriteThread::LinkOne()` called from `BeginWriteStall()`, `DelayWrite()`, `WriteImplWALOnly()`. I haven't been about to generate a unit test that reproduces this failure but I believe the root cause is that DelayWrite() was never meant to be re-entrant, only called from the DB's write_thread_ leader. https://github.com/facebook/rocksdb/issues/11020 introduced a call to DelayWrite() from the nonmem_write_thread_ group leader.

This fix is to make DelayWrite() apply to the specific write queue that it is being called from (inject a dummy write stall entry to the head of the appropriate write queue). WriteController is re-entrant, based on polling and state changes signalled with bg_cv_, so can manage stalling two queues. The only anticipated complication (called out by Andrew in previous PR) is that we don't want timed write delays being injected in parallel for the two queues, because that dimishes the intended throttling effect. Thus, we only allow timed delays for the primary write queue.

HISTORY not updated because this is intended for the same release where the bug was introduced.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11130

Test Plan:
Although I was not able to reproduce the assertion failure, I was able to reproduce a distinct flaw with what I believe is the same root cause: a kind of deadlock if both write queues need to wake up from stopped writes. Only one will be waiting on bg_cv_ (the other waiting in `LinkOne()` for the write queue to open up), so a single SignalAll() will only unblock one of the queues, with the other re-instating the stop until another signal on bg_cv_. A simple unit test is added for this case.

Will also run crash_test_with_multiops_wc_txn for a while looking for issues.

Reviewed By: ajkr

Differential Revision: D42749330

Pulled By: pdillinger

fbshipit-source-id: 4317dd899a93d57c26fd5af7143038f82d4d4d1b
2023-01-25 14:18:27 -08:00
Wenlong Zhang 1cfe3528a2 support loongarch64 for rocksdb (#10036)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10036

Reviewed By: hx235

Differential Revision: D42424074

Pulled By: ajkr

fbshipit-source-id: 004adb75005a26bd01c5d568d1ec6ac442cd59dd
2023-01-13 08:42:44 -08:00
Yanqin Jin c93ba7db5d Revise LockWAL/UnlockWAL implementation (#11020)
Summary:
RocksDB has two public APIs: `DB::LockWAL()`/`DB::UnlockWAL()`. The current implementation acquires and
releases the internal `DBImpl::log_write_mutex_`.

According to the comment on `DBImpl::log_write_mutex_`: https://github.com/facebook/rocksdb/blob/7.8.fb/db/db_impl/db_impl.h#L2287:L2288
> Note: to avoid dealock, if needed to acquire both log_write_mutex_ and mutex_, the order should be first mutex_ and then log_write_mutex_.

This puts limitations on how applications can use the `LockWAL()` API. After `LockWAL()` returns ok, then application
should not perform any operation that acquires `mutex_`. Currently, the use case of `LockWAL()` is MyRocks implementing
the MySQL storage engine handlerton `lock_hton_log` interface. The operation that MyRocks performs after `LockWAL()`
is `GetSortedWalFiless()` which not only acquires mutex_, but also `log_write_mutex_`.

There are two issues:
1. Applications using these two APIs may hang if one thread calls `GetSortedWalFiles()` after
calling `LockWAL()` because log_write_mutex is not recursive.
2. Two threads may dead lock due to lock order inversion.

To fix these issues, we can modify the implementation of LockWAL so that it does not keep
`log_write_mutex_` held until UnlockWAL. To achieve the goal of locking the WAL, we can
instead manually inject a write stall so that all future writes will be stopped.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11020

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D41785203

Pulled By: riversand963

fbshipit-source-id: 5ccb7a9c6eb9a2c3fa80fd2c399cc2568b8f89ce
2022-12-13 21:45:00 -08:00
Yanqin Jin 75aca74017 Replace member variable lambda with methods (#10924)
Summary:
In transaction unit tests, replace a few member variable lambdas with
non-static methods. It's easier for gdb to work with variables in methods than in lambdas.
(Seen similar things to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86675).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10924

Test Plan: make check

Reviewed By: jay-zhuang

Differential Revision: D41072241

Pulled By: riversand963

fbshipit-source-id: e4fa491de573c4656225a86a75af926c1df827f6
2022-11-07 12:31:48 -08:00
Yanqin Jin 0547cecb81 Reduce access to atomic variables in a test (#10909)
Summary:
With TSAN build on CircleCI (see mini-tsan in .circleci/config).
Sometimes `SeqAdvanceConcurrentTest.SeqAdvanceConcurrent` will get stuck when an experimental feature called
"unordered write" is enabled. Stack trace will be the following
```
Thread 7 (Thread 0x7f2284a1c700 (LWP 481523) "write_prepared_"):
#0  0x00000000004fa3f5 in __tsan_atomic64_load () at ./db/merge_context.h:15
https://github.com/facebook/rocksdb/issues/1  0x00000000005e5942 in std::__atomic_base<unsigned long>::load (this=0x7b74000012f8, __m=std::memory_order_seq_cst) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/atomic_base.h:481
https://github.com/facebook/rocksdb/issues/2  std::__atomic_base<unsigned long>::operator unsigned long (this=0x7b74000012f8) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/atomic_base.h:341
https://github.com/facebook/rocksdb/issues/3  0x00000000005bf001 in rocksdb::SeqAdvanceConcurrentTest_SeqAdvanceConcurrent_Test::TestBody()::$_9::operator()(void*) const (this=0x7b14000085e8) at utilities/transactions/write_prepared_transaction_test.cc:1702

Thread 6 (Thread 0x7f228421b700 (LWP 481521) "write_prepared_"):
#0  0x000000000052178c in __tsan::MetaMap::GetAndLock(__tsan::ThreadState*, unsigned long, unsigned long, bool, bool) () at ./db/merge_context.h:15
https://github.com/facebook/rocksdb/issues/1  0x00000000004fa48e in __tsan_atomic64_load () at ./db/merge_context.h:15
https://github.com/facebook/rocksdb/issues/2  0x00000000005e5942 in std::__atomic_base<unsigned long>::load (this=0x7b74000012f8, __m=std::memory_order_seq_cst) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/atomic_base.h:481
https://github.com/facebook/rocksdb/issues/3  std::__atomic_base<unsigned long>::operator unsigned long (this=0x7b74000012f8) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/atomic_base.h:341
https://github.com/facebook/rocksdb/issues/4  0x00000000005bf001 in rocksdb::SeqAdvanceConcurrentTest_SeqAdvanceConcurrent_Test::TestBody()::$_9::operator()(void*) const (this=0x7b14000085e8) at utilities/transactions/write_prepared_transaction_test.cc:1702
```

This is problematic and suspicious. Two threads will get stuck in the same place trying to load from an atomic variable.
https://github.com/facebook/rocksdb/blob/7.8.fb/utilities/transactions/write_prepared_transaction_test.cc#L1694:L1707. Not sure why two threads can reach the same point.

The stack trace shows that there may be a deadlock, since the two threads are on the same write thread (one is doing Prepare, while the other is trying to commit).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10909

Test Plan:
On CircleCI mini-tsan, apply a patch first so that we have a higher chance of hitting the same problematic situation,
```
 diff --git a/utilities/transactions/write_prepared_transaction_test.cc b/utilities/transactions/write_prepared_transaction_test.cc
index 4bc1f3744..bd5dc4924 100644
 --- a/utilities/transactions/write_prepared_transaction_test.cc
+++ b/utilities/transactions/write_prepared_transaction_test.cc
@@ -1714,13 +1714,13 @@ TEST_P(SeqAdvanceConcurrentTest, SeqAdvanceConcurrent) {
       size_t d = (n % base[bi + 1]) / base[bi];
       switch (d) {
         case 0:
-          threads.emplace_back(txn_t0, bi);
+          threads.emplace_back(txn_t3, bi);
           break;
         case 1:
-          threads.emplace_back(txn_t1, bi);
+          threads.emplace_back(txn_t3, bi);
           break;
         case 2:
-          threads.emplace_back(txn_t2, bi);
+          threads.emplace_back(txn_t3, bi);
           break;
         case 3:
           threads.emplace_back(txn_t3, bi);
```
then build and run tests
```
COMPILE_WITH_TSAN=1 CC=clang-13 CXX=clang++-13 ROCKSDB_DISABLE_ALIGNED_NEW=1 USE_CLANG=1 make V=1 -j32 check
gtest-parallel -r 100 ./write_prepared_transaction_test --gtest_filter=TwoWriteQueues/SeqAdvanceConcurrentTest.SeqAdvanceConcurrent/19
```
In the above, `SeqAdvanceConcurrent/19`. The tests 10 to 19 correspond to unordered write in which Prepare() and Commit() can both enter the same write thread.
Before this PR, there is a high chance of hitting the deadlock. With this PR, no deadlock has been encountered so far.

Reviewed By: ltamasi

Differential Revision: D40869387

Pulled By: riversand963

fbshipit-source-id: 81e82a70c263e4f3417597a201b081ee54f1deab
2022-11-02 14:54:58 -07:00
Yanqin Jin 7d26e4c5a3 Basic Support for Merge with user-defined timestamp (#10819)
Summary:
This PR implements the originally disabled `Merge()` APIs when user-defined timestamp is enabled.

Simplest usage:
```cpp
// assume string append merge op is used with '.' as delimiter.
// ts1 < ts2
db->Put(WriteOptions(), "key", ts1, "v0");
db->Merge(WriteOptions(), "key", ts2, "1");
ReadOptions ro;
ro.timestamp = &ts2;
db->Get(ro, "key", &value);
ASSERT_EQ("v0.1", value);
```

Some code comments are added for clarity.

Note: support for timestamp in `DB::GetMergeOperands()` will be done in a follow-up PR.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10819

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D40603195

Pulled By: riversand963

fbshipit-source-id: f96d6f183258f3392d80377025529f7660503013
2022-10-31 22:28:58 -07:00
Yanqin Jin 900f79126d Pass const LockInfo& to AcquireLocked() and AcquireWithTimeout (#10874)
Summary:
The motivation and benefit of current behavior of passing `LockInfo&&` as argument to AcquireLocked() and AcquireWithTimeout() is not clear to me. Furthermore, in AcquireWithTimeout(), we access members of `LockInfo&&` after it is passed to AcquireLocked() as rvalue ref. In addition, we may call `AcquireLocked()` with `std::move(lock_info)` multiple times.

This leads to linter warning of use-after-move. If future implementation of AcquireLocked() does something like moving-construct a new `LockedInfo` using the passed-in `LockInfo&&`, then the caller cannot use it because `LockInfo` has a member of type `autovector`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10874

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D40704210

Pulled By: riversand963

fbshipit-source-id: 20091df65b4fc63b072bcec9809efc49955d6d35
2022-10-28 14:05:12 -07:00