Commit graph

3 commits

Author SHA1 Message Date
Alan Paxton d9a441113e JNI get_helper code sharing / multiGet() use efficient batch C++ support (#12344)
Summary:
Implement RAII-based helpers for JNIGet() and multiGet()

Replace JNI C++ helpers `rocksdb_get_helper, rocksdb_get_helper_direct`, `multi_get_helper`, `multi_get_helper_direct`, `multi_get_helper_release_keys`, `txn_get_helper`, and `txn_multi_get_helper`.

The model is to entirely do away with a single helper, instead a number of utility methods allow each separate
JNI `Get()` and `MultiGet()` method to organise their parameters efficiently, then call the underlying C++ `db->Get()`,
`db->MultiGet()`, `txn->Get()`, or `txn->MultiGet()` method itself, and use further utilities to retrieve results.

Roughly speaking:

* get keys into C++ form
* Call C++ Get()
* get results and status into Java form

We achieve a useful performance gain as part of this work; by using the updated C++ multiGet we immediately pick up its performance gains (batch improvements to multiGet C++ were previously implemented, but not until now used by Java/JNI). multiGetBB already uses the batched C++ multiGet(), and all other benchmarks show consistent improvement after the changes:

## Before:
```
Benchmark (columnFamilyTestType) (keyCount) (keySize) (multiGetSize) (valueSize) Mode Cnt Score Error Units
MultiGetNewBenchmarks.multiGetBB200 no_column_family 10000 1024 100 256 thrpt 25 5315.459 ± 20.465 ops/s
MultiGetNewBenchmarks.multiGetBB200 no_column_family 10000 1024 100 1024 thrpt 25 5673.115 ± 78.299 ops/s
MultiGetNewBenchmarks.multiGetBB200 no_column_family 10000 1024 100 4096 thrpt 25 2616.860 ± 46.994 ops/s
MultiGetNewBenchmarks.multiGetBB200 no_column_family 10000 1024 100 16384 thrpt 25 1700.058 ± 24.034 ops/s
MultiGetNewBenchmarks.multiGetBB200 no_column_family 10000 1024 100 65536 thrpt 25 791.171 ± 13.955 ops/s
MultiGetNewBenchmarks.multiGetList10 no_column_family 10000 1024 100 256 thrpt 25 6129.929 ± 94.200 ops/s
MultiGetNewBenchmarks.multiGetList10 no_column_family 10000 1024 100 1024 thrpt 25 7012.405 ± 97.886 ops/s
MultiGetNewBenchmarks.multiGetList10 no_column_family 10000 1024 100 4096 thrpt 25 2799.014 ± 39.352 ops/s
MultiGetNewBenchmarks.multiGetList10 no_column_family 10000 1024 100 16384 thrpt 25 1417.205 ± 22.272 ops/s
MultiGetNewBenchmarks.multiGetList10 no_column_family 10000 1024 100 65536 thrpt 25 655.594 ± 13.050 ops/s
MultiGetNewBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 256 thrpt 25 6147.247 ± 82.711 ops/s
MultiGetNewBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 1024 thrpt 25 7004.213 ± 79.251 ops/s
MultiGetNewBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 4096 thrpt 25 2715.154 ± 110.017 ops/s
MultiGetNewBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 16384 thrpt 25 1408.070 ± 31.714 ops/s
MultiGetNewBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 65536 thrpt 25 623.829 ± 57.374 ops/s
MultiGetNewBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 256 thrpt 25 6119.243 ± 116.313 ops/s
MultiGetNewBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 1024 thrpt 25 6931.873 ± 128.094 ops/s
MultiGetNewBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 4096 thrpt 25 2678.253 ± 39.113 ops/s
MultiGetNewBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 16384 thrpt 25 1337.384 ± 19.500 ops/s
MultiGetNewBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 65536 thrpt 25 625.596 ± 14.525 ops/s
```

## After:
```
Benchmark                                    (columnFamilyTestType)  (keyCount)  (keySize)  (multiGetSize)  (valueSize)   Mode  Cnt     Score     Error  Units
MultiGetBenchmarks.multiGetBB200                   no_column_family       10000       1024             100          256  thrpt   25  5191.074 ±  78.250  ops/s
MultiGetBenchmarks.multiGetBB200                   no_column_family       10000       1024             100         1024  thrpt   25  5378.692 ± 260.682  ops/s
MultiGetBenchmarks.multiGetBB200                   no_column_family       10000       1024             100         4096  thrpt   25  2590.183 ±  34.844  ops/s
MultiGetBenchmarks.multiGetBB200                   no_column_family       10000       1024             100        16384  thrpt   25  1634.793 ±  34.022  ops/s
MultiGetBenchmarks.multiGetBB200                   no_column_family       10000       1024             100        65536  thrpt   25   786.455 ±   8.462  ops/s
MultiGetBenchmarks.multiGetBB200                    1_column_family       10000       1024             100          256  thrpt   25  5285.055 ±  11.676  ops/s
MultiGetBenchmarks.multiGetBB200                    1_column_family       10000       1024             100         1024  thrpt   25  5586.758 ± 213.008  ops/s
MultiGetBenchmarks.multiGetBB200                    1_column_family       10000       1024             100         4096  thrpt   25  2527.172 ±  17.106  ops/s
MultiGetBenchmarks.multiGetBB200                    1_column_family       10000       1024             100        16384  thrpt   25  1819.547 ±  12.958  ops/s
MultiGetBenchmarks.multiGetBB200                    1_column_family       10000       1024             100        65536  thrpt   25   803.861 ±   9.963  ops/s
MultiGetBenchmarks.multiGetBB200                 20_column_families       10000       1024             100          256  thrpt   25  5253.793 ±  28.020  ops/s
MultiGetBenchmarks.multiGetBB200                 20_column_families       10000       1024             100         1024  thrpt   25  5705.591 ±  20.556  ops/s
MultiGetBenchmarks.multiGetBB200                 20_column_families       10000       1024             100         4096  thrpt   25  2523.377 ±  15.415  ops/s
MultiGetBenchmarks.multiGetBB200                 20_column_families       10000       1024             100        16384  thrpt   25  1815.344 ±  11.309  ops/s
MultiGetBenchmarks.multiGetBB200                 20_column_families       10000       1024             100        65536  thrpt   25   820.792 ±   3.192  ops/s
MultiGetBenchmarks.multiGetBB200                100_column_families       10000       1024             100          256  thrpt   25  5262.184 ±  20.477  ops/s
MultiGetBenchmarks.multiGetBB200                100_column_families       10000       1024             100         1024  thrpt   25  5706.959 ±  23.123  ops/s
MultiGetBenchmarks.multiGetBB200                100_column_families       10000       1024             100         4096  thrpt   25  2520.362 ±   9.170  ops/s
MultiGetBenchmarks.multiGetBB200                100_column_families       10000       1024             100        16384  thrpt   25  1789.185 ±  14.239  ops/s
MultiGetBenchmarks.multiGetBB200                100_column_families       10000       1024             100        65536  thrpt   25   818.401 ±  12.132  ops/s
MultiGetBenchmarks.multiGetList10                  no_column_family       10000       1024             100          256  thrpt   25  6978.310 ±  14.084  ops/s
MultiGetBenchmarks.multiGetList10                  no_column_family       10000       1024             100         1024  thrpt   25  7664.242 ±  22.304  ops/s
MultiGetBenchmarks.multiGetList10                  no_column_family       10000       1024             100         4096  thrpt   25  2881.778 ±  81.054  ops/s
MultiGetBenchmarks.multiGetList10                  no_column_family       10000       1024             100        16384  thrpt   25  1599.826 ±   7.190  ops/s
MultiGetBenchmarks.multiGetList10                  no_column_family       10000       1024             100        65536  thrpt   25   737.520 ±   6.809  ops/s
MultiGetBenchmarks.multiGetList10                   1_column_family       10000       1024             100          256  thrpt   25  6974.376 ±  10.716  ops/s
MultiGetBenchmarks.multiGetList10                   1_column_family       10000       1024             100         1024  thrpt   25  7637.440 ±  45.877  ops/s
MultiGetBenchmarks.multiGetList10                   1_column_family       10000       1024             100         4096  thrpt   25  2820.472 ±  42.231  ops/s
MultiGetBenchmarks.multiGetList10                   1_column_family       10000       1024             100        16384  thrpt   25  1716.663 ±   8.527  ops/s
MultiGetBenchmarks.multiGetList10                   1_column_family       10000       1024             100        65536  thrpt   25   755.848 ±   7.514  ops/s
MultiGetBenchmarks.multiGetList10                20_column_families       10000       1024             100          256  thrpt   25  6943.651 ±  20.040  ops/s
MultiGetBenchmarks.multiGetList10                20_column_families       10000       1024             100         1024  thrpt   25  7679.415 ±   9.114  ops/s
MultiGetBenchmarks.multiGetList10                20_column_families       10000       1024             100         4096  thrpt   25  2844.564 ±  13.388  ops/s
MultiGetBenchmarks.multiGetList10                20_column_families       10000       1024             100        16384  thrpt   25  1729.545 ±   5.983  ops/s
MultiGetBenchmarks.multiGetList10                20_column_families       10000       1024             100        65536  thrpt   25   783.218 ±   1.530  ops/s
MultiGetBenchmarks.multiGetList10               100_column_families       10000       1024             100          256  thrpt   25  6944.276 ±  29.995  ops/s
MultiGetBenchmarks.multiGetList10               100_column_families       10000       1024             100         1024  thrpt   25  7670.301 ±   8.986  ops/s
MultiGetBenchmarks.multiGetList10               100_column_families       10000       1024             100         4096  thrpt   25  2839.828 ±  12.421  ops/s
MultiGetBenchmarks.multiGetList10               100_column_families       10000       1024             100        16384  thrpt   25  1730.005 ±   9.209  ops/s
MultiGetBenchmarks.multiGetList10               100_column_families       10000       1024             100        65536  thrpt   25   787.096 ±   1.977  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20        no_column_family       10000       1024             100          256  thrpt   25  6896.944 ±  21.530  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20        no_column_family       10000       1024             100         1024  thrpt   25  7622.407 ±  12.824  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20        no_column_family       10000       1024             100         4096  thrpt   25  2927.538 ±  19.792  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20        no_column_family       10000       1024             100        16384  thrpt   25  1598.041 ±   4.312  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20        no_column_family       10000       1024             100        65536  thrpt   25   744.564 ±   9.236  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20         1_column_family       10000       1024             100          256  thrpt   25  6853.760 ±  78.041  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20         1_column_family       10000       1024             100         1024  thrpt   25  7360.917 ± 355.365  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20         1_column_family       10000       1024             100         4096  thrpt   25  2848.774 ±  13.409  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20         1_column_family       10000       1024             100        16384  thrpt   25  1727.688 ±   3.329  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20         1_column_family       10000       1024             100        65536  thrpt   25   776.088 ±   7.517  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20      20_column_families       10000       1024             100          256  thrpt   25  6910.339 ±  14.366  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20      20_column_families       10000       1024             100         1024  thrpt   25  7633.660 ±  10.830  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20      20_column_families       10000       1024             100         4096  thrpt   25  2787.799 ±  81.775  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20      20_column_families       10000       1024             100        16384  thrpt   25  1726.517 ±   6.830  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20      20_column_families       10000       1024             100        65536  thrpt   25   787.597 ±   3.362  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20     100_column_families       10000       1024             100          256  thrpt   25  6922.445 ±  10.493  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20     100_column_families       10000       1024             100         1024  thrpt   25  7604.710 ±  48.043  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20     100_column_families       10000       1024             100         4096  thrpt   25  2848.788 ±  15.783  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20     100_column_families       10000       1024             100        16384  thrpt   25  1730.837 ±   6.497  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20     100_column_families       10000       1024             100        65536  thrpt   25   794.557 ±   1.869  ops/s
MultiGetBenchmarks.multiGetListRandomCF30          no_column_family       10000       1024             100          256  thrpt   25  6918.716 ±  15.766  ops/s
MultiGetBenchmarks.multiGetListRandomCF30          no_column_family       10000       1024             100         1024  thrpt   25  7626.692 ±   9.394  ops/s
MultiGetBenchmarks.multiGetListRandomCF30          no_column_family       10000       1024             100         4096  thrpt   25  2871.382 ±  72.155  ops/s
MultiGetBenchmarks.multiGetListRandomCF30          no_column_family       10000       1024             100        16384  thrpt   25  1598.786 ±   4.819  ops/s
MultiGetBenchmarks.multiGetListRandomCF30          no_column_family       10000       1024             100        65536  thrpt   25   748.469 ±   7.234  ops/s
MultiGetBenchmarks.multiGetListRandomCF30           1_column_family       10000       1024             100          256  thrpt   25  6922.666 ±  17.131  ops/s
MultiGetBenchmarks.multiGetListRandomCF30           1_column_family       10000       1024             100         1024  thrpt   25  7623.890 ±   8.805  ops/s
MultiGetBenchmarks.multiGetListRandomCF30           1_column_family       10000       1024             100         4096  thrpt   25  2850.698 ±  18.004  ops/s
MultiGetBenchmarks.multiGetListRandomCF30           1_column_family       10000       1024             100        16384  thrpt   25  1727.623 ±   4.868  ops/s
MultiGetBenchmarks.multiGetListRandomCF30           1_column_family       10000       1024             100        65536  thrpt   25   774.534 ±  10.025  ops/s
MultiGetBenchmarks.multiGetListRandomCF30        20_column_families       10000       1024             100          256  thrpt   25  5486.251 ±  13.582  ops/s
MultiGetBenchmarks.multiGetListRandomCF30        20_column_families       10000       1024             100         1024  thrpt   25  4920.656 ±  44.557  ops/s
MultiGetBenchmarks.multiGetListRandomCF30        20_column_families       10000       1024             100         4096  thrpt   25  3922.913 ±  25.686  ops/s
MultiGetBenchmarks.multiGetListRandomCF30        20_column_families       10000       1024             100        16384  thrpt   25  2873.106 ±   4.336  ops/s
MultiGetBenchmarks.multiGetListRandomCF30        20_column_families       10000       1024             100        65536  thrpt   25   802.404 ±   8.967  ops/s
MultiGetBenchmarks.multiGetListRandomCF30       100_column_families       10000       1024             100          256  thrpt   25  4817.996 ±  18.042  ops/s
MultiGetBenchmarks.multiGetListRandomCF30       100_column_families       10000       1024             100         1024  thrpt   25  4243.922 ±  13.929  ops/s
MultiGetBenchmarks.multiGetListRandomCF30       100_column_families       10000       1024             100         4096  thrpt   25  3175.998 ±   7.773  ops/s
MultiGetBenchmarks.multiGetListRandomCF30       100_column_families       10000       1024             100        16384  thrpt   25  2321.990 ±  12.501  ops/s
MultiGetBenchmarks.multiGetListRandomCF30       100_column_families       10000       1024             100        65536  thrpt   25  1753.028 ±   7.130  ops/s
```

Closes https://github.com/facebook/rocksdb/issues/11518

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12344

Reviewed By: cbi42

Differential Revision: D54809714

Pulled By: pdillinger

fbshipit-source-id: bee3b949720abac073bce043b59ce976a11e99eb
2024-03-12 12:42:08 -07:00
Alan Paxton c1ec0b28eb java / jni io_uring support (#9224)
Summary:
Existing multiGet() in java calls multi_get_helper() which then calls DB::std::vector MultiGet(). This doesn't take advantage of io_uring.

This change adds another JNI level method that runs a parallel code path using the DB::void MultiGet(), using ByteBuffers at the JNI level. We call it multiGetDirect(). In addition to using the io_uring path, this code internally returns pinned slices which we can copy out of into our direct byte buffers; this should reduce the overall number of copies in the code path to/from Java. Some jmh benchmark runs (100k keys, 1000 key multiGet) suggest that for value sizes > 1k, we see about a 20% performance improvement, although performance is slightly reduced for small value sizes, there's a little bit more overhead in the JNI methods.

Closes https://github.com/facebook/rocksdb/issues/8407

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9224

Reviewed By: mrambacher

Differential Revision: D32951754

Pulled By: jay-zhuang

fbshipit-source-id: 1f70df7334be2b6c42a9c8f92725f67c71631690
2021-12-15 18:09:25 -08:00
Adam Retter 6477075f2c JMH microbenchmarks for RocksJava (#6241)
Summary:
This is the start of some JMH microbenchmarks for RocksJava.

Such benchmarks can help us decide on performance improvements of the Java API.

At the moment, I have only added benchmarks for various Comparator options, as that is one of the first areas where I want to improve performance. I plan to expand this to many more tests.

Details of how to compile and run the benchmarks are in the `README.md`.

A run of these on a XEON 3.5 GHz 4vCPU (QEMU Virtual CPU version 2.5+) / 8GB RAM KVM with Ubuntu 18.04, OpenJDK 1.8.0_232, and gcc 8.3.0 produced the following:

```
# Run complete. Total time: 01:43:17

REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.

Benchmark                                         (comparatorName)   Mode  Cnt       Score       Error  Units
ComparatorBenchmarks.put                           native_bytewise thrpt   25   122373.920 ±  2200.538  ops/s
ComparatorBenchmarks.put              java_bytewise_adaptive_mutex thrpt   25    17388.201 ±  1444.006  ops/s
ComparatorBenchmarks.put          java_bytewise_non-adaptive_mutex thrpt   25    16887.150 ±  1632.204  ops/s
ComparatorBenchmarks.put       java_direct_bytewise_adaptive_mutex thrpt   25    15644.572 ±  1791.189  ops/s
ComparatorBenchmarks.put   java_direct_bytewise_non-adaptive_mutex thrpt   25    14869.601 ±  2252.135  ops/s
ComparatorBenchmarks.put                   native_reverse_bytewise thrpt   25   116528.735 ±  4168.797  ops/s
ComparatorBenchmarks.put      java_reverse_bytewise_adaptive_mutex thrpt   25    10651.975 ±   545.998  ops/s
ComparatorBenchmarks.put  java_reverse_bytewise_non-adaptive_mutex thrpt   25    10514.224 ±   930.069  ops/s
```

Indicating a ~7x difference between comparators implemented natively (C++) and those implemented in Java. Let's see if we can't improve on that in the near future...
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6241

Differential Revision: D19290410

Pulled By: pdillinger

fbshipit-source-id: 25d44bf3a31de265502ed0c5d8a28cf4c7cb9c0b
2020-01-07 15:46:09 -08:00