rocksdb/java/GetBenchmarks.md

# RocksDB Get Performance Benchmarks

Results associated with [Improve Java API `get()` performance by reducing copies](https://github.com/facebook/rocksdb/pull/10970)

## Build/Run

Mac
```
make clean jclean
DEBUG_LEVEL=0 make -j12 rocksdbjava
(cd java/target; cp rocksdbjni-7.9.0-osx.jar rocksdbjni-7.9.0-SNAPSHOT-osx.jar)
mvn install:install-file -Dfile=./java/target/rocksdbjni-7.9.0-SNAPSHOT-osx.jar -DgroupId=org.rocksdb -DartifactId=rocksdbjni -Dversion=7.9.0-SNAPSHOT -Dpackaging=jar
```

Linux
```
make clean jclean
DEBUG_LEVEL=0 make -j12 rocksdbjava
(cd java/target; cp rocksdbjni-7.9.0-linux64.jar rocksdbjni-7.9.0-SNAPSHOT-linux64.jar)
mvn install:install-file -Dfile=./java/target/rocksdbjni-7.9.0-SNAPSHOT-linux64.jar -DgroupId=org.rocksdb -DartifactId=rocksdbjni -Dversion=7.9.0-SNAPSHOT -Dpackaging=jar
```

Build jmh test package, on either platform
```
pushd java/jmh
mvn clean package
```

A quick test run, just as a sanity check, using a small number of keys, would be
```
java -jar target/rocksdbjni-jmh-1.0-SNAPSHOT-benchmarks.jar -p keyCount=1000 -p keySize=128 -p valueSize=32768 -p columnFamilyTestType="no_column_family" GetBenchmarks
```
The long performance run (as big as we can make it on our Ubuntu box without filling the disk)
```
java -jar target/rocksdbjni-jmh-1.0-SNAPSHOT-benchmarks.jar -p keyCount=1000,50000 -p keySize=128 -p valueSize=1024,16384 -p columnFamilyTestType="1_column_family","20_column_families" GetBenchmarks.get GetBenchmarks.preallocatedByteBufferGet GetBenchmarks.preallocatedGet
```

## Results (small runs, Mac)

These are run on a 10-core M1 with 64GB of memory and 2TB of SSD.
They probably reflect the absolute best case for this optimization, hitting in-memory buffers and completely eliminating a buffer copy.

### Before
Benchmark                                (columnFamilyTestType)  (keyCount)  (keySize)  (multiGetSize)  (valueSize)   Mode  Cnt          Score        Error  Units
GetBenchmarks.get                              no_column_family        1000        128             N/A        32768  thrpt   25      43496.578 ±   5743.090  ops/s
GetBenchmarks.preallocatedByteBufferGet        no_column_family        1000        128             N/A        32768  thrpt   25      70765.578 ±    697.548  ops/s
GetBenchmarks.preallocatedGet                  no_column_family        1000        128             N/A        32768  thrpt   25      69883.554 ±    944.184  ops/s

### After fixing byte[] (.get and .preallocatedGet)

Benchmark                                (columnFamilyTestType)  (keyCount)  (keySize)  (multiGetSize)  (valueSize)   Mode  Cnt          Score        Error  Units
GetBenchmarks.get                              no_column_family        1000        128             N/A        32768  thrpt   25     149207.681 ±   2261.671  ops/s
GetBenchmarks.preallocatedByteBufferGet        no_column_family        1000        128             N/A        32768  thrpt   25      68920.489 ±   1574.664  ops/s
GetBenchmarks.preallocatedGet                  no_column_family        1000        128             N/A        32768  thrpt   25     177399.022 ±   2107.375  ops/s

### After fixing ByteBuffer (.preallocatedByteBufferGet)

Benchmark                                (columnFamilyTestType)  (keyCount)  (keySize)  (multiGetSize)  (valueSize)   Mode  Cnt          Score        Error  Units
GetBenchmarks.get                              no_column_family        1000        128             N/A        32768  thrpt   25     150389.259 ±   1371.473  ops/s
GetBenchmarks.preallocatedByteBufferGet        no_column_family        1000        128             N/A        32768  thrpt   25     179919.468 ±   1670.714  ops/s
GetBenchmarks.preallocatedGet                  no_column_family        1000        128             N/A        32768  thrpt   25     178261.938 ±   2630.571  ops/s
## Results (Ubuntu, big runs)
These take 3-4 hours
```
java -jar target/rocksdbjni-jmh-1.0-SNAPSHOT-benchmarks.jar -p keyCount=1000,50000 -p keySize=128 -p valueSize=1024,16384 -p columnFamilyTestType="1_column_family","20_column_families" GetBenchmarks.get GetBenchmarks.preallocatedByteBufferGet GetBenchmarks.preallocatedGet
```
It's clear that all `get()` variants have noticeably improved performance, though not the spectacular gains of the M1.
### With fixes for all of the `get()` instances

Benchmark                                (columnFamilyTestType)  (keyCount)  (keySize)  (valueSize)   Mode  Cnt        Score       Error  Units
GetBenchmarks.get                               1_column_family        1000        128         1024  thrpt   25   935648.793 ± 22879.910  ops/s
GetBenchmarks.get                               1_column_family        1000        128        16384  thrpt   25   204366.301 ±  1326.570  ops/s
GetBenchmarks.get                               1_column_family       50000        128         1024  thrpt   25   693451.990 ± 19822.720  ops/s
GetBenchmarks.get                               1_column_family       50000        128        16384  thrpt   25    50473.768 ±   497.335  ops/s
GetBenchmarks.get                            20_column_families        1000        128         1024  thrpt   25   550118.874 ± 14289.009  ops/s
GetBenchmarks.get                            20_column_families        1000        128        16384  thrpt   25   120545.549 ±   648.280  ops/s
GetBenchmarks.get                            20_column_families       50000        128         1024  thrpt   25   235671.353 ±  2231.195  ops/s
GetBenchmarks.get                            20_column_families       50000        128        16384  thrpt   25    12463.887 ±  1950.746  ops/s
GetBenchmarks.preallocatedByteBufferGet         1_column_family        1000        128         1024  thrpt   25  1196026.040 ± 35435.729  ops/s
GetBenchmarks.preallocatedByteBufferGet         1_column_family        1000        128        16384  thrpt   25   403252.655 ±  3287.054  ops/s
GetBenchmarks.preallocatedByteBufferGet         1_column_family       50000        128         1024  thrpt   25   829965.448 ± 16945.452  ops/s
GetBenchmarks.preallocatedByteBufferGet         1_column_family       50000        128        16384  thrpt   25    63798.042 ±  1292.858  ops/s
GetBenchmarks.preallocatedByteBufferGet      20_column_families        1000        128         1024  thrpt   25   724557.253 ± 12710.828  ops/s
GetBenchmarks.preallocatedByteBufferGet      20_column_families        1000        128        16384  thrpt   25   176846.615 ±  1121.644  ops/s
GetBenchmarks.preallocatedByteBufferGet      20_column_families       50000        128         1024  thrpt   25   263553.764 ±  1304.243  ops/s
GetBenchmarks.preallocatedByteBufferGet      20_column_families       50000        128        16384  thrpt   25    14721.693 ±  2574.240  ops/s
GetBenchmarks.preallocatedGet                   1_column_family        1000        128         1024  thrpt   25  1093947.765 ± 42846.276  ops/s
GetBenchmarks.preallocatedGet                   1_column_family        1000        128        16384  thrpt   25   391629.913 ±  4039.965  ops/s
GetBenchmarks.preallocatedGet                   1_column_family       50000        128         1024  thrpt   25   769332.958 ± 24180.749  ops/s
GetBenchmarks.preallocatedGet                   1_column_family       50000        128        16384  thrpt   25    61712.038 ±   423.494  ops/s
GetBenchmarks.preallocatedGet                20_column_families        1000        128         1024  thrpt   25   694684.465 ±  5484.205  ops/s
GetBenchmarks.preallocatedGet                20_column_families        1000        128        16384  thrpt   25   172383.593 ±   841.679  ops/s
GetBenchmarks.preallocatedGet                20_column_families       50000        128         1024  thrpt   25   257447.351 ±  1388.667  ops/s
GetBenchmarks.preallocatedGet                20_column_families       50000        128        16384  thrpt   25    13418.522 ±  2418.619  ops/s

### Baseline (no fixes)

Benchmark                                (columnFamilyTestType)  (keyCount)  (keySize)  (valueSize)   Mode  Cnt        Score       Error  Units
GetBenchmarks.get                               1_column_family        1000        128         1024  thrpt   25   866745.224 ±  8834.629  ops/s
GetBenchmarks.get                               1_column_family        1000        128        16384  thrpt   25   184332.195 ±  2304.217  ops/s
GetBenchmarks.get                               1_column_family       50000        128         1024  thrpt   25   666794.288 ± 16150.684  ops/s
GetBenchmarks.get                               1_column_family       50000        128        16384  thrpt   25    47221.788 ±   433.165  ops/s
GetBenchmarks.get                            20_column_families        1000        128         1024  thrpt   25   551513.636 ±  7763.681  ops/s
GetBenchmarks.get                            20_column_families        1000        128        16384  thrpt   25   113117.720 ±   580.738  ops/s
GetBenchmarks.get                            20_column_families       50000        128         1024  thrpt   25   238675.555 ±  1758.978  ops/s
GetBenchmarks.get                            20_column_families       50000        128        16384  thrpt   25    11639.390 ±  1459.765  ops/s
GetBenchmarks.preallocatedByteBufferGet         1_column_family        1000        128         1024  thrpt   25  1153617.917 ± 26350.028  ops/s
GetBenchmarks.preallocatedByteBufferGet         1_column_family        1000        128        16384  thrpt   25   401710.334 ±  4324.539  ops/s
GetBenchmarks.preallocatedByteBufferGet         1_column_family       50000        128         1024  thrpt   25   809384.073 ± 13833.871  ops/s
GetBenchmarks.preallocatedByteBufferGet         1_column_family       50000        128        16384  thrpt   25    59279.005 ±   443.207  ops/s
GetBenchmarks.preallocatedByteBufferGet      20_column_families        1000        128         1024  thrpt   25   715466.403 ±  6591.375  ops/s
GetBenchmarks.preallocatedByteBufferGet      20_column_families        1000        128        16384  thrpt   25   175279.163 ±   910.923  ops/s
GetBenchmarks.preallocatedByteBufferGet      20_column_families       50000        128         1024  thrpt   25   263295.180 ±   856.456  ops/s
GetBenchmarks.preallocatedByteBufferGet      20_column_families       50000        128        16384  thrpt   25    14001.928 ±  2462.067  ops/s
GetBenchmarks.preallocatedGet                   1_column_family        1000        128         1024  thrpt   25  1072866.854 ± 27030.592  ops/s
GetBenchmarks.preallocatedGet                   1_column_family        1000        128        16384  thrpt   25   383950.853 ±  4510.654  ops/s
GetBenchmarks.preallocatedGet                   1_column_family       50000        128         1024  thrpt   25   764395.469 ± 10097.417  ops/s
GetBenchmarks.preallocatedGet                   1_column_family       50000        128        16384  thrpt   25    56851.330 ±   388.029  ops/s
GetBenchmarks.preallocatedGet                20_column_families        1000        128         1024  thrpt   25   668518.593 ±  9764.117  ops/s
GetBenchmarks.preallocatedGet                20_column_families        1000        128        16384  thrpt   25   171309.695 ±   875.895  ops/s
GetBenchmarks.preallocatedGet                20_column_families       50000        128         1024  thrpt   25   256057.801 ±   954.621  ops/s
GetBenchmarks.preallocatedGet                20_column_families       50000        128        16384  thrpt   25    13319.380 ±  2126.654  ops/s

### Comparison

It does at least look best when the data is cached. That is to say, smallest number of column families, and least keys.

GetBenchmarks.get                               1_column_family        1000        128        16384  thrpt   25   204366.301 ±  1326.570  ops/s
GetBenchmarks.get                               1_column_family        1000        128        16384  thrpt   25   184332.195 ±  2304.217  ops/s

GetBenchmarks.get                               1_column_family       50000        128        16384  thrpt   25    50473.768 ±   497.335  ops/s
GetBenchmarks.get                               1_column_family       50000        128        16384  thrpt   25    47221.788 ±   433.165  ops/s

GetBenchmarks.get                            20_column_families        1000        128        16384  thrpt   25   120545.549 ±   648.280  ops/s
GetBenchmarks.get                            20_column_families        1000        128        16384  thrpt   25   113117.720 ±   580.738  ops/s

GetBenchmarks.get                            20_column_families       50000        128        16384  thrpt   25    12463.887 ±  1950.746  ops/s
GetBenchmarks.get                            20_column_families       50000        128        16384  thrpt   25    11639.390 ±  1459.765  ops/s

### Baseline
25 minute run, small number of keys
```
java -jar target/rocksdbjni-jmh-1.0-SNAPSHOT-benchmarks.jar -p keyCount=1000 -p keySize=128 -p valueSize=32768 -p columnFamilyTestType="no_column_families" GetBenchmarks.get GetBenchmarks.preallocatedByteBufferGet GetBenchmarks.preallocatedGet
```

Benchmark                                (columnFamilyTestType)  (keyCount)  (keySize)  (valueSize)   Mode  Cnt      Score     Error  Units
GetBenchmarks.get                            no_column_families        1000        128        32768  thrpt   25  32344.908 ± 296.651  ops/s
GetBenchmarks.preallocatedByteBufferGet      no_column_families        1000        128        32768  thrpt   25  45266.968 ± 424.514  ops/s
GetBenchmarks.preallocatedGet                no_column_families        1000        128        32768  thrpt   25  43531.088 ± 291.785  ops/s

### Optimized

Benchmark                                (columnFamilyTestType)  (keyCount)  (keySize)  (valueSize)   Mode  Cnt      Score     Error  Units
GetBenchmarks.get                            no_column_families        1000        128        32768  thrpt   25  37463.716 ± 235.744  ops/s
GetBenchmarks.preallocatedByteBufferGet      no_column_families        1000        128        32768  thrpt   25  48946.105 ± 466.463  ops/s
GetBenchmarks.preallocatedGet                no_column_families        1000        128        32768  thrpt   25  47143.624 ± 576.763  ops/s

## Conclusion

The performance improvement is real.