Evolved Binary has been working on several aspects of how the Java API to RocksDB can be improved. Two aspects of this which are of particular importance are performance and the developer experience.
* We have built some synthetic benchmark code to determine which are the most efficient methods of transferring data between Java and C++.
* We have used the results of the synthetic benchmarking to guide plans for rationalising the API interfaces.
* We have made some opportunistic performance optimizations/fixes within the Java API which have already yielded noticable improvements.
## Synthetic JNI API Performance Benchmarks
The synthetic benchmark repository contains tests designed to isolate the Java to/from C++ interaction of a canonical data intensive Key/Value Store implemented in C++ with a Java (JNI) API layered on top.
JNI provides several mechanisms for allowing transfer of data between Java buffers and C++ buffers. These mechanisms are not trivial, because they require the JNI system to ensure that Java memory under the control of the JVM is not moved or garbage collected whilst it is being accessed outside the direct control of the JVM.
We set out to determine which of multiple options for transfer of data from `C++` to `Java` and vice-versa were the most efficient. We used the [Java Microbenchmark Harness](https://github.com/openjdk/jmh) to set up repeatable benchmarks to measure all the options.
We explore these and some other potential mechanisms in the detailed results (in our [Synthetic JNI performance repository](https://github.com/evolvedbinary/jni-benchmarks/blob/main/DataBenchmarks.md))
We summarise this work here:
### The Model
* In `C++` we represent the on-disk data as an in-memory map of `(key, value)`
pairs.
* For a fetch query, we expect the result to be a Java object with access to the
contents of the _value_. This may be a standard Java object which does the job
of data access (a `byte[]` or a `ByteBuffer`) or an object of our own devising
which holds references to the value in some form (a `FastBuffer` pointing to
`com.sun.unsafe.Unsafe` unsafe memory, for instance).
### Data Types
There are several potential data types for holding data for transfer, and they
are unsurprisingly quite connected underneath.
#### Byte Array
The simplest data container is a _raw_ array of bytes (`byte[]`).
There are 3 different mechanisms for transferring data between a `byte[]` and
We benchmarked `Put` methods in a similar synthetic fashion in less depth, but enough to confirm that the performance profile is similar/symmetrical. As with `get()` using `GetElements` is the least performant way of implementing transfers to/from Java objects in C++/JNI, and other JNI mechanisms do not differ greatly one from another.
## Lessons from Synthetic API
Performance analysis shows that for `get()`, fetching into allocated `byte[]` is
equally as efficient as any other mechanism, as long as JNI region methods are used
for the internal data transfer. Copying out or otherwise using the
result on the Java side is straightforward and efficient. Using `byte[]` avoids the manual memory
management required with direct `nio.ByteBuffer`s, which extra work does not
appear to provide any gain. A C++ implementation using the `GetRegion` JNI
method is probably to be preferred to using `GetCritical` because while their
performance is equal, `GetRegion` is a higher-level/simpler abstraction.
Vitally, whatever JNI transfer mechanism is chosen, the buffer allocation
mechanism and pattern is crucial to achieving good performance. We experimented
with making use of netty's pooled allocator part of the benchmark, and the
difference of `getIntoPooledNettyByteBuf`, using the allocator, compared to
`getIntoNettyByteBuf` using the same pre-allocate on setup as every other
benchmark, is significant.
Equally importantly, transfer of data to or from buffers should where possible
be done in bulk, using array copy or buffer copy mechanisms. Thought should
perhaps be given to supporting common transformations in the underlying C++
layer.
## API Recommendations
Of course there is some noise within the results. but we can agree:
Translating this into designing an efficient API, we want to:
* Support API methods that return results in buffers supplied by the client.
* Support `byte[]`-based APIs as the simplest way of getting data into a usable configuration for a broad range of Java use.
* Support direct `ByteBuffer`s as these can reduce copies when used as part of a chain of `ByteBuffer`-based operations. This sort of sophisticated streaming model is most likely to be used by clients where performance is important, and so we decide to support it.
* Support indirect `ByteBuffer`s for a combination of reasons:
* API consistency between direct and indirect buffers
* Simplicity of implementation, as we can wrap `byte[]`-oriented methods
* Continue to support methods which allocate return buffers per-call, as these are the easiest to use on initial encounter with the RocksDB API.
* Use more complex (client supplied buffer) API methods where performance matters
* Don't allocate/deallocate where you don't need to
* recycle your own buffers where this makes sense
* or make sure that you are supplying the ultimate destination buffer (your cache, or a target network buffer) as input to RocksDB `get()` and `put()` calls
We are currently implementing a number of extra methods consistently across the Java fetch and store APIs to RocksDB in the PR [Java API consistency between RocksDB.put() , .merge() and Transaction.put() , .merge()](https://github.com/facebook/rocksdb/pull/11019) according to these principles.
## Optimizations
### Reduce Copies within API Implementation
Having analysed JNI performance as described, we reviewed the core of RocksJNI for opportunities to improve the performance. We noticed one thing in particular; some of the `get()` methods of the Java API had not been updated to take advantage of the new [`PinnableSlice`](http://rocksdb.org/blog/2017/08/24/pinnableslice.html) methods.
Fixing this turned out to be a straightforward change, which has now been incorporated in the codebase [Improve Java API `get()` performance by reducing copies](https://github.com/facebook/rocksdb/pull/10970)
#### Performance Results
Using the JMH performances tests we updated as part of the above PR, we can see a small but consistent improvement in performance for all of the different get method variants which we have enhanced in the PR.
So stage (3) costs us a copy into Java. It's mostly unavoidable that there will be at least the one copy from a C++ buffer into a Java buffer.
But what does stage 2 do ?
* Create a `PinnableSlice(std::string&)` which uses the value as the slice's backing buffer.
* Call `DB::Get()` using the PinnableSlice variant
* Work out if the slice has pinned data, in which case copy the pinned data into value and release it.
* ..or, if the slice has not pinned data, it is already in value (because we tried, but couldn't pin anything).
So stage (2) costs us a copy into a `std::string`. But! It's just a naive `std::string` that we have copied a large buffer into. And in RocksDB, the buffer is or can be large, so an extra copy something we need to worry about.
Luckily this is easy to fix. In the Java API (JNI) implementation:
1. Create a PinnableSlice() which uses its own default backing buffer.
2. Call `DB::Get()` using the `PinnableSlice` variant of the RocksDB API
3. Copy the data indicated by the `PinnableSlice` straight into the Java output buffer using the JNI `SetByteArrayRegion()` method, then release the slice.
4. Work out if the slice has successfully pinned data, in which case copy the pinned data straight into the Java output buffer using the JNI `SetByteArrayRegion()` method, then release the pin.
5. ..or, if the slice has not pinned data, it is in the pinnable slice's default backing buffer. All that is left, is to copy it straight into the Java output buffer using the JNI SetByteArrayRegion() method.
In the case where the `PinnableSlice` has succesfully pinned the data, this saves us the intermediate copy to the `std::string`. In the case where it hasn't, we still have the extra copy so the observed performance improvement depends on when the data can be pinned. Luckily, our benchmarking suggests that the pin is happening in a significant number of cases.
On discussion with the RocksDB core team we understand that the core `PinnableSlice` optimization is most likely to succeed when pages are loaded from the block cache, rather than when they are in `memtable`. And it might be possible to successfully pin in the `memtable` as well, with some extra coding effort. This would likely improve the results for these benchmarks.