rocksdb

Commit Graph

Author	SHA1	Message	Date
Alan Paxton	f8969ad7d4	Improve Java API get() performance by reducing copies (#10970 ) Summary: Performance improvements for `get()` paths in the RocksJava API (JNI). Document describing the performance results. Replace uses of the legacy `DB::Get()` method wrapper returning data in a `std::string` with direct calls to `DB::Get()` passing a pinnable slice to receive this data. Copying from a pinned slice direct to the destination java byte array, without going via an intervening std::string, is a major performance gain for this code path. Note that this gain only comes where `DB::Get()` is able to return a pinned buffer; where it has to copy into the buffer owned by the slice, there is still the intervening copy and no performance gain. It may be possible to address this case too, but it is not trivial. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10970 Reviewed By: pdillinger Differential Revision: D42125567 Pulled By: ajkr fbshipit-source-id: b7a4df7523b0420cadb1e9b6c7da3ec030a8da34	2022-12-21 11:54:24 -08:00
Alan Paxton	6a8920f988	JNI native memory leak - release array elements (#10981 ) Summary: Closes https://github.com/facebook/rocksdb/issues/10980 Reproduced as per the suggestion in the ticket, and `$ jcmd <PID> VM.native_memory \| grep Internal` reports that we are no longer leaking internal memory with the suggested fix. I did the repro in `MultiGetTest.java` which I have optimized imports on. It did not seem helpful to leave the test code around as it would be onerous to build a memory leak reproducer, and regression seems a remote possibility. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10981 Reviewed By: riversand963 Differential Revision: D41498748 Pulled By: ajkr fbshipit-source-id: 8c6dd0d608172879c8bda479c7c9c05c12d34e70	2022-12-14 10:49:32 -08:00
anand76	ecba6a320e	Add some async read stats (#10947 ) Summary: Add stats for time spent in the ReadAsync call, and async read errors. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10947 Test Plan: Run db_bench and look at stats Reviewed By: akankshamahajan15 Differential Revision: D41236637 Pulled By: anand1976 fbshipit-source-id: 70539b69a28491d57acead449436a761f7108acf	2022-11-13 21:38:35 -08:00
Alan Paxton	17553bdd5e	RocksJava API - fix Transaction.multiGet() size limit, remove bogus EnsureLocalCapacity() calls (#10674 ) Summary: Resolves see https://github.com/facebook/rocksdb/issues/9006 Fixes 2 related issues with JNI local references in the RocksJava API. 1. Some instances of RocksJava API JNI code appear to have misunderstood the reason for `JNIEnv->EnsureLocalCapacity()` and are carrying out bogus checks which happen to fail with some larger parameter values (many column families in a single call, very long key names or values). Remove these checks and add some regression tests for the previous failures. 2. The helper for Transaction multiGet operations (`multiGet()`, `multiGetForUpdate()`,...) is limited in the number of keys it can `get()` for because it requires a corresponding number of live local references. Refactor the helper slightly, copying out the key contents within a loop so that the references don't have to exist at the same time. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10674 Reviewed By: ajkr Differential Revision: D40515361 Pulled By: jay-zhuang fbshipit-source-id: f1be0126181a698b3ad27c0945a39c54d950aa25	2022-10-26 17:25:33 -07:00
Brendan MacDonell	5f915b447d	Fix ChecksumType::kXXH3 in the Java API (#10862 ) Summary: While PR#9749 nominally added support for XXH3 in the Java API, it did not update the `toCppChecksumType` method. As a result, setting the checksum type to XXH3 actually set it to CRC32c instead. This commit adds the missing entry to portal.h, and also updates the tests so that they verify the options passed to RocksDB, instead of simply checking that the getter returns the value set by the setter. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10862 Reviewed By: pdillinger Differential Revision: D40665031 Pulled By: ajkr fbshipit-source-id: 2834419b3361a4bac47db3b858951fb451b5bdc8	2022-10-25 19:25:44 -07:00
sdong	2a551976f4	Run format check for .h and .cc files under java/ (#10851 ) Summary: Run format check for .h and .cc files to clean the format Pull Request resolved: https://github.com/facebook/rocksdb/pull/10851 Test Plan: Watch CI tests to pass Reviewed By: ajkr Differential Revision: D40649723 fbshipit-source-id: 62d32cead0b3b8e6540e86d25451bd72642109eb	2022-10-25 09:26:51 -07:00
anand76	72a3fb3424	Update statistics for async scan readaheads (#10585 ) Summary: Imported a fix to "rocksdb.prefetched.bytes.discarded" stat from https://github.com/facebook/rocksdb/issues/10561, and added a new stat "rocksdb.async.prefetch.abort.micros" to measure time spent waiting for async reads to abort. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10585 Reviewed By: akankshamahajan15 Differential Revision: D39067000 Pulled By: anand1976 fbshipit-source-id: d7cda71abb48017239bd5fd832345a16c7024faf	2022-08-29 14:37:44 -07:00
Levi Tamasi	64e74723f7	Use the default metadata charge policy when creating an LRU cache via the Java API (#10577 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10577 Reviewed By: akankshamahajan15 Differential Revision: D39035884 Pulled By: ltamasi fbshipit-source-id: 48f116f8ca172b7eb5eb3651f39ddb891a7ffade	2022-08-29 09:42:04 -07:00
Brendan MacDonell	418b36a9bc	Support CompactionPri::kRoundRobin in RocksJava (#10572 ) Summary: Pretty trivial — this PR just adds the new compaction priority to the Java API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10572 Reviewed By: hx235 Differential Revision: D39006523 Pulled By: ajkr fbshipit-source-id: ea8d665817e7b05826c397afa41c3abcda81484e	2022-08-25 13:32:03 -07:00
Gang Liao	275cd80cdb	Add a blob-specific cache priority (#10461 ) Summary: RocksDB's `Cache` abstraction currently supports two priority levels for items: high (used for frequently accessed/highly valuable SST metablocks like index/filter blocks) and low (used for SST data blocks). Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. Since we would like to make it possible to use the same backing cache for the block cache and the blob cache, it would make sense to add a new, lower-than-low cache priority level (bottom level) for blobs so data blocks are prioritized over them. This task is a part of https://github.com/facebook/rocksdb/issues/10156 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10461 Reviewed By: siying Differential Revision: D38672823 Pulled By: ltamasi fbshipit-source-id: 90cf7362036563d79891f47be2cc24b827482743	2022-08-12 17:59:06 -07:00
sdong	9277569ba3	Add some missing headers (#10519 ) Summary: Some files miss headers. Also some headers are irregular. Fix them to make an internal checkup tool happy. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10519 Reviewed By: jay-zhuang Differential Revision: D38603291 fbshipit-source-id: 13b1bbd6d48f5ee15ba20da67544396de48238f1	2022-08-11 12:45:50 -07:00
Peter Dillinger	65036e4217	Revert "Add a blob-specific cache priority (#10309 )" (#10434 ) Summary: This reverts commit `8d178090be` because of a clear performance regression seen in internal dashboard https://fburl.com/unidash/tpz75iee Pull Request resolved: https://github.com/facebook/rocksdb/pull/10434 Reviewed By: ltamasi Differential Revision: D38256373 Pulled By: pdillinger fbshipit-source-id: 134aa00f50dd7b1bbe037c227884a351342ec44b	2022-07-29 07:18:15 -07:00
Gang Liao	8d178090be	Add a blob-specific cache priority (#10309 ) Summary: RocksDB's `Cache` abstraction currently supports two priority levels for items: high (used for frequently accessed/highly valuable SST metablocks like index/filter blocks) and low (used for SST data blocks). Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. Since we would like to make it possible to use the same backing cache for the block cache and the blob cache, it would make sense to add a new, lower-than-low cache priority level (bottom level) for blobs so data blocks are prioritized over them. This task is a part of https://github.com/facebook/rocksdb/issues/10156 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10309 Reviewed By: ltamasi Differential Revision: D38211655 Pulled By: gangliao fbshipit-source-id: 65ef33337db4d85277cc6f9782d67c421ad71dd5	2022-07-27 19:09:24 -07:00
Gang Liao	ec4ebeff30	Support prepopulating/warming the blob cache (#10298 ) Summary: Many workloads have temporal locality, where recently written items are read back in a short period of time. When using remote file systems, this is inefficient since it involves network traffic and higher latencies. Because of this, we would like to support prepopulating the blob cache during flush. This task is a part of https://github.com/facebook/rocksdb/issues/10156 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10298 Reviewed By: ltamasi Differential Revision: D37908743 Pulled By: gangliao fbshipit-source-id: 9feaed234bc719d38f0c02975c1ad19fa4bb37d1	2022-07-17 07:13:59 -07:00
Guido Tagliavini Ponce	7e1b417824	Revert NewClockCache signature (#10358 ) Summary: This complements https://github.com/facebook/rocksdb/issues/10351. This PR reverts NewClockCache's signature to an older version, expected by the users of the old (buggy) ClockCache. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10358 Test Plan: ``make -j24 check`` and re-run the pre-release tests. Reviewed By: siying Differential Revision: D37832601 Pulled By: guidotag fbshipit-source-id: 32a91d3da4119be187935003b7b897272ceb1950	2022-07-13 17:43:39 -07:00
Guido Tagliavini Ponce	57a0e2f304	Clock cache (#10273 ) Summary: This is the initial step in the development of a lock-free clock cache. This PR includes the base hash table design (which we mostly ported over from FastLRUCache) and the clock eviction algorithm. Importantly, it's still _not_ lock-free---all operations use a shard lock. Besides the locking, there are other features left as future work: - Remove keys from the handles. Instead, use 128-bit bijective hashes of them for handle comparisons, probing (we need two 32-bit hashes of the key for double hashing) and sharding (we need one 6-bit hash). - Remove the clock_usage_ field, which is updated on every lookup. Even if it were atomically updated, it could cause memory invalidations across cores. - Middle insertions into the clock list. - A test that exercises the clock eviction policy. - Update the Java API of ClockCache and Java calls to C++. Along the way, we improved the code and comments quality of FastLRUCache. These changes are relatively minor. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10273 Test Plan: ``make -j24 check`` Reviewed By: pdillinger Differential Revision: D37522461 Pulled By: guidotag fbshipit-source-id: 3d70b737dbb70dcf662f00cef8c609750f083943	2022-06-29 21:50:39 -07:00
Gang Liao	d7ebb58cb5	Add blob cache tickers, perf context statistics, and DB properties (#10203 ) Summary: In order to be able to monitor the performance of the new blob cache, we made the follow changes: - Add blob cache hit/miss/insertion tickers (see https://github.com/facebook/rocksdb/wiki/Statistics) - Extend the perf context similarly (see https://github.com/facebook/rocksdb/wiki/Perf-Context-and-IO-Stats-Context) - Implement new DB properties (see e.g. https://github.com/facebook/rocksdb/blob/main/include/rocksdb/db.h#L1042-L1051) that expose the capacity and current usage of the blob cache. This PR is a part of https://github.com/facebook/rocksdb/issues/10156 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10203 Reviewed By: ltamasi Differential Revision: D37478658 Pulled By: gangliao fbshipit-source-id: d8ee3f41d47315ef725e4551226330b4b6832e40	2022-06-28 13:52:35 -07:00
Baptiste Lemaire	5879053fd0	Dynamically changeable `MemPurge` option (#10011 ) Summary: Summary Make the mempurge option flag a Mutable Column Family option flag. Therefore, the mempurge feature can be dynamically toggled. Motivation RocksDB users prefer having the ability to switch features on and off without having to close and reopen the DB. This is particularly important if the feature causes issues and needs to be turned off. Dynamically changing a DB option flag does not seem currently possible. Moreover, with this new change, the MemPurge feature can be toggled on or off independently between column families, which we see as a major improvement. Content of this PR This PR includes removal of the `experimental_mempurge_threshold` flag as a DB option flag, and its re-introduction as a `MutableCFOption` flag. I updated the code to handle dynamic changes of the flag (in particular inside the `FlushJob` file). Additionally, this PR includes a new test to demonstrate the capacity of the code to toggle the MemPurge feature on and off, as well as the addition in the `db_stress` module of 2 different mempurge threshold values (0.0 and 1.0) that can be randomly changed with the `set_option_one_in` flag. This is useful to stress test the dynamic changes. Benchmarking I will add numbers to prove that there is no performance impact within the next 12 hours. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10011 Reviewed By: pdillinger Differential Revision: D36462357 Pulled By: bjlemaire fbshipit-source-id: 5e3d63bdadf085c0572ecc2349e7dd9729ce1802	2022-06-23 09:42:18 -07:00
Levi Tamasi	7b2c0140ba	Fix Java build (#10105 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10105 Reviewed By: cbi42 Differential Revision: D36891073 Pulled By: ltamasi fbshipit-source-id: 16487ec708fc96add2a1ebc2d98f6439dfc852ca	2022-06-03 08:11:31 -07:00
Gang Liao	e6432dfd4c	Make it possible to enable blob files starting from a certain LSM tree level (#10077 ) Summary: Currently, if blob files are enabled (i.e. `enable_blob_files` is true), large values are extracted both during flush/recovery (when SST files are written into level 0 of the LSM tree) and during compaction into any LSM tree level. For certain use cases that have a mix of short-lived and long-lived values, it might make sense to support extracting large values only during compactions whose output level is greater than or equal to a specified LSM tree level (e.g. compactions into L1/L2/... or above). This could reduce the space amplification caused by large values that are turned into garbage shortly after being written at the price of some write amplification incurred by long-lived values whose extraction to blob files is delayed. In order to achieve this, we would like to do the following: - Add a new configuration option `blob_file_starting_level` (default: 0) to `AdvancedColumnFamilyOptions` (and `MutableCFOptions` and extend the related logic) - Instantiate `BlobFileBuilder` in `BuildTable` (used during flush and recovery, where the LSM tree level is L0) and `CompactionJob` iff `enable_blob_files` is set and the LSM tree level is `>= blob_file_starting_level` - Add unit tests for the new functionality, and add the new option to our stress tests (`db_stress` and `db_crashtest.py` ) - Add the new option to our benchmarking tool `db_bench` and the BlobDB benchmark script `run_blob_bench.sh` - Add the new option to the `ldb` tool (see https://github.com/facebook/rocksdb/wiki/Administration-and-Data-Access-Tool) - Ideally extend the C and Java bindings with the new option - Update the BlobDB wiki to document the new option. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10077 Reviewed By: ltamasi Differential Revision: D36884156 Pulled By: gangliao fbshipit-source-id: 942bab025f04633edca8564ed64791cb5e31627d	2022-06-02 20:04:33 -07:00
Changyu Bi	cc23b46da1	Support using ZDICT_finalizeDictionary to generate zstd dictionary (#9857 ) Summary: An untrained dictionary is currently simply the concatenation of several samples. The ZSTD API, ZDICT_finalizeDictionary(), can improve such a dictionary's effectiveness at low cost. This PR changes how dictionary is created by calling the ZSTD ZDICT_finalizeDictionary() API instead of creating raw content dictionary (when max_dict_buffer_bytes > 0), and pass in all buffered uncompressed data blocks as samples. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9857 Test Plan: #### db_bench test for cpu/memory of compression+decompression and space saving on synthetic data: Set up: change the parameter [here](`fb9a167a55/tools/db_bench_tool.cc (L1766)`) to 16384 to make synthetic data more compressible. ``` # linked local ZSTD with version 1.5.2 # DEBUG_LEVEL=0 ROCKSDB_NO_FBCODE=1 ROCKSDB_DISABLE_ZSTD=1 EXTRA_CXXFLAGS="-DZSTD_STATIC_LINKING_ONLY -DZSTD -I/data/users/changyubi/install/include/" EXTRA_LDFLAGS="-L/data/users/changyubi/install/lib/ -l:libzstd.a" make -j32 db_bench dict_bytes=16384 train_bytes=1048576 echo "========== No Dictionary ==========" TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=0 -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1 TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=0 -block_size=4096 2>&1 \| grep elapsed du -hc /dev/shm/dbbench/sst \| grep total echo "========== Raw Content Dictionary ==========" TEST_TMPDIR=/dev/shm ./db_bench_main -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1 TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench_main -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -block_size=4096 2>&1 \| grep elapsed du -hc /dev/shm/dbbench/sst \| grep total echo "========== FinalizeDictionary ==========" TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1 TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false -block_size=4096 2>&1 \| grep elapsed du -hc /dev/shm/dbbench/sst \| grep total echo "========== TrainDictionary ==========" TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1 TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -block_size=4096 2>&1 \| grep elapsed du -hc /dev/shm/dbbench/sst \| grep total # Result: TrainDictionary is much better on space saving, but FinalizeDictionary seems to use less memory. # before compression data size: 1.2GB dict_bytes=16384 max_dict_buffer_bytes = 1048576 space cpu/memory No Dictionary 468M 14.93user 1.00system 0:15.92elapsed 100%CPU (0avgtext+0avgdata 23904maxresident)k Raw Dictionary 251M 15.81user 0.80system 0:16.56elapsed 100%CPU (0avgtext+0avgdata 156808maxresident)k FinalizeDictionary 236M 11.93user 0.64system 0:12.56elapsed 100%CPU (0avgtext+0avgdata 89548maxresident)k TrainDictionary 84M 7.29user 0.45system 0:07.75elapsed 100%CPU (0avgtext+0avgdata 97288maxresident)k ``` #### Benchmark on 10 sample SST files for spacing saving and CPU time on compression: FinalizeDictionary is comparable to TrainDictionary in terms of space saving, and takes less time in compression. ``` dict_bytes=16384 train_bytes=1048576 for sst_file in `ls ../temp/myrock-sst/` do echo "******** $sst_file ********" echo "========== No Dictionary ==========" ./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD echo "========== Raw Content Dictionary ==========" ./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes echo "========== FinalizeDictionary ==========" ./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes --compression_zstd_max_train_bytes=$train_bytes --compression_use_zstd_finalize_dict echo "========== TrainDictionary ==========" ./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes --compression_zstd_max_train_bytes=$train_bytes done 010240.sst (Size/Time) 011029.sst 013184.sst 021552.sst 185054.sst 185137.sst 191666.sst 7560381.sst 7604174.sst 7635312.sst No Dictionary 28165569 / 2614419 32899411 / 2976832 32977848 / 3055542 31966329 / 2004590 33614351 / 1755877 33429029 / 1717042 33611933 / 1776936 33634045 / 2771417 33789721 / 2205414 33592194 / 388254 Raw Content Dictionary 28019950 / 2697961 33748665 / 3572422 33896373 / 3534701 26418431 / 2259658 28560825 / 1839168 28455030 / 1846039 28494319 / 1861349 32391599 / 3095649 33772142 / 2407843 33592230 / 474523 FinalizeDictionary 27896012 / 2650029 33763886 / 3719427 33904283 / 3552793 26008225 / 2198033 28111872 / 1869530 28014374 / 1789771 28047706 / 1848300 32296254 / 3204027 33698698 / 2381468 33592344 / 517433 TrainDictionary 28046089 / 2740037 33706480 / 3679019 33885741 / 3629351 25087123 / 2204558 27194353 / 1970207 27234229 / 1896811 27166710 / 1903119 32011041 / 3322315 32730692 / 2406146 33608631 / 570593 ``` #### Decompression/Read test: With FinalizeDictionary/TrainDictionary, some data structure used for decompression are in stored in dictionary, so they are expected to be faster in terms of decompression/reads. ``` dict_bytes=16384 train_bytes=1048576 echo "No Dictionary" TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=0 > /dev/null 2>&1 TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=0 2>&1 \| grep MB/s echo "Raw Dictionary" TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes > /dev/null 2>&1 TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes 2>&1 \| grep MB/s echo "FinalizeDict" TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false > /dev/null 2>&1 TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false 2>&1 \| grep MB/s echo "Train Dictionary" TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes > /dev/null 2>&1 TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes 2>&1 \| grep MB/s No Dictionary readrandom : 12.183 micros/op 82082 ops/sec 12.183 seconds 1000000 operations; 9.1 MB/s (1000000 of 1000000 found) Raw Dictionary readrandom : 12.314 micros/op 81205 ops/sec 12.314 seconds 1000000 operations; 9.0 MB/s (1000000 of 1000000 found) FinalizeDict readrandom : 9.787 micros/op 102180 ops/sec 9.787 seconds 1000000 operations; 11.3 MB/s (1000000 of 1000000 found) Train Dictionary readrandom : 9.698 micros/op 103108 ops/sec 9.699 seconds 1000000 operations; 11.4 MB/s (1000000 of 1000000 found) ``` Reviewed By: ajkr Differential Revision: D35720026 Pulled By: cbi42 fbshipit-source-id: 24d230fdff0fd28a1bb650658798f00dfcfb2a1f	2022-05-20 12:09:09 -07:00
anand76	57997ddaaf	Multi file concurrency in MultiGet using coroutines and async IO (#9968 ) Summary: This PR implements a coroutine version of batched MultiGet in order to concurrently read from multiple SST files in a level using async IO, thus reducing the latency of the MultiGet. The API from the user perspective is still synchronous and single threaded, with the RocksDB part of the processing happening in the context of the caller's thread. In Version::MultiGet, the decision is made whether to call synchronous or coroutine code. A good way to review this PR is to review the first 4 commits in order - de773b3, 70c2f70, 10b50e1, and 377a597 - before reviewing the rest. TODO: 1. Figure out how to build it in CircleCI (requires some dependencies to be installed) 2. Do some stress testing with coroutines enabled No regression in synchronous MultiGet between this branch and main - ``` ./db_bench -use_existing_db=true --db=/data/mysql/rocksdb/prefix_scan -benchmarks="readseq,multireadrandom" -key_size=32 -value_size=512 -num=5000000 -batch_size=64 -multiread_batched=true -use_direct_reads=false -duration=60 -ops_between_duration_checks=1 -readonly=true -adaptive_readahead=true -threads=16 -cache_size=10485760000 -async_io=false -multiread_stride=40000 -statistics ``` Branch - ```multireadrandom : 4.025 micros/op 3975111 ops/sec 60.001 seconds 238509056 operations; 2062.3 MB/s (14767808 of 14767808 found)``` Main - ```multireadrandom : 3.987 micros/op 4013216 ops/sec 60.001 seconds 240795392 operations; 2082.1 MB/s (15231040 of 15231040 found)``` More benchmarks in various scenarios are given below. The measurements were taken with ```async_io=false``` (no coroutines) and ```async_io=true``` (use coroutines). For an IO bound workload (with every key requiring an IO), the coroutines version shows a clear benefit, being ~2.6X faster. For CPU bound workloads, the coroutines version has ~6-15% higher CPU utilization, depending on how many keys overlap an SST file. 1. Single thread IO bound workload on remote storage with sparse MultiGet batch keys (~1 key overlap/file) - No coroutines - ```multireadrandom : 831.774 micros/op 1202 ops/sec 60.001 seconds 72136 operations; 0.6 MB/s (72136 of 72136 found)``` Using coroutines - ```multireadrandom : 318.742 micros/op 3137 ops/sec 60.003 seconds 188248 operations; 1.6 MB/s (188248 of 188248 found)``` 2. Single thread CPU bound workload (all data cached) with ~1 key overlap/file - No coroutines - ```multireadrandom : 4.127 micros/op 242322 ops/sec 60.000 seconds 14539384 operations; 125.7 MB/s (14539384 of 14539384 found)``` Using coroutines - ```multireadrandom : 4.741 micros/op 210935 ops/sec 60.000 seconds 12656176 operations; 109.4 MB/s (12656176 of 12656176 found)``` 3. Single thread CPU bound workload with ~2 key overlap/file - No coroutines - ```multireadrandom : 3.717 micros/op 269000 ops/sec 60.000 seconds 16140024 operations; 139.6 MB/s (16140024 of 16140024 found)``` Using coroutines - ```multireadrandom : 4.146 micros/op 241204 ops/sec 60.000 seconds 14472296 operations; 125.1 MB/s (14472296 of 14472296 found)``` 4. CPU bound multi-threaded (16 threads) with ~4 key overlap/file - No coroutines - ```multireadrandom : 4.534 micros/op 3528792 ops/sec 60.000 seconds 211728728 operations; 1830.7 MB/s (12737024 of 12737024 found) ``` Using coroutines - ```multireadrandom : 4.872 micros/op 3283812 ops/sec 60.000 seconds 197030096 operations; 1703.6 MB/s (12548032 of 12548032 found) ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/9968 Reviewed By: akankshamahajan15 Differential Revision: D36348563 Pulled By: anand1976 fbshipit-source-id: c0ce85a505fd26ebfbb09786cbd7f25202038696	2022-05-19 15:36:27 -07:00
sdong	736a7b5433	Remove own ToString() (#9955 ) Summary: ToString() is created as some platform doesn't support std::to_string(). However, we've already used std::to_string() by mistake for 16 months (in db/db_info_dumper.cc). This commit just remove ToString(). Pull Request resolved: https://github.com/facebook/rocksdb/pull/9955 Test Plan: Watch CI tests Reviewed By: riversand963 Differential Revision: D36176799 fbshipit-source-id: bdb6dcd0e3a3ab96a1ac810f5d0188f684064471	2022-05-06 13:03:58 -07:00
Akanksha Mahajan	3653029dda	Add stats related to async prefetching (#9845 ) Summary: Add stats PREFETCHED_BYTES_DISCARDED and POLL_WAIT_MICROS. PREFETCHED_BYTES_DISCARDED records number of prefetched bytes discarded by FilePrefetchBuffer. POLL_WAIT_MICROS records the time taken by underling file_system Poll API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9845 Test Plan: Update existing tests Reviewed By: anand1976 Differential Revision: D35909694 Pulled By: akankshamahajan15 fbshipit-source-id: e009ef940bb9ed72c9446f5529095caabb8a1e36	2022-04-25 21:58:22 -07:00
Akanksha Mahajan	0b8f885939	Update stats for Read and ReadAsync in random_access_file_reader for async prefetching (#9810 ) Summary: Update stats in random_access_file_reader for Read and ReadAsync API to take into account the read latency for async prefetching. It also fixes ERROR_HANDLER_AUTORESUME_RETRY_COUNT stat whose value was incorrect in portal.h Pull Request resolved: https://github.com/facebook/rocksdb/pull/9810 Test Plan: Update unit test Reviewed By: anand1976 Differential Revision: D35433081 Pulled By: akankshamahajan15 fbshipit-source-id: aeec3901270e58a003ce6b5214bd25ddcb3a12a9	2022-04-06 14:26:53 -07:00
Peter Dillinger	6534c6dea4	Fix remaining uses of "backupable" (#9792 ) Summary: Various renaming and fixes to get rid of remaining uses of "backupable" which is terminology leftover from the original, flawed design of BackupableDB. Now any DB can be backed up, using BackupEngine. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9792 Test Plan: CI Reviewed By: ajkr Differential Revision: D35334386 Pulled By: pdillinger fbshipit-source-id: 2108a42b4575c8cccdfd791c549aae93ec2f3329	2022-04-05 09:52:33 -07:00
Alan Paxton	b6ad0d958f	Fb 9718 verify checksums is ignored (#9767 ) Summary: Fixes https://github.com/facebook/rocksdb/issues/9718 The verify_checksums flag of read_options should be passed to the read options used by the BlockFetcher in a couple of cases where it is not at present. It will now happen (but did not, previously) on iteration and on [multi]get, where a fetcher is created as part of the iterate/get call. This may result in much better performance in a few workloads where the client chooses to remove verification. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9767 Reviewed By: mrambacher Differential Revision: D35218986 Pulled By: jay-zhuang fbshipit-source-id: 329d29764bb70fbc7f2673440bc46c107a813bc8	2022-03-29 11:54:54 -07:00
Jermy Li	b83263bbe4	jni: uniformly use GetByteArrayRegion() to copy bytes (#9380 ) Summary: Uniformly use GetByteArrayRegion() instead of GetByteArrayElements() to copy bytes. In addition, it can avoid an inefficient ReleaseByteArrayElements() operation. Some benefits of GetByteArrayRegion() can be referred to: https://stackoverflow.com/a/2480493 Pull Request resolved: https://github.com/facebook/rocksdb/pull/9380 Reviewed By: ajkr Differential Revision: D35135474 Pulled By: jay-zhuang fbshipit-source-id: a32c1774d37f2d22b9bcd105d83e0bb984b71b54	2022-03-25 10:24:58 -07:00
Alan Paxton	dec144f172	Extend Java RocksDB iterators to support indirect Byte Buffers (#9222 ) Summary: Extend Java RocksDB iterators to support indirect byte buffers, to add to the existing support for direct byte buffers. Code to distinguish direct/indirect buffers is switched in Java, and a 2nd separate JNI call implemented to support indirect buffers. Indirect support passes contained buffers using byte[] There are some Java subclasses of iterator (WBWIIterator, SstFileReaderIterator) which also now have parallel JNI support functions implemented, along with direct/indirect switches in Java methods. Closes https://github.com/facebook/rocksdb/issues/6282 Pull Request resolved: https://github.com/facebook/rocksdb/pull/9222 Reviewed By: ajkr Differential Revision: D35115283 Pulled By: jay-zhuang fbshipit-source-id: f8d5d20b975aef700560fbcc99f707bb028dc42e	2022-03-24 12:50:38 -07:00
Alan Paxton	8ae0c33a7a	Add new checksum type kXXH3 to Java API (#9749 ) Summary: Fix https://github.com/facebook/rocksdb/issues/9720 And make a couple of incidental tests test the thing they were meant to test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9749 Reviewed By: ajkr Differential Revision: D35115298 Pulled By: jay-zhuang fbshipit-source-id: d687d1f070d29216be9693601c71131bbea87c79	2022-03-24 12:33:12 -07:00
Tomas Kolda	9e05c5e251	NPE in Java_org_rocksdb_ColumnFamilyOptions_setSstPartitionerFactory (#9622 ) Summary: There was a mistake that incorrectly cast SstPartitionerFactory (missed shared pointer). It worked for database (correct cast), but not for family. Trying to set it in family has caused Access violation. I have also added test and improved it. Older version was passing even without sst partitioner which is weird, because on Level1 we had two SST files with same key "aaaa1". I was not sure if it is a new feature and changed it to overlaping keys "aaaa0" - "aaaa2" overlaps "aaaa1". Pull Request resolved: https://github.com/facebook/rocksdb/pull/9622 Reviewed By: ajkr Differential Revision: D34871968 Pulled By: pdillinger fbshipit-source-id: a08009766da49fc198692a610e8beb19caf737e6	2022-03-14 14:12:30 -07:00
Si Ke	06c8afeff5	Fix pointer to jlong conversion in 32 bit OS (#9396 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9396 Reviewed By: jay-zhuang Differential Revision: D34529654 Pulled By: pdillinger fbshipit-source-id: cf62152ba86b02f9ffa7780f370ad49089e56a0b	2022-03-01 09:02:15 -08:00
Akanksha Mahajan	3699b171e4	Change enum SizeApproximationFlags to enum class (#9604 ) Summary: Change enum SizeApproximationFlags to enum and class and add overloaded operators for the transition between enum class and uint8_t Pull Request resolved: https://github.com/facebook/rocksdb/pull/9604 Test Plan: Circle CI jobs Reviewed By: riversand963 Differential Revision: D34360281 Pulled By: akankshamahajan15 fbshipit-source-id: 6351dfdb717ae3c4530d324c3d37a8ecb01dd1ef	2022-02-18 20:22:57 -08:00
Jay Zhuang	f4b2500e12	Add last level and non-last level read statistics (#9519 ) Summary: Add last level and non-last level read statistics: ``` LAST_LEVEL_READ_BYTES, LAST_LEVEL_READ_COUNT, NON_LAST_LEVEL_READ_BYTES, NON_LAST_LEVEL_READ_COUNT, ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/9519 Test Plan: added unittest Reviewed By: siying Differential Revision: D34062539 Pulled By: jay-zhuang fbshipit-source-id: 908644c3050878b4234febdc72e3e19d89af38cd	2022-02-18 14:23:07 -08:00
Alan Paxton	36ce2e2a0a	Update build files for java8 build (#9541 ) Summary: For RocksJava 7 we will move from requiring Java 7 to Java 8. * This simplifies the `Makefile` as we no longer need to deal with Java 7; so we no longer use `javah`. * Added a java-version target which is invoked by the java target, and which exits if the version of java being used is not 8 or greater. * Enforces java 8 as a minimum. * Fixed CMake build. * Fixed broken java event listener test, as the test was broken and the assertions in the callbacks were not causing assertions in the tests. The callbacks now queue up assertion errors for the main thread of the tests to check. * Fixed C++ dangling pointers in the test code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9541 Reviewed By: pdillinger Differential Revision: D34214929 Pulled By: jay-zhuang fbshipit-source-id: fdff348758d0a23a742e83c87d5f54073ce16ca6	2022-02-17 13:29:21 -08:00
Peter Dillinger	420d51b9a0	Update Java API for FilterPolicy changes (#9569 ) Summary: Obsolete block-based filter no longer in public API, from https://github.com/facebook/rocksdb/issues/9535 Pull Request resolved: https://github.com/facebook/rocksdb/pull/9569 Test Plan: existing tests Reviewed By: jay-zhuang Differential Revision: D34243579 Pulled By: pdillinger fbshipit-source-id: ec5127d9bb9cc3f70501c531829a735bffdd1418	2022-02-15 12:18:52 -08:00
Levi Tamasi	ac251aa641	Add Java bindings for blob compaction readahead size (#9554 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9554 Test Plan: Added new unit tests. Reviewed By: mrambacher Differential Revision: D34197121 Pulled By: ltamasi fbshipit-source-id: 15056e26d632057a7c052a5024a560ba0eac554c	2022-02-14 09:15:42 -08:00
Akanksha Mahajan	9745c68eb1	Remove deprecated option new_table_reader_for_compaction_inputs (#9443 ) Summary: In RocksDB option new_table_reader_for_compaction_inputs has not effect on Compaction or on the behavior of RocksDB library. Therefore, we are removing it in the upcoming 7.0 release. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9443 Test Plan: CircleCI Reviewed By: ajkr Differential Revision: D33788508 Pulled By: akankshamahajan15 fbshipit-source-id: 324ca6f12bfd019e9bd5e1b0cdac39be5c3cec7d	2022-02-08 19:31:28 -08:00
Radek Hubner	42c8afd85a	WriteOptions - add missing java API. (#9295 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9295 Reviewed By: riversand963 Differential Revision: D33672440 Pulled By: ajkr fbshipit-source-id: 85f73a9297888b00255b636e7826b37186aba45c	2022-02-04 16:08:06 -08:00
Si Ke	2c3a780901	Fixed all RocksJava test failures in Centos and Alpine (#9395 ) Summary: Fixed all RocksJava test failures in Centos and Alpine 32 bit and 64 bit OSes Pull Request resolved: https://github.com/facebook/rocksdb/pull/9395 Reviewed By: mrambacher Differential Revision: D33771987 Pulled By: ajkr fbshipit-source-id: fed91033b8df08f191ad65e1fb745a9264bbfa70	2022-02-04 16:03:56 -08:00
Jermy Li	83ff350ff2	jni: expose memtable_whole_key_filtering option (#9394 ) Summary: refer to: https://github.com/facebook/rocksdb/wiki/Prefix-Seek#configure-prefix-bloom-filter Pull Request resolved: https://github.com/facebook/rocksdb/pull/9394 Reviewed By: mrambacher Differential Revision: D33671533 Pulled By: ajkr fbshipit-source-id: d90db1712efdd5dd65020329867381d6b3cf2626	2022-02-04 16:01:16 -08:00
Yanqin Jin	d10c5c08d3	Remove iter_start_seqnum and preserve_deletes (#9430 ) Summary: According to https://github.com/facebook/rocksdb/blob/6.27.fb/db/db_impl/db_impl.cc#L2896:L2911 and https://github.com/facebook/rocksdb/blob/6.27.fb/db/db_impl/db_impl_open.cc#L203:L208, we are going to remove `iter_start_seqnum` and `preserve_deletes` starting from RocksDB 7.0 Pull Request resolved: https://github.com/facebook/rocksdb/pull/9430 Test Plan: make check and CI Reviewed By: ajkr Differential Revision: D33753639 Pulled By: riversand963 fbshipit-source-id: c80aab8e8d8fc33e52472fed524ed703d0ffc8b6	2022-01-28 13:28:38 -08:00
Jay Zhuang	22321e1027	Remove unused API base_background_compactions (#9462 ) Summary: The API is deprecated long time ago. Clean up the codebase by removing it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9462 Test Plan: CI, fake release: D33835220 Reviewed By: riversand963 Differential Revision: D33835103 Pulled By: jay-zhuang fbshipit-source-id: 6d2dc12c8e7fdbe2700865a3e61f0e3f78bd8184	2022-01-27 21:05:18 -08:00
Peter Dillinger	78aee6fedc	Remove obsolete backupable_db.h, utility_db.h (#9438 ) Summary: This also removes the obsolete names BackupableDBOptions and UtilityDB. API users must now use BackupEngineOptions and DBWithTTL::Open. In C API, `rocksdb_backupable_db_` is replaced `rocksdb_backup_engine_`. Similar renaming in Java API. In reference to https://github.com/facebook/rocksdb/issues/9389 Pull Request resolved: https://github.com/facebook/rocksdb/pull/9438 Test Plan: CI Reviewed By: mrambacher Differential Revision: D33780269 Pulled By: pdillinger fbshipit-source-id: 4a6cfc5c1b4c78bcad790b9d3dd13c5fdf4a1fac	2022-01-27 15:45:30 -08:00
Yanqin Jin	50135c1bf3	Move HDFS support to separate repo (#9170 ) Summary: This PR moves HDFS support from RocksDB repo to a separate repo. The new (temporary?) repo in this PR serves as an example before we finalize the decision on where and who to host hdfs support. At this point, people can start from the example repo and fork. Java/JNI is not included yet, and needs to be done later if necessary. The goal is to include this commit in RocksDB 7.0 release. Reference: https://github.com/ajkr/dedupfs by ajkr Pull Request resolved: https://github.com/facebook/rocksdb/pull/9170 Test Plan: Follow the instructions in https://github.com/riversand963/rocksdb-hdfs-env/blob/master/README.md. Build and run db_bench and db_stress. make check Reviewed By: ajkr Differential Revision: D33751662 Pulled By: riversand963 fbshipit-source-id: 22b4db7f31762ed417a20239f5a08dcd1696244f	2022-01-24 20:23:54 -08:00
Yanqin Jin	0376869f05	Remove using namespace (#9369 ) Summary: As title. This is part of an fb-internal task. First, remove all `using namespace` statements if applicable. Next, utilize multiple build platforms and see if anything is broken. Should anything become broken, fix the compilation errors with as little extra change as possible. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9369 Test Plan: internal build and make check make clean && make static_lib && cd examples && make all Reviewed By: pdillinger Differential Revision: D33517260 Pulled By: riversand963 fbshipit-source-id: 3fc4ce6402a073421dfd9a9b2d1c79441dca7a40	2022-01-12 09:31:12 -08:00
Alan Paxton	c1ec0b28eb	java / jni io_uring support (#9224 ) Summary: Existing multiGet() in java calls multi_get_helper() which then calls DB::std::vector MultiGet(). This doesn't take advantage of io_uring. This change adds another JNI level method that runs a parallel code path using the DB::void MultiGet(), using ByteBuffers at the JNI level. We call it multiGetDirect(). In addition to using the io_uring path, this code internally returns pinned slices which we can copy out of into our direct byte buffers; this should reduce the overall number of copies in the code path to/from Java. Some jmh benchmark runs (100k keys, 1000 key multiGet) suggest that for value sizes > 1k, we see about a 20% performance improvement, although performance is slightly reduced for small value sizes, there's a little bit more overhead in the JNI methods. Closes https://github.com/facebook/rocksdb/issues/8407 Pull Request resolved: https://github.com/facebook/rocksdb/pull/9224 Reviewed By: mrambacher Differential Revision: D32951754 Pulled By: jay-zhuang fbshipit-source-id: 1f70df7334be2b6c42a9c8f92725f67c71631690	2021-12-15 18:09:25 -08:00
Radek Hubner	7ac3a5d406	ReadOptions - Add missing java API. (#9248 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9248 Reviewed By: mrambacher Differential Revision: D33011237 Pulled By: jay-zhuang fbshipit-source-id: b6544ad40cb722e327bac60a0af711db253e36d7	2021-12-15 17:46:05 -08:00
Yanqin Jin	bd513fd075	Add commit marker with timestamp (#9266 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9266 This diff adds a new tag `CommitWithTimestamp`. Currently, there is no API to trigger writing this tag to WAL, thus it is unavailable to users. This is an ongoing effort to add user-defined timestamp support to write-committed transactions. This diff also indicates all column families that may potentially participate in the same transaction must either disable timestamp or have the same timestamp format, since `CommitWithTimestamp` tag is followed by a single byte-array denoting the commit timestamp of the transaction. We will enforce this checking in a future diff. We keep this diff small. Reviewed By: ltamasi Differential Revision: D31721350 fbshipit-source-id: e1450811443647feb6ca01adec4c8aaae270ffc6	2021-12-10 11:05:35 -08:00
Jermy Li	c39a808cb6	Deprecate WriteBatch.remove() and use the new style delete() (#9256 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9256 Reviewed By: mrambacher Differential Revision: D32971447 Pulled By: jay-zhuang fbshipit-source-id: 6954d7287229a8c776092bd82af3a8a8cd92b35e	2021-12-10 09:18:17 -08:00

1 2 3 4 5 ...

485 Commits