Commit Graph

909 Commits

Author SHA1 Message Date
Gang Liao e6432dfd4c Make it possible to enable blob files starting from a certain LSM tree level (#10077)
Summary:
Currently, if blob files are enabled (i.e. `enable_blob_files` is true), large values are extracted both during flush/recovery (when SST files are written into level 0 of the LSM tree) and during compaction into any LSM tree level. For certain use cases that have a mix of short-lived and long-lived values, it might make sense to support extracting large values only during compactions whose output level is greater than or equal to a specified LSM tree level (e.g. compactions into L1/L2/... or above). This could reduce the space amplification caused by large values that are turned into garbage shortly after being written at the price of some write amplification incurred by long-lived values whose extraction to blob files is delayed.

In order to achieve this, we would like to do the following:
- Add a new configuration option `blob_file_starting_level` (default: 0) to `AdvancedColumnFamilyOptions` (and `MutableCFOptions` and extend the related logic)
- Instantiate `BlobFileBuilder` in `BuildTable` (used during flush and recovery, where the LSM tree level is L0) and `CompactionJob` iff `enable_blob_files` is set and the LSM tree level is `>= blob_file_starting_level`
- Add unit tests for the new functionality, and add the new option to our stress tests (`db_stress` and `db_crashtest.py` )
- Add the new option to our benchmarking tool `db_bench` and the BlobDB benchmark script `run_blob_bench.sh`
- Add the new option to the `ldb` tool (see https://github.com/facebook/rocksdb/wiki/Administration-and-Data-Access-Tool)
- Ideally extend the C and Java bindings with the new option
- Update the BlobDB wiki to document the new option.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10077

Reviewed By: ltamasi

Differential Revision: D36884156

Pulled By: gangliao

fbshipit-source-id: 942bab025f04633edca8564ed64791cb5e31627d
2022-06-02 20:04:33 -07:00
Akanksha Mahajan d04df2752a Persist the new MANIFEST after successfully syncing the new WAL during recovery (#9922)
Summary:
In case of non-TransactionDB and avoid_flush_during_recovery = true, RocksDB won't
flush the data from WAL to L0 for all column families if possible. As a
result, not all column families can increase their log_numbers, and
min_log_number_to_keep won't change.
For transaction DB (.allow_2pc), even with the flush, there may be old WAL files that it must not delete because they can contain data of uncommitted transactions and min_log_number_to_keep won't change.
If we persist a new MANIFEST with
advanced log_numbers for some column families, then during a second
crash after persisting the MANIFEST, RocksDB will see some column
families' log_numbers larger than the corrupted wal, and the "column family inconsistency" error will be hit, causing recovery to fail.

As a solution, RocksDB will persist the new MANIFEST after successfully syncing the new WAL.
If a future recovery starts from the new MANIFEST, then it means the new WAL is successfully synced. Due to the sentinel empty write batch at the beginning, kPointInTimeRecovery of WAL is guaranteed to go after this point.
If future recovery starts from the old MANIFEST, it means the writing the new MANIFEST failed. We won't have the "SST ahead of WAL" error.
Currently, RocksDB DB::Open() may creates and writes to two new MANIFEST files even before recovery succeeds. This PR buffers the edits in a structure and writes to a new MANIFEST after recovery is successful

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9922

Test Plan:
1. Update unit tests to fail without this change
2. make crast_test -j

Branch with unit test and no fix  https://github.com/facebook/rocksdb/pull/9942 to keep track of unit test (without fix)

Reviewed By: riversand963

Differential Revision: D36043701

Pulled By: akankshamahajan15

fbshipit-source-id: 5760970db0a0920fb73d3c054a4155733500acd9
2022-06-01 10:52:26 -07:00
Yanqin Jin 514f0b0937 Fail DB::Open() if logger cannot be created (#9984)
Summary:
For regular db instance and secondary instance, we return error and refuse to open DB if Logger creation fails.

Our current code allows it, but it is really difficult to debug because
there will be no LOG files. The same for OPTIONS file, which will be explored in another PR.

Furthermore, Arena::AllocateAligned(size_t bytes, size_t huge_page_size, Logger* logger) has an
assertion as the following:

```cpp
#ifdef MAP_HUGETLB
if (huge_page_size > 0 && bytes > 0) {
  assert(logger != nullptr);
}
#endif
```

It can be removed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9984

Test Plan: make check

Reviewed By: jay-zhuang

Differential Revision: D36347754

Pulled By: riversand963

fbshipit-source-id: 529798c0511d2eaa2f0fd40cf7e61c4cbc6bc57e
2022-05-27 07:23:31 -07:00
Peter Dillinger 1e4850f626 Java build: finish compiling before testing (etc) (#10034)
Summary:
Lack of ordering dependencies could lead to random
build-linux-java failures with "Truncated class file" because tests
started before compilation was finished. (Fix to java/Makefile)

Also:
* export SHA256_CMD to save copy-paste
* Actually fail if Java sample build fails--which it was in CircleCI
* Don't require Snappy for Java sample build (for more compatibility)
* Remove check_all_python from jtest because it's running in `make
check` builds in CircleCI

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10034

Test Plan: CI, some manual

Reviewed By: jay-zhuang

Differential Revision: D36596541

Pulled By: pdillinger

fbshipit-source-id: 230d79db4b7ae93a366871ff09d0a88e8e1c8af3
2022-05-23 09:56:40 -07:00
Changyu Bi cc23b46da1 Support using ZDICT_finalizeDictionary to generate zstd dictionary (#9857)
Summary:
An untrained dictionary is currently simply the concatenation of several samples. The ZSTD API, ZDICT_finalizeDictionary(), can improve such a dictionary's effectiveness at low cost. This PR changes how dictionary is created by calling the ZSTD ZDICT_finalizeDictionary() API instead of creating raw content dictionary (when max_dict_buffer_bytes > 0), and pass in all buffered uncompressed data blocks as samples.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9857

Test Plan:
#### db_bench test for cpu/memory of compression+decompression and space saving on synthetic data:
Set up: change the parameter [here](fb9a167a55/tools/db_bench_tool.cc (L1766)) to 16384 to make synthetic data more compressible.
```
# linked local ZSTD with version 1.5.2
# DEBUG_LEVEL=0 ROCKSDB_NO_FBCODE=1 ROCKSDB_DISABLE_ZSTD=1  EXTRA_CXXFLAGS="-DZSTD_STATIC_LINKING_ONLY -DZSTD -I/data/users/changyubi/install/include/" EXTRA_LDFLAGS="-L/data/users/changyubi/install/lib/ -l:libzstd.a" make -j32 db_bench

dict_bytes=16384
train_bytes=1048576
echo "========== No Dictionary =========="
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=0 -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=0 -block_size=4096 2>&1 | grep elapsed
du -hc /dev/shm/dbbench/*sst | grep total

echo "========== Raw Content Dictionary =========="
TEST_TMPDIR=/dev/shm ./db_bench_main -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench_main -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -block_size=4096 2>&1 | grep elapsed
du -hc /dev/shm/dbbench/*sst | grep total

echo "========== FinalizeDictionary =========="
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false -block_size=4096 2>&1 | grep elapsed
du -hc /dev/shm/dbbench/*sst | grep total

echo "========== TrainDictionary =========="
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -block_size=4096 2>&1 | grep elapsed
du -hc /dev/shm/dbbench/*sst | grep total

# Result: TrainDictionary is much better on space saving, but FinalizeDictionary seems to use less memory.
# before compression data size: 1.2GB
dict_bytes=16384
max_dict_buffer_bytes =  1048576
                    space   cpu/memory
No Dictionary       468M    14.93user 1.00system 0:15.92elapsed 100%CPU (0avgtext+0avgdata 23904maxresident)k
Raw Dictionary      251M    15.81user 0.80system 0:16.56elapsed 100%CPU (0avgtext+0avgdata 156808maxresident)k
FinalizeDictionary  236M    11.93user 0.64system 0:12.56elapsed 100%CPU (0avgtext+0avgdata 89548maxresident)k
TrainDictionary     84M     7.29user 0.45system 0:07.75elapsed 100%CPU (0avgtext+0avgdata 97288maxresident)k
```

#### Benchmark on 10 sample SST files for spacing saving and CPU time on compression:
FinalizeDictionary is comparable to TrainDictionary in terms of space saving, and takes less time in compression.
```
dict_bytes=16384
train_bytes=1048576

for sst_file in `ls ../temp/myrock-sst/`
do
  echo "********** $sst_file **********"
  echo "========== No Dictionary =========="
  ./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD

  echo "========== Raw Content Dictionary =========="
  ./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes

  echo "========== FinalizeDictionary =========="
  ./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes --compression_zstd_max_train_bytes=$train_bytes --compression_use_zstd_finalize_dict

  echo "========== TrainDictionary =========="
  ./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes --compression_zstd_max_train_bytes=$train_bytes
done

                         010240.sst (Size/Time) 011029.sst              013184.sst              021552.sst              185054.sst              185137.sst              191666.sst              7560381.sst             7604174.sst             7635312.sst
No Dictionary           28165569 / 2614419      32899411 / 2976832      32977848 / 3055542      31966329 / 2004590      33614351 / 1755877      33429029 / 1717042      33611933 / 1776936      33634045 / 2771417      33789721 / 2205414      33592194 / 388254
Raw Content Dictionary  28019950 / 2697961      33748665 / 3572422      33896373 / 3534701      26418431 / 2259658      28560825 / 1839168      28455030 / 1846039      28494319 / 1861349      32391599 / 3095649      33772142 / 2407843      33592230 / 474523
FinalizeDictionary      27896012 / 2650029      33763886 / 3719427      33904283 / 3552793      26008225 / 2198033      28111872 / 1869530      28014374 / 1789771      28047706 / 1848300      32296254 / 3204027      33698698 / 2381468      33592344 / 517433
TrainDictionary         28046089 / 2740037      33706480 / 3679019      33885741 / 3629351      25087123 / 2204558      27194353 / 1970207      27234229 / 1896811      27166710 / 1903119      32011041 / 3322315      32730692 / 2406146      33608631 / 570593
```

#### Decompression/Read test:
With FinalizeDictionary/TrainDictionary, some data structure used for decompression are in stored in dictionary, so they are expected to be faster in terms of decompression/reads.
```
dict_bytes=16384
train_bytes=1048576
echo "No Dictionary"
TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=0 > /dev/null 2>&1
TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=0 2>&1 | grep MB/s

echo "Raw Dictionary"
TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes > /dev/null 2>&1
TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd  -compression_max_dict_bytes=$dict_bytes 2>&1 | grep MB/s

echo "FinalizeDict"
TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false  > /dev/null 2>&1
TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false 2>&1 | grep MB/s

echo "Train Dictionary"
TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes > /dev/null 2>&1
TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes 2>&1 | grep MB/s

No Dictionary
readrandom   :      12.183 micros/op 82082 ops/sec 12.183 seconds 1000000 operations;    9.1 MB/s (1000000 of 1000000 found)
Raw Dictionary
readrandom   :      12.314 micros/op 81205 ops/sec 12.314 seconds 1000000 operations;    9.0 MB/s (1000000 of 1000000 found)
FinalizeDict
readrandom   :       9.787 micros/op 102180 ops/sec 9.787 seconds 1000000 operations;   11.3 MB/s (1000000 of 1000000 found)
Train Dictionary
readrandom   :       9.698 micros/op 103108 ops/sec 9.699 seconds 1000000 operations;   11.4 MB/s (1000000 of 1000000 found)
```

Reviewed By: ajkr

Differential Revision: D35720026

Pulled By: cbi42

fbshipit-source-id: 24d230fdff0fd28a1bb650658798f00dfcfb2a1f
2022-05-20 12:09:09 -07:00
anand76 57997ddaaf Multi file concurrency in MultiGet using coroutines and async IO (#9968)
Summary:
This PR implements a coroutine version of batched MultiGet in order to concurrently read from multiple SST files in a level using async IO, thus reducing the latency of the MultiGet. The API from the user perspective is still synchronous and single threaded, with the RocksDB part of the processing happening in the context of the caller's thread. In Version::MultiGet, the decision is made whether to call synchronous or coroutine code.

A good way to review this PR is to review the first 4 commits in order - de773b3, 70c2f70, 10b50e1, and 377a597 - before reviewing the rest.

TODO:
1. Figure out how to build it in CircleCI (requires some dependencies to be installed)
2. Do some stress testing with coroutines enabled

No regression in synchronous MultiGet between this branch and main -
```
./db_bench -use_existing_db=true --db=/data/mysql/rocksdb/prefix_scan -benchmarks="readseq,multireadrandom" -key_size=32 -value_size=512 -num=5000000 -batch_size=64 -multiread_batched=true -use_direct_reads=false -duration=60 -ops_between_duration_checks=1 -readonly=true -adaptive_readahead=true -threads=16 -cache_size=10485760000 -async_io=false -multiread_stride=40000 -statistics
```
Branch - ```multireadrandom :       4.025 micros/op 3975111 ops/sec 60.001 seconds 238509056 operations; 2062.3 MB/s (14767808 of 14767808 found)```

Main - ```multireadrandom :       3.987 micros/op 4013216 ops/sec 60.001 seconds 240795392 operations; 2082.1 MB/s (15231040 of 15231040 found)```

More benchmarks in various scenarios are given below. The measurements were taken with ```async_io=false``` (no coroutines) and ```async_io=true``` (use coroutines). For an IO bound workload (with every key requiring an IO), the coroutines version shows a clear benefit, being ~2.6X faster. For CPU bound workloads, the coroutines version has ~6-15% higher CPU utilization, depending on how many keys overlap an SST file.

1. Single thread IO bound workload on remote storage with sparse MultiGet batch keys (~1 key overlap/file) -
No coroutines - ```multireadrandom :     831.774 micros/op 1202 ops/sec 60.001 seconds 72136 operations;    0.6 MB/s (72136 of 72136 found)```
Using coroutines - ```multireadrandom :     318.742 micros/op 3137 ops/sec 60.003 seconds 188248 operations;    1.6 MB/s (188248 of 188248 found)```

2. Single thread CPU bound workload (all data cached) with ~1 key overlap/file -
No coroutines - ```multireadrandom :       4.127 micros/op 242322 ops/sec 60.000 seconds 14539384 operations;  125.7 MB/s (14539384 of 14539384 found)```
Using coroutines - ```multireadrandom :       4.741 micros/op 210935 ops/sec 60.000 seconds 12656176 operations;  109.4 MB/s (12656176 of 12656176 found)```

3. Single thread CPU bound workload with ~2 key overlap/file -
No coroutines - ```multireadrandom :       3.717 micros/op 269000 ops/sec 60.000 seconds 16140024 operations;  139.6 MB/s (16140024 of 16140024 found)```
Using coroutines - ```multireadrandom :       4.146 micros/op 241204 ops/sec 60.000 seconds 14472296 operations;  125.1 MB/s (14472296 of 14472296 found)```

4. CPU bound multi-threaded (16 threads) with ~4 key overlap/file -
No coroutines - ```multireadrandom :       4.534 micros/op 3528792 ops/sec 60.000 seconds 211728728 operations; 1830.7 MB/s (12737024 of 12737024 found) ```
Using coroutines - ```multireadrandom :       4.872 micros/op 3283812 ops/sec 60.000 seconds 197030096 operations; 1703.6 MB/s (12548032 of 12548032 found) ```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9968

Reviewed By: akankshamahajan15

Differential Revision: D36348563

Pulled By: anand1976

fbshipit-source-id: c0ce85a505fd26ebfbb09786cbd7f25202038696
2022-05-19 15:36:27 -07:00
Davide Angelocola 4527bb2fed Fix conversion issues in MutableOptions (#9194)
Summary:
Removing unnecessary checks around conversion from int/long to double as it does not lose information (see https://docs.oracle.com/javase/specs/jls/se9/html/jls-5.html#jls-5.1.2).

For example, `value > Double.MAX_VALUE` is always false when value is long or int.

Can you please have a look adamretter? Also fixed some other minor issues (do you prefer a separate PR?)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9194

Reviewed By: ajkr

Differential Revision: D36221694

fbshipit-source-id: bf327c07386560b87ddc0c98039e8d6e8f2f1e82
2022-05-09 12:34:26 -07:00
♚ PH⑦ de Soria™♛ 9381436bf3 Fixed typo (#9331)
Summary:
Just fixing a very minor typo in the doc block :) Hope it will help anyway 😊

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9331

Reviewed By: riversand963

Differential Revision: D34339823

fbshipit-source-id: b76104bc3efbc9d1f38cbf5c6dd7648dc909ced3
2022-05-06 17:41:07 -07:00
Roman Puchkovskiy 00889cf8f2 Never use String#getBytes() in the production code (#9487)
Summary:
There are encodings that are not ASCII-compatible (like cp1140), so it is possible that a JVM is run with a default encoding in which String#getBytes() would return unexpected values even for ASCII strings.

A little bit of context: https://stackoverflow.com/questions/70913929/can-an-encoding-incompatible-with-ascii-encoding-be-set-as-a-default-encoding-in/70914154

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9487

Reviewed By: riversand963

Differential Revision: D34097728

fbshipit-source-id: afd654ecaf20f6d30d9fc20c6a090398de2585eb
2022-05-06 16:22:15 -07:00
sdong 736a7b5433 Remove own ToString() (#9955)
Summary:
ToString() is created as some platform doesn't support std::to_string(). However, we've already used std::to_string() by mistake for 16 months (in db/db_info_dumper.cc). This commit just remove ToString().

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9955

Test Plan: Watch CI tests

Reviewed By: riversand963

Differential Revision: D36176799

fbshipit-source-id: bdb6dcd0e3a3ab96a1ac810f5d0188f684064471
2022-05-06 13:03:58 -07:00
Andrew Kryczka c5d367f472 Revert open logic changes in #9634 (#9906)
Summary:
Left HISTORY.md and unit tests.
Added a new unit test to repro the corruption scenario that this PR fixes, and HISTORY.md line for that.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9906

Reviewed By: riversand963

Differential Revision: D35940093

Pulled By: ajkr

fbshipit-source-id: 9816f99e1ce405ba36f316beb4f6378c37c8c86b
2022-04-26 14:46:53 -07:00
Akanksha Mahajan 3653029dda Add stats related to async prefetching (#9845)
Summary:
Add stats PREFETCHED_BYTES_DISCARDED and POLL_WAIT_MICROS.
PREFETCHED_BYTES_DISCARDED records number of prefetched bytes discarded by
FilePrefetchBuffer. POLL_WAIT_MICROS records the time taken by underling
file_system Poll API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9845

Test Plan: Update existing tests

Reviewed By: anand1976

Differential Revision: D35909694

Pulled By: akankshamahajan15

fbshipit-source-id: e009ef940bb9ed72c9446f5529095caabb8a1e36
2022-04-25 21:58:22 -07:00
Akanksha Mahajan ae82d91492 Remove corrupted WAL files in kPointRecoveryMode with avoid_flush_duing_recovery set true (#9634)
Summary:
1) In case of non-TransactionDB and avoid_flush_during_recovery = true, RocksDB won't
flush the data from WAL to L0 for all column families if possible. As a
result, not all column families can increase their log_numbers, and
min_log_number_to_keep won't change.
2) For transaction DB (.allow_2pc), even with the flush, there may be old WAL files that it must not delete because they can contain data of uncommitted transactions and min_log_number_to_keep won't change.

If we persist a new MANIFEST with
advanced log_numbers for some column families, then during a second
crash after persisting the MANIFEST, RocksDB will see some column
families' log_numbers larger than the corrupted wal, and the "column family inconsistency" error will be hit, causing recovery to fail.

As a solution,
1. the corrupted WALs whose numbers are larger than the
corrupted wal and smaller than the new WAL will be moved to archive folder.
2. Currently, RocksDB DB::Open() may creates and writes to two new MANIFEST files even before recovery succeeds. This PR buffers the edits in a structure and writes to a new MANIFEST after recovery is successful

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9634

Test Plan:
1. Added new unit tests
                2. make crast_test -j

Reviewed By: riversand963

Differential Revision: D34463666

Pulled By: akankshamahajan15

fbshipit-source-id: e233d3af0ed4e2028ca0cf051e5a334a0fdc9d19
2022-04-11 15:39:31 -07:00
Akanksha Mahajan 0b8f885939 Update stats for Read and ReadAsync in random_access_file_reader for async prefetching (#9810)
Summary:
Update stats in random_access_file_reader for Read and
ReadAsync API to take into account the read latency for async
prefetching.

It also fixes ERROR_HANDLER_AUTORESUME_RETRY_COUNT stat whose value was
incorrect in portal.h

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9810

Test Plan: Update unit test

Reviewed By: anand1976

Differential Revision: D35433081

Pulled By: akankshamahajan15

fbshipit-source-id: aeec3901270e58a003ce6b5214bd25ddcb3a12a9
2022-04-06 14:26:53 -07:00
Peter Dillinger 6534c6dea4 Fix remaining uses of "backupable" (#9792)
Summary:
Various renaming and fixes to get rid of remaining uses of
"backupable" which is terminology leftover from the original, flawed
design of BackupableDB. Now any DB can be backed up, using BackupEngine.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9792

Test Plan: CI

Reviewed By: ajkr

Differential Revision: D35334386

Pulled By: pdillinger

fbshipit-source-id: 2108a42b4575c8cccdfd791c549aae93ec2f3329
2022-04-05 09:52:33 -07:00
Alan Paxton b6ad0d958f Fb 9718 verify checksums is ignored (#9767)
Summary:
Fixes https://github.com/facebook/rocksdb/issues/9718

The verify_checksums flag of read_options should be passed to the read options used by the BlockFetcher in a couple of cases where it is not at present. It will now happen (but did not, previously) on iteration and on [multi]get, where a fetcher is created as part of the iterate/get call.

This may result in much better performance in a few workloads where the client chooses to remove verification.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9767

Reviewed By: mrambacher

Differential Revision: D35218986

Pulled By: jay-zhuang

fbshipit-source-id: 329d29764bb70fbc7f2673440bc46c107a813bc8
2022-03-29 11:54:54 -07:00
Jermy Li b83263bbe4 jni: uniformly use GetByteArrayRegion() to copy bytes (#9380)
Summary:
Uniformly use GetByteArrayRegion() instead of GetByteArrayElements()
to copy bytes.
In addition, it can avoid an inefficient ReleaseByteArrayElements()
operation.
Some benefits of GetByteArrayRegion() can be referred to:
https://stackoverflow.com/a/2480493

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9380

Reviewed By: ajkr

Differential Revision: D35135474

Pulled By: jay-zhuang

fbshipit-source-id: a32c1774d37f2d22b9bcd105d83e0bb984b71b54
2022-03-25 10:24:58 -07:00
Alan Paxton dec144f172 Extend Java RocksDB iterators to support indirect Byte Buffers (#9222)
Summary:
Extend Java RocksDB iterators to support indirect byte buffers, to add to the existing support for direct byte buffers.
Code to distinguish direct/indirect buffers is switched in Java, and a 2nd separate JNI call implemented to support indirect
buffers. Indirect support passes contained buffers using byte[]

There are some Java subclasses of iterator (WBWIIterator, SstFileReaderIterator) which also now have parallel JNI support functions implemented, along with direct/indirect switches in Java methods.

Closes https://github.com/facebook/rocksdb/issues/6282

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9222

Reviewed By: ajkr

Differential Revision: D35115283

Pulled By: jay-zhuang

fbshipit-source-id: f8d5d20b975aef700560fbcc99f707bb028dc42e
2022-03-24 12:50:38 -07:00
Alan Paxton 8ae0c33a7a Add new checksum type kXXH3 to Java API (#9749)
Summary:
Fix https://github.com/facebook/rocksdb/issues/9720

And make a couple of incidental tests test the thing they were meant to test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9749

Reviewed By: ajkr

Differential Revision: D35115298

Pulled By: jay-zhuang

fbshipit-source-id: d687d1f070d29216be9693601c71131bbea87c79
2022-03-24 12:33:12 -07:00
Tomas Kolda 9e05c5e251 NPE in Java_org_rocksdb_ColumnFamilyOptions_setSstPartitionerFactory (#9622)
Summary:
There was a mistake that incorrectly cast SstPartitionerFactory (missed shared pointer). It worked for database (correct cast), but not for family. Trying to set it in family has caused Access violation.

I have also added test and improved it. Older version was passing even without sst partitioner which is weird, because on Level1 we had two SST files with same key "aaaa1". I was not sure if it is a new feature and changed it to overlaping keys "aaaa0" - "aaaa2" overlaps "aaaa1".

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9622

Reviewed By: ajkr

Differential Revision: D34871968

Pulled By: pdillinger

fbshipit-source-id: a08009766da49fc198692a610e8beb19caf737e6
2022-03-14 14:12:30 -07:00
Adam Retter dab19afe56 Fix RocksJava releases for macOS (#9662)
Summary:
Addresses the problems described in https://github.com/facebook/rocksdb/pull/9254#issuecomment-1054598516 and https://github.com/facebook/rocksdb/pull/9254#issuecomment-1059574837 that have blocked a RocksJava release

**NOTE** Also needs to be ported to 6.29.fb branch.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9662

Reviewed By: ajkr

Differential Revision: D34689200

Pulled By: pdillinger

fbshipit-source-id: c62fe34c54f05be5a00ee1daec8ec7454baa5eb8
2022-03-07 10:50:52 -08:00
Si Ke 06c8afeff5 Fix pointer to jlong conversion in 32 bit OS (#9396)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9396

Reviewed By: jay-zhuang

Differential Revision: D34529654

Pulled By: pdillinger

fbshipit-source-id: cf62152ba86b02f9ffa7780f370ad49089e56a0b
2022-03-01 09:02:15 -08:00
stefan-zobel ddb7620a61 Fix trivial Javadoc omissions (#9534)
Summary:
- fix spelling of `valueSizeSofLimit` and add "param" description in ReadOptions
- add 3 missing "return" in RocksDB

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9534

Reviewed By: riversand963

Differential Revision: D34131186

Pulled By: mrambacher

fbshipit-source-id: 7eb7ec177906052837180b291d67fb1c838c49e1
2022-02-28 11:51:17 -08:00
Akanksha Mahajan 3699b171e4 Change enum SizeApproximationFlags to enum class (#9604)
Summary:
Change enum SizeApproximationFlags to enum and class and add
overloaded operators for the transition between enum class and uint8_t

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9604

Test Plan: Circle CI jobs

Reviewed By: riversand963

Differential Revision: D34360281

Pulled By: akankshamahajan15

fbshipit-source-id: 6351dfdb717ae3c4530d324c3d37a8ecb01dd1ef
2022-02-18 20:22:57 -08:00
Jay Zhuang f4b2500e12 Add last level and non-last level read statistics (#9519)
Summary:
Add last level and non-last level read statistics:
```
LAST_LEVEL_READ_BYTES,
LAST_LEVEL_READ_COUNT,
NON_LAST_LEVEL_READ_BYTES,
NON_LAST_LEVEL_READ_COUNT,
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9519

Test Plan: added unittest

Reviewed By: siying

Differential Revision: D34062539

Pulled By: jay-zhuang

fbshipit-source-id: 908644c3050878b4234febdc72e3e19d89af38cd
2022-02-18 14:23:07 -08:00
Alan Paxton ce84e50288 Plugin java jni support (#9575)
Summary:
Extend the plugin architecture to allow for the inclusion, building and testing of Java and JNI components of a plugin. This will cause the JAR built by `$ make rocksdbjava` to include the extra functionality provided by the plugin, and will cause `$ make jtest` to add the java tests provided by the plugin to the tests built and run by Java testing.

The plugin's `<plugin>.mk` file can define:
```
<plugin>_JNI_NATIVE_SOURCES
<plugin>_NATIVE_JAVA_CLASSES
<plugin>_JAVA_TESTS
```
The plugin should provide java/src, java/test and java/rocksjni directories. When a plugin is required to be build it must be named in the ROCKSDB_PLUGINS environment variable (as per the plugin architecture). This now has the effect of adding the files specified by the above definitions to the appropriate parts of the build.

An example of a plugin with a Java component can be found as part of the hdfs plugin in https://github.com/riversand963/rocksdb-hdfs-env - at the time of writing the Java part of this fails tests, and needs a little work to complete, but it builds correctly under the plugin model.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9575

Reviewed By: hx235

Differential Revision: D34253948

Pulled By: riversand963

fbshipit-source-id: b3dde5da06f3d3c25c54246892097ae2a369b42d
2022-02-17 19:39:23 -08:00
Alan Paxton 8d9c203f69 Remove previously deprecated Java where RocksDB also removed it, or where no direct equivalent existed. (#9576)
Summary:
For RocksDB v7 major release. Remove previously deprecated Java API methods and associated tests
- where equivalent/alternative functionality exists and is already tested AND
- where the core RocksDB function/feature has also been removed
- OR the functionality exists only in Java so the previous deprecation only affected Java methods

RETAIN deprecated Java which reflects functionality which is deprecated by, but also still supported by, the core of RocksDB.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9576

Reviewed By: ajkr

Differential Revision: D34314983

Pulled By: jay-zhuang

fbshipit-source-id: 7cf9c17e3e07be9d289beb99f81b71e8e09ac403
2022-02-17 17:29:35 -08:00
Alan Paxton 36ce2e2a0a Update build files for java8 build (#9541)
Summary:
For RocksJava 7 we will move from requiring Java 7 to Java 8.

* This simplifies the `Makefile` as we no longer need to deal with Java 7; so we no longer use `javah`.
* Added a java-version target which is invoked by the java target, and which exits if the version of java being used is not 8 or greater.
* Enforces java 8 as a minimum.
* Fixed CMake build.

* Fixed broken java event listener test, as the test was broken and the assertions in the callbacks were not causing assertions in the tests. The callbacks now queue up assertion errors for the main thread of the tests to check.
* Fixed C++ dangling pointers in the test code.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9541

Reviewed By: pdillinger

Differential Revision: D34214929

Pulled By: jay-zhuang

fbshipit-source-id: fdff348758d0a23a742e83c87d5f54073ce16ca6
2022-02-17 13:29:21 -08:00
Adam Retter 5e64407923 Support C++17 Docker build environments for RocksJava (#9500)
Summary:
See https://github.com/facebook/rocksdb/issues/9388#issuecomment-1029583789

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9500

Reviewed By: pdillinger

Differential Revision: D34114687

Pulled By: jay-zhuang

fbshipit-source-id: 22129d99ccd0dba7e8f1b263ddc5520d939641bf
2022-02-17 12:48:38 -08:00
Peter Dillinger 420d51b9a0 Update Java API for FilterPolicy changes (#9569)
Summary:
Obsolete block-based filter no longer in public API, from https://github.com/facebook/rocksdb/issues/9535

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9569

Test Plan: existing tests

Reviewed By: jay-zhuang

Differential Revision: D34243579

Pulled By: pdillinger

fbshipit-source-id: ec5127d9bb9cc3f70501c531829a735bffdd1418
2022-02-15 12:18:52 -08:00
Levi Tamasi ac251aa641 Add Java bindings for blob compaction readahead size (#9554)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9554

Test Plan: Added new unit tests.

Reviewed By: mrambacher

Differential Revision: D34197121

Pulled By: ltamasi

fbshipit-source-id: 15056e26d632057a7c052a5024a560ba0eac554c
2022-02-14 09:15:42 -08:00
Alan Paxton eed71dfa82 Transaction multiGet convert to list-based (#9522)
Summary:
Transaction multiGet convert to list-based.

RocksDB Java (non-transactional) has multiGetAsList() methods to expose multiGet(). These return a list of results. These methods replaced multiGet() methods returning an array of results, which were deprecated in Rocks 6 and are being removed in Rocks 7.

The transactional API still presents multiGet() methods returning arrays, so in Rocks 7 we replace these with multiGetAsList()methods and deprecate the multiGet() methods.

This does not require any changes to the supporting JNI/C++ code, only to the wrappers which present the Java API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9522

Reviewed By: mrambacher

Differential Revision: D34114373

Pulled By: jay-zhuang

fbshipit-source-id: cb22d6095934d951b6aee4aed3e07923d3c18007
2022-02-14 08:33:02 -08:00
Alan Paxton 99d86252b6 remove deprecated dispose() for Rocks JNI interface Java objects. (#9523)
Summary:
For RocksDB 7. Remove deprecated dispose() And as a consequence remove finalize(), which is good Modern Java hygiene.

It is extremely non-deterministic when `finalize()` is called on an object, and resource closure/recovery of underlying native/C++ objects and/or non-memory resource cannot be adequately controlled through GC finalization. The RocksDB Java/JNI interface provides and encourages the use of AutoCloseable objects with close() methods, allowing predictable disposal of resources at exit from try-with-resource blocks.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9523

Reviewed By: mrambacher

Differential Revision: D34079843

Pulled By: jay-zhuang

fbshipit-source-id: d1f0463a89a548b5d57bfaa50154379e722d189a
2022-02-09 11:32:53 -08:00
Akanksha Mahajan 9745c68eb1 Remove deprecated option new_table_reader_for_compaction_inputs (#9443)
Summary:
In RocksDB option new_table_reader_for_compaction_inputs has
not effect on Compaction or on the behavior of RocksDB library.
Therefore, we are removing it in the upcoming 7.0 release.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9443

Test Plan: CircleCI

Reviewed By: ajkr

Differential Revision: D33788508

Pulled By: akankshamahajan15

fbshipit-source-id: 324ca6f12bfd019e9bd5e1b0cdac39be5c3cec7d
2022-02-08 19:31:28 -08:00
Radek Hubner 42c8afd85a WriteOptions - add missing java API. (#9295)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9295

Reviewed By: riversand963

Differential Revision: D33672440

Pulled By: ajkr

fbshipit-source-id: 85f73a9297888b00255b636e7826b37186aba45c
2022-02-04 16:08:06 -08:00
Si Ke 2c3a780901 Fixed all RocksJava test failures in Centos and Alpine (#9395)
Summary:
Fixed all RocksJava test failures in Centos and Alpine 32 bit and 64 bit OSes

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9395

Reviewed By: mrambacher

Differential Revision: D33771987

Pulled By: ajkr

fbshipit-source-id: fed91033b8df08f191ad65e1fb745a9264bbfa70
2022-02-04 16:03:56 -08:00
Jermy Li 83ff350ff2 jni: expose memtable_whole_key_filtering option (#9394)
Summary:
refer to: https://github.com/facebook/rocksdb/wiki/Prefix-Seek#configure-prefix-bloom-filter

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9394

Reviewed By: mrambacher

Differential Revision: D33671533

Pulled By: ajkr

fbshipit-source-id: d90db1712efdd5dd65020329867381d6b3cf2626
2022-02-04 16:01:16 -08:00
Hui Xiao 42cca28ebb Remove deprecated API AdvancedColumnFamilyOptions::rate_limit_delay_max_milliseconds (#9455)
Summary:
**Context/Summary:**
AdvancedColumnFamilyOptions::rate_limit_delay_max_milliseconds has been marked as deprecated and it's time to actually remove the code.
- Keep `soft_rate_limit`/`hard_rate_limit` in `cf_mutable_options_type_info` to prevent throwing `InvalidArgument` in `GetColumnFamilyOptionsFromMap` when reading an option file still with these options (e.g, old option file generated from RocksDB before the deprecation)
- Keep `soft_rate_limit`/`hard_rate_limit` in under `OptionsOldApiTest.GetOptionsFromMapTest` to test the case mentioned above.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9455

Test Plan: Rely on my eyeball and CI

Reviewed By: ajkr

Differential Revision: D33811664

Pulled By: hx235

fbshipit-source-id: 866859427fe710354a90f1095057f80116365ff0
2022-01-28 16:47:08 -08:00
Yanqin Jin d10c5c08d3 Remove iter_start_seqnum and preserve_deletes (#9430)
Summary:
According to https://github.com/facebook/rocksdb/blob/6.27.fb/db/db_impl/db_impl.cc#L2896:L2911 and https://github.com/facebook/rocksdb/blob/6.27.fb/db/db_impl/db_impl_open.cc#L203:L208,
we are going to remove `iter_start_seqnum` and `preserve_deletes` starting from RocksDB 7.0

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9430

Test Plan: make check and CI

Reviewed By: ajkr

Differential Revision: D33753639

Pulled By: riversand963

fbshipit-source-id: c80aab8e8d8fc33e52472fed524ed703d0ffc8b6
2022-01-28 13:28:38 -08:00
Jay Zhuang 22321e1027 Remove unused API base_background_compactions (#9462)
Summary:
The API is deprecated long time ago. Clean up the codebase by
removing it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9462

Test Plan: CI, fake release: D33835220

Reviewed By: riversand963

Differential Revision: D33835103

Pulled By: jay-zhuang

fbshipit-source-id: 6d2dc12c8e7fdbe2700865a3e61f0e3f78bd8184
2022-01-27 21:05:18 -08:00
Peter Dillinger 78aee6fedc Remove obsolete backupable_db.h, utility_db.h (#9438)
Summary:
This also removes the obsolete names BackupableDBOptions
and UtilityDB. API users must now use BackupEngineOptions and
DBWithTTL::Open. In C API, `rocksdb_backupable_db_*` is replaced
`rocksdb_backup_engine_*`. Similar renaming in Java API.

In reference to https://github.com/facebook/rocksdb/issues/9389

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9438

Test Plan: CI

Reviewed By: mrambacher

Differential Revision: D33780269

Pulled By: pdillinger

fbshipit-source-id: 4a6cfc5c1b4c78bcad790b9d3dd13c5fdf4a1fac
2022-01-27 15:45:30 -08:00
Hui Xiao 1e0e883ca5 Remove deprecated API AdvancedColumnFamilyOptions::soft_rate_limit/hard_rate_limit (#9452)
Summary:
**Context/Summary:**
AdvancedColumnFamilyOptions::soft_rate_limit/hard_rate_limit have been marked as deprecated and it's time to actually remove the code.
- Keep `soft_rate_limit`/`hard_rate_limit` in `cf_mutable_options_type_info` to prevent throwing `InvalidArgument` in `GetColumnFamilyOptionsFromMap` when reading an option file still with these options (e.g, old option file generated from RocksDB before the deprecation)
- Keep `soft_rate_limit`/`hard_rate_limit` in under `OptionsOldApiTest.GetOptionsFromMapTest` to test the case mentioned above.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9452

Test Plan: Rely on my eyeball and CI

Reviewed By: ajkr

Differential Revision: D33804938

Pulled By: hx235

fbshipit-source-id: 133d49f7ec5238d7efceeb0a3122a5792a2b9945
2022-01-27 13:01:09 -08:00
Yanqin Jin 50135c1bf3 Move HDFS support to separate repo (#9170)
Summary:
This PR moves HDFS support from RocksDB repo to a separate repo. The new (temporary?) repo
in this PR serves as an example before we finalize the decision on where and who to host hdfs support. At this point,
people can start from the example repo and fork.

Java/JNI is not included yet, and needs to be done later if necessary.

The goal is to include this commit in RocksDB 7.0 release.

Reference:
https://github.com/ajkr/dedupfs by ajkr

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9170

Test Plan:
Follow the instructions in https://github.com/riversand963/rocksdb-hdfs-env/blob/master/README.md. Build and run db_bench and db_stress.

make check

Reviewed By: ajkr

Differential Revision: D33751662

Pulled By: riversand963

fbshipit-source-id: 22b4db7f31762ed417a20239f5a08dcd1696244f
2022-01-24 20:23:54 -08:00
Eric Thérond 5602b1d3d9 Add support for Apple Silicon to RocksJava (#9254)
Summary:
Fixes facebook/rocksdb#7720

Updated Makefile with flags to define target architecture when compiling/linking,
and added goal `rocksdbjavastaticosxub` to build a OS X Universal Binary native library.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9254

Reviewed By: mrambacher

Differential Revision: D33551160

Pulled By: pdillinger

fbshipit-source-id: 9ce9962e03aacf55014545a6cdf638b5b14b8fa9
2022-01-12 17:20:58 -08:00
Yanqin Jin 0376869f05 Remove using namespace (#9369)
Summary:
As title.
This is part of an fb-internal task.
First, remove all `using namespace` statements if applicable.
Next, utilize multiple build platforms and see if anything is broken.
Should anything become broken, fix the compilation errors with as little extra change as possible.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9369

Test Plan:
internal build and make check
make clean && make static_lib && cd examples && make all

Reviewed By: pdillinger

Differential Revision: D33517260

Pulled By: riversand963

fbshipit-source-id: 3fc4ce6402a073421dfd9a9b2d1c79441dca7a40
2022-01-12 09:31:12 -08:00
Andrew Kryczka b860a42158 Recover to exact latest seqno of data committed to MANIFEST (#9305)
Summary:
The LastSequence field in the MANIFEST file is the baseline seqno for a recovered DB. Recovering WAL entries might cause the recovered DB's seqno to advance above this baseline, but the recovered DB will never use a smaller seqno.

Before this PR, we were writing the DB's seqno at the time of LogAndApply() as the LastSequence value. This works in the sense that it is a large enough baseline for the recovered DB that it'll never overwrite any records in existing SST files. At the same time, it's arbitrarily larger than what's needed. This behavior comes from LevelDB, where there was no tracking of largest seqno in an SST file.

Now we know the largest seqno of newly written SST files, so we can write an exact value in LastSequence that actually reflects the largest seqno in any file referred to by the MANIFEST. This is primarily useful for correctness testing with unsynced data loss, where the recovered DB's seqno needs to indicate what records were recovered.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9305

Test Plan:
- https://github.com/facebook/rocksdb/issues/9338 adds crash-recovery correctness testing coverage for WAL disabled use cases
- https://github.com/facebook/rocksdb/issues/9357 will extend that testing to cover file ingestion
- Added assertion at end of LogAndApply() for `VersionSet::descriptor_last_sequence_` consistency with files
- Manually tested upgrade/downgrade compatibility with a custom crash test that randomly picks between a `db_stress` built with and without this PR (for old code it must run with `-disable_wal=0`)

Reviewed By: riversand963

Differential Revision: D33182770

Pulled By: ajkr

fbshipit-source-id: 0bfafaf685f347cc8cb0e1d62e0186340a738f7d
2022-01-05 16:02:21 -08:00
Adam Retter 65996dd757 Fixes for building RocksJava builds on s390x (#9321)
Summary:
* Added Docker build environment for RocksJava on s390x
* Cache alignment size for s390x was incorrectly calculated on gcc 6.4.0
* Tighter control over which installed version of Java is used is required - build now correctly adheres to `JAVA_HOME` if it is set
* Alpine build scripts should be used on Alpine (previously CentOS script worked by falling through to minimal gcc version)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9321

Reviewed By: mrambacher

Differential Revision: D33259624

Pulled By: jay-zhuang

fbshipit-source-id: d791a5150581344925c3c3f9cbb9a3622d63b3b6
2021-12-22 12:57:50 -08:00
stefan-zobel 7ae213f735 Minor Javadoc fixes (#9203)
Summary:
Added two missing parameter tags with description and added some descriptions for parameter / return tags

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9203

Reviewed By: jay-zhuang

Differential Revision: D32990607

Pulled By: mrambacher

fbshipit-source-id: 10aea4c4cf1c28d5e97d19722ee835a965d1eb55
2021-12-21 05:40:51 -08:00
Jermy Li 9828b6d5fd fix java doc issues (#9253)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9253

Reviewed By: jay-zhuang

Differential Revision: D32990516

Pulled By: mrambacher

fbshipit-source-id: c7cdb6562ac6871bca6ea0d9efa454f3a902a137
2021-12-16 21:04:41 -08:00
Andrea Cavalli 9918e1ee5a Set KeyMayExist fields visibility to public (#9285)
Summary:
Fixes https://github.com/facebook/rocksdb/issues/9284

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9285

Reviewed By: pdillinger

Differential Revision: D33062006

Pulled By: mrambacher

fbshipit-source-id: c3471c2db717fa5bc2337cf996ce744af0ed877d
2021-12-16 10:59:05 -08:00
Alan Paxton c1ec0b28eb java / jni io_uring support (#9224)
Summary:
Existing multiGet() in java calls multi_get_helper() which then calls DB::std::vector MultiGet(). This doesn't take advantage of io_uring.

This change adds another JNI level method that runs a parallel code path using the DB::void MultiGet(), using ByteBuffers at the JNI level. We call it multiGetDirect(). In addition to using the io_uring path, this code internally returns pinned slices which we can copy out of into our direct byte buffers; this should reduce the overall number of copies in the code path to/from Java. Some jmh benchmark runs (100k keys, 1000 key multiGet) suggest that for value sizes > 1k, we see about a 20% performance improvement, although performance is slightly reduced for small value sizes, there's a little bit more overhead in the JNI methods.

Closes https://github.com/facebook/rocksdb/issues/8407

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9224

Reviewed By: mrambacher

Differential Revision: D32951754

Pulled By: jay-zhuang

fbshipit-source-id: 1f70df7334be2b6c42a9c8f92725f67c71631690
2021-12-15 18:09:25 -08:00
Radek Hubner 7ac3a5d406 ReadOptions - Add missing java API. (#9248)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9248

Reviewed By: mrambacher

Differential Revision: D33011237

Pulled By: jay-zhuang

fbshipit-source-id: b6544ad40cb722e327bac60a0af711db253e36d7
2021-12-15 17:46:05 -08:00
Davide Angelocola 8a97c541e4 Fix copy constructors of Options and ColumnFamilyOptions (#9166)
Summary:
Looks like some fields are not copied by the copy constructor.

Please confirm if it is a real issue!

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9166

Reviewed By: jay-zhuang

Differential Revision: D32532093

Pulled By: mrambacher

fbshipit-source-id: f636ef9425a530a8655947115160ae471916252b
2021-12-13 07:22:56 -08:00
Yanqin Jin bd513fd075 Add commit marker with timestamp (#9266)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9266

This diff adds a new tag `CommitWithTimestamp`. Currently, there is no API to trigger writing
this tag to WAL, thus it is unavailable to users.
This is an ongoing effort to add user-defined timestamp support to write-committed transactions.
This diff also indicates all column families that may potentially participate in the same
transaction must either disable timestamp or have the same timestamp format, since
`CommitWithTimestamp` tag is followed by a single byte-array denoting the commit
timestamp of the transaction. We will enforce this checking in a future diff. We keep this
diff small.

Reviewed By: ltamasi

Differential Revision: D31721350

fbshipit-source-id: e1450811443647feb6ca01adec4c8aaae270ffc6
2021-12-10 11:05:35 -08:00
Jermy Li c39a808cb6 Deprecate WriteBatch.remove() and use the new style delete() (#9256)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9256

Reviewed By: mrambacher

Differential Revision: D32971447

Pulled By: jay-zhuang

fbshipit-source-id: 6954d7287229a8c776092bd82af3a8a8cd92b35e
2021-12-10 09:18:17 -08:00
stefan-zobel f57745814f Minor RocksJava Java code cosmetics (#9204)
Summary:
Specifically:
- unused imports
- code formatting
- typos in comments
- unnecessary casts
- missing default label in switch statement
- explicit use of long literals in multiplication
- use generics where possible without backward compatibility risk

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9204

Reviewed By: ajkr

Differential Revision: D32955184

Pulled By: jay-zhuang

fbshipit-source-id: 42d05ce42639d982b9ea34c8081266dfba7f1efa
2021-12-09 20:00:48 -08:00
Hui Xiao 9daf07305c Replace TableProperties::properties_offsets map with external_sst_file_global_seqno_offset (#9212)
Summary:
**Context:**
Searching `TableProperties::properties_offsets` across the codebase reveals that internally it is only used to find the external SST file's global seqno offeset. Therefore we can narrow it down and replace this map property with a uint64_t property `external_sst_file_global_seqno_offset` to save memory usage related to table properties.

Note:
- See PR comments for discussion about potential impact on existing external usage of `TableProperties::properties_offsets`
- See PR comments for discussion on keeping external SST file global seqno's offset VS using a simple flag indicating seqno's existence.

**Summary:**
- Replaced `TableProperties::properties_offsets` with `TableProperties::external_sst_file_global_seqno_offset`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9212

Test Plan: - Relied on existing tests should be sufficient since `TableProperties::properties_offsets` existed before and should already be tested.

Reviewed By: ajkr

Differential Revision: D32665941

Pulled By: hx235

fbshipit-source-id: 718e44617346dc4f3b1276ee953e61c196277795
2021-12-02 08:30:36 -08:00
Adam Retter d94932323a Check that newIteratorWithBase regardless of WBWI Overwrite Mode (#8134)
Summary:
The behaviour of WBWI has changed when calling newIteratorWithBase when overwrite is set to true or false. This PR simply adds tests to assert the new correct behaviour.

Closes https://github.com/facebook/rocksdb/issues/7370
Closes https://github.com/facebook/rocksdb/pull/8134

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9107

Reviewed By: pdillinger

Differential Revision: D32099475

Pulled By: mrambacher

fbshipit-source-id: 245f483f73db866cc8a51219a2bff2e09e59faa0
2021-11-18 11:53:09 -08:00
Davide Angelocola c9539ede76 Fix integer overflow in TraceOptions (#9157)
Summary:
Hello from a happy user of rocksdb java :-)

Default constructor of TraceOptions is supposed to initialize size to 64GB but the expression contains an integer overflow.

Simple test case with JShell:
```
jshell> 64 * 1024 * 1024 * 1024
$1 ==> 0

jshell> 64L * 1024 * 1024 * 1024
$2 ==> 68719476736
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9157

Reviewed By: pdillinger, zhichao-cao

Differential Revision: D32369273

Pulled By: mrambacher

fbshipit-source-id: 6a0c95fff7a91f27ff15d65b662c6b101756b450
2021-11-17 08:41:48 -08:00
Zhichao Cao b694cd0e0d Add tiered storage related read bytes stats to Statistic (#9123)
Summary:
Add the 3 read bytes counter to the Statistic, which will be used by storage tiering and get the information for files with different temperature.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9123

Test Plan: added new testing cases.

Reviewed By: siying

Differential Revision: D32154745

Pulled By: zhichao-cao

fbshipit-source-id: b7905d6dae469a72428742364ec07b634b6f15da
2021-11-16 15:17:17 -08:00
Adam Retter 1a8eec461b Remove invalid RocksJava native entry (#9147)
Summary:
It seems that an incorrect native source file entry was introduced in https://github.com/facebook/rocksdb/pull/8999. For some reason it appears that CI was not run against that PR, and so the problem was not detected.

This PR fixes the problem by removing the invalid entry, allowing RocksJava to build correctly again.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9147

Reviewed By: pdillinger

Differential Revision: D32300976

fbshipit-source-id: dbd763b806bacf0fc08f4deaf07c63d0a266c4cf
2021-11-09 17:21:58 -08:00
Alan Paxton e5b34f5867 Fb 5789 max total WAL size clarification (#9108)
Summary:
Add clarification/extension to comments on max_total_wal_size and the Java wrapper MaxTotalWalSize to better explain the effect of the option on log file sizes.

Closes https://github.com/facebook/rocksdb/issues/5789

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9108

Reviewed By: pdillinger

Differential Revision: D32066640

Pulled By: mrambacher

fbshipit-source-id: 7d5affc87e4119019054af9c884a2ea01d68f5b7
2021-11-08 08:54:37 -08:00
Adam Retter be351f4754 Restore Java 7 Compatibility (#9103)
Summary:
RocksDB should still compile on Java 7.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9103

Reviewed By: pdillinger

Differential Revision: D32067561

Pulled By: mrambacher

fbshipit-source-id: bbe9c18c8007ab3e113de4add56a84c9bde61c8e
2021-11-08 08:21:02 -08:00
Alan Paxton ec9082d698 Regression tests for tickets fixed by previous change. (#9019)
Summary:
closes https://github.com/facebook/rocksdb/issues/5891
closes https://github.com/facebook/rocksdb/issues/2001

Java BytewiseComparator is now unsigned compliant, consistent with the default C++ comparator, which has always been thus. Consequently 2 tickets reporting the previous broken state can be closed.

 This test confirms that the following issues were in fact resolved
 by a change made between 6.2.2 and 6.22.1,
 to wit https://github.com/facebook/rocksdb/commit/7242dae7
which as part of its effect, changed the Java bytewise comparators.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9019

Reviewed By: pdillinger

Differential Revision: D31610910

Pulled By: mrambacher

fbshipit-source-id: 664230f1377a1aa270136edd63eea2c206b907e9
2021-11-01 15:06:47 -07:00
Alan Paxton 73e6b89fad Java wrapper for blob_gc_force_threshold as blobGarbageCollectionForceThreshold (#9109)
Summary:
Extra option added as a supplement to https://github.com/facebook/rocksdb/pull/8999

Closes https://github.com/facebook/rocksdb/issues/8221

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9109

Reviewed By: mrambacher

Differential Revision: D32065039

Pulled By: ltamasi

fbshipit-source-id: 6c484050a30fe0523850a8a3c95dc85b0a501362
2021-11-01 11:59:10 -07:00
myasuka dc00e4b120 Introduce allowStall option for write buffer manager constructor (#9076)
Summary:
https://github.com/facebook/rocksdb/pull/7898 enable write buffer manager to stall write when memory_usage exceeds buffer_size, this is really useful for container running case to limit the memory usage. However, this feature is not visiable for rocksJava yet.

This PR targets to introduce this feature for rocksJava.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9076

Reviewed By: akankshamahajan15

Differential Revision: D31931092

Pulled By: anand1976

fbshipit-source-id: 5531c16a87598663a02368c07b5e13a503164578
2021-10-26 12:09:54 -07:00
Jonathan Albrecht e970248602 Add support for building on s390x platform (#8962)
Summary:
This PR adds support for building on s390x including updating travis CI. It uses the previous work in https://github.com/facebook/rocksdb/pull/6168 and adds some more changes to get all current tests (make check and jni tests) to pass. The tests were run with snappy, lz4, bzip2 and zstd all compiled in.

There are a few pieces still needed to get the travis build working that I don't think I can do. adamretter is this something you could help with?

1. A prebuilt https://rocksdb-deps.s3-us-west-2.amazonaws.com/cmake/cmake-3.14.5-Linux-s390x.deb package
2. A https://hub.docker.com/r/evolvedbinary/rocksjava s390x image

Not sure if there is more required for travis. Happy to help in any way I can.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8962

Reviewed By: mrambacher

Differential Revision: D31802198

Pulled By: pdillinger

fbshipit-source-id: 683511466fa6b505f85ba5a9964a268c6151f0c2
2021-10-22 10:13:15 -07:00
Alan Paxton 8d615a2b1d New-style blob option bindings, Java option getter and improve/fix option parsing (#8999)
Summary:
Implementation of https://github.com/facebook/rocksdb/issues/8221, plus/including extension of Java options API to allow the get() of options from RocksDB. The extension allows more comprehensive testing of options at the Java side, by validating that the options are set at the C++ side.

Variations on methods:
MutableColumnFamilyOptions.MutableColumnFamilyOptionsBuilder getOptions()
MutableDBOptions.MutableDBOptionsBuilder getDBOptions()

retrieve the options via RocksDB C++ interfaces, and parse the resulting string into one of the Java-style option objects.

This necessitated generalising the parsing of option strings in Java, which now parses the full range of option strings returned by the C++ interface, rather than a useful subset. This necessitates the list-separator being changed to :(colon) from , (comma).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8999

Reviewed By: jay-zhuang

Differential Revision: D31655487

Pulled By: ltamasi

fbshipit-source-id: c38e98145c81c61dc38238b0df580db176ce4efd
2021-10-19 09:21:52 -07:00
Alan Paxton 86cf7266c3 keyMayExist() supports ByteBuffer (#9013)
Summary:
closes https://github.com/facebook/rocksdb/issues/7917

Implemented ByteBuffer API variants of Java keyMayExist() uniformly with and without column families, read options and return data values. Implemented 2 supporting C++ JNI methods.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9013

Reviewed By: mrambacher

Differential Revision: D31665989

Pulled By: jay-zhuang

fbshipit-source-id: 8adc1730217dba38d6fa7b31d788650a33e28af1
2021-10-18 17:20:07 -07:00
Alan Paxton f5526af8ed Fix multiget throwing NPE for num of keys > 70k (#9012)
Summary:
closes https://github.com/facebook/rocksdb/issues/8039

Unnecessary use of multiple local JNI references at the same time, 1 per key, was limiting the size of the key array. The local references don't need to be held simultaneously, so if we rearrange the code we can make it work for bigger key arrays.

Incidentally, make errors throw helpful exception messages rather than returning a null pointer.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9012

Reviewed By: mrambacher

Differential Revision: D31580862

Pulled By: jay-zhuang

fbshipit-source-id: ce05831d52ede332e1b20e74d2dc621d219b9616
2021-10-14 11:48:12 -07:00
Jay Zhuang 6b34eb0ebc Add remote compaction read/write bytes statistics (#8939)
Summary:
Add basic read/write bytes statistics on the primary side:
`REMOTE_COMPACT_READ_BYTES`
`REMOTE_COMPACT_WRITE_BYTES`

Fixed existing statistics missing some IO for remote compaction.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8939

Test Plan: CI

Reviewed By: ajkr

Differential Revision: D31074672

Pulled By: jay-zhuang

fbshipit-source-id: c57afdba369990185008ffaec7e3fe7c62e8902f
2021-09-28 14:00:37 -07:00
Peter Dillinger cb5b851ff8 Add (& fix) some simple source code checks (#8821)
Summary:
* Don't hardcode namespace rocksdb (use ROCKSDB_NAMESPACE)
* Don't #include <rocksdb/...> (use double quotes)
* Support putting NOCOMMIT (any case) in source code that should not be
committed/pushed in current state.

These will be run with `make check` and in GitHub actions

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8821

Test Plan: existing tests, manually try out new checks

Reviewed By: zhichao-cao

Differential Revision: D30791726

Pulled By: pdillinger

fbshipit-source-id: 399c883f312be24d9e55c58951d4013e18429d92
2021-09-07 21:19:27 -07:00
Andrew Kryczka 9308ff366c Bytes read/written stats for `CreateNewBackup*()` (#8819)
Summary:
Gets `Statistics` from the options associated with the `DB` undergoing backup, and populates new ticker stats with the thread-local `IOContext` read/write counters for the threads doing backup work.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8819

Reviewed By: pdillinger

Differential Revision: D30779238

Pulled By: ajkr

fbshipit-source-id: 75ccafc355f90906df5cf80367f7245b985772d8
2021-09-07 18:25:16 -07:00
Andrew Kryczka 941543721d Bytes read stat for `VerifyChecksum()` and `VerifyFileChecksums()` APIs (#8741)
Summary:
- Clarified some comments on compatibility for adding new ticker stats
- Added read I/O stats for `VerifyChecksum()` and `VerifyFileChecksums()` APIs

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8741

Test Plan: new unit test

Reviewed By: zhichao-cao

Differential Revision: D30708578

Pulled By: ajkr

fbshipit-source-id: d06b961f7e199ae92c266b683e39870aa8f63449
2021-09-07 13:28:29 -07:00
Peter Dillinger c9cd5d25a8 Remove some unneeded code (#8736)
Summary:
* FullKey and ParseFullKey appear to serve no purpose in the public API
(or anything else) so removed. Only use in one test updated.
* NumberToString serves no purpose vs. ToString so removed, numerous
calls updated
* Remove unnecessary forward declarations in metadata.h by re-arranging
class definitions.
* Remove some unneeded semicolons

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8736

Test Plan: existing tests

Reviewed By: mrambacher

Differential Revision: D30700039

Pulled By: pdillinger

fbshipit-source-id: 1e436a576f511a6ed8b4d97af7cc8216bc729af2
2021-09-01 14:28:58 -07:00
anand76 add68bd28a Add a stat to count secondary cache hits (#8666)
Summary:
Add a stat for secondary cache hits. The ```Cache::Lookup``` API had an unused ```stats``` parameter. This PR uses that to pass the pointer to a ```Statistics``` object that ```LRUCache``` uses to record the stat.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8666

Test Plan: Update a unit test in lru_cache_test

Reviewed By: zhichao-cao

Differential Revision: D30353816

Pulled By: anand1976

fbshipit-source-id: 2046f78b460428877a26ffdd2bb914ae47dfbe77
2021-08-16 21:01:14 -07:00
sdong e7c24168d8 Move old files to warm tier in FIFO compactions (#8310)
Summary:
Some FIFO users want to keep the data for longer, but the old data is rarely accessed. This feature allows users to configure FIFO compaction so that data older than a threshold is moved to a warm storage tier.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8310

Test Plan: Add several unit tests.

Reviewed By: ajkr

Differential Revision: D28493792

fbshipit-source-id: c14824ea634814dee5278b449ab5c98b6e0b5501
2021-08-09 12:51:14 -07:00
Brendan MacDonell 8ca081780b Correct javadoc for Env#setBackgroundThreads(int) (#8576)
Summary:
By default, the low priority pool is not the flush pool, so calling `Env#setBackgroundThreads` without providing a priority will not do what the caller expected.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8576

Reviewed By: ajkr

Differential Revision: D29925154

Pulled By: mrambacher

fbshipit-source-id: cd7211fc374e7d9929a9b88ea0a5ba8134b76099
2021-08-06 08:52:14 -07:00
Mikhail Golubev 8f52972cf9 Allow to use a string as a delimiter in StringAppendOperator (#8536)
Summary:
An arbitrary string can be used as a delimiter in StringAppend merge operator
flavor. In particular, it allows using an empty string, combining binary values for
the same key byte-to-byte one next to another.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8536

Reviewed By: mrambacher

Differential Revision: D29962120

Pulled By: zhichao-cao

fbshipit-source-id: 4ef5d846a47835cf428a11200409e30e2dbffc4f
2021-08-02 16:50:41 -07:00
Anatolii Zhmaiev 9ddb55a8f6 Add periodic_compaction_seconds option to RocksJava (#8579)
Summary:
Fixes https://github.com/facebook/rocksdb/issues/8578

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8579

Reviewed By: ajkr

Differential Revision: D29895081

Pulled By: mrambacher

fbshipit-source-id: 3e4120e26a3e8252f8301d657c0aaa0b8550cddf
2021-07-26 17:33:42 -07:00
Peter Dillinger df5dc73bec Don't hold DB mutex for block cache entry stat scans (#8538)
Summary:
I previously didn't notice the DB mutex was being held during
block cache entry stat scans, probably because I primarily checked for
read performance regressions, because they require the block cache and
are traditionally latency-sensitive.

This change does some refactoring to avoid holding DB mutex and to
avoid triggering and waiting for a scan in GetProperty("rocksdb.cfstats").
Some tests have to be updated because now the stats collector is
populated in the Cache aggressively on DB startup rather than lazily.
(I hope to clean up some of this added complexity in the future.)

This change also ensures proper treatment of need_out_of_mutex for
non-int DB properties.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8538

Test Plan:
Added unit test logic that uses sync points to fail if the DB mutex
is held during a scan, covering the various ways that a scan might be
triggered.

Performance test - the known impact to holding the DB mutex is on
TransactionDB, and the easiest way to see the impact is to hack the
scan code to almost always miss and take an artificially long time
scanning. Here I've injected an unconditional 5s sleep at the call to
ApplyToAllEntries.

Before (hacked):

    $ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
    randomtransaction :     433.219 micros/op 2308 ops/sec;    0.1 MB/s ( transactions:78999 aborts:0)
    rocksdb.db.write.micros P50 : 16.135883 P95 : 36.622503 P99 : 66.036115 P100 : 5000614.000000 COUNT : 149677 SUM : 8364856
    $ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
    randomtransaction :     448.802 micros/op 2228 ops/sec;    0.1 MB/s ( transactions:75999 aborts:0)
    rocksdb.db.write.micros P50 : 16.629221 P95 : 37.320607 P99 : 72.144341 P100 : 5000871.000000 COUNT : 143995 SUM : 13472323

Notice the 5s P100 write time.

After (hacked):

    $ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
    randomtransaction :     303.645 micros/op 3293 ops/sec;    0.1 MB/s ( transactions:98999 aborts:0)
    rocksdb.db.write.micros P50 : 16.061871 P95 : 33.978834 P99 : 60.018017 P100 : 616315.000000 COUNT : 187619 SUM : 4097407
    $ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
    randomtransaction :     310.383 micros/op 3221 ops/sec;    0.1 MB/s ( transactions:96999 aborts:0)
    rocksdb.db.write.micros P50 : 16.270026 P95 : 35.786844 P99 : 64.302878 P100 : 603088.000000 COUNT : 183819 SUM : 4095918

P100 write is now ~0.6s. Not good, but it's the same even if I completely bypass all the scanning code:

    $ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
    randomtransaction :     311.365 micros/op 3211 ops/sec;    0.1 MB/s ( transactions:96999 aborts:0)
    rocksdb.db.write.micros P50 : 16.274362 P95 : 36.221184 P99 : 68.809783 P100 : 649808.000000 COUNT : 183819 SUM : 4156767
    $ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
    randomtransaction :     308.395 micros/op 3242 ops/sec;    0.1 MB/s ( transactions:97999 aborts:0)
    rocksdb.db.write.micros P50 : 16.106222 P95 : 37.202403 P99 : 67.081875 P100 : 598091.000000 COUNT : 185714 SUM : 4098832

No substantial difference.

Reviewed By: siying

Differential Revision: D29738847

Pulled By: pdillinger

fbshipit-source-id: 1c5c155f5a1b62e4fea0fd4eeb515a8b7474027b
2021-07-16 14:13:08 -07:00
longlijian 4e4ec16957 Replace the namespace "rocksdb" to "ROCKSDB_NAMESPACE" (#8531)
Summary:
For more detail can reference the https://github.com/facebook/rocksdb/issues/6433
(https://github.com/facebook/rocksdb/pull/6433)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8531

Reviewed By: siying

Differential Revision: D29717057

Pulled By: ajkr

fbshipit-source-id: 3ccad9501e5612590e54a7cf8c447118f323c7f4
2021-07-15 17:23:39 -07:00
Baptiste Lemaire e817bc9628 Added memtable garbage statistics (#8411)
Summary:
**Summary**:
2 new statistics counters are added to RocksDB: `MEMTABLE_PAYLOAD_BYTES_AT_FLUSH` and `MEMTABLE_GARBAGE_BYTES_AT_FLUSH`. The former tracks how many raw bytes of useful data are present on the memtable at flush time, whereas the latter is tracks how many of these raw bytes are considered garbage, meaning that they ended up not being imported on the SSTables resulting from the flush operations.

**Unit test**: run `make db_flush_test -j$(nproc); ./db_flush_test` to run the unit test.
This executable includes 3 tests, that test support and correct stat calculations for workloads with inserts, deletes, and DeleteRanges. The parameters are set such that the workloads are performed on a single memtable, and a single SSTable is created as a result of the flush operation. The flush operation is manually called in the test file. The tests verify that the values of these 2 statistics counters introduced in this PR  can be exactly predicted, showing that we have a full understanding of the underlying operations.

**Performance testing**:
`./db_bench -statistics -benchmarks=fillrandom -num=10000000` repeated 10 times.
Timing done using "date" function in a bash script.
_Results_:
Original Rocksdb fork: mean 66.6 sec, std 1.18 sec.
This feature branch: mean 67.4 sec, std 1.35 sec.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8411

Reviewed By: akankshamahajan15

Differential Revision: D29150629

Pulled By: bjlemaire

fbshipit-source-id: 7b3c2e86d50c6aa34fa50fd134282eacb543a5b1
2021-06-18 04:57:27 -07:00
Sidi Mohamed EL AATIFI 298edae941 Fix a typo in Javadoc (#8394)
Summary:
iterateLowerBound Slice representing the lower bound

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8394

Reviewed By: ajkr

Differential Revision: D29085721

Pulled By: jay-zhuang

fbshipit-source-id: a154375879395c48e9bd3794d296e70316894056
2021-06-17 12:02:57 -07:00
mrambacher 8948dc8524 Make ImmutableOptions struct that inherits from ImmutableCFOptions and ImmutableDBOptions (#8262)
Summary:
The ImmutableCFOptions contained a bunch of fields that belonged to the ImmutableDBOptions.  This change cleans that up by introducing an ImmutableOptions struct.  Following the pattern of Options struct, this class inherits from the DB and CFOption structs (of the Immutable form).

Only one structural change (the ImmutableCFOptions::fs was changed to a shared_ptr from a raw one) is in this PR.  All of the other changes involve moving the member variables from the ImmutableCFOptions into the ImmutableOptions and changing member variables or function parameters as required for compilation purposes.

Follow-on PRs may do a further clean-up of the code, such as renaming variables (such as "ImmutableOptions cf_options") and potentially eliminating un-needed function parameters (there is no longer a need to pass both an ImmutableDBOptions and an ImmutableOptions to a function).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8262

Reviewed By: pdillinger

Differential Revision: D28226540

Pulled By: mrambacher

fbshipit-source-id: 18ae71eadc879dedbe38b1eb8e6f9ff5c7147dbf
2021-05-05 14:00:17 -07:00
Adam Retter 69c986825e Fix javadoc for keyMayExist (#8232)
Summary:
Closes https://github.com/facebook/rocksdb/issues/6985

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8232

Reviewed By: jay-zhuang

Differential Revision: D27999779

Pulled By: mrambacher

fbshipit-source-id: a37c88d93bde2692b8be9e46e673dda7bea701b2
2021-04-26 08:34:10 -07:00
Yanqin Jin a376c22066 Handle rename() failure in non-local FS (#8192)
Summary:
In a distributed environment, a file `rename()` operation can succeed on server (remote)
side, but the client can somehow return non-ok status to RocksDB. Possible reasons include
network partition, connection issue, etc. This happens in `rocksdb::SetCurrentFile()`, which
can be called in `LogAndApply() -> ProcessManifestWrites()` if RocksDB tries to switch to a
new MANIFEST. We currently always delete the new MANIFEST if an error occurs.

This is problematic in distributed world. If the server-side successfully updates the CURRENT
file via renaming, then a subsequent `DB::Open()` will try to look for the new MANIFEST and fail.

As a fix, we can track the execution result of IO operations on the new MANIFEST.
- If IO operations on the new MANIFEST fail, then we know the CURRENT must point to the original
  MANIFEST. Therefore, it is safe to remove the new MANIFEST.
- If IO operations on the new MANIFEST all succeed, but somehow we end up in the clean up
  code block, then we do not know whether CURRENT points to the new or old MANIFEST. (For local
  POSIX-compliant FS, it should still point to old MANIFEST, but it does not matter if we keep the
  new MANIFEST.) Therefore, we keep the new MANIFEST.
    - Any future `LogAndApply()` will switch to a new MANIFEST and update CURRENT.
    - If process reopens the db immediately after the failure, then the CURRENT file can point
      to either the new MANIFEST or the old one, both of which exist. Therefore, recovery can
      succeed and ignore the other.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8192

Test Plan: make check

Reviewed By: zhichao-cao

Differential Revision: D27804648

Pulled By: riversand963

fbshipit-source-id: 9c16f2a5ce41bc6aadf085e48449b19ede8423e4
2021-04-19 18:11:13 -07:00
Andrew Kryczka 1ba2b8a568 Add sample_for_compression results to table properties (#8139)
Summary:
Added `TableProperties::{fast,slow}_compression_estimated_data_size`.
These properties are present in block-based tables when
`ColumnFamilyOptions::sample_for_compression > 0` and the necessary
compression library is supported when the file is generated. They
contain estimates of what `TableProperties::data_size` would be if the
"fast"/"slow" compression library had been used instead. One
limitation is we do not record exactly which "fast" (ZSTD or Zlib)
or "slow" (LZ4 or Snappy) compression library produced the result.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8139

Test Plan:
- new unit test
- ran `db_bench` with `sample_for_compression=1`; verified the `data_size` property matches the `{slow,fast}_compression_estimated_data_size` when the same compression type is used for the output file compression and the sampled compression

Reviewed By: riversand963

Differential Revision: D27454338

Pulled By: ajkr

fbshipit-source-id: 9529293de93ddac7f03b2e149d746e9f634abac4
2021-03-31 18:21:50 -07:00
Jay Zhuang a781b103da Fix getApproximateMemTableStats() return type (#8098)
Summary:
Which should return 2 long instead of an array.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8098

Reviewed By: mrambacher

Differential Revision: D27308741

Pulled By: jay-zhuang

fbshipit-source-id: 44beea2bd28cf6779b048bebc98f2426fe95e25c
2021-03-31 09:46:47 -07:00
Vlad Artamonov 4a6bc47b2e Fix possible mistype in a comment (#8086)
Summary:
This is a small fix to what I think is a mistype in two comments in `DBOptionsInterface.java`. If it was not an error, feel free to close.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8086

Reviewed By: ajkr

Differential Revision: D27260488

Pulled By: mrambacher

fbshipit-source-id: 469daadaf6039d5b5187132b8e0c7c3672842f21
2021-03-23 12:37:24 -07:00
Zhichao Cao 08ec5e7321 Add the statistics and info log for Error handler (#8050)
Summary:
Add statistics and info log for error handler: counters for bg error, bg io error, bg retryable io error, auto resume, auto resume total retry, and auto resume sucess; Histogram for auto resume retry count in each recovery call.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8050

Test Plan: make check and add test to error_handler_fs_test

Reviewed By: anand1976

Differential Revision: D26990565

Pulled By: zhichao-cao

fbshipit-source-id: 49f71e8ea4e9db8b189943976404205b56ab883f
2021-03-17 22:38:13 -07:00
Xiaopeng Zhang c603f2f898 support getUsage and getPinnedUsage in JavaAPI for Cache (#7925)
Summary:
support getUsage and getPinnedUsage in JavaAPI for Cache
also fix a typo in LRUCacheTest.java that the highPriPoolRatio is not valid(set 5, I guess it means 0.05)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7925

Reviewed By: mrambacher

Differential Revision: D26900241

Pulled By: ajkr

fbshipit-source-id: 735d1e40a16fa8919c89c7c7154ba7f81208ec33
2021-03-17 09:30:33 -07:00
stefan-zobel 8d9088464b Java-API: Fix minor Javadoc copy-paste errors (#8034)
Summary:
Fixes 3 minor Javadoc copy-paste errors in the `RocksDB#newIterator()` and `Transaction#getIterator()` variants that take a column family handle but are talking about iterating over "the database" or "the default column family".

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8034

Reviewed By: jay-zhuang

Differential Revision: D26877667

Pulled By: mrambacher

fbshipit-source-id: 95dd95b667c496e389f221acc9a91b340e4b63bf
2021-03-16 18:07:09 -07:00
stefan-zobel cc34da75b5 Java-API: byteCompressionType should be declared as primitive type byte (#7981)
Summary:
The variable `byteCompressionType` is only assigned values of primitive type and is never 'null', but it is declared with the boxed type 'Byte'.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7981

Reviewed By: ajkr

Differential Revision: D26546600

Pulled By: jay-zhuang

fbshipit-source-id: 07b579cdfcfc2262a448ca3626e216416fd05892
2021-03-09 22:05:16 -08:00
Peter Dillinger 0028e3398b Make format_version=5 new default (#8017)
Summary:
Haven't seen any production issues with new Bloom filter and
it's now > 1 year old (added in 6.6.0).

Updated check_format_compatible.sh and HISTORY.md

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8017

Test Plan: tests updated (or prior bugs fixed)

Reviewed By: ajkr

Differential Revision: D26762197

Pulled By: pdillinger

fbshipit-source-id: 0e755c46b443087c1544da0fd545beb9c403d1c2
2021-03-09 12:42:53 -08:00
stefan-zobel 430842f948 Java-API: Missing space in string literal (#7982)
Summary:
`TtlDB.open()`: missing space after 'column'
`AdvancedColumnFamilyOptionsInterface.setLevelCompactionDynamicLevelBytes()`: missing space after 'cause'

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7982

Reviewed By: ajkr

Differential Revision: D26546632

Pulled By: jay-zhuang

fbshipit-source-id: 885dedcaa2200842764fbac9ce3766d54e1c8914
2021-03-09 11:30:29 -08:00
Andrew Kryczka d904233d2f Limit buffering for collecting samples for compression dictionary (#7970)
Summary:
For dictionary compression, we need to collect some representative samples of the data to be compressed, which we use to either generate or train (when `CompressionOptions::zstd_max_train_bytes > 0`) a dictionary. Previously, the strategy was to buffer all the data blocks during flush, and up to the target file size during compaction. That strategy allowed us to randomly pick samples from as wide a range as possible that'd be guaranteed to land in a single output file.

However, some users try to make huge files in memory-constrained environments, where this strategy can cause OOM. This PR introduces an option, `CompressionOptions::max_dict_buffer_bytes`, that limits how much data blocks are buffered before we switch to unbuffered mode (which means creating the per-SST dictionary, writing out the buffered data, and compressing/writing new blocks as soon as they are built). It is not strict as we currently buffer more than just data blocks -- also keys are buffered. But it does make a step towards giving users predictable memory usage.

Related changes include:

- Changed sampling for dictionary compression to select unique data blocks when there is limited availability of data blocks
- Made use of `BlockBuilder::SwapAndReset()` to save an allocation+memcpy when buffering data blocks for building a dictionary
- Changed `ParseBoolean()` to accept an input containing characters after the boolean. This is necessary since, with this PR, a value for `CompressionOptions::enabled` is no longer necessarily the final component in the `CompressionOptions` string.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7970

Test Plan:
- updated `CompressionOptions` unit tests to verify limit is respected (to the extent expected in the current implementation) in various scenarios of flush/compaction to bottommost/non-bottommost level
- looked at jemalloc heap profiles right before and after switching to unbuffered mode during flush/compaction. Verified memory usage in buffering is proportional to the limit set.

Reviewed By: pdillinger

Differential Revision: D26467994

Pulled By: ajkr

fbshipit-source-id: 3da4ef9fba59974e4ef40e40c01611002c861465
2021-02-19 14:09:54 -08:00
stefan-zobel 251143f8fb rocksdbjni: Possible NPE in RocksDB.setOptions #7869 (#7909)
Summary:
Fix for https://github.com/facebook/rocksdb/issues/7869

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7909

Reviewed By: akankshamahajan15

Differential Revision: D26181440

Pulled By: ajkr

fbshipit-source-id: f323aec9d91e177fa873599b99801b391cf094b1
2021-02-18 15:48:39 -08:00
Xiaopeng Zhang bf6795aea0 fix java sample typo and replace deprecated code with latest (#7906)
Summary:
1. replace deprecated code in sample java with latest api
2. fix optimistictransaction sample code typo

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7906

Reviewed By: ajkr

Differential Revision: D26127429

Pulled By: jay-zhuang

fbshipit-source-id: f015ad1435f565cffb8798a4fb5afc44c72d73d7
2021-02-01 14:45:34 -08:00
Xiaopeng Zhang 36963dc2ca fix write option typo in java samples (#7894)
Summary:
this is a trivial PR for rocksdb java samples, I think it is a typo about write options. to do sync write, WAL should not be disabled

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7894

Reviewed By: jay-zhuang

Differential Revision: D26047128

Pulled By: mrambacher

fbshipit-source-id: a06ce54cb61af0d3f2578a709c34a0b1ccecb0b2
2021-01-26 19:13:08 -08:00