Commit Graph

106 Commits

Author SHA1 Message Date
Changyu Bi c2aad555c3 Add `CompressionOptions::checksum` for enabling ZSTD checksum (#11666)
Summary:
Optionally enable zstd checksum flag (d857369028/lib/zstd.h (L428)) to detect corruption during decompression. Main changes are in compression.h:
* User can set CompressionOptions::checksum to true to enable this feature.
* We enable this feature in ZSTD by setting the checksum flag in ZSTD compression context: `ZSTD_CCtx`.
* Uses `ZSTD_compress2()` to do compression since it supports frame parameter like the checksum flag. Compression level is also set in compression context as a flag.
* Error handling during decompression to propagate error message from ZSTD.
* Updated microbench to test read performance impact.

About compatibility, the current compression decoders should continue to work with the data created by the new compression API `ZSTD_compress2()`: https://github.com/facebook/zstd/issues/3711.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11666

Test Plan:
* Existing unit tests for zstd compression
* Add unit test `DBTest2.ZSTDChecksum` to test the corruption case
* Manually tested that compression levels, parallel compression, dictionary compression, index compression all work with the new ZSTD_compress2() API.
* Manually tested with `sst_dump --command=recompress` that different compression levels and dictionary compression settings all work.
* Manually tested compiling with older versions of ZSTD: v1.3.8, v1.1.0, v0.6.2.
* Perf impact: from public benchmark data: http://fastcompression.blogspot.com/2019/03/presenting-xxh3.html for checksum and https://github.com/facebook/zstd#benchmarks, if decompression is 1700MB/s and checksum computation is 70000MB/s, checksum computation is an additional ~2.4% time for decompression. Compression is slower and checksumming should be less noticeable.
* Microbench:
```
TEST_TMPDIR=/dev/shm ./branch_db_basic_bench --benchmark_filter=DBGet/comp_style:0/max_data:1048576/per_key_size:256/enable_statistics:0/negative_query:0/enable_filter:0/mmap:0/compression_type:7/compression_checksum:1/no_blockcache:1/iterations:10000/threads:1 --benchmark_repetitions=100

Min out of 100 runs:
Main:
10390 10436 10456 10484 10499 10535 10544 10545 10565 10568

After this PR, checksum=false
10285 10397 10503 10508 10515 10557 10562 10635 10640 10660

After this PR, checksum=true
10827 10876 10925 10949 10971 11052 11061 11063 11100 11109
```
* db_bench:
```
Write perf
TEST_TMPDIR=/dev/shm/ ./db_bench_ichecksum --benchmarks=fillseq[-X10] --compression_type=zstd --num=10000000 --compression_checksum=..

[FillSeq checksum=0]
fillseq [AVG    10 runs] : 281635 (± 31711) ops/sec;   31.2 (± 3.5) MB/sec
fillseq [MEDIAN 10 runs] : 294027 ops/sec;   32.5 MB/sec

[FillSeq checksum=1]
fillseq [AVG    10 runs] : 286961 (± 34700) ops/sec;   31.7 (± 3.8) MB/sec
fillseq [MEDIAN 10 runs] : 283278 ops/sec;   31.3 MB/sec

Read perf
TEST_TMPDIR=/dev/shm ./db_bench_ichecksum --benchmarks=readrandom[-X20] --num=100000000 --reads=1000000 --use_existing_db=true --readonly=1

[Readrandom checksum=1]
readrandom [AVG    20 runs] : 360928 (± 3579) ops/sec;    4.0 (± 0.0) MB/sec
readrandom [MEDIAN 20 runs] : 362468 ops/sec;    4.0 MB/sec

[Readrandom checksum=0]
readrandom [AVG    20 runs] : 380365 (± 2384) ops/sec;    4.2 (± 0.0) MB/sec
readrandom [MEDIAN 20 runs] : 379800 ops/sec;    4.2 MB/sec

Compression
TEST_TMPDIR=/dev/shm ./db_bench_ichecksum --benchmarks=compress[-X20] --compression_type=zstd --num=100000000 --compression_checksum=1

checksum=1
compress [AVG    20 runs] : 54074 (± 634) ops/sec;  211.2 (± 2.5) MB/sec
compress [MEDIAN 20 runs] : 54396 ops/sec;  212.5 MB/sec

checksum=0
compress [AVG    20 runs] : 54598 (± 393) ops/sec;  213.3 (± 1.5) MB/sec
compress [MEDIAN 20 runs] : 54592 ops/sec;  213.3 MB/sec

Decompression:
TEST_TMPDIR=/dev/shm ./db_bench_ichecksum --benchmarks=uncompress[-X20] --compression_type=zstd --compression_checksum=1

checksum = 0
uncompress [AVG    20 runs] : 167499 (± 962) ops/sec;  654.3 (± 3.8) MB/sec
uncompress [MEDIAN 20 runs] : 167210 ops/sec;  653.2 MB/sec
checksum = 1
uncompress [AVG    20 runs] : 167980 (± 924) ops/sec;  656.2 (± 3.6) MB/sec
uncompress [MEDIAN 20 runs] : 168465 ops/sec;  658.1 MB/sec
```

Reviewed By: ajkr

Differential Revision: D48019378

Pulled By: cbi42

fbshipit-source-id: 674120c6e1853c2ced1436ac8138559d0204feba
2023-08-18 15:01:59 -07:00
Yu Zhang 1e77e35d26 Add a per column family default temperature option for accounting (#11708)
Summary:
Add a column family option `default_temperature` that will be used for file reading accounting purpose, such as io statistics, for files that don't have an explicitly set temperature.

This options is not a mutable one, changing its value would require a DB restart. This is to avoid the confusion that had the option being a mutable one, the users may expect it to take effect on all files immediately, while in reality, it would only become effective for SST files opened in the future.

This `default_temperature` also just affect accounting during one DB session. It won't be recorded in manifest as the file's temperature and can be different across different DB sessions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11708

Test Plan:
```
make all check
```

Reviewed By: pdillinger

Differential Revision: D48375763

Pulled By: jowlyzhang

fbshipit-source-id: eb756696c14a694c6e2a93d2bb6f040563194981
2023-08-17 17:06:57 -07:00
Changyu Bi d1ff401472 Delay bottommost level single file compactions (#11701)
Summary:
For leveled compaction, RocksDB has a special kind of compaction with reason "kBottommmostFiles" that compacts bottommost level files to clear data held by snapshots (more detail in https://github.com/facebook/rocksdb/issues/3009). Such compactions can happen soon after a relevant snapshot is released. For some use cases, a bottommost file may contain only a small amount of keys that can be cleared, so compacting such a file has a high write amp. In addition, these bottommost files may be compacted in compactions with reason other than "kBottommmostFiles" if we wait for some time (so that enough data is ingested to trigger such a compaction). This PR introduces an option `bottommost_file_compaction_delay` to specify the delay of these bottommost level single file compactions.

* The main change is in `VersionStorageInfo::ComputeBottommostFilesMarkedForCompaction()` where we only add a file to `bottommost_files_marked_for_compaction_` if it oldest_snapshot is larger than its non-zero largest_seqno **and** the file is old enough. Note that if a file is not old enough but its largest_seqno is less than oldest_snapshot, we exclude it from the calculation of `bottommost_files_mark_threshold_`. This makes the change simpler, but such a file's eligibility for compaction will only be checked the next time `ComputeBottommostFilesMarkedForCompaction()` is called. This happens when a new Version is created (compaction, flush, SetOptions()...), a new enough snapshot is released (`VersionStorageInfo::UpdateOldestSnapshot()`) or when a compaction is picked and compaction score has to be re-calculated.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11701

Test Plan:
* Add two unit tests to test when bottommost_file_compaction_delay > 0.
* Ran crash test with the new option.

Reviewed By: jaykorean, ajkr

Differential Revision: D48331564

Pulled By: cbi42

fbshipit-source-id: c584f3dc5f6354fce3ed65f4c6366dc450b15ba8
2023-08-16 17:45:44 -07:00
Vardhan 87a21d08fe Add an option to trigger flush when the number of range deletions reach a threshold (#11358)
Summary:
Add a mutable column family option `memtable_max_range_deletions`. When non-zero, RocksDB will try to flush the current memtable after it has at least `memtable_max_range_deletions` range deletions. Java API is added and crash test is updated accordingly to randomly enable this option.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11358

Test Plan:
* New unit test: `DBRangeDelTest.MemtableMaxRangeDeletions`
* Ran crash test `python3 ./tools/db_crashtest.py whitebox --simple --memtable_max_range_deletions=20` and saw logs showing flushed memtables usually with 20 range deletions.

Reviewed By: ajkr

Differential Revision: D46582680

Pulled By: cbi42

fbshipit-source-id: f23d6fa8d8264ecf0a18d55c113ba03f5e2504da
2023-08-02 19:58:56 -07:00
Changyu Bi 8827cd0618 Support compacting files to different temperatures in FIFO compaction (#11428)
Summary:
- Add a new option `CompactionOptionsFIFO::file_temperature_age_thresholds` that allows user to specify age thresholds for compacting files to different temperatures. File temperature can be used to store files in different storage media. The new options allows specifying multiple temperature-age pairs. The option uses struct for a temperature-age pair to use the existing parsing functionality to make the option dynamically settable.
- Deprecate the old option `age_for_warm` that was added for a similar purpose.
- Compaction score calculation logic is updated to check if a file needs to be compacted to change its temperature.
- Some refactoring is done in `FIFOCompactionPicker::PickTemperatureChangeCompaction`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11428

Test Plan: adapted unit tests that were for `age_for_warm` to this new option.

Reviewed By: ajkr

Differential Revision: D45611412

Pulled By: cbi42

fbshipit-source-id: 2dc384841f61cc04abb9681e31aa2de0f0b06106
2023-05-11 16:40:59 -07:00
Changyu Bi 62fc15f009 Block per key-value checksum (#11287)
Summary:
add option `block_protection_bytes_per_key` and implementation for block per key-value checksum. The main changes are
1. checksum construction and verification in block.cc/h
2. pass the option `block_protection_bytes_per_key` around (mainly for methods defined in table_cache.h)
3. unit tests/crash test updates

Tests:
* Added unit tests
* Crash test: `python3 tools/db_crashtest.py blackbox --simple --block_protection_bytes_per_key=1 --write_buffer_size=1048576`

Follow up (maybe as a separate PR): make sure corruption status returned from BlockIters are correctly handled.

Performance:
Turning on block per KV protection has a non-trivial negative impact on read performance and costs additional memory.
For memory, each block includes additional 24 bytes for checksum-related states beside checksum itself. For CPU, I set up a DB of size ~1.2GB with 5M keys (32 bytes key and 200 bytes value) which compacts to ~5 SST files (target file size 256 MB) in L6 without compression. I tested readrandom performance with various block cache size (to mimic various cache hit rates):

```
SETUP
make OPTIMIZE_LEVEL="-O3" USE_LTO=1 DEBUG_LEVEL=0 -j32 db_bench
./db_bench -benchmarks=fillseq,compact0,waitforcompaction,compact,waitforcompaction -write_buffer_size=33554432 -level_compaction_dynamic_level_bytes=true -max_background_jobs=8 -target_file_size_base=268435456 --num=5000000 --key_size=32 --value_size=200 --compression_type=none

BENCHMARK
./db_bench --use_existing_db -benchmarks=readtocache,readrandom[-X10] --num=5000000 --key_size=32 --disable_auto_compactions --reads=1000000 --block_protection_bytes_per_key=[0|1] --cache_size=$CACHESIZE

The readrandom ops/sec looks like the following:
Block cache size:  2GB        1.2GB * 0.9    1.2GB * 0.8     1.2GB * 0.5   8MB
Main              240805     223604         198176           161653       139040
PR prot_bytes=0   238691     226693         200127           161082       141153
PR prot_bytes=1   214983     193199         178532           137013       108211
prot_bytes=1 vs    -10%        -15%          -10.8%          -15%        -23%
prot_bytes=0
```

The benchmark has a lot of variance, but there was a 5% to 25% regression in this benchmark with different cache hit rates.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11287

Reviewed By: ajkr

Differential Revision: D43970708

Pulled By: cbi42

fbshipit-source-id: ef98d898b71779846fa74212b9ec9e08b7183940
2023-04-25 12:08:23 -07:00
Peter Dillinger d79be3dca2 Changes and enhancements to compression stats, thresholds (#11388)
Summary:
## Option API updates
* Add new CompressionOptions::max_compressed_bytes_per_kb, which corresponds to 1024.0 / min allowable compression ratio. This avoids the hard-coded minimum ratio of 8/7.
* Remove unnecessary constructor for CompressionOptions.
* Document undocumented CompressionOptions. Use idiom for default values shown clearly in one place (not precariously repeated).

 ## Stat API updates
* Deprecate the BYTES_COMPRESSED, BYTES_DECOMPRESSED histograms. Histograms incur substantial extra space & time costs compared to tickers, and the distribution of uncompressed data block sizes tends to be uninteresting. If we're interested in that distribution, I don't see why it should be limited to blocks stored as compressed.
* Deprecate the NUMBER_BLOCK_NOT_COMPRESSED ticker, because the name is very confusing.
* New or existing tickers relevant to compression:
  * BYTES_COMPRESSED_FROM
  * BYTES_COMPRESSED_TO
  * BYTES_COMPRESSION_BYPASSED
  * BYTES_COMPRESSION_REJECTED
  * COMPACT_WRITE_BYTES + FLUSH_WRITE_BYTES (both existing)
  * NUMBER_BLOCK_COMPRESSED (existing)
  * NUMBER_BLOCK_COMPRESSION_BYPASSED
  * NUMBER_BLOCK_COMPRESSION_REJECTED
  * BYTES_DECOMPRESSED_FROM
  * BYTES_DECOMPRESSED_TO

We can compute a number of things with these stats:
* "Successful" compression ratio: BYTES_COMPRESSED_FROM / BYTES_COMPRESSED_TO
* Compression ratio of data on which compression was attempted: (BYTES_COMPRESSED_FROM + BYTES_COMPRESSION_REJECTED) / (BYTES_COMPRESSED_TO + BYTES_COMPRESSION_REJECTED)
* Compression ratio of data that could be eligible for compression: (BYTES_COMPRESSED_FROM + X) / (BYTES_COMPRESSED_TO + X) where X = BYTES_COMPRESSION_REJECTED + NUMBER_BLOCK_COMPRESSION_REJECTED
* Overall SST compression ratio (compression disabled vs. actual): (Y - BYTES_COMPRESSED_TO + BYTES_COMPRESSED_FROM) / Y where Y = COMPACT_WRITE_BYTES + FLUSH_WRITE_BYTES

Keeping _REJECTED separate from _BYPASSED helps us to understand "wasted" CPU time in compression.

 ## BlockBasedTableBuilder
Various small refactorings, optimizations, and name clean-ups.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11388

Test Plan:
unit tests added

* `options_settable_test.cc`: use non-deprecated idiom for configuring CompressionOptions from string. The old idiom is tested elsewhere and does not need to be updated to support the new field.

Reviewed By: ajkr

Differential Revision: D45128202

Pulled By: pdillinger

fbshipit-source-id: 5a652bf5c022b7ec340cf79018cccf0686962803
2023-04-21 21:57:40 -07:00
Yu Zhang 647cd73674 Initial add UDT in memtable only option (#11362)
Summary:
This option is immutable through the life time of the DB open. For now, updating its value between different DB open sessions is also a non compatible change. When I work on support for updating comparator, the type of updates accepted for this option will be supported then.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11362

Test Plan: `make check`

Reviewed By: ltamasi

Differential Revision: D44873870

Pulled By: jowlyzhang

fbshipit-source-id: aa02094754b58d99abf9af4c9a8108c1350254cb
2023-04-11 17:50:34 -07:00
Peter Dillinger 3cacd4b4ec Put Cache and CacheWrapper in new public header (#11192)
Summary:
The definition of the Cache class should not be needed by the vast majority of RocksDB users, so I think it is just distracting to include it in cache.h, which is primarily needed for configuring and creating caches. This change moves the class to a new header advanced_cache.h. It is just cut-and-paste except for modifying the class API comment.

In general, operations on shared_ptr<Cache> should continue to work when only a forward declaration of Cache is available, as long as all the Cache instances provided are already shared_ptr. See https://stackoverflow.com/a/17650101/454544

Also, the most common way to customize a Cache is by wrapping an existing implementation, so it makes sense to provide CacheWrapper in the public API. This was a cut-and-paste job except removing the implementation of Name() so that derived classes must provide it.

Intended follow-up: consolidate Release() into one function to reduce customization bugs / confusion

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11192

Test Plan: `make check`

Reviewed By: anand1976

Differential Revision: D43055487

Pulled By: pdillinger

fbshipit-source-id: 7b05492df35e0f30b581b4c24c579bc275b6d110
2023-02-09 12:12:02 -08:00
sdong 4720ba4391 Remove RocksDB LITE (#11147)
Summary:
We haven't been actively mantaining RocksDB LITE recently and the size must have been gone up significantly. We are removing the support.

Most of changes were done through following comments:

unifdef -m -UROCKSDB_LITE `git grep -l ROCKSDB_LITE | egrep '[.](cc|h)'`

by Peter Dillinger. Others changes were manually applied to build scripts, CircleCI manifests, ROCKSDB_LITE is used in an expression and file db_stress_test_base.cc.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11147

Test Plan: See CI

Reviewed By: pdillinger

Differential Revision: D42796341

fbshipit-source-id: 4920e15fc2060c2cd2221330a6d0e5e65d4b7fe2
2023-01-27 13:14:19 -08:00
Changyu Bi 333abe9c55 Ignore max_compaction_bytes for compaction input that are within output key-range (#10835)
Summary:
When picking compaction input files, we sometimes stop picking a file that is fully included in the output key-range due to hitting max_compaction_bytes. Including these input files can potentially reduce WA at the expense of larger compactions. Larger compaction should be fine as files from input level are usually 10X smaller than files from output level. This PR adds a mutable CF option `ignore_max_compaction_bytes_for_input` that is enabled by default. We can remove this option once we are sure it is safe.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10835

Test Plan:
- CI, a unit test on max_compaction_bytes fails before turning this flag off.
- Benchmark does not show much difference in WA: `./db_bench --benchmarks=fillrandom,waitforcompaction,stats,levelstats -max_background_jobs=12 -num=2000000000 -target_file_size_base=33554432 --write_buffer_size=33554432`
```
main:
** Compaction Stats [default] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      3/0   91.59 MB   0.8     70.9     0.0     70.9     200.8    129.9       0.0   1.5     25.2     71.2   2886.55           2463.45      9725    0.297   1093M   254K       0.0       0.0
  L1      9/0   248.03 MB   1.0    392.0   129.8    262.2     391.7    129.5       0.0   3.0     69.0     68.9   5821.71           5536.90       804    7.241   6029M  5814K       0.0       0.0
  L2     87/0    2.50 GB   1.0    537.0   128.5    408.5     533.8    125.2       0.7   4.2     69.5     69.1   7912.24           7323.70      4417    1.791   8299M    36M       0.0       0.0
  L3    836/0   24.99 GB   1.0    616.9   118.3    498.7     594.5     95.8       5.2   5.0     66.9     64.5   9442.38           8490.28      4204    2.246   9749M   306M       0.0       0.0
  L4   2355/0   62.95 GB   0.3     67.3    37.1     30.2      54.2     24.0      38.9   1.5     72.2     58.2    954.37            821.18       917    1.041   1076M   173M       0.0       0.0
 Sum   3290/0   90.77 GB   0.0   1684.2   413.7   1270.5    1775.0    504.5      44.9  13.7     63.8     67.3  27017.25          24635.52     20067    1.346     26G   522M       0.0       0.0

Cumulative compaction: 1774.96 GB write, 154.29 MB/s write, 1684.19 GB read, 146.40 MB/s read, 27017.3 seconds

This PR:
** Compaction Stats [default] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      3/0   45.71 MB   0.8     72.9     0.0     72.9     202.8    129.9       0.0   1.6     25.4     70.7   2938.16           2510.36      9741    0.302   1124M   265K       0.0       0.0
  L1      8/0   234.54 MB   0.9    384.5   129.8    254.7     384.2    129.6       0.0   3.0     69.0     68.9   5708.08           5424.43       791    7.216   5913M  5753K       0.0       0.0
  L2     84/0    2.47 GB   1.0    543.1   128.6    414.5     539.9    125.4       0.7   4.2     69.6     69.2   7989.31           7403.13      4418    1.808   8393M    36M       0.0       0.0
  L3    839/0   24.96 GB   1.0    615.6   118.4    497.2     593.2     96.0       5.1   5.0     66.6     64.1   9471.23           8489.31      4193    2.259   9726M   306M       0.0       0.0
  L4   2360/0   63.04 GB   0.3     67.6    37.3     30.3      54.4     24.1      38.9   1.5     71.5     57.6    967.30            827.99       907    1.066   1080M   173M       0.0       0.0
 Sum   3294/0   90.75 GB   0.0   1683.8   414.2   1269.6    1774.5    504.9      44.8  13.7     63.7     67.1  27074.08          24655.22     20050    1.350     26G   522M       0.0       0.0

Cumulative compaction: 1774.52 GB write, 157.09 MB/s write, 1683.77 GB read, 149.06 MB/s read, 27074.1 seconds
```

Reviewed By: ajkr

Differential Revision: D40518319

Pulled By: cbi42

fbshipit-source-id: f4ea614bc0ebefe007ffaf05bb9aec9a8ca25b60
2022-10-21 10:22:41 -07:00
Jay Zhuang c401f285c3 Add option `preserve_internal_time_seconds` to preserve the time info (#10747)
Summary:
Add option `preserve_internal_time_seconds` to preserve the internal
time information.
It's mostly for the migration of the existing data to tiered storage (
`preclude_last_level_data_seconds`). When the tiering feature is just
enabled, the existing data won't have the time information to decide if
it's hot or cold. Enabling this feature will start collect and preserve
the time information for the new data.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10747

Reviewed By: siying

Differential Revision: D39910141

Pulled By: siying

fbshipit-source-id: 25c21638e37b1a7c44006f636b7d714fe7242138
2022-10-07 18:49:40 -07:00
Jay Zhuang f3cc66632b Align compaction output file boundaries to the next level ones (#10655)
Summary:
Try to align the compaction output file boundaries to the next level ones
(grandparent level), to reduce the level compaction write-amplification.

In level compaction, there are "wasted" data at the beginning and end of the
output level files. Align the file boundary can avoid such "wasted" compaction.
With this PR, it tries to align the non-bottommost level file boundaries to its
next level ones. It may cut file when the file size is large enough (at least
50% of target_file_size) and not too large (2x target_file_size).

db_bench shows about 12.56% compaction reduction:
```
TEST_TMPDIR=/data/dbbench2 ./db_bench --benchmarks=fillrandom,readrandom -max_background_jobs=12 -num=400000000 -target_file_size_base=33554432

# baseline:
Flush(GB): cumulative 25.882, interval 7.216
Cumulative compaction: 285.90 GB write, 162.36 MB/s write, 269.68 GB read, 153.15 MB/s read, 2926.7 seconds

# with this change:
Flush(GB): cumulative 25.882, interval 7.753
Cumulative compaction: 249.97 GB write, 141.96 MB/s write, 233.74 GB read, 132.74 MB/s read, 2534.9 seconds
```

The compaction simulator shows a similar result (14% with 100G random data).
As a side effect, with this PR, the SST file size can exceed the
target_file_size, but is capped at 2x target_file_size. And there will be
smaller files. Here are file size statistics when loading 100GB with the target
file size 32MB:
```
          baseline      this_PR
count  1.656000e+03  1.705000e+03
mean   3.116062e+07  3.028076e+07
std    7.145242e+06  8.046139e+06
```

The feature is enabled by default, to revert to the old behavior disable it
with `AdvancedColumnFamilyOptions.level_compaction_dynamic_file_size = false`

Also includes https://github.com/facebook/rocksdb/issues/1963 to cut file before skippable grandparent file. Which is for
use case like user adding 2 or more non-overlapping data range at the same
time, it can reduce the overlapping of 2 datasets in the lower levels.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10655

Reviewed By: cbi42

Differential Revision: D39552321

Pulled By: jay-zhuang

fbshipit-source-id: 640d15f159ab0cd973f2426cfc3af266fc8bdde2
2022-09-29 19:43:55 -07:00
Changyu Bi fd165c869d Add memtable per key-value checksum (#10281)
Summary:
Append per key-value checksum to internal key. These checksums are verified on read paths including Get, Iterator and during Flush. Get and Iterator will return `Corruption` status if there is a checksum verification failure. Flush will make DB become read-only upon memtable entry checksum verification failure.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10281

Test Plan:
- Added new unit test cases: `make check`
- Benchmark on memtable insert
```
TEST_TMPDIR=/dev/shm/memtable_write ./db_bench -benchmarks=fillseq -disable_wal=true -max_write_buffer_number=100 -num=10000000 -min_write_buffer_number_to_merge=100

# avg over 10 runs
Baseline: 1166936 ops/sec
memtable 2 bytes kv checksum : 1.11674e+06 ops/sec (-4%)
memtable 2 bytes kv checksum + write batch 8 bytes kv checksum: 1.08579e+06 ops/sec (-6.95%)
write batch 8 bytes kv checksum: 1.17979e+06 ops/sec (+1.1%)
```
-  Benchmark on only memtable read: ops/sec dropped 31% for `readseq` due to time spend on verifying checksum.
ops/sec for `readrandom` dropped ~6.8%.
```
# Readseq
sudo TEST_TMPDIR=/dev/shm/memtable_read ./db_bench -benchmarks=fillseq,readseq"[-X20]" -disable_wal=true -max_write_buffer_number=100 -num=10000000 -min_write_buffer_number_to_merge=100

readseq [AVG    20 runs] : 7432840 (± 212005) ops/sec;  822.3 (± 23.5) MB/sec
readseq [MEDIAN 20 runs] : 7573878 ops/sec;  837.9 MB/sec

With -memtable_protection_bytes_per_key=2:

readseq [AVG    20 runs] : 5134607 (± 119596) ops/sec;  568.0 (± 13.2) MB/sec
readseq [MEDIAN 20 runs] : 5232946 ops/sec;  578.9 MB/sec

# Readrandom
sudo TEST_TMPDIR=/dev/shm/memtable_read ./db_bench -benchmarks=fillrandom,readrandom"[-X10]" -disable_wal=true -max_write_buffer_number=100 -num=1000000 -min_write_buffer_number_to_merge=100
readrandom [AVG    10 runs] : 140236 (± 3938) ops/sec;    9.8 (± 0.3) MB/sec
readrandom [MEDIAN 10 runs] : 140545 ops/sec;    9.8 MB/sec

With -memtable_protection_bytes_per_key=2:
readrandom [AVG    10 runs] : 130632 (± 2738) ops/sec;    9.1 (± 0.2) MB/sec
readrandom [MEDIAN 10 runs] : 130341 ops/sec;    9.1 MB/sec
```

- Stress test: `python3 -u tools/db_crashtest.py whitebox --duration=1800`

Reviewed By: ajkr

Differential Revision: D37607896

Pulled By: cbi42

fbshipit-source-id: fdaefb475629d2471780d4a5f5bf81b44ee56113
2022-08-12 13:51:32 -07:00
Jay Zhuang 3f763763aa Change `bottommost_temperture` to `last_level_temperture` (#10471)
Summary:
Change tiered compaction feature from `bottommost_temperture` to
`last_level_temperture`. The old option is kept for migration purpose only,
which is behaving the same as `last_level_temperture` and it will be removed in
the next release.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10471

Test Plan: CI

Reviewed By: siying

Differential Revision: D38450621

Pulled By: jay-zhuang

fbshipit-source-id: cc1cdf8bad409376fec0152abc0a64fb72a91527
2022-08-08 14:36:34 -07:00
Gang Liao ec4ebeff30 Support prepopulating/warming the blob cache (#10298)
Summary:
Many workloads have temporal locality, where recently written items are read back in a short period of time. When using remote file systems, this is inefficient since it involves network traffic and higher latencies. Because of this, we would like to support prepopulating the blob cache during flush.

This task is a part of https://github.com/facebook/rocksdb/issues/10156

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10298

Reviewed By: ltamasi

Differential Revision: D37908743

Pulled By: gangliao

fbshipit-source-id: 9feaed234bc719d38f0c02975c1ad19fa4bb37d1
2022-07-17 07:13:59 -07:00
Jay Zhuang a3acf2ef87 Add seqno to time mapping (#10338)
Summary:
Which will be used for tiered storage to preclude hot data from
compacting to the cold tier (the last level).
Internally, adding seqno to time mapping. A periodic_task is scheduled
to record the current_seqno -> current_time in certain cadence. When
memtable flush, the mapping informaiton is stored in sstable property.
During compaction, the mapping information are merged and get the
approximate time of sequence number, which is used to determine if a key
is recently inserted or not and preclude it from the last level if it's
recently inserted (within the `preclude_last_level_data_seconds`).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10338

Test Plan: CI

Reviewed By: siying

Differential Revision: D37810187

Pulled By: jay-zhuang

fbshipit-source-id: 6953be7a18a99de8b1cb3b162d712f79c2b4899f
2022-07-14 21:49:34 -07:00
Baptiste Lemaire 5879053fd0 Dynamically changeable `MemPurge` option (#10011)
Summary:
**Summary**
Make the mempurge option flag a Mutable Column Family option flag. Therefore, the mempurge feature can be dynamically toggled.

**Motivation**
RocksDB users prefer having the ability to switch features on and off without having to close and reopen the DB. This is particularly important if the feature causes issues and needs to be turned off. Dynamically changing a DB option flag does not seem currently possible.
Moreover, with this new change, the MemPurge feature can be toggled on or off independently between column families, which we see as a major improvement.

**Content of this PR**
This PR includes removal of the `experimental_mempurge_threshold` flag as a DB option flag, and its re-introduction as a `MutableCFOption` flag. I updated the code to handle dynamic changes of the flag (in particular inside the `FlushJob` file). Additionally, this PR includes a new test to demonstrate the capacity of the code to toggle the MemPurge feature on and off, as well as the addition in the `db_stress` module of 2 different mempurge threshold values (0.0 and 1.0) that can be randomly changed with the `set_option_one_in` flag. This is useful to stress test the dynamic changes.

**Benchmarking**
I will add numbers to prove that there is no performance impact within the next 12 hours.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10011

Reviewed By: pdillinger

Differential Revision: D36462357

Pulled By: bjlemaire

fbshipit-source-id: 5e3d63bdadf085c0572ecc2349e7dd9729ce1802
2022-06-23 09:42:18 -07:00
Gang Liao cba398df8a Add blob cache option in the column family options (#10155)
Summary:
There is currently no caching mechanism for blobs, which is not ideal especially when the database resides on remote storage (where we cannot rely on the OS page cache). As part of this task, we would like to make it possible for the application to configure a blob cache.
This PR is a part of https://github.com/facebook/rocksdb/issues/10156

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10155

Reviewed By: ltamasi

Differential Revision: D37150819

Pulled By: gangliao

fbshipit-source-id: b807c7916ea5d411588128f8e22a49f171388fe2
2022-06-14 14:19:26 -07:00
Gang Liao e6432dfd4c Make it possible to enable blob files starting from a certain LSM tree level (#10077)
Summary:
Currently, if blob files are enabled (i.e. `enable_blob_files` is true), large values are extracted both during flush/recovery (when SST files are written into level 0 of the LSM tree) and during compaction into any LSM tree level. For certain use cases that have a mix of short-lived and long-lived values, it might make sense to support extracting large values only during compactions whose output level is greater than or equal to a specified LSM tree level (e.g. compactions into L1/L2/... or above). This could reduce the space amplification caused by large values that are turned into garbage shortly after being written at the price of some write amplification incurred by long-lived values whose extraction to blob files is delayed.

In order to achieve this, we would like to do the following:
- Add a new configuration option `blob_file_starting_level` (default: 0) to `AdvancedColumnFamilyOptions` (and `MutableCFOptions` and extend the related logic)
- Instantiate `BlobFileBuilder` in `BuildTable` (used during flush and recovery, where the LSM tree level is L0) and `CompactionJob` iff `enable_blob_files` is set and the LSM tree level is `>= blob_file_starting_level`
- Add unit tests for the new functionality, and add the new option to our stress tests (`db_stress` and `db_crashtest.py` )
- Add the new option to our benchmarking tool `db_bench` and the BlobDB benchmark script `run_blob_bench.sh`
- Add the new option to the `ldb` tool (see https://github.com/facebook/rocksdb/wiki/Administration-and-Data-Access-Tool)
- Ideally extend the C and Java bindings with the new option
- Update the BlobDB wiki to document the new option.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10077

Reviewed By: ltamasi

Differential Revision: D36884156

Pulled By: gangliao

fbshipit-source-id: 942bab025f04633edca8564ed64791cb5e31627d
2022-06-02 20:04:33 -07:00
Changyu Bi cc23b46da1 Support using ZDICT_finalizeDictionary to generate zstd dictionary (#9857)
Summary:
An untrained dictionary is currently simply the concatenation of several samples. The ZSTD API, ZDICT_finalizeDictionary(), can improve such a dictionary's effectiveness at low cost. This PR changes how dictionary is created by calling the ZSTD ZDICT_finalizeDictionary() API instead of creating raw content dictionary (when max_dict_buffer_bytes > 0), and pass in all buffered uncompressed data blocks as samples.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9857

Test Plan:
#### db_bench test for cpu/memory of compression+decompression and space saving on synthetic data:
Set up: change the parameter [here](fb9a167a55/tools/db_bench_tool.cc (L1766)) to 16384 to make synthetic data more compressible.
```
# linked local ZSTD with version 1.5.2
# DEBUG_LEVEL=0 ROCKSDB_NO_FBCODE=1 ROCKSDB_DISABLE_ZSTD=1  EXTRA_CXXFLAGS="-DZSTD_STATIC_LINKING_ONLY -DZSTD -I/data/users/changyubi/install/include/" EXTRA_LDFLAGS="-L/data/users/changyubi/install/lib/ -l:libzstd.a" make -j32 db_bench

dict_bytes=16384
train_bytes=1048576
echo "========== No Dictionary =========="
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=0 -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=0 -block_size=4096 2>&1 | grep elapsed
du -hc /dev/shm/dbbench/*sst | grep total

echo "========== Raw Content Dictionary =========="
TEST_TMPDIR=/dev/shm ./db_bench_main -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench_main -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -block_size=4096 2>&1 | grep elapsed
du -hc /dev/shm/dbbench/*sst | grep total

echo "========== FinalizeDictionary =========="
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false -block_size=4096 2>&1 | grep elapsed
du -hc /dev/shm/dbbench/*sst | grep total

echo "========== TrainDictionary =========="
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -block_size=4096 2>&1 | grep elapsed
du -hc /dev/shm/dbbench/*sst | grep total

# Result: TrainDictionary is much better on space saving, but FinalizeDictionary seems to use less memory.
# before compression data size: 1.2GB
dict_bytes=16384
max_dict_buffer_bytes =  1048576
                    space   cpu/memory
No Dictionary       468M    14.93user 1.00system 0:15.92elapsed 100%CPU (0avgtext+0avgdata 23904maxresident)k
Raw Dictionary      251M    15.81user 0.80system 0:16.56elapsed 100%CPU (0avgtext+0avgdata 156808maxresident)k
FinalizeDictionary  236M    11.93user 0.64system 0:12.56elapsed 100%CPU (0avgtext+0avgdata 89548maxresident)k
TrainDictionary     84M     7.29user 0.45system 0:07.75elapsed 100%CPU (0avgtext+0avgdata 97288maxresident)k
```

#### Benchmark on 10 sample SST files for spacing saving and CPU time on compression:
FinalizeDictionary is comparable to TrainDictionary in terms of space saving, and takes less time in compression.
```
dict_bytes=16384
train_bytes=1048576

for sst_file in `ls ../temp/myrock-sst/`
do
  echo "********** $sst_file **********"
  echo "========== No Dictionary =========="
  ./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD

  echo "========== Raw Content Dictionary =========="
  ./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes

  echo "========== FinalizeDictionary =========="
  ./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes --compression_zstd_max_train_bytes=$train_bytes --compression_use_zstd_finalize_dict

  echo "========== TrainDictionary =========="
  ./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes --compression_zstd_max_train_bytes=$train_bytes
done

                         010240.sst (Size/Time) 011029.sst              013184.sst              021552.sst              185054.sst              185137.sst              191666.sst              7560381.sst             7604174.sst             7635312.sst
No Dictionary           28165569 / 2614419      32899411 / 2976832      32977848 / 3055542      31966329 / 2004590      33614351 / 1755877      33429029 / 1717042      33611933 / 1776936      33634045 / 2771417      33789721 / 2205414      33592194 / 388254
Raw Content Dictionary  28019950 / 2697961      33748665 / 3572422      33896373 / 3534701      26418431 / 2259658      28560825 / 1839168      28455030 / 1846039      28494319 / 1861349      32391599 / 3095649      33772142 / 2407843      33592230 / 474523
FinalizeDictionary      27896012 / 2650029      33763886 / 3719427      33904283 / 3552793      26008225 / 2198033      28111872 / 1869530      28014374 / 1789771      28047706 / 1848300      32296254 / 3204027      33698698 / 2381468      33592344 / 517433
TrainDictionary         28046089 / 2740037      33706480 / 3679019      33885741 / 3629351      25087123 / 2204558      27194353 / 1970207      27234229 / 1896811      27166710 / 1903119      32011041 / 3322315      32730692 / 2406146      33608631 / 570593
```

#### Decompression/Read test:
With FinalizeDictionary/TrainDictionary, some data structure used for decompression are in stored in dictionary, so they are expected to be faster in terms of decompression/reads.
```
dict_bytes=16384
train_bytes=1048576
echo "No Dictionary"
TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=0 > /dev/null 2>&1
TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=0 2>&1 | grep MB/s

echo "Raw Dictionary"
TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes > /dev/null 2>&1
TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd  -compression_max_dict_bytes=$dict_bytes 2>&1 | grep MB/s

echo "FinalizeDict"
TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false  > /dev/null 2>&1
TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false 2>&1 | grep MB/s

echo "Train Dictionary"
TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes > /dev/null 2>&1
TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes 2>&1 | grep MB/s

No Dictionary
readrandom   :      12.183 micros/op 82082 ops/sec 12.183 seconds 1000000 operations;    9.1 MB/s (1000000 of 1000000 found)
Raw Dictionary
readrandom   :      12.314 micros/op 81205 ops/sec 12.314 seconds 1000000 operations;    9.0 MB/s (1000000 of 1000000 found)
FinalizeDict
readrandom   :       9.787 micros/op 102180 ops/sec 9.787 seconds 1000000 operations;   11.3 MB/s (1000000 of 1000000 found)
Train Dictionary
readrandom   :       9.698 micros/op 103108 ops/sec 9.699 seconds 1000000 operations;   11.4 MB/s (1000000 of 1000000 found)
```

Reviewed By: ajkr

Differential Revision: D35720026

Pulled By: cbi42

fbshipit-source-id: 24d230fdff0fd28a1bb650658798f00dfcfb2a1f
2022-05-20 12:09:09 -07:00
mrambacher 204a42ca97 Added GetFactoryCount/Names/Types to ObjectRegistry (#9358)
Summary:
These methods allow for more thorough testing of the ObjectRegistry and Customizable infrastructure in a simpler manner.  With this change, the Customizable tests can now check what factories are registered and attempt to create each of them in a systematic fashion.

With this change, I think all of the factories registered with the ObjectRegistry/CreateFromString are now tested via the customizable_test classes.

Note that there were a few other minor changes.  There was a "posix://*" register with the ObjectRegistry which was missed during the PatternEntry conversion -- these changes found that.  The nickname and default names for the FileSystem classes was also inverted.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9358

Reviewed By: pdillinger

Differential Revision: D33433542

Pulled By: mrambacher

fbshipit-source-id: 9a32da74e6620745b4eeffb2712be70eeeadfa7e
2022-05-16 09:44:43 -07:00
mrambacher bfc6a8ee4a Option type info functions (#9411)
Summary:
Add methods to set the various functions (Parse, Serialize, Equals) to the OptionTypeInfo.  These methods simplify the number of constructors required for OptionTypeInfo and make the code a little clearer.

Add functions to the OptionTypeInfo for Prepare and Validate.  These methods allow types other than Configurable and Customizable to have Prepare and Validate logic.  These methods could be used by an option to guarantee that its settings were in a range or that a value was initialized.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9411

Reviewed By: pdillinger

Differential Revision: D36174849

Pulled By: mrambacher

fbshipit-source-id: 72517d8c6bab4723788a4c1a9e16590bff870125
2022-05-13 04:57:08 -07:00
sdong 49628c9a83 Use std::numeric_limits<> (#9954)
Summary:
Right now we still don't fully use std::numeric_limits but use a macro, mainly for supporting VS 2013. Right now we only support VS 2017 and up so it is not a problem. The code comment claims that MinGW still needs it. We don't have a CI running MinGW so it's hard to validate. since we now require C++17, it's hard to imagine MinGW would still build RocksDB but doesn't support std::numeric_limits<>.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9954

Test Plan: See CI Runs.

Reviewed By: riversand963

Differential Revision: D36173954

fbshipit-source-id: a35a73af17cdcae20e258cdef57fcf29a50b49e0
2022-05-05 13:08:21 -07:00
Jay Zhuang 36aec94d85 `compression_per_level` should be used for flush and changeable (#9658)
Summary:
- Make `compression_per_level` dynamical changeable with `SetOptions`;
- Fix a bug that `compression_per_level` is not used for flush;

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9658

Test Plan: CI

Reviewed By: ajkr

Differential Revision: D34700749

Pulled By: jay-zhuang

fbshipit-source-id: a23b9dfa7ad03d393c1d71781d19e91de796f49c
2022-03-07 18:06:19 -08:00
Peter Dillinger e24734f843 Use -Wno-invalid-offsetof instead of dangerous offset_of hack (#9563)
Summary:
After https://github.com/facebook/rocksdb/issues/9515 added a unique_ptr to Status, we see some
warnings-as-error in some internal builds like this:

```
stderr: rocksdb/src/db/compaction/compaction_job.cc:2839:7: error:
offset of on non-standard-layout type 'struct CompactionServiceResult'
[-Werror,-Winvalid-offsetof]
     {offsetof(struct CompactionServiceResult, status),
      ^                                        ~~~~~~
```

I see three potential solutions to resolving this:

* Expand our use of an idiom that works around the warning (see offset_of
functions removed in this change, inspired by
https://gist.github.com/graphitemaster/494f21190bb2c63c5516)  However,
this construction is invoking undefined behavior that assumes consistent
layout with no compiler-introduced indirection. A compiler incompatible
with our assumptions will likely compile the code and exhibit undefined
behavior.
* Migrate to something in place of offset, like a function mapping
CompactionServiceResult* to Status* (for the `status` field). This might
be required in the long term.
* **Selected:** Use our new C++17 dependency to use offsetof in a well-defined way
when the compiler allows it. From a comment on
https://gist.github.com/graphitemaster/494f21190bb2c63c5516:

> A final note: in C++17, offsetof is conditionally supported, which
> means that you can use it on any type (not just standard layout
> types) and the compiler will error if it can't compile it correctly.
> That appears to be the best option if you can live with C++17 and
> don't need constexpr support.

The C++17 semantics are confirmed on
https://en.cppreference.com/w/cpp/types/offsetof, so we can suppress the
warning as long as we accept that we might run into a compiler that
rejects the code, and at that point we will find a solution, such as
the more intrusive "migrate" solution above.

Although this is currently only showing in our buck build, it will
surely show up also with make and cmake, so I have updated those
configurations as well.

Also in the buck build, -Wno-expansion-to-defined does not appear to be
needed anymore (both current compiler configurations) so I
removed it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9563

Test Plan: Tried out buck builds with both current compiler configurations

Reviewed By: riversand963

Differential Revision: D34220931

Pulled By: pdillinger

fbshipit-source-id: d39436008259bd1eaaa87c77be69fb2a5b559e1f
2022-02-15 09:19:19 -08:00
Baptiste Lemaire bec9ab4316 Remove deprecated option DBOptions::max_mem_compaction_level (#9446)
Summary:
In RocksDB, this option was already marked as "NOT SUPPORTED" for a long time, and setting this option does not have any effect on the behavior of RocksDB library. Therefore, we are removing it in the preparations of the upcoming 7.0 release.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9446

Reviewed By: ajkr

Differential Revision: D33793048

Pulled By: bjlemaire

fbshipit-source-id: 73316efdb194e90225005246673dae99e65577ae
2022-02-04 05:32:28 -08:00
Jay Zhuang 3e27add385 Fix a backward compatibility issue (#9456)
Summary:
Fix a backward compatibility issue caused by removing
`purge_redundant_kvs_while_flush`. Reserve the option internally.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9456

Test Plan: CI: https://app.circleci.com/pipelines/github/facebook/rocksdb/11122/workflows/b7bc0f35-1be8-432c-9292-79125e22ecc7/jobs/280595

Reviewed By: ajkr, ltamasi

Differential Revision: D33808474

Pulled By: jay-zhuang

fbshipit-source-id: 7c3b553bc8e85c8a560514e8e460a2dbaf25718d
2022-01-26 22:23:21 -08:00
Jay Zhuang 961d8dacf2 Remove unused option purge_redundant_kvs_while_flush (#9429)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9429

Test Plan: fake release for test: D33754513

Reviewed By: riversand963

Differential Revision: D33753637

Pulled By: jay-zhuang

fbshipit-source-id: 18db4701e8f28dda8f1ab660c2be9890a8312c12
2022-01-26 10:24:16 -08:00
Jay Zhuang 022b400cba Make `bottommost_temperature` dynamically changeable (#9402)
Summary:
Make `AdvancedColumnFamilyOptions.bottommost_temperature`
dynamically changeable with `SetOptions` API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9402

Test Plan: added unittest

Reviewed By: siying

Differential Revision: D33674487

Pulled By: jay-zhuang

fbshipit-source-id: 8943768156aa6197c63850a64238a8092527d517
2022-01-25 15:23:04 -08:00
mrambacher 5486717ee2 Fix an issue with MemTableRepFactory::CreateFromString (#9273)
Summary:
If ignore_unsupported_options=true, then it is possible for MemTableRepFactory::CreateFromString to succeed without setting a result (result=nullptr).  This would cause the original value to be overwritten with null and an error would be raised later when PrepareOptions is invoked.

Added unit test for this condition.  Will add (in another PR unless required by reviewers) comparable tests for all of the other Customizable classes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9273

Reviewed By: ltamasi

Differential Revision: D32990365

Pulled By: mrambacher

fbshipit-source-id: b150724c3f5ae7346357b3866244fd93466875c7
2021-12-09 12:36:18 -08:00
Levi Tamasi dc5de45af8 Support readahead during compaction for blob files (#9187)
Summary:
The patch adds a new BlobDB configuration option `blob_compaction_readahead_size`
that can be used to enable prefetching data from blob files during compaction.
This is important when using storage with higher latencies like HDDs or remote filesystems.
If enabled, prefetching is used for all cases when blobs are read during compaction,
namely garbage collection, compaction filters (when the existing value has to be read from
a blob file), and `Merge` (when the value of the base `Put` is stored in a blob file).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9187

Test Plan: Ran `make check` and the stress/crash test.

Reviewed By: riversand963

Differential Revision: D32565512

Pulled By: ltamasi

fbshipit-source-id: 87be9cebc3aa01cc227bec6b5f64d827b8164f5d
2021-11-19 17:53:47 -08:00
sdong c66b4429ff Incremental Space Amp Compactions in Universal Style (#8655)
Summary:
This commit introduces incremental compaction in univeral style for space amplification. This follows the first improvement mentioned in https://rocksdb.org/blog/2021/04/12/universal-improvements.html . The implemention simply picks up files about size of max_compaction_bytes to compact and execute if the penalty is not too big. More optimizations can be done in the future, e.g. prioritizing between this compaction and other types. But for now, the feature is supposed to be functional and can often reduce frequency of full compactions, although it can introduce penalty.

In order to add cut files more efficiently so that more files from upper levels can be included, SST file cutting threshold (for current file + overlapping parent level files) is set to 1.5X of target file size. A 2MB target file size will generate files like this: https://gist.github.com/siying/29d2676fba417404f3c95e6c013c7de8 Number of files indeed increases but it is not out of control.

Two set of write benchmarks are run:
1. For ingestion rate limited scenario, we can see full compaction is mostly eliminated: https://gist.github.com/siying/959bc1186066906831cf4c808d6e0a19 . The write amp increased from 7.7 to 9.4, as expected. After applying file cutting, the number is improved to 8.9. In another benchmark, the write amp is even better with the incremental approach: https://gist.github.com/siying/d1c16c286d7c59c4d7bba718ca198163
2. For ingestion rate unlimited scenario, incremental compaction turns out to be too expensive most of the time and is not executed, as expected.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8655

Test Plan: Add unit tests to the functionality.

Reviewed By: ajkr

Differential Revision: D31787034

fbshipit-source-id: ce813e63b15a61d5a56e97bf8902a1b28e011beb
2021-10-20 10:04:13 -07:00
Levi Tamasi 3e1bf771a3 Make it possible to force the garbage collection of the oldest blob files (#8994)
Summary:
The current BlobDB garbage collection logic works by relocating the valid
blobs from the oldest blob files as they are encountered during compaction,
and cleaning up blob files once they contain nothing but garbage. However,
with sufficiently skewed workloads, it is theoretically possible to end up in a
situation when few or no compactions get scheduled for the SST files that contain
references to the oldest blob files, which can lead to increased space amp due
to the lack of GC.

In order to efficiently handle such workloads, the patch adds a new BlobDB
configuration option called `blob_garbage_collection_force_threshold`,
which signals to BlobDB to schedule targeted compactions for the SST files
that keep alive the oldest batch of blob files if the overall ratio of garbage in
the given blob files meets the threshold *and* all the given blob files are
eligible for GC based on `blob_garbage_collection_age_cutoff`. (For example,
if the new option is set to 0.9, targeted compactions will get scheduled if the
sum of garbage bytes meets or exceeds 90% of the sum of total bytes in the
oldest blob files, assuming all affected blob files are below the age-based cutoff.)
The net result of these targeted compactions is that the valid blobs in the oldest
blob files are relocated and the oldest blob files themselves cleaned up (since
*all* SST files that rely on them get compacted away).

These targeted compactions are similar to periodic compactions in the sense
that they force certain SST files that otherwise would not get picked up to undergo
compaction and also in the sense that instead of merging files from multiple levels,
they target a single file. (Note: such compactions might still include neighboring files
from the same level due to the need of having a "clean cut" boundary but they never
include any files from any other level.)

This functionality is currently only supported with the leveled compaction style
and is inactive by default (since the default value is set to 1.0, i.e. 100%).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8994

Test Plan: Ran `make check` and tested using `db_bench` and the stress/crash tests.

Reviewed By: riversand963

Differential Revision: D31489850

Pulled By: ltamasi

fbshipit-source-id: 44057d511726a0e2a03c5d9313d7511b3f0c4eab
2021-10-11 18:03:01 -07:00
mrambacher 13ae16c315 Cleanup includes in dbformat.h (#8930)
Summary:
This header file was including everything and the kitchen sink when it did not need to.  This resulted in many places including this header when they needed other pieces instead.

Cleaned up this header to only include what was needed and fixed up the remaining code to include what was now missing.

Hopefully, this sort of code hygiene cleanup will speed up the builds...

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8930

Reviewed By: pdillinger

Differential Revision: D31142788

Pulled By: mrambacher

fbshipit-source-id: 6b45de3f300750c79f751f6227dece9cfd44085d
2021-09-29 04:04:40 -07:00
mrambacher 7fd68b7c39 Make WalFilter, SstPartitionerFactory, FileChecksumGenFactory, and TableProperties Customizable (#8638)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8638

Reviewed By: zhichao-cao

Differential Revision: D31024729

Pulled By: mrambacher

fbshipit-source-id: 954c04ccab0b8dee64050a27aadf78ed119106c0
2021-09-28 05:32:02 -07:00
mrambacher e0f697d2bd Make SliceTransform into a Customizable class (#8641)
Summary:
Made SliceTransform into a Customizable class.

Would be nice to write a test that stored and used a custom transform  in an SST table.

There are a set of tests (DBBlockFliterTest.PrefixExtractor*, SamePrefixTest.InDomainTest, PrefixTest.PrefixAndWholeKeyTest that run the same with or without a SliceTransform/PrefixFilter.  Is this expected?

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8641

Reviewed By: zhichao-cao

Differential Revision: D31142793

Pulled By: mrambacher

fbshipit-source-id: bb08672fccbfdc263dcae21f25a62307e1facda1
2021-09-27 07:43:47 -07:00
mrambacher beed86473a Make MemTableRepFactory into a Customizable class (#8419)
Summary:
This PR does the following:
-> Makes the MemTableRepFactory into a Customizable class and creatable/configurable via CreateFromString
-> Makes the existing implementations compatible with configurations
-> Moves the "SpecialRepFactory" test class into testutil, accessible via the ObjectRegistry or a NewSpecial API

New tests were added to validate the functionality and all existing tests pass.  db_bench and memtablerep_bench were hand-tested to verify the functionality in those tools.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8419

Reviewed By: zhichao-cao

Differential Revision: D29558961

Pulled By: mrambacher

fbshipit-source-id: 81b7229636e4e649a0c914e73ac7b0f8454c931c
2021-09-08 07:46:44 -07:00
Peter Dillinger 4750421ece Replace most typedef with using= (#8751)
Summary:
Old typedef syntax is confusing

Most but not all changes with

    perl -pi -e 's/typedef (.*) ([a-zA-Z0-9_]+);/using $2 = $1;/g' list_of_files
    make format

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8751

Test Plan: existing

Reviewed By: zhichao-cao

Differential Revision: D30745277

Pulled By: pdillinger

fbshipit-source-id: 6f65f0631c3563382d43347896020413cc2366d9
2021-09-07 11:31:59 -07:00
sdong e7c24168d8 Move old files to warm tier in FIFO compactions (#8310)
Summary:
Some FIFO users want to keep the data for longer, but the old data is rarely accessed. This feature allows users to configure FIFO compaction so that data older than a threshold is moved to a warm storage tier.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8310

Test Plan: Add several unit tests.

Reviewed By: ajkr

Differential Revision: D28493792

fbshipit-source-id: c14824ea634814dee5278b449ab5c98b6e0b5501
2021-08-09 12:51:14 -07:00
mrambacher d057e8326d Make MergeOperator+CompactionFilter/Factory into Customizable Classes (#8481)
Summary:
- Changed MergeOperator, CompactionFilter, and CompactionFilterFactory into Customizable classes.
 - Added Options/Configurable/Object Registration for TTL and Cassandra variants
 - Changed the StringAppend MergeOperators to accept a string delimiter rather than a simple char.  Made the delimiter into a configurable option
 - Added tests for new functionality

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8481

Reviewed By: zhichao-cao

Differential Revision: D30136050

Pulled By: mrambacher

fbshipit-source-id: 271d1772835935b6773abaf018ee71e42f9491af
2021-08-06 08:27:25 -07:00
mrambacher 373e3a154d Fix Immutable Customizable Serialization (#8457)
Summary:
If a Customizable option was not mutable, it would still appear in the list of mutable options when serialized. This meant that when the immutable options were used to configure another immutable object, an "option not changeable" status would be returned.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8457

Reviewed By: zhichao-cao

Differential Revision: D29428298

Pulled By: mrambacher

fbshipit-source-id: 3945b0b822f8e5955a7c5590fe64dfd5bc1fe6a0
2021-06-28 12:28:28 -07:00
mrambacher 6ad0810393 Make Comparator into a Customizable Object (#8336)
Summary:
Makes the Comparator class into a Customizable object.  Added/Updated the CreateFromString method to create Comparators.  Added test for using the ObjectRegistry to create one.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8336

Reviewed By: jay-zhuang

Differential Revision: D28999612

Pulled By: mrambacher

fbshipit-source-id: bff2cb2814eeb9fef6a00fddc61d6e34b6fbcf2e
2021-06-11 06:22:59 -07:00
Jay Zhuang d15fbae449 Refactor Option obj address from char* to void* (#8295)
Summary:
And replace `reinterpret_cast` with `static_cast` or no cast.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8295

Test Plan: `make check`

Reviewed By: mrambacher

Differential Revision: D28420303

Pulled By: jay-zhuang

fbshipit-source-id: 645be123a0df624dc2bea37cd54a35403fc494fa
2021-05-13 14:29:42 -07:00
mrambacher 78e82410eb Added static methods for simple types to OptionTypeInfo (#8249)
Summary:
Added ParseType, SerializeType, and TypesAreEqual methods to OptionTypeInfo.  These methods can be used for serialization and deserialization of basic types.

Change the MutableCF/DB Options to use this format.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8249

Reviewed By: jay-zhuang

Differential Revision: D28351190

Pulled By: mrambacher

fbshipit-source-id: 72a78643b804f2f0bf59c32ffefa63346672ad16
2021-05-11 16:15:47 -07:00
mrambacher 9f2d255aed Add ObjectRegistry to ConfigOptions (#8166)
Summary:
This change enables a couple of things:
- Different ConfigOptions can have different registry/factory associated with it, thereby allowing things like a "Test" ConfigOptions versus a "Production"
- The ObjectRegistry is created fewer times and can be re-used

The ConfigOptions can also be initialized/constructed from a DBOptions, in which case it will grab some of its settings (Env, Logger) from the DBOptions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8166

Reviewed By: zhichao-cao

Differential Revision: D27657952

Pulled By: mrambacher

fbshipit-source-id: ae1d6200bb7ab127405cdeefaba43c7fe694dfdd
2021-05-11 06:47:22 -07:00
mrambacher 8948dc8524 Make ImmutableOptions struct that inherits from ImmutableCFOptions and ImmutableDBOptions (#8262)
Summary:
The ImmutableCFOptions contained a bunch of fields that belonged to the ImmutableDBOptions.  This change cleans that up by introducing an ImmutableOptions struct.  Following the pattern of Options struct, this class inherits from the DB and CFOption structs (of the Immutable form).

Only one structural change (the ImmutableCFOptions::fs was changed to a shared_ptr from a raw one) is in this PR.  All of the other changes involve moving the member variables from the ImmutableCFOptions into the ImmutableOptions and changing member variables or function parameters as required for compilation purposes.

Follow-on PRs may do a further clean-up of the code, such as renaming variables (such as "ImmutableOptions cf_options") and potentially eliminating un-needed function parameters (there is no longer a need to pass both an ImmutableDBOptions and an ImmutableOptions to a function).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8262

Reviewed By: pdillinger

Differential Revision: D28226540

Pulled By: mrambacher

fbshipit-source-id: 18ae71eadc879dedbe38b1eb8e6f9ff5c7147dbf
2021-05-05 14:00:17 -07:00
mrambacher 0ca6d6297f Rename variables in ImmutableCFOptions to avoid conflicts with ImmutableDBOptions (#8227)
Summary:
Renaming ImmutableCFOptions::info_log and statistics to logger and stats.  This is stage 2 in creating an ImmutableOptions class.  It is necessary because the names match those in ImmutableOptions and have different types.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8227

Reviewed By: jay-zhuang

Differential Revision: D28000967

Pulled By: mrambacher

fbshipit-source-id: 3bf2aa04e8f1e8724d825b7deacf41080c14420b
2021-04-26 12:43:45 -07:00
mrambacher 6bab3a34e9 Move RegisterOptions into the Configurable API (#8223)
Summary:
As previously coded, a Configurable extension would need access to code not in the public API.  This change moves RegisterOptions into the Configurable class and therefore available to public extensions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8223

Reviewed By: anand1976

Differential Revision: D27960188

Pulled By: mrambacher

fbshipit-source-id: ac88b19397183df633902def5b5701b9b65fbf40
2021-04-26 03:13:24 -07:00
mrambacher 01e460d538 Make types of Immutable/Mutable Options fields match that of the underlying Option (#8176)
Summary:
This PR is a first step at attempting to clean up some of the Mutable/Immutable Options code.  With this change, a DBOption and a ColumnFamilyOption can be reconstructed from their Mutable and Immutable equivalents, respectively.

readrandom tests do not show any performance degradation versus master (though both are slightly slower than the current 6.19 release).

There are still fields in the ImmutableCFOptions that are not CF options but DB options.  Eventually, I would like to move those into an ImmutableOptions (= ImmutableDBOptions+ImmutableCFOptions).  But that will be part of a future PR to minimize changes and disruptions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8176

Reviewed By: pdillinger

Differential Revision: D27954339

Pulled By: mrambacher

fbshipit-source-id: ec6b805ba9afe6e094bffdbd76246c2d99aa9fad
2021-04-22 20:43:54 -07:00