Commit graph

11 commits

Author SHA1 Message Date
Changyu Bi 6a82b68788 Avoid counting extra range tombstone compensated size in AddRangeDels() (#11091)
Summary:
in `CompactionOutputs::AddRangeDels()`, range tombstones with the same start and end key but different sequence numbers all contribute to compensated range tombstone size. This PR removes this redundancy. This PR also includes a fix from https://github.com/facebook/rocksdb/issues/11067 where a range tombstone that is not within a file's range was being added to the file. This fixes an assertion failure for `icmp.Compare(start, end) <= 0` in VersionSet::ApproximateSize() when calculating compensated range tombstone size. Assertions and a comment/essay was added to reason that no such range tombstone will be added after this fix.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11091

Test Plan:
- Added unit tests
- Stress test with small key range: `python3 tools/db_crashtest.py blackbox --simple --max_key=100 --interval=600 --write_buffer_size=262144 --target_file_size_base=256 --max_bytes_for_level_base=262144 --block_size=128 --value_size_mult=33 --subcompactions=10`

Reviewed By: ajkr

Differential Revision: D42521588

Pulled By: cbi42

fbshipit-source-id: 5bda3fe38997995314e1f7592319af12b69bc4f8
2023-01-17 12:47:44 -08:00
Changyu Bi f515d9d203 Revert #10802 Consider range tombstone in compaction output file cutting (#11089)
Summary:
This reverts commit f02c708aa3 since it introduced several bugs (see https://github.com/facebook/rocksdb/issues/11078 and https://github.com/facebook/rocksdb/issues/11067 for attempts to fix them) and that I do not have a high confidence to fix all of them and ensure no further ones before the next release branch cut. There are also come existing issue found during bug fixing. We will work on it and try to merge it to the release after.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11089

Test Plan: existing CI.

Reviewed By: ajkr

Differential Revision: D42505972

Pulled By: cbi42

fbshipit-source-id: 2f66dcde6b85dc94977b317c2ce513872cfbc153
2023-01-13 12:28:21 -08:00
Changyu Bi cc6f323705 Include estimated bytes deleted by range tombstones in compensated file size (#10734)
Summary:
compensate file sizes in compaction picking so files with range tombstones are preferred, such that they get compacted down earlier as they tend to delete a lot of data. This PR adds a `compensated_range_deletion_size` field in FileMeta that is computed during Flush/Compaction and persisted in MANIFEST. This value is added to `compensated_file_size` which will be used for compaction picking. Currently, for a file in level L, `compensated_range_deletion_size` is set to the estimated bytes deleted by range tombstone of this file in all levels > L. This helps to reduce space amp when data in older levels are covered by range tombstones in level L.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10734

Test Plan:
- Added unit tests.
- benchmark to check if the above definition `compensated_range_deletion_size` is reducing space amp as intended, without affecting write amp too much. The experiment set up favorable for this optimization: large range tombstone issued infrequently. Command used:
```
./db_bench -benchmarks=fillrandom,waitforcompaction,stats,levelstats -use_existing_db=false -avoid_flush_during_recovery=true -write_buffer_size=33554432 -level_compaction_dynamic_level_bytes=true -max_background_jobs=8 -max_bytes_for_level_base=134217728 -target_file_size_base=33554432 -writes_per_range_tombstone=500000 -range_tombstone_width=5000000 -num=50000000 -benchmark_write_rate_limit=8388608 -threads=16 -duration=1800 --max_num_range_tombstones=1000000000
```

In this experiment, each thread wrote 16 range tombstones over the duration of 30 minutes, each range tombstone has width 5M that is the 10% of the key space width. Results shows this PR generates a smaller DB size.

Compaction stats from this PR:
```
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      2/0   31.54 MB   0.5      0.0     0.0      0.0       8.4      8.4       0.0   1.0      0.0     63.4    135.56            110.94       544    0.249       0      0       0.0       0.0
  L4      3/0   96.55 MB   0.8     18.5     6.7     11.8      18.4      6.6       0.0   2.7     65.3     64.9    290.08            284.03       108    2.686    284M  1957K       0.0       0.0
  L5     15/0   404.41 MB   1.0     19.1     7.7     11.4      18.8      7.4       0.3   2.5     66.6     65.7    292.93            285.34       220    1.332    293M  3808K       0.0       0.0
  L6    143/0    4.12 GB   0.0     45.0     7.5     37.5      41.6      4.1       0.0   5.5     71.2     65.9    647.00            632.66       251    2.578    739M    47M       0.0       0.0
 Sum    163/0    4.64 GB   0.0     82.6    21.9     60.7      87.2     26.5       0.3  10.4     61.9     65.4   1365.58           1312.97      1123    1.216   1318M    52M       0.0       0.0
```

Compaction stats from main:
```
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      0/0    0.00 KB   0.0      0.0     0.0      0.0       8.4      8.4       0.0   1.0      0.0     60.5    142.12            115.89       569    0.250       0      0       0.0       0.0
  L4      3/0   85.68 MB   1.0     17.7     6.8     10.9      17.6      6.7       0.0   2.6     62.7     62.3    289.05            281.79       112    2.581    272M  2309K       0.0       0.0
  L5     11/0   293.73 MB   1.0     18.8     7.5     11.2      18.5      7.2       0.5   2.5     64.9     63.9    296.07            288.50       220    1.346    288M  4365K       0.0       0.0
  L6    130/0    3.94 GB   0.0     51.5     7.6     43.9      47.9      3.9       0.0   6.3     67.2     62.4    784.95            765.92       258    3.042    848M    51M       0.0       0.0
 Sum    144/0    4.31 GB   0.0     88.0    21.9     66.0      92.3     26.3       0.5  11.0     59.6     62.5   1512.19           1452.09      1159    1.305   1409M    58M       0.0       0.0```

Reviewed By: ajkr

Differential Revision: D39834713

Pulled By: cbi42

fbshipit-source-id: fe9341040b8704a8fbb10cad5cf5c43e962c7e6b
2022-12-29 13:28:24 -08:00
Changyu Bi 53b703eafe Fix an assertion failure in CompactionOutputs::AddRangeDels() (#11040)
Summary:
the [assertion](c3f720c60d/db/compaction/compaction_outputs.cc (L643)) in `CompactionOutputs::AddRangeDels()` can fail after https://github.com/facebook/rocksdb/pull/10802. The assertion fails when `lower_bound_from_range_tombstone` is true during `AddRangeDels()` for a new compaction output file, while the lower bound range tombstone key has seqno 0 and op_type kTypeRangeDeletion. It can have seqno 0 when it was truncated at a point key whose seqno was zeroed out during compaction, the seqno and op_type could be set [here](c3f720c60d/db/compaction/compaction_outputs.cc (L594)). This PR fixes the assertion excluding the case when `lower_bound_from_range_tombstone` is true.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11040

Test Plan: CI

Reviewed By: ajkr

Differential Revision: D42119914

Pulled By: cbi42

fbshipit-source-id: 0897e71b5304cb02aac30f71667b590c37b72baf
2022-12-19 16:36:39 -08:00
Changyu Bi f02c708aa3 Consider range tombstone in compaction output file cutting (#10802)
Summary:
This PR is the first step for Issue https://github.com/facebook/rocksdb/issues/4811. Currently compaction output files are cut at point keys, and the decision is made mainly in `CompactionOutputs::ShouldStopBefore()`. This makes it possible for range tombstones to cause large compactions that does not respect `max_compaction_bytes`. For example, we can have a large range tombstone that overlaps with too many files from the next level. Another example is when there is a gap between a range tombstone and another key. The first issue may be more acceptable, as a lot of data is deleted. This PR address the second issue by calling `ShouldStopBefore()` for range tombstone start keys. The main change is for `CompactionIterator` to emit range tombstone start keys to be processed by `CompactionOutputs`. A new `CompactionMergingIterator` is introduced and only used under `CompactionIterator` for this purpose. Further improvement after this PR include 1) cut compaction output at some grandparent boundary key instead of at the next point key or range tombstone start key and 2) cut compaction output file within a large range tombstone (it may be easier and reasonable to only do it for range tombstones at the end of a compaction output).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10802

Test Plan:
- added unit tests in db_range_del_test.
- stress test: `python3 tools/db_crashtest.py whitebox --[simple|enable_ts] --verify_iterator_with_expected_state_one_in=5 --delrangepercent=5 --prefixpercent=2 --writepercent=58 --readpercen=21 --duration=36000 --range_deletion_width=1000000`

Reviewed By: ajkr, jay-zhuang

Differential Revision: D40308827

Pulled By: cbi42

fbshipit-source-id: a8fd6f70a3f09d0ef7a40e006f6c964bba8c00df
2022-12-15 09:11:54 -08:00
Jay Zhuang 23fa5b7789 Use sstableKeyCompare() for compaction output boundary check (#10763)
Summary:
To make it consistent with the compaction picker which uses the `sstableKeyCompare()` to pick the overlap files. For example, without this change, it may cut L1 files like:
```
 L1: [2-21]  [22-30]
 L2: [1-10] [21-30]
```
Because "21" on L1 is smaller than "21" on L2. But for compaction, these 2 files are overlapped.
`sstableKeyCompare()` also take range delete into consideration which may cut file for the same key.
It also makes the `max_compaction_bytes` calculation more accurate for cases like above, the overlapped bytes was under estimated. Also make sure the 2 keys won't be splitted to 2 files because of reaching `max_compaction_bytes`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10763

Reviewed By: cbi42

Differential Revision: D39971904

Pulled By: cbi42

fbshipit-source-id: bcc309e9c3dc61a8f50667a6f633e6132c0154a8
2022-10-06 15:54:58 -07:00
Changyu Bi 9f2363f4c4 User-defined timestamp support for DeleteRange() (#10661)
Summary:
Add user-defined timestamp support for range deletion. The new API is `DeleteRange(opt, cf, begin_key, end_key, ts)`. Most of the change is to update the comparator to compare without timestamp. Other than that, major changes are
- internal range tombstone data structures (`FragmentedRangeTombstoneList`, `RangeTombstone`, etc.) to store timestamps.
- Garbage collection of range tombstones and range tombstone covered keys during compaction.
- Get()/MultiGet() to return the timestamp of a range tombstone when needed.
- Get/Iterator with range tombstones bounded by readoptions.timestamp.
- timestamp crash test now issues DeleteRange by default.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10661

Test Plan:
- Added unit test: `make check`
- Stress test: `python3 tools/db_crashtest.py --enable_ts whitebox --readpercent=57 --prefixpercent=4 --writepercent=25 -delpercent=5 --iterpercent=5 --delrangepercent=4`
- Ran `db_bench` to measure regression when timestamp is not enabled. The tests are for write (with some range deletion) and iterate with DB fitting in memory: `./db_bench--benchmarks=fillrandom,seekrandom --writes_per_range_tombstone=200 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=500000 --reads=500000 --seek_nexts=10 --disable_auto_compactions -disable_wal=true --max_num_range_tombstones=1000`.  Did not see consistent regression in no timestamp case.

| micros/op | fillrandom | seekrandom |
| --- | --- | --- |
|main| 2.58 |10.96|
|PR 10661| 2.68 |10.63|

Reviewed By: riversand963

Differential Revision: D39441192

Pulled By: cbi42

fbshipit-source-id: f05aca3c41605caf110daf0ff405919f300ddec2
2022-09-30 16:13:03 -07:00
Jay Zhuang f3cc66632b Align compaction output file boundaries to the next level ones (#10655)
Summary:
Try to align the compaction output file boundaries to the next level ones
(grandparent level), to reduce the level compaction write-amplification.

In level compaction, there are "wasted" data at the beginning and end of the
output level files. Align the file boundary can avoid such "wasted" compaction.
With this PR, it tries to align the non-bottommost level file boundaries to its
next level ones. It may cut file when the file size is large enough (at least
50% of target_file_size) and not too large (2x target_file_size).

db_bench shows about 12.56% compaction reduction:
```
TEST_TMPDIR=/data/dbbench2 ./db_bench --benchmarks=fillrandom,readrandom -max_background_jobs=12 -num=400000000 -target_file_size_base=33554432

# baseline:
Flush(GB): cumulative 25.882, interval 7.216
Cumulative compaction: 285.90 GB write, 162.36 MB/s write, 269.68 GB read, 153.15 MB/s read, 2926.7 seconds

# with this change:
Flush(GB): cumulative 25.882, interval 7.753
Cumulative compaction: 249.97 GB write, 141.96 MB/s write, 233.74 GB read, 132.74 MB/s read, 2534.9 seconds
```

The compaction simulator shows a similar result (14% with 100G random data).
As a side effect, with this PR, the SST file size can exceed the
target_file_size, but is capped at 2x target_file_size. And there will be
smaller files. Here are file size statistics when loading 100GB with the target
file size 32MB:
```
          baseline      this_PR
count  1.656000e+03  1.705000e+03
mean   3.116062e+07  3.028076e+07
std    7.145242e+06  8.046139e+06
```

The feature is enabled by default, to revert to the old behavior disable it
with `AdvancedColumnFamilyOptions.level_compaction_dynamic_file_size = false`

Also includes https://github.com/facebook/rocksdb/issues/1963 to cut file before skippable grandparent file. Which is for
use case like user adding 2 or more non-overlapping data range at the same
time, it can reduce the overlapping of 2 datasets in the lower levels.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10655

Reviewed By: cbi42

Differential Revision: D39552321

Pulled By: jay-zhuang

fbshipit-source-id: 640d15f159ab0cd973f2426cfc3af266fc8bdde2
2022-09-29 19:43:55 -07:00
Jay Zhuang 849cf1bf68 Refactor Compaction file cut ShouldStopBefore() (#10629)
Summary:
Consolidate compaction output cut logic to `ShouldStopBefore()` and move
it inside of CompactionOutputs class.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10629

Reviewed By: cbi42

Differential Revision: D39315536

Pulled By: jay-zhuang

fbshipit-source-id: 7d81037babbd35c276bbaad02dbc2bb555fdac18
2022-09-14 22:09:12 -07:00
Jay Zhuang a3acf2ef87 Add seqno to time mapping (#10338)
Summary:
Which will be used for tiered storage to preclude hot data from
compacting to the cold tier (the last level).
Internally, adding seqno to time mapping. A periodic_task is scheduled
to record the current_seqno -> current_time in certain cadence. When
memtable flush, the mapping informaiton is stored in sstable property.
During compaction, the mapping information are merged and get the
approximate time of sequence number, which is used to determine if a key
is recently inserted or not and preclude it from the last level if it's
recently inserted (within the `preclude_last_level_data_seconds`).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10338

Test Plan: CI

Reviewed By: siying

Differential Revision: D37810187

Pulled By: jay-zhuang

fbshipit-source-id: 6953be7a18a99de8b1cb3b162d712f79c2b4899f
2022-07-14 21:49:34 -07:00
Jay Zhuang 6ce0b2ca34 Tiered Compaction: per key placement support (#9964)
Summary:
Support per_key_placement for last level compaction, which will
be used for tiered compaction.
* compaction iterator reports which level a key should output to;
* compaction get the output level information and check if it's safe to
  output the data to penultimate level;
* all compaction output files will be installed.
* extra internal compaction stats added for penultimate level.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9964

Test Plan:
* Unittest
* db_bench, no significate difference: https://gist.github.com/jay-zhuang/3645f8fb97ec0ab47c10704bb39fd6e4
* microbench manual compaction no significate difference: https://gist.github.com/jay-zhuang/ba679b3e89e24992615ee9eef310e6dd
* run the db_stress multiple times (not covering the new feature) looks good (internal: https://fburl.com/sandcastle/9w84pp2m)

Reviewed By: ajkr

Differential Revision: D36249494

Pulled By: jay-zhuang

fbshipit-source-id: a96da57c8031c1df83e4a7a8567b657a112b80a3
2022-07-13 20:54:49 -07:00