Commit graph

83 commits

Author SHA1 Message Date
Yu Zhang c73cf7a878 Add CompactForTieringCollector to support automatically trigger compaction for tiering use case (#12760)
Summary:
This PR adds user property collector factory `CompactForTieringCollectorFactory` to support observe SST file and mark it as need compaction for fast tracking data to the proper tier.

A triggering ratio `compaction_trigger_ratio_` can be configured to achieve the following:
1) Setting the ratio to be equal to or smaller than 0 disables this collector
2) Setting the ratio to be within (0, 1] will write the number of observed eligible entries into a user property and marks a file as need-compaction when aforementioned condition is met.
3) Setting the ratio to be higher than 1 can be used to just writes the user table property, and not mark any file as need compaction.
 For a column family that does not enable tiering feature, even if an effective configuration is provided, this collector is still disabled. For a file that is already on the last level, this collector is also disabled.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12760

Test Plan: Added unit tests

Reviewed By: pdillinger

Differential Revision: D58734976

Pulled By: jowlyzhang

fbshipit-source-id: 6daab2c4f62b5c6689c3c03e3b3907bbbe6b7a81
2024-06-18 10:51:29 -07:00
Yu Zhang 9dc171e3bb Fix issue that cause false alarm corruption report (#12626)
Summary:
The state of `saved_seq_for_penul_check_` is not correctly maintained with the current flow. It's supposed to store the original sequence number for a `kTypeValuePreferredSeqno` entry for use in the `DecideOutputLevel` function. However, it's not always properly cleared.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12626

Test Plan:
Added unit test that would fail before the fix
./tiered_compaction_test --gtest_filter="*InterleavedTimedPutAndPut*"

Reviewed By: pdillinger

Differential Revision: D57123469

Pulled By: jowlyzhang

fbshipit-source-id: 8d73214b3b6dc152daf19b6bd6ee9063581dc277
2024-05-08 15:51:38 -07:00
Yu Zhang 2c02a9b76f Preserve TimedPut on penultimate level until it actually expires (#12543)
Summary:
To make sure `TimedPut` are placed on proper tier before and when it becomes eligible for cold tier
1) flush and compaction need to keep relevant seqno to time mapping for not just the sequence number contained in internal keys, but also preferred sequence number for `TimedPut` entries.

This PR also fix some bugs in for handling `TimedPut` during compaction:
1) dealing with an edge case when a `TimedPut` entry's internal key is the right bound for penultimate level, the internal key after swapping in its preferred sequence number will fall outside of the penultimate range because preferred sequence number is smaller than its original sequence number. The entry however is still safe to be placed on penultimate level, so we keep track of `TimedPut` entry's original sequence number for this check. The idea behind this is that as long as it's safe for the original key to be placed on penultimate level, it's safe for the entry with swapped preferred sequence number to be placed on penultimate level too. Because we only swap in preferred sequence number when that entry is visible to the earliest snapshot and there is no other data points with the same user key in lower levels. On the other hand, as long as it's not safe for the original key to be placed on penultimate level, we will not place the entry after swapping the preferred seqno on penultimate level either.

2) the assertion that preferred seqno is always bigger than original sequence number may fail if this logic is only exercised after sequence number is zeroed out. We adjust the assertion to handle that case too. In this case, we don't swap in the preferred seqno but will adjust the its type to `kTypeValue`.

3) there was a special case handling for when range deletion may end up incorrectly covering an entry if preferred seqno is swapped in. But it missed the case that if the original entry is already covered by range deletion. The original handling will mistakenly output the entry instead of omitting it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12543

Test Plan:
./tiered_compaction_test --gtest_filter="PrecludeLastLevelTest.PreserveTimedPutOnPenultimateLevel"
./compaction_iterator_test --gtest_filter="*TimedPut*"

Reviewed By: pdillinger

Differential Revision: D56195096

Pulled By: jowlyzhang

fbshipit-source-id: 37ebb09d2513abbd9e90cda0217e26874584b8f3
2024-04-30 11:16:02 -07:00
Yu Zhang 1104eaa35e Add initial support for TimedPut API (#12419)
Summary:
This PR adds support for `TimedPut` API. We introduced a new type `kTypeValuePreferredSeqno` for entries added to the DB via the `TimedPut` API.

The life cycle of such an entry on the write/flush/compaction paths are:

1) It is initially added to memtable as:
`<user_key, seq, kTypeValuePreferredSeqno>: {value, write_unix_time}`

2) When it's flushed to L0 sst files, it's converted to:
`<user_key, seq, kTypeValuePreferredSeqno>: {value, preferred_seqno}`
 when we have easy access to the seqno to time mapping.

3) During compaction, if certain conditions are met, we swap in the `preferred_seqno` and the entry will become:
`<user_key, preferred_seqno, kTypeValue>: value`. This step helps fast track these entries to the cold tier if they are eligible after the sequence number swap.

On the read path:
A `kTypeValuePreferredSeqno` entry acts the same as a `kTypeValue` entry, the unix_write_time/preferred seqno part packed in value is completely ignored.

Needed follow ups:
1) The seqno to time mapping accessible in flush needs to be extended to cover the `write_unix_time` for possible `kTypeValuePreferredSeqno` entries. This also means we need to track these `write_unix_time` in memtable.

2) Compaction filter support for the new `kTypeValuePreferredSeqno` type for feature parity with other `kTypeValue` and equivalent types.

3) Stress test coverage for the feature

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12419

Test Plan: Added unit tests

Reviewed By: pdillinger

Differential Revision: D54920296

Pulled By: jowlyzhang

fbshipit-source-id: c8b43f7a7c465e569141770e93c748371ff1da9e
2024-03-14 15:44:55 -07:00
Changyu Bi 4e58cc6437 Check internal key range when compacting from last level to penultimate level (#12063)
Summary:
The test failure in https://github.com/facebook/rocksdb/issues/11909 shows that we may compact keys outside of internal key range of penultimate level input files from last level to penultimate level, which can potentially cause overlapping files in the penultimate level. This PR updates the  `Compaction::WithinPenultimateLevelOutputRange()` to check internal key range instead of user key.

Other fixes:
* skip range del sentinels when deciding output level for tiered compaction

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12063

Test Plan:
- existing unit tests
- apply the fix to https://github.com/facebook/rocksdb/issues/11905 and run `./tiered_compaction_test --gtest_filter="*RangeDelsCauseFileEndpointsToOverlap*"`

Reviewed By: ajkr

Differential Revision: D51288985

Pulled By: cbi42

fbshipit-source-id: 70085db5f5c3b15300bcbc39057d57b83fd9902a
2023-11-17 10:50:40 -08:00
Levi Tamasi 1e63fc9925 Add a helper method WideColumnsHelper::SortColumns (#11823)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11823

Similarly to https://github.com/facebook/rocksdb/pull/11813, the patch is a small refactoring that eliminates some copy-paste around sorting the columns of entities by column name.

Reviewed By: jaykorean

Differential Revision: D49195504

fbshipit-source-id: d48c9f290e3203f838cc5949856c469ecf730008
2023-09-12 12:36:07 -07:00
Changyu Bi 6a0f637633 Compare the number of input keys and processed keys for compactions (#11571)
Summary:
... to improve data integrity validation during compaction.

A new option `compaction_verify_record_count` is introduced for this verification and is enabled by default. One exception when the verification is not done is when a compaction filter returns kRemoveAndSkipUntil which can cause CompactionIterator to seek until some key and hence not able to keep track of the number of keys processed.

For expected number of input keys, we sum over the number of total keys - number of range tombstones across compaction input files (`CompactionJob::UpdateCompactionStats()`). Table properties are consulted if `FileMetaData` is not initialized for some input file. Since table properties for all input files were also constructed during `DBImpl::NotifyOnCompactionBegin()`, `Compaction::GetTableProperties()` is introduced to reduce duplicated code.

For actual number of keys processed, each subcompaction will record its number of keys processed to `sub_compact->compaction_job_stats.num_input_records` and aggregated when all subcompactions finish (`CompactionJob::AggregateCompactionStats()`). In the case when some subcompaction encountered kRemoveAndSkipUntil from compaction filter and does not have accurate count, it propagates this information through `sub_compact->compaction_job_stats.has_num_input_records`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11571

Test Plan:
* Add a new unit test `DBCompactionTest.VerifyRecordCount` for the corruption case.
* All other unit tests for non-corrupted case.
* Ran crash test for a few hours: `python3 ./tools/db_crashtest.py whitebox --simple`

Reviewed By: ajkr

Differential Revision: D47131965

Pulled By: cbi42

fbshipit-source-id: cc8e94565dd526c4347e9d3843ecf32f6727af92
2023-07-28 09:47:31 -07:00
Hui Xiao 151242ce46 Group rocksdb.sst.read.micros stat by IOActivity flush and compaction (#11288)
Summary:
**Context:**
The existing stat rocksdb.sst.read.micros does not reflect each of compaction and flush cases but aggregate them, which is not so helpful for us to understand IO read behavior of each of them.

**Summary**
- Update `StopWatch` and `RandomAccessFileReader` to record `rocksdb.sst.read.micros` and `rocksdb.file.{flush/compaction}.read.micros`
   - Fixed the default histogram in `RandomAccessFileReader`
- New field `ReadOptions/IOOptions::io_activity`; Pass `ReadOptions` through paths under db open, flush and compaction to where we can prepare `IOOptions` and pass it to `RandomAccessFileReader`
- Use `thread_status_util` for assertion in `DbStressFSWrapper` for continuous testing on we are passing correct `io_activity` under db open, flush and compaction

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11288

Test Plan:
- **Stress test**
- **Db bench 1: rocksdb.sst.read.micros COUNT ≈ sum of rocksdb.file.read.flush.micros's and rocksdb.file.read.compaction.micros's.**  (without blob)
     - May not be exactly the same due to `HistogramStat::Add` only guarantees atomic not accuracy across threads.
```
./db_bench -db=/dev/shm/testdb/ -statistics=true -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 (-use_plain_table=1 -prefix_size=10)
```
```
// BlockBasedTable
rocksdb.sst.read.micros P50 : 2.009374 P95 : 4.968548 P99 : 8.110362 P100 : 43.000000 COUNT : 40456 SUM : 114805
rocksdb.file.read.flush.micros P50 : 1.871841 P95 : 3.872407 P99 : 5.540541 P100 : 43.000000 COUNT : 2250 SUM : 6116
rocksdb.file.read.compaction.micros P50 : 2.023109 P95 : 5.029149 P99 : 8.196910 P100 : 26.000000 COUNT : 38206 SUM : 108689

// PlainTable
Does not apply
```
- **Db bench 2: performance**

**Read**

SETUP: db with 900 files
```
./db_bench -db=/dev/shm/testdb/ -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655  -disable_auto_compactions=true -target_file_size_base=655 -compression_type=none
```run till convergence
```
./db_bench -seed=1678564177044286 -use_existing_db=true -db=/dev/shm/testdb -benchmarks=readrandom[-X60] -statistics=true -num=1000000 -disable_auto_compactions=true -compression_type=none -bloom_bits=3
```
Pre-change
`readrandom [AVG 60 runs] : 21568 (± 248) ops/sec`
Post-change (no regression, -0.3%)
`readrandom [AVG 60 runs] : 21486 (± 236) ops/sec`

**Compaction/Flush**run till convergence
```
./db_bench -db=/dev/shm/testdb2/ -seed=1678564177044286 -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655  -disable_auto_compactions=false -target_file_size_base=655 -compression_type=none

rocksdb.sst.read.micros  COUNT : 33820
rocksdb.sst.read.flush.micros COUNT : 1800
rocksdb.sst.read.compaction.micros COUNT : 32020
```
Pre-change
`fillseq [AVG 46 runs] : 1391 (± 214) ops/sec;    0.7 (± 0.1) MB/sec`

Post-change (no regression, ~-0.4%)
`fillseq [AVG 46 runs] : 1385 (± 216) ops/sec;    0.7 (± 0.1) MB/sec`

Reviewed By: ajkr

Differential Revision: D44007011

Pulled By: hx235

fbshipit-source-id: a54c89e4846dfc9a135389edf3f3eedfea257132
2023-04-21 09:07:18 -07:00
Changyu Bi b3c43a5b99 Drain unnecessary levels when level_compaction_dynamic_level_bytes=true (#11340)
Summary:
When a user migrates to level compaction + `level_compaction_dynamic_level_bytes=true`, or when a DB shrinks, there can be unnecessary levels in the DB. Before this PR, this is no way to remove these levels except a manual compaction. These extra unnecessary levels make it harder to guarantee max_bytes_for_level_multiplier and can cause extra space amp. This PR boosts compaction score for these levels to allow RocksDB to automatically drain these levels. Together with https://github.com/facebook/rocksdb/issues/11321, this makes migration to `level_compaction_dynamic_level_bytes=true` automatic without needing user to do a one time full manual compaction. Credit: this PR is modified from https://github.com/facebook/rocksdb/issues/3921.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11340

Test Plan:
- New unit tests
- `python3 tools/db_crashtest.py whitebox --simple` which randomly sets level_compaction_dynamic_level_bytes in each run.

Reviewed By: ajkr

Differential Revision: D44563884

Pulled By: cbi42

fbshipit-source-id: e20d3620bd73dff22be18c5a91a07f340740bcc8
2023-04-06 11:20:43 -07:00
Changyu Bi 229297d1b8 Refactor AddRangeDels() + consider range tombstone during compaction file cutting (#11113)
Summary:
A second attempt after https://github.com/facebook/rocksdb/issues/10802, with bug fixes and refactoring. This PR updates compaction logic to take range tombstones into account when determining whether to cut the current compaction output file (https://github.com/facebook/rocksdb/issues/4811). Before this change, only point keys were considered, and range tombstones could cause large compactions. For example, if the current compaction outputs is a range tombstone [a, b) and 2 point keys y, z, they would be added to the same file, and may overlap with too many files in the next level and cause a large compaction in the future. This PR also includes ajkr's effort to simplify the logic to add range tombstones to compaction output files in `AddRangeDels()` ([https://github.com/facebook/rocksdb/issues/11078](https://github.com/facebook/rocksdb/pull/11078#issuecomment-1386078861)).

The main change is for `CompactionIterator` to emit range tombstone start keys to be processed by `CompactionOutputs`. A new class `CompactionMergingIterator` is introduced to replace `MergingIterator` under `CompactionIterator` to enable emitting of range tombstone start keys. Further improvement after this PR include cutting compaction output at some grandparent boundary key (instead of the next output key) when cutting within a range tombstone to reduce overlap with grandparents.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11113

Test Plan:
* added unit test in db_range_del_test
* crash test with a small key range: `python3 tools/db_crashtest.py blackbox --simple --max_key=100 --interval=600 --write_buffer_size=262144 --target_file_size_base=256 --max_bytes_for_level_base=262144 --block_size=128 --value_size_mult=33 --subcompactions=10 --use_multiget=1 --delpercent=3 --delrangepercent=2 --verify_iterator_with_expected_state_one_in=2 --num_iterations=10`

Reviewed By: ajkr

Differential Revision: D42655709

Pulled By: cbi42

fbshipit-source-id: 8367e36ef5640e8f21c14a3855d4a8d6e360a34c
2023-02-22 12:28:18 -08:00
Levi Tamasi 876d281592 Add compaction filter support for wide-column entities (#11196)
Summary:
The patch adds compaction filter support for wide-column entities by introducing
a new `CompactionFilter` API called `FilterV3`. This API is called for regular
key-values, merge operands, and wide-column entities as well. It is passed the
existing value/operand or wide-column structure and it can update the value or
columns or keep/delete/etc. the key-value as usual. For compatibility, the default
implementation of `FilterV3` keeps all wide-column entities and falls back to calling
`FilterV2` for plain old key-values and merge operands.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11196

Test Plan: `make check`

Reviewed By: akankshamahajan15

Differential Revision: D43094147

Pulled By: ltamasi

fbshipit-source-id: 75acabe9a35254f7f404ba6173ee9c2774382ebd
2023-02-07 16:17:39 -08:00
Levi Tamasi df680b24ef Clean up InvokeFilterIfNeeded a bit (#11174)
Summary:
The patch makes some code quality enhancements in `CompactionIterator::InvokeFilterIfNeeded`
including the renaming of `filter` (which is most likely a remnant of the days before the `FilterV2`
API when the compaction filter used to return a boolean) to `decision`, the removal of some
outdated comments, the elimination of an `error` flag which was only used in one failure case
out of many, as well as some small stylistic improvements. (Some the above will also come in
handy when adding compaction filter support for wide-column entities.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11174

Test Plan: `make check`

Reviewed By: akankshamahajan15

Differential Revision: D42901408

Pulled By: ltamasi

fbshipit-source-id: ab382d59a4990c5dfe1cee219d49e1d80902b666
2023-02-01 10:03:07 -08:00
Andrew Kryczka b7fbcefda8 Add API to limit blast radius of merge operator failure (#11092)
Summary:
Prior to this PR, `FullMergeV2()` can only return `false` to indicate failure, which causes any operation invoking it to fail. During a compaction, such a failure causes the compaction to fail and causes the DB to irreversibly enter read-only mode. Some users asked for a way to allow the merge operator to fail without such widespread damage.

To limit the blast radius of merge operator failures, this PR introduces the `MergeOperationOutput::op_failure_scope` API. When unpopulated (`kDefault`) or set to `kTryMerge`, the merge operator failure handling is the same as before. When set to `kMustMerge`, merge operator failure still causes failure to operations that must merge (`Get()`, iterator, `MultiGet()`, etc.). However, under `kMustMerge`, flushes/compactions can survive merge operator failures by outputting the unmerged input operands.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11092

Reviewed By: siying

Differential Revision: D42525673

Pulled By: ajkr

fbshipit-source-id: 951dc3bf190f86347dccf3381be967565cda52ee
2023-01-20 14:40:30 -08:00
Changyu Bi f515d9d203 Revert #10802 Consider range tombstone in compaction output file cutting (#11089)
Summary:
This reverts commit f02c708aa3 since it introduced several bugs (see https://github.com/facebook/rocksdb/issues/11078 and https://github.com/facebook/rocksdb/issues/11067 for attempts to fix them) and that I do not have a high confidence to fix all of them and ensure no further ones before the next release branch cut. There are also come existing issue found during bug fixing. We will work on it and try to merge it to the release after.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11089

Test Plan: existing CI.

Reviewed By: ajkr

Differential Revision: D42505972

Pulled By: cbi42

fbshipit-source-id: 2f66dcde6b85dc94977b317c2ce513872cfbc153
2023-01-13 12:28:21 -08:00
Changyu Bi f02c708aa3 Consider range tombstone in compaction output file cutting (#10802)
Summary:
This PR is the first step for Issue https://github.com/facebook/rocksdb/issues/4811. Currently compaction output files are cut at point keys, and the decision is made mainly in `CompactionOutputs::ShouldStopBefore()`. This makes it possible for range tombstones to cause large compactions that does not respect `max_compaction_bytes`. For example, we can have a large range tombstone that overlaps with too many files from the next level. Another example is when there is a gap between a range tombstone and another key. The first issue may be more acceptable, as a lot of data is deleted. This PR address the second issue by calling `ShouldStopBefore()` for range tombstone start keys. The main change is for `CompactionIterator` to emit range tombstone start keys to be processed by `CompactionOutputs`. A new `CompactionMergingIterator` is introduced and only used under `CompactionIterator` for this purpose. Further improvement after this PR include 1) cut compaction output at some grandparent boundary key instead of at the next point key or range tombstone start key and 2) cut compaction output file within a large range tombstone (it may be easier and reasonable to only do it for range tombstones at the end of a compaction output).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10802

Test Plan:
- added unit tests in db_range_del_test.
- stress test: `python3 tools/db_crashtest.py whitebox --[simple|enable_ts] --verify_iterator_with_expected_state_one_in=5 --delrangepercent=5 --prefixpercent=2 --writepercent=58 --readpercen=21 --duration=36000 --range_deletion_width=1000000`

Reviewed By: ajkr, jay-zhuang

Differential Revision: D40308827

Pulled By: cbi42

fbshipit-source-id: a8fd6f70a3f09d0ef7a40e006f6c964bba8c00df
2022-12-15 09:11:54 -08:00
Andrew Kryczka 097f9f4425 Fix CompactionIterator flag for penultimate level output (#10967)
Summary:
We were not resetting it in non-debug mode so it could be true once and then stay true for future keys where it should be false. This PR adds the reset logic.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10967

Test Plan:
- built `db_bench` with DEBUG_LEVEL=0
- ran benchmark: `TEST_TMPDIR=/dev/shm/prefix ./db_bench -benchmarks=fillrandom -compaction_style=1 -preserve_internal_time_seconds=100 -preclude_last_level_data_seconds=10 -write_buffer_size=1048576 -target_file_size_base=1048576 -subcompactions=8 -duration=120`
- compared "output_to_penultimate_level: X bytes + last: Y bytes" lines in LOG output
  - Before this fix, Y was always zero
  - After this fix, Y gradually increased throughout the benchmark

Reviewed By: riversand963

Differential Revision: D41417726

Pulled By: ajkr

fbshipit-source-id: ace1e9a289e751a5b0c2fbaa8addd4eda5525329
2022-11-21 16:14:03 -08:00
Yanqin Jin 7d26e4c5a3 Basic Support for Merge with user-defined timestamp (#10819)
Summary:
This PR implements the originally disabled `Merge()` APIs when user-defined timestamp is enabled.

Simplest usage:
```cpp
// assume string append merge op is used with '.' as delimiter.
// ts1 < ts2
db->Put(WriteOptions(), "key", ts1, "v0");
db->Merge(WriteOptions(), "key", ts2, "1");
ReadOptions ro;
ro.timestamp = &ts2;
db->Get(ro, "key", &value);
ASSERT_EQ("v0.1", value);
```

Some code comments are added for clarity.

Note: support for timestamp in `DB::GetMergeOperands()` will be done in a follow-up PR.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10819

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D40603195

Pulled By: riversand963

fbshipit-source-id: f96d6f183258f3392d80377025529f7660503013
2022-10-31 22:28:58 -07:00
Jay Zhuang b36ec37a4b clang-format for db/compaction (#10882)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10882

Reviewed By: riversand963

Differential Revision: D40724867

Pulled By: jay-zhuang

fbshipit-source-id: 7f387724f8cd07d8d2b90566a515a4e9078d21f1
2022-10-26 12:35:12 -07:00
Jay Zhuang 1663f77d2a Fix no internal time recorded for small preclude_last_level (#10829)
Summary:
When the `preclude_last_level_data_seconds` or
`preserve_internal_time_seconds` is smaller than 100 (seconds), no seqno->time information was recorded.
Also make sure all data will be compacted to the last level even if there's no write to record the time information.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10829

Test Plan: added unittest

Reviewed By: siying

Differential Revision: D40443934

Pulled By: jay-zhuang

fbshipit-source-id: 2ecf1361daf9f3e5c3385aee6dc924fa59e2813a
2022-10-20 17:11:38 -07:00
Jay Zhuang c401f285c3 Add option preserve_internal_time_seconds to preserve the time info (#10747)
Summary:
Add option `preserve_internal_time_seconds` to preserve the internal
time information.
It's mostly for the migration of the existing data to tiered storage (
`preclude_last_level_data_seconds`). When the tiering feature is just
enabled, the existing data won't have the time information to decide if
it's hot or cold. Enabling this feature will start collect and preserve
the time information for the new data.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10747

Reviewed By: siying

Differential Revision: D39910141

Pulled By: siying

fbshipit-source-id: 25c21638e37b1a7c44006f636b7d714fe7242138
2022-10-07 18:49:40 -07:00
Changyu Bi 9f2363f4c4 User-defined timestamp support for DeleteRange() (#10661)
Summary:
Add user-defined timestamp support for range deletion. The new API is `DeleteRange(opt, cf, begin_key, end_key, ts)`. Most of the change is to update the comparator to compare without timestamp. Other than that, major changes are
- internal range tombstone data structures (`FragmentedRangeTombstoneList`, `RangeTombstone`, etc.) to store timestamps.
- Garbage collection of range tombstones and range tombstone covered keys during compaction.
- Get()/MultiGet() to return the timestamp of a range tombstone when needed.
- Get/Iterator with range tombstones bounded by readoptions.timestamp.
- timestamp crash test now issues DeleteRange by default.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10661

Test Plan:
- Added unit test: `make check`
- Stress test: `python3 tools/db_crashtest.py --enable_ts whitebox --readpercent=57 --prefixpercent=4 --writepercent=25 -delpercent=5 --iterpercent=5 --delrangepercent=4`
- Ran `db_bench` to measure regression when timestamp is not enabled. The tests are for write (with some range deletion) and iterate with DB fitting in memory: `./db_bench--benchmarks=fillrandom,seekrandom --writes_per_range_tombstone=200 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=500000 --reads=500000 --seek_nexts=10 --disable_auto_compactions -disable_wal=true --max_num_range_tombstones=1000`.  Did not see consistent regression in no timestamp case.

| micros/op | fillrandom | seekrandom |
| --- | --- | --- |
|main| 2.58 |10.96|
|PR 10661| 2.68 |10.63|

Reviewed By: riversand963

Differential Revision: D39441192

Pulled By: cbi42

fbshipit-source-id: f05aca3c41605caf110daf0ff405919f300ddec2
2022-09-30 16:13:03 -07:00
Changyu Bi fd165c869d Add memtable per key-value checksum (#10281)
Summary:
Append per key-value checksum to internal key. These checksums are verified on read paths including Get, Iterator and during Flush. Get and Iterator will return `Corruption` status if there is a checksum verification failure. Flush will make DB become read-only upon memtable entry checksum verification failure.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10281

Test Plan:
- Added new unit test cases: `make check`
- Benchmark on memtable insert
```
TEST_TMPDIR=/dev/shm/memtable_write ./db_bench -benchmarks=fillseq -disable_wal=true -max_write_buffer_number=100 -num=10000000 -min_write_buffer_number_to_merge=100

# avg over 10 runs
Baseline: 1166936 ops/sec
memtable 2 bytes kv checksum : 1.11674e+06 ops/sec (-4%)
memtable 2 bytes kv checksum + write batch 8 bytes kv checksum: 1.08579e+06 ops/sec (-6.95%)
write batch 8 bytes kv checksum: 1.17979e+06 ops/sec (+1.1%)
```
-  Benchmark on only memtable read: ops/sec dropped 31% for `readseq` due to time spend on verifying checksum.
ops/sec for `readrandom` dropped ~6.8%.
```
# Readseq
sudo TEST_TMPDIR=/dev/shm/memtable_read ./db_bench -benchmarks=fillseq,readseq"[-X20]" -disable_wal=true -max_write_buffer_number=100 -num=10000000 -min_write_buffer_number_to_merge=100

readseq [AVG    20 runs] : 7432840 (± 212005) ops/sec;  822.3 (± 23.5) MB/sec
readseq [MEDIAN 20 runs] : 7573878 ops/sec;  837.9 MB/sec

With -memtable_protection_bytes_per_key=2:

readseq [AVG    20 runs] : 5134607 (± 119596) ops/sec;  568.0 (± 13.2) MB/sec
readseq [MEDIAN 20 runs] : 5232946 ops/sec;  578.9 MB/sec

# Readrandom
sudo TEST_TMPDIR=/dev/shm/memtable_read ./db_bench -benchmarks=fillrandom,readrandom"[-X10]" -disable_wal=true -max_write_buffer_number=100 -num=1000000 -min_write_buffer_number_to_merge=100
readrandom [AVG    10 runs] : 140236 (± 3938) ops/sec;    9.8 (± 0.3) MB/sec
readrandom [MEDIAN 10 runs] : 140545 ops/sec;    9.8 MB/sec

With -memtable_protection_bytes_per_key=2:
readrandom [AVG    10 runs] : 130632 (± 2738) ops/sec;    9.1 (± 0.2) MB/sec
readrandom [MEDIAN 10 runs] : 130341 ops/sec;    9.1 MB/sec
```

- Stress test: `python3 -u tools/db_crashtest.py whitebox --duration=1800`

Reviewed By: ajkr

Differential Revision: D37607896

Pulled By: cbi42

fbshipit-source-id: fdaefb475629d2471780d4a5f5bf81b44ee56113
2022-08-12 13:51:32 -07:00
sherriiiliu 4753e5a2e9 Fix wrong value passed in compaction filter in BlobDB (#10391)
Summary:
New blobdb has a bug in compaction filter, where `blob_value_` is not reset for next iterated key. This will cause blob_value_ not empty and previous value read from blob is passed into the filter function for next key, even if its value is not in blob. Fixed by reseting regardless of key type.

Test Case:
Add `FilterByValueLength` test case in `DBBlobCompactionTest`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10391

Reviewed By: riversand963

Differential Revision: D38629900

Pulled By: ltamasi

fbshipit-source-id: 47d23ff2e5ec697958a210db9e6ceeb8b2fc49fa
2022-08-11 13:55:28 -07:00
Yanqin Jin fee2c472d0 Include minimal contextual information in CompactionIterator (#10505)
Summary:
The main purpose is to make debugging easier without sacrificing performance.

Instead of using a boolean variable for `CompactionIterator::valid_`, we can extend it to an `uint8_t`,
using the LSB to denote if the compaction iterator is valid and 4 additional bits to denote where
the iterator is set valid inside `NextFromInput()`. Therefore, when the control flow reaches
`PrepareOutput()` and hits assertion there, we can have a better idea of what has gone wrong.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10505

Test Plan:
make check
```
TEST_TMPDIR=/dev/shm/rocksdb time ./db_bench -compression_type=none -write_buffer_size=1073741824 -benchmarks=fillseq,flush
```
The above command has a 'flush' benchmark which uses `CompactionIterator`.  I haven't observed any CPU regression or drop in throughput or latency increase.

Reviewed By: ltamasi

Differential Revision: D38551615

Pulled By: riversand963

fbshipit-source-id: 1250848fc118bb753d71fa9ff8ba840df999f5e0
2022-08-09 17:07:24 -07:00
Jay Zhuang 3f763763aa Change bottommost_temperture to last_level_temperture (#10471)
Summary:
Change tiered compaction feature from `bottommost_temperture` to
`last_level_temperture`. The old option is kept for migration purpose only,
which is behaving the same as `last_level_temperture` and it will be removed in
the next release.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10471

Test Plan: CI

Reviewed By: siying

Differential Revision: D38450621

Pulled By: jay-zhuang

fbshipit-source-id: cc1cdf8bad409376fec0152abc0a64fb72a91527
2022-08-08 14:36:34 -07:00
Levi Tamasi cc8ded6152 Do not put blobs read during compaction into cache (#10457)
Summary:
During compaction, blobs are currently read using the default
`ReadOptions`, which has the `fill_cache` flag set to true. Earlier,
this didn't make any difference since we didn't have a blob cache;
however, now we have to explicitly set this flag to false to avoid
polluting the cache during compaction.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10457

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D38333528

Pulled By: ltamasi

fbshipit-source-id: 5b4d49a1e39543bee73c7df2aa9194fb101875e2
2022-08-01 19:49:05 -07:00
Jay Zhuang faa0f9723c Tiered compaction: integrate Seqno time mapping with per key placement (#10370)
Summary:
Using the Sequence number to time mapping to decide if a key is hot or not in
compaction and place it in the corresponding level.

Note: the feature is not complete, level compaction will run indefinitely until
all penultimate level data is cold and small enough to not trigger compaction.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10370

Test Plan:
CI
* Run basic db_bench for universal compaction manually

Reviewed By: siying

Differential Revision: D37892338

Pulled By: jay-zhuang

fbshipit-source-id: 792bbd91b1ccc2f62b5d14c53118434bcaac4bbe
2022-07-15 19:01:30 -07:00
Jay Zhuang a3acf2ef87 Add seqno to time mapping (#10338)
Summary:
Which will be used for tiered storage to preclude hot data from
compacting to the cold tier (the last level).
Internally, adding seqno to time mapping. A periodic_task is scheduled
to record the current_seqno -> current_time in certain cadence. When
memtable flush, the mapping informaiton is stored in sstable property.
During compaction, the mapping information are merged and get the
approximate time of sequence number, which is used to determine if a key
is recently inserted or not and preclude it from the last level if it's
recently inserted (within the `preclude_last_level_data_seconds`).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10338

Test Plan: CI

Reviewed By: siying

Differential Revision: D37810187

Pulled By: jay-zhuang

fbshipit-source-id: 6953be7a18a99de8b1cb3b162d712f79c2b4899f
2022-07-14 21:49:34 -07:00
Jay Zhuang 6ce0b2ca34 Tiered Compaction: per key placement support (#9964)
Summary:
Support per_key_placement for last level compaction, which will
be used for tiered compaction.
* compaction iterator reports which level a key should output to;
* compaction get the output level information and check if it's safe to
  output the data to penultimate level;
* all compaction output files will be installed.
* extra internal compaction stats added for penultimate level.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9964

Test Plan:
* Unittest
* db_bench, no significate difference: https://gist.github.com/jay-zhuang/3645f8fb97ec0ab47c10704bb39fd6e4
* microbench manual compaction no significate difference: https://gist.github.com/jay-zhuang/ba679b3e89e24992615ee9eef310e6dd
* run the db_stress multiple times (not covering the new feature) looks good (internal: https://fburl.com/sandcastle/9w84pp2m)

Reviewed By: ajkr

Differential Revision: D36249494

Pulled By: jay-zhuang

fbshipit-source-id: a96da57c8031c1df83e4a7a8567b657a112b80a3
2022-07-13 20:54:49 -07:00
Levi Tamasi c73d2a9d18 Add API for writing wide-column entities (#10242)
Summary:
The patch builds on https://github.com/facebook/rocksdb/pull/9915 and adds
a new API called `PutEntity` that can be used to write a wide-column entity
to the database. The new API is added to both `DB` and `WriteBatch`. Note
that currently there is no way to retrieve these entities; more precisely, all
read APIs (`Get`, `MultiGet`, and iterator) return `NotSupported` when they
encounter a wide-column entity that is required to answer a query. Read-side
support (as well as other missing functionality like `Merge`, compaction filter,
and timestamp support) will be added in later PRs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10242

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D37369748

Pulled By: ltamasi

fbshipit-source-id: 7f5e412359ed7a400fd80b897dae5599dbcd685d
2022-06-25 15:30:47 -07:00
Yanqin Jin 4d31d3c2ed Abort in dbg mode after logging (#10183)
Summary:
In CompactionIterator code, there are multiple places where the process
will abort in dbg mode before logging the error message describing the
cause. This PR changes only the logging behavior for compaction iterator so
that error message is written to LOG before the process aborts in debug
mode.

Also updated the triggering condition for an assertion for single delete with
user-defined timestamp.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10183

Test Plan: make check

Reviewed By: akankshamahajan15

Differential Revision: D37190218

Pulled By: riversand963

fbshipit-source-id: 741bb007067be7cfbe94ac9e530ad4b2b339c009
2022-06-15 22:00:24 -07:00
zczhu 3ee6c9baec Consolidate manual_compaction_paused_ check (#10070)
Summary:
As pointed out by [https://github.com/facebook/rocksdb/pull/8351#discussion_r645765422](https://github.com/facebook/rocksdb/pull/8351#discussion_r645765422), check `manual_compaction_paused` and `manual_compaction_canceled` can be reduced by setting `*canceled` to be true in `DisableManualCompaction()` and `*canceled` to be false in the last time calling `EnableManualCompaction()`.

Changed Tests: The origin `DBTest2.PausingManualCompaction1` uses a callback function to increase `manual_compaction_paused` and the origin CompactionJob/CompactionIterator with `manual_compaction_paused` can detect this. I changed the callback function so that it sets `*canceled` as true if `canceled` is not `nullptr` (to notify CompactionJob/CompactionIterator the compaction has been canceled).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10070

Test Plan: This change does not introduce new features, but some slight difference in compaction implementation. Run the same manual compaction unit tests as before (e.g., PausingManualCompaction[1-4], CancelManualCompaction[1-2], CancelManualCompactionWithListener in db_test2, and db_compaction_test).

Reviewed By: ajkr

Differential Revision: D36949133

Pulled By: littlepig2013

fbshipit-source-id: c5dc4c956fbf8f624003a0f5ad2690240063a821
2022-06-06 18:32:26 -07:00
Yanqin Jin 3f263ef536 Add a temporary option for user to opt-out enforcement of SingleDelete contract (#9983)
Summary:
PR https://github.com/facebook/rocksdb/issues/9888 started to enforce the contract of single delete described in https://github.com/facebook/rocksdb/wiki/Single-Delete.

For some of existing use cases, it is desirable to have a transition during which compaction will not fail
if the contract is violated. Therefore, we add a temporary option `enforce_single_del_contracts` to allow
application to opt out from this new strict behavior. Once transition completes, the flag can be set to `true` again.

In a future release, the option will be removed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9983

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D36333672

Pulled By: riversand963

fbshipit-source-id: dcb703ea0ed08076a1422f1bfb9914afe3c2caa2
2022-05-16 15:44:59 -07:00
Yanqin Jin 9d634dd5b6 Rename kRemoveWithSingleDelete to kPurge (#9951)
Summary:
PR 9929 adds a new CompactionFilter::Decision, i.e.
kRemoveWithSingleDelete so that CompactionFilter can indicate to
CompactionIterator that a PUT can only be removed with SD. However, how
CompactionIterator handles such a key is implementation detail which
should not be implied in the public API. In fact,
such a PUT can just be dropped. This is an optimization which we will apply in the near future.

Discussion thread: https://github.com/facebook/rocksdb/pull/9929#discussion_r863198964

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9951

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D36156590

Pulled By: riversand963

fbshipit-source-id: 7b7d01f47bba4cad7d9cca6ca52984f27f88b372
2022-05-05 08:16:20 -07:00
Yanqin Jin 06394ff4e7 Fix a bug of CompactionIterator/CompactionFilter using Delete (#9929)
Summary:
When compaction filter determines that a key should be removed, it updates the internal key's type
to `Delete`. If this internal key is preserved in current compaction but seen by a later compaction
together with `SingleDelete`, it will cause compaction iterator to return Corruption.

To fix the issue, compaction filter should return more information in addition to the intention of removing
a key. Therefore, we add a new `kRemoveWithSingleDelete` to `CompactionFilter::Decision`. Seeing
`kRemoveWithSingleDelete`, compaction iterator will update the op type of the internal key to `kTypeSingleDelete`.

In addition, I updated db_stress_shared_state.[cc|h] so that `no_overwrite_ids_` becomes `const`. It is easier to
reason about thread-safety if accessed from multiple threads. This information is passed to `PrepareTxnDBOptions()`
when calling from `Open()` so that we can set up the rollback deletion type callback for transactions.

Finally, disable compaction filter for multiops_txn because the key removal logic of `DbStressCompactionFilter` does
not quite work with `MultiOpsTxnsStressTest`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9929

Test Plan:
make check
make crash_test
make crash_test_with_txn

Reviewed By: anand1976

Differential Revision: D36069678

Pulled By: riversand963

fbshipit-source-id: cedd2f1ba958af59ad3916f1ba6f424307955f92
2022-05-02 13:25:45 -07:00
Yanqin Jin 2b5c29f9f3 Enforce the contract of SingleDelete (#9888)
Summary:
Enforce the contract of SingleDelete so that they are not mixed with
Delete for the same key. Otherwise, it will lead to undefined behavior.
See https://github.com/facebook/rocksdb/wiki/Single-Delete#notes.

Also fix unit tests and write-unprepared.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9888

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D35837817

Pulled By: riversand963

fbshipit-source-id: acd06e4dcba8cb18df92b44ed18c57e10e5a7635
2022-04-28 14:48:27 -07:00
Levi Tamasi db536ee045 Propagate errors from UpdateBoundaries (#9851)
Summary:
In `FileMetaData`, we keep track of the lowest-numbered blob file
referenced by the SST file in question for the purposes of BlobDB's
garbage collection in the `oldest_blob_file_number` field, which is
updated in `UpdateBoundaries`. However, with the current code,
`BlobIndex` decoding errors (or invalid blob file numbers) are swallowed
in this method. The patch changes this by propagating these errors
and failing the corresponding flush/compaction. (Note that since blob
references are generated by the BlobDB code and also parsed by
`CompactionIterator`, in reality this can only happen in the case of
memory corruption.)

This change necessitated updating some unit tests that involved
fake/corrupt `BlobIndex` objects. Some of these just used a dummy string like
`"blob_index"` as a placeholder; these were replaced with real `BlobIndex`es.
Some were relying on the earlier behavior to simulate corruption; these
were replaced with `SyncPoint`-based test code that corrupts a valid
blob reference at read time.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9851

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D35683671

Pulled By: ltamasi

fbshipit-source-id: f7387af9945c48e4d5c4cd864f1ba425c7ad51f6
2022-04-15 20:25:48 -07:00
Yanqin Jin 0bd4dcde6b CompactionIterator sees consistent view of which keys are committed (#9830)
Summary:
**This PR does not affect the functionality of `DB` and write-committed transactions.**

`CompactionIterator` uses `KeyCommitted(seq)` to determine if a key in the database is committed.
As the name 'write-committed' implies, if write-committed policy is used, a key exists in the database only if
it is committed. In fact, the implementation of `KeyCommitted()` is as follows:

```
inline bool KeyCommitted(SequenceNumber seq) {
  // For non-txn-db and write-committed, snapshot_checker_ is always nullptr.
  return snapshot_checker_ == nullptr ||
         snapshot_checker_->CheckInSnapshot(seq, kMaxSequence) == SnapshotCheckerResult::kInSnapshot;
}
```

With that being said, we focus on write-prepared/write-unprepared transactions.

A few notes:
- A key can exist in the db even if it's uncommitted. Therefore, we rely on `snapshot_checker_` to determine data visibility. We also require that all writes go through transaction API instead of the raw `WriteBatch` + `Write`, thus at most one uncommitted version of one user key can exist in the database.
- `CompactionIterator` outputs a key as long as the key is uncommitted.

Due to the above reasons, it is possible that `CompactionIterator` decides to output an uncommitted key without
doing further checks on the key (`NextFromInput()`). By the time the key is being prepared for output, the key becomes
committed because the `snapshot_checker_(seq, kMaxSequence)` becomes true in the implementation of `KeyCommitted()`.
Then `CompactionIterator` will try to zero its sequence number and hit assertion error if the key is a tombstone.

To fix this issue, we should make the `CompactionIterator` see a consistent view of the input keys. Note that
for write-prepared/write-unprepared, the background flush/compaction jobs already take a "job snapshot" before starting
processing keys. The job snapshot is released only after the entire flush/compaction finishes. We can use this snapshot
to determine whether a key is committed or not with minor change to `KeyCommitted()`.

```
inline bool KeyCommitted(SequenceNumber sequence) {
  // For non-txn-db and write-committed, snapshot_checker_ is always nullptr.
  return snapshot_checker_ == nullptr ||
         snapshot_checker_->CheckInSnapshot(sequence, job_snapshot_) ==
             SnapshotCheckerResult::kInSnapshot;
}
```

As a result, whether a key is committed or not will remain a constant throughout compaction, causing no trouble
for `CompactionIterator`s assertions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9830

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D35561162

Pulled By: riversand963

fbshipit-source-id: 0e00d200c195240341cfe6d34cbc86798b315b9f
2022-04-14 11:11:04 -07:00
Yanqin Jin 0ad9ee30ce Remove dead code (#9825)
Summary:
Options `preserve_deletes` and `iter_start_seqnum` have been removed since 7.0.

This PR removes dead code related to these two removed options.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9825

Test Plan: make check

Reviewed By: akankshamahajan15

Differential Revision: D35517950

Pulled By: riversand963

fbshipit-source-id: 86282ce5ec4087acb94a06a42a1b6d55b1715482
2022-04-11 10:26:55 -07:00
Levi Tamasi 320d9a8e8a Use a sorted vector instead of a map to store blob file metadata (#9526)
Summary:
The patch replaces `std::map` with a sorted `std::vector` for
`VersionStorageInfo::blob_files_` and preallocates the space
for the `vector` before saving the `BlobFileMetaData` into the
new `VersionStorageInfo` in `VersionBuilder::Rep::SaveBlobFilesTo`.
These changes reduce the time the DB mutex is held while
saving new `Version`s, and using a sorted `vector` also makes
lookups faster thanks to better memory locality.

In addition, the patch introduces helper methods
`VersionStorageInfo::GetBlobFileMetaData` and
`VersionStorageInfo::GetBlobFileMetaDataLB` that can be used by
clients to perform lookups in the `vector`, and does some general
cleanup in the parts of code where blob file metadata are used.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9526

Test Plan:
Ran `make check` and the crash test script for a while.

Performance was tested using a load-optimized benchmark (`fillseq` with vector memtable, no WAL) and small file sizes so that a significant number of files are produced:

```
numactl --interleave=all ./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/ltamasi-dbbench --wal_dir=/data/ltamasi-dbbench --num=800000000 --num_levels=8 --key_size=20 --value_size=400 --block_size=8192 --cache_size=51539607552 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=lz4 --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --soft_pending_compaction_bytes_limit=167503724544 --hard_pending_compaction_bytes_limit=335007449088 --min_level_to_compress=0 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --enable_blob_files=1 --blob_file_size=16777216 --min_blob_size=0 --blob_compression_type=lz4 --enable_blob_garbage_collection=1 --seed=<some value>
```

Final statistics before the patch:

```
Cumulative writes: 0 writes, 700M keys, 0 commit groups, 0.0 writes per commit group, ingest: 284.62 GB, 121.27 MB/s
Interval writes: 0 writes, 334K keys, 0 commit groups, 0.0 writes per commit group, ingest: 139.28 MB, 72.46 MB/s
```

With the patch:

```
Cumulative writes: 0 writes, 760M keys, 0 commit groups, 0.0 writes per commit group, ingest: 308.66 GB, 131.52 MB/s
Interval writes: 0 writes, 445K keys, 0 commit groups, 0.0 writes per commit group, ingest: 185.35 MB, 93.15 MB/s
```

Total time to complete the benchmark is 2611 seconds with the patch, down from 2986 secs.

Reviewed By: riversand963

Differential Revision: D34082728

Pulled By: ltamasi

fbshipit-source-id: fc598abf676dce436734d06bb9d2d99a26a004fc
2022-02-09 12:36:43 -08:00
Levi Tamasi dc5de45af8 Support readahead during compaction for blob files (#9187)
Summary:
The patch adds a new BlobDB configuration option `blob_compaction_readahead_size`
that can be used to enable prefetching data from blob files during compaction.
This is important when using storage with higher latencies like HDDs or remote filesystems.
If enabled, prefetching is used for all cases when blobs are read during compaction,
namely garbage collection, compaction filters (when the existing value has to be read from
a blob file), and `Merge` (when the value of the base `Put` is stored in a blob file).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9187

Test Plan: Ran `make check` and the stress/crash test.

Reviewed By: riversand963

Differential Revision: D32565512

Pulled By: ltamasi

fbshipit-source-id: 87be9cebc3aa01cc227bec6b5f64d827b8164f5d
2021-11-19 17:53:47 -08:00
Yanqin Jin a113cecfc9 Fix a bug in timestamp-related GC (#9116)
Summary:
For multiple versions (ts + seq) of the same user key, if they cross the boundary of `full_history_ts_low_`,
we should retain the version that is visible to the `full_history_ts_low_`. Namely, we keep the internal key
with the largest timestamp smaller than `full_history_ts_low`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9116

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D32261514

Pulled By: riversand963

fbshipit-source-id: e10f47c254c04c05261440051e4f50cb7d95474e
2021-11-09 13:08:55 -08:00
Yanqin Jin 5237b39d2e Fix assertion error during compaction with write-prepared txn enabled (#9105)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9105

The user contract of SingleDelete is that: a SingleDelete can only be issued to
a key that exists and has NOT been updated. For example, application can insert
one key `key`, and uses a SingleDelete to delete it in the future. The `key`
cannot be updated or removed using Delete.
In reality, especially when write-prepared transaction is being used, things
can get tricky. For example, a prepared transaction already writes `key` to the
memtable after a successful Prepare(). Afterwards, should the transaction
rollback, it will insert a Delete into the memtable to cancel out the prior
Put. Consider the following sequence of operations.

```
// operation sequence 1
Begin txn
Put(key)
Prepare()
Flush()

Rollback txn
Flush()
```

There will be two SSTs resulting from above. One of the contains a PUT, while
the second one contains a Delete. It is also known that releasing a snapshot
can lead to an L0 containing only a SD for a particular key. Consider the
following operations following the above block.

```
// operation sequence 2
db->Put(key)
db->SingleDelete(key)
Flush()
```

The operation sequence 2 can result in an L0 with only the SD.

Should there be a snapshot for conflict checking created before operation
sequence 1, then an attempt to compact the db may hit the assertion failure
below, because ikey_.type is Delete (from a rollback).

```
else if (clear_and_output_next_key_) {
  assert(ikey_.type == kTypeValue || ikey_.type == kTypeBlobIndex);
}
```

To fix the assertion failure, we can skip the SingleDelete if we detect an
earlier Delete in the same snapshot interval.

Reviewed By: ltamasi

Differential Revision: D32056848

fbshipit-source-id: 23620a91e28562d91c45cf7e95f414b54b729748
2021-11-05 15:29:18 -07:00
Yanqin Jin 9b53f14a35 Fixed a bug in CompactionIterator when write-preared transaction is used (#9060)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9060

RocksDB bottommost level compaction may zero out an internal key's sequence if
the key's sequence is in the earliest_snapshot.
In write-prepared transaction, checking the visibility of a certain sequence in
a specific released snapshot may return a "snapshot released" result.
Therefore, it is possible, after a certain sequence of events, a PUT has its
sequence zeroed out, but a subsequent SingleDelete of the same key will still
be output with its original sequence. This violates the ascending order of
keys and leads to incorrect result.

The solution is to use an extra variable `last_key_seq_zeroed_` to track the
information about visibility in earliest snapshot. With this variable, we can
know for sure that a SingleDelete is in the earliest snapshot even if the said
snapshot is released during compaction before processing the SD.

Reviewed By: ltamasi

Differential Revision: D31813016

fbshipit-source-id: d8cff59d6f34e0bdf282614034aaea99be9174e1
2021-11-03 15:55:00 -07:00
Yanqin Jin fdf2a0d7eb Fix a compaction bug for write-prepared txn (#9061)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9061

In write-prepared txn, checking a sequence's visibility in a released (old)
snapshot may return "Snapshot released". Suppose we have two snapshots:

```
earliest_snap < earliest_write_conflict_snap
```

If we release `earliest_write_conflict_snap` but keep `earliest_snap` during
bottommost level compaction, then it is possible that certain sequence of
events can lead to a PUT being seq-zeroed followed by a SingleDelete of the
same key. This violates the ascending order of keys, and will cause data
inconsistency.

Reviewed By: ltamasi

Differential Revision: D31813017

fbshipit-source-id: dc68ba2541d1228489b93cf3edda5f37ed06f285
2021-10-29 15:23:17 -07:00
mrambacher 13ae16c315 Cleanup includes in dbformat.h (#8930)
Summary:
This header file was including everything and the kitchen sink when it did not need to.  This resulted in many places including this header when they needed other pieces instead.

Cleaned up this header to only include what was needed and fixed up the remaining code to include what was now missing.

Hopefully, this sort of code hygiene cleanup will speed up the builds...

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8930

Reviewed By: pdillinger

Differential Revision: D31142788

Pulled By: mrambacher

fbshipit-source-id: 6b45de3f300750c79f751f6227dece9cfd44085d
2021-09-29 04:04:40 -07:00
Akanksha Mahajan 78afb4d81e Support SingleDelete for user-defined timestamps (#8921)
Summary:
Added support for SingleDelete for user-defined timestamps. Users can now Get and Iterate over keys deleted with SingleDelete. It also includes changes in CompactionIterator which  preserves the same user key with different timestamps, unless the timestamp is below a certain threshold full_history_ts_low.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8921

Test Plan: Added new unit tests

Reviewed By: riversand963

Differential Revision: D31098191

Pulled By: akankshamahajan15

fbshipit-source-id: 78a59ef4b4884ae324fcd10f56e62a27d5ee2f49
2021-09-27 11:51:07 -07:00
Yanqin Jin 2b367fa8cc Fix bug caused by releasing snapshot(s) during compaction (#8608)
Summary:
In debug mode, we are seeing assertion failure as follows

```
db/compaction/compaction_iterator.cc:980: void rocksdb::CompactionIterator::PrepareOutput(): \
Assertion `ikey_.type != kTypeDeletion && ikey_.type != kTypeSingleDeletion' failed.
```

It is caused by releasing earliest snapshot during compaction between the execution of
`NextFromInput()` and `PrepareOutput()`.

In one case, as demonstrated in unit test `WritePreparedTransaction.ReleaseEarliestSnapshotDuringCompaction_WithSD2`,
incorrect result may be returned by a following range scan if we disable assertion, as in opt compilation
level: the SingleDelete marker's sequence number is zeroed out, but the preceding PUT is also
outputted to the SST file after compaction. Due to the logic of DBIter, the PUT will not be
skipped and will be returned by iterator in range scan. https://github.com/facebook/rocksdb/issues/8661 illustrates what happened.

Fix by taking a more conservative approach: make compaction zero out sequence number only
if key is in the earliest snapshot when the compaction starts.

Another assertion failure is
```
Assertion `current_user_key_snapshot_ == last_snapshot' failed.
```

It's caused by releasing the snapshot between the PUT and SingleDelete during compaction.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8608

Test Plan: make check

Reviewed By: jay-zhuang

Differential Revision: D30145645

Pulled By: riversand963

fbshipit-source-id: 699f58e66faf70732ad53810ccef43935d3bbe81
2021-08-17 22:14:20 -07:00
Levi Tamasi 6878cedcc3 Add statistics support to integrated BlobDB (#8667)
Summary:
The patch adds statistics support to the integrated BlobDB implementation,
namely the tickers `BLOB_DB_BLOB_FILE_BYTES_READ` and
`BLOB_DB_GC_{NUM_KEYS,BYTES}_RELOCATED`, and the histograms
`BLOB_DB_(DE)COMPRESSION_MICROS`. (Some other statistics, like
`BLOB_DB_BLOB_FILE_BYTES_WRITTEN`, `BLOB_DB_BLOB_FILE_SYNCED`,
`BLOB_DB_BLOB_FILE_{READ,WRITE,SYNC}_MICROS` were already supported.)
Note that the vast majority of the old BlobDB's tickers/histograms are not
really applicable to the new implementation, since they e.g. pertain to calling
dedicated BlobDB APIs (which the integrated BlobDB does not have) or are
tied to the legacy BlobDB's design of writing blob files synchronously when
a write API is called. Such statistics are marked "legacy BlobDB only" in
`statistics.h`.

Fixes https://github.com/facebook/rocksdb/issues/8645 .

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8667

Test Plan: Ran `make check` and tested the new statistics using `db_bench`.

Reviewed By: riversand963

Differential Revision: D30356884

Pulled By: ltamasi

fbshipit-source-id: 5f8a833faee60401c5643c2f0a6c0415488190a4
2021-08-17 17:22:31 -07:00
jimmycleary e0ff365a76 Replace macros in compaction_iterator.cc with inline functions (#8592)
Summary:
Internal task T96186510.

Created new inline member functions in `CompactionIterator`,
`DefinitelyInSnapshot`, `DefinitelyNotInSnapshot`, and
`InEarliestSnapshot` to replace the macros at the top of
`compaction_iterator.cc`.

Placed the definitions in `compaction_iterator.h` in accordance with
Google's style guide for inline functions. Separated the declarations
and definitions, and only placed the `inline` keyword on the
definitions, in line with ISO CPP recommendations.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8592

Test Plan: Ran `make check`.  Successful build and all tests appeared to pass.

Reviewed By: riversand963

Differential Revision: D29966782

Pulled By: jimmycFB

fbshipit-source-id: 3584290bbbabf862e9ab58852281f46d37f58be6
2021-07-28 14:53:29 -07:00