rocksdb

mirror of https://github.com/facebook/rocksdb.git synced 2024-11-26 07:30:54 +00:00

Author	SHA1	Message	Date
Yu Zhang	c73cf7a878	Add CompactForTieringCollector to support automatically trigger compaction for tiering use case (#12760 ) Summary: This PR adds user property collector factory `CompactForTieringCollectorFactory` to support observe SST file and mark it as need compaction for fast tracking data to the proper tier. A triggering ratio `compaction_trigger_ratio_` can be configured to achieve the following: 1) Setting the ratio to be equal to or smaller than 0 disables this collector 2) Setting the ratio to be within (0, 1] will write the number of observed eligible entries into a user property and marks a file as need-compaction when aforementioned condition is met. 3) Setting the ratio to be higher than 1 can be used to just writes the user table property, and not mark any file as need compaction. For a column family that does not enable tiering feature, even if an effective configuration is provided, this collector is still disabled. For a file that is already on the last level, this collector is also disabled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12760 Test Plan: Added unit tests Reviewed By: pdillinger Differential Revision: D58734976 Pulled By: jowlyzhang fbshipit-source-id: 6daab2c4f62b5c6689c3c03e3b3907bbbe6b7a81	2024-06-18 10:51:29 -07:00
Yu Zhang	9dc171e3bb	Fix issue that cause false alarm corruption report (#12626 ) Summary: The state of `saved_seq_for_penul_check_` is not correctly maintained with the current flow. It's supposed to store the original sequence number for a `kTypeValuePreferredSeqno` entry for use in the `DecideOutputLevel` function. However, it's not always properly cleared. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12626 Test Plan: Added unit test that would fail before the fix ./tiered_compaction_test --gtest_filter="InterleavedTimedPutAndPut" Reviewed By: pdillinger Differential Revision: D57123469 Pulled By: jowlyzhang fbshipit-source-id: 8d73214b3b6dc152daf19b6bd6ee9063581dc277	2024-05-08 15:51:38 -07:00
Yu Zhang	2c02a9b76f	Preserve TimedPut on penultimate level until it actually expires (#12543 ) Summary: To make sure `TimedPut` are placed on proper tier before and when it becomes eligible for cold tier 1) flush and compaction need to keep relevant seqno to time mapping for not just the sequence number contained in internal keys, but also preferred sequence number for `TimedPut` entries. This PR also fix some bugs in for handling `TimedPut` during compaction: 1) dealing with an edge case when a `TimedPut` entry's internal key is the right bound for penultimate level, the internal key after swapping in its preferred sequence number will fall outside of the penultimate range because preferred sequence number is smaller than its original sequence number. The entry however is still safe to be placed on penultimate level, so we keep track of `TimedPut` entry's original sequence number for this check. The idea behind this is that as long as it's safe for the original key to be placed on penultimate level, it's safe for the entry with swapped preferred sequence number to be placed on penultimate level too. Because we only swap in preferred sequence number when that entry is visible to the earliest snapshot and there is no other data points with the same user key in lower levels. On the other hand, as long as it's not safe for the original key to be placed on penultimate level, we will not place the entry after swapping the preferred seqno on penultimate level either. 2) the assertion that preferred seqno is always bigger than original sequence number may fail if this logic is only exercised after sequence number is zeroed out. We adjust the assertion to handle that case too. In this case, we don't swap in the preferred seqno but will adjust the its type to `kTypeValue`. 3) there was a special case handling for when range deletion may end up incorrectly covering an entry if preferred seqno is swapped in. But it missed the case that if the original entry is already covered by range deletion. The original handling will mistakenly output the entry instead of omitting it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12543 Test Plan: ./tiered_compaction_test --gtest_filter="PrecludeLastLevelTest.PreserveTimedPutOnPenultimateLevel" ./compaction_iterator_test --gtest_filter="TimedPut" Reviewed By: pdillinger Differential Revision: D56195096 Pulled By: jowlyzhang fbshipit-source-id: 37ebb09d2513abbd9e90cda0217e26874584b8f3	2024-04-30 11:16:02 -07:00
Yu Zhang	1104eaa35e	Add initial support for TimedPut API (#12419 ) Summary: This PR adds support for `TimedPut` API. We introduced a new type `kTypeValuePreferredSeqno` for entries added to the DB via the `TimedPut` API. The life cycle of such an entry on the write/flush/compaction paths are: 1) It is initially added to memtable as: `<user_key, seq, kTypeValuePreferredSeqno>: {value, write_unix_time}` 2) When it's flushed to L0 sst files, it's converted to: `<user_key, seq, kTypeValuePreferredSeqno>: {value, preferred_seqno}` when we have easy access to the seqno to time mapping. 3) During compaction, if certain conditions are met, we swap in the `preferred_seqno` and the entry will become: `<user_key, preferred_seqno, kTypeValue>: value`. This step helps fast track these entries to the cold tier if they are eligible after the sequence number swap. On the read path: A `kTypeValuePreferredSeqno` entry acts the same as a `kTypeValue` entry, the unix_write_time/preferred seqno part packed in value is completely ignored. Needed follow ups: 1) The seqno to time mapping accessible in flush needs to be extended to cover the `write_unix_time` for possible `kTypeValuePreferredSeqno` entries. This also means we need to track these `write_unix_time` in memtable. 2) Compaction filter support for the new `kTypeValuePreferredSeqno` type for feature parity with other `kTypeValue` and equivalent types. 3) Stress test coverage for the feature Pull Request resolved: https://github.com/facebook/rocksdb/pull/12419 Test Plan: Added unit tests Reviewed By: pdillinger Differential Revision: D54920296 Pulled By: jowlyzhang fbshipit-source-id: c8b43f7a7c465e569141770e93c748371ff1da9e	2024-03-14 15:44:55 -07:00
Changyu Bi	4e58cc6437	Check internal key range when compacting from last level to penultimate level (#12063 ) Summary: The test failure in https://github.com/facebook/rocksdb/issues/11909 shows that we may compact keys outside of internal key range of penultimate level input files from last level to penultimate level, which can potentially cause overlapping files in the penultimate level. This PR updates the `Compaction::WithinPenultimateLevelOutputRange()` to check internal key range instead of user key. Other fixes: * skip range del sentinels when deciding output level for tiered compaction Pull Request resolved: https://github.com/facebook/rocksdb/pull/12063 Test Plan: - existing unit tests - apply the fix to https://github.com/facebook/rocksdb/issues/11905 and run `./tiered_compaction_test --gtest_filter="RangeDelsCauseFileEndpointsToOverlap"` Reviewed By: ajkr Differential Revision: D51288985 Pulled By: cbi42 fbshipit-source-id: 70085db5f5c3b15300bcbc39057d57b83fd9902a	2023-11-17 10:50:40 -08:00
Levi Tamasi	1e63fc9925	Add a helper method WideColumnsHelper::SortColumns (#11823 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11823 Similarly to https://github.com/facebook/rocksdb/pull/11813, the patch is a small refactoring that eliminates some copy-paste around sorting the columns of entities by column name. Reviewed By: jaykorean Differential Revision: D49195504 fbshipit-source-id: d48c9f290e3203f838cc5949856c469ecf730008	2023-09-12 12:36:07 -07:00
Changyu Bi	6a0f637633	Compare the number of input keys and processed keys for compactions (#11571 ) Summary: ... to improve data integrity validation during compaction. A new option `compaction_verify_record_count` is introduced for this verification and is enabled by default. One exception when the verification is not done is when a compaction filter returns kRemoveAndSkipUntil which can cause CompactionIterator to seek until some key and hence not able to keep track of the number of keys processed. For expected number of input keys, we sum over the number of total keys - number of range tombstones across compaction input files (`CompactionJob::UpdateCompactionStats()`). Table properties are consulted if `FileMetaData` is not initialized for some input file. Since table properties for all input files were also constructed during `DBImpl::NotifyOnCompactionBegin()`, `Compaction::GetTableProperties()` is introduced to reduce duplicated code. For actual number of keys processed, each subcompaction will record its number of keys processed to `sub_compact->compaction_job_stats.num_input_records` and aggregated when all subcompactions finish (`CompactionJob::AggregateCompactionStats()`). In the case when some subcompaction encountered kRemoveAndSkipUntil from compaction filter and does not have accurate count, it propagates this information through `sub_compact->compaction_job_stats.has_num_input_records`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11571 Test Plan: * Add a new unit test `DBCompactionTest.VerifyRecordCount` for the corruption case. * All other unit tests for non-corrupted case. * Ran crash test for a few hours: `python3 ./tools/db_crashtest.py whitebox --simple` Reviewed By: ajkr Differential Revision: D47131965 Pulled By: cbi42 fbshipit-source-id: cc8e94565dd526c4347e9d3843ecf32f6727af92	2023-07-28 09:47:31 -07:00
Hui Xiao	151242ce46	Group rocksdb.sst.read.micros stat by IOActivity flush and compaction (#11288 ) Summary: Context: The existing stat rocksdb.sst.read.micros does not reflect each of compaction and flush cases but aggregate them, which is not so helpful for us to understand IO read behavior of each of them. Summary - Update `StopWatch` and `RandomAccessFileReader` to record `rocksdb.sst.read.micros` and `rocksdb.file.{flush/compaction}.read.micros` - Fixed the default histogram in `RandomAccessFileReader` - New field `ReadOptions/IOOptions::io_activity`; Pass `ReadOptions` through paths under db open, flush and compaction to where we can prepare `IOOptions` and pass it to `RandomAccessFileReader` - Use `thread_status_util` for assertion in `DbStressFSWrapper` for continuous testing on we are passing correct `io_activity` under db open, flush and compaction Pull Request resolved: https://github.com/facebook/rocksdb/pull/11288 Test Plan: - Stress test - Db bench 1: rocksdb.sst.read.micros COUNT ≈ sum of rocksdb.file.read.flush.micros's and rocksdb.file.read.compaction.micros's. (without blob) - May not be exactly the same due to `HistogramStat::Add` only guarantees atomic not accuracy across threads. ``` ./db_bench -db=/dev/shm/testdb/ -statistics=true -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 (-use_plain_table=1 -prefix_size=10) ``` ``` // BlockBasedTable rocksdb.sst.read.micros P50 : 2.009374 P95 : 4.968548 P99 : 8.110362 P100 : 43.000000 COUNT : 40456 SUM : 114805 rocksdb.file.read.flush.micros P50 : 1.871841 P95 : 3.872407 P99 : 5.540541 P100 : 43.000000 COUNT : 2250 SUM : 6116 rocksdb.file.read.compaction.micros P50 : 2.023109 P95 : 5.029149 P99 : 8.196910 P100 : 26.000000 COUNT : 38206 SUM : 108689 // PlainTable Does not apply ``` - Db bench 2: performance Read SETUP: db with 900 files ``` ./db_bench -db=/dev/shm/testdb/ -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -disable_auto_compactions=true -target_file_size_base=655 -compression_type=none ```run till convergence ``` ./db_bench -seed=1678564177044286 -use_existing_db=true -db=/dev/shm/testdb -benchmarks=readrandom[-X60] -statistics=true -num=1000000 -disable_auto_compactions=true -compression_type=none -bloom_bits=3 ``` Pre-change `readrandom [AVG 60 runs] : 21568 (± 248) ops/sec` Post-change (no regression, -0.3%) `readrandom [AVG 60 runs] : 21486 (± 236) ops/sec` Compaction/Flushrun till convergence ``` ./db_bench -db=/dev/shm/testdb2/ -seed=1678564177044286 -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -disable_auto_compactions=false -target_file_size_base=655 -compression_type=none rocksdb.sst.read.micros COUNT : 33820 rocksdb.sst.read.flush.micros COUNT : 1800 rocksdb.sst.read.compaction.micros COUNT : 32020 ``` Pre-change `fillseq [AVG 46 runs] : 1391 (± 214) ops/sec; 0.7 (± 0.1) MB/sec` Post-change (no regression, ~-0.4%) `fillseq [AVG 46 runs] : 1385 (± 216) ops/sec; 0.7 (± 0.1) MB/sec` Reviewed By: ajkr Differential Revision: D44007011 Pulled By: hx235 fbshipit-source-id: a54c89e4846dfc9a135389edf3f3eedfea257132	2023-04-21 09:07:18 -07:00
Changyu Bi	b3c43a5b99	Drain unnecessary levels when `level_compaction_dynamic_level_bytes=true` (#11340 ) Summary: When a user migrates to level compaction + `level_compaction_dynamic_level_bytes=true`, or when a DB shrinks, there can be unnecessary levels in the DB. Before this PR, this is no way to remove these levels except a manual compaction. These extra unnecessary levels make it harder to guarantee max_bytes_for_level_multiplier and can cause extra space amp. This PR boosts compaction score for these levels to allow RocksDB to automatically drain these levels. Together with https://github.com/facebook/rocksdb/issues/11321, this makes migration to `level_compaction_dynamic_level_bytes=true` automatic without needing user to do a one time full manual compaction. Credit: this PR is modified from https://github.com/facebook/rocksdb/issues/3921. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11340 Test Plan: - New unit tests - `python3 tools/db_crashtest.py whitebox --simple` which randomly sets level_compaction_dynamic_level_bytes in each run. Reviewed By: ajkr Differential Revision: D44563884 Pulled By: cbi42 fbshipit-source-id: e20d3620bd73dff22be18c5a91a07f340740bcc8	2023-04-06 11:20:43 -07:00
Changyu Bi	229297d1b8	Refactor AddRangeDels() + consider range tombstone during compaction file cutting (#11113 ) Summary: A second attempt after https://github.com/facebook/rocksdb/issues/10802, with bug fixes and refactoring. This PR updates compaction logic to take range tombstones into account when determining whether to cut the current compaction output file (https://github.com/facebook/rocksdb/issues/4811). Before this change, only point keys were considered, and range tombstones could cause large compactions. For example, if the current compaction outputs is a range tombstone [a, b) and 2 point keys y, z, they would be added to the same file, and may overlap with too many files in the next level and cause a large compaction in the future. This PR also includes ajkr's effort to simplify the logic to add range tombstones to compaction output files in `AddRangeDels()` ([https://github.com/facebook/rocksdb/issues/11078](https://github.com/facebook/rocksdb/pull/11078#issuecomment-1386078861)). The main change is for `CompactionIterator` to emit range tombstone start keys to be processed by `CompactionOutputs`. A new class `CompactionMergingIterator` is introduced to replace `MergingIterator` under `CompactionIterator` to enable emitting of range tombstone start keys. Further improvement after this PR include cutting compaction output at some grandparent boundary key (instead of the next output key) when cutting within a range tombstone to reduce overlap with grandparents. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11113 Test Plan: * added unit test in db_range_del_test * crash test with a small key range: `python3 tools/db_crashtest.py blackbox --simple --max_key=100 --interval=600 --write_buffer_size=262144 --target_file_size_base=256 --max_bytes_for_level_base=262144 --block_size=128 --value_size_mult=33 --subcompactions=10 --use_multiget=1 --delpercent=3 --delrangepercent=2 --verify_iterator_with_expected_state_one_in=2 --num_iterations=10` Reviewed By: ajkr Differential Revision: D42655709 Pulled By: cbi42 fbshipit-source-id: 8367e36ef5640e8f21c14a3855d4a8d6e360a34c	2023-02-22 12:28:18 -08:00
Levi Tamasi	876d281592	Add compaction filter support for wide-column entities (#11196 ) Summary: The patch adds compaction filter support for wide-column entities by introducing a new `CompactionFilter` API called `FilterV3`. This API is called for regular key-values, merge operands, and wide-column entities as well. It is passed the existing value/operand or wide-column structure and it can update the value or columns or keep/delete/etc. the key-value as usual. For compatibility, the default implementation of `FilterV3` keeps all wide-column entities and falls back to calling `FilterV2` for plain old key-values and merge operands. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11196 Test Plan: `make check` Reviewed By: akankshamahajan15 Differential Revision: D43094147 Pulled By: ltamasi fbshipit-source-id: 75acabe9a35254f7f404ba6173ee9c2774382ebd	2023-02-07 16:17:39 -08:00
Levi Tamasi	df680b24ef	Clean up InvokeFilterIfNeeded a bit (#11174 ) Summary: The patch makes some code quality enhancements in `CompactionIterator::InvokeFilterIfNeeded` including the renaming of `filter` (which is most likely a remnant of the days before the `FilterV2` API when the compaction filter used to return a boolean) to `decision`, the removal of some outdated comments, the elimination of an `error` flag which was only used in one failure case out of many, as well as some small stylistic improvements. (Some the above will also come in handy when adding compaction filter support for wide-column entities.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/11174 Test Plan: `make check` Reviewed By: akankshamahajan15 Differential Revision: D42901408 Pulled By: ltamasi fbshipit-source-id: ab382d59a4990c5dfe1cee219d49e1d80902b666	2023-02-01 10:03:07 -08:00
Andrew Kryczka	b7fbcefda8	Add API to limit blast radius of merge operator failure (#11092 ) Summary: Prior to this PR, `FullMergeV2()` can only return `false` to indicate failure, which causes any operation invoking it to fail. During a compaction, such a failure causes the compaction to fail and causes the DB to irreversibly enter read-only mode. Some users asked for a way to allow the merge operator to fail without such widespread damage. To limit the blast radius of merge operator failures, this PR introduces the `MergeOperationOutput::op_failure_scope` API. When unpopulated (`kDefault`) or set to `kTryMerge`, the merge operator failure handling is the same as before. When set to `kMustMerge`, merge operator failure still causes failure to operations that must merge (`Get()`, iterator, `MultiGet()`, etc.). However, under `kMustMerge`, flushes/compactions can survive merge operator failures by outputting the unmerged input operands. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11092 Reviewed By: siying Differential Revision: D42525673 Pulled By: ajkr fbshipit-source-id: 951dc3bf190f86347dccf3381be967565cda52ee	2023-01-20 14:40:30 -08:00
Changyu Bi	f515d9d203	Revert #10802 Consider range tombstone in compaction output file cutting (#11089 ) Summary: This reverts commit `f02c708aa3` since it introduced several bugs (see https://github.com/facebook/rocksdb/issues/11078 and https://github.com/facebook/rocksdb/issues/11067 for attempts to fix them) and that I do not have a high confidence to fix all of them and ensure no further ones before the next release branch cut. There are also come existing issue found during bug fixing. We will work on it and try to merge it to the release after. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11089 Test Plan: existing CI. Reviewed By: ajkr Differential Revision: D42505972 Pulled By: cbi42 fbshipit-source-id: 2f66dcde6b85dc94977b317c2ce513872cfbc153	2023-01-13 12:28:21 -08:00
Changyu Bi	f02c708aa3	Consider range tombstone in compaction output file cutting (#10802 ) Summary: This PR is the first step for Issue https://github.com/facebook/rocksdb/issues/4811. Currently compaction output files are cut at point keys, and the decision is made mainly in `CompactionOutputs::ShouldStopBefore()`. This makes it possible for range tombstones to cause large compactions that does not respect `max_compaction_bytes`. For example, we can have a large range tombstone that overlaps with too many files from the next level. Another example is when there is a gap between a range tombstone and another key. The first issue may be more acceptable, as a lot of data is deleted. This PR address the second issue by calling `ShouldStopBefore()` for range tombstone start keys. The main change is for `CompactionIterator` to emit range tombstone start keys to be processed by `CompactionOutputs`. A new `CompactionMergingIterator` is introduced and only used under `CompactionIterator` for this purpose. Further improvement after this PR include 1) cut compaction output at some grandparent boundary key instead of at the next point key or range tombstone start key and 2) cut compaction output file within a large range tombstone (it may be easier and reasonable to only do it for range tombstones at the end of a compaction output). Pull Request resolved: https://github.com/facebook/rocksdb/pull/10802 Test Plan: - added unit tests in db_range_del_test. - stress test: `python3 tools/db_crashtest.py whitebox --[simple\|enable_ts] --verify_iterator_with_expected_state_one_in=5 --delrangepercent=5 --prefixpercent=2 --writepercent=58 --readpercen=21 --duration=36000 --range_deletion_width=1000000` Reviewed By: ajkr, jay-zhuang Differential Revision: D40308827 Pulled By: cbi42 fbshipit-source-id: a8fd6f70a3f09d0ef7a40e006f6c964bba8c00df	2022-12-15 09:11:54 -08:00
Andrew Kryczka	097f9f4425	Fix CompactionIterator flag for penultimate level output (#10967 ) Summary: We were not resetting it in non-debug mode so it could be true once and then stay true for future keys where it should be false. This PR adds the reset logic. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10967 Test Plan: - built `db_bench` with DEBUG_LEVEL=0 - ran benchmark: `TEST_TMPDIR=/dev/shm/prefix ./db_bench -benchmarks=fillrandom -compaction_style=1 -preserve_internal_time_seconds=100 -preclude_last_level_data_seconds=10 -write_buffer_size=1048576 -target_file_size_base=1048576 -subcompactions=8 -duration=120` - compared "output_to_penultimate_level: X bytes + last: Y bytes" lines in LOG output - Before this fix, Y was always zero - After this fix, Y gradually increased throughout the benchmark Reviewed By: riversand963 Differential Revision: D41417726 Pulled By: ajkr fbshipit-source-id: ace1e9a289e751a5b0c2fbaa8addd4eda5525329	2022-11-21 16:14:03 -08:00
Yanqin Jin	7d26e4c5a3	Basic Support for Merge with user-defined timestamp (#10819 ) Summary: This PR implements the originally disabled `Merge()` APIs when user-defined timestamp is enabled. Simplest usage: ```cpp // assume string append merge op is used with '.' as delimiter. // ts1 < ts2 db->Put(WriteOptions(), "key", ts1, "v0"); db->Merge(WriteOptions(), "key", ts2, "1"); ReadOptions ro; ro.timestamp = &ts2; db->Get(ro, "key", &value); ASSERT_EQ("v0.1", value); ``` Some code comments are added for clarity. Note: support for timestamp in `DB::GetMergeOperands()` will be done in a follow-up PR. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10819 Test Plan: make check Reviewed By: ltamasi Differential Revision: D40603195 Pulled By: riversand963 fbshipit-source-id: f96d6f183258f3392d80377025529f7660503013	2022-10-31 22:28:58 -07:00
Jay Zhuang	b36ec37a4b	clang-format for db/compaction (#10882 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10882 Reviewed By: riversand963 Differential Revision: D40724867 Pulled By: jay-zhuang fbshipit-source-id: 7f387724f8cd07d8d2b90566a515a4e9078d21f1	2022-10-26 12:35:12 -07:00
Jay Zhuang	1663f77d2a	Fix no internal time recorded for small preclude_last_level (#10829 ) Summary: When the `preclude_last_level_data_seconds` or `preserve_internal_time_seconds` is smaller than 100 (seconds), no seqno->time information was recorded. Also make sure all data will be compacted to the last level even if there's no write to record the time information. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10829 Test Plan: added unittest Reviewed By: siying Differential Revision: D40443934 Pulled By: jay-zhuang fbshipit-source-id: 2ecf1361daf9f3e5c3385aee6dc924fa59e2813a	2022-10-20 17:11:38 -07:00
Jay Zhuang	c401f285c3	Add option `preserve_internal_time_seconds` to preserve the time info (#10747 ) Summary: Add option `preserve_internal_time_seconds` to preserve the internal time information. It's mostly for the migration of the existing data to tiered storage ( `preclude_last_level_data_seconds`). When the tiering feature is just enabled, the existing data won't have the time information to decide if it's hot or cold. Enabling this feature will start collect and preserve the time information for the new data. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10747 Reviewed By: siying Differential Revision: D39910141 Pulled By: siying fbshipit-source-id: 25c21638e37b1a7c44006f636b7d714fe7242138	2022-10-07 18:49:40 -07:00
Changyu Bi	9f2363f4c4	User-defined timestamp support for `DeleteRange()` (#10661 ) Summary: Add user-defined timestamp support for range deletion. The new API is `DeleteRange(opt, cf, begin_key, end_key, ts)`. Most of the change is to update the comparator to compare without timestamp. Other than that, major changes are - internal range tombstone data structures (`FragmentedRangeTombstoneList`, `RangeTombstone`, etc.) to store timestamps. - Garbage collection of range tombstones and range tombstone covered keys during compaction. - Get()/MultiGet() to return the timestamp of a range tombstone when needed. - Get/Iterator with range tombstones bounded by readoptions.timestamp. - timestamp crash test now issues DeleteRange by default. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10661 Test Plan: - Added unit test: `make check` - Stress test: `python3 tools/db_crashtest.py --enable_ts whitebox --readpercent=57 --prefixpercent=4 --writepercent=25 -delpercent=5 --iterpercent=5 --delrangepercent=4` - Ran `db_bench` to measure regression when timestamp is not enabled. The tests are for write (with some range deletion) and iterate with DB fitting in memory: `./db_bench--benchmarks=fillrandom,seekrandom --writes_per_range_tombstone=200 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=500000 --reads=500000 --seek_nexts=10 --disable_auto_compactions -disable_wal=true --max_num_range_tombstones=1000`. Did not see consistent regression in no timestamp case. \| micros/op \| fillrandom \| seekrandom \| \| --- \| --- \| --- \| \|main\| 2.58 \|10.96\| \|PR 10661\| 2.68 \|10.63\| Reviewed By: riversand963 Differential Revision: D39441192 Pulled By: cbi42 fbshipit-source-id: f05aca3c41605caf110daf0ff405919f300ddec2	2022-09-30 16:13:03 -07:00
Changyu Bi	fd165c869d	Add memtable per key-value checksum (#10281 ) Summary: Append per key-value checksum to internal key. These checksums are verified on read paths including Get, Iterator and during Flush. Get and Iterator will return `Corruption` status if there is a checksum verification failure. Flush will make DB become read-only upon memtable entry checksum verification failure. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10281 Test Plan: - Added new unit test cases: `make check` - Benchmark on memtable insert ``` TEST_TMPDIR=/dev/shm/memtable_write ./db_bench -benchmarks=fillseq -disable_wal=true -max_write_buffer_number=100 -num=10000000 -min_write_buffer_number_to_merge=100 # avg over 10 runs Baseline: 1166936 ops/sec memtable 2 bytes kv checksum : 1.11674e+06 ops/sec (-4%) memtable 2 bytes kv checksum + write batch 8 bytes kv checksum: 1.08579e+06 ops/sec (-6.95%) write batch 8 bytes kv checksum: 1.17979e+06 ops/sec (+1.1%) ``` - Benchmark on only memtable read: ops/sec dropped 31% for `readseq` due to time spend on verifying checksum. ops/sec for `readrandom` dropped ~6.8%. ``` # Readseq sudo TEST_TMPDIR=/dev/shm/memtable_read ./db_bench -benchmarks=fillseq,readseq"[-X20]" -disable_wal=true -max_write_buffer_number=100 -num=10000000 -min_write_buffer_number_to_merge=100 readseq [AVG 20 runs] : 7432840 (± 212005) ops/sec; 822.3 (± 23.5) MB/sec readseq [MEDIAN 20 runs] : 7573878 ops/sec; 837.9 MB/sec With -memtable_protection_bytes_per_key=2: readseq [AVG 20 runs] : 5134607 (± 119596) ops/sec; 568.0 (± 13.2) MB/sec readseq [MEDIAN 20 runs] : 5232946 ops/sec; 578.9 MB/sec # Readrandom sudo TEST_TMPDIR=/dev/shm/memtable_read ./db_bench -benchmarks=fillrandom,readrandom"[-X10]" -disable_wal=true -max_write_buffer_number=100 -num=1000000 -min_write_buffer_number_to_merge=100 readrandom [AVG 10 runs] : 140236 (± 3938) ops/sec; 9.8 (± 0.3) MB/sec readrandom [MEDIAN 10 runs] : 140545 ops/sec; 9.8 MB/sec With -memtable_protection_bytes_per_key=2: readrandom [AVG 10 runs] : 130632 (± 2738) ops/sec; 9.1 (± 0.2) MB/sec readrandom [MEDIAN 10 runs] : 130341 ops/sec; 9.1 MB/sec ``` - Stress test: `python3 -u tools/db_crashtest.py whitebox --duration=1800` Reviewed By: ajkr Differential Revision: D37607896 Pulled By: cbi42 fbshipit-source-id: fdaefb475629d2471780d4a5f5bf81b44ee56113	2022-08-12 13:51:32 -07:00
sherriiiliu	4753e5a2e9	Fix wrong value passed in compaction filter in BlobDB (#10391 ) Summary: New blobdb has a bug in compaction filter, where `blob_value_` is not reset for next iterated key. This will cause blob_value_ not empty and previous value read from blob is passed into the filter function for next key, even if its value is not in blob. Fixed by reseting regardless of key type. Test Case: Add `FilterByValueLength` test case in `DBBlobCompactionTest` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10391 Reviewed By: riversand963 Differential Revision: D38629900 Pulled By: ltamasi fbshipit-source-id: 47d23ff2e5ec697958a210db9e6ceeb8b2fc49fa	2022-08-11 13:55:28 -07:00
Yanqin Jin	fee2c472d0	Include minimal contextual information in `CompactionIterator` (#10505 ) Summary: The main purpose is to make debugging easier without sacrificing performance. Instead of using a boolean variable for `CompactionIterator::valid_`, we can extend it to an `uint8_t`, using the LSB to denote if the compaction iterator is valid and 4 additional bits to denote where the iterator is set valid inside `NextFromInput()`. Therefore, when the control flow reaches `PrepareOutput()` and hits assertion there, we can have a better idea of what has gone wrong. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10505 Test Plan: make check ``` TEST_TMPDIR=/dev/shm/rocksdb time ./db_bench -compression_type=none -write_buffer_size=1073741824 -benchmarks=fillseq,flush ``` The above command has a 'flush' benchmark which uses `CompactionIterator`. I haven't observed any CPU regression or drop in throughput or latency increase. Reviewed By: ltamasi Differential Revision: D38551615 Pulled By: riversand963 fbshipit-source-id: 1250848fc118bb753d71fa9ff8ba840df999f5e0	2022-08-09 17:07:24 -07:00
Jay Zhuang	3f763763aa	Change `bottommost_temperture` to `last_level_temperture` (#10471 ) Summary: Change tiered compaction feature from `bottommost_temperture` to `last_level_temperture`. The old option is kept for migration purpose only, which is behaving the same as `last_level_temperture` and it will be removed in the next release. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10471 Test Plan: CI Reviewed By: siying Differential Revision: D38450621 Pulled By: jay-zhuang fbshipit-source-id: cc1cdf8bad409376fec0152abc0a64fb72a91527	2022-08-08 14:36:34 -07:00
Levi Tamasi	cc8ded6152	Do not put blobs read during compaction into cache (#10457 ) Summary: During compaction, blobs are currently read using the default `ReadOptions`, which has the `fill_cache` flag set to true. Earlier, this didn't make any difference since we didn't have a blob cache; however, now we have to explicitly set this flag to false to avoid polluting the cache during compaction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10457 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D38333528 Pulled By: ltamasi fbshipit-source-id: 5b4d49a1e39543bee73c7df2aa9194fb101875e2	2022-08-01 19:49:05 -07:00
Jay Zhuang	faa0f9723c	Tiered compaction: integrate Seqno time mapping with per key placement (#10370 ) Summary: Using the Sequence number to time mapping to decide if a key is hot or not in compaction and place it in the corresponding level. Note: the feature is not complete, level compaction will run indefinitely until all penultimate level data is cold and small enough to not trigger compaction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10370 Test Plan: CI * Run basic db_bench for universal compaction manually Reviewed By: siying Differential Revision: D37892338 Pulled By: jay-zhuang fbshipit-source-id: 792bbd91b1ccc2f62b5d14c53118434bcaac4bbe	2022-07-15 19:01:30 -07:00
Jay Zhuang	a3acf2ef87	Add seqno to time mapping (#10338 ) Summary: Which will be used for tiered storage to preclude hot data from compacting to the cold tier (the last level). Internally, adding seqno to time mapping. A periodic_task is scheduled to record the current_seqno -> current_time in certain cadence. When memtable flush, the mapping informaiton is stored in sstable property. During compaction, the mapping information are merged and get the approximate time of sequence number, which is used to determine if a key is recently inserted or not and preclude it from the last level if it's recently inserted (within the `preclude_last_level_data_seconds`). Pull Request resolved: https://github.com/facebook/rocksdb/pull/10338 Test Plan: CI Reviewed By: siying Differential Revision: D37810187 Pulled By: jay-zhuang fbshipit-source-id: 6953be7a18a99de8b1cb3b162d712f79c2b4899f	2022-07-14 21:49:34 -07:00
Jay Zhuang	6ce0b2ca34	Tiered Compaction: per key placement support (#9964 ) Summary: Support per_key_placement for last level compaction, which will be used for tiered compaction. * compaction iterator reports which level a key should output to; * compaction get the output level information and check if it's safe to output the data to penultimate level; * all compaction output files will be installed. * extra internal compaction stats added for penultimate level. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9964 Test Plan: * Unittest * db_bench, no significate difference: https://gist.github.com/jay-zhuang/3645f8fb97ec0ab47c10704bb39fd6e4 * microbench manual compaction no significate difference: https://gist.github.com/jay-zhuang/ba679b3e89e24992615ee9eef310e6dd * run the db_stress multiple times (not covering the new feature) looks good (internal: https://fburl.com/sandcastle/9w84pp2m) Reviewed By: ajkr Differential Revision: D36249494 Pulled By: jay-zhuang fbshipit-source-id: a96da57c8031c1df83e4a7a8567b657a112b80a3	2022-07-13 20:54:49 -07:00
Levi Tamasi	c73d2a9d18	Add API for writing wide-column entities (#10242 ) Summary: The patch builds on https://github.com/facebook/rocksdb/pull/9915 and adds a new API called `PutEntity` that can be used to write a wide-column entity to the database. The new API is added to both `DB` and `WriteBatch`. Note that currently there is no way to retrieve these entities; more precisely, all read APIs (`Get`, `MultiGet`, and iterator) return `NotSupported` when they encounter a wide-column entity that is required to answer a query. Read-side support (as well as other missing functionality like `Merge`, compaction filter, and timestamp support) will be added in later PRs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10242 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D37369748 Pulled By: ltamasi fbshipit-source-id: 7f5e412359ed7a400fd80b897dae5599dbcd685d	2022-06-25 15:30:47 -07:00
Yanqin Jin	4d31d3c2ed	Abort in dbg mode after logging (#10183 ) Summary: In CompactionIterator code, there are multiple places where the process will abort in dbg mode before logging the error message describing the cause. This PR changes only the logging behavior for compaction iterator so that error message is written to LOG before the process aborts in debug mode. Also updated the triggering condition for an assertion for single delete with user-defined timestamp. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10183 Test Plan: make check Reviewed By: akankshamahajan15 Differential Revision: D37190218 Pulled By: riversand963 fbshipit-source-id: 741bb007067be7cfbe94ac9e530ad4b2b339c009	2022-06-15 22:00:24 -07:00
zczhu	3ee6c9baec	Consolidate manual_compaction_paused_ check (#10070 ) Summary: As pointed out by [https://github.com/facebook/rocksdb/pull/8351#discussion_r645765422](https://github.com/facebook/rocksdb/pull/8351#discussion_r645765422), check `manual_compaction_paused` and `manual_compaction_canceled` can be reduced by setting `canceled` to be true in `DisableManualCompaction()` and `canceled` to be false in the last time calling `EnableManualCompaction()`. Changed Tests: The origin `DBTest2.PausingManualCompaction1` uses a callback function to increase `manual_compaction_paused` and the origin CompactionJob/CompactionIterator with `manual_compaction_paused` can detect this. I changed the callback function so that it sets `*canceled` as true if `canceled` is not `nullptr` (to notify CompactionJob/CompactionIterator the compaction has been canceled). Pull Request resolved: https://github.com/facebook/rocksdb/pull/10070 Test Plan: This change does not introduce new features, but some slight difference in compaction implementation. Run the same manual compaction unit tests as before (e.g., PausingManualCompaction[1-4], CancelManualCompaction[1-2], CancelManualCompactionWithListener in db_test2, and db_compaction_test). Reviewed By: ajkr Differential Revision: D36949133 Pulled By: littlepig2013 fbshipit-source-id: c5dc4c956fbf8f624003a0f5ad2690240063a821	2022-06-06 18:32:26 -07:00
Yanqin Jin	3f263ef536	Add a temporary option for user to opt-out enforcement of SingleDelete contract (#9983 ) Summary: PR https://github.com/facebook/rocksdb/issues/9888 started to enforce the contract of single delete described in https://github.com/facebook/rocksdb/wiki/Single-Delete. For some of existing use cases, it is desirable to have a transition during which compaction will not fail if the contract is violated. Therefore, we add a temporary option `enforce_single_del_contracts` to allow application to opt out from this new strict behavior. Once transition completes, the flag can be set to `true` again. In a future release, the option will be removed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9983 Test Plan: make check Reviewed By: ajkr Differential Revision: D36333672 Pulled By: riversand963 fbshipit-source-id: dcb703ea0ed08076a1422f1bfb9914afe3c2caa2	2022-05-16 15:44:59 -07:00
Yanqin Jin	9d634dd5b6	Rename kRemoveWithSingleDelete to kPurge (#9951 ) Summary: PR 9929 adds a new CompactionFilter::Decision, i.e. kRemoveWithSingleDelete so that CompactionFilter can indicate to CompactionIterator that a PUT can only be removed with SD. However, how CompactionIterator handles such a key is implementation detail which should not be implied in the public API. In fact, such a PUT can just be dropped. This is an optimization which we will apply in the near future. Discussion thread: https://github.com/facebook/rocksdb/pull/9929#discussion_r863198964 Pull Request resolved: https://github.com/facebook/rocksdb/pull/9951 Test Plan: make check Reviewed By: ajkr Differential Revision: D36156590 Pulled By: riversand963 fbshipit-source-id: 7b7d01f47bba4cad7d9cca6ca52984f27f88b372	2022-05-05 08:16:20 -07:00
Yanqin Jin	06394ff4e7	Fix a bug of CompactionIterator/CompactionFilter using `Delete` (#9929 ) Summary: When compaction filter determines that a key should be removed, it updates the internal key's type to `Delete`. If this internal key is preserved in current compaction but seen by a later compaction together with `SingleDelete`, it will cause compaction iterator to return Corruption. To fix the issue, compaction filter should return more information in addition to the intention of removing a key. Therefore, we add a new `kRemoveWithSingleDelete` to `CompactionFilter::Decision`. Seeing `kRemoveWithSingleDelete`, compaction iterator will update the op type of the internal key to `kTypeSingleDelete`. In addition, I updated db_stress_shared_state.[cc\|h] so that `no_overwrite_ids_` becomes `const`. It is easier to reason about thread-safety if accessed from multiple threads. This information is passed to `PrepareTxnDBOptions()` when calling from `Open()` so that we can set up the rollback deletion type callback for transactions. Finally, disable compaction filter for multiops_txn because the key removal logic of `DbStressCompactionFilter` does not quite work with `MultiOpsTxnsStressTest`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9929 Test Plan: make check make crash_test make crash_test_with_txn Reviewed By: anand1976 Differential Revision: D36069678 Pulled By: riversand963 fbshipit-source-id: cedd2f1ba958af59ad3916f1ba6f424307955f92	2022-05-02 13:25:45 -07:00
Yanqin Jin	2b5c29f9f3	Enforce the contract of SingleDelete (#9888 ) Summary: Enforce the contract of SingleDelete so that they are not mixed with Delete for the same key. Otherwise, it will lead to undefined behavior. See https://github.com/facebook/rocksdb/wiki/Single-Delete#notes. Also fix unit tests and write-unprepared. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9888 Test Plan: make check Reviewed By: ajkr Differential Revision: D35837817 Pulled By: riversand963 fbshipit-source-id: acd06e4dcba8cb18df92b44ed18c57e10e5a7635	2022-04-28 14:48:27 -07:00
Levi Tamasi	db536ee045	Propagate errors from UpdateBoundaries (#9851 ) Summary: In `FileMetaData`, we keep track of the lowest-numbered blob file referenced by the SST file in question for the purposes of BlobDB's garbage collection in the `oldest_blob_file_number` field, which is updated in `UpdateBoundaries`. However, with the current code, `BlobIndex` decoding errors (or invalid blob file numbers) are swallowed in this method. The patch changes this by propagating these errors and failing the corresponding flush/compaction. (Note that since blob references are generated by the BlobDB code and also parsed by `CompactionIterator`, in reality this can only happen in the case of memory corruption.) This change necessitated updating some unit tests that involved fake/corrupt `BlobIndex` objects. Some of these just used a dummy string like `"blob_index"` as a placeholder; these were replaced with real `BlobIndex`es. Some were relying on the earlier behavior to simulate corruption; these were replaced with `SyncPoint`-based test code that corrupts a valid blob reference at read time. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9851 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D35683671 Pulled By: ltamasi fbshipit-source-id: f7387af9945c48e4d5c4cd864f1ba425c7ad51f6	2022-04-15 20:25:48 -07:00
Yanqin Jin	0bd4dcde6b	CompactionIterator sees consistent view of which keys are committed (#9830 ) Summary: This PR does not affect the functionality of `DB` and write-committed transactions. `CompactionIterator` uses `KeyCommitted(seq)` to determine if a key in the database is committed. As the name 'write-committed' implies, if write-committed policy is used, a key exists in the database only if it is committed. In fact, the implementation of `KeyCommitted()` is as follows: ``` inline bool KeyCommitted(SequenceNumber seq) { // For non-txn-db and write-committed, snapshot_checker_ is always nullptr. return snapshot_checker_ == nullptr \|\| snapshot_checker_->CheckInSnapshot(seq, kMaxSequence) == SnapshotCheckerResult::kInSnapshot; } ``` With that being said, we focus on write-prepared/write-unprepared transactions. A few notes: - A key can exist in the db even if it's uncommitted. Therefore, we rely on `snapshot_checker_` to determine data visibility. We also require that all writes go through transaction API instead of the raw `WriteBatch` + `Write`, thus at most one uncommitted version of one user key can exist in the database. - `CompactionIterator` outputs a key as long as the key is uncommitted. Due to the above reasons, it is possible that `CompactionIterator` decides to output an uncommitted key without doing further checks on the key (`NextFromInput()`). By the time the key is being prepared for output, the key becomes committed because the `snapshot_checker_(seq, kMaxSequence)` becomes true in the implementation of `KeyCommitted()`. Then `CompactionIterator` will try to zero its sequence number and hit assertion error if the key is a tombstone. To fix this issue, we should make the `CompactionIterator` see a consistent view of the input keys. Note that for write-prepared/write-unprepared, the background flush/compaction jobs already take a "job snapshot" before starting processing keys. The job snapshot is released only after the entire flush/compaction finishes. We can use this snapshot to determine whether a key is committed or not with minor change to `KeyCommitted()`. ``` inline bool KeyCommitted(SequenceNumber sequence) { // For non-txn-db and write-committed, snapshot_checker_ is always nullptr. return snapshot_checker_ == nullptr \|\| snapshot_checker_->CheckInSnapshot(sequence, job_snapshot_) == SnapshotCheckerResult::kInSnapshot; } ``` As a result, whether a key is committed or not will remain a constant throughout compaction, causing no trouble for `CompactionIterator`s assertions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9830 Test Plan: make check Reviewed By: ltamasi Differential Revision: D35561162 Pulled By: riversand963 fbshipit-source-id: 0e00d200c195240341cfe6d34cbc86798b315b9f	2022-04-14 11:11:04 -07:00
Yanqin Jin	0ad9ee30ce	Remove dead code (#9825 ) Summary: Options `preserve_deletes` and `iter_start_seqnum` have been removed since 7.0. This PR removes dead code related to these two removed options. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9825 Test Plan: make check Reviewed By: akankshamahajan15 Differential Revision: D35517950 Pulled By: riversand963 fbshipit-source-id: 86282ce5ec4087acb94a06a42a1b6d55b1715482	2022-04-11 10:26:55 -07:00
Levi Tamasi	320d9a8e8a	Use a sorted vector instead of a map to store blob file metadata (#9526 ) Summary: The patch replaces `std::map` with a sorted `std::vector` for `VersionStorageInfo::blob_files_` and preallocates the space for the `vector` before saving the `BlobFileMetaData` into the new `VersionStorageInfo` in `VersionBuilder::Rep::SaveBlobFilesTo`. These changes reduce the time the DB mutex is held while saving new `Version`s, and using a sorted `vector` also makes lookups faster thanks to better memory locality. In addition, the patch introduces helper methods `VersionStorageInfo::GetBlobFileMetaData` and `VersionStorageInfo::GetBlobFileMetaDataLB` that can be used by clients to perform lookups in the `vector`, and does some general cleanup in the parts of code where blob file metadata are used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9526 Test Plan: Ran `make check` and the crash test script for a while. Performance was tested using a load-optimized benchmark (`fillseq` with vector memtable, no WAL) and small file sizes so that a significant number of files are produced: ``` numactl --interleave=all ./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/ltamasi-dbbench --wal_dir=/data/ltamasi-dbbench --num=800000000 --num_levels=8 --key_size=20 --value_size=400 --block_size=8192 --cache_size=51539607552 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=lz4 --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --soft_pending_compaction_bytes_limit=167503724544 --hard_pending_compaction_bytes_limit=335007449088 --min_level_to_compress=0 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --enable_blob_files=1 --blob_file_size=16777216 --min_blob_size=0 --blob_compression_type=lz4 --enable_blob_garbage_collection=1 --seed=<some value> ``` Final statistics before the patch: ``` Cumulative writes: 0 writes, 700M keys, 0 commit groups, 0.0 writes per commit group, ingest: 284.62 GB, 121.27 MB/s Interval writes: 0 writes, 334K keys, 0 commit groups, 0.0 writes per commit group, ingest: 139.28 MB, 72.46 MB/s ``` With the patch: ``` Cumulative writes: 0 writes, 760M keys, 0 commit groups, 0.0 writes per commit group, ingest: 308.66 GB, 131.52 MB/s Interval writes: 0 writes, 445K keys, 0 commit groups, 0.0 writes per commit group, ingest: 185.35 MB, 93.15 MB/s ``` Total time to complete the benchmark is 2611 seconds with the patch, down from 2986 secs. Reviewed By: riversand963 Differential Revision: D34082728 Pulled By: ltamasi fbshipit-source-id: fc598abf676dce436734d06bb9d2d99a26a004fc	2022-02-09 12:36:43 -08:00
Levi Tamasi	dc5de45af8	Support readahead during compaction for blob files (#9187 ) Summary: The patch adds a new BlobDB configuration option `blob_compaction_readahead_size` that can be used to enable prefetching data from blob files during compaction. This is important when using storage with higher latencies like HDDs or remote filesystems. If enabled, prefetching is used for all cases when blobs are read during compaction, namely garbage collection, compaction filters (when the existing value has to be read from a blob file), and `Merge` (when the value of the base `Put` is stored in a blob file). Pull Request resolved: https://github.com/facebook/rocksdb/pull/9187 Test Plan: Ran `make check` and the stress/crash test. Reviewed By: riversand963 Differential Revision: D32565512 Pulled By: ltamasi fbshipit-source-id: 87be9cebc3aa01cc227bec6b5f64d827b8164f5d	2021-11-19 17:53:47 -08:00
Yanqin Jin	a113cecfc9	Fix a bug in timestamp-related GC (#9116 ) Summary: For multiple versions (ts + seq) of the same user key, if they cross the boundary of `full_history_ts_low_`, we should retain the version that is visible to the `full_history_ts_low_`. Namely, we keep the internal key with the largest timestamp smaller than `full_history_ts_low`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9116 Test Plan: make check Reviewed By: ltamasi Differential Revision: D32261514 Pulled By: riversand963 fbshipit-source-id: e10f47c254c04c05261440051e4f50cb7d95474e	2021-11-09 13:08:55 -08:00
Yanqin Jin	5237b39d2e	Fix assertion error during compaction with write-prepared txn enabled (#9105 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9105 The user contract of SingleDelete is that: a SingleDelete can only be issued to a key that exists and has NOT been updated. For example, application can insert one key `key`, and uses a SingleDelete to delete it in the future. The `key` cannot be updated or removed using Delete. In reality, especially when write-prepared transaction is being used, things can get tricky. For example, a prepared transaction already writes `key` to the memtable after a successful Prepare(). Afterwards, should the transaction rollback, it will insert a Delete into the memtable to cancel out the prior Put. Consider the following sequence of operations. ``` // operation sequence 1 Begin txn Put(key) Prepare() Flush() Rollback txn Flush() ``` There will be two SSTs resulting from above. One of the contains a PUT, while the second one contains a Delete. It is also known that releasing a snapshot can lead to an L0 containing only a SD for a particular key. Consider the following operations following the above block. ``` // operation sequence 2 db->Put(key) db->SingleDelete(key) Flush() ``` The operation sequence 2 can result in an L0 with only the SD. Should there be a snapshot for conflict checking created before operation sequence 1, then an attempt to compact the db may hit the assertion failure below, because ikey_.type is Delete (from a rollback). ``` else if (clear_and_output_next_key_) { assert(ikey_.type == kTypeValue \|\| ikey_.type == kTypeBlobIndex); } ``` To fix the assertion failure, we can skip the SingleDelete if we detect an earlier Delete in the same snapshot interval. Reviewed By: ltamasi Differential Revision: D32056848 fbshipit-source-id: 23620a91e28562d91c45cf7e95f414b54b729748	2021-11-05 15:29:18 -07:00
Yanqin Jin	9b53f14a35	Fixed a bug in CompactionIterator when write-preared transaction is used (#9060 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9060 RocksDB bottommost level compaction may zero out an internal key's sequence if the key's sequence is in the earliest_snapshot. In write-prepared transaction, checking the visibility of a certain sequence in a specific released snapshot may return a "snapshot released" result. Therefore, it is possible, after a certain sequence of events, a PUT has its sequence zeroed out, but a subsequent SingleDelete of the same key will still be output with its original sequence. This violates the ascending order of keys and leads to incorrect result. The solution is to use an extra variable `last_key_seq_zeroed_` to track the information about visibility in earliest snapshot. With this variable, we can know for sure that a SingleDelete is in the earliest snapshot even if the said snapshot is released during compaction before processing the SD. Reviewed By: ltamasi Differential Revision: D31813016 fbshipit-source-id: d8cff59d6f34e0bdf282614034aaea99be9174e1	2021-11-03 15:55:00 -07:00
Yanqin Jin	fdf2a0d7eb	Fix a compaction bug for write-prepared txn (#9061 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9061 In write-prepared txn, checking a sequence's visibility in a released (old) snapshot may return "Snapshot released". Suppose we have two snapshots: ``` earliest_snap < earliest_write_conflict_snap ``` If we release `earliest_write_conflict_snap` but keep `earliest_snap` during bottommost level compaction, then it is possible that certain sequence of events can lead to a PUT being seq-zeroed followed by a SingleDelete of the same key. This violates the ascending order of keys, and will cause data inconsistency. Reviewed By: ltamasi Differential Revision: D31813017 fbshipit-source-id: dc68ba2541d1228489b93cf3edda5f37ed06f285	2021-10-29 15:23:17 -07:00
mrambacher	13ae16c315	Cleanup includes in dbformat.h (#8930 ) Summary: This header file was including everything and the kitchen sink when it did not need to. This resulted in many places including this header when they needed other pieces instead. Cleaned up this header to only include what was needed and fixed up the remaining code to include what was now missing. Hopefully, this sort of code hygiene cleanup will speed up the builds... Pull Request resolved: https://github.com/facebook/rocksdb/pull/8930 Reviewed By: pdillinger Differential Revision: D31142788 Pulled By: mrambacher fbshipit-source-id: 6b45de3f300750c79f751f6227dece9cfd44085d	2021-09-29 04:04:40 -07:00
Akanksha Mahajan	78afb4d81e	Support SingleDelete for user-defined timestamps (#8921 ) Summary: Added support for SingleDelete for user-defined timestamps. Users can now Get and Iterate over keys deleted with SingleDelete. It also includes changes in CompactionIterator which preserves the same user key with different timestamps, unless the timestamp is below a certain threshold full_history_ts_low. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8921 Test Plan: Added new unit tests Reviewed By: riversand963 Differential Revision: D31098191 Pulled By: akankshamahajan15 fbshipit-source-id: 78a59ef4b4884ae324fcd10f56e62a27d5ee2f49	2021-09-27 11:51:07 -07:00
Yanqin Jin	2b367fa8cc	Fix bug caused by releasing snapshot(s) during compaction (#8608 ) Summary: In debug mode, we are seeing assertion failure as follows ``` db/compaction/compaction_iterator.cc:980: void rocksdb::CompactionIterator::PrepareOutput(): \ Assertion `ikey_.type != kTypeDeletion && ikey_.type != kTypeSingleDeletion' failed. ``` It is caused by releasing earliest snapshot during compaction between the execution of `NextFromInput()` and `PrepareOutput()`. In one case, as demonstrated in unit test `WritePreparedTransaction.ReleaseEarliestSnapshotDuringCompaction_WithSD2`, incorrect result may be returned by a following range scan if we disable assertion, as in opt compilation level: the SingleDelete marker's sequence number is zeroed out, but the preceding PUT is also outputted to the SST file after compaction. Due to the logic of DBIter, the PUT will not be skipped and will be returned by iterator in range scan. https://github.com/facebook/rocksdb/issues/8661 illustrates what happened. Fix by taking a more conservative approach: make compaction zero out sequence number only if key is in the earliest snapshot when the compaction starts. Another assertion failure is ``` Assertion `current_user_key_snapshot_ == last_snapshot' failed. ``` It's caused by releasing the snapshot between the PUT and SingleDelete during compaction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8608 Test Plan: make check Reviewed By: jay-zhuang Differential Revision: D30145645 Pulled By: riversand963 fbshipit-source-id: 699f58e66faf70732ad53810ccef43935d3bbe81	2021-08-17 22:14:20 -07:00
Levi Tamasi	6878cedcc3	Add statistics support to integrated BlobDB (#8667 ) Summary: The patch adds statistics support to the integrated BlobDB implementation, namely the tickers `BLOB_DB_BLOB_FILE_BYTES_READ` and `BLOB_DB_GC_{NUM_KEYS,BYTES}_RELOCATED`, and the histograms `BLOB_DB_(DE)COMPRESSION_MICROS`. (Some other statistics, like `BLOB_DB_BLOB_FILE_BYTES_WRITTEN`, `BLOB_DB_BLOB_FILE_SYNCED`, `BLOB_DB_BLOB_FILE_{READ,WRITE,SYNC}_MICROS` were already supported.) Note that the vast majority of the old BlobDB's tickers/histograms are not really applicable to the new implementation, since they e.g. pertain to calling dedicated BlobDB APIs (which the integrated BlobDB does not have) or are tied to the legacy BlobDB's design of writing blob files synchronously when a write API is called. Such statistics are marked "legacy BlobDB only" in `statistics.h`. Fixes https://github.com/facebook/rocksdb/issues/8645 . Pull Request resolved: https://github.com/facebook/rocksdb/pull/8667 Test Plan: Ran `make check` and tested the new statistics using `db_bench`. Reviewed By: riversand963 Differential Revision: D30356884 Pulled By: ltamasi fbshipit-source-id: 5f8a833faee60401c5643c2f0a6c0415488190a4	2021-08-17 17:22:31 -07:00
jimmycleary	e0ff365a76	Replace macros in compaction_iterator.cc with inline functions (#8592 ) Summary: Internal task T96186510. Created new inline member functions in `CompactionIterator`, `DefinitelyInSnapshot`, `DefinitelyNotInSnapshot`, and `InEarliestSnapshot` to replace the macros at the top of `compaction_iterator.cc`. Placed the definitions in `compaction_iterator.h` in accordance with Google's style guide for inline functions. Separated the declarations and definitions, and only placed the `inline` keyword on the definitions, in line with ISO CPP recommendations. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8592 Test Plan: Ran `make check`. Successful build and all tests appeared to pass. Reviewed By: riversand963 Differential Revision: D29966782 Pulled By: jimmycFB fbshipit-source-id: 3584290bbbabf862e9ab58852281f46d37f58be6	2021-07-28 14:53:29 -07:00

1 2

83 commits