rocksdb/db/compaction/compaction_picker_test.cc

3994 lines
160 KiB
C++
Raw Normal View History

// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
// This source code is licensed under both the GPLv2 (found in the
// COPYING file in the root directory) and Apache 2.0 License
// (found in the LICENSE.Apache file in the root directory).
#include <limits>
#include <string>
#include <utility>
Try to start TTL earlier with kMinOverlappingRatio is used (#8749) Summary: Right now, when options.ttl is set, compactions are triggered around the time when TTL is reached. This might cause extra compactions which are often bursty. This commit tries to mitigate it by picking those files earlier in normal compaction picking process. This is only implemented using kMinOverlappingRatio with Leveled compaction as it is the default value and it is more complicated to change other styles. When a file is aged more than ttl/2, RocksDB starts to boost the compaction priority of files in normal compaction picking process, and hope by the time TTL is reached, very few extra compaction is needed. In order for this to work, another change is made: during a compaction, if an output level file is older than ttl/2, cut output files based on original boundary (if it is not in the last level). This is to make sure that after an old file is moved to the next level, and new data is merged from the upper level, the new data falling into this range isn't reset with old timestamp. Without this change, in many cases, most files from one level will keep having old timestamp, even if they have newer data and we stuck in it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8749 Test Plan: Add a unit test to test the boosting logic. Will add a unit test to test it end-to-end. Reviewed By: jay-zhuang Differential Revision: D30735261 fbshipit-source-id: 503c2d89250b22911eb99e72b379be154de3428e
2021-11-01 21:32:12 +00:00
#include "db/compaction/compaction.h"
#include "db/compaction/compaction_picker_fifo.h"
#include "db/compaction/compaction_picker_level.h"
#include "db/compaction/compaction_picker_universal.h"
Try to start TTL earlier with kMinOverlappingRatio is used (#8749) Summary: Right now, when options.ttl is set, compactions are triggered around the time when TTL is reached. This might cause extra compactions which are often bursty. This commit tries to mitigate it by picking those files earlier in normal compaction picking process. This is only implemented using kMinOverlappingRatio with Leveled compaction as it is the default value and it is more complicated to change other styles. When a file is aged more than ttl/2, RocksDB starts to boost the compaction priority of files in normal compaction picking process, and hope by the time TTL is reached, very few extra compaction is needed. In order for this to work, another change is made: during a compaction, if an output level file is older than ttl/2, cut output files based on original boundary (if it is not in the last level). This is to make sure that after an old file is moved to the next level, and new data is merged from the upper level, the new data falling into this range isn't reset with old timestamp. Without this change, in many cases, most files from one level will keep having old timestamp, even if they have newer data and we stuck in it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8749 Test Plan: Add a unit test to test the boosting logic. Will add a unit test to test it end-to-end. Reviewed By: jay-zhuang Differential Revision: D30735261 fbshipit-source-id: 503c2d89250b22911eb99e72b379be154de3428e
2021-11-01 21:32:12 +00:00
#include "db/compaction/file_pri.h"
#include "rocksdb/advanced_options.h"
#include "table/unique_id_impl.h"
#include "test_util/testharness.h"
#include "test_util/testutil.h"
#include "util/string_util.h"
namespace ROCKSDB_NAMESPACE {
class CountingLogger : public Logger {
public:
2015-02-01 19:08:19 +00:00
using Logger::Logv;
void Logv(const char* /*format*/, va_list /*ap*/) override { log_count++; }
size_t log_count;
};
Fix overlapping check by excluding timestamp (#10615) Summary: With user-defined timestamp, checking overlapping should exclude timestamp part from key. This has already been done for range checking for files in sstableKeyCompare(), but not yet done when checking with concurrent compactions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10615 Test Plan: (Will add more tests) make check (Repro seems easier with this commit sha: git checkout 78bbdef530bd36fa299d496bd1013cf39d8e203a) rm -rf /dev/shm/rocksdb/* && mkdir /dev/shm/rocksdb/rocksdb_crashtest_expected && ./db_stress --allow_data_in_errors=True --clear_column_family_one_in=0 --continuous_verification_interval=0 --data_block_index_type=1 --db=/dev/shm/rocksdb//rocksdb_crashtest_blackbox --delpercent=5 --delrangepercent=0 --expected_values_dir=/dev/shm/rocksdb//rocksdb_crashtest_expected --iterpercent=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=25000000 --max_write_batch_group_size_bytes=1048576 --nooverwritepercent=1 --ops_per_thread=1000000 --paranoid_file_checks=1 --partition_filters=0 --prefix_size=8 --prefixpercent=5 --readpercent=30 --reopen=0 --snapshot_hold_ops=100000 --subcompactions=1 --compaction_pri=3 --target_file_size_base=65536 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_cf_consistency=0 --use_multiget=1 --user_timestamp_size=8 --value_size_mult=32 --verify_checksum=1 --write_buffer_size=65536 --writepercent=60 -disable_wal=1 Reviewed By: akankshamahajan15 Differential Revision: D39146797 Pulled By: riversand963 fbshipit-source-id: 7fca800026ca6219220100b8b6cf84d907828163
2022-09-08 20:03:07 +00:00
class CompactionPickerTestBase : public testing::Test {
public:
const Comparator* ucmp_;
InternalKeyComparator icmp_;
Options options_;
ImmutableOptions ioptions_;
MutableCFOptions mutable_cf_options_;
MutableDBOptions mutable_db_options_;
LevelCompactionPicker level_compaction_picker;
std::string cf_name_;
CountingLogger logger_;
LogBuffer log_buffer_;
uint32_t file_num_;
CompactionOptionsFIFO fifo_options_;
std::unique_ptr<VersionStorageInfo> vstorage_;
std::vector<std::unique_ptr<FileMetaData>> files_;
// does not own FileMetaData
std::unordered_map<uint32_t, std::pair<FileMetaData*, int>> file_map_;
// input files to compaction process.
std::vector<CompactionInputFiles> input_files_;
int compaction_level_start_;
Fix overlapping check by excluding timestamp (#10615) Summary: With user-defined timestamp, checking overlapping should exclude timestamp part from key. This has already been done for range checking for files in sstableKeyCompare(), but not yet done when checking with concurrent compactions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10615 Test Plan: (Will add more tests) make check (Repro seems easier with this commit sha: git checkout 78bbdef530bd36fa299d496bd1013cf39d8e203a) rm -rf /dev/shm/rocksdb/* && mkdir /dev/shm/rocksdb/rocksdb_crashtest_expected && ./db_stress --allow_data_in_errors=True --clear_column_family_one_in=0 --continuous_verification_interval=0 --data_block_index_type=1 --db=/dev/shm/rocksdb//rocksdb_crashtest_blackbox --delpercent=5 --delrangepercent=0 --expected_values_dir=/dev/shm/rocksdb//rocksdb_crashtest_expected --iterpercent=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=25000000 --max_write_batch_group_size_bytes=1048576 --nooverwritepercent=1 --ops_per_thread=1000000 --paranoid_file_checks=1 --partition_filters=0 --prefix_size=8 --prefixpercent=5 --readpercent=30 --reopen=0 --snapshot_hold_ops=100000 --subcompactions=1 --compaction_pri=3 --target_file_size_base=65536 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_cf_consistency=0 --use_multiget=1 --user_timestamp_size=8 --value_size_mult=32 --verify_checksum=1 --write_buffer_size=65536 --writepercent=60 -disable_wal=1 Reviewed By: akankshamahajan15 Differential Revision: D39146797 Pulled By: riversand963 fbshipit-source-id: 7fca800026ca6219220100b8b6cf84d907828163
2022-09-08 20:03:07 +00:00
explicit CompactionPickerTestBase(const Comparator* _ucmp)
: ucmp_(_ucmp),
icmp_(ucmp_),
Fix overlapping check by excluding timestamp (#10615) Summary: With user-defined timestamp, checking overlapping should exclude timestamp part from key. This has already been done for range checking for files in sstableKeyCompare(), but not yet done when checking with concurrent compactions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10615 Test Plan: (Will add more tests) make check (Repro seems easier with this commit sha: git checkout 78bbdef530bd36fa299d496bd1013cf39d8e203a) rm -rf /dev/shm/rocksdb/* && mkdir /dev/shm/rocksdb/rocksdb_crashtest_expected && ./db_stress --allow_data_in_errors=True --clear_column_family_one_in=0 --continuous_verification_interval=0 --data_block_index_type=1 --db=/dev/shm/rocksdb//rocksdb_crashtest_blackbox --delpercent=5 --delrangepercent=0 --expected_values_dir=/dev/shm/rocksdb//rocksdb_crashtest_expected --iterpercent=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=25000000 --max_write_batch_group_size_bytes=1048576 --nooverwritepercent=1 --ops_per_thread=1000000 --paranoid_file_checks=1 --partition_filters=0 --prefix_size=8 --prefixpercent=5 --readpercent=30 --reopen=0 --snapshot_hold_ops=100000 --subcompactions=1 --compaction_pri=3 --target_file_size_base=65536 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_cf_consistency=0 --use_multiget=1 --user_timestamp_size=8 --value_size_mult=32 --verify_checksum=1 --write_buffer_size=65536 --writepercent=60 -disable_wal=1 Reviewed By: akankshamahajan15 Differential Revision: D39146797 Pulled By: riversand963 fbshipit-source-id: 7fca800026ca6219220100b8b6cf84d907828163
2022-09-08 20:03:07 +00:00
options_(CreateOptions(ucmp_)),
ioptions_(options_),
mutable_cf_options_(options_),
mutable_db_options_(),
level_compaction_picker(ioptions_, &icmp_),
cf_name_("dummy"),
log_buffer_(InfoLogLevel::INFO_LEVEL, &logger_),
file_num_(1),
vstorage_(nullptr) {
mutable_cf_options_.ttl = 0;
mutable_cf_options_.periodic_compaction_seconds = 0;
// ioptions_.compaction_pri = kMinOverlappingRatio has its own set of
// tests to cover.
ioptions_.compaction_pri = kByCompensatedSize;
fifo_options_.max_table_files_size = 1;
mutable_cf_options_.RefreshDerivedOptions(ioptions_);
ioptions_.cf_paths.emplace_back("dummy",
std::numeric_limits<uint64_t>::max());
}
Fix overlapping check by excluding timestamp (#10615) Summary: With user-defined timestamp, checking overlapping should exclude timestamp part from key. This has already been done for range checking for files in sstableKeyCompare(), but not yet done when checking with concurrent compactions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10615 Test Plan: (Will add more tests) make check (Repro seems easier with this commit sha: git checkout 78bbdef530bd36fa299d496bd1013cf39d8e203a) rm -rf /dev/shm/rocksdb/* && mkdir /dev/shm/rocksdb/rocksdb_crashtest_expected && ./db_stress --allow_data_in_errors=True --clear_column_family_one_in=0 --continuous_verification_interval=0 --data_block_index_type=1 --db=/dev/shm/rocksdb//rocksdb_crashtest_blackbox --delpercent=5 --delrangepercent=0 --expected_values_dir=/dev/shm/rocksdb//rocksdb_crashtest_expected --iterpercent=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=25000000 --max_write_batch_group_size_bytes=1048576 --nooverwritepercent=1 --ops_per_thread=1000000 --paranoid_file_checks=1 --partition_filters=0 --prefix_size=8 --prefixpercent=5 --readpercent=30 --reopen=0 --snapshot_hold_ops=100000 --subcompactions=1 --compaction_pri=3 --target_file_size_base=65536 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_cf_consistency=0 --use_multiget=1 --user_timestamp_size=8 --value_size_mult=32 --verify_checksum=1 --write_buffer_size=65536 --writepercent=60 -disable_wal=1 Reviewed By: akankshamahajan15 Differential Revision: D39146797 Pulled By: riversand963 fbshipit-source-id: 7fca800026ca6219220100b8b6cf84d907828163
2022-09-08 20:03:07 +00:00
~CompactionPickerTestBase() override {}
void NewVersionStorage(int num_levels, CompactionStyle style) {
DeleteVersionStorage();
options_.num_levels = num_levels;
Sort L0 files by newly introduced epoch_num (#10922) Summary: **Context:** Sorting L0 files by `largest_seqno` has at least two inconvenience: - File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. `force_consistency_check=true` will catch such overlap seqno range even those harmless overlap. - For example, consider the following sequence of events ("key@n" indicates key at seqno "n") - insert k1@1 to memtable m1 - ingest file s1 with k2@2, ingest file s2 with k3@3 - insert k4@4 to m1 - compact files s1, s2 and result in new file s3 of seqno range [2, 3] - flush m1 and result in new file s4 of seqno range [1, 4]. And `force_consistency_check=true` will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1 - However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption. - Single delete can decrease a file's largest seqno and ordering by `largest_seqno` can introduce a wrong ordering hence file reordering corruption - For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to ajkr for this example) - an existing SST s1 contains only k1@1 - insert k1@2 to memtable m1 - ingest file s2 with k3@3, ingest file s3 with k4@4 - insert single delete k5@5 in m1 - flush m1 and result in new file s4 of seqno range [2, 5] - compact s1, s2, s3 and result in new file s5 of seqno range [1, 4] - compact s4 and result in new file s6 of seqno range [2] due to single delete - By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by `force_consistency_check=true`, there isn't a good way to prevent this from happening if ordering by `largest_seqno` Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to ajkr , we now introduce `epoch_num` which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum `epoch_num` among input files'). This will avoid the above inconvenience in the following ways: - In the first case above, there will no longer be overlap seqno range check in `force_consistency_check=true` but `epoch_number` ordering check. This will result in file ordering s1 < s2 < s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class `DBCompactionTestL0FilesMisorderCorruption*` for more. - In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption. **Summary:** - Introduce `epoch_number` stored per `ColumnFamilyData` and sort CF's L0 files by their assigned `epoch_number` instead of `largest_seqno`. - `epoch_number` is increased and assigned upon `VersionEdit::AddFile()` for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the `kReservedEpochNumberForFileIngestedBehind`) - Compaction output file is assigned with the minimum `epoch_number` among input files' - Refit level: reuse refitted file's epoch_number - Other paths needing `epoch_number` treatment: - Import column families: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo` - Repair: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`. - Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon `VersionEdit::AddFile()` where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair). - Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later `Refit(target_level=0)`). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder. - Persist `epoch_number` of each file in manifest and recover `epoch_number` on db recovery - Backward compatibility with old db without `epoch_number` support is guaranteed by assigning `epoch_number` to recovered files by `NewestFirstBySeqno` order. See `VersionStorageInfo::RecoverEpochNumbers()` for more - Forward compatibility with manifest is guaranteed by flexibility of `NewFileCustomTag` - Replace `force_consistent_check` on L0 with `epoch_number` and remove false positive check like case 1 with `largest_seqno` above - Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (`NewestFirstBySeqno`) to check/sort them till we infer their epoch number. See usages of `EpochNumberRequirement`. - Remove fix https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and their outdated tests to file reordering corruption because such fix can be replaced by this PR. - Misc: - update existing tests with `epoch_number` so make check will pass - update https://github.com/facebook/rocksdb/pull/5958#issue-511150930 tests to verify corruption is fixed using `epoch_number` and cover universal/fifo compaction/CompactRange/CompactFile cases - assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber() Pull Request resolved: https://github.com/facebook/rocksdb/pull/10922 Test Plan: - `make check` - New unit tests under `db/db_compaction_test.cc`, `db/db_test2.cc`, `db/version_builder_test.cc`, `db/repair_test.cc` - Updated tests (i.e, `DBCompactionTestL0FilesMisorderCorruption*`) under https://github.com/facebook/rocksdb/pull/5958#issue-511150930 - [Ongoing] Compatibility test: manually run https://github.com/ajkr/rocksdb/commit/36a5686ec012f35a4371e409aa85c404ca1c210d (with file ingestion off for running the `.orig` binary to prevent this bug affecting upgrade/downgrade formality checking) for 1 hour on `simple black/white box`, `cf_consistency/txn/enable_ts with whitebox + test_best_efforts_recovery with blackbox` - [Ongoing] normal db stress test - [Ongoing] db stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761 Reviewed By: ajkr Differential Revision: D41063187 Pulled By: hx235 fbshipit-source-id: 826cb23455de7beaabe2d16c57682a82733a32a9
2022-12-13 21:29:37 +00:00
vstorage_.reset(new VersionStorageInfo(
&icmp_, ucmp_, options_.num_levels, style, nullptr, false,
EpochNumberRequirement::kMustPresent));
Clean up VersionStorageInfo a bit (#9494) Summary: The patch does some cleanup in and around `VersionStorageInfo`: * Renames the method `PrepareApply` to `PrepareAppend` in `Version` to make it clear that it is to be called before appending the `Version` to `VersionSet` (via `AppendVersion`), not before applying any `VersionEdit`s. * Introduces a helper method `VersionStorageInfo::PrepareForVersionAppend` (called by `Version::PrepareAppend`) that encapsulates the population of the various derived data structures in `VersionStorageInfo`, and turns the methods computing the derived structures (`UpdateNumNonEmptyLevels`, `CalculateBaseBytes` etc.) into private helpers. * Changes `Version::PrepareAppend` so it only calls `UpdateAccumulatedStats` if the `update_stats` flag is set. (Earlier, this was checked by the callee.) Related to this, it also moves the call to `ComputeCompensatedSizes` to `VersionStorageInfo::PrepareForVersionAppend`. * Updates and cleans up `version_builder_test`, `version_set_test`, and `compaction_picker_test` so `PrepareForVersionAppend` is called anytime a new `VersionStorageInfo` is set up or saved. This cleanup also involves splitting `VersionStorageInfoTest.MaxBytesForLevelDynamic` into multiple smaller test cases. * Fixes up a bunch of comments that were outdated or just plain incorrect. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9494 Test Plan: Ran `make check` and the crash test script for a while. Reviewed By: riversand963 Differential Revision: D33971666 Pulled By: ltamasi fbshipit-source-id: fda52faac7783041126e4f8dec0fe01bdcadf65a
2022-02-04 16:18:18 +00:00
vstorage_->PrepareForVersionAppend(ioptions_, mutable_cf_options_);
}
// Create a new VersionStorageInfo object so we can add mode files and then
// merge it with the existing VersionStorageInfo
void AddVersionStorage() {
temp_vstorage_.reset(new VersionStorageInfo(
&icmp_, ucmp_, options_.num_levels, ioptions_.compaction_style,
Sort L0 files by newly introduced epoch_num (#10922) Summary: **Context:** Sorting L0 files by `largest_seqno` has at least two inconvenience: - File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. `force_consistency_check=true` will catch such overlap seqno range even those harmless overlap. - For example, consider the following sequence of events ("key@n" indicates key at seqno "n") - insert k1@1 to memtable m1 - ingest file s1 with k2@2, ingest file s2 with k3@3 - insert k4@4 to m1 - compact files s1, s2 and result in new file s3 of seqno range [2, 3] - flush m1 and result in new file s4 of seqno range [1, 4]. And `force_consistency_check=true` will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1 - However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption. - Single delete can decrease a file's largest seqno and ordering by `largest_seqno` can introduce a wrong ordering hence file reordering corruption - For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to ajkr for this example) - an existing SST s1 contains only k1@1 - insert k1@2 to memtable m1 - ingest file s2 with k3@3, ingest file s3 with k4@4 - insert single delete k5@5 in m1 - flush m1 and result in new file s4 of seqno range [2, 5] - compact s1, s2, s3 and result in new file s5 of seqno range [1, 4] - compact s4 and result in new file s6 of seqno range [2] due to single delete - By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by `force_consistency_check=true`, there isn't a good way to prevent this from happening if ordering by `largest_seqno` Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to ajkr , we now introduce `epoch_num` which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum `epoch_num` among input files'). This will avoid the above inconvenience in the following ways: - In the first case above, there will no longer be overlap seqno range check in `force_consistency_check=true` but `epoch_number` ordering check. This will result in file ordering s1 < s2 < s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class `DBCompactionTestL0FilesMisorderCorruption*` for more. - In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption. **Summary:** - Introduce `epoch_number` stored per `ColumnFamilyData` and sort CF's L0 files by their assigned `epoch_number` instead of `largest_seqno`. - `epoch_number` is increased and assigned upon `VersionEdit::AddFile()` for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the `kReservedEpochNumberForFileIngestedBehind`) - Compaction output file is assigned with the minimum `epoch_number` among input files' - Refit level: reuse refitted file's epoch_number - Other paths needing `epoch_number` treatment: - Import column families: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo` - Repair: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`. - Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon `VersionEdit::AddFile()` where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair). - Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later `Refit(target_level=0)`). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder. - Persist `epoch_number` of each file in manifest and recover `epoch_number` on db recovery - Backward compatibility with old db without `epoch_number` support is guaranteed by assigning `epoch_number` to recovered files by `NewestFirstBySeqno` order. See `VersionStorageInfo::RecoverEpochNumbers()` for more - Forward compatibility with manifest is guaranteed by flexibility of `NewFileCustomTag` - Replace `force_consistent_check` on L0 with `epoch_number` and remove false positive check like case 1 with `largest_seqno` above - Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (`NewestFirstBySeqno`) to check/sort them till we infer their epoch number. See usages of `EpochNumberRequirement`. - Remove fix https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and their outdated tests to file reordering corruption because such fix can be replaced by this PR. - Misc: - update existing tests with `epoch_number` so make check will pass - update https://github.com/facebook/rocksdb/pull/5958#issue-511150930 tests to verify corruption is fixed using `epoch_number` and cover universal/fifo compaction/CompactRange/CompactFile cases - assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber() Pull Request resolved: https://github.com/facebook/rocksdb/pull/10922 Test Plan: - `make check` - New unit tests under `db/db_compaction_test.cc`, `db/db_test2.cc`, `db/version_builder_test.cc`, `db/repair_test.cc` - Updated tests (i.e, `DBCompactionTestL0FilesMisorderCorruption*`) under https://github.com/facebook/rocksdb/pull/5958#issue-511150930 - [Ongoing] Compatibility test: manually run https://github.com/ajkr/rocksdb/commit/36a5686ec012f35a4371e409aa85c404ca1c210d (with file ingestion off for running the `.orig` binary to prevent this bug affecting upgrade/downgrade formality checking) for 1 hour on `simple black/white box`, `cf_consistency/txn/enable_ts with whitebox + test_best_efforts_recovery with blackbox` - [Ongoing] normal db stress test - [Ongoing] db stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761 Reviewed By: ajkr Differential Revision: D41063187 Pulled By: hx235 fbshipit-source-id: 826cb23455de7beaabe2d16c57682a82733a32a9
2022-12-13 21:29:37 +00:00
vstorage_.get(), false, EpochNumberRequirement::kMustPresent));
}
void DeleteVersionStorage() {
vstorage_.reset();
temp_vstorage_.reset();
files_.clear();
file_map_.clear();
input_files_.clear();
}
Fix overlapping check by excluding timestamp (#10615) Summary: With user-defined timestamp, checking overlapping should exclude timestamp part from key. This has already been done for range checking for files in sstableKeyCompare(), but not yet done when checking with concurrent compactions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10615 Test Plan: (Will add more tests) make check (Repro seems easier with this commit sha: git checkout 78bbdef530bd36fa299d496bd1013cf39d8e203a) rm -rf /dev/shm/rocksdb/* && mkdir /dev/shm/rocksdb/rocksdb_crashtest_expected && ./db_stress --allow_data_in_errors=True --clear_column_family_one_in=0 --continuous_verification_interval=0 --data_block_index_type=1 --db=/dev/shm/rocksdb//rocksdb_crashtest_blackbox --delpercent=5 --delrangepercent=0 --expected_values_dir=/dev/shm/rocksdb//rocksdb_crashtest_expected --iterpercent=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=25000000 --max_write_batch_group_size_bytes=1048576 --nooverwritepercent=1 --ops_per_thread=1000000 --paranoid_file_checks=1 --partition_filters=0 --prefix_size=8 --prefixpercent=5 --readpercent=30 --reopen=0 --snapshot_hold_ops=100000 --subcompactions=1 --compaction_pri=3 --target_file_size_base=65536 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_cf_consistency=0 --use_multiget=1 --user_timestamp_size=8 --value_size_mult=32 --verify_checksum=1 --write_buffer_size=65536 --writepercent=60 -disable_wal=1 Reviewed By: akankshamahajan15 Differential Revision: D39146797 Pulled By: riversand963 fbshipit-source-id: 7fca800026ca6219220100b8b6cf84d907828163
2022-09-08 20:03:07 +00:00
// REQUIRES: smallest and largest are c-style strings ending with '\0'
void Add(int level, uint32_t file_number, const char* smallest,
const char* largest, uint64_t file_size = 1, uint32_t path_id = 0,
SequenceNumber smallest_seq = 100, SequenceNumber largest_seq = 100,
size_t compensated_file_size = 0, bool marked_for_compact = false,
Temperature temperature = Temperature::kUnknown,
Fix overlapping check by excluding timestamp (#10615) Summary: With user-defined timestamp, checking overlapping should exclude timestamp part from key. This has already been done for range checking for files in sstableKeyCompare(), but not yet done when checking with concurrent compactions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10615 Test Plan: (Will add more tests) make check (Repro seems easier with this commit sha: git checkout 78bbdef530bd36fa299d496bd1013cf39d8e203a) rm -rf /dev/shm/rocksdb/* && mkdir /dev/shm/rocksdb/rocksdb_crashtest_expected && ./db_stress --allow_data_in_errors=True --clear_column_family_one_in=0 --continuous_verification_interval=0 --data_block_index_type=1 --db=/dev/shm/rocksdb//rocksdb_crashtest_blackbox --delpercent=5 --delrangepercent=0 --expected_values_dir=/dev/shm/rocksdb//rocksdb_crashtest_expected --iterpercent=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=25000000 --max_write_batch_group_size_bytes=1048576 --nooverwritepercent=1 --ops_per_thread=1000000 --paranoid_file_checks=1 --partition_filters=0 --prefix_size=8 --prefixpercent=5 --readpercent=30 --reopen=0 --snapshot_hold_ops=100000 --subcompactions=1 --compaction_pri=3 --target_file_size_base=65536 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_cf_consistency=0 --use_multiget=1 --user_timestamp_size=8 --value_size_mult=32 --verify_checksum=1 --write_buffer_size=65536 --writepercent=60 -disable_wal=1 Reviewed By: akankshamahajan15 Differential Revision: D39146797 Pulled By: riversand963 fbshipit-source-id: 7fca800026ca6219220100b8b6cf84d907828163
2022-09-08 20:03:07 +00:00
uint64_t oldest_ancestor_time = kUnknownOldestAncesterTime,
Sort L0 files by newly introduced epoch_num (#10922) Summary: **Context:** Sorting L0 files by `largest_seqno` has at least two inconvenience: - File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. `force_consistency_check=true` will catch such overlap seqno range even those harmless overlap. - For example, consider the following sequence of events ("key@n" indicates key at seqno "n") - insert k1@1 to memtable m1 - ingest file s1 with k2@2, ingest file s2 with k3@3 - insert k4@4 to m1 - compact files s1, s2 and result in new file s3 of seqno range [2, 3] - flush m1 and result in new file s4 of seqno range [1, 4]. And `force_consistency_check=true` will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1 - However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption. - Single delete can decrease a file's largest seqno and ordering by `largest_seqno` can introduce a wrong ordering hence file reordering corruption - For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to ajkr for this example) - an existing SST s1 contains only k1@1 - insert k1@2 to memtable m1 - ingest file s2 with k3@3, ingest file s3 with k4@4 - insert single delete k5@5 in m1 - flush m1 and result in new file s4 of seqno range [2, 5] - compact s1, s2, s3 and result in new file s5 of seqno range [1, 4] - compact s4 and result in new file s6 of seqno range [2] due to single delete - By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by `force_consistency_check=true`, there isn't a good way to prevent this from happening if ordering by `largest_seqno` Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to ajkr , we now introduce `epoch_num` which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum `epoch_num` among input files'). This will avoid the above inconvenience in the following ways: - In the first case above, there will no longer be overlap seqno range check in `force_consistency_check=true` but `epoch_number` ordering check. This will result in file ordering s1 < s2 < s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class `DBCompactionTestL0FilesMisorderCorruption*` for more. - In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption. **Summary:** - Introduce `epoch_number` stored per `ColumnFamilyData` and sort CF's L0 files by their assigned `epoch_number` instead of `largest_seqno`. - `epoch_number` is increased and assigned upon `VersionEdit::AddFile()` for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the `kReservedEpochNumberForFileIngestedBehind`) - Compaction output file is assigned with the minimum `epoch_number` among input files' - Refit level: reuse refitted file's epoch_number - Other paths needing `epoch_number` treatment: - Import column families: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo` - Repair: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`. - Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon `VersionEdit::AddFile()` where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair). - Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later `Refit(target_level=0)`). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder. - Persist `epoch_number` of each file in manifest and recover `epoch_number` on db recovery - Backward compatibility with old db without `epoch_number` support is guaranteed by assigning `epoch_number` to recovered files by `NewestFirstBySeqno` order. See `VersionStorageInfo::RecoverEpochNumbers()` for more - Forward compatibility with manifest is guaranteed by flexibility of `NewFileCustomTag` - Replace `force_consistent_check` on L0 with `epoch_number` and remove false positive check like case 1 with `largest_seqno` above - Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (`NewestFirstBySeqno`) to check/sort them till we infer their epoch number. See usages of `EpochNumberRequirement`. - Remove fix https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and their outdated tests to file reordering corruption because such fix can be replaced by this PR. - Misc: - update existing tests with `epoch_number` so make check will pass - update https://github.com/facebook/rocksdb/pull/5958#issue-511150930 tests to verify corruption is fixed using `epoch_number` and cover universal/fifo compaction/CompactRange/CompactFile cases - assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber() Pull Request resolved: https://github.com/facebook/rocksdb/pull/10922 Test Plan: - `make check` - New unit tests under `db/db_compaction_test.cc`, `db/db_test2.cc`, `db/version_builder_test.cc`, `db/repair_test.cc` - Updated tests (i.e, `DBCompactionTestL0FilesMisorderCorruption*`) under https://github.com/facebook/rocksdb/pull/5958#issue-511150930 - [Ongoing] Compatibility test: manually run https://github.com/ajkr/rocksdb/commit/36a5686ec012f35a4371e409aa85c404ca1c210d (with file ingestion off for running the `.orig` binary to prevent this bug affecting upgrade/downgrade formality checking) for 1 hour on `simple black/white box`, `cf_consistency/txn/enable_ts with whitebox + test_best_efforts_recovery with blackbox` - [Ongoing] normal db stress test - [Ongoing] db stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761 Reviewed By: ajkr Differential Revision: D41063187 Pulled By: hx235 fbshipit-source-id: 826cb23455de7beaabe2d16c57682a82733a32a9
2022-12-13 21:29:37 +00:00
Slice ts_of_smallest = Slice(), Slice ts_of_largest = Slice(),
uint64_t epoch_number = kUnknownEpochNumber) {
Fix overlapping check by excluding timestamp (#10615) Summary: With user-defined timestamp, checking overlapping should exclude timestamp part from key. This has already been done for range checking for files in sstableKeyCompare(), but not yet done when checking with concurrent compactions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10615 Test Plan: (Will add more tests) make check (Repro seems easier with this commit sha: git checkout 78bbdef530bd36fa299d496bd1013cf39d8e203a) rm -rf /dev/shm/rocksdb/* && mkdir /dev/shm/rocksdb/rocksdb_crashtest_expected && ./db_stress --allow_data_in_errors=True --clear_column_family_one_in=0 --continuous_verification_interval=0 --data_block_index_type=1 --db=/dev/shm/rocksdb//rocksdb_crashtest_blackbox --delpercent=5 --delrangepercent=0 --expected_values_dir=/dev/shm/rocksdb//rocksdb_crashtest_expected --iterpercent=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=25000000 --max_write_batch_group_size_bytes=1048576 --nooverwritepercent=1 --ops_per_thread=1000000 --paranoid_file_checks=1 --partition_filters=0 --prefix_size=8 --prefixpercent=5 --readpercent=30 --reopen=0 --snapshot_hold_ops=100000 --subcompactions=1 --compaction_pri=3 --target_file_size_base=65536 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_cf_consistency=0 --use_multiget=1 --user_timestamp_size=8 --value_size_mult=32 --verify_checksum=1 --write_buffer_size=65536 --writepercent=60 -disable_wal=1 Reviewed By: akankshamahajan15 Differential Revision: D39146797 Pulled By: riversand963 fbshipit-source-id: 7fca800026ca6219220100b8b6cf84d907828163
2022-09-08 20:03:07 +00:00
assert(ts_of_smallest.size() == ucmp_->timestamp_size());
assert(ts_of_largest.size() == ucmp_->timestamp_size());
VersionStorageInfo* vstorage;
if (temp_vstorage_) {
vstorage = temp_vstorage_.get();
} else {
vstorage = vstorage_.get();
}
assert(level < vstorage->num_levels());
Fix overlapping check by excluding timestamp (#10615) Summary: With user-defined timestamp, checking overlapping should exclude timestamp part from key. This has already been done for range checking for files in sstableKeyCompare(), but not yet done when checking with concurrent compactions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10615 Test Plan: (Will add more tests) make check (Repro seems easier with this commit sha: git checkout 78bbdef530bd36fa299d496bd1013cf39d8e203a) rm -rf /dev/shm/rocksdb/* && mkdir /dev/shm/rocksdb/rocksdb_crashtest_expected && ./db_stress --allow_data_in_errors=True --clear_column_family_one_in=0 --continuous_verification_interval=0 --data_block_index_type=1 --db=/dev/shm/rocksdb//rocksdb_crashtest_blackbox --delpercent=5 --delrangepercent=0 --expected_values_dir=/dev/shm/rocksdb//rocksdb_crashtest_expected --iterpercent=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=25000000 --max_write_batch_group_size_bytes=1048576 --nooverwritepercent=1 --ops_per_thread=1000000 --paranoid_file_checks=1 --partition_filters=0 --prefix_size=8 --prefixpercent=5 --readpercent=30 --reopen=0 --snapshot_hold_ops=100000 --subcompactions=1 --compaction_pri=3 --target_file_size_base=65536 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_cf_consistency=0 --use_multiget=1 --user_timestamp_size=8 --value_size_mult=32 --verify_checksum=1 --write_buffer_size=65536 --writepercent=60 -disable_wal=1 Reviewed By: akankshamahajan15 Differential Revision: D39146797 Pulled By: riversand963 fbshipit-source-id: 7fca800026ca6219220100b8b6cf84d907828163
2022-09-08 20:03:07 +00:00
char* smallest_key_buf = nullptr;
char* largest_key_buf = nullptr;
if (!ts_of_smallest.empty()) {
smallest_key_buf = new char[strlen(smallest) + ucmp_->timestamp_size()];
memcpy(smallest_key_buf, smallest, strlen(smallest));
memcpy(smallest_key_buf + strlen(smallest), ts_of_smallest.data(),
ucmp_->timestamp_size());
largest_key_buf = new char[strlen(largest) + ucmp_->timestamp_size()];
memcpy(largest_key_buf, largest, strlen(largest));
memcpy(largest_key_buf + strlen(largest), ts_of_largest.data(),
ucmp_->timestamp_size());
}
InternalKey smallest_ikey = InternalKey(
smallest_key_buf ? Slice(smallest_key_buf,
ucmp_->timestamp_size() + strlen(smallest))
: smallest,
smallest_seq, kTypeValue);
InternalKey largest_ikey = InternalKey(
largest_key_buf
? Slice(largest_key_buf, ucmp_->timestamp_size() + strlen(largest))
: largest,
largest_seq, kTypeValue);
FileMetaData* f = new FileMetaData(
Fix overlapping check by excluding timestamp (#10615) Summary: With user-defined timestamp, checking overlapping should exclude timestamp part from key. This has already been done for range checking for files in sstableKeyCompare(), but not yet done when checking with concurrent compactions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10615 Test Plan: (Will add more tests) make check (Repro seems easier with this commit sha: git checkout 78bbdef530bd36fa299d496bd1013cf39d8e203a) rm -rf /dev/shm/rocksdb/* && mkdir /dev/shm/rocksdb/rocksdb_crashtest_expected && ./db_stress --allow_data_in_errors=True --clear_column_family_one_in=0 --continuous_verification_interval=0 --data_block_index_type=1 --db=/dev/shm/rocksdb//rocksdb_crashtest_blackbox --delpercent=5 --delrangepercent=0 --expected_values_dir=/dev/shm/rocksdb//rocksdb_crashtest_expected --iterpercent=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=25000000 --max_write_batch_group_size_bytes=1048576 --nooverwritepercent=1 --ops_per_thread=1000000 --paranoid_file_checks=1 --partition_filters=0 --prefix_size=8 --prefixpercent=5 --readpercent=30 --reopen=0 --snapshot_hold_ops=100000 --subcompactions=1 --compaction_pri=3 --target_file_size_base=65536 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_cf_consistency=0 --use_multiget=1 --user_timestamp_size=8 --value_size_mult=32 --verify_checksum=1 --write_buffer_size=65536 --writepercent=60 -disable_wal=1 Reviewed By: akankshamahajan15 Differential Revision: D39146797 Pulled By: riversand963 fbshipit-source-id: 7fca800026ca6219220100b8b6cf84d907828163
2022-09-08 20:03:07 +00:00
file_number, path_id, file_size, smallest_ikey, largest_ikey,
smallest_seq, largest_seq, marked_for_compact, temperature,
kInvalidBlobFileNumber, kUnknownOldestAncesterTime,
Sort L0 files by newly introduced epoch_num (#10922) Summary: **Context:** Sorting L0 files by `largest_seqno` has at least two inconvenience: - File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. `force_consistency_check=true` will catch such overlap seqno range even those harmless overlap. - For example, consider the following sequence of events ("key@n" indicates key at seqno "n") - insert k1@1 to memtable m1 - ingest file s1 with k2@2, ingest file s2 with k3@3 - insert k4@4 to m1 - compact files s1, s2 and result in new file s3 of seqno range [2, 3] - flush m1 and result in new file s4 of seqno range [1, 4]. And `force_consistency_check=true` will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1 - However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption. - Single delete can decrease a file's largest seqno and ordering by `largest_seqno` can introduce a wrong ordering hence file reordering corruption - For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to ajkr for this example) - an existing SST s1 contains only k1@1 - insert k1@2 to memtable m1 - ingest file s2 with k3@3, ingest file s3 with k4@4 - insert single delete k5@5 in m1 - flush m1 and result in new file s4 of seqno range [2, 5] - compact s1, s2, s3 and result in new file s5 of seqno range [1, 4] - compact s4 and result in new file s6 of seqno range [2] due to single delete - By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by `force_consistency_check=true`, there isn't a good way to prevent this from happening if ordering by `largest_seqno` Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to ajkr , we now introduce `epoch_num` which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum `epoch_num` among input files'). This will avoid the above inconvenience in the following ways: - In the first case above, there will no longer be overlap seqno range check in `force_consistency_check=true` but `epoch_number` ordering check. This will result in file ordering s1 < s2 < s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class `DBCompactionTestL0FilesMisorderCorruption*` for more. - In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption. **Summary:** - Introduce `epoch_number` stored per `ColumnFamilyData` and sort CF's L0 files by their assigned `epoch_number` instead of `largest_seqno`. - `epoch_number` is increased and assigned upon `VersionEdit::AddFile()` for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the `kReservedEpochNumberForFileIngestedBehind`) - Compaction output file is assigned with the minimum `epoch_number` among input files' - Refit level: reuse refitted file's epoch_number - Other paths needing `epoch_number` treatment: - Import column families: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo` - Repair: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`. - Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon `VersionEdit::AddFile()` where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair). - Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later `Refit(target_level=0)`). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder. - Persist `epoch_number` of each file in manifest and recover `epoch_number` on db recovery - Backward compatibility with old db without `epoch_number` support is guaranteed by assigning `epoch_number` to recovered files by `NewestFirstBySeqno` order. See `VersionStorageInfo::RecoverEpochNumbers()` for more - Forward compatibility with manifest is guaranteed by flexibility of `NewFileCustomTag` - Replace `force_consistent_check` on L0 with `epoch_number` and remove false positive check like case 1 with `largest_seqno` above - Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (`NewestFirstBySeqno`) to check/sort them till we infer their epoch number. See usages of `EpochNumberRequirement`. - Remove fix https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and their outdated tests to file reordering corruption because such fix can be replaced by this PR. - Misc: - update existing tests with `epoch_number` so make check will pass - update https://github.com/facebook/rocksdb/pull/5958#issue-511150930 tests to verify corruption is fixed using `epoch_number` and cover universal/fifo compaction/CompactRange/CompactFile cases - assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber() Pull Request resolved: https://github.com/facebook/rocksdb/pull/10922 Test Plan: - `make check` - New unit tests under `db/db_compaction_test.cc`, `db/db_test2.cc`, `db/version_builder_test.cc`, `db/repair_test.cc` - Updated tests (i.e, `DBCompactionTestL0FilesMisorderCorruption*`) under https://github.com/facebook/rocksdb/pull/5958#issue-511150930 - [Ongoing] Compatibility test: manually run https://github.com/ajkr/rocksdb/commit/36a5686ec012f35a4371e409aa85c404ca1c210d (with file ingestion off for running the `.orig` binary to prevent this bug affecting upgrade/downgrade formality checking) for 1 hour on `simple black/white box`, `cf_consistency/txn/enable_ts with whitebox + test_best_efforts_recovery with blackbox` - [Ongoing] normal db stress test - [Ongoing] db stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761 Reviewed By: ajkr Differential Revision: D41063187 Pulled By: hx235 fbshipit-source-id: 826cb23455de7beaabe2d16c57682a82733a32a9
2022-12-13 21:29:37 +00:00
kUnknownFileCreationTime, epoch_number, kUnknownFileChecksum,
Include estimated bytes deleted by range tombstones in compensated file size (#10734) Summary: compensate file sizes in compaction picking so files with range tombstones are preferred, such that they get compacted down earlier as they tend to delete a lot of data. This PR adds a `compensated_range_deletion_size` field in FileMeta that is computed during Flush/Compaction and persisted in MANIFEST. This value is added to `compensated_file_size` which will be used for compaction picking. Currently, for a file in level L, `compensated_range_deletion_size` is set to the estimated bytes deleted by range tombstone of this file in all levels > L. This helps to reduce space amp when data in older levels are covered by range tombstones in level L. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10734 Test Plan: - Added unit tests. - benchmark to check if the above definition `compensated_range_deletion_size` is reducing space amp as intended, without affecting write amp too much. The experiment set up favorable for this optimization: large range tombstone issued infrequently. Command used: ``` ./db_bench -benchmarks=fillrandom,waitforcompaction,stats,levelstats -use_existing_db=false -avoid_flush_during_recovery=true -write_buffer_size=33554432 -level_compaction_dynamic_level_bytes=true -max_background_jobs=8 -max_bytes_for_level_base=134217728 -target_file_size_base=33554432 -writes_per_range_tombstone=500000 -range_tombstone_width=5000000 -num=50000000 -benchmark_write_rate_limit=8388608 -threads=16 -duration=1800 --max_num_range_tombstones=1000000000 ``` In this experiment, each thread wrote 16 range tombstones over the duration of 30 minutes, each range tombstone has width 5M that is the 10% of the key space width. Results shows this PR generates a smaller DB size. Compaction stats from this PR: ``` Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 2/0 31.54 MB 0.5 0.0 0.0 0.0 8.4 8.4 0.0 1.0 0.0 63.4 135.56 110.94 544 0.249 0 0 0.0 0.0 L4 3/0 96.55 MB 0.8 18.5 6.7 11.8 18.4 6.6 0.0 2.7 65.3 64.9 290.08 284.03 108 2.686 284M 1957K 0.0 0.0 L5 15/0 404.41 MB 1.0 19.1 7.7 11.4 18.8 7.4 0.3 2.5 66.6 65.7 292.93 285.34 220 1.332 293M 3808K 0.0 0.0 L6 143/0 4.12 GB 0.0 45.0 7.5 37.5 41.6 4.1 0.0 5.5 71.2 65.9 647.00 632.66 251 2.578 739M 47M 0.0 0.0 Sum 163/0 4.64 GB 0.0 82.6 21.9 60.7 87.2 26.5 0.3 10.4 61.9 65.4 1365.58 1312.97 1123 1.216 1318M 52M 0.0 0.0 ``` Compaction stats from main: ``` Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 0/0 0.00 KB 0.0 0.0 0.0 0.0 8.4 8.4 0.0 1.0 0.0 60.5 142.12 115.89 569 0.250 0 0 0.0 0.0 L4 3/0 85.68 MB 1.0 17.7 6.8 10.9 17.6 6.7 0.0 2.6 62.7 62.3 289.05 281.79 112 2.581 272M 2309K 0.0 0.0 L5 11/0 293.73 MB 1.0 18.8 7.5 11.2 18.5 7.2 0.5 2.5 64.9 63.9 296.07 288.50 220 1.346 288M 4365K 0.0 0.0 L6 130/0 3.94 GB 0.0 51.5 7.6 43.9 47.9 3.9 0.0 6.3 67.2 62.4 784.95 765.92 258 3.042 848M 51M 0.0 0.0 Sum 144/0 4.31 GB 0.0 88.0 21.9 66.0 92.3 26.3 0.5 11.0 59.6 62.5 1512.19 1452.09 1159 1.305 1409M 58M 0.0 0.0``` Reviewed By: ajkr Differential Revision: D39834713 Pulled By: cbi42 fbshipit-source-id: fe9341040b8704a8fbb10cad5cf5c43e962c7e6b
2022-12-29 21:28:24 +00:00
kUnknownFileChecksumFuncName, kNullUniqueId64x2, 0);
f->compensated_file_size =
(compensated_file_size != 0) ? compensated_file_size : file_size;
f->oldest_ancester_time = oldest_ancestor_time;
vstorage->AddFile(level, f);
files_.emplace_back(f);
file_map_.insert({file_number, {f, level}});
Fix overlapping check by excluding timestamp (#10615) Summary: With user-defined timestamp, checking overlapping should exclude timestamp part from key. This has already been done for range checking for files in sstableKeyCompare(), but not yet done when checking with concurrent compactions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10615 Test Plan: (Will add more tests) make check (Repro seems easier with this commit sha: git checkout 78bbdef530bd36fa299d496bd1013cf39d8e203a) rm -rf /dev/shm/rocksdb/* && mkdir /dev/shm/rocksdb/rocksdb_crashtest_expected && ./db_stress --allow_data_in_errors=True --clear_column_family_one_in=0 --continuous_verification_interval=0 --data_block_index_type=1 --db=/dev/shm/rocksdb//rocksdb_crashtest_blackbox --delpercent=5 --delrangepercent=0 --expected_values_dir=/dev/shm/rocksdb//rocksdb_crashtest_expected --iterpercent=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=25000000 --max_write_batch_group_size_bytes=1048576 --nooverwritepercent=1 --ops_per_thread=1000000 --paranoid_file_checks=1 --partition_filters=0 --prefix_size=8 --prefixpercent=5 --readpercent=30 --reopen=0 --snapshot_hold_ops=100000 --subcompactions=1 --compaction_pri=3 --target_file_size_base=65536 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_cf_consistency=0 --use_multiget=1 --user_timestamp_size=8 --value_size_mult=32 --verify_checksum=1 --write_buffer_size=65536 --writepercent=60 -disable_wal=1 Reviewed By: akankshamahajan15 Differential Revision: D39146797 Pulled By: riversand963 fbshipit-source-id: 7fca800026ca6219220100b8b6cf84d907828163
2022-09-08 20:03:07 +00:00
delete[] smallest_key_buf;
delete[] largest_key_buf;
}
void SetCompactionInputFilesLevels(int level_count, int start_level) {
input_files_.resize(level_count);
for (int i = 0; i < level_count; ++i) {
input_files_[i].level = start_level + i;
}
compaction_level_start_ = start_level;
}
void AddToCompactionFiles(uint32_t file_number) {
auto iter = file_map_.find(file_number);
assert(iter != file_map_.end());
int level = iter->second.second;
assert(level < vstorage_->num_levels());
input_files_[level - compaction_level_start_].files.emplace_back(
iter->second.first);
}
void UpdateVersionStorageInfo() {
if (temp_vstorage_) {
VersionBuilder builder(FileOptions(), &ioptions_, nullptr,
vstorage_.get(), nullptr);
ASSERT_OK(builder.SaveTo(temp_vstorage_.get()));
vstorage_ = std::move(temp_vstorage_);
}
Clean up VersionStorageInfo a bit (#9494) Summary: The patch does some cleanup in and around `VersionStorageInfo`: * Renames the method `PrepareApply` to `PrepareAppend` in `Version` to make it clear that it is to be called before appending the `Version` to `VersionSet` (via `AppendVersion`), not before applying any `VersionEdit`s. * Introduces a helper method `VersionStorageInfo::PrepareForVersionAppend` (called by `Version::PrepareAppend`) that encapsulates the population of the various derived data structures in `VersionStorageInfo`, and turns the methods computing the derived structures (`UpdateNumNonEmptyLevels`, `CalculateBaseBytes` etc.) into private helpers. * Changes `Version::PrepareAppend` so it only calls `UpdateAccumulatedStats` if the `update_stats` flag is set. (Earlier, this was checked by the callee.) Related to this, it also moves the call to `ComputeCompensatedSizes` to `VersionStorageInfo::PrepareForVersionAppend`. * Updates and cleans up `version_builder_test`, `version_set_test`, and `compaction_picker_test` so `PrepareForVersionAppend` is called anytime a new `VersionStorageInfo` is set up or saved. This cleanup also involves splitting `VersionStorageInfoTest.MaxBytesForLevelDynamic` into multiple smaller test cases. * Fixes up a bunch of comments that were outdated or just plain incorrect. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9494 Test Plan: Ran `make check` and the crash test script for a while. Reviewed By: riversand963 Differential Revision: D33971666 Pulled By: ltamasi fbshipit-source-id: fda52faac7783041126e4f8dec0fe01bdcadf65a
2022-02-04 16:18:18 +00:00
vstorage_->PrepareForVersionAppend(ioptions_, mutable_cf_options_);
vstorage_->ComputeCompactionScore(ioptions_, mutable_cf_options_);
vstorage_->SetFinalized();
}
private:
Fix overlapping check by excluding timestamp (#10615) Summary: With user-defined timestamp, checking overlapping should exclude timestamp part from key. This has already been done for range checking for files in sstableKeyCompare(), but not yet done when checking with concurrent compactions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10615 Test Plan: (Will add more tests) make check (Repro seems easier with this commit sha: git checkout 78bbdef530bd36fa299d496bd1013cf39d8e203a) rm -rf /dev/shm/rocksdb/* && mkdir /dev/shm/rocksdb/rocksdb_crashtest_expected && ./db_stress --allow_data_in_errors=True --clear_column_family_one_in=0 --continuous_verification_interval=0 --data_block_index_type=1 --db=/dev/shm/rocksdb//rocksdb_crashtest_blackbox --delpercent=5 --delrangepercent=0 --expected_values_dir=/dev/shm/rocksdb//rocksdb_crashtest_expected --iterpercent=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=25000000 --max_write_batch_group_size_bytes=1048576 --nooverwritepercent=1 --ops_per_thread=1000000 --paranoid_file_checks=1 --partition_filters=0 --prefix_size=8 --prefixpercent=5 --readpercent=30 --reopen=0 --snapshot_hold_ops=100000 --subcompactions=1 --compaction_pri=3 --target_file_size_base=65536 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_cf_consistency=0 --use_multiget=1 --user_timestamp_size=8 --value_size_mult=32 --verify_checksum=1 --write_buffer_size=65536 --writepercent=60 -disable_wal=1 Reviewed By: akankshamahajan15 Differential Revision: D39146797 Pulled By: riversand963 fbshipit-source-id: 7fca800026ca6219220100b8b6cf84d907828163
2022-09-08 20:03:07 +00:00
Options CreateOptions(const Comparator* ucmp) const {
Options opts;
opts.comparator = ucmp;
return opts;
}
std::unique_ptr<VersionStorageInfo> temp_vstorage_;
};
Fix overlapping check by excluding timestamp (#10615) Summary: With user-defined timestamp, checking overlapping should exclude timestamp part from key. This has already been done for range checking for files in sstableKeyCompare(), but not yet done when checking with concurrent compactions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10615 Test Plan: (Will add more tests) make check (Repro seems easier with this commit sha: git checkout 78bbdef530bd36fa299d496bd1013cf39d8e203a) rm -rf /dev/shm/rocksdb/* && mkdir /dev/shm/rocksdb/rocksdb_crashtest_expected && ./db_stress --allow_data_in_errors=True --clear_column_family_one_in=0 --continuous_verification_interval=0 --data_block_index_type=1 --db=/dev/shm/rocksdb//rocksdb_crashtest_blackbox --delpercent=5 --delrangepercent=0 --expected_values_dir=/dev/shm/rocksdb//rocksdb_crashtest_expected --iterpercent=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=25000000 --max_write_batch_group_size_bytes=1048576 --nooverwritepercent=1 --ops_per_thread=1000000 --paranoid_file_checks=1 --partition_filters=0 --prefix_size=8 --prefixpercent=5 --readpercent=30 --reopen=0 --snapshot_hold_ops=100000 --subcompactions=1 --compaction_pri=3 --target_file_size_base=65536 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_cf_consistency=0 --use_multiget=1 --user_timestamp_size=8 --value_size_mult=32 --verify_checksum=1 --write_buffer_size=65536 --writepercent=60 -disable_wal=1 Reviewed By: akankshamahajan15 Differential Revision: D39146797 Pulled By: riversand963 fbshipit-source-id: 7fca800026ca6219220100b8b6cf84d907828163
2022-09-08 20:03:07 +00:00
class CompactionPickerTest : public CompactionPickerTestBase {
public:
explicit CompactionPickerTest()
: CompactionPickerTestBase(BytewiseComparator()) {}
~CompactionPickerTest() override {}
};
class CompactionPickerU64TsTest : public CompactionPickerTestBase {
public:
explicit CompactionPickerU64TsTest()
: CompactionPickerTestBase(test::BytewiseComparatorWithU64TsWrapper()) {}
~CompactionPickerU64TsTest() override {}
};
rocksdb: switch to gtest Summary: Our existing test notation is very similar to what is used in gtest. It makes it easy to adopt what is different. In this diff I modify existing [[ https://code.google.com/p/googletest/wiki/Primer#Test_Fixtures:_Using_the_Same_Data_Configuration_for_Multiple_Te | test fixture ]] classes to inherit from `testing::Test`. Also for unit tests that use fixture class, `TEST` is replaced with `TEST_F` as required in gtest. There are several custom `main` functions in our existing tests. To make this transition easier, I modify all `main` functions to fallow gtest notation. But eventually we can remove them and use implementation of `main` that gtest provides. ```lang=bash % cat ~/transform #!/bin/sh files=$(git ls-files '*test\.cc') for file in $files do if grep -q "rocksdb::test::RunAllTests()" $file then if grep -Eq '^class \w+Test {' $file then perl -pi -e 's/^(class \w+Test) {/${1}: public testing::Test {/g' $file perl -pi -e 's/^(TEST)/${1}_F/g' $file fi perl -pi -e 's/(int main.*\{)/${1}::testing::InitGoogleTest(&argc, argv);/g' $file perl -pi -e 's/rocksdb::test::RunAllTests/RUN_ALL_TESTS/g' $file fi done % sh ~/transform % make format ``` Second iteration of this diff contains only scripted changes. Third iteration contains manual changes to fix last errors and make it compilable. Test Plan: Build and notice no errors. ```lang=bash % USE_CLANG=1 make check -j55 ``` Tests are still testing. Reviewers: meyering, sdong, rven, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D35157
2015-03-17 21:08:00 +00:00
TEST_F(CompactionPickerTest, Empty) {
NewVersionStorage(6, kCompactionStyleLevel);
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(level_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction.get() == nullptr);
}
rocksdb: switch to gtest Summary: Our existing test notation is very similar to what is used in gtest. It makes it easy to adopt what is different. In this diff I modify existing [[ https://code.google.com/p/googletest/wiki/Primer#Test_Fixtures:_Using_the_Same_Data_Configuration_for_Multiple_Te | test fixture ]] classes to inherit from `testing::Test`. Also for unit tests that use fixture class, `TEST` is replaced with `TEST_F` as required in gtest. There are several custom `main` functions in our existing tests. To make this transition easier, I modify all `main` functions to fallow gtest notation. But eventually we can remove them and use implementation of `main` that gtest provides. ```lang=bash % cat ~/transform #!/bin/sh files=$(git ls-files '*test\.cc') for file in $files do if grep -q "rocksdb::test::RunAllTests()" $file then if grep -Eq '^class \w+Test {' $file then perl -pi -e 's/^(class \w+Test) {/${1}: public testing::Test {/g' $file perl -pi -e 's/^(TEST)/${1}_F/g' $file fi perl -pi -e 's/(int main.*\{)/${1}::testing::InitGoogleTest(&argc, argv);/g' $file perl -pi -e 's/rocksdb::test::RunAllTests/RUN_ALL_TESTS/g' $file fi done % sh ~/transform % make format ``` Second iteration of this diff contains only scripted changes. Third iteration contains manual changes to fix last errors and make it compilable. Test Plan: Build and notice no errors. ```lang=bash % USE_CLANG=1 make check -j55 ``` Tests are still testing. Reviewers: meyering, sdong, rven, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D35157
2015-03-17 21:08:00 +00:00
TEST_F(CompactionPickerTest, Single) {
NewVersionStorage(6, kCompactionStyleLevel);
mutable_cf_options_.level0_file_num_compaction_trigger = 2;
Add(0, 1U, "p", "q");
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(level_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction.get() == nullptr);
}
rocksdb: switch to gtest Summary: Our existing test notation is very similar to what is used in gtest. It makes it easy to adopt what is different. In this diff I modify existing [[ https://code.google.com/p/googletest/wiki/Primer#Test_Fixtures:_Using_the_Same_Data_Configuration_for_Multiple_Te | test fixture ]] classes to inherit from `testing::Test`. Also for unit tests that use fixture class, `TEST` is replaced with `TEST_F` as required in gtest. There are several custom `main` functions in our existing tests. To make this transition easier, I modify all `main` functions to fallow gtest notation. But eventually we can remove them and use implementation of `main` that gtest provides. ```lang=bash % cat ~/transform #!/bin/sh files=$(git ls-files '*test\.cc') for file in $files do if grep -q "rocksdb::test::RunAllTests()" $file then if grep -Eq '^class \w+Test {' $file then perl -pi -e 's/^(class \w+Test) {/${1}: public testing::Test {/g' $file perl -pi -e 's/^(TEST)/${1}_F/g' $file fi perl -pi -e 's/(int main.*\{)/${1}::testing::InitGoogleTest(&argc, argv);/g' $file perl -pi -e 's/rocksdb::test::RunAllTests/RUN_ALL_TESTS/g' $file fi done % sh ~/transform % make format ``` Second iteration of this diff contains only scripted changes. Third iteration contains manual changes to fix last errors and make it compilable. Test Plan: Build and notice no errors. ```lang=bash % USE_CLANG=1 make check -j55 ``` Tests are still testing. Reviewers: meyering, sdong, rven, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D35157
2015-03-17 21:08:00 +00:00
TEST_F(CompactionPickerTest, Level0Trigger) {
NewVersionStorage(6, kCompactionStyleLevel);
mutable_cf_options_.level0_file_num_compaction_trigger = 2;
Add(0, 1U, "150", "200");
Add(0, 2U, "200", "250");
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(level_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction.get() != nullptr);
ASSERT_EQ(2U, compaction->num_input_files(0));
ASSERT_EQ(1U, compaction->input(0, 0)->fd.GetNumber());
ASSERT_EQ(2U, compaction->input(0, 1)->fd.GetNumber());
}
rocksdb: switch to gtest Summary: Our existing test notation is very similar to what is used in gtest. It makes it easy to adopt what is different. In this diff I modify existing [[ https://code.google.com/p/googletest/wiki/Primer#Test_Fixtures:_Using_the_Same_Data_Configuration_for_Multiple_Te | test fixture ]] classes to inherit from `testing::Test`. Also for unit tests that use fixture class, `TEST` is replaced with `TEST_F` as required in gtest. There are several custom `main` functions in our existing tests. To make this transition easier, I modify all `main` functions to fallow gtest notation. But eventually we can remove them and use implementation of `main` that gtest provides. ```lang=bash % cat ~/transform #!/bin/sh files=$(git ls-files '*test\.cc') for file in $files do if grep -q "rocksdb::test::RunAllTests()" $file then if grep -Eq '^class \w+Test {' $file then perl -pi -e 's/^(class \w+Test) {/${1}: public testing::Test {/g' $file perl -pi -e 's/^(TEST)/${1}_F/g' $file fi perl -pi -e 's/(int main.*\{)/${1}::testing::InitGoogleTest(&argc, argv);/g' $file perl -pi -e 's/rocksdb::test::RunAllTests/RUN_ALL_TESTS/g' $file fi done % sh ~/transform % make format ``` Second iteration of this diff contains only scripted changes. Third iteration contains manual changes to fix last errors and make it compilable. Test Plan: Build and notice no errors. ```lang=bash % USE_CLANG=1 make check -j55 ``` Tests are still testing. Reviewers: meyering, sdong, rven, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D35157
2015-03-17 21:08:00 +00:00
TEST_F(CompactionPickerTest, Level1Trigger) {
NewVersionStorage(6, kCompactionStyleLevel);
Add(1, 66U, "150", "200", 1000000000U);
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(level_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction.get() != nullptr);
ASSERT_EQ(1U, compaction->num_input_files(0));
ASSERT_EQ(66U, compaction->input(0, 0)->fd.GetNumber());
}
rocksdb: switch to gtest Summary: Our existing test notation is very similar to what is used in gtest. It makes it easy to adopt what is different. In this diff I modify existing [[ https://code.google.com/p/googletest/wiki/Primer#Test_Fixtures:_Using_the_Same_Data_Configuration_for_Multiple_Te | test fixture ]] classes to inherit from `testing::Test`. Also for unit tests that use fixture class, `TEST` is replaced with `TEST_F` as required in gtest. There are several custom `main` functions in our existing tests. To make this transition easier, I modify all `main` functions to fallow gtest notation. But eventually we can remove them and use implementation of `main` that gtest provides. ```lang=bash % cat ~/transform #!/bin/sh files=$(git ls-files '*test\.cc') for file in $files do if grep -q "rocksdb::test::RunAllTests()" $file then if grep -Eq '^class \w+Test {' $file then perl -pi -e 's/^(class \w+Test) {/${1}: public testing::Test {/g' $file perl -pi -e 's/^(TEST)/${1}_F/g' $file fi perl -pi -e 's/(int main.*\{)/${1}::testing::InitGoogleTest(&argc, argv);/g' $file perl -pi -e 's/rocksdb::test::RunAllTests/RUN_ALL_TESTS/g' $file fi done % sh ~/transform % make format ``` Second iteration of this diff contains only scripted changes. Third iteration contains manual changes to fix last errors and make it compilable. Test Plan: Build and notice no errors. ```lang=bash % USE_CLANG=1 make check -j55 ``` Tests are still testing. Reviewers: meyering, sdong, rven, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D35157
2015-03-17 21:08:00 +00:00
TEST_F(CompactionPickerTest, Level1Trigger2) {
mutable_cf_options_.target_file_size_base = 10000000000;
mutable_cf_options_.RefreshDerivedOptions(ioptions_);
NewVersionStorage(6, kCompactionStyleLevel);
Add(1, 66U, "150", "200", 1000000001U);
Add(1, 88U, "201", "300", 1000000000U);
Add(2, 6U, "150", "179", 1000000000U);
Add(2, 7U, "180", "220", 1000000000U);
Add(2, 8U, "221", "300", 1000000000U);
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(level_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction.get() != nullptr);
ASSERT_EQ(1U, compaction->num_input_files(0));
ASSERT_EQ(2U, compaction->num_input_files(1));
ASSERT_EQ(66U, compaction->input(0, 0)->fd.GetNumber());
ASSERT_EQ(6U, compaction->input(1, 0)->fd.GetNumber());
ASSERT_EQ(7U, compaction->input(1, 1)->fd.GetNumber());
ASSERT_EQ(uint64_t{1073741824}, compaction->OutputFilePreallocationSize());
}
rocksdb: switch to gtest Summary: Our existing test notation is very similar to what is used in gtest. It makes it easy to adopt what is different. In this diff I modify existing [[ https://code.google.com/p/googletest/wiki/Primer#Test_Fixtures:_Using_the_Same_Data_Configuration_for_Multiple_Te | test fixture ]] classes to inherit from `testing::Test`. Also for unit tests that use fixture class, `TEST` is replaced with `TEST_F` as required in gtest. There are several custom `main` functions in our existing tests. To make this transition easier, I modify all `main` functions to fallow gtest notation. But eventually we can remove them and use implementation of `main` that gtest provides. ```lang=bash % cat ~/transform #!/bin/sh files=$(git ls-files '*test\.cc') for file in $files do if grep -q "rocksdb::test::RunAllTests()" $file then if grep -Eq '^class \w+Test {' $file then perl -pi -e 's/^(class \w+Test) {/${1}: public testing::Test {/g' $file perl -pi -e 's/^(TEST)/${1}_F/g' $file fi perl -pi -e 's/(int main.*\{)/${1}::testing::InitGoogleTest(&argc, argv);/g' $file perl -pi -e 's/rocksdb::test::RunAllTests/RUN_ALL_TESTS/g' $file fi done % sh ~/transform % make format ``` Second iteration of this diff contains only scripted changes. Third iteration contains manual changes to fix last errors and make it compilable. Test Plan: Build and notice no errors. ```lang=bash % USE_CLANG=1 make check -j55 ``` Tests are still testing. Reviewers: meyering, sdong, rven, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D35157
2015-03-17 21:08:00 +00:00
TEST_F(CompactionPickerTest, LevelMaxScore) {
NewVersionStorage(6, kCompactionStyleLevel);
mutable_cf_options_.target_file_size_base = 10000000;
mutable_cf_options_.max_bytes_for_level_base = 10 * 1024 * 1024;
mutable_cf_options_.RefreshDerivedOptions(ioptions_);
Add(0, 1U, "150", "200", 1000000U);
// Level 1 score 1.2
Add(1, 66U, "150", "200", 6000000U);
Add(1, 88U, "201", "300", 6000000U);
// Level 2 score 1.8. File 7 is the largest. Should be picked
Add(2, 6U, "150", "179", 60000000U);
Add(2, 7U, "180", "220", 60000001U);
Add(2, 8U, "221", "300", 60000000U);
// Level 3 score slightly larger than 1
Add(3, 26U, "150", "170", 260000000U);
Add(3, 27U, "171", "179", 260000000U);
Add(3, 28U, "191", "220", 260000000U);
Add(3, 29U, "221", "300", 260000000U);
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(level_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction.get() != nullptr);
ASSERT_EQ(1U, compaction->num_input_files(0));
ASSERT_EQ(7U, compaction->input(0, 0)->fd.GetNumber());
ASSERT_EQ(mutable_cf_options_.target_file_size_base +
mutable_cf_options_.target_file_size_base / 10,
compaction->OutputFilePreallocationSize());
}
rocksdb: switch to gtest Summary: Our existing test notation is very similar to what is used in gtest. It makes it easy to adopt what is different. In this diff I modify existing [[ https://code.google.com/p/googletest/wiki/Primer#Test_Fixtures:_Using_the_Same_Data_Configuration_for_Multiple_Te | test fixture ]] classes to inherit from `testing::Test`. Also for unit tests that use fixture class, `TEST` is replaced with `TEST_F` as required in gtest. There are several custom `main` functions in our existing tests. To make this transition easier, I modify all `main` functions to fallow gtest notation. But eventually we can remove them and use implementation of `main` that gtest provides. ```lang=bash % cat ~/transform #!/bin/sh files=$(git ls-files '*test\.cc') for file in $files do if grep -q "rocksdb::test::RunAllTests()" $file then if grep -Eq '^class \w+Test {' $file then perl -pi -e 's/^(class \w+Test) {/${1}: public testing::Test {/g' $file perl -pi -e 's/^(TEST)/${1}_F/g' $file fi perl -pi -e 's/(int main.*\{)/${1}::testing::InitGoogleTest(&argc, argv);/g' $file perl -pi -e 's/rocksdb::test::RunAllTests/RUN_ALL_TESTS/g' $file fi done % sh ~/transform % make format ``` Second iteration of this diff contains only scripted changes. Third iteration contains manual changes to fix last errors and make it compilable. Test Plan: Build and notice no errors. ```lang=bash % USE_CLANG=1 make check -j55 ``` Tests are still testing. Reviewers: meyering, sdong, rven, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D35157
2015-03-17 21:08:00 +00:00
TEST_F(CompactionPickerTest, NeedsCompactionLevel) {
const int kLevels = 6;
const int kFileCount = 20;
for (int level = 0; level < kLevels - 1; ++level) {
NewVersionStorage(kLevels, kCompactionStyleLevel);
uint64_t file_size = vstorage_->MaxBytesForLevel(level) * 2 / kFileCount;
for (int file_count = 1; file_count <= kFileCount; ++file_count) {
// start a brand new version in each test.
NewVersionStorage(kLevels, kCompactionStyleLevel);
for (int i = 0; i < file_count; ++i) {
Add(level, i, std::to_string((i + 100) * 1000).c_str(),
std::to_string((i + 100) * 1000 + 999).c_str(), file_size, 0,
i * 100, i * 100 + 99);
}
UpdateVersionStorageInfo();
ASSERT_EQ(vstorage_->CompactionScoreLevel(0), level);
ASSERT_EQ(level_compaction_picker.NeedsCompaction(vstorage_.get()),
vstorage_->CompactionScore(0) >= 1);
// release the version storage
DeleteVersionStorage();
}
}
}
rocksdb: switch to gtest Summary: Our existing test notation is very similar to what is used in gtest. It makes it easy to adopt what is different. In this diff I modify existing [[ https://code.google.com/p/googletest/wiki/Primer#Test_Fixtures:_Using_the_Same_Data_Configuration_for_Multiple_Te | test fixture ]] classes to inherit from `testing::Test`. Also for unit tests that use fixture class, `TEST` is replaced with `TEST_F` as required in gtest. There are several custom `main` functions in our existing tests. To make this transition easier, I modify all `main` functions to fallow gtest notation. But eventually we can remove them and use implementation of `main` that gtest provides. ```lang=bash % cat ~/transform #!/bin/sh files=$(git ls-files '*test\.cc') for file in $files do if grep -q "rocksdb::test::RunAllTests()" $file then if grep -Eq '^class \w+Test {' $file then perl -pi -e 's/^(class \w+Test) {/${1}: public testing::Test {/g' $file perl -pi -e 's/^(TEST)/${1}_F/g' $file fi perl -pi -e 's/(int main.*\{)/${1}::testing::InitGoogleTest(&argc, argv);/g' $file perl -pi -e 's/rocksdb::test::RunAllTests/RUN_ALL_TESTS/g' $file fi done % sh ~/transform % make format ``` Second iteration of this diff contains only scripted changes. Third iteration contains manual changes to fix last errors and make it compilable. Test Plan: Build and notice no errors. ```lang=bash % USE_CLANG=1 make check -j55 ``` Tests are still testing. Reviewers: meyering, sdong, rven, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D35157
2015-03-17 21:08:00 +00:00
TEST_F(CompactionPickerTest, Level0TriggerDynamic) {
int num_levels = ioptions_.num_levels;
ioptions_.level_compaction_dynamic_level_bytes = true;
mutable_cf_options_.level0_file_num_compaction_trigger = 2;
mutable_cf_options_.max_bytes_for_level_base = 200;
mutable_cf_options_.max_bytes_for_level_multiplier = 10;
NewVersionStorage(num_levels, kCompactionStyleLevel);
Add(0, 1U, "150", "200");
Add(0, 2U, "200", "250");
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(level_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction.get() != nullptr);
ASSERT_EQ(2U, compaction->num_input_files(0));
ASSERT_EQ(1U, compaction->input(0, 0)->fd.GetNumber());
ASSERT_EQ(2U, compaction->input(0, 1)->fd.GetNumber());
Make Compaction class easier to use Summary: The goal of this diff is to make Compaction class easier to use. This should also make new compaction algorithms easier to write (like CompactFiles from @yhchiang and dynamic leveled and multi-leveled universal from @sdong). Here are couple of things demonstrating that Compaction class is hard to use: 1. we have two constructors of Compaction class 2. there's this thing called grandparents_, but it appears to only be setup for leveled compaction and not compactfiles 3. it's easy to introduce a subtle and dangerous bug like this: D36225 4. SetupBottomMostLevel() is hard to understand and it shouldn't be. See this comment: https://github.com/facebook/rocksdb/blob/afbafeaeaebfd27a0f3e992fee8e0c57d07658fa/db/compaction.cc#L236-L241. It also made it harder for @yhchiang to write CompactFiles, as evidenced by this: https://github.com/facebook/rocksdb/blob/afbafeaeaebfd27a0f3e992fee8e0c57d07658fa/db/compaction_picker.cc#L204-L210 The problem is that we create Compaction object, which holds a lot of state, and then pass it around to some functions. After those functions are done mutating, then we call couple of functions on Compaction object, like SetupBottommostLevel() and MarkFilesBeingCompacted(). It is very hard to see what's happening with all that Compaction's state while it's travelling across different functions. If you're writing a new PickCompaction() function you need to try really hard to understand what are all the functions you need to run on Compaction object and what state you need to setup. My proposed solution is to make important parts of Compaction immutable after construction. PickCompaction() should calculate compaction inputs and then pass them onto Compaction object once they are finalized. That makes it easy to create a new compaction -- just provide all the parameters to the constructor and you're done. No need to call confusing functions after you created your object. This diff doesn't fully achieve that goal, but it comes pretty close. Here are some of the changes: * have one Compaction constructor instead of two. * inputs_ is constant after construction * MarkFilesBeingCompacted() is now private to Compaction class and automatically called on construction/destruction. * SetupBottommostLevel() is gone. Compaction figures it out on its own based on the input. * CompactionPicker's functions are not passing around Compaction object anymore. They are only passing around the state that they need. Test Plan: make check make asan_check make valgrind_check Reviewers: rven, anthony, sdong, yhchiang Reviewed By: yhchiang Subscribers: sdong, yhchiang, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D36687
2015-04-10 22:01:54 +00:00
ASSERT_EQ(1, static_cast<int>(compaction->num_input_levels()));
ASSERT_EQ(num_levels - 1, compaction->output_level());
}
rocksdb: switch to gtest Summary: Our existing test notation is very similar to what is used in gtest. It makes it easy to adopt what is different. In this diff I modify existing [[ https://code.google.com/p/googletest/wiki/Primer#Test_Fixtures:_Using_the_Same_Data_Configuration_for_Multiple_Te | test fixture ]] classes to inherit from `testing::Test`. Also for unit tests that use fixture class, `TEST` is replaced with `TEST_F` as required in gtest. There are several custom `main` functions in our existing tests. To make this transition easier, I modify all `main` functions to fallow gtest notation. But eventually we can remove them and use implementation of `main` that gtest provides. ```lang=bash % cat ~/transform #!/bin/sh files=$(git ls-files '*test\.cc') for file in $files do if grep -q "rocksdb::test::RunAllTests()" $file then if grep -Eq '^class \w+Test {' $file then perl -pi -e 's/^(class \w+Test) {/${1}: public testing::Test {/g' $file perl -pi -e 's/^(TEST)/${1}_F/g' $file fi perl -pi -e 's/(int main.*\{)/${1}::testing::InitGoogleTest(&argc, argv);/g' $file perl -pi -e 's/rocksdb::test::RunAllTests/RUN_ALL_TESTS/g' $file fi done % sh ~/transform % make format ``` Second iteration of this diff contains only scripted changes. Third iteration contains manual changes to fix last errors and make it compilable. Test Plan: Build and notice no errors. ```lang=bash % USE_CLANG=1 make check -j55 ``` Tests are still testing. Reviewers: meyering, sdong, rven, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D35157
2015-03-17 21:08:00 +00:00
TEST_F(CompactionPickerTest, Level0TriggerDynamic2) {
int num_levels = ioptions_.num_levels;
ioptions_.level_compaction_dynamic_level_bytes = true;
mutable_cf_options_.level0_file_num_compaction_trigger = 2;
mutable_cf_options_.max_bytes_for_level_base = 200;
mutable_cf_options_.max_bytes_for_level_multiplier = 10;
NewVersionStorage(num_levels, kCompactionStyleLevel);
Add(0, 1U, "150", "200");
Add(0, 2U, "200", "250");
Add(num_levels - 1, 3U, "200", "250", 300U);
UpdateVersionStorageInfo();
ASSERT_EQ(vstorage_->base_level(), num_levels - 2);
std::unique_ptr<Compaction> compaction(level_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction.get() != nullptr);
ASSERT_EQ(2U, compaction->num_input_files(0));
ASSERT_EQ(1U, compaction->input(0, 0)->fd.GetNumber());
ASSERT_EQ(2U, compaction->input(0, 1)->fd.GetNumber());
Make Compaction class easier to use Summary: The goal of this diff is to make Compaction class easier to use. This should also make new compaction algorithms easier to write (like CompactFiles from @yhchiang and dynamic leveled and multi-leveled universal from @sdong). Here are couple of things demonstrating that Compaction class is hard to use: 1. we have two constructors of Compaction class 2. there's this thing called grandparents_, but it appears to only be setup for leveled compaction and not compactfiles 3. it's easy to introduce a subtle and dangerous bug like this: D36225 4. SetupBottomMostLevel() is hard to understand and it shouldn't be. See this comment: https://github.com/facebook/rocksdb/blob/afbafeaeaebfd27a0f3e992fee8e0c57d07658fa/db/compaction.cc#L236-L241. It also made it harder for @yhchiang to write CompactFiles, as evidenced by this: https://github.com/facebook/rocksdb/blob/afbafeaeaebfd27a0f3e992fee8e0c57d07658fa/db/compaction_picker.cc#L204-L210 The problem is that we create Compaction object, which holds a lot of state, and then pass it around to some functions. After those functions are done mutating, then we call couple of functions on Compaction object, like SetupBottommostLevel() and MarkFilesBeingCompacted(). It is very hard to see what's happening with all that Compaction's state while it's travelling across different functions. If you're writing a new PickCompaction() function you need to try really hard to understand what are all the functions you need to run on Compaction object and what state you need to setup. My proposed solution is to make important parts of Compaction immutable after construction. PickCompaction() should calculate compaction inputs and then pass them onto Compaction object once they are finalized. That makes it easy to create a new compaction -- just provide all the parameters to the constructor and you're done. No need to call confusing functions after you created your object. This diff doesn't fully achieve that goal, but it comes pretty close. Here are some of the changes: * have one Compaction constructor instead of two. * inputs_ is constant after construction * MarkFilesBeingCompacted() is now private to Compaction class and automatically called on construction/destruction. * SetupBottommostLevel() is gone. Compaction figures it out on its own based on the input. * CompactionPicker's functions are not passing around Compaction object anymore. They are only passing around the state that they need. Test Plan: make check make asan_check make valgrind_check Reviewers: rven, anthony, sdong, yhchiang Reviewed By: yhchiang Subscribers: sdong, yhchiang, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D36687
2015-04-10 22:01:54 +00:00
ASSERT_EQ(1, static_cast<int>(compaction->num_input_levels()));
ASSERT_EQ(num_levels - 2, compaction->output_level());
}
rocksdb: switch to gtest Summary: Our existing test notation is very similar to what is used in gtest. It makes it easy to adopt what is different. In this diff I modify existing [[ https://code.google.com/p/googletest/wiki/Primer#Test_Fixtures:_Using_the_Same_Data_Configuration_for_Multiple_Te | test fixture ]] classes to inherit from `testing::Test`. Also for unit tests that use fixture class, `TEST` is replaced with `TEST_F` as required in gtest. There are several custom `main` functions in our existing tests. To make this transition easier, I modify all `main` functions to fallow gtest notation. But eventually we can remove them and use implementation of `main` that gtest provides. ```lang=bash % cat ~/transform #!/bin/sh files=$(git ls-files '*test\.cc') for file in $files do if grep -q "rocksdb::test::RunAllTests()" $file then if grep -Eq '^class \w+Test {' $file then perl -pi -e 's/^(class \w+Test) {/${1}: public testing::Test {/g' $file perl -pi -e 's/^(TEST)/${1}_F/g' $file fi perl -pi -e 's/(int main.*\{)/${1}::testing::InitGoogleTest(&argc, argv);/g' $file perl -pi -e 's/rocksdb::test::RunAllTests/RUN_ALL_TESTS/g' $file fi done % sh ~/transform % make format ``` Second iteration of this diff contains only scripted changes. Third iteration contains manual changes to fix last errors and make it compilable. Test Plan: Build and notice no errors. ```lang=bash % USE_CLANG=1 make check -j55 ``` Tests are still testing. Reviewers: meyering, sdong, rven, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D35157
2015-03-17 21:08:00 +00:00
TEST_F(CompactionPickerTest, Level0TriggerDynamic3) {
int num_levels = ioptions_.num_levels;
ioptions_.level_compaction_dynamic_level_bytes = true;
mutable_cf_options_.level0_file_num_compaction_trigger = 2;
mutable_cf_options_.max_bytes_for_level_base = 200;
mutable_cf_options_.max_bytes_for_level_multiplier = 10;
NewVersionStorage(num_levels, kCompactionStyleLevel);
Add(0, 1U, "150", "200");
Add(0, 2U, "200", "250");
Add(num_levels - 1, 3U, "200", "250", 300U);
Add(num_levels - 1, 4U, "300", "350", 3000U);
UpdateVersionStorageInfo();
ASSERT_EQ(vstorage_->base_level(), num_levels - 3);
std::unique_ptr<Compaction> compaction(level_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction.get() != nullptr);
ASSERT_EQ(2U, compaction->num_input_files(0));
ASSERT_EQ(1U, compaction->input(0, 0)->fd.GetNumber());
ASSERT_EQ(2U, compaction->input(0, 1)->fd.GetNumber());
Make Compaction class easier to use Summary: The goal of this diff is to make Compaction class easier to use. This should also make new compaction algorithms easier to write (like CompactFiles from @yhchiang and dynamic leveled and multi-leveled universal from @sdong). Here are couple of things demonstrating that Compaction class is hard to use: 1. we have two constructors of Compaction class 2. there's this thing called grandparents_, but it appears to only be setup for leveled compaction and not compactfiles 3. it's easy to introduce a subtle and dangerous bug like this: D36225 4. SetupBottomMostLevel() is hard to understand and it shouldn't be. See this comment: https://github.com/facebook/rocksdb/blob/afbafeaeaebfd27a0f3e992fee8e0c57d07658fa/db/compaction.cc#L236-L241. It also made it harder for @yhchiang to write CompactFiles, as evidenced by this: https://github.com/facebook/rocksdb/blob/afbafeaeaebfd27a0f3e992fee8e0c57d07658fa/db/compaction_picker.cc#L204-L210 The problem is that we create Compaction object, which holds a lot of state, and then pass it around to some functions. After those functions are done mutating, then we call couple of functions on Compaction object, like SetupBottommostLevel() and MarkFilesBeingCompacted(). It is very hard to see what's happening with all that Compaction's state while it's travelling across different functions. If you're writing a new PickCompaction() function you need to try really hard to understand what are all the functions you need to run on Compaction object and what state you need to setup. My proposed solution is to make important parts of Compaction immutable after construction. PickCompaction() should calculate compaction inputs and then pass them onto Compaction object once they are finalized. That makes it easy to create a new compaction -- just provide all the parameters to the constructor and you're done. No need to call confusing functions after you created your object. This diff doesn't fully achieve that goal, but it comes pretty close. Here are some of the changes: * have one Compaction constructor instead of two. * inputs_ is constant after construction * MarkFilesBeingCompacted() is now private to Compaction class and automatically called on construction/destruction. * SetupBottommostLevel() is gone. Compaction figures it out on its own based on the input. * CompactionPicker's functions are not passing around Compaction object anymore. They are only passing around the state that they need. Test Plan: make check make asan_check make valgrind_check Reviewers: rven, anthony, sdong, yhchiang Reviewed By: yhchiang Subscribers: sdong, yhchiang, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D36687
2015-04-10 22:01:54 +00:00
ASSERT_EQ(1, static_cast<int>(compaction->num_input_levels()));
ASSERT_EQ(num_levels - 3, compaction->output_level());
}
rocksdb: switch to gtest Summary: Our existing test notation is very similar to what is used in gtest. It makes it easy to adopt what is different. In this diff I modify existing [[ https://code.google.com/p/googletest/wiki/Primer#Test_Fixtures:_Using_the_Same_Data_Configuration_for_Multiple_Te | test fixture ]] classes to inherit from `testing::Test`. Also for unit tests that use fixture class, `TEST` is replaced with `TEST_F` as required in gtest. There are several custom `main` functions in our existing tests. To make this transition easier, I modify all `main` functions to fallow gtest notation. But eventually we can remove them and use implementation of `main` that gtest provides. ```lang=bash % cat ~/transform #!/bin/sh files=$(git ls-files '*test\.cc') for file in $files do if grep -q "rocksdb::test::RunAllTests()" $file then if grep -Eq '^class \w+Test {' $file then perl -pi -e 's/^(class \w+Test) {/${1}: public testing::Test {/g' $file perl -pi -e 's/^(TEST)/${1}_F/g' $file fi perl -pi -e 's/(int main.*\{)/${1}::testing::InitGoogleTest(&argc, argv);/g' $file perl -pi -e 's/rocksdb::test::RunAllTests/RUN_ALL_TESTS/g' $file fi done % sh ~/transform % make format ``` Second iteration of this diff contains only scripted changes. Third iteration contains manual changes to fix last errors and make it compilable. Test Plan: Build and notice no errors. ```lang=bash % USE_CLANG=1 make check -j55 ``` Tests are still testing. Reviewers: meyering, sdong, rven, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D35157
2015-03-17 21:08:00 +00:00
TEST_F(CompactionPickerTest, Level0TriggerDynamic4) {
int num_levels = ioptions_.num_levels;
ioptions_.level_compaction_dynamic_level_bytes = true;
mutable_cf_options_.level0_file_num_compaction_trigger = 2;
mutable_cf_options_.max_bytes_for_level_base = 200;
mutable_cf_options_.max_bytes_for_level_multiplier = 10;
NewVersionStorage(num_levels, kCompactionStyleLevel);
Add(0, 1U, "150", "200");
Add(0, 2U, "200", "250");
Add(num_levels - 1, 3U, "200", "250", 300U);
Add(num_levels - 1, 4U, "300", "350", 3000U);
Add(num_levels - 3, 5U, "150", "180", 3U);
Add(num_levels - 3, 6U, "181", "300", 3U);
Add(num_levels - 3, 7U, "400", "450", 3U);
UpdateVersionStorageInfo();
ASSERT_EQ(vstorage_->base_level(), num_levels - 3);
std::unique_ptr<Compaction> compaction(level_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction.get() != nullptr);
ASSERT_EQ(2U, compaction->num_input_files(0));
ASSERT_EQ(1U, compaction->input(0, 0)->fd.GetNumber());
ASSERT_EQ(2U, compaction->input(0, 1)->fd.GetNumber());
Make Compaction class easier to use Summary: The goal of this diff is to make Compaction class easier to use. This should also make new compaction algorithms easier to write (like CompactFiles from @yhchiang and dynamic leveled and multi-leveled universal from @sdong). Here are couple of things demonstrating that Compaction class is hard to use: 1. we have two constructors of Compaction class 2. there's this thing called grandparents_, but it appears to only be setup for leveled compaction and not compactfiles 3. it's easy to introduce a subtle and dangerous bug like this: D36225 4. SetupBottomMostLevel() is hard to understand and it shouldn't be. See this comment: https://github.com/facebook/rocksdb/blob/afbafeaeaebfd27a0f3e992fee8e0c57d07658fa/db/compaction.cc#L236-L241. It also made it harder for @yhchiang to write CompactFiles, as evidenced by this: https://github.com/facebook/rocksdb/blob/afbafeaeaebfd27a0f3e992fee8e0c57d07658fa/db/compaction_picker.cc#L204-L210 The problem is that we create Compaction object, which holds a lot of state, and then pass it around to some functions. After those functions are done mutating, then we call couple of functions on Compaction object, like SetupBottommostLevel() and MarkFilesBeingCompacted(). It is very hard to see what's happening with all that Compaction's state while it's travelling across different functions. If you're writing a new PickCompaction() function you need to try really hard to understand what are all the functions you need to run on Compaction object and what state you need to setup. My proposed solution is to make important parts of Compaction immutable after construction. PickCompaction() should calculate compaction inputs and then pass them onto Compaction object once they are finalized. That makes it easy to create a new compaction -- just provide all the parameters to the constructor and you're done. No need to call confusing functions after you created your object. This diff doesn't fully achieve that goal, but it comes pretty close. Here are some of the changes: * have one Compaction constructor instead of two. * inputs_ is constant after construction * MarkFilesBeingCompacted() is now private to Compaction class and automatically called on construction/destruction. * SetupBottommostLevel() is gone. Compaction figures it out on its own based on the input. * CompactionPicker's functions are not passing around Compaction object anymore. They are only passing around the state that they need. Test Plan: make check make asan_check make valgrind_check Reviewers: rven, anthony, sdong, yhchiang Reviewed By: yhchiang Subscribers: sdong, yhchiang, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D36687
2015-04-10 22:01:54 +00:00
ASSERT_EQ(2U, compaction->num_input_files(1));
ASSERT_EQ(num_levels - 3, compaction->level(1));
ASSERT_EQ(5U, compaction->input(1, 0)->fd.GetNumber());
ASSERT_EQ(6U, compaction->input(1, 1)->fd.GetNumber());
ASSERT_EQ(2, static_cast<int>(compaction->num_input_levels()));
ASSERT_EQ(num_levels - 3, compaction->output_level());
}
rocksdb: switch to gtest Summary: Our existing test notation is very similar to what is used in gtest. It makes it easy to adopt what is different. In this diff I modify existing [[ https://code.google.com/p/googletest/wiki/Primer#Test_Fixtures:_Using_the_Same_Data_Configuration_for_Multiple_Te | test fixture ]] classes to inherit from `testing::Test`. Also for unit tests that use fixture class, `TEST` is replaced with `TEST_F` as required in gtest. There are several custom `main` functions in our existing tests. To make this transition easier, I modify all `main` functions to fallow gtest notation. But eventually we can remove them and use implementation of `main` that gtest provides. ```lang=bash % cat ~/transform #!/bin/sh files=$(git ls-files '*test\.cc') for file in $files do if grep -q "rocksdb::test::RunAllTests()" $file then if grep -Eq '^class \w+Test {' $file then perl -pi -e 's/^(class \w+Test) {/${1}: public testing::Test {/g' $file perl -pi -e 's/^(TEST)/${1}_F/g' $file fi perl -pi -e 's/(int main.*\{)/${1}::testing::InitGoogleTest(&argc, argv);/g' $file perl -pi -e 's/rocksdb::test::RunAllTests/RUN_ALL_TESTS/g' $file fi done % sh ~/transform % make format ``` Second iteration of this diff contains only scripted changes. Third iteration contains manual changes to fix last errors and make it compilable. Test Plan: Build and notice no errors. ```lang=bash % USE_CLANG=1 make check -j55 ``` Tests are still testing. Reviewers: meyering, sdong, rven, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D35157
2015-03-17 21:08:00 +00:00
TEST_F(CompactionPickerTest, LevelTriggerDynamic4) {
int num_levels = ioptions_.num_levels;
ioptions_.level_compaction_dynamic_level_bytes = true;
ioptions_.compaction_pri = kMinOverlappingRatio;
mutable_cf_options_.level0_file_num_compaction_trigger = 2;
mutable_cf_options_.max_bytes_for_level_base = 200;
mutable_cf_options_.max_bytes_for_level_multiplier = 10;
NewVersionStorage(num_levels, kCompactionStyleLevel);
Add(0, 1U, "150", "200");
Add(num_levels - 1, 2U, "200", "250", 300U);
Add(num_levels - 1, 3U, "300", "350", 3000U);
Add(num_levels - 1, 4U, "400", "450", 3U);
Add(num_levels - 2, 5U, "150", "180", 300U);
Add(num_levels - 2, 6U, "181", "350", 500U);
Add(num_levels - 2, 7U, "400", "450", 200U);
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(level_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction.get() != nullptr);
ASSERT_EQ(1U, compaction->num_input_files(0));
ASSERT_EQ(5U, compaction->input(0, 0)->fd.GetNumber());
ASSERT_EQ(0, compaction->num_input_files(1));
ASSERT_EQ(1U, compaction->num_input_levels());
ASSERT_EQ(num_levels - 1, compaction->output_level());
}
// Universal and FIFO Compactions are not supported in ROCKSDB_LITE
#ifndef ROCKSDB_LITE
rocksdb: switch to gtest Summary: Our existing test notation is very similar to what is used in gtest. It makes it easy to adopt what is different. In this diff I modify existing [[ https://code.google.com/p/googletest/wiki/Primer#Test_Fixtures:_Using_the_Same_Data_Configuration_for_Multiple_Te | test fixture ]] classes to inherit from `testing::Test`. Also for unit tests that use fixture class, `TEST` is replaced with `TEST_F` as required in gtest. There are several custom `main` functions in our existing tests. To make this transition easier, I modify all `main` functions to fallow gtest notation. But eventually we can remove them and use implementation of `main` that gtest provides. ```lang=bash % cat ~/transform #!/bin/sh files=$(git ls-files '*test\.cc') for file in $files do if grep -q "rocksdb::test::RunAllTests()" $file then if grep -Eq '^class \w+Test {' $file then perl -pi -e 's/^(class \w+Test) {/${1}: public testing::Test {/g' $file perl -pi -e 's/^(TEST)/${1}_F/g' $file fi perl -pi -e 's/(int main.*\{)/${1}::testing::InitGoogleTest(&argc, argv);/g' $file perl -pi -e 's/rocksdb::test::RunAllTests/RUN_ALL_TESTS/g' $file fi done % sh ~/transform % make format ``` Second iteration of this diff contains only scripted changes. Third iteration contains manual changes to fix last errors and make it compilable. Test Plan: Build and notice no errors. ```lang=bash % USE_CLANG=1 make check -j55 ``` Tests are still testing. Reviewers: meyering, sdong, rven, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D35157
2015-03-17 21:08:00 +00:00
TEST_F(CompactionPickerTest, NeedsCompactionUniversal) {
NewVersionStorage(1, kCompactionStyleUniversal);
UniversalCompactionPicker universal_compaction_picker(ioptions_, &icmp_);
UpdateVersionStorageInfo();
// must return false when there's no files.
ASSERT_EQ(universal_compaction_picker.NeedsCompaction(vstorage_.get()),
false);
// verify the trigger given different number of L0 files.
for (int i = 1;
i <= mutable_cf_options_.level0_file_num_compaction_trigger * 2; ++i) {
NewVersionStorage(1, kCompactionStyleUniversal);
Add(0, i, std::to_string((i + 100) * 1000).c_str(),
std::to_string((i + 100) * 1000 + 999).c_str(), 1000000, 0, i * 100,
i * 100 + 99);
UpdateVersionStorageInfo();
ASSERT_EQ(level_compaction_picker.NeedsCompaction(vstorage_.get()),
vstorage_->CompactionScore(0) >= 1);
}
}
TEST_F(CompactionPickerTest, CompactionUniversalIngestBehindReservedLevel) {
const uint64_t kFileSize = 100000;
NewVersionStorage(1, kCompactionStyleUniversal);
ioptions_.allow_ingest_behind = true;
ioptions_.num_levels = 3;
UniversalCompactionPicker universal_compaction_picker(ioptions_, &icmp_);
UpdateVersionStorageInfo();
// must return false when there's no files.
ASSERT_EQ(universal_compaction_picker.NeedsCompaction(vstorage_.get()),
false);
NewVersionStorage(3, kCompactionStyleUniversal);
Add(0, 1U, "150", "200", kFileSize, 0, 500, 550);
Add(0, 2U, "201", "250", kFileSize, 0, 401, 450);
Add(0, 4U, "260", "300", kFileSize, 0, 260, 300);
Add(1, 5U, "100", "151", kFileSize, 0, 200, 251);
Add(1, 3U, "301", "350", kFileSize, 0, 101, 150);
Add(2, 6U, "120", "200", kFileSize, 0, 20, 100);
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(
universal_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
// output level should be the one above the bottom-most
ASSERT_EQ(1, compaction->output_level());
}
// Tests if the files can be trivially moved in multi level
// universal compaction when allow_trivial_move option is set
// In this test as the input files overlaps, they cannot
// be trivially moved.
TEST_F(CompactionPickerTest, CannotTrivialMoveUniversal) {
const uint64_t kFileSize = 100000;
mutable_cf_options_.compaction_options_universal.allow_trivial_move = true;
NewVersionStorage(1, kCompactionStyleUniversal);
UniversalCompactionPicker universal_compaction_picker(ioptions_, &icmp_);
UpdateVersionStorageInfo();
// must return false when there's no files.
ASSERT_EQ(universal_compaction_picker.NeedsCompaction(vstorage_.get()),
false);
NewVersionStorage(3, kCompactionStyleUniversal);
Add(0, 1U, "150", "200", kFileSize, 0, 500, 550);
Add(0, 2U, "201", "250", kFileSize, 0, 401, 450);
Add(0, 4U, "260", "300", kFileSize, 0, 260, 300);
Add(1, 5U, "100", "151", kFileSize, 0, 200, 251);
Add(1, 3U, "301", "350", kFileSize, 0, 101, 150);
Add(2, 6U, "120", "200", kFileSize, 0, 20, 100);
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(
universal_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(!compaction->is_trivial_move());
}
// Tests if the files can be trivially moved in multi level
// universal compaction when allow_trivial_move option is set
// In this test as the input files doesn't overlaps, they should
// be trivially moved.
TEST_F(CompactionPickerTest, AllowsTrivialMoveUniversal) {
const uint64_t kFileSize = 100000;
mutable_cf_options_.compaction_options_universal.allow_trivial_move = true;
UniversalCompactionPicker universal_compaction_picker(ioptions_, &icmp_);
NewVersionStorage(3, kCompactionStyleUniversal);
Add(0, 1U, "150", "200", kFileSize, 0, 500, 550);
Add(0, 2U, "201", "250", kFileSize, 0, 401, 450);
Add(0, 4U, "260", "300", kFileSize, 0, 260, 300);
Add(1, 5U, "010", "080", kFileSize, 0, 200, 251);
Add(2, 3U, "301", "350", kFileSize, 0, 101, 150);
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(
universal_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction->is_trivial_move());
}
TEST_F(CompactionPickerTest, UniversalPeriodicCompaction1) {
// The case where universal periodic compaction can be picked
// with some newer files being compacted.
const uint64_t kFileSize = 100000;
mutable_cf_options_.periodic_compaction_seconds = 1000;
UniversalCompactionPicker universal_compaction_picker(ioptions_, &icmp_);
NewVersionStorage(5, kCompactionStyleUniversal);
Add(0, 1U, "150", "200", kFileSize, 0, 500, 550);
Add(0, 2U, "201", "250", kFileSize, 0, 401, 450);
Add(0, 4U, "260", "300", kFileSize, 0, 260, 300);
Add(3, 5U, "010", "080", kFileSize, 0, 200, 251);
Add(4, 3U, "301", "350", kFileSize, 0, 101, 150);
Add(4, 6U, "501", "750", kFileSize, 0, 101, 150);
file_map_[2].first->being_compacted = true;
UpdateVersionStorageInfo();
vstorage_->TEST_AddFileMarkedForPeriodicCompaction(4, file_map_[3].first);
std::unique_ptr<Compaction> compaction(
universal_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction);
ASSERT_EQ(4, compaction->output_level());
ASSERT_EQ(0, compaction->start_level());
ASSERT_EQ(1U, compaction->num_input_files(0));
}
TEST_F(CompactionPickerTest, UniversalPeriodicCompaction2) {
// The case where universal periodic compaction does not
// pick up only level to compact if it doesn't cover
// any file marked as periodic compaction.
const uint64_t kFileSize = 100000;
mutable_cf_options_.periodic_compaction_seconds = 1000;
UniversalCompactionPicker universal_compaction_picker(ioptions_, &icmp_);
NewVersionStorage(5, kCompactionStyleUniversal);
Add(0, 1U, "150", "200", kFileSize, 0, 500, 550);
Add(3, 5U, "010", "080", kFileSize, 0, 200, 251);
Add(4, 3U, "301", "350", kFileSize, 0, 101, 150);
Add(4, 6U, "501", "750", kFileSize, 0, 101, 150);
file_map_[5].first->being_compacted = true;
UpdateVersionStorageInfo();
vstorage_->TEST_AddFileMarkedForPeriodicCompaction(0, file_map_[1].first);
std::unique_ptr<Compaction> compaction(
universal_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_FALSE(compaction);
}
TEST_F(CompactionPickerTest, UniversalPeriodicCompaction3) {
// The case where universal periodic compaction does not
// pick up only the last sorted run which is an L0 file if it isn't
// marked as periodic compaction.
const uint64_t kFileSize = 100000;
mutable_cf_options_.periodic_compaction_seconds = 1000;
UniversalCompactionPicker universal_compaction_picker(ioptions_, &icmp_);
NewVersionStorage(5, kCompactionStyleUniversal);
Add(0, 1U, "150", "200", kFileSize, 0, 500, 550);
Add(0, 5U, "010", "080", kFileSize, 0, 200, 251);
Add(0, 6U, "501", "750", kFileSize, 0, 101, 150);
file_map_[5].first->being_compacted = true;
UpdateVersionStorageInfo();
vstorage_->TEST_AddFileMarkedForPeriodicCompaction(0, file_map_[1].first);
std::unique_ptr<Compaction> compaction(
universal_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_FALSE(compaction);
}
TEST_F(CompactionPickerTest, UniversalPeriodicCompaction4) {
// The case where universal periodic compaction couldn't form
// a compaction that includes any file marked for periodic compaction.
// Right now we form the compaction anyway if it is more than one
// sorted run. Just put the case here to validate that it doesn't
// crash.
const uint64_t kFileSize = 100000;
mutable_cf_options_.periodic_compaction_seconds = 1000;
UniversalCompactionPicker universal_compaction_picker(ioptions_, &icmp_);
NewVersionStorage(5, kCompactionStyleUniversal);
Add(0, 1U, "150", "200", kFileSize, 0, 500, 550);
Add(2, 2U, "010", "080", kFileSize, 0, 200, 251);
Add(3, 5U, "010", "080", kFileSize, 0, 200, 251);
Add(4, 3U, "301", "350", kFileSize, 0, 101, 150);
Add(4, 6U, "501", "750", kFileSize, 0, 101, 150);
file_map_[2].first->being_compacted = true;
UpdateVersionStorageInfo();
vstorage_->TEST_AddFileMarkedForPeriodicCompaction(0, file_map_[2].first);
std::unique_ptr<Compaction> compaction(
universal_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(!compaction ||
compaction->start_level() != compaction->output_level());
}
TEST_F(CompactionPickerTest, UniversalPeriodicCompaction5) {
// Test single L0 file periodic compaction triggering.
const uint64_t kFileSize = 100000;
mutable_cf_options_.periodic_compaction_seconds = 1000;
UniversalCompactionPicker universal_compaction_picker(ioptions_, &icmp_);
NewVersionStorage(5, kCompactionStyleUniversal);
Add(0, 6U, "150", "200", kFileSize, 0, 500, 550);
UpdateVersionStorageInfo();
vstorage_->TEST_AddFileMarkedForPeriodicCompaction(0, file_map_[6].first);
std::unique_ptr<Compaction> compaction(
universal_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction);
ASSERT_EQ(0, compaction->start_level());
ASSERT_EQ(1U, compaction->num_input_files(0));
ASSERT_EQ(6U, compaction->input(0, 0)->fd.GetNumber());
ASSERT_EQ(4, compaction->output_level());
}
TEST_F(CompactionPickerTest, UniversalPeriodicCompaction6) {
// Test single sorted run non-L0 periodic compaction
const uint64_t kFileSize = 100000;
mutable_cf_options_.periodic_compaction_seconds = 1000;
UniversalCompactionPicker universal_compaction_picker(ioptions_, &icmp_);
NewVersionStorage(5, kCompactionStyleUniversal);
Add(4, 5U, "150", "200", kFileSize, 0, 500, 550);
Add(4, 6U, "350", "400", kFileSize, 0, 500, 550);
UpdateVersionStorageInfo();
vstorage_->TEST_AddFileMarkedForPeriodicCompaction(4, file_map_[6].first);
std::unique_ptr<Compaction> compaction(
universal_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction);
ASSERT_EQ(4, compaction->start_level());
ASSERT_EQ(2U, compaction->num_input_files(0));
ASSERT_EQ(5U, compaction->input(0, 0)->fd.GetNumber());
ASSERT_EQ(6U, compaction->input(0, 1)->fd.GetNumber());
ASSERT_EQ(4, compaction->output_level());
}
Incremental Space Amp Compactions in Universal Style (#8655) Summary: This commit introduces incremental compaction in univeral style for space amplification. This follows the first improvement mentioned in https://rocksdb.org/blog/2021/04/12/universal-improvements.html . The implemention simply picks up files about size of max_compaction_bytes to compact and execute if the penalty is not too big. More optimizations can be done in the future, e.g. prioritizing between this compaction and other types. But for now, the feature is supposed to be functional and can often reduce frequency of full compactions, although it can introduce penalty. In order to add cut files more efficiently so that more files from upper levels can be included, SST file cutting threshold (for current file + overlapping parent level files) is set to 1.5X of target file size. A 2MB target file size will generate files like this: https://gist.github.com/siying/29d2676fba417404f3c95e6c013c7de8 Number of files indeed increases but it is not out of control. Two set of write benchmarks are run: 1. For ingestion rate limited scenario, we can see full compaction is mostly eliminated: https://gist.github.com/siying/959bc1186066906831cf4c808d6e0a19 . The write amp increased from 7.7 to 9.4, as expected. After applying file cutting, the number is improved to 8.9. In another benchmark, the write amp is even better with the incremental approach: https://gist.github.com/siying/d1c16c286d7c59c4d7bba718ca198163 2. For ingestion rate unlimited scenario, incremental compaction turns out to be too expensive most of the time and is not executed, as expected. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8655 Test Plan: Add unit tests to the functionality. Reviewed By: ajkr Differential Revision: D31787034 fbshipit-source-id: ce813e63b15a61d5a56e97bf8902a1b28e011beb
2021-10-20 17:03:03 +00:00
TEST_F(CompactionPickerTest, UniversalIncrementalSpace1) {
const uint64_t kFileSize = 100000;
mutable_cf_options_.max_compaction_bytes = 555555;
mutable_cf_options_.compaction_options_universal.incremental = true;
mutable_cf_options_.compaction_options_universal
.max_size_amplification_percent = 30;
UniversalCompactionPicker universal_compaction_picker(ioptions_, &icmp_);
NewVersionStorage(5, kCompactionStyleUniversal);
Add(0, 1U, "150", "200", kFileSize, 0, 500, 550);
Add(2, 2U, "010", "080", kFileSize, 0, 200, 251);
Add(3, 5U, "310", "380", kFileSize, 0, 200, 251);
Add(3, 6U, "410", "880", kFileSize, 0, 200, 251);
Add(3, 7U, "910", "980", 1, 0, 200, 251);
Add(4, 10U, "201", "250", kFileSize, 0, 101, 150);
Add(4, 11U, "301", "350", kFileSize, 0, 101, 150);
Add(4, 12U, "401", "450", kFileSize, 0, 101, 150);
Add(4, 13U, "501", "750", kFileSize, 0, 101, 150);
Add(4, 14U, "801", "850", kFileSize, 0, 101, 150);
Add(4, 15U, "901", "950", kFileSize, 0, 101, 150);
// Add(4, 15U, "960", "970", kFileSize, 0, 101, 150);
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(
universal_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
Incremental Space Amp Compactions in Universal Style (#8655) Summary: This commit introduces incremental compaction in univeral style for space amplification. This follows the first improvement mentioned in https://rocksdb.org/blog/2021/04/12/universal-improvements.html . The implemention simply picks up files about size of max_compaction_bytes to compact and execute if the penalty is not too big. More optimizations can be done in the future, e.g. prioritizing between this compaction and other types. But for now, the feature is supposed to be functional and can often reduce frequency of full compactions, although it can introduce penalty. In order to add cut files more efficiently so that more files from upper levels can be included, SST file cutting threshold (for current file + overlapping parent level files) is set to 1.5X of target file size. A 2MB target file size will generate files like this: https://gist.github.com/siying/29d2676fba417404f3c95e6c013c7de8 Number of files indeed increases but it is not out of control. Two set of write benchmarks are run: 1. For ingestion rate limited scenario, we can see full compaction is mostly eliminated: https://gist.github.com/siying/959bc1186066906831cf4c808d6e0a19 . The write amp increased from 7.7 to 9.4, as expected. After applying file cutting, the number is improved to 8.9. In another benchmark, the write amp is even better with the incremental approach: https://gist.github.com/siying/d1c16c286d7c59c4d7bba718ca198163 2. For ingestion rate unlimited scenario, incremental compaction turns out to be too expensive most of the time and is not executed, as expected. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8655 Test Plan: Add unit tests to the functionality. Reviewed By: ajkr Differential Revision: D31787034 fbshipit-source-id: ce813e63b15a61d5a56e97bf8902a1b28e011beb
2021-10-20 17:03:03 +00:00
ASSERT_TRUE(compaction);
ASSERT_EQ(4, compaction->output_level());
ASSERT_EQ(3, compaction->start_level());
ASSERT_EQ(2U, compaction->num_input_files(0));
ASSERT_EQ(5U, compaction->input(0, 0)->fd.GetNumber());
ASSERT_EQ(6U, compaction->input(0, 1)->fd.GetNumber());
// ASSERT_EQ(4U, compaction->num_input_files(1));
ASSERT_EQ(11U, compaction->input(1, 0)->fd.GetNumber());
ASSERT_EQ(12U, compaction->input(1, 1)->fd.GetNumber());
ASSERT_EQ(13U, compaction->input(1, 2)->fd.GetNumber());
ASSERT_EQ(14U, compaction->input(1, 3)->fd.GetNumber());
}
TEST_F(CompactionPickerTest, UniversalIncrementalSpace2) {
const uint64_t kFileSize = 100000;
mutable_cf_options_.max_compaction_bytes = 400000;
mutable_cf_options_.compaction_options_universal.incremental = true;
mutable_cf_options_.compaction_options_universal
.max_size_amplification_percent = 30;
UniversalCompactionPicker universal_compaction_picker(ioptions_, &icmp_);
NewVersionStorage(5, kCompactionStyleUniversal);
Add(0, 1U, "150", "200", kFileSize, 0, 500, 550);
Add(1, 2U, "010", "080", kFileSize, 0, 200, 251);
Add(2, 5U, "310", "380", kFileSize, 0, 200, 251);
Add(2, 6U, "410", "880", kFileSize, 0, 200, 251);
Add(2, 7U, "910", "980", kFileSize, 0, 200, 251);
Add(4, 10U, "201", "250", kFileSize, 0, 101, 150);
Add(4, 11U, "301", "350", kFileSize, 0, 101, 150);
Add(4, 12U, "401", "450", kFileSize, 0, 101, 150);
Add(4, 13U, "501", "750", kFileSize, 0, 101, 150);
Add(4, 14U, "801", "850", kFileSize, 0, 101, 150);
Add(4, 15U, "901", "950", kFileSize, 0, 101, 150);
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(
universal_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
Incremental Space Amp Compactions in Universal Style (#8655) Summary: This commit introduces incremental compaction in univeral style for space amplification. This follows the first improvement mentioned in https://rocksdb.org/blog/2021/04/12/universal-improvements.html . The implemention simply picks up files about size of max_compaction_bytes to compact and execute if the penalty is not too big. More optimizations can be done in the future, e.g. prioritizing between this compaction and other types. But for now, the feature is supposed to be functional and can often reduce frequency of full compactions, although it can introduce penalty. In order to add cut files more efficiently so that more files from upper levels can be included, SST file cutting threshold (for current file + overlapping parent level files) is set to 1.5X of target file size. A 2MB target file size will generate files like this: https://gist.github.com/siying/29d2676fba417404f3c95e6c013c7de8 Number of files indeed increases but it is not out of control. Two set of write benchmarks are run: 1. For ingestion rate limited scenario, we can see full compaction is mostly eliminated: https://gist.github.com/siying/959bc1186066906831cf4c808d6e0a19 . The write amp increased from 7.7 to 9.4, as expected. After applying file cutting, the number is improved to 8.9. In another benchmark, the write amp is even better with the incremental approach: https://gist.github.com/siying/d1c16c286d7c59c4d7bba718ca198163 2. For ingestion rate unlimited scenario, incremental compaction turns out to be too expensive most of the time and is not executed, as expected. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8655 Test Plan: Add unit tests to the functionality. Reviewed By: ajkr Differential Revision: D31787034 fbshipit-source-id: ce813e63b15a61d5a56e97bf8902a1b28e011beb
2021-10-20 17:03:03 +00:00
ASSERT_TRUE(compaction);
ASSERT_EQ(4, compaction->output_level());
ASSERT_EQ(2, compaction->start_level());
ASSERT_EQ(1U, compaction->num_input_files(0));
ASSERT_EQ(7U, compaction->input(0, 0)->fd.GetNumber());
ASSERT_EQ(1U, compaction->num_input_files(1));
ASSERT_EQ(15U, compaction->input(1, 0)->fd.GetNumber());
}
TEST_F(CompactionPickerTest, UniversalIncrementalSpace3) {
// Test bottom level files falling between gaps between two upper level
// files
const uint64_t kFileSize = 100000;
mutable_cf_options_.max_compaction_bytes = 300000;
mutable_cf_options_.compaction_options_universal.incremental = true;
mutable_cf_options_.compaction_options_universal
.max_size_amplification_percent = 30;
UniversalCompactionPicker universal_compaction_picker(ioptions_, &icmp_);
NewVersionStorage(5, kCompactionStyleUniversal);
Add(0, 1U, "150", "200", kFileSize, 0, 500, 550);
Add(2, 2U, "010", "080", kFileSize, 0, 200, 251);
Add(3, 5U, "000", "180", kFileSize, 0, 200, 251);
Add(3, 6U, "181", "190", kFileSize, 0, 200, 251);
Add(3, 7U, "710", "810", kFileSize, 0, 200, 251);
Add(3, 8U, "820", "830", kFileSize, 0, 200, 251);
Add(3, 9U, "900", "991", kFileSize, 0, 200, 251);
Add(4, 10U, "201", "250", kFileSize, 0, 101, 150);
Add(4, 11U, "301", "350", kFileSize, 0, 101, 150);
Add(4, 12U, "401", "450", kFileSize, 0, 101, 150);
Add(4, 13U, "501", "750", kFileSize, 0, 101, 150);
Add(4, 14U, "801", "850", kFileSize, 0, 101, 150);
Add(4, 15U, "901", "950", kFileSize, 0, 101, 150);
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(
universal_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
Incremental Space Amp Compactions in Universal Style (#8655) Summary: This commit introduces incremental compaction in univeral style for space amplification. This follows the first improvement mentioned in https://rocksdb.org/blog/2021/04/12/universal-improvements.html . The implemention simply picks up files about size of max_compaction_bytes to compact and execute if the penalty is not too big. More optimizations can be done in the future, e.g. prioritizing between this compaction and other types. But for now, the feature is supposed to be functional and can often reduce frequency of full compactions, although it can introduce penalty. In order to add cut files more efficiently so that more files from upper levels can be included, SST file cutting threshold (for current file + overlapping parent level files) is set to 1.5X of target file size. A 2MB target file size will generate files like this: https://gist.github.com/siying/29d2676fba417404f3c95e6c013c7de8 Number of files indeed increases but it is not out of control. Two set of write benchmarks are run: 1. For ingestion rate limited scenario, we can see full compaction is mostly eliminated: https://gist.github.com/siying/959bc1186066906831cf4c808d6e0a19 . The write amp increased from 7.7 to 9.4, as expected. After applying file cutting, the number is improved to 8.9. In another benchmark, the write amp is even better with the incremental approach: https://gist.github.com/siying/d1c16c286d7c59c4d7bba718ca198163 2. For ingestion rate unlimited scenario, incremental compaction turns out to be too expensive most of the time and is not executed, as expected. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8655 Test Plan: Add unit tests to the functionality. Reviewed By: ajkr Differential Revision: D31787034 fbshipit-source-id: ce813e63b15a61d5a56e97bf8902a1b28e011beb
2021-10-20 17:03:03 +00:00
ASSERT_TRUE(compaction);
ASSERT_EQ(4, compaction->output_level());
ASSERT_EQ(2, compaction->start_level());
ASSERT_EQ(1U, compaction->num_input_files(0));
ASSERT_EQ(2U, compaction->input(0, 0)->fd.GetNumber());
ASSERT_EQ(2U, compaction->num_input_files(1));
ASSERT_EQ(5U, compaction->input(1, 0)->fd.GetNumber());
ASSERT_EQ(6U, compaction->input(1, 1)->fd.GetNumber());
ASSERT_EQ(0, compaction->num_input_files(2));
}
TEST_F(CompactionPickerTest, UniversalIncrementalSpace4) {
// Test compaction candidates always cover many files.
const uint64_t kFileSize = 100000;
mutable_cf_options_.max_compaction_bytes = 3200000;
mutable_cf_options_.compaction_options_universal.incremental = true;
mutable_cf_options_.compaction_options_universal
.max_size_amplification_percent = 30;
UniversalCompactionPicker universal_compaction_picker(ioptions_, &icmp_);
NewVersionStorage(5, kCompactionStyleUniversal);
Add(0, 1U, "150", "200", kFileSize, 0, 500, 550);
Add(2, 2U, "010", "080", kFileSize, 0, 200, 251);
// Generate files like following:
// L3: (1101, 1180) (1201, 1280) ... (7901, 7908)
// L4: (1130, 1150) (1160, 1210) (1230, 1250) (1260 1310) ... (7960, 8010)
for (int i = 11; i < 79; i++) {
Add(3, 100 + i * 3, std::to_string(i * 100).c_str(),
std::to_string(i * 100 + 80).c_str(), kFileSize, 0, 200, 251);
Incremental Space Amp Compactions in Universal Style (#8655) Summary: This commit introduces incremental compaction in univeral style for space amplification. This follows the first improvement mentioned in https://rocksdb.org/blog/2021/04/12/universal-improvements.html . The implemention simply picks up files about size of max_compaction_bytes to compact and execute if the penalty is not too big. More optimizations can be done in the future, e.g. prioritizing between this compaction and other types. But for now, the feature is supposed to be functional and can often reduce frequency of full compactions, although it can introduce penalty. In order to add cut files more efficiently so that more files from upper levels can be included, SST file cutting threshold (for current file + overlapping parent level files) is set to 1.5X of target file size. A 2MB target file size will generate files like this: https://gist.github.com/siying/29d2676fba417404f3c95e6c013c7de8 Number of files indeed increases but it is not out of control. Two set of write benchmarks are run: 1. For ingestion rate limited scenario, we can see full compaction is mostly eliminated: https://gist.github.com/siying/959bc1186066906831cf4c808d6e0a19 . The write amp increased from 7.7 to 9.4, as expected. After applying file cutting, the number is improved to 8.9. In another benchmark, the write amp is even better with the incremental approach: https://gist.github.com/siying/d1c16c286d7c59c4d7bba718ca198163 2. For ingestion rate unlimited scenario, incremental compaction turns out to be too expensive most of the time and is not executed, as expected. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8655 Test Plan: Add unit tests to the functionality. Reviewed By: ajkr Differential Revision: D31787034 fbshipit-source-id: ce813e63b15a61d5a56e97bf8902a1b28e011beb
2021-10-20 17:03:03 +00:00
// Add a tie breaker
if (i == 66) {
Add(3, 10000U, "6690", "6699", kFileSize, 0, 200, 251);
}
Add(4, 100 + i * 3 + 1, std::to_string(i * 100 + 30).c_str(),
std::to_string(i * 100 + 50).c_str(), kFileSize, 0, 200, 251);
Add(4, 100 + i * 3 + 2, std::to_string(i * 100 + 60).c_str(),
std::to_string(i * 100 + 110).c_str(), kFileSize, 0, 200, 251);
Incremental Space Amp Compactions in Universal Style (#8655) Summary: This commit introduces incremental compaction in univeral style for space amplification. This follows the first improvement mentioned in https://rocksdb.org/blog/2021/04/12/universal-improvements.html . The implemention simply picks up files about size of max_compaction_bytes to compact and execute if the penalty is not too big. More optimizations can be done in the future, e.g. prioritizing between this compaction and other types. But for now, the feature is supposed to be functional and can often reduce frequency of full compactions, although it can introduce penalty. In order to add cut files more efficiently so that more files from upper levels can be included, SST file cutting threshold (for current file + overlapping parent level files) is set to 1.5X of target file size. A 2MB target file size will generate files like this: https://gist.github.com/siying/29d2676fba417404f3c95e6c013c7de8 Number of files indeed increases but it is not out of control. Two set of write benchmarks are run: 1. For ingestion rate limited scenario, we can see full compaction is mostly eliminated: https://gist.github.com/siying/959bc1186066906831cf4c808d6e0a19 . The write amp increased from 7.7 to 9.4, as expected. After applying file cutting, the number is improved to 8.9. In another benchmark, the write amp is even better with the incremental approach: https://gist.github.com/siying/d1c16c286d7c59c4d7bba718ca198163 2. For ingestion rate unlimited scenario, incremental compaction turns out to be too expensive most of the time and is not executed, as expected. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8655 Test Plan: Add unit tests to the functionality. Reviewed By: ajkr Differential Revision: D31787034 fbshipit-source-id: ce813e63b15a61d5a56e97bf8902a1b28e011beb
2021-10-20 17:03:03 +00:00
}
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(
universal_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
Incremental Space Amp Compactions in Universal Style (#8655) Summary: This commit introduces incremental compaction in univeral style for space amplification. This follows the first improvement mentioned in https://rocksdb.org/blog/2021/04/12/universal-improvements.html . The implemention simply picks up files about size of max_compaction_bytes to compact and execute if the penalty is not too big. More optimizations can be done in the future, e.g. prioritizing between this compaction and other types. But for now, the feature is supposed to be functional and can often reduce frequency of full compactions, although it can introduce penalty. In order to add cut files more efficiently so that more files from upper levels can be included, SST file cutting threshold (for current file + overlapping parent level files) is set to 1.5X of target file size. A 2MB target file size will generate files like this: https://gist.github.com/siying/29d2676fba417404f3c95e6c013c7de8 Number of files indeed increases but it is not out of control. Two set of write benchmarks are run: 1. For ingestion rate limited scenario, we can see full compaction is mostly eliminated: https://gist.github.com/siying/959bc1186066906831cf4c808d6e0a19 . The write amp increased from 7.7 to 9.4, as expected. After applying file cutting, the number is improved to 8.9. In another benchmark, the write amp is even better with the incremental approach: https://gist.github.com/siying/d1c16c286d7c59c4d7bba718ca198163 2. For ingestion rate unlimited scenario, incremental compaction turns out to be too expensive most of the time and is not executed, as expected. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8655 Test Plan: Add unit tests to the functionality. Reviewed By: ajkr Differential Revision: D31787034 fbshipit-source-id: ce813e63b15a61d5a56e97bf8902a1b28e011beb
2021-10-20 17:03:03 +00:00
ASSERT_TRUE(compaction);
ASSERT_EQ(4, compaction->output_level());
ASSERT_EQ(3, compaction->start_level());
ASSERT_EQ(6U, compaction->num_input_files(0));
ASSERT_EQ(100 + 62U * 3, compaction->input(0, 0)->fd.GetNumber());
ASSERT_EQ(10000U, compaction->input(0, 5)->fd.GetNumber());
ASSERT_EQ(11, compaction->num_input_files(1));
}
TEST_F(CompactionPickerTest, UniversalIncrementalSpace5) {
// Test compaction candidates always cover many files with some single
// files larger than size threshold.
const uint64_t kFileSize = 100000;
mutable_cf_options_.max_compaction_bytes = 3200000;
mutable_cf_options_.compaction_options_universal.incremental = true;
mutable_cf_options_.compaction_options_universal
.max_size_amplification_percent = 30;
UniversalCompactionPicker universal_compaction_picker(ioptions_, &icmp_);
NewVersionStorage(5, kCompactionStyleUniversal);
Add(0, 1U, "150", "200", kFileSize, 0, 500, 550);
Add(2, 2U, "010", "080", kFileSize, 0, 200, 251);
// Generate files like following:
// L3: (1101, 1180) (1201, 1280) ... (7901, 7908)
// L4: (1130, 1150) (1160, 1210) (1230, 1250) (1260 1310) ... (7960, 8010)
for (int i = 11; i < 70; i++) {
Add(3, 100 + i * 3, std::to_string(i * 100).c_str(),
std::to_string(i * 100 + 80).c_str(),
Incremental Space Amp Compactions in Universal Style (#8655) Summary: This commit introduces incremental compaction in univeral style for space amplification. This follows the first improvement mentioned in https://rocksdb.org/blog/2021/04/12/universal-improvements.html . The implemention simply picks up files about size of max_compaction_bytes to compact and execute if the penalty is not too big. More optimizations can be done in the future, e.g. prioritizing between this compaction and other types. But for now, the feature is supposed to be functional and can often reduce frequency of full compactions, although it can introduce penalty. In order to add cut files more efficiently so that more files from upper levels can be included, SST file cutting threshold (for current file + overlapping parent level files) is set to 1.5X of target file size. A 2MB target file size will generate files like this: https://gist.github.com/siying/29d2676fba417404f3c95e6c013c7de8 Number of files indeed increases but it is not out of control. Two set of write benchmarks are run: 1. For ingestion rate limited scenario, we can see full compaction is mostly eliminated: https://gist.github.com/siying/959bc1186066906831cf4c808d6e0a19 . The write amp increased from 7.7 to 9.4, as expected. After applying file cutting, the number is improved to 8.9. In another benchmark, the write amp is even better with the incremental approach: https://gist.github.com/siying/d1c16c286d7c59c4d7bba718ca198163 2. For ingestion rate unlimited scenario, incremental compaction turns out to be too expensive most of the time and is not executed, as expected. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8655 Test Plan: Add unit tests to the functionality. Reviewed By: ajkr Differential Revision: D31787034 fbshipit-source-id: ce813e63b15a61d5a56e97bf8902a1b28e011beb
2021-10-20 17:03:03 +00:00
i % 10 == 9 ? kFileSize * 100 : kFileSize, 0, 200, 251);
Add(4, 100 + i * 3 + 1, std::to_string(i * 100 + 30).c_str(),
std::to_string(i * 100 + 50).c_str(), kFileSize, 0, 200, 251);
Add(4, 100 + i * 3 + 2, std::to_string(i * 100 + 60).c_str(),
std::to_string(i * 100 + 110).c_str(), kFileSize, 0, 200, 251);
Incremental Space Amp Compactions in Universal Style (#8655) Summary: This commit introduces incremental compaction in univeral style for space amplification. This follows the first improvement mentioned in https://rocksdb.org/blog/2021/04/12/universal-improvements.html . The implemention simply picks up files about size of max_compaction_bytes to compact and execute if the penalty is not too big. More optimizations can be done in the future, e.g. prioritizing between this compaction and other types. But for now, the feature is supposed to be functional and can often reduce frequency of full compactions, although it can introduce penalty. In order to add cut files more efficiently so that more files from upper levels can be included, SST file cutting threshold (for current file + overlapping parent level files) is set to 1.5X of target file size. A 2MB target file size will generate files like this: https://gist.github.com/siying/29d2676fba417404f3c95e6c013c7de8 Number of files indeed increases but it is not out of control. Two set of write benchmarks are run: 1. For ingestion rate limited scenario, we can see full compaction is mostly eliminated: https://gist.github.com/siying/959bc1186066906831cf4c808d6e0a19 . The write amp increased from 7.7 to 9.4, as expected. After applying file cutting, the number is improved to 8.9. In another benchmark, the write amp is even better with the incremental approach: https://gist.github.com/siying/d1c16c286d7c59c4d7bba718ca198163 2. For ingestion rate unlimited scenario, incremental compaction turns out to be too expensive most of the time and is not executed, as expected. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8655 Test Plan: Add unit tests to the functionality. Reviewed By: ajkr Differential Revision: D31787034 fbshipit-source-id: ce813e63b15a61d5a56e97bf8902a1b28e011beb
2021-10-20 17:03:03 +00:00
}
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(
universal_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
Incremental Space Amp Compactions in Universal Style (#8655) Summary: This commit introduces incremental compaction in univeral style for space amplification. This follows the first improvement mentioned in https://rocksdb.org/blog/2021/04/12/universal-improvements.html . The implemention simply picks up files about size of max_compaction_bytes to compact and execute if the penalty is not too big. More optimizations can be done in the future, e.g. prioritizing between this compaction and other types. But for now, the feature is supposed to be functional and can often reduce frequency of full compactions, although it can introduce penalty. In order to add cut files more efficiently so that more files from upper levels can be included, SST file cutting threshold (for current file + overlapping parent level files) is set to 1.5X of target file size. A 2MB target file size will generate files like this: https://gist.github.com/siying/29d2676fba417404f3c95e6c013c7de8 Number of files indeed increases but it is not out of control. Two set of write benchmarks are run: 1. For ingestion rate limited scenario, we can see full compaction is mostly eliminated: https://gist.github.com/siying/959bc1186066906831cf4c808d6e0a19 . The write amp increased from 7.7 to 9.4, as expected. After applying file cutting, the number is improved to 8.9. In another benchmark, the write amp is even better with the incremental approach: https://gist.github.com/siying/d1c16c286d7c59c4d7bba718ca198163 2. For ingestion rate unlimited scenario, incremental compaction turns out to be too expensive most of the time and is not executed, as expected. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8655 Test Plan: Add unit tests to the functionality. Reviewed By: ajkr Differential Revision: D31787034 fbshipit-source-id: ce813e63b15a61d5a56e97bf8902a1b28e011beb
2021-10-20 17:03:03 +00:00
ASSERT_TRUE(compaction);
ASSERT_EQ(4, compaction->output_level());
ASSERT_EQ(3, compaction->start_level());
ASSERT_EQ(6U, compaction->num_input_files(0));
ASSERT_EQ(100 + 14 * 3, compaction->input(0, 0)->fd.GetNumber());
ASSERT_EQ(100 + 19 * 3, compaction->input(0, 5)->fd.GetNumber());
ASSERT_EQ(13, compaction->num_input_files(1));
}
rocksdb: switch to gtest Summary: Our existing test notation is very similar to what is used in gtest. It makes it easy to adopt what is different. In this diff I modify existing [[ https://code.google.com/p/googletest/wiki/Primer#Test_Fixtures:_Using_the_Same_Data_Configuration_for_Multiple_Te | test fixture ]] classes to inherit from `testing::Test`. Also for unit tests that use fixture class, `TEST` is replaced with `TEST_F` as required in gtest. There are several custom `main` functions in our existing tests. To make this transition easier, I modify all `main` functions to fallow gtest notation. But eventually we can remove them and use implementation of `main` that gtest provides. ```lang=bash % cat ~/transform #!/bin/sh files=$(git ls-files '*test\.cc') for file in $files do if grep -q "rocksdb::test::RunAllTests()" $file then if grep -Eq '^class \w+Test {' $file then perl -pi -e 's/^(class \w+Test) {/${1}: public testing::Test {/g' $file perl -pi -e 's/^(TEST)/${1}_F/g' $file fi perl -pi -e 's/(int main.*\{)/${1}::testing::InitGoogleTest(&argc, argv);/g' $file perl -pi -e 's/rocksdb::test::RunAllTests/RUN_ALL_TESTS/g' $file fi done % sh ~/transform % make format ``` Second iteration of this diff contains only scripted changes. Third iteration contains manual changes to fix last errors and make it compilable. Test Plan: Build and notice no errors. ```lang=bash % USE_CLANG=1 make check -j55 ``` Tests are still testing. Reviewers: meyering, sdong, rven, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D35157
2015-03-17 21:08:00 +00:00
TEST_F(CompactionPickerTest, NeedsCompactionFIFO) {
NewVersionStorage(1, kCompactionStyleFIFO);
const int kFileCount =
mutable_cf_options_.level0_file_num_compaction_trigger * 3;
const uint64_t kFileSize = 100000;
const uint64_t kMaxSize = kFileSize * kFileCount / 2;
fifo_options_.max_table_files_size = kMaxSize;
mutable_cf_options_.compaction_options_fifo = fifo_options_;
FIFOCompactionPicker fifo_compaction_picker(ioptions_, &icmp_);
UpdateVersionStorageInfo();
// must return false when there's no files.
ASSERT_EQ(fifo_compaction_picker.NeedsCompaction(vstorage_.get()), false);
// verify whether compaction is needed based on the current
// size of L0 files.
for (int i = 1; i <= kFileCount; ++i) {
NewVersionStorage(1, kCompactionStyleFIFO);
Add(0, i, std::to_string((i + 100) * 1000).c_str(),
std::to_string((i + 100) * 1000 + 999).c_str(), kFileSize, 0, i * 100,
i * 100 + 99);
UpdateVersionStorageInfo();
ASSERT_EQ(fifo_compaction_picker.NeedsCompaction(vstorage_.get()),
vstorage_->CompactionScore(0) >= 1);
}
}
TEST_F(CompactionPickerTest, FIFOToWarm1) {
NewVersionStorage(1, kCompactionStyleFIFO);
const uint64_t kFileSize = 100000;
const uint64_t kMaxSize = kFileSize * 100000;
uint64_t kWarmThreshold = 2000;
fifo_options_.max_table_files_size = kMaxSize;
fifo_options_.age_for_warm = kWarmThreshold;
mutable_cf_options_.compaction_options_fifo = fifo_options_;
mutable_cf_options_.level0_file_num_compaction_trigger = 2;
mutable_cf_options_.max_compaction_bytes = kFileSize * 100;
FIFOCompactionPicker fifo_compaction_picker(ioptions_, &icmp_);
int64_t current_time = 0;
ASSERT_OK(Env::Default()->GetCurrentTime(&current_time));
uint64_t threshold_time =
static_cast<uint64_t>(current_time) - kWarmThreshold;
Add(0, 6U, "240", "290", 2 * kFileSize, 0, 2900, 3000, 0, true,
Temperature::kUnknown, static_cast<uint64_t>(current_time) - 100);
Add(0, 5U, "240", "290", 2 * kFileSize, 0, 2700, 2800, 0, true,
Temperature::kUnknown, threshold_time + 100);
Add(0, 4U, "260", "300", 1 * kFileSize, 0, 2500, 2600, 0, true,
Temperature::kUnknown, threshold_time - 2000);
Add(0, 3U, "200", "300", 4 * kFileSize, 0, 2300, 2400, 0, true,
Temperature::kUnknown, threshold_time - 3000);
UpdateVersionStorageInfo();
ASSERT_EQ(fifo_compaction_picker.NeedsCompaction(vstorage_.get()), true);
std::unique_ptr<Compaction> compaction(fifo_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction.get() != nullptr);
ASSERT_EQ(1U, compaction->num_input_files(0));
ASSERT_EQ(3U, compaction->input(0, 0)->fd.GetNumber());
}
TEST_F(CompactionPickerTest, FIFOToWarm2) {
NewVersionStorage(1, kCompactionStyleFIFO);
const uint64_t kFileSize = 100000;
const uint64_t kMaxSize = kFileSize * 100000;
uint64_t kWarmThreshold = 2000;
fifo_options_.max_table_files_size = kMaxSize;
fifo_options_.age_for_warm = kWarmThreshold;
mutable_cf_options_.compaction_options_fifo = fifo_options_;
mutable_cf_options_.level0_file_num_compaction_trigger = 2;
mutable_cf_options_.max_compaction_bytes = kFileSize * 100;
FIFOCompactionPicker fifo_compaction_picker(ioptions_, &icmp_);
int64_t current_time = 0;
ASSERT_OK(Env::Default()->GetCurrentTime(&current_time));
uint64_t threshold_time =
static_cast<uint64_t>(current_time) - kWarmThreshold;
Add(0, 6U, "240", "290", 2 * kFileSize, 0, 2900, 3000, 0, true,
Temperature::kUnknown, static_cast<uint64_t>(current_time) - 100);
Add(0, 5U, "240", "290", 2 * kFileSize, 0, 2700, 2800, 0, true,
Temperature::kUnknown, threshold_time + 100);
Add(0, 4U, "260", "300", 1 * kFileSize, 0, 2500, 2600, 0, true,
Temperature::kUnknown, threshold_time - 2000);
Add(0, 3U, "200", "300", 4 * kFileSize, 0, 2300, 2400, 0, true,
Temperature::kUnknown, threshold_time - 3000);
Add(0, 2U, "200", "300", 4 * kFileSize, 0, 2100, 2200, 0, true,
Temperature::kUnknown, threshold_time - 4000);
UpdateVersionStorageInfo();
ASSERT_EQ(fifo_compaction_picker.NeedsCompaction(vstorage_.get()), true);
std::unique_ptr<Compaction> compaction(fifo_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction.get() != nullptr);
ASSERT_EQ(2U, compaction->num_input_files(0));
ASSERT_EQ(2U, compaction->input(0, 0)->fd.GetNumber());
ASSERT_EQ(3U, compaction->input(0, 1)->fd.GetNumber());
}
TEST_F(CompactionPickerTest, FIFOToWarmMaxSize) {
NewVersionStorage(1, kCompactionStyleFIFO);
const uint64_t kFileSize = 100000;
const uint64_t kMaxSize = kFileSize * 100000;
uint64_t kWarmThreshold = 2000;
fifo_options_.max_table_files_size = kMaxSize;
fifo_options_.age_for_warm = kWarmThreshold;
mutable_cf_options_.compaction_options_fifo = fifo_options_;
mutable_cf_options_.level0_file_num_compaction_trigger = 2;
mutable_cf_options_.max_compaction_bytes = kFileSize * 9;
FIFOCompactionPicker fifo_compaction_picker(ioptions_, &icmp_);
int64_t current_time = 0;
ASSERT_OK(Env::Default()->GetCurrentTime(&current_time));
uint64_t threshold_time =
static_cast<uint64_t>(current_time) - kWarmThreshold;
Add(0, 6U, "240", "290", 2 * kFileSize, 0, 2900, 3000, 0, true,
Temperature::kUnknown, static_cast<uint64_t>(current_time) - 100);
Add(0, 5U, "240", "290", 2 * kFileSize, 0, 2700, 2800, 0, true,
Temperature::kUnknown, threshold_time + 100);
Add(0, 4U, "260", "300", 1 * kFileSize, 0, 2500, 2600, 0, true,
Temperature::kUnknown, threshold_time - 2000);
Add(0, 3U, "200", "300", 4 * kFileSize, 0, 2300, 2400, 0, true,
Temperature::kUnknown, threshold_time - 3000);
Add(0, 2U, "200", "300", 4 * kFileSize, 0, 2100, 2200, 0, true,
Temperature::kUnknown, threshold_time - 4000);
Add(0, 1U, "200", "300", 4 * kFileSize, 0, 2000, 2100, 0, true,
Temperature::kUnknown, threshold_time - 5000);
UpdateVersionStorageInfo();
ASSERT_EQ(fifo_compaction_picker.NeedsCompaction(vstorage_.get()), true);
std::unique_ptr<Compaction> compaction(fifo_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction.get() != nullptr);
ASSERT_EQ(2U, compaction->num_input_files(0));
ASSERT_EQ(1U, compaction->input(0, 0)->fd.GetNumber());
ASSERT_EQ(2U, compaction->input(0, 1)->fd.GetNumber());
}
TEST_F(CompactionPickerTest, FIFOToWarmWithExistingWarm) {
NewVersionStorage(1, kCompactionStyleFIFO);
const uint64_t kFileSize = 100000;
const uint64_t kMaxSize = kFileSize * 100000;
uint64_t kWarmThreshold = 2000;
fifo_options_.max_table_files_size = kMaxSize;
fifo_options_.age_for_warm = kWarmThreshold;
mutable_cf_options_.compaction_options_fifo = fifo_options_;
mutable_cf_options_.level0_file_num_compaction_trigger = 2;
mutable_cf_options_.max_compaction_bytes = kFileSize * 100;
FIFOCompactionPicker fifo_compaction_picker(ioptions_, &icmp_);
int64_t current_time = 0;
ASSERT_OK(Env::Default()->GetCurrentTime(&current_time));
uint64_t threshold_time =
static_cast<uint64_t>(current_time) - kWarmThreshold;
Add(0, 6U, "240", "290", 2 * kFileSize, 0, 2900, 3000, 0, true,
Temperature::kUnknown, static_cast<uint64_t>(current_time) - 100);
Add(0, 5U, "240", "290", 2 * kFileSize, 0, 2700, 2800, 0, true,
Temperature::kUnknown, threshold_time + 100);
Add(0, 4U, "260", "300", 1 * kFileSize, 0, 2500, 2600, 0, true,
Temperature::kUnknown, threshold_time - 2000);
Add(0, 3U, "200", "300", 4 * kFileSize, 0, 2300, 2400, 0, true,
Temperature::kUnknown, threshold_time - 3000);
Add(0, 2U, "200", "300", 4 * kFileSize, 0, 2100, 2200, 0, true,
Temperature::kUnknown, threshold_time - 4000);
Add(0, 1U, "200", "300", 4 * kFileSize, 0, 2000, 2100, 0, true,
Temperature::kWarm, threshold_time - 5000);
UpdateVersionStorageInfo();
ASSERT_EQ(fifo_compaction_picker.NeedsCompaction(vstorage_.get()), true);
std::unique_ptr<Compaction> compaction(fifo_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction.get() != nullptr);
ASSERT_EQ(2U, compaction->num_input_files(0));
ASSERT_EQ(2U, compaction->input(0, 0)->fd.GetNumber());
ASSERT_EQ(3U, compaction->input(0, 1)->fd.GetNumber());
}
TEST_F(CompactionPickerTest, FIFOToWarmWithOngoing) {
NewVersionStorage(1, kCompactionStyleFIFO);
const uint64_t kFileSize = 100000;
const uint64_t kMaxSize = kFileSize * 100000;
uint64_t kWarmThreshold = 2000;
fifo_options_.max_table_files_size = kMaxSize;
fifo_options_.age_for_warm = kWarmThreshold;
mutable_cf_options_.compaction_options_fifo = fifo_options_;
mutable_cf_options_.level0_file_num_compaction_trigger = 2;
mutable_cf_options_.max_compaction_bytes = kFileSize * 100;
FIFOCompactionPicker fifo_compaction_picker(ioptions_, &icmp_);
int64_t current_time = 0;
ASSERT_OK(Env::Default()->GetCurrentTime(&current_time));
uint64_t threshold_time =
static_cast<uint64_t>(current_time) - kWarmThreshold;
Add(0, 6U, "240", "290", 2 * kFileSize, 0, 2900, 3000, 0, true,
Temperature::kUnknown, static_cast<uint64_t>(current_time) - 100);
Add(0, 5U, "240", "290", 2 * kFileSize, 0, 2700, 2800, 0, true,
Temperature::kUnknown, threshold_time + 100);
Add(0, 4U, "260", "300", 1 * kFileSize, 0, 2500, 2600, 0, true,
Temperature::kUnknown, threshold_time - 2000);
Add(0, 3U, "200", "300", 4 * kFileSize, 0, 2300, 2400, 0, true,
Temperature::kUnknown, threshold_time - 3000);
Add(0, 2U, "200", "300", 4 * kFileSize, 0, 2100, 2200, 0, true,
Temperature::kUnknown, threshold_time - 4000);
Add(0, 1U, "200", "300", 4 * kFileSize, 0, 2000, 2100, 0, true,
Temperature::kWarm, threshold_time - 5000);
file_map_[2].first->being_compacted = true;
UpdateVersionStorageInfo();
ASSERT_EQ(fifo_compaction_picker.NeedsCompaction(vstorage_.get()), true);
std::unique_ptr<Compaction> compaction(fifo_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
// Stop if a file is being compacted
ASSERT_TRUE(compaction.get() == nullptr);
}
TEST_F(CompactionPickerTest, FIFOToWarmWithHotBetweenWarms) {
NewVersionStorage(1, kCompactionStyleFIFO);
const uint64_t kFileSize = 100000;
const uint64_t kMaxSize = kFileSize * 100000;
uint64_t kWarmThreshold = 2000;
fifo_options_.max_table_files_size = kMaxSize;
fifo_options_.age_for_warm = kWarmThreshold;
mutable_cf_options_.compaction_options_fifo = fifo_options_;
mutable_cf_options_.level0_file_num_compaction_trigger = 2;
mutable_cf_options_.max_compaction_bytes = kFileSize * 100;
FIFOCompactionPicker fifo_compaction_picker(ioptions_, &icmp_);
int64_t current_time = 0;
ASSERT_OK(Env::Default()->GetCurrentTime(&current_time));
uint64_t threshold_time =
static_cast<uint64_t>(current_time) - kWarmThreshold;
Add(0, 6U, "240", "290", 2 * kFileSize, 0, 2900, 3000, 0, true,
Temperature::kUnknown, static_cast<uint64_t>(current_time) - 100);
Add(0, 5U, "240", "290", 2 * kFileSize, 0, 2700, 2800, 0, true,
Temperature::kUnknown, threshold_time + 100);
Add(0, 4U, "260", "300", 1 * kFileSize, 0, 2500, 2600, 0, true,
Temperature::kUnknown, threshold_time - 2000);
Add(0, 3U, "200", "300", 4 * kFileSize, 0, 2300, 2400, 0, true,
Temperature::kWarm, threshold_time - 3000);
Add(0, 2U, "200", "300", 4 * kFileSize, 0, 2100, 2200, 0, true,
Temperature::kUnknown, threshold_time - 4000);
Add(0, 1U, "200", "300", 4 * kFileSize, 0, 2000, 2100, 0, true,
Temperature::kWarm, threshold_time - 5000);
UpdateVersionStorageInfo();
ASSERT_EQ(fifo_compaction_picker.NeedsCompaction(vstorage_.get()), true);
std::unique_ptr<Compaction> compaction(fifo_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
// Stop if a file is being compacted
ASSERT_TRUE(compaction.get() != nullptr);
ASSERT_EQ(1U, compaction->num_input_files(0));
ASSERT_EQ(2U, compaction->input(0, 0)->fd.GetNumber());
}
#endif // ROCKSDB_LITE
TEST_F(CompactionPickerTest, CompactionPriMinOverlapping1) {
NewVersionStorage(6, kCompactionStyleLevel);
ioptions_.compaction_pri = kMinOverlappingRatio;
mutable_cf_options_.target_file_size_base = 100000000000;
mutable_cf_options_.target_file_size_multiplier = 10;
mutable_cf_options_.max_bytes_for_level_base = 10 * 1024 * 1024;
mutable_cf_options_.RefreshDerivedOptions(ioptions_);
Add(2, 6U, "150", "179", 50000000U);
Add(2, 7U, "180", "220", 50000000U);
Add(2, 8U, "321", "400", 50000000U); // File not overlapping
Add(2, 9U, "721", "800", 50000000U);
Add(3, 26U, "150", "170", 260000000U);
Add(3, 27U, "171", "179", 260000000U);
Add(3, 28U, "191", "220", 260000000U);
Add(3, 29U, "221", "300", 260000000U);
Add(3, 30U, "750", "900", 260000000U);
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(level_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction.get() != nullptr);
ASSERT_EQ(1U, compaction->num_input_files(0));
// Pick file 8 because it overlaps with 0 files on level 3.
ASSERT_EQ(8U, compaction->input(0, 0)->fd.GetNumber());
// Compaction input size * 1.1
ASSERT_GE(uint64_t{55000000}, compaction->OutputFilePreallocationSize());
}
TEST_F(CompactionPickerTest, CompactionPriMinOverlapping2) {
NewVersionStorage(6, kCompactionStyleLevel);
ioptions_.compaction_pri = kMinOverlappingRatio;
mutable_cf_options_.target_file_size_base = 10000000;
mutable_cf_options_.target_file_size_multiplier = 10;
mutable_cf_options_.max_bytes_for_level_base = 10 * 1024 * 1024;
Add(2, 6U, "150", "175",
60000000U); // Overlaps with file 26, 27, total size 521M
Add(2, 7U, "176", "200", 60000000U); // Overlaps with file 27, 28, total size
// 520M, the smallest overlapping
Add(2, 8U, "201", "300",
60000000U); // Overlaps with file 28, 29, total size 521M
Add(3, 25U, "100", "110", 261000000U);
Add(3, 26U, "150", "170", 261000000U);
Add(3, 27U, "171", "179", 260000000U);
Add(3, 28U, "191", "220", 260000000U);
Add(3, 29U, "221", "300", 261000000U);
Add(3, 30U, "321", "400", 261000000U);
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(level_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction.get() != nullptr);
ASSERT_EQ(1U, compaction->num_input_files(0));
// Picking file 7 because overlapping ratio is the biggest.
ASSERT_EQ(7U, compaction->input(0, 0)->fd.GetNumber());
}
TEST_F(CompactionPickerTest, CompactionPriMinOverlapping3) {
NewVersionStorage(6, kCompactionStyleLevel);
ioptions_.compaction_pri = kMinOverlappingRatio;
mutable_cf_options_.max_bytes_for_level_base = 10000000;
mutable_cf_options_.max_bytes_for_level_multiplier = 10;
// file 7 and 8 over lap with the same file, but file 8 is smaller so
// it will be picked.
Add(2, 6U, "150", "167", 60000000U); // Overlaps with file 26, 27
Add(2, 7U, "168", "169", 60000000U); // Overlaps with file 27
Add(2, 8U, "201", "300", 61000000U); // Overlaps with file 28, but the file
// itself is larger. Should be picked.
Add(3, 26U, "160", "165", 260000000U);
Add(3, 27U, "166", "170", 260000000U);
Add(3, 28U, "180", "400", 260000000U);
Add(3, 29U, "401", "500", 260000000U);
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(level_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction.get() != nullptr);
ASSERT_EQ(1U, compaction->num_input_files(0));
// Picking file 8 because overlapping ratio is the biggest.
ASSERT_EQ(8U, compaction->input(0, 0)->fd.GetNumber());
}
TEST_F(CompactionPickerTest, CompactionPriMinOverlapping4) {
NewVersionStorage(6, kCompactionStyleLevel);
ioptions_.compaction_pri = kMinOverlappingRatio;
mutable_cf_options_.max_bytes_for_level_base = 10000000;
mutable_cf_options_.max_bytes_for_level_multiplier = 10;
Ignore max_compaction_bytes for compaction input that are within output key-range (#10835) Summary: When picking compaction input files, we sometimes stop picking a file that is fully included in the output key-range due to hitting max_compaction_bytes. Including these input files can potentially reduce WA at the expense of larger compactions. Larger compaction should be fine as files from input level are usually 10X smaller than files from output level. This PR adds a mutable CF option `ignore_max_compaction_bytes_for_input` that is enabled by default. We can remove this option once we are sure it is safe. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10835 Test Plan: - CI, a unit test on max_compaction_bytes fails before turning this flag off. - Benchmark does not show much difference in WA: `./db_bench --benchmarks=fillrandom,waitforcompaction,stats,levelstats -max_background_jobs=12 -num=2000000000 -target_file_size_base=33554432 --write_buffer_size=33554432` ``` main: ** Compaction Stats [default] ** Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 3/0 91.59 MB 0.8 70.9 0.0 70.9 200.8 129.9 0.0 1.5 25.2 71.2 2886.55 2463.45 9725 0.297 1093M 254K 0.0 0.0 L1 9/0 248.03 MB 1.0 392.0 129.8 262.2 391.7 129.5 0.0 3.0 69.0 68.9 5821.71 5536.90 804 7.241 6029M 5814K 0.0 0.0 L2 87/0 2.50 GB 1.0 537.0 128.5 408.5 533.8 125.2 0.7 4.2 69.5 69.1 7912.24 7323.70 4417 1.791 8299M 36M 0.0 0.0 L3 836/0 24.99 GB 1.0 616.9 118.3 498.7 594.5 95.8 5.2 5.0 66.9 64.5 9442.38 8490.28 4204 2.246 9749M 306M 0.0 0.0 L4 2355/0 62.95 GB 0.3 67.3 37.1 30.2 54.2 24.0 38.9 1.5 72.2 58.2 954.37 821.18 917 1.041 1076M 173M 0.0 0.0 Sum 3290/0 90.77 GB 0.0 1684.2 413.7 1270.5 1775.0 504.5 44.9 13.7 63.8 67.3 27017.25 24635.52 20067 1.346 26G 522M 0.0 0.0 Cumulative compaction: 1774.96 GB write, 154.29 MB/s write, 1684.19 GB read, 146.40 MB/s read, 27017.3 seconds This PR: ** Compaction Stats [default] ** Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 3/0 45.71 MB 0.8 72.9 0.0 72.9 202.8 129.9 0.0 1.6 25.4 70.7 2938.16 2510.36 9741 0.302 1124M 265K 0.0 0.0 L1 8/0 234.54 MB 0.9 384.5 129.8 254.7 384.2 129.6 0.0 3.0 69.0 68.9 5708.08 5424.43 791 7.216 5913M 5753K 0.0 0.0 L2 84/0 2.47 GB 1.0 543.1 128.6 414.5 539.9 125.4 0.7 4.2 69.6 69.2 7989.31 7403.13 4418 1.808 8393M 36M 0.0 0.0 L3 839/0 24.96 GB 1.0 615.6 118.4 497.2 593.2 96.0 5.1 5.0 66.6 64.1 9471.23 8489.31 4193 2.259 9726M 306M 0.0 0.0 L4 2360/0 63.04 GB 0.3 67.6 37.3 30.3 54.4 24.1 38.9 1.5 71.5 57.6 967.30 827.99 907 1.066 1080M 173M 0.0 0.0 Sum 3294/0 90.75 GB 0.0 1683.8 414.2 1269.6 1774.5 504.9 44.8 13.7 63.7 67.1 27074.08 24655.22 20050 1.350 26G 522M 0.0 0.0 Cumulative compaction: 1774.52 GB write, 157.09 MB/s write, 1683.77 GB read, 149.06 MB/s read, 27074.1 seconds ``` Reviewed By: ajkr Differential Revision: D40518319 Pulled By: cbi42 fbshipit-source-id: f4ea614bc0ebefe007ffaf05bb9aec9a8ca25b60
2022-10-21 17:22:41 +00:00
mutable_cf_options_.ignore_max_compaction_bytes_for_input = false;
// file 7 and 8 over lap with the same file, but file 8 is smaller so
// it will be picked.
// Overlaps with file 26, 27. And the file is compensated so will be
// picked up.
Add(2, 6U, "150", "167", 60000000U, 0, 100, 100, 180000000U);
Add(2, 7U, "168", "169", 60000000U); // Overlaps with file 27
Add(2, 8U, "201", "300", 61000000U); // Overlaps with file 28
Add(3, 26U, "160", "165", 60000000U);
// Boosted file size in output level is not considered.
Add(3, 27U, "166", "170", 60000000U, 0, 100, 100, 260000000U);
Add(3, 28U, "180", "400", 60000000U);
Add(3, 29U, "401", "500", 60000000U);
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(level_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction.get() != nullptr);
ASSERT_EQ(1U, compaction->num_input_files(0));
// Picking file 8 because overlapping ratio is the biggest.
ASSERT_EQ(6U, compaction->input(0, 0)->fd.GetNumber());
}
Add basic kRoundRobin compaction policy (#10107) Summary: Add `kRoundRobin` as a compaction priority. The implementation is as follows. - Define a cursor as the smallest Internal key in the successor of the selected file. Add `vector<InternalKey> compact_cursor_` into `VersionStorageInfo` where each element (`InternalKey`) in `compact_cursor_` represents a cursor. In round-robin compaction policy, we just need to select the first file (assuming files are sorted) and also has the smallest InternalKey larger than/equal to the cursor. After a file is chosen, we create a new `Fsize` vector which puts the selected file is placed at the first position in `temp`, the next cursor is then updated as the smallest InternalKey in successor of the selected file (the above logic is implemented in `SortFileByRoundRobin`). - After a compaction succeeds, typically `InstallCompactionResults()`, we choose the next cursor for the input level and save it to `edit`. When calling `LogAndApply`, we save the next cursor with its level into some local variable and finally apply the change to `vstorage` in `SaveTo` function. - Cursors are persist pair by pair (<level, InternalKey>) in `EncodeTo` so that they can be reconstructed when reopening. An empty cursor will not be encoded to MANIFEST Pull Request resolved: https://github.com/facebook/rocksdb/pull/10107 Test Plan: add unit test (`CompactionPriRoundRobin`) in `compaction_picker_test`, add `kRoundRobin` priority in `CompactionPriTest` from `db_compaction_test`, and add `PersistRoundRobinCompactCursor` in `db_compaction_test` Reviewed By: ajkr Differential Revision: D37316037 Pulled By: littlepig2013 fbshipit-source-id: 9f481748190ace416079139044e00df2968fb1ee
2022-06-21 18:56:53 +00:00
TEST_F(CompactionPickerTest, CompactionPriRoundRobin) {
std::vector<InternalKey> test_cursors = {InternalKey("249", 100, kTypeValue),
InternalKey("600", 100, kTypeValue),
InternalKey()};
std::vector<uint32_t> selected_files = {8U, 6U, 6U};
ioptions_.compaction_pri = kRoundRobin;
Support subcmpct using reserved resources for round-robin priority (#10341) Summary: Earlier implementation of round-robin priority can only pick one file at a time and disallows parallel compactions within the same level. In this PR, round-robin compaction policy will expand towards more input files with respecting some additional constraints, which are summarized as follows: * Constraint 1: We can only pick consecutive files - Constraint 1a: When a file is being compacted (or some input files are being compacted after expanding), we cannot choose it and have to stop choosing more files - Constraint 1b: When we reach the last file (with the largest keys), we cannot choose more files (the next file will be the first one with small keys) * Constraint 2: We should ensure the total compaction bytes (including the overlapped files from the next level) is no more than `mutable_cf_options_.max_compaction_bytes` * Constraint 3: We try our best to pick as many files as possible so that the post-compaction level size can be just less than `MaxBytesForLevel(start_level_)` * Constraint 4: If trivial move is allowed, we reuse the logic of `TryNonL0TrivialMove()` instead of expanding files with Constraint 3 More details can be found in `LevelCompactionBuilder::SetupOtherFilesWithRoundRobinExpansion()`. The above optimization accelerates the process of moving the compaction cursor, in which the write-amp can be further reduced. While a large compaction may lead to high write stall, we break this large compaction into several subcompactions **regardless of** the `max_subcompactions` limit. The number of subcompactions for round-robin compaction priority is determined through the following steps: * Step 1: Initialized against `max_output_file_limit`, the number of input files in the start level, and also the range size limit `ranges.size()` * Step 2: Call `AcquireSubcompactionResources()`when max subcompactions is not sufficient, but we may or may not obtain desired resources, additional number of resources is stored in `extra_num_subcompaction_threads_reserved_`). Subcompaction limit is changed and update `num_planned_subcompactions` with `GetSubcompactionLimit()` * Step 3: Call `ShrinkSubcompactionResources()` to ensure extra resources can be released (extra resources may exist for round-robin compaction when the number of actual number of subcompactions is less than the number of planned subcompactions) More details can be found in `CompactionJob::AcquireSubcompactionResources()`,`CompactionJob::ShrinkSubcompactionResources()`, and `CompactionJob::ReleaseSubcompactionResources()`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10341 Test Plan: Add `CompactionPriMultipleFilesRoundRobin[1-3]` unit test in `compaction_picker_test.cc` and `RoundRobinSubcompactionsAgainstResources.SubcompactionsUsingResources/[0-4]`, `RoundRobinSubcompactionsAgainstPressureToken.PressureTokenTest/[0-1]` in `db_compaction_test.cc` Reviewed By: ajkr, hx235 Differential Revision: D37792644 Pulled By: littlepig2013 fbshipit-source-id: 7fecb7c4ffd97b34bbf6e3b760b2c35a772a0657
2022-07-24 18:12:44 +00:00
mutable_cf_options_.max_bytes_for_level_base = 12000000;
Add basic kRoundRobin compaction policy (#10107) Summary: Add `kRoundRobin` as a compaction priority. The implementation is as follows. - Define a cursor as the smallest Internal key in the successor of the selected file. Add `vector<InternalKey> compact_cursor_` into `VersionStorageInfo` where each element (`InternalKey`) in `compact_cursor_` represents a cursor. In round-robin compaction policy, we just need to select the first file (assuming files are sorted) and also has the smallest InternalKey larger than/equal to the cursor. After a file is chosen, we create a new `Fsize` vector which puts the selected file is placed at the first position in `temp`, the next cursor is then updated as the smallest InternalKey in successor of the selected file (the above logic is implemented in `SortFileByRoundRobin`). - After a compaction succeeds, typically `InstallCompactionResults()`, we choose the next cursor for the input level and save it to `edit`. When calling `LogAndApply`, we save the next cursor with its level into some local variable and finally apply the change to `vstorage` in `SaveTo` function. - Cursors are persist pair by pair (<level, InternalKey>) in `EncodeTo` so that they can be reconstructed when reopening. An empty cursor will not be encoded to MANIFEST Pull Request resolved: https://github.com/facebook/rocksdb/pull/10107 Test Plan: add unit test (`CompactionPriRoundRobin`) in `compaction_picker_test`, add `kRoundRobin` priority in `CompactionPriTest` from `db_compaction_test`, and add `PersistRoundRobinCompactCursor` in `db_compaction_test` Reviewed By: ajkr Differential Revision: D37316037 Pulled By: littlepig2013 fbshipit-source-id: 9f481748190ace416079139044e00df2968fb1ee
2022-06-21 18:56:53 +00:00
mutable_cf_options_.max_bytes_for_level_multiplier = 10;
for (size_t i = 0; i < test_cursors.size(); i++) {
// start a brand new version in each test.
NewVersionStorage(6, kCompactionStyleLevel);
vstorage_->ResizeCompactCursors(6);
// Set the cursor
vstorage_->AddCursorForOneLevel(2, test_cursors[i]);
Add(2, 6U, "150", "199", 50000000U); // Overlap with 26U, 27U
Add(2, 7U, "200", "249", 50000000U); // File not overlapping
Add(2, 8U, "300", "600", 50000000U); // Overlap with 28U, 29U
Add(3, 26U, "130", "165", 60000000U);
Add(3, 27U, "166", "170", 60000000U);
Add(3, 28U, "270", "340", 60000000U);
Add(3, 29U, "401", "500", 60000000U);
UpdateVersionStorageInfo();
LevelCompactionPicker local_level_compaction_picker =
LevelCompactionPicker(ioptions_, &icmp_);
std::unique_ptr<Compaction> compaction(
local_level_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
Add basic kRoundRobin compaction policy (#10107) Summary: Add `kRoundRobin` as a compaction priority. The implementation is as follows. - Define a cursor as the smallest Internal key in the successor of the selected file. Add `vector<InternalKey> compact_cursor_` into `VersionStorageInfo` where each element (`InternalKey`) in `compact_cursor_` represents a cursor. In round-robin compaction policy, we just need to select the first file (assuming files are sorted) and also has the smallest InternalKey larger than/equal to the cursor. After a file is chosen, we create a new `Fsize` vector which puts the selected file is placed at the first position in `temp`, the next cursor is then updated as the smallest InternalKey in successor of the selected file (the above logic is implemented in `SortFileByRoundRobin`). - After a compaction succeeds, typically `InstallCompactionResults()`, we choose the next cursor for the input level and save it to `edit`. When calling `LogAndApply`, we save the next cursor with its level into some local variable and finally apply the change to `vstorage` in `SaveTo` function. - Cursors are persist pair by pair (<level, InternalKey>) in `EncodeTo` so that they can be reconstructed when reopening. An empty cursor will not be encoded to MANIFEST Pull Request resolved: https://github.com/facebook/rocksdb/pull/10107 Test Plan: add unit test (`CompactionPriRoundRobin`) in `compaction_picker_test`, add `kRoundRobin` priority in `CompactionPriTest` from `db_compaction_test`, and add `PersistRoundRobinCompactCursor` in `db_compaction_test` Reviewed By: ajkr Differential Revision: D37316037 Pulled By: littlepig2013 fbshipit-source-id: 9f481748190ace416079139044e00df2968fb1ee
2022-06-21 18:56:53 +00:00
ASSERT_TRUE(compaction.get() != nullptr);
Support subcmpct using reserved resources for round-robin priority (#10341) Summary: Earlier implementation of round-robin priority can only pick one file at a time and disallows parallel compactions within the same level. In this PR, round-robin compaction policy will expand towards more input files with respecting some additional constraints, which are summarized as follows: * Constraint 1: We can only pick consecutive files - Constraint 1a: When a file is being compacted (or some input files are being compacted after expanding), we cannot choose it and have to stop choosing more files - Constraint 1b: When we reach the last file (with the largest keys), we cannot choose more files (the next file will be the first one with small keys) * Constraint 2: We should ensure the total compaction bytes (including the overlapped files from the next level) is no more than `mutable_cf_options_.max_compaction_bytes` * Constraint 3: We try our best to pick as many files as possible so that the post-compaction level size can be just less than `MaxBytesForLevel(start_level_)` * Constraint 4: If trivial move is allowed, we reuse the logic of `TryNonL0TrivialMove()` instead of expanding files with Constraint 3 More details can be found in `LevelCompactionBuilder::SetupOtherFilesWithRoundRobinExpansion()`. The above optimization accelerates the process of moving the compaction cursor, in which the write-amp can be further reduced. While a large compaction may lead to high write stall, we break this large compaction into several subcompactions **regardless of** the `max_subcompactions` limit. The number of subcompactions for round-robin compaction priority is determined through the following steps: * Step 1: Initialized against `max_output_file_limit`, the number of input files in the start level, and also the range size limit `ranges.size()` * Step 2: Call `AcquireSubcompactionResources()`when max subcompactions is not sufficient, but we may or may not obtain desired resources, additional number of resources is stored in `extra_num_subcompaction_threads_reserved_`). Subcompaction limit is changed and update `num_planned_subcompactions` with `GetSubcompactionLimit()` * Step 3: Call `ShrinkSubcompactionResources()` to ensure extra resources can be released (extra resources may exist for round-robin compaction when the number of actual number of subcompactions is less than the number of planned subcompactions) More details can be found in `CompactionJob::AcquireSubcompactionResources()`,`CompactionJob::ShrinkSubcompactionResources()`, and `CompactionJob::ReleaseSubcompactionResources()`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10341 Test Plan: Add `CompactionPriMultipleFilesRoundRobin[1-3]` unit test in `compaction_picker_test.cc` and `RoundRobinSubcompactionsAgainstResources.SubcompactionsUsingResources/[0-4]`, `RoundRobinSubcompactionsAgainstPressureToken.PressureTokenTest/[0-1]` in `db_compaction_test.cc` Reviewed By: ajkr, hx235 Differential Revision: D37792644 Pulled By: littlepig2013 fbshipit-source-id: 7fecb7c4ffd97b34bbf6e3b760b2c35a772a0657
2022-07-24 18:12:44 +00:00
// Since the max bytes for level 2 is 120M, picking one file to compact
// makes the post-compaction level size less than 120M, there is exactly one
// file picked for round-robin compaction
Add basic kRoundRobin compaction policy (#10107) Summary: Add `kRoundRobin` as a compaction priority. The implementation is as follows. - Define a cursor as the smallest Internal key in the successor of the selected file. Add `vector<InternalKey> compact_cursor_` into `VersionStorageInfo` where each element (`InternalKey`) in `compact_cursor_` represents a cursor. In round-robin compaction policy, we just need to select the first file (assuming files are sorted) and also has the smallest InternalKey larger than/equal to the cursor. After a file is chosen, we create a new `Fsize` vector which puts the selected file is placed at the first position in `temp`, the next cursor is then updated as the smallest InternalKey in successor of the selected file (the above logic is implemented in `SortFileByRoundRobin`). - After a compaction succeeds, typically `InstallCompactionResults()`, we choose the next cursor for the input level and save it to `edit`. When calling `LogAndApply`, we save the next cursor with its level into some local variable and finally apply the change to `vstorage` in `SaveTo` function. - Cursors are persist pair by pair (<level, InternalKey>) in `EncodeTo` so that they can be reconstructed when reopening. An empty cursor will not be encoded to MANIFEST Pull Request resolved: https://github.com/facebook/rocksdb/pull/10107 Test Plan: add unit test (`CompactionPriRoundRobin`) in `compaction_picker_test`, add `kRoundRobin` priority in `CompactionPriTest` from `db_compaction_test`, and add `PersistRoundRobinCompactCursor` in `db_compaction_test` Reviewed By: ajkr Differential Revision: D37316037 Pulled By: littlepig2013 fbshipit-source-id: 9f481748190ace416079139044e00df2968fb1ee
2022-06-21 18:56:53 +00:00
ASSERT_EQ(1U, compaction->num_input_files(0));
ASSERT_EQ(selected_files[i], compaction->input(0, 0)->fd.GetNumber());
// release the version storage
DeleteVersionStorage();
}
}
Support subcmpct using reserved resources for round-robin priority (#10341) Summary: Earlier implementation of round-robin priority can only pick one file at a time and disallows parallel compactions within the same level. In this PR, round-robin compaction policy will expand towards more input files with respecting some additional constraints, which are summarized as follows: * Constraint 1: We can only pick consecutive files - Constraint 1a: When a file is being compacted (or some input files are being compacted after expanding), we cannot choose it and have to stop choosing more files - Constraint 1b: When we reach the last file (with the largest keys), we cannot choose more files (the next file will be the first one with small keys) * Constraint 2: We should ensure the total compaction bytes (including the overlapped files from the next level) is no more than `mutable_cf_options_.max_compaction_bytes` * Constraint 3: We try our best to pick as many files as possible so that the post-compaction level size can be just less than `MaxBytesForLevel(start_level_)` * Constraint 4: If trivial move is allowed, we reuse the logic of `TryNonL0TrivialMove()` instead of expanding files with Constraint 3 More details can be found in `LevelCompactionBuilder::SetupOtherFilesWithRoundRobinExpansion()`. The above optimization accelerates the process of moving the compaction cursor, in which the write-amp can be further reduced. While a large compaction may lead to high write stall, we break this large compaction into several subcompactions **regardless of** the `max_subcompactions` limit. The number of subcompactions for round-robin compaction priority is determined through the following steps: * Step 1: Initialized against `max_output_file_limit`, the number of input files in the start level, and also the range size limit `ranges.size()` * Step 2: Call `AcquireSubcompactionResources()`when max subcompactions is not sufficient, but we may or may not obtain desired resources, additional number of resources is stored in `extra_num_subcompaction_threads_reserved_`). Subcompaction limit is changed and update `num_planned_subcompactions` with `GetSubcompactionLimit()` * Step 3: Call `ShrinkSubcompactionResources()` to ensure extra resources can be released (extra resources may exist for round-robin compaction when the number of actual number of subcompactions is less than the number of planned subcompactions) More details can be found in `CompactionJob::AcquireSubcompactionResources()`,`CompactionJob::ShrinkSubcompactionResources()`, and `CompactionJob::ReleaseSubcompactionResources()`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10341 Test Plan: Add `CompactionPriMultipleFilesRoundRobin[1-3]` unit test in `compaction_picker_test.cc` and `RoundRobinSubcompactionsAgainstResources.SubcompactionsUsingResources/[0-4]`, `RoundRobinSubcompactionsAgainstPressureToken.PressureTokenTest/[0-1]` in `db_compaction_test.cc` Reviewed By: ajkr, hx235 Differential Revision: D37792644 Pulled By: littlepig2013 fbshipit-source-id: 7fecb7c4ffd97b34bbf6e3b760b2c35a772a0657
2022-07-24 18:12:44 +00:00
TEST_F(CompactionPickerTest, CompactionPriMultipleFilesRoundRobin1) {
ioptions_.compaction_pri = kRoundRobin;
mutable_cf_options_.max_compaction_bytes = 100000000u;
mutable_cf_options_.max_bytes_for_level_base = 120;
mutable_cf_options_.max_bytes_for_level_multiplier = 10;
// start a brand new version in each test.
NewVersionStorage(6, kCompactionStyleLevel);
vstorage_->ResizeCompactCursors(6);
// Set the cursor (file picking should start with 7U)
vstorage_->AddCursorForOneLevel(2, InternalKey("199", 100, kTypeValue));
Add(2, 6U, "150", "199", 500U);
Add(2, 7U, "200", "249", 500U);
Add(2, 8U, "300", "600", 500U);
Add(2, 9U, "700", "800", 500U);
Add(2, 10U, "850", "950", 500U);
Add(3, 26U, "130", "165", 600U);
Add(3, 27U, "166", "170", 600U);
Add(3, 28U, "270", "340", 600U);
Add(3, 29U, "401", "500", 600U);
Add(3, 30U, "601", "800", 600U);
Add(3, 31U, "830", "890", 600U);
UpdateVersionStorageInfo();
LevelCompactionPicker local_level_compaction_picker =
LevelCompactionPicker(ioptions_, &icmp_);
std::unique_ptr<Compaction> compaction(
local_level_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
Support subcmpct using reserved resources for round-robin priority (#10341) Summary: Earlier implementation of round-robin priority can only pick one file at a time and disallows parallel compactions within the same level. In this PR, round-robin compaction policy will expand towards more input files with respecting some additional constraints, which are summarized as follows: * Constraint 1: We can only pick consecutive files - Constraint 1a: When a file is being compacted (or some input files are being compacted after expanding), we cannot choose it and have to stop choosing more files - Constraint 1b: When we reach the last file (with the largest keys), we cannot choose more files (the next file will be the first one with small keys) * Constraint 2: We should ensure the total compaction bytes (including the overlapped files from the next level) is no more than `mutable_cf_options_.max_compaction_bytes` * Constraint 3: We try our best to pick as many files as possible so that the post-compaction level size can be just less than `MaxBytesForLevel(start_level_)` * Constraint 4: If trivial move is allowed, we reuse the logic of `TryNonL0TrivialMove()` instead of expanding files with Constraint 3 More details can be found in `LevelCompactionBuilder::SetupOtherFilesWithRoundRobinExpansion()`. The above optimization accelerates the process of moving the compaction cursor, in which the write-amp can be further reduced. While a large compaction may lead to high write stall, we break this large compaction into several subcompactions **regardless of** the `max_subcompactions` limit. The number of subcompactions for round-robin compaction priority is determined through the following steps: * Step 1: Initialized against `max_output_file_limit`, the number of input files in the start level, and also the range size limit `ranges.size()` * Step 2: Call `AcquireSubcompactionResources()`when max subcompactions is not sufficient, but we may or may not obtain desired resources, additional number of resources is stored in `extra_num_subcompaction_threads_reserved_`). Subcompaction limit is changed and update `num_planned_subcompactions` with `GetSubcompactionLimit()` * Step 3: Call `ShrinkSubcompactionResources()` to ensure extra resources can be released (extra resources may exist for round-robin compaction when the number of actual number of subcompactions is less than the number of planned subcompactions) More details can be found in `CompactionJob::AcquireSubcompactionResources()`,`CompactionJob::ShrinkSubcompactionResources()`, and `CompactionJob::ReleaseSubcompactionResources()`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10341 Test Plan: Add `CompactionPriMultipleFilesRoundRobin[1-3]` unit test in `compaction_picker_test.cc` and `RoundRobinSubcompactionsAgainstResources.SubcompactionsUsingResources/[0-4]`, `RoundRobinSubcompactionsAgainstPressureToken.PressureTokenTest/[0-1]` in `db_compaction_test.cc` Reviewed By: ajkr, hx235 Differential Revision: D37792644 Pulled By: littlepig2013 fbshipit-source-id: 7fecb7c4ffd97b34bbf6e3b760b2c35a772a0657
2022-07-24 18:12:44 +00:00
ASSERT_TRUE(compaction.get() != nullptr);
// The maximum compaction bytes is very large in this case so we can igore its
// constraint in this test case. The maximum bytes for level 2 is 1200
// bytes, and thus at least 3 files should be picked so that the bytes in
// level 2 is less than the maximum
ASSERT_EQ(3U, compaction->num_input_files(0));
ASSERT_EQ(7U, compaction->input(0, 0)->fd.GetNumber());
ASSERT_EQ(8U, compaction->input(0, 1)->fd.GetNumber());
ASSERT_EQ(9U, compaction->input(0, 2)->fd.GetNumber());
// release the version storage
DeleteVersionStorage();
}
TEST_F(CompactionPickerTest, CompactionPriMultipleFilesRoundRobin2) {
ioptions_.compaction_pri = kRoundRobin;
mutable_cf_options_.max_compaction_bytes = 2500u;
mutable_cf_options_.max_bytes_for_level_base = 120;
mutable_cf_options_.max_bytes_for_level_multiplier = 10;
// start a brand new version in each test.
NewVersionStorage(6, kCompactionStyleLevel);
vstorage_->ResizeCompactCursors(6);
// Set the cursor (file picking should start with 6U)
vstorage_->AddCursorForOneLevel(2, InternalKey("1000", 100, kTypeValue));
Add(2, 6U, "150", "199", 500U); // Overlap with 26U, 27U
Add(2, 7U, "200", "249", 500U); // Overlap with 27U
Add(2, 8U, "300", "600", 500U); // Overlap with 28U, 29U
Add(2, 9U, "700", "800", 500U);
Add(2, 10U, "850", "950", 500U);
Add(3, 26U, "130", "165", 600U);
Add(3, 27U, "166", "230", 600U);
Add(3, 28U, "270", "340", 600U);
Add(3, 29U, "401", "500", 600U);
Add(3, 30U, "601", "800", 600U);
Add(3, 31U, "830", "890", 600U);
UpdateVersionStorageInfo();
LevelCompactionPicker local_level_compaction_picker =
LevelCompactionPicker(ioptions_, &icmp_);
std::unique_ptr<Compaction> compaction(
local_level_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
Support subcmpct using reserved resources for round-robin priority (#10341) Summary: Earlier implementation of round-robin priority can only pick one file at a time and disallows parallel compactions within the same level. In this PR, round-robin compaction policy will expand towards more input files with respecting some additional constraints, which are summarized as follows: * Constraint 1: We can only pick consecutive files - Constraint 1a: When a file is being compacted (or some input files are being compacted after expanding), we cannot choose it and have to stop choosing more files - Constraint 1b: When we reach the last file (with the largest keys), we cannot choose more files (the next file will be the first one with small keys) * Constraint 2: We should ensure the total compaction bytes (including the overlapped files from the next level) is no more than `mutable_cf_options_.max_compaction_bytes` * Constraint 3: We try our best to pick as many files as possible so that the post-compaction level size can be just less than `MaxBytesForLevel(start_level_)` * Constraint 4: If trivial move is allowed, we reuse the logic of `TryNonL0TrivialMove()` instead of expanding files with Constraint 3 More details can be found in `LevelCompactionBuilder::SetupOtherFilesWithRoundRobinExpansion()`. The above optimization accelerates the process of moving the compaction cursor, in which the write-amp can be further reduced. While a large compaction may lead to high write stall, we break this large compaction into several subcompactions **regardless of** the `max_subcompactions` limit. The number of subcompactions for round-robin compaction priority is determined through the following steps: * Step 1: Initialized against `max_output_file_limit`, the number of input files in the start level, and also the range size limit `ranges.size()` * Step 2: Call `AcquireSubcompactionResources()`when max subcompactions is not sufficient, but we may or may not obtain desired resources, additional number of resources is stored in `extra_num_subcompaction_threads_reserved_`). Subcompaction limit is changed and update `num_planned_subcompactions` with `GetSubcompactionLimit()` * Step 3: Call `ShrinkSubcompactionResources()` to ensure extra resources can be released (extra resources may exist for round-robin compaction when the number of actual number of subcompactions is less than the number of planned subcompactions) More details can be found in `CompactionJob::AcquireSubcompactionResources()`,`CompactionJob::ShrinkSubcompactionResources()`, and `CompactionJob::ReleaseSubcompactionResources()`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10341 Test Plan: Add `CompactionPriMultipleFilesRoundRobin[1-3]` unit test in `compaction_picker_test.cc` and `RoundRobinSubcompactionsAgainstResources.SubcompactionsUsingResources/[0-4]`, `RoundRobinSubcompactionsAgainstPressureToken.PressureTokenTest/[0-1]` in `db_compaction_test.cc` Reviewed By: ajkr, hx235 Differential Revision: D37792644 Pulled By: littlepig2013 fbshipit-source-id: 7fecb7c4ffd97b34bbf6e3b760b2c35a772a0657
2022-07-24 18:12:44 +00:00
ASSERT_TRUE(compaction.get() != nullptr);
// The maximum compaction bytes is only 2500 bytes now. Even though we are
// required to choose 3 files so that the post-compaction level size is less
// than 1200 bytes. We cannot pick 3 files to compact since the maximum
// compaction size is 2500. After picking files 6U and 7U, the number of
// compaction bytes has reached 2200, and thus no more space to add another
// input file with 50M bytes.
ASSERT_EQ(2U, compaction->num_input_files(0));
ASSERT_EQ(6U, compaction->input(0, 0)->fd.GetNumber());
ASSERT_EQ(7U, compaction->input(0, 1)->fd.GetNumber());
// release the version storage
DeleteVersionStorage();
}
TEST_F(CompactionPickerTest, CompactionPriMultipleFilesRoundRobin3) {
ioptions_.compaction_pri = kRoundRobin;
mutable_cf_options_.max_compaction_bytes = 1000000u;
mutable_cf_options_.max_bytes_for_level_base = 120;
mutable_cf_options_.max_bytes_for_level_multiplier = 10;
// start a brand new version in each test.
NewVersionStorage(6, kCompactionStyleLevel);
vstorage_->ResizeCompactCursors(6);
// Set the cursor (file picking should start with 9U)
vstorage_->AddCursorForOneLevel(2, InternalKey("700", 100, kTypeValue));
Add(2, 6U, "150", "199", 500U);
Add(2, 7U, "200", "249", 500U);
Add(2, 8U, "300", "600", 500U);
Add(2, 9U, "700", "800", 500U);
Add(2, 10U, "850", "950", 500U);
Add(3, 26U, "130", "165", 600U);
Add(3, 27U, "166", "170", 600U);
Add(3, 28U, "270", "340", 600U);
Add(3, 29U, "401", "500", 600U);
Add(3, 30U, "601", "800", 600U);
Add(3, 31U, "830", "890", 600U);
UpdateVersionStorageInfo();
LevelCompactionPicker local_level_compaction_picker =
LevelCompactionPicker(ioptions_, &icmp_);
std::unique_ptr<Compaction> compaction(
local_level_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
Support subcmpct using reserved resources for round-robin priority (#10341) Summary: Earlier implementation of round-robin priority can only pick one file at a time and disallows parallel compactions within the same level. In this PR, round-robin compaction policy will expand towards more input files with respecting some additional constraints, which are summarized as follows: * Constraint 1: We can only pick consecutive files - Constraint 1a: When a file is being compacted (or some input files are being compacted after expanding), we cannot choose it and have to stop choosing more files - Constraint 1b: When we reach the last file (with the largest keys), we cannot choose more files (the next file will be the first one with small keys) * Constraint 2: We should ensure the total compaction bytes (including the overlapped files from the next level) is no more than `mutable_cf_options_.max_compaction_bytes` * Constraint 3: We try our best to pick as many files as possible so that the post-compaction level size can be just less than `MaxBytesForLevel(start_level_)` * Constraint 4: If trivial move is allowed, we reuse the logic of `TryNonL0TrivialMove()` instead of expanding files with Constraint 3 More details can be found in `LevelCompactionBuilder::SetupOtherFilesWithRoundRobinExpansion()`. The above optimization accelerates the process of moving the compaction cursor, in which the write-amp can be further reduced. While a large compaction may lead to high write stall, we break this large compaction into several subcompactions **regardless of** the `max_subcompactions` limit. The number of subcompactions for round-robin compaction priority is determined through the following steps: * Step 1: Initialized against `max_output_file_limit`, the number of input files in the start level, and also the range size limit `ranges.size()` * Step 2: Call `AcquireSubcompactionResources()`when max subcompactions is not sufficient, but we may or may not obtain desired resources, additional number of resources is stored in `extra_num_subcompaction_threads_reserved_`). Subcompaction limit is changed and update `num_planned_subcompactions` with `GetSubcompactionLimit()` * Step 3: Call `ShrinkSubcompactionResources()` to ensure extra resources can be released (extra resources may exist for round-robin compaction when the number of actual number of subcompactions is less than the number of planned subcompactions) More details can be found in `CompactionJob::AcquireSubcompactionResources()`,`CompactionJob::ShrinkSubcompactionResources()`, and `CompactionJob::ReleaseSubcompactionResources()`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10341 Test Plan: Add `CompactionPriMultipleFilesRoundRobin[1-3]` unit test in `compaction_picker_test.cc` and `RoundRobinSubcompactionsAgainstResources.SubcompactionsUsingResources/[0-4]`, `RoundRobinSubcompactionsAgainstPressureToken.PressureTokenTest/[0-1]` in `db_compaction_test.cc` Reviewed By: ajkr, hx235 Differential Revision: D37792644 Pulled By: littlepig2013 fbshipit-source-id: 7fecb7c4ffd97b34bbf6e3b760b2c35a772a0657
2022-07-24 18:12:44 +00:00
ASSERT_TRUE(compaction.get() != nullptr);
// Cannot pick more files since we reach the last file in level 2
ASSERT_EQ(2U, compaction->num_input_files(0));
ASSERT_EQ(9U, compaction->input(0, 0)->fd.GetNumber());
ASSERT_EQ(10U, compaction->input(0, 1)->fd.GetNumber());
// release the version storage
DeleteVersionStorage();
}
TEST_F(CompactionPickerTest, CompactionPriMinOverlappingManyFiles) {
NewVersionStorage(6, kCompactionStyleLevel);
ioptions_.compaction_pri = kMinOverlappingRatio;
mutable_cf_options_.max_bytes_for_level_base = 15000000;
mutable_cf_options_.max_bytes_for_level_multiplier = 10;
// file 7 and 8 over lap with the same file, but file 8 is smaller so
// it will be picked.
Add(2, 13U, "010", "011",
6100U); // Overlaps with a large file. Not picked
Add(2, 14U, "020", "021",
6100U); // Overlaps with a large file. Not picked
Add(2, 15U, "030", "031",
6100U); // Overlaps with a large file. Not picked
Add(2, 16U, "040", "041",
6100U); // Overlaps with a large file. Not picked
Add(2, 17U, "050", "051",
6100U); // Overlaps with a large file. Not picked
Add(2, 18U, "060", "061",
6100U); // Overlaps with a large file. Not picked
Add(2, 19U, "070", "071",
6100U); // Overlaps with a large file. Not picked
Add(2, 20U, "080", "081",
6100U); // Overlaps with a large file. Not picked
Add(2, 6U, "150", "167", 60000000U); // Overlaps with file 26, 27
Add(2, 7U, "168", "169", 60000000U); // Overlaps with file 27
Add(2, 8U, "201", "300", 61000000U); // Overlaps with file 28, but the file
// itself is larger. Should be picked.
Add(2, 9U, "610", "611",
6100U); // Overlaps with a large file. Not picked
Add(2, 10U, "620", "621",
6100U); // Overlaps with a large file. Not picked
Add(2, 11U, "630", "631",
6100U); // Overlaps with a large file. Not picked
Add(2, 12U, "640", "641",
6100U); // Overlaps with a large file. Not picked
Add(3, 31U, "001", "100", 260000000U);
Add(3, 26U, "160", "165", 260000000U);
Add(3, 27U, "166", "170", 260000000U);
Add(3, 28U, "180", "400", 260000000U);
Add(3, 29U, "401", "500", 260000000U);
Add(3, 30U, "601", "700", 260000000U);
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(level_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction.get() != nullptr);
ASSERT_EQ(1U, compaction->num_input_files(0));
// Picking file 8 because overlapping ratio is the biggest.
ASSERT_EQ(8U, compaction->input(0, 0)->fd.GetNumber());
}
// This test exhibits the bug where we don't properly reset parent_index in
// PickCompaction()
TEST_F(CompactionPickerTest, ParentIndexResetBug) {
int num_levels = ioptions_.num_levels;
mutable_cf_options_.level0_file_num_compaction_trigger = 2;
mutable_cf_options_.max_bytes_for_level_base = 200;
NewVersionStorage(num_levels, kCompactionStyleLevel);
Add(0, 1U, "150", "200"); // <- marked for compaction
Add(1, 3U, "400", "500", 600); // <- this one needs compacting
Add(2, 4U, "150", "200");
Add(2, 5U, "201", "210");
Add(2, 6U, "300", "310");
Add(2, 7U, "400", "500"); // <- being compacted
vstorage_->LevelFiles(2)[3]->being_compacted = true;
vstorage_->LevelFiles(0)[0]->marked_for_compaction = true;
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(level_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
}
// This test checks ExpandWhileOverlapping() by having overlapping user keys
// ranges (with different sequence numbers) in the input files.
TEST_F(CompactionPickerTest, OverlappingUserKeys) {
NewVersionStorage(6, kCompactionStyleLevel);
ioptions_.compaction_pri = kByCompensatedSize;
Add(1, 1U, "100", "150", 1U);
// Overlapping user keys
Add(1, 2U, "200", "400", 1U);
Add(1, 3U, "400", "500", 1000000000U, 0, 0);
Add(2, 4U, "600", "700", 1U);
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(level_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction.get() != nullptr);
ASSERT_EQ(1U, compaction->num_input_levels());
ASSERT_EQ(2U, compaction->num_input_files(0));
ASSERT_EQ(2U, compaction->input(0, 0)->fd.GetNumber());
ASSERT_EQ(3U, compaction->input(0, 1)->fd.GetNumber());
}
TEST_F(CompactionPickerTest, OverlappingUserKeys2) {
NewVersionStorage(6, kCompactionStyleLevel);
// Overlapping user keys on same level and output level
Add(1, 1U, "200", "400", 1000000000U);
Add(1, 2U, "400", "500", 1U, 0, 0);
Add(2, 3U, "000", "100", 1U);
Add(2, 4U, "100", "600", 1U, 0, 0);
Add(2, 5U, "600", "700", 1U, 0, 0);
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(level_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction.get() != nullptr);
ASSERT_EQ(2U, compaction->num_input_levels());
ASSERT_EQ(2U, compaction->num_input_files(0));
ASSERT_EQ(3U, compaction->num_input_files(1));
ASSERT_EQ(1U, compaction->input(0, 0)->fd.GetNumber());
ASSERT_EQ(2U, compaction->input(0, 1)->fd.GetNumber());
ASSERT_EQ(3U, compaction->input(1, 0)->fd.GetNumber());
ASSERT_EQ(4U, compaction->input(1, 1)->fd.GetNumber());
ASSERT_EQ(5U, compaction->input(1, 2)->fd.GetNumber());
}
TEST_F(CompactionPickerTest, OverlappingUserKeys3) {
NewVersionStorage(6, kCompactionStyleLevel);
// Chain of overlapping user key ranges (forces ExpandWhileOverlapping() to
// expand multiple times)
Add(1, 1U, "100", "150", 1U);
Add(1, 2U, "150", "200", 1U, 0, 0);
Add(1, 3U, "200", "250", 1000000000U, 0, 0);
Add(1, 4U, "250", "300", 1U, 0, 0);
Add(1, 5U, "300", "350", 1U, 0, 0);
// Output level overlaps with the beginning and the end of the chain
Add(2, 6U, "050", "100", 1U);
Add(2, 7U, "350", "400", 1U);
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(level_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction.get() != nullptr);
ASSERT_EQ(2U, compaction->num_input_levels());
ASSERT_EQ(5U, compaction->num_input_files(0));
ASSERT_EQ(2U, compaction->num_input_files(1));
ASSERT_EQ(1U, compaction->input(0, 0)->fd.GetNumber());
ASSERT_EQ(2U, compaction->input(0, 1)->fd.GetNumber());
ASSERT_EQ(3U, compaction->input(0, 2)->fd.GetNumber());
ASSERT_EQ(4U, compaction->input(0, 3)->fd.GetNumber());
ASSERT_EQ(5U, compaction->input(0, 4)->fd.GetNumber());
ASSERT_EQ(6U, compaction->input(1, 0)->fd.GetNumber());
ASSERT_EQ(7U, compaction->input(1, 1)->fd.GetNumber());
}
TEST_F(CompactionPickerTest, OverlappingUserKeys4) {
NewVersionStorage(6, kCompactionStyleLevel);
mutable_cf_options_.max_bytes_for_level_base = 1000000;
Add(1, 1U, "100", "150", 1U);
Add(1, 2U, "150", "199", 1U, 0, 0);
Add(1, 3U, "200", "250", 1100000U, 0, 0);
Add(1, 4U, "251", "300", 1U, 0, 0);
Add(1, 5U, "300", "350", 1U, 0, 0);
Add(2, 6U, "100", "115", 1U);
Add(2, 7U, "125", "325", 1U);
Add(2, 8U, "350", "400", 1U);
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(level_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction.get() != nullptr);
ASSERT_EQ(2U, compaction->num_input_levels());
ASSERT_EQ(1U, compaction->num_input_files(0));
ASSERT_EQ(1U, compaction->num_input_files(1));
ASSERT_EQ(3U, compaction->input(0, 0)->fd.GetNumber());
ASSERT_EQ(7U, compaction->input(1, 0)->fd.GetNumber());
}
TEST_F(CompactionPickerTest, OverlappingUserKeys5) {
NewVersionStorage(6, kCompactionStyleLevel);
// Overlapping user keys on same level and output level
Add(1, 1U, "200", "400", 1000000000U);
Add(1, 2U, "400", "500", 1U, 0, 0);
Add(2, 3U, "000", "100", 1U);
Add(2, 4U, "100", "600", 1U, 0, 0);
Add(2, 5U, "600", "700", 1U, 0, 0);
vstorage_->LevelFiles(2)[2]->being_compacted = true;
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(level_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction.get() == nullptr);
}
TEST_F(CompactionPickerTest, OverlappingUserKeys6) {
NewVersionStorage(6, kCompactionStyleLevel);
// Overlapping user keys on same level and output level
Add(1, 1U, "200", "400", 1U, 0, 0);
Add(1, 2U, "401", "500", 1U, 0, 0);
Add(2, 3U, "000", "100", 1U);
Add(2, 4U, "100", "300", 1U, 0, 0);
Add(2, 5U, "305", "450", 1U, 0, 0);
Add(2, 6U, "460", "600", 1U, 0, 0);
Add(2, 7U, "600", "700", 1U, 0, 0);
vstorage_->LevelFiles(1)[0]->marked_for_compaction = true;
vstorage_->LevelFiles(1)[1]->marked_for_compaction = true;
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(level_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction.get() != nullptr);
ASSERT_EQ(2U, compaction->num_input_levels());
ASSERT_EQ(1U, compaction->num_input_files(0));
ASSERT_EQ(3U, compaction->num_input_files(1));
}
TEST_F(CompactionPickerTest, OverlappingUserKeys7) {
NewVersionStorage(6, kCompactionStyleLevel);
mutable_cf_options_.max_compaction_bytes = 100000000000u;
// Overlapping user keys on same level and output level
Add(1, 1U, "200", "400", 1U, 0, 0);
Add(1, 2U, "401", "500", 1000000000U, 0, 0);
Add(2, 3U, "100", "250", 1U);
Add(2, 4U, "300", "600", 1U, 0, 0);
Add(2, 5U, "600", "800", 1U, 0, 0);
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(level_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction.get() != nullptr);
ASSERT_EQ(2U, compaction->num_input_levels());
ASSERT_GE(1U, compaction->num_input_files(0));
ASSERT_GE(2U, compaction->num_input_files(1));
// File 5 has to be included in the compaction
ASSERT_EQ(5U, compaction->inputs(1)->back()->fd.GetNumber());
}
TEST_F(CompactionPickerTest, OverlappingUserKeys8) {
NewVersionStorage(6, kCompactionStyleLevel);
mutable_cf_options_.max_compaction_bytes = 100000000000u;
// grow the number of inputs in "level" without
// changing the number of "level+1" files we pick up
// Expand input level as much as possible
// no overlapping case
Add(1, 1U, "101", "150", 1U);
Add(1, 2U, "151", "200", 1U);
Add(1, 3U, "201", "300", 1000000000U);
Add(1, 4U, "301", "400", 1U);
Add(1, 5U, "401", "500", 1U);
Add(2, 6U, "150", "200", 1U);
Add(2, 7U, "200", "450", 1U, 0, 0);
Add(2, 8U, "500", "600", 1U);
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(level_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction.get() != nullptr);
ASSERT_EQ(2U, compaction->num_input_levels());
ASSERT_EQ(3U, compaction->num_input_files(0));
ASSERT_EQ(2U, compaction->num_input_files(1));
ASSERT_EQ(2U, compaction->input(0, 0)->fd.GetNumber());
ASSERT_EQ(3U, compaction->input(0, 1)->fd.GetNumber());
ASSERT_EQ(4U, compaction->input(0, 2)->fd.GetNumber());
ASSERT_EQ(6U, compaction->input(1, 0)->fd.GetNumber());
ASSERT_EQ(7U, compaction->input(1, 1)->fd.GetNumber());
}
TEST_F(CompactionPickerTest, OverlappingUserKeys9) {
NewVersionStorage(6, kCompactionStyleLevel);
mutable_cf_options_.max_compaction_bytes = 100000000000u;
// grow the number of inputs in "level" without
// changing the number of "level+1" files we pick up
// Expand input level as much as possible
// overlapping case
Add(1, 1U, "121", "150", 1U);
Add(1, 2U, "151", "200", 1U);
Add(1, 3U, "201", "300", 1000000000U);
Add(1, 4U, "301", "400", 1U);
Add(1, 5U, "401", "500", 1U);
Add(2, 6U, "100", "120", 1U);
Add(2, 7U, "150", "200", 1U);
Add(2, 8U, "200", "450", 1U, 0, 0);
Add(2, 9U, "501", "600", 1U);
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(level_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction.get() != nullptr);
ASSERT_EQ(2U, compaction->num_input_levels());
ASSERT_EQ(5U, compaction->num_input_files(0));
ASSERT_EQ(2U, compaction->num_input_files(1));
ASSERT_EQ(1U, compaction->input(0, 0)->fd.GetNumber());
ASSERT_EQ(2U, compaction->input(0, 1)->fd.GetNumber());
ASSERT_EQ(3U, compaction->input(0, 2)->fd.GetNumber());
ASSERT_EQ(4U, compaction->input(0, 3)->fd.GetNumber());
ASSERT_EQ(7U, compaction->input(1, 0)->fd.GetNumber());
ASSERT_EQ(8U, compaction->input(1, 1)->fd.GetNumber());
}
2017-07-20 03:33:52 +00:00
TEST_F(CompactionPickerTest, OverlappingUserKeys10) {
// Locked file encountered when pulling in extra input-level files with same
// user keys. Verify we pick the next-best file from the same input level.
NewVersionStorage(6, kCompactionStyleLevel);
mutable_cf_options_.max_compaction_bytes = 100000000000u;
// file_number 2U is largest and thus first choice. But it overlaps with
// file_number 1U which is being compacted. So instead we pick the next-
// biggest file, 3U, which is eligible for compaction.
Add(1 /* level */, 1U /* file_number */, "100" /* smallest */,
"150" /* largest */, 1U /* file_size */);
file_map_[1U].first->being_compacted = true;
Add(1 /* level */, 2U /* file_number */, "150" /* smallest */,
"200" /* largest */, 1000000000U /* file_size */, 0 /* smallest_seq */,
0 /* largest_seq */);
Add(1 /* level */, 3U /* file_number */, "201" /* smallest */,
"250" /* largest */, 900000000U /* file_size */);
Add(2 /* level */, 4U /* file_number */, "100" /* smallest */,
"150" /* largest */, 1U /* file_size */);
Add(2 /* level */, 5U /* file_number */, "151" /* smallest */,
"200" /* largest */, 1U /* file_size */);
Add(2 /* level */, 6U /* file_number */, "201" /* smallest */,
"250" /* largest */, 1U /* file_size */);
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(level_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
2017-07-20 03:33:52 +00:00
ASSERT_TRUE(compaction.get() != nullptr);
ASSERT_EQ(2U, compaction->num_input_levels());
ASSERT_EQ(1U, compaction->num_input_files(0));
ASSERT_EQ(1U, compaction->num_input_files(1));
ASSERT_EQ(3U, compaction->input(0, 0)->fd.GetNumber());
ASSERT_EQ(6U, compaction->input(1, 0)->fd.GetNumber());
}
TEST_F(CompactionPickerTest, OverlappingUserKeys11) {
// Locked file encountered when pulling in extra output-level files with same
// user keys. Expected to skip that compaction and pick the next-best choice.
NewVersionStorage(6, kCompactionStyleLevel);
mutable_cf_options_.max_compaction_bytes = 100000000000u;
// score(L1) = 3.7
// score(L2) = 1.85
// There is no eligible file in L1 to compact since both candidates pull in
// file_number 5U, which overlaps with a file pending compaction (6U). The
// first eligible compaction is from L2->L3.
Add(1 /* level */, 2U /* file_number */, "151" /* smallest */,
"200" /* largest */, 1000000000U /* file_size */);
Add(1 /* level */, 3U /* file_number */, "201" /* smallest */,
"250" /* largest */, 1U /* file_size */);
Add(2 /* level */, 4U /* file_number */, "100" /* smallest */,
"149" /* largest */, 5000000000U /* file_size */);
Add(2 /* level */, 5U /* file_number */, "150" /* smallest */,
"201" /* largest */, 1U /* file_size */);
Add(2 /* level */, 6U /* file_number */, "201" /* smallest */,
"249" /* largest */, 1U /* file_size */, 0 /* smallest_seq */,
0 /* largest_seq */);
file_map_[6U].first->being_compacted = true;
Add(3 /* level */, 7U /* file_number */, "100" /* smallest */,
"149" /* largest */, 1U /* file_size */);
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(level_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
2017-07-20 03:33:52 +00:00
ASSERT_TRUE(compaction.get() != nullptr);
ASSERT_EQ(2U, compaction->num_input_levels());
ASSERT_EQ(1U, compaction->num_input_files(0));
ASSERT_EQ(1U, compaction->num_input_files(1));
ASSERT_EQ(4U, compaction->input(0, 0)->fd.GetNumber());
ASSERT_EQ(7U, compaction->input(1, 0)->fd.GetNumber());
}
Try to start TTL earlier with kMinOverlappingRatio is used (#8749) Summary: Right now, when options.ttl is set, compactions are triggered around the time when TTL is reached. This might cause extra compactions which are often bursty. This commit tries to mitigate it by picking those files earlier in normal compaction picking process. This is only implemented using kMinOverlappingRatio with Leveled compaction as it is the default value and it is more complicated to change other styles. When a file is aged more than ttl/2, RocksDB starts to boost the compaction priority of files in normal compaction picking process, and hope by the time TTL is reached, very few extra compaction is needed. In order for this to work, another change is made: during a compaction, if an output level file is older than ttl/2, cut output files based on original boundary (if it is not in the last level). This is to make sure that after an old file is moved to the next level, and new data is merged from the upper level, the new data falling into this range isn't reset with old timestamp. Without this change, in many cases, most files from one level will keep having old timestamp, even if they have newer data and we stuck in it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8749 Test Plan: Add a unit test to test the boosting logic. Will add a unit test to test it end-to-end. Reviewed By: jay-zhuang Differential Revision: D30735261 fbshipit-source-id: 503c2d89250b22911eb99e72b379be154de3428e
2021-11-01 21:32:12 +00:00
TEST_F(CompactionPickerTest, FileTtlBooster) {
// Set TTL to 2048
// TTL boosting for all levels starts at 1024,
// Whole TTL range is 2048 * 31 / 32 - 1024 = 1984 - 1024 = 960.
// From second last level (L5), range starts at
// 1024 + 480, 1024 + 240, 1024 + 120 (which is L3).
// Boosting step 124 / 16 = 7.75 -> 7
//
const uint64_t kCurrentTime = 1000000;
FileMetaData meta;
{
FileTtlBooster booster(kCurrentTime, 2048, 7, 3);
// Not triggering if the file is younger than ttl/2
meta.oldest_ancester_time = kCurrentTime - 1023;
ASSERT_EQ(1, booster.GetBoostScore(&meta));
meta.oldest_ancester_time = kCurrentTime - 1024;
ASSERT_EQ(1, booster.GetBoostScore(&meta));
meta.oldest_ancester_time = kCurrentTime + 10;
ASSERT_EQ(1, booster.GetBoostScore(&meta));
// Within one boosting step
meta.oldest_ancester_time = kCurrentTime - (1024 + 120 + 6);
ASSERT_EQ(1, booster.GetBoostScore(&meta));
// One boosting step
meta.oldest_ancester_time = kCurrentTime - (1024 + 120 + 7);
ASSERT_EQ(2, booster.GetBoostScore(&meta));
meta.oldest_ancester_time = kCurrentTime - (1024 + 120 + 8);
ASSERT_EQ(2, booster.GetBoostScore(&meta));
// Multiple boosting steps
meta.oldest_ancester_time = kCurrentTime - (1024 + 120 + 30);
ASSERT_EQ(5, booster.GetBoostScore(&meta));
// Very high boosting steps
meta.oldest_ancester_time = kCurrentTime - (1024 + 120 + 700);
ASSERT_EQ(101, booster.GetBoostScore(&meta));
}
{
// Test second last level
FileTtlBooster booster(kCurrentTime, 2048, 7, 5);
meta.oldest_ancester_time = kCurrentTime - (1024 + 480);
ASSERT_EQ(1, booster.GetBoostScore(&meta));
meta.oldest_ancester_time = kCurrentTime - (1024 + 480 + 60);
ASSERT_EQ(3, booster.GetBoostScore(&meta));
}
{
// Test last level
FileTtlBooster booster(kCurrentTime, 2048, 7, 6);
meta.oldest_ancester_time = kCurrentTime - (1024 + 480);
ASSERT_EQ(1, booster.GetBoostScore(&meta));
meta.oldest_ancester_time = kCurrentTime - (1024 + 480 + 60);
ASSERT_EQ(1, booster.GetBoostScore(&meta));
meta.oldest_ancester_time = kCurrentTime - 3000;
ASSERT_EQ(1, booster.GetBoostScore(&meta));
}
}
TEST_F(CompactionPickerTest, NotScheduleL1IfL0WithHigherPri1) {
NewVersionStorage(6, kCompactionStyleLevel);
mutable_cf_options_.level0_file_num_compaction_trigger = 2;
mutable_cf_options_.max_bytes_for_level_base = 900000000U;
// 6 L0 files, score 3.
Add(0, 1U, "000", "400", 1U);
Add(0, 2U, "001", "400", 1U, 0, 0);
Add(0, 3U, "001", "400", 1000000000U, 0, 0);
Add(0, 31U, "001", "400", 1000000000U, 0, 0);
Add(0, 32U, "001", "400", 1000000000U, 0, 0);
Add(0, 33U, "001", "400", 1000000000U, 0, 0);
// L1 total size 2GB, score 2.2. If one file being compacted, score 1.1.
Add(1, 4U, "050", "300", 1000000000U, 0, 0);
file_map_[4u].first->being_compacted = true;
Add(1, 5U, "301", "350", 1000000000U, 0, 0);
// Output level overlaps with the beginning and the end of the chain
Add(2, 6U, "050", "100", 1U);
Add(2, 7U, "300", "400", 1U);
// No compaction should be scheduled, if L0 has higher priority than L1
// but L0->L1 compaction is blocked by a file in L1 being compacted.
UpdateVersionStorageInfo();
ASSERT_EQ(0, vstorage_->CompactionScoreLevel(0));
ASSERT_EQ(1, vstorage_->CompactionScoreLevel(1));
std::unique_ptr<Compaction> compaction(level_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction.get() == nullptr);
}
TEST_F(CompactionPickerTest, NotScheduleL1IfL0WithHigherPri2) {
NewVersionStorage(6, kCompactionStyleLevel);
mutable_cf_options_.level0_file_num_compaction_trigger = 2;
mutable_cf_options_.max_bytes_for_level_base = 900000000U;
// 6 L0 files, score 3.
Add(0, 1U, "000", "400", 1U);
Add(0, 2U, "001", "400", 1U, 0, 0);
Add(0, 3U, "001", "400", 1000000000U, 0, 0);
Add(0, 31U, "001", "400", 1000000000U, 0, 0);
Add(0, 32U, "001", "400", 1000000000U, 0, 0);
Add(0, 33U, "001", "400", 1000000000U, 0, 0);
// L1 total size 2GB, score 2.2. If one file being compacted, score 1.1.
Add(1, 4U, "050", "300", 1000000000U, 0, 0);
Add(1, 5U, "301", "350", 1000000000U, 0, 0);
// Output level overlaps with the beginning and the end of the chain
Add(2, 6U, "050", "100", 1U);
Add(2, 7U, "300", "400", 1U);
// If no file in L1 being compacted, L0->L1 compaction will be scheduled.
UpdateVersionStorageInfo(); // being_compacted flag is cleared here.
ASSERT_EQ(0, vstorage_->CompactionScoreLevel(0));
ASSERT_EQ(1, vstorage_->CompactionScoreLevel(1));
std::unique_ptr<Compaction> compaction(level_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction.get() != nullptr);
}
TEST_F(CompactionPickerTest, NotScheduleL1IfL0WithHigherPri3) {
NewVersionStorage(6, kCompactionStyleLevel);
mutable_cf_options_.level0_file_num_compaction_trigger = 2;
mutable_cf_options_.max_bytes_for_level_base = 900000000U;
// 6 L0 files, score 3.
Add(0, 1U, "000", "400", 1U);
Add(0, 2U, "001", "400", 1U, 0, 0);
Add(0, 3U, "001", "400", 1000000000U, 0, 0);
Add(0, 31U, "001", "400", 1000000000U, 0, 0);
Add(0, 32U, "001", "400", 1000000000U, 0, 0);
Add(0, 33U, "001", "400", 1000000000U, 0, 0);
// L1 score more than 6.
Add(1, 4U, "050", "300", 1000000000U, 0, 0);
file_map_[4u].first->being_compacted = true;
Add(1, 5U, "301", "350", 1000000000U, 0, 0);
Add(1, 51U, "351", "400", 6000000000U, 0, 0);
// Output level overlaps with the beginning and the end of the chain
Add(2, 6U, "050", "100", 1U);
Add(2, 7U, "300", "400", 1U);
// If score in L1 is larger than L0, L1 compaction goes through despite
// there is pending L0 compaction.
UpdateVersionStorageInfo();
ASSERT_EQ(1, vstorage_->CompactionScoreLevel(0));
ASSERT_EQ(0, vstorage_->CompactionScoreLevel(1));
std::unique_ptr<Compaction> compaction(level_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction.get() != nullptr);
}
TEST_F(CompactionPickerTest, EstimateCompactionBytesNeeded1) {
int num_levels = ioptions_.num_levels;
ioptions_.level_compaction_dynamic_level_bytes = false;
mutable_cf_options_.level0_file_num_compaction_trigger = 4;
mutable_cf_options_.max_bytes_for_level_base = 1000;
mutable_cf_options_.max_bytes_for_level_multiplier = 10;
NewVersionStorage(num_levels, kCompactionStyleLevel);
Add(0, 1U, "150", "200", 200);
Add(0, 2U, "150", "200", 200);
Add(0, 3U, "150", "200", 200);
// Level 1 is over target by 200
Add(1, 4U, "400", "500", 600);
Add(1, 5U, "600", "700", 600);
// Level 2 is less than target 10000 even added size of level 1
// Size ratio of L2/L1 is 9600 / 1200 = 8
Add(2, 6U, "150", "200", 2500);
Add(2, 7U, "201", "210", 2000);
Add(2, 8U, "300", "310", 2600);
Add(2, 9U, "400", "500", 2500);
// Level 3 exceeds target 100,000 of 1000
Add(3, 10U, "400", "500", 101000);
// Level 4 exceeds target 1,000,000 by 900 after adding size from level 3
// Size ratio L4/L3 is 9.9
// After merge from L3, L4 size is 1000900
Add(4, 11U, "400", "500", 999900);
Add(5, 12U, "400", "500", 8007200);
UpdateVersionStorageInfo();
ASSERT_EQ(200u * 9u + 10900u + 900u * 9,
vstorage_->estimated_compaction_needed_bytes());
}
TEST_F(CompactionPickerTest, EstimateCompactionBytesNeeded2) {
int num_levels = ioptions_.num_levels;
ioptions_.level_compaction_dynamic_level_bytes = false;
mutable_cf_options_.level0_file_num_compaction_trigger = 3;
mutable_cf_options_.max_bytes_for_level_base = 1000;
mutable_cf_options_.max_bytes_for_level_multiplier = 10;
NewVersionStorage(num_levels, kCompactionStyleLevel);
Add(0, 1U, "150", "200", 200);
Add(0, 2U, "150", "200", 200);
Add(0, 4U, "150", "200", 200);
Add(0, 5U, "150", "200", 200);
Add(0, 6U, "150", "200", 200);
// Level 1 size will be 1400 after merging with L0
Add(1, 7U, "400", "500", 200);
Add(1, 8U, "600", "700", 200);
// Level 2 is less than target 10000 even added size of level 1
Add(2, 9U, "150", "200", 9100);
// Level 3 over the target, but since level 4 is empty, we assume it will be
// a trivial move.
Add(3, 10U, "400", "500", 101000);
UpdateVersionStorageInfo();
// estimated L1->L2 merge: 400 * (9100.0 / 1400.0 + 1.0)
ASSERT_EQ(1400u + 3000u, vstorage_->estimated_compaction_needed_bytes());
}
TEST_F(CompactionPickerTest, EstimateCompactionBytesNeeded3) {
int num_levels = ioptions_.num_levels;
ioptions_.level_compaction_dynamic_level_bytes = false;
mutable_cf_options_.level0_file_num_compaction_trigger = 3;
mutable_cf_options_.max_bytes_for_level_base = 1000;
mutable_cf_options_.max_bytes_for_level_multiplier = 10;
NewVersionStorage(num_levels, kCompactionStyleLevel);
Add(0, 1U, "150", "200", 2000);
Add(0, 2U, "150", "200", 2000);
Add(0, 4U, "150", "200", 2000);
Add(0, 5U, "150", "200", 2000);
Add(0, 6U, "150", "200", 1000);
// Level 1 size will be 10000 after merging with L0
Add(1, 7U, "400", "500", 500);
Add(1, 8U, "600", "700", 500);
Add(2, 9U, "150", "200", 10000);
UpdateVersionStorageInfo();
ASSERT_EQ(10000u + 18000u, vstorage_->estimated_compaction_needed_bytes());
}
TEST_F(CompactionPickerTest, EstimateCompactionBytesNeededDynamicLevel) {
int num_levels = ioptions_.num_levels;
ioptions_.level_compaction_dynamic_level_bytes = true;
mutable_cf_options_.level0_file_num_compaction_trigger = 3;
mutable_cf_options_.max_bytes_for_level_base = 1000;
mutable_cf_options_.max_bytes_for_level_multiplier = 10;
NewVersionStorage(num_levels, kCompactionStyleLevel);
// Set Last level size 50000
// num_levels - 1 target 5000
// num_levels - 2 is base level with target 1000 (rounded up to
// max_bytes_for_level_base).
Add(num_levels - 1, 10U, "400", "500", 50000);
Add(0, 1U, "150", "200", 200);
Add(0, 2U, "150", "200", 200);
Add(0, 4U, "150", "200", 200);
Add(0, 5U, "150", "200", 200);
Add(0, 6U, "150", "200", 200);
// num_levels - 3 is over target by 100 + 1000
Add(num_levels - 3, 7U, "400", "500", 550);
Add(num_levels - 3, 8U, "600", "700", 550);
// num_levels - 2 is over target by 1100 + 200
Add(num_levels - 2, 9U, "150", "200", 5200);
UpdateVersionStorageInfo();
// Merging to the second last level: (5200 / 2100 + 1) * 1100
// Merging to the last level: (50000 / 6300 + 1) * 1300
ASSERT_EQ(2100u + 3823u + 11617u,
vstorage_->estimated_compaction_needed_bytes());
}
TEST_F(CompactionPickerTest, IsBottommostLevelTest) {
// case 1: Higher levels are empty
NewVersionStorage(6, kCompactionStyleLevel);
Add(0, 1U, "a", "m");
Add(0, 2U, "c", "z");
Add(1, 3U, "d", "e");
Add(1, 4U, "l", "p");
Add(2, 5U, "g", "i");
Add(2, 6U, "x", "z");
UpdateVersionStorageInfo();
SetCompactionInputFilesLevels(2, 1);
AddToCompactionFiles(3U);
AddToCompactionFiles(5U);
bool result =
Compaction::TEST_IsBottommostLevel(2, vstorage_.get(), input_files_);
ASSERT_TRUE(result);
// case 2: Higher levels have no overlap
NewVersionStorage(6, kCompactionStyleLevel);
Add(0, 1U, "a", "m");
Add(0, 2U, "c", "z");
Add(1, 3U, "d", "e");
Add(1, 4U, "l", "p");
Add(2, 5U, "g", "i");
Add(2, 6U, "x", "z");
Add(3, 7U, "k", "p");
Add(3, 8U, "t", "w");
Add(4, 9U, "a", "b");
Add(5, 10U, "c", "cc");
UpdateVersionStorageInfo();
SetCompactionInputFilesLevels(2, 1);
AddToCompactionFiles(3U);
AddToCompactionFiles(5U);
result = Compaction::TEST_IsBottommostLevel(2, vstorage_.get(), input_files_);
ASSERT_TRUE(result);
// case 3.1: Higher levels (level 3) have overlap
NewVersionStorage(6, kCompactionStyleLevel);
Add(0, 1U, "a", "m");
Add(0, 2U, "c", "z");
Add(1, 3U, "d", "e");
Add(1, 4U, "l", "p");
Add(2, 5U, "g", "i");
Add(2, 6U, "x", "z");
Add(3, 7U, "e", "g");
Add(3, 8U, "h", "k");
Add(4, 9U, "a", "b");
Add(5, 10U, "c", "cc");
UpdateVersionStorageInfo();
SetCompactionInputFilesLevels(2, 1);
AddToCompactionFiles(3U);
AddToCompactionFiles(5U);
result = Compaction::TEST_IsBottommostLevel(2, vstorage_.get(), input_files_);
ASSERT_FALSE(result);
// case 3.2: Higher levels (level 5) have overlap
DeleteVersionStorage();
NewVersionStorage(6, kCompactionStyleLevel);
Add(0, 1U, "a", "m");
Add(0, 2U, "c", "z");
Add(1, 3U, "d", "e");
Add(1, 4U, "l", "p");
Add(2, 5U, "g", "i");
Add(2, 6U, "x", "z");
Add(3, 7U, "j", "k");
Add(3, 8U, "l", "m");
Add(4, 9U, "a", "b");
Add(5, 10U, "c", "cc");
Add(5, 11U, "h", "k");
Add(5, 12U, "y", "yy");
Add(5, 13U, "z", "zz");
UpdateVersionStorageInfo();
SetCompactionInputFilesLevels(2, 1);
AddToCompactionFiles(3U);
AddToCompactionFiles(5U);
result = Compaction::TEST_IsBottommostLevel(2, vstorage_.get(), input_files_);
ASSERT_FALSE(result);
// case 3.3: Higher levels (level 5) have overlap, but it's only overlapping
// one key ("d")
NewVersionStorage(6, kCompactionStyleLevel);
Add(0, 1U, "a", "m");
Add(0, 2U, "c", "z");
Add(1, 3U, "d", "e");
Add(1, 4U, "l", "p");
Add(2, 5U, "g", "i");
Add(2, 6U, "x", "z");
Add(3, 7U, "j", "k");
Add(3, 8U, "l", "m");
Add(4, 9U, "a", "b");
Add(5, 10U, "c", "cc");
Add(5, 11U, "ccc", "d");
Add(5, 12U, "y", "yy");
Add(5, 13U, "z", "zz");
UpdateVersionStorageInfo();
SetCompactionInputFilesLevels(2, 1);
AddToCompactionFiles(3U);
AddToCompactionFiles(5U);
result = Compaction::TEST_IsBottommostLevel(2, vstorage_.get(), input_files_);
ASSERT_FALSE(result);
// Level 0 files overlap
NewVersionStorage(6, kCompactionStyleLevel);
Add(0, 1U, "s", "t");
Add(0, 2U, "a", "m");
Add(0, 3U, "b", "z");
Add(0, 4U, "e", "f");
Add(5, 10U, "y", "z");
UpdateVersionStorageInfo();
SetCompactionInputFilesLevels(1, 0);
AddToCompactionFiles(1U);
AddToCompactionFiles(2U);
AddToCompactionFiles(3U);
AddToCompactionFiles(4U);
result = Compaction::TEST_IsBottommostLevel(2, vstorage_.get(), input_files_);
ASSERT_FALSE(result);
// Level 0 files don't overlap
NewVersionStorage(6, kCompactionStyleLevel);
Add(0, 1U, "s", "t");
Add(0, 2U, "a", "m");
Add(0, 3U, "b", "k");
Add(0, 4U, "e", "f");
Add(5, 10U, "y", "z");
UpdateVersionStorageInfo();
SetCompactionInputFilesLevels(1, 0);
AddToCompactionFiles(1U);
AddToCompactionFiles(2U);
AddToCompactionFiles(3U);
AddToCompactionFiles(4U);
result = Compaction::TEST_IsBottommostLevel(2, vstorage_.get(), input_files_);
ASSERT_TRUE(result);
// Level 1 files overlap
NewVersionStorage(6, kCompactionStyleLevel);
Add(0, 1U, "s", "t");
Add(0, 2U, "a", "m");
Add(0, 3U, "b", "k");
Add(0, 4U, "e", "f");
Add(1, 5U, "a", "m");
Add(1, 6U, "n", "o");
Add(1, 7U, "w", "y");
Add(5, 10U, "y", "z");
UpdateVersionStorageInfo();
SetCompactionInputFilesLevels(2, 0);
AddToCompactionFiles(1U);
AddToCompactionFiles(2U);
AddToCompactionFiles(3U);
AddToCompactionFiles(4U);
AddToCompactionFiles(5U);
AddToCompactionFiles(6U);
AddToCompactionFiles(7U);
result = Compaction::TEST_IsBottommostLevel(2, vstorage_.get(), input_files_);
ASSERT_FALSE(result);
DeleteVersionStorage();
}
TEST_F(CompactionPickerTest, MaxCompactionBytesHit) {
mutable_cf_options_.max_bytes_for_level_base = 1000000u;
mutable_cf_options_.max_compaction_bytes = 800000u;
Ignore max_compaction_bytes for compaction input that are within output key-range (#10835) Summary: When picking compaction input files, we sometimes stop picking a file that is fully included in the output key-range due to hitting max_compaction_bytes. Including these input files can potentially reduce WA at the expense of larger compactions. Larger compaction should be fine as files from input level are usually 10X smaller than files from output level. This PR adds a mutable CF option `ignore_max_compaction_bytes_for_input` that is enabled by default. We can remove this option once we are sure it is safe. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10835 Test Plan: - CI, a unit test on max_compaction_bytes fails before turning this flag off. - Benchmark does not show much difference in WA: `./db_bench --benchmarks=fillrandom,waitforcompaction,stats,levelstats -max_background_jobs=12 -num=2000000000 -target_file_size_base=33554432 --write_buffer_size=33554432` ``` main: ** Compaction Stats [default] ** Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 3/0 91.59 MB 0.8 70.9 0.0 70.9 200.8 129.9 0.0 1.5 25.2 71.2 2886.55 2463.45 9725 0.297 1093M 254K 0.0 0.0 L1 9/0 248.03 MB 1.0 392.0 129.8 262.2 391.7 129.5 0.0 3.0 69.0 68.9 5821.71 5536.90 804 7.241 6029M 5814K 0.0 0.0 L2 87/0 2.50 GB 1.0 537.0 128.5 408.5 533.8 125.2 0.7 4.2 69.5 69.1 7912.24 7323.70 4417 1.791 8299M 36M 0.0 0.0 L3 836/0 24.99 GB 1.0 616.9 118.3 498.7 594.5 95.8 5.2 5.0 66.9 64.5 9442.38 8490.28 4204 2.246 9749M 306M 0.0 0.0 L4 2355/0 62.95 GB 0.3 67.3 37.1 30.2 54.2 24.0 38.9 1.5 72.2 58.2 954.37 821.18 917 1.041 1076M 173M 0.0 0.0 Sum 3290/0 90.77 GB 0.0 1684.2 413.7 1270.5 1775.0 504.5 44.9 13.7 63.8 67.3 27017.25 24635.52 20067 1.346 26G 522M 0.0 0.0 Cumulative compaction: 1774.96 GB write, 154.29 MB/s write, 1684.19 GB read, 146.40 MB/s read, 27017.3 seconds This PR: ** Compaction Stats [default] ** Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 3/0 45.71 MB 0.8 72.9 0.0 72.9 202.8 129.9 0.0 1.6 25.4 70.7 2938.16 2510.36 9741 0.302 1124M 265K 0.0 0.0 L1 8/0 234.54 MB 0.9 384.5 129.8 254.7 384.2 129.6 0.0 3.0 69.0 68.9 5708.08 5424.43 791 7.216 5913M 5753K 0.0 0.0 L2 84/0 2.47 GB 1.0 543.1 128.6 414.5 539.9 125.4 0.7 4.2 69.6 69.2 7989.31 7403.13 4418 1.808 8393M 36M 0.0 0.0 L3 839/0 24.96 GB 1.0 615.6 118.4 497.2 593.2 96.0 5.1 5.0 66.6 64.1 9471.23 8489.31 4193 2.259 9726M 306M 0.0 0.0 L4 2360/0 63.04 GB 0.3 67.6 37.3 30.3 54.4 24.1 38.9 1.5 71.5 57.6 967.30 827.99 907 1.066 1080M 173M 0.0 0.0 Sum 3294/0 90.75 GB 0.0 1683.8 414.2 1269.6 1774.5 504.9 44.8 13.7 63.7 67.1 27074.08 24655.22 20050 1.350 26G 522M 0.0 0.0 Cumulative compaction: 1774.52 GB write, 157.09 MB/s write, 1683.77 GB read, 149.06 MB/s read, 27074.1 seconds ``` Reviewed By: ajkr Differential Revision: D40518319 Pulled By: cbi42 fbshipit-source-id: f4ea614bc0ebefe007ffaf05bb9aec9a8ca25b60
2022-10-21 17:22:41 +00:00
mutable_cf_options_.ignore_max_compaction_bytes_for_input = false;
ioptions_.level_compaction_dynamic_level_bytes = false;
NewVersionStorage(6, kCompactionStyleLevel);
// A compaction should be triggered and pick file 2 and 5.
// It can expand because adding file 1 and 3, the compaction size will
// exceed mutable_cf_options_.max_bytes_for_level_base.
Add(1, 1U, "100", "150", 300000U);
Add(1, 2U, "151", "200", 300001U, 0, 0);
Add(1, 3U, "201", "250", 300000U, 0, 0);
Add(1, 4U, "251", "300", 300000U, 0, 0);
Add(2, 5U, "100", "256", 1U);
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(level_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction.get() != nullptr);
ASSERT_EQ(2U, compaction->num_input_levels());
ASSERT_EQ(1U, compaction->num_input_files(0));
ASSERT_EQ(1U, compaction->num_input_files(1));
ASSERT_EQ(2U, compaction->input(0, 0)->fd.GetNumber());
ASSERT_EQ(5U, compaction->input(1, 0)->fd.GetNumber());
}
TEST_F(CompactionPickerTest, MaxCompactionBytesNotHit) {
mutable_cf_options_.max_bytes_for_level_base = 800000u;
mutable_cf_options_.max_compaction_bytes = 1000000u;
Ignore max_compaction_bytes for compaction input that are within output key-range (#10835) Summary: When picking compaction input files, we sometimes stop picking a file that is fully included in the output key-range due to hitting max_compaction_bytes. Including these input files can potentially reduce WA at the expense of larger compactions. Larger compaction should be fine as files from input level are usually 10X smaller than files from output level. This PR adds a mutable CF option `ignore_max_compaction_bytes_for_input` that is enabled by default. We can remove this option once we are sure it is safe. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10835 Test Plan: - CI, a unit test on max_compaction_bytes fails before turning this flag off. - Benchmark does not show much difference in WA: `./db_bench --benchmarks=fillrandom,waitforcompaction,stats,levelstats -max_background_jobs=12 -num=2000000000 -target_file_size_base=33554432 --write_buffer_size=33554432` ``` main: ** Compaction Stats [default] ** Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 3/0 91.59 MB 0.8 70.9 0.0 70.9 200.8 129.9 0.0 1.5 25.2 71.2 2886.55 2463.45 9725 0.297 1093M 254K 0.0 0.0 L1 9/0 248.03 MB 1.0 392.0 129.8 262.2 391.7 129.5 0.0 3.0 69.0 68.9 5821.71 5536.90 804 7.241 6029M 5814K 0.0 0.0 L2 87/0 2.50 GB 1.0 537.0 128.5 408.5 533.8 125.2 0.7 4.2 69.5 69.1 7912.24 7323.70 4417 1.791 8299M 36M 0.0 0.0 L3 836/0 24.99 GB 1.0 616.9 118.3 498.7 594.5 95.8 5.2 5.0 66.9 64.5 9442.38 8490.28 4204 2.246 9749M 306M 0.0 0.0 L4 2355/0 62.95 GB 0.3 67.3 37.1 30.2 54.2 24.0 38.9 1.5 72.2 58.2 954.37 821.18 917 1.041 1076M 173M 0.0 0.0 Sum 3290/0 90.77 GB 0.0 1684.2 413.7 1270.5 1775.0 504.5 44.9 13.7 63.8 67.3 27017.25 24635.52 20067 1.346 26G 522M 0.0 0.0 Cumulative compaction: 1774.96 GB write, 154.29 MB/s write, 1684.19 GB read, 146.40 MB/s read, 27017.3 seconds This PR: ** Compaction Stats [default] ** Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 3/0 45.71 MB 0.8 72.9 0.0 72.9 202.8 129.9 0.0 1.6 25.4 70.7 2938.16 2510.36 9741 0.302 1124M 265K 0.0 0.0 L1 8/0 234.54 MB 0.9 384.5 129.8 254.7 384.2 129.6 0.0 3.0 69.0 68.9 5708.08 5424.43 791 7.216 5913M 5753K 0.0 0.0 L2 84/0 2.47 GB 1.0 543.1 128.6 414.5 539.9 125.4 0.7 4.2 69.6 69.2 7989.31 7403.13 4418 1.808 8393M 36M 0.0 0.0 L3 839/0 24.96 GB 1.0 615.6 118.4 497.2 593.2 96.0 5.1 5.0 66.6 64.1 9471.23 8489.31 4193 2.259 9726M 306M 0.0 0.0 L4 2360/0 63.04 GB 0.3 67.6 37.3 30.3 54.4 24.1 38.9 1.5 71.5 57.6 967.30 827.99 907 1.066 1080M 173M 0.0 0.0 Sum 3294/0 90.75 GB 0.0 1683.8 414.2 1269.6 1774.5 504.9 44.8 13.7 63.7 67.1 27074.08 24655.22 20050 1.350 26G 522M 0.0 0.0 Cumulative compaction: 1774.52 GB write, 157.09 MB/s write, 1683.77 GB read, 149.06 MB/s read, 27074.1 seconds ``` Reviewed By: ajkr Differential Revision: D40518319 Pulled By: cbi42 fbshipit-source-id: f4ea614bc0ebefe007ffaf05bb9aec9a8ca25b60
2022-10-21 17:22:41 +00:00
mutable_cf_options_.ignore_max_compaction_bytes_for_input = false;
ioptions_.level_compaction_dynamic_level_bytes = false;
NewVersionStorage(6, kCompactionStyleLevel);
// A compaction should be triggered and pick file 2 and 5.
// and it expands to file 1 and 3 too.
Add(1, 1U, "100", "150", 300000U);
Add(1, 2U, "151", "200", 300001U, 0, 0);
Add(1, 3U, "201", "250", 300000U, 0, 0);
Add(1, 4U, "251", "300", 300000U, 0, 0);
Add(2, 5U, "000", "251", 1U);
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(level_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction.get() != nullptr);
ASSERT_EQ(2U, compaction->num_input_levels());
ASSERT_EQ(3U, compaction->num_input_files(0));
ASSERT_EQ(1U, compaction->num_input_files(1));
ASSERT_EQ(1U, compaction->input(0, 0)->fd.GetNumber());
ASSERT_EQ(2U, compaction->input(0, 1)->fd.GetNumber());
ASSERT_EQ(3U, compaction->input(0, 2)->fd.GetNumber());
ASSERT_EQ(5U, compaction->input(1, 0)->fd.GetNumber());
}
TEST_F(CompactionPickerTest, IsTrivialMoveOn) {
mutable_cf_options_.max_bytes_for_level_base = 10000u;
mutable_cf_options_.max_compaction_bytes = 10001u;
ioptions_.level_compaction_dynamic_level_bytes = false;
NewVersionStorage(6, kCompactionStyleLevel);
// A compaction should be triggered and pick file 2
Add(1, 1U, "100", "150", 3000U);
Add(1, 2U, "151", "200", 3001U);
Add(1, 3U, "201", "250", 3000U);
Add(1, 4U, "251", "300", 3000U);
Add(3, 5U, "120", "130", 7000U);
Add(3, 6U, "170", "180", 7000U);
Add(3, 7U, "220", "230", 7000U);
Add(3, 8U, "270", "280", 7000U);
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(level_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction.get() != nullptr);
ASSERT_TRUE(compaction->IsTrivialMove());
}
TEST_F(CompactionPickerTest, L0TrivialMove1) {
mutable_cf_options_.max_bytes_for_level_base = 10000000u;
mutable_cf_options_.level0_file_num_compaction_trigger = 4;
mutable_cf_options_.max_compaction_bytes = 10000000u;
ioptions_.level_compaction_dynamic_level_bytes = false;
NewVersionStorage(6, kCompactionStyleLevel);
Add(0, 1U, "100", "150", 3000U, 0, 710, 800);
Add(0, 2U, "151", "200", 3001U, 0, 610, 700);
Add(0, 3U, "301", "350", 3000U, 0, 510, 600);
Add(0, 4U, "451", "400", 3000U, 0, 410, 500);
Add(1, 5U, "120", "130", 7000U);
Add(1, 6U, "170", "180", 7000U);
Add(1, 7U, "220", "230", 7000U);
Add(1, 8U, "270", "280", 7000U);
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(level_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction.get() != nullptr);
ASSERT_EQ(1, compaction->num_input_levels());
ASSERT_EQ(2, compaction->num_input_files(0));
ASSERT_EQ(3, compaction->input(0, 0)->fd.GetNumber());
ASSERT_EQ(4, compaction->input(0, 1)->fd.GetNumber());
ASSERT_TRUE(compaction->IsTrivialMove());
}
TEST_F(CompactionPickerTest, L0TrivialMoveOneFile) {
mutable_cf_options_.max_bytes_for_level_base = 10000000u;
mutable_cf_options_.level0_file_num_compaction_trigger = 4;
mutable_cf_options_.max_compaction_bytes = 10000000u;
ioptions_.level_compaction_dynamic_level_bytes = false;
NewVersionStorage(6, kCompactionStyleLevel);
Add(0, 1U, "100", "150", 3000U, 0, 710, 800);
Add(0, 2U, "551", "600", 3001U, 0, 610, 700);
Add(0, 3U, "101", "150", 3000U, 0, 510, 600);
Add(0, 4U, "451", "400", 3000U, 0, 410, 500);
Add(1, 5U, "120", "130", 7000U);
Add(1, 6U, "170", "180", 7000U);
Add(1, 7U, "220", "230", 7000U);
Add(1, 8U, "270", "280", 7000U);
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(level_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction.get() != nullptr);
ASSERT_EQ(1, compaction->num_input_levels());
ASSERT_EQ(1, compaction->num_input_files(0));
ASSERT_EQ(4, compaction->input(0, 0)->fd.GetNumber());
ASSERT_TRUE(compaction->IsTrivialMove());
}
TEST_F(CompactionPickerTest, L0TrivialMoveWholeL0) {
mutable_cf_options_.max_bytes_for_level_base = 10000000u;
mutable_cf_options_.level0_file_num_compaction_trigger = 4;
mutable_cf_options_.max_compaction_bytes = 10000000u;
ioptions_.level_compaction_dynamic_level_bytes = false;
NewVersionStorage(6, kCompactionStyleLevel);
Add(0, 1U, "300", "350", 3000U, 0, 710, 800);
Add(0, 2U, "651", "600", 3001U, 0, 610, 700);
Add(0, 3U, "501", "550", 3000U, 0, 510, 600);
Add(0, 4U, "451", "400", 3000U, 0, 410, 500);
Add(1, 5U, "120", "130", 7000U);
Add(1, 6U, "970", "980", 7000U);
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(level_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction.get() != nullptr);
ASSERT_EQ(1, compaction->num_input_levels());
ASSERT_EQ(4, compaction->num_input_files(0));
ASSERT_EQ(1, compaction->input(0, 0)->fd.GetNumber());
ASSERT_EQ(4, compaction->input(0, 1)->fd.GetNumber());
ASSERT_EQ(3, compaction->input(0, 2)->fd.GetNumber());
ASSERT_EQ(2, compaction->input(0, 3)->fd.GetNumber());
ASSERT_TRUE(compaction->IsTrivialMove());
}
TEST_F(CompactionPickerTest, IsTrivialMoveOffSstPartitioned) {
mutable_cf_options_.max_bytes_for_level_base = 10000u;
mutable_cf_options_.max_compaction_bytes = 10001u;
ioptions_.level_compaction_dynamic_level_bytes = false;
ioptions_.sst_partitioner_factory = NewSstPartitionerFixedPrefixFactory(1);
NewVersionStorage(6, kCompactionStyleLevel);
// A compaction should be triggered and pick file 2
Add(1, 1U, "100", "150", 3000U);
Add(1, 2U, "151", "200", 3001U);
Add(1, 3U, "201", "250", 3000U);
Add(1, 4U, "251", "300", 3000U);
Add(3, 5U, "120", "130", 7000U);
Add(3, 6U, "170", "180", 7000U);
Add(3, 7U, "220", "230", 7000U);
Add(3, 8U, "270", "280", 7000U);
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(level_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction.get() != nullptr);
// No trivial move, because partitioning is applied
ASSERT_TRUE(!compaction->IsTrivialMove());
}
TEST_F(CompactionPickerTest, IsTrivialMoveOff) {
mutable_cf_options_.max_bytes_for_level_base = 1000000u;
mutable_cf_options_.max_compaction_bytes = 10000u;
ioptions_.level_compaction_dynamic_level_bytes = false;
NewVersionStorage(6, kCompactionStyleLevel);
// A compaction should be triggered and pick all files from level 1
Add(1, 1U, "100", "150", 300000U, 0, 0);
Add(1, 2U, "150", "200", 300000U, 0, 0);
Add(1, 3U, "200", "250", 300000U, 0, 0);
Add(1, 4U, "250", "300", 300000U, 0, 0);
Add(3, 5U, "120", "130", 6000U);
Add(3, 6U, "140", "150", 6000U);
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(level_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction.get() != nullptr);
ASSERT_FALSE(compaction->IsTrivialMove());
}
TEST_F(CompactionPickerTest, TrivialMoveMultipleFiles1) {
mutable_cf_options_.max_bytes_for_level_base = 1000u;
mutable_cf_options_.max_compaction_bytes = 10000001u;
ioptions_.level_compaction_dynamic_level_bytes = false;
ioptions_.compaction_pri = kMinOverlappingRatio;
NewVersionStorage(6, kCompactionStyleLevel);
Add(2, 1U, "100", "150", 3000U);
Add(2, 2U, "151", "200", 3001U);
Add(2, 3U, "301", "350", 3000U);
Add(2, 4U, "451", "400", 3000U);
Add(2, 5U, "551", "500", 3000U);
Add(2, 6U, "651", "600", 3000U);
Add(2, 7U, "751", "700", 3000U);
Add(2, 8U, "851", "900", 3000U);
Add(3, 15U, "120", "130", 700U);
Add(3, 16U, "170", "180", 700U);
Add(3, 17U, "220", "230", 700U);
Add(3, 18U, "870", "880", 700U);
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(level_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction.get() != nullptr);
ASSERT_TRUE(compaction->IsTrivialMove());
ASSERT_EQ(1, compaction->num_input_levels());
ASSERT_EQ(4, compaction->num_input_files(0));
ASSERT_EQ(3, compaction->input(0, 0)->fd.GetNumber());
ASSERT_EQ(4, compaction->input(0, 1)->fd.GetNumber());
ASSERT_EQ(5, compaction->input(0, 2)->fd.GetNumber());
ASSERT_EQ(6, compaction->input(0, 3)->fd.GetNumber());
}
TEST_F(CompactionPickerTest, TrivialMoveMultipleFiles2) {
mutable_cf_options_.max_bytes_for_level_base = 1000u;
mutable_cf_options_.max_compaction_bytes = 10000001u;
ioptions_.level_compaction_dynamic_level_bytes = false;
ioptions_.compaction_pri = kMinOverlappingRatio;
NewVersionStorage(6, kCompactionStyleLevel);
Add(2, 1U, "100", "150", 3000U);
Add(2, 2U, "151", "160", 3001U);
Add(2, 3U, "161", "179", 3000U);
Add(2, 4U, "220", "400", 3000U);
Add(2, 5U, "551", "500", 3000U);
Add(2, 6U, "651", "600", 3000U);
Add(2, 7U, "751", "700", 3000U);
Add(2, 8U, "851", "900", 3000U);
Add(3, 15U, "120", "130", 700U);
Add(3, 17U, "220", "230", 700U);
Add(3, 18U, "870", "880", 700U);
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(level_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction.get() != nullptr);
ASSERT_TRUE(compaction->IsTrivialMove());
ASSERT_EQ(1, compaction->num_input_levels());
ASSERT_EQ(2, compaction->num_input_files(0));
ASSERT_EQ(2, compaction->input(0, 0)->fd.GetNumber());
ASSERT_EQ(3, compaction->input(0, 1)->fd.GetNumber());
}
TEST_F(CompactionPickerTest, TrivialMoveMultipleFiles3) {
mutable_cf_options_.max_bytes_for_level_base = 1000u;
mutable_cf_options_.max_compaction_bytes = 10000001u;
ioptions_.level_compaction_dynamic_level_bytes = false;
ioptions_.compaction_pri = kMinOverlappingRatio;
NewVersionStorage(6, kCompactionStyleLevel);
// Even if consecutive files can be trivial moved, we don't pick them
// since in case trivial move can't be issued for a reason, we cannot
// fall back to normal compactions.
Add(2, 1U, "100", "150", 3000U);
Add(2, 2U, "151", "160", 3001U);
Add(2, 5U, "551", "500", 3000U);
Add(2, 6U, "651", "600", 3000U);
Add(2, 7U, "751", "700", 3000U);
Add(2, 8U, "851", "900", 3000U);
Add(3, 15U, "120", "130", 700U);
Add(3, 17U, "220", "230", 700U);
Add(3, 18U, "870", "880", 700U);
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(level_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction.get() != nullptr);
ASSERT_TRUE(compaction->IsTrivialMove());
ASSERT_EQ(1, compaction->num_input_levels());
ASSERT_EQ(1, compaction->num_input_files(0));
ASSERT_EQ(2, compaction->input(0, 0)->fd.GetNumber());
}
TEST_F(CompactionPickerTest, TrivialMoveMultipleFiles4) {
mutable_cf_options_.max_bytes_for_level_base = 1000u;
mutable_cf_options_.max_compaction_bytes = 10000001u;
ioptions_.level_compaction_dynamic_level_bytes = false;
ioptions_.compaction_pri = kMinOverlappingRatio;
NewVersionStorage(6, kCompactionStyleLevel);
Add(2, 1U, "100", "150", 4000U);
Add(2, 2U, "151", "160", 4001U);
Add(2, 3U, "161", "179", 4000U);
Add(3, 15U, "120", "130", 700U);
Add(3, 17U, "220", "230", 700U);
Add(3, 18U, "870", "880", 700U);
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(level_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction.get() != nullptr);
ASSERT_TRUE(compaction->IsTrivialMove());
ASSERT_EQ(1, compaction->num_input_levels());
ASSERT_EQ(2, compaction->num_input_files(0));
ASSERT_EQ(2, compaction->input(0, 0)->fd.GetNumber());
ASSERT_EQ(3, compaction->input(0, 1)->fd.GetNumber());
}
TEST_F(CompactionPickerTest, TrivialMoveMultipleFiles5) {
mutable_cf_options_.max_bytes_for_level_base = 1000u;
mutable_cf_options_.max_compaction_bytes = 10000001u;
ioptions_.level_compaction_dynamic_level_bytes = false;
ioptions_.compaction_pri = kMinOverlappingRatio;
NewVersionStorage(6, kCompactionStyleLevel);
// File 4 and 5 aren't clean cut, so only 2 and 3 are picked.
Add(2, 1U, "100", "150", 4000U);
Add(2, 2U, "151", "160", 4001U);
Add(2, 3U, "161", "179", 4000U);
Add(2, 4U, "180", "185", 4000U);
Add(2, 5U, "185", "190", 4000U);
Add(3, 15U, "120", "130", 700U);
Add(3, 17U, "220", "230", 700U);
Add(3, 18U, "870", "880", 700U);
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(level_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction.get() != nullptr);
ASSERT_TRUE(compaction->IsTrivialMove());
ASSERT_EQ(1, compaction->num_input_levels());
ASSERT_EQ(2, compaction->num_input_files(0));
ASSERT_EQ(2, compaction->input(0, 0)->fd.GetNumber());
ASSERT_EQ(3, compaction->input(0, 1)->fd.GetNumber());
}
TEST_F(CompactionPickerTest, TrivialMoveMultipleFiles6) {
mutable_cf_options_.max_bytes_for_level_base = 1000u;
mutable_cf_options_.max_compaction_bytes = 10000001u;
ioptions_.level_compaction_dynamic_level_bytes = false;
ioptions_.compaction_pri = kMinOverlappingRatio;
NewVersionStorage(6, kCompactionStyleLevel);
Add(2, 1U, "100", "150", 3000U);
Add(2, 2U, "151", "200", 3001U);
Add(2, 3U, "301", "350", 3000U);
Add(2, 4U, "451", "400", 3000U);
Add(2, 5U, "551", "500", 3000U);
file_map_[5U].first->being_compacted = true;
Add(2, 6U, "651", "600", 3000U);
Add(2, 7U, "751", "700", 3000U);
Add(2, 8U, "851", "900", 3000U);
Add(3, 15U, "120", "130", 700U);
Add(3, 16U, "170", "180", 700U);
Add(3, 17U, "220", "230", 700U);
Add(3, 18U, "870", "880", 700U);
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(level_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction.get() != nullptr);
ASSERT_TRUE(compaction->IsTrivialMove());
ASSERT_EQ(1, compaction->num_input_levels());
// Since the next file is being compacted. Stopping at 3 and 4.
ASSERT_EQ(2, compaction->num_input_files(0));
ASSERT_EQ(3, compaction->input(0, 0)->fd.GetNumber());
ASSERT_EQ(4, compaction->input(0, 1)->fd.GetNumber());
}
TEST_F(CompactionPickerTest, CacheNextCompactionIndex) {
NewVersionStorage(6, kCompactionStyleLevel);
mutable_cf_options_.max_compaction_bytes = 100000000000u;
Add(1 /* level */, 1U /* file_number */, "100" /* smallest */,
"149" /* largest */, 1000000000U /* file_size */);
file_map_[1U].first->being_compacted = true;
Add(1 /* level */, 2U /* file_number */, "150" /* smallest */,
"199" /* largest */, 900000000U /* file_size */);
Add(1 /* level */, 3U /* file_number */, "200" /* smallest */,
"249" /* largest */, 800000000U /* file_size */);
Add(1 /* level */, 4U /* file_number */, "250" /* smallest */,
"299" /* largest */, 700000000U /* file_size */);
Add(2 /* level */, 5U /* file_number */, "150" /* smallest */,
"199" /* largest */, 100U /* file_size */);
Add(2 /* level */, 6U /* file_number */, "200" /* smallest */,
"240" /* largest */, 1U /* file_size */);
Add(2 /* level */, 7U /* file_number */, "260" /* smallest */,
"270" /* largest */, 1U /* file_size */);
file_map_[5U].first->being_compacted = true;
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(level_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction.get() != nullptr);
ASSERT_EQ(2U, compaction->num_input_levels());
ASSERT_EQ(1U, compaction->num_input_files(0));
ASSERT_EQ(1U, compaction->num_input_files(1));
ASSERT_EQ(3U, compaction->input(0, 0)->fd.GetNumber());
ASSERT_EQ(2, vstorage_->NextCompactionIndex(1 /* level */));
compaction.reset(level_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction.get() != nullptr);
ASSERT_EQ(2U, compaction->num_input_levels());
ASSERT_EQ(1U, compaction->num_input_files(0));
ASSERT_EQ(1U, compaction->num_input_files(1));
ASSERT_EQ(4U, compaction->input(0, 0)->fd.GetNumber());
ASSERT_EQ(3, vstorage_->NextCompactionIndex(1 /* level */));
compaction.reset(level_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction.get() == nullptr);
ASSERT_EQ(4, vstorage_->NextCompactionIndex(1 /* level */));
}
TEST_F(CompactionPickerTest, IntraL0MaxCompactionBytesNotHit) {
// Intra L0 compaction triggers only if there are at least
// level0_file_num_compaction_trigger + 2 L0 files.
mutable_cf_options_.level0_file_num_compaction_trigger = 3;
mutable_cf_options_.max_compaction_bytes = 1000000u;
NewVersionStorage(6, kCompactionStyleLevel);
// All 5 L0 files will be picked for intra L0 compaction. The one L1 file
// spans entire L0 key range and is marked as being compacted to avoid
// L0->L1 compaction.
Fix corruption with intra-L0 on ingested files (#5958) Summary: ## Problem Description Our process was abort when it call `CheckConsistency`. And the information in `stderr` show that "`L0 files seqno 3001491972 3004797440 vs. 3002875611 3004524421` ". Here are the causes of the accident I investigated. * RocksDB will call `CheckConsistency` whenever `MANIFEST` file is update. It will check sequence number interval of every file, except files which were ingested. * When one file is ingested into RocksDB, it will be assigned the value of global sequence number, and the minimum and maximum seqno of this file are equal, which are both equal to global sequence number. * `CheckConsistency` determines whether the file is ingested by whether the smallest and largest seqno of an sstable file are equal. * If IntraL0Compaction picks one sst which was ingested just now and compacted it into another sst, the `smallest_seqno` of this new file will be smaller than his `largest_seqno`. * If more than one ingested file was ingested before memtable schedule flush, and they all compact into one new sstable file by `IntraL0Compaction`. The sequence interval of this new file will be included in the interval of the memtable. So `CheckConsistency` will return a `Corruption`. * If a sstable was ingested after the memtable was schedule to flush, which would assign a larger seqno to it than memtable. Then the file was compacted with other files (these files were all flushed before the memtable) in L0 into one file. This compaction start before the flush job of memtable start, but completed after the flush job finish. So this new file produced by the compaction (we call it s1) would have a larger interval of sequence number than the file produced by flush (we call it s2). **But there was still some data in s1 written into RocksDB before the s2, so it's possible that some data in s2 was cover by old data in s1.** Of course, it would also make a `Corruption` because of overlap of seqno. There is the relationship of the files: > s1.smallest_seqno < s2.smallest_seqno < s2.largest_seqno < s1.largest_seqno So I skip pick sst file which was ingested in function `FindIntraL0Compaction ` ## Reason Here is my bug report: https://github.com/facebook/rocksdb/issues/5913 There are two situations that can cause the check to fail. ### First situation: - First we ingest five external sst into Rocksdb, and they happened to be ingested in L0. and there had been some data in memtable, which make the smallest sequence number of memtable is less than which of sst that we ingest. - If there had been one compaction job which compacted sst from L0 to L1, `LevelCompactionPicker` would trigger a `IntraL0Compaction` which would compact this five sst from L0 to L0. We call this sst A, which was merged from five ingested sst. - Then some data was put into memtable, and memtable was flushed to L0. We called this sst B. - RocksDB check consistency , and find the `smallest_seqno` of B is less than that of A and crash. Because A was merged from five sst, the smallest sequence number of it was less than the biggest sequece number of itself, so RocksDB could not tell if A was produce by ingested. ### Secondary situaion - First we have flushed many sst in L0, we call them [s1, s2, s3]. - There is an immutable memtable request to be flushed, but because flush thread is busy, so it has not been picked. we call it m1. And at the moment, one sst is ingested into L0. We call it s4. Because s4 is ingested after m1 became immutable memtable, so it has a larger log sequence number than m1. - m1 is flushed in L0. because it is small, this flush job finish quickly. we call it s5. - [s1, s2, s3, s4] are compacted into one sst to L0, by IntraL0Compaction. We call it s6. - compacted 4@0 files to L0 - When s6 is added into manifest, the corruption happened. because the largest sequence number of s6 is equal to s4, and they are both larger than that of s5. But because s1 is older than m1, so the smallest sequence number of s6 is smaller than that of s5. - s6.smallest_seqno < s5.smallest_seqno < s5.largest_seqno < s6.largest_seqno Pull Request resolved: https://github.com/facebook/rocksdb/pull/5958 Differential Revision: D18601316 fbshipit-source-id: 5fe54b3c9af52a2e1400728f565e895cde1c7267
2019-11-19 23:07:49 +00:00
Add(0, 1U, "100", "150", 200000U, 0, 100, 101);
Add(0, 2U, "151", "200", 200000U, 0, 102, 103);
Add(0, 3U, "201", "250", 200000U, 0, 104, 105);
Add(0, 4U, "251", "300", 200000U, 0, 106, 107);
Add(0, 5U, "301", "350", 200000U, 0, 108, 109);
Add(1, 6U, "100", "350", 200000U, 0, 110, 111);
vstorage_->LevelFiles(1)[0]->being_compacted = true;
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(level_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction.get() != nullptr);
ASSERT_EQ(1U, compaction->num_input_levels());
ASSERT_EQ(5U, compaction->num_input_files(0));
ASSERT_EQ(CompactionReason::kLevelL0FilesNum,
compaction->compaction_reason());
ASSERT_EQ(0, compaction->output_level());
}
TEST_F(CompactionPickerTest, IntraL0MaxCompactionBytesHit) {
// Intra L0 compaction triggers only if there are at least
// level0_file_num_compaction_trigger + 2 L0 files.
mutable_cf_options_.level0_file_num_compaction_trigger = 3;
mutable_cf_options_.max_compaction_bytes = 999999u;
NewVersionStorage(6, kCompactionStyleLevel);
// 4 out of 5 L0 files will be picked for intra L0 compaction due to
// max_compaction_bytes limit (the minimum number of files for triggering
// intra L0 compaction is 4). The one L1 file spans entire L0 key range and
// is marked as being compacted to avoid L0->L1 compaction.
Fix corruption with intra-L0 on ingested files (#5958) Summary: ## Problem Description Our process was abort when it call `CheckConsistency`. And the information in `stderr` show that "`L0 files seqno 3001491972 3004797440 vs. 3002875611 3004524421` ". Here are the causes of the accident I investigated. * RocksDB will call `CheckConsistency` whenever `MANIFEST` file is update. It will check sequence number interval of every file, except files which were ingested. * When one file is ingested into RocksDB, it will be assigned the value of global sequence number, and the minimum and maximum seqno of this file are equal, which are both equal to global sequence number. * `CheckConsistency` determines whether the file is ingested by whether the smallest and largest seqno of an sstable file are equal. * If IntraL0Compaction picks one sst which was ingested just now and compacted it into another sst, the `smallest_seqno` of this new file will be smaller than his `largest_seqno`. * If more than one ingested file was ingested before memtable schedule flush, and they all compact into one new sstable file by `IntraL0Compaction`. The sequence interval of this new file will be included in the interval of the memtable. So `CheckConsistency` will return a `Corruption`. * If a sstable was ingested after the memtable was schedule to flush, which would assign a larger seqno to it than memtable. Then the file was compacted with other files (these files were all flushed before the memtable) in L0 into one file. This compaction start before the flush job of memtable start, but completed after the flush job finish. So this new file produced by the compaction (we call it s1) would have a larger interval of sequence number than the file produced by flush (we call it s2). **But there was still some data in s1 written into RocksDB before the s2, so it's possible that some data in s2 was cover by old data in s1.** Of course, it would also make a `Corruption` because of overlap of seqno. There is the relationship of the files: > s1.smallest_seqno < s2.smallest_seqno < s2.largest_seqno < s1.largest_seqno So I skip pick sst file which was ingested in function `FindIntraL0Compaction ` ## Reason Here is my bug report: https://github.com/facebook/rocksdb/issues/5913 There are two situations that can cause the check to fail. ### First situation: - First we ingest five external sst into Rocksdb, and they happened to be ingested in L0. and there had been some data in memtable, which make the smallest sequence number of memtable is less than which of sst that we ingest. - If there had been one compaction job which compacted sst from L0 to L1, `LevelCompactionPicker` would trigger a `IntraL0Compaction` which would compact this five sst from L0 to L0. We call this sst A, which was merged from five ingested sst. - Then some data was put into memtable, and memtable was flushed to L0. We called this sst B. - RocksDB check consistency , and find the `smallest_seqno` of B is less than that of A and crash. Because A was merged from five sst, the smallest sequence number of it was less than the biggest sequece number of itself, so RocksDB could not tell if A was produce by ingested. ### Secondary situaion - First we have flushed many sst in L0, we call them [s1, s2, s3]. - There is an immutable memtable request to be flushed, but because flush thread is busy, so it has not been picked. we call it m1. And at the moment, one sst is ingested into L0. We call it s4. Because s4 is ingested after m1 became immutable memtable, so it has a larger log sequence number than m1. - m1 is flushed in L0. because it is small, this flush job finish quickly. we call it s5. - [s1, s2, s3, s4] are compacted into one sst to L0, by IntraL0Compaction. We call it s6. - compacted 4@0 files to L0 - When s6 is added into manifest, the corruption happened. because the largest sequence number of s6 is equal to s4, and they are both larger than that of s5. But because s1 is older than m1, so the smallest sequence number of s6 is smaller than that of s5. - s6.smallest_seqno < s5.smallest_seqno < s5.largest_seqno < s6.largest_seqno Pull Request resolved: https://github.com/facebook/rocksdb/pull/5958 Differential Revision: D18601316 fbshipit-source-id: 5fe54b3c9af52a2e1400728f565e895cde1c7267
2019-11-19 23:07:49 +00:00
Add(0, 1U, "100", "150", 200000U, 0, 100, 101);
Add(0, 2U, "151", "200", 200000U, 0, 102, 103);
Add(0, 3U, "201", "250", 200000U, 0, 104, 105);
Add(0, 4U, "251", "300", 200000U, 0, 106, 107);
Add(0, 5U, "301", "350", 200000U, 0, 108, 109);
Add(1, 6U, "100", "350", 200000U, 0, 109, 110);
vstorage_->LevelFiles(1)[0]->being_compacted = true;
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(level_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction.get() != nullptr);
ASSERT_EQ(1U, compaction->num_input_levels());
ASSERT_EQ(4U, compaction->num_input_files(0));
ASSERT_EQ(CompactionReason::kLevelL0FilesNum,
compaction->compaction_reason());
ASSERT_EQ(0, compaction->output_level());
}
#ifndef ROCKSDB_LITE
TEST_F(CompactionPickerTest, UniversalMarkedCompactionFullOverlap) {
const uint64_t kFileSize = 100000;
ioptions_.compaction_style = kCompactionStyleUniversal;
UniversalCompactionPicker universal_compaction_picker(ioptions_, &icmp_);
// This test covers the case where a "regular" universal compaction is
// scheduled first, followed by a delete triggered compaction. The latter
// should fail
NewVersionStorage(5, kCompactionStyleUniversal);
Sort L0 files by newly introduced epoch_num (#10922) Summary: **Context:** Sorting L0 files by `largest_seqno` has at least two inconvenience: - File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. `force_consistency_check=true` will catch such overlap seqno range even those harmless overlap. - For example, consider the following sequence of events ("key@n" indicates key at seqno "n") - insert k1@1 to memtable m1 - ingest file s1 with k2@2, ingest file s2 with k3@3 - insert k4@4 to m1 - compact files s1, s2 and result in new file s3 of seqno range [2, 3] - flush m1 and result in new file s4 of seqno range [1, 4]. And `force_consistency_check=true` will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1 - However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption. - Single delete can decrease a file's largest seqno and ordering by `largest_seqno` can introduce a wrong ordering hence file reordering corruption - For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to ajkr for this example) - an existing SST s1 contains only k1@1 - insert k1@2 to memtable m1 - ingest file s2 with k3@3, ingest file s3 with k4@4 - insert single delete k5@5 in m1 - flush m1 and result in new file s4 of seqno range [2, 5] - compact s1, s2, s3 and result in new file s5 of seqno range [1, 4] - compact s4 and result in new file s6 of seqno range [2] due to single delete - By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by `force_consistency_check=true`, there isn't a good way to prevent this from happening if ordering by `largest_seqno` Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to ajkr , we now introduce `epoch_num` which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum `epoch_num` among input files'). This will avoid the above inconvenience in the following ways: - In the first case above, there will no longer be overlap seqno range check in `force_consistency_check=true` but `epoch_number` ordering check. This will result in file ordering s1 < s2 < s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class `DBCompactionTestL0FilesMisorderCorruption*` for more. - In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption. **Summary:** - Introduce `epoch_number` stored per `ColumnFamilyData` and sort CF's L0 files by their assigned `epoch_number` instead of `largest_seqno`. - `epoch_number` is increased and assigned upon `VersionEdit::AddFile()` for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the `kReservedEpochNumberForFileIngestedBehind`) - Compaction output file is assigned with the minimum `epoch_number` among input files' - Refit level: reuse refitted file's epoch_number - Other paths needing `epoch_number` treatment: - Import column families: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo` - Repair: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`. - Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon `VersionEdit::AddFile()` where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair). - Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later `Refit(target_level=0)`). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder. - Persist `epoch_number` of each file in manifest and recover `epoch_number` on db recovery - Backward compatibility with old db without `epoch_number` support is guaranteed by assigning `epoch_number` to recovered files by `NewestFirstBySeqno` order. See `VersionStorageInfo::RecoverEpochNumbers()` for more - Forward compatibility with manifest is guaranteed by flexibility of `NewFileCustomTag` - Replace `force_consistent_check` on L0 with `epoch_number` and remove false positive check like case 1 with `largest_seqno` above - Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (`NewestFirstBySeqno`) to check/sort them till we infer their epoch number. See usages of `EpochNumberRequirement`. - Remove fix https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and their outdated tests to file reordering corruption because such fix can be replaced by this PR. - Misc: - update existing tests with `epoch_number` so make check will pass - update https://github.com/facebook/rocksdb/pull/5958#issue-511150930 tests to verify corruption is fixed using `epoch_number` and cover universal/fifo compaction/CompactRange/CompactFile cases - assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber() Pull Request resolved: https://github.com/facebook/rocksdb/pull/10922 Test Plan: - `make check` - New unit tests under `db/db_compaction_test.cc`, `db/db_test2.cc`, `db/version_builder_test.cc`, `db/repair_test.cc` - Updated tests (i.e, `DBCompactionTestL0FilesMisorderCorruption*`) under https://github.com/facebook/rocksdb/pull/5958#issue-511150930 - [Ongoing] Compatibility test: manually run https://github.com/ajkr/rocksdb/commit/36a5686ec012f35a4371e409aa85c404ca1c210d (with file ingestion off for running the `.orig` binary to prevent this bug affecting upgrade/downgrade formality checking) for 1 hour on `simple black/white box`, `cf_consistency/txn/enable_ts with whitebox + test_best_efforts_recovery with blackbox` - [Ongoing] normal db stress test - [Ongoing] db stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761 Reviewed By: ajkr Differential Revision: D41063187 Pulled By: hx235 fbshipit-source-id: 826cb23455de7beaabe2d16c57682a82733a32a9
2022-12-13 21:29:37 +00:00
Add(0, 1U, "150", "200", kFileSize, 0, 500, 550, /*compensated_file_size*/ 0,
/*marked_for_compact*/ false, /* temperature*/ Temperature::kUnknown,
/*oldest_ancestor_time*/ kUnknownOldestAncesterTime,
/*ts_of_smallest*/ Slice(), /*ts_of_largest*/ Slice(),
/*epoch_number*/ 3);
Add(0, 2U, "201", "250", 2 * kFileSize, 0, 401, 450,
/*compensated_file_size*/ 0, /*marked_for_compact*/ false,
/* temperature*/ Temperature::kUnknown,
/*oldest_ancestor_time*/ kUnknownOldestAncesterTime,
/*ts_of_smallest*/ Slice(), /*ts_of_largest*/ Slice(),
/*epoch_number*/ 2);
Add(0, 4U, "260", "300", 4 * kFileSize, 0, 260, 300,
/*compensated_file_size*/ 0, /*marked_for_compact*/ false,
/* temperature*/ Temperature::kUnknown,
/*oldest_ancestor_time*/ kUnknownOldestAncesterTime,
/*ts_of_smallest*/ Slice(), /*ts_of_largest*/ Slice(),
/*epoch_number*/ 1);
Add(3, 5U, "010", "080", 8 * kFileSize, 0, 200, 251);
Add(4, 3U, "301", "350", 8 * kFileSize, 0, 101, 150);
Add(4, 6U, "501", "750", 8 * kFileSize, 0, 101, 150);
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(
universal_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction);
// Validate that its a compaction to reduce sorted runs
ASSERT_EQ(CompactionReason::kUniversalSortedRunNum,
compaction->compaction_reason());
ASSERT_EQ(0, compaction->output_level());
ASSERT_EQ(0, compaction->start_level());
ASSERT_EQ(2U, compaction->num_input_files(0));
AddVersionStorage();
// Simulate a flush and mark the file for compaction
Sort L0 files by newly introduced epoch_num (#10922) Summary: **Context:** Sorting L0 files by `largest_seqno` has at least two inconvenience: - File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. `force_consistency_check=true` will catch such overlap seqno range even those harmless overlap. - For example, consider the following sequence of events ("key@n" indicates key at seqno "n") - insert k1@1 to memtable m1 - ingest file s1 with k2@2, ingest file s2 with k3@3 - insert k4@4 to m1 - compact files s1, s2 and result in new file s3 of seqno range [2, 3] - flush m1 and result in new file s4 of seqno range [1, 4]. And `force_consistency_check=true` will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1 - However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption. - Single delete can decrease a file's largest seqno and ordering by `largest_seqno` can introduce a wrong ordering hence file reordering corruption - For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to ajkr for this example) - an existing SST s1 contains only k1@1 - insert k1@2 to memtable m1 - ingest file s2 with k3@3, ingest file s3 with k4@4 - insert single delete k5@5 in m1 - flush m1 and result in new file s4 of seqno range [2, 5] - compact s1, s2, s3 and result in new file s5 of seqno range [1, 4] - compact s4 and result in new file s6 of seqno range [2] due to single delete - By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by `force_consistency_check=true`, there isn't a good way to prevent this from happening if ordering by `largest_seqno` Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to ajkr , we now introduce `epoch_num` which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum `epoch_num` among input files'). This will avoid the above inconvenience in the following ways: - In the first case above, there will no longer be overlap seqno range check in `force_consistency_check=true` but `epoch_number` ordering check. This will result in file ordering s1 < s2 < s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class `DBCompactionTestL0FilesMisorderCorruption*` for more. - In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption. **Summary:** - Introduce `epoch_number` stored per `ColumnFamilyData` and sort CF's L0 files by their assigned `epoch_number` instead of `largest_seqno`. - `epoch_number` is increased and assigned upon `VersionEdit::AddFile()` for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the `kReservedEpochNumberForFileIngestedBehind`) - Compaction output file is assigned with the minimum `epoch_number` among input files' - Refit level: reuse refitted file's epoch_number - Other paths needing `epoch_number` treatment: - Import column families: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo` - Repair: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`. - Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon `VersionEdit::AddFile()` where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair). - Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later `Refit(target_level=0)`). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder. - Persist `epoch_number` of each file in manifest and recover `epoch_number` on db recovery - Backward compatibility with old db without `epoch_number` support is guaranteed by assigning `epoch_number` to recovered files by `NewestFirstBySeqno` order. See `VersionStorageInfo::RecoverEpochNumbers()` for more - Forward compatibility with manifest is guaranteed by flexibility of `NewFileCustomTag` - Replace `force_consistent_check` on L0 with `epoch_number` and remove false positive check like case 1 with `largest_seqno` above - Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (`NewestFirstBySeqno`) to check/sort them till we infer their epoch number. See usages of `EpochNumberRequirement`. - Remove fix https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and their outdated tests to file reordering corruption because such fix can be replaced by this PR. - Misc: - update existing tests with `epoch_number` so make check will pass - update https://github.com/facebook/rocksdb/pull/5958#issue-511150930 tests to verify corruption is fixed using `epoch_number` and cover universal/fifo compaction/CompactRange/CompactFile cases - assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber() Pull Request resolved: https://github.com/facebook/rocksdb/pull/10922 Test Plan: - `make check` - New unit tests under `db/db_compaction_test.cc`, `db/db_test2.cc`, `db/version_builder_test.cc`, `db/repair_test.cc` - Updated tests (i.e, `DBCompactionTestL0FilesMisorderCorruption*`) under https://github.com/facebook/rocksdb/pull/5958#issue-511150930 - [Ongoing] Compatibility test: manually run https://github.com/ajkr/rocksdb/commit/36a5686ec012f35a4371e409aa85c404ca1c210d (with file ingestion off for running the `.orig` binary to prevent this bug affecting upgrade/downgrade formality checking) for 1 hour on `simple black/white box`, `cf_consistency/txn/enable_ts with whitebox + test_best_efforts_recovery with blackbox` - [Ongoing] normal db stress test - [Ongoing] db stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761 Reviewed By: ajkr Differential Revision: D41063187 Pulled By: hx235 fbshipit-source-id: 826cb23455de7beaabe2d16c57682a82733a32a9
2022-12-13 21:29:37 +00:00
Add(0, 7U, "150", "200", kFileSize, 0, 551, 600, 0, true,
/* temperature*/ Temperature::kUnknown,
/*oldest_ancestor_time*/ kUnknownOldestAncesterTime,
/*ts_of_smallest*/ Slice(), /*ts_of_largest*/ Slice(),
/*epoch_number*/ 4);
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction2(
universal_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_FALSE(compaction2);
}
TEST_F(CompactionPickerTest, UniversalMarkedCompactionFullOverlap2) {
const uint64_t kFileSize = 100000;
ioptions_.compaction_style = kCompactionStyleUniversal;
UniversalCompactionPicker universal_compaction_picker(ioptions_, &icmp_);
// This test covers the case where a delete triggered compaction is
// scheduled first, followed by a "regular" compaction. The latter
// should fail
NewVersionStorage(5, kCompactionStyleUniversal);
// Mark file number 4 for compaction
Sort L0 files by newly introduced epoch_num (#10922) Summary: **Context:** Sorting L0 files by `largest_seqno` has at least two inconvenience: - File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. `force_consistency_check=true` will catch such overlap seqno range even those harmless overlap. - For example, consider the following sequence of events ("key@n" indicates key at seqno "n") - insert k1@1 to memtable m1 - ingest file s1 with k2@2, ingest file s2 with k3@3 - insert k4@4 to m1 - compact files s1, s2 and result in new file s3 of seqno range [2, 3] - flush m1 and result in new file s4 of seqno range [1, 4]. And `force_consistency_check=true` will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1 - However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption. - Single delete can decrease a file's largest seqno and ordering by `largest_seqno` can introduce a wrong ordering hence file reordering corruption - For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to ajkr for this example) - an existing SST s1 contains only k1@1 - insert k1@2 to memtable m1 - ingest file s2 with k3@3, ingest file s3 with k4@4 - insert single delete k5@5 in m1 - flush m1 and result in new file s4 of seqno range [2, 5] - compact s1, s2, s3 and result in new file s5 of seqno range [1, 4] - compact s4 and result in new file s6 of seqno range [2] due to single delete - By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by `force_consistency_check=true`, there isn't a good way to prevent this from happening if ordering by `largest_seqno` Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to ajkr , we now introduce `epoch_num` which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum `epoch_num` among input files'). This will avoid the above inconvenience in the following ways: - In the first case above, there will no longer be overlap seqno range check in `force_consistency_check=true` but `epoch_number` ordering check. This will result in file ordering s1 < s2 < s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class `DBCompactionTestL0FilesMisorderCorruption*` for more. - In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption. **Summary:** - Introduce `epoch_number` stored per `ColumnFamilyData` and sort CF's L0 files by their assigned `epoch_number` instead of `largest_seqno`. - `epoch_number` is increased and assigned upon `VersionEdit::AddFile()` for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the `kReservedEpochNumberForFileIngestedBehind`) - Compaction output file is assigned with the minimum `epoch_number` among input files' - Refit level: reuse refitted file's epoch_number - Other paths needing `epoch_number` treatment: - Import column families: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo` - Repair: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`. - Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon `VersionEdit::AddFile()` where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair). - Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later `Refit(target_level=0)`). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder. - Persist `epoch_number` of each file in manifest and recover `epoch_number` on db recovery - Backward compatibility with old db without `epoch_number` support is guaranteed by assigning `epoch_number` to recovered files by `NewestFirstBySeqno` order. See `VersionStorageInfo::RecoverEpochNumbers()` for more - Forward compatibility with manifest is guaranteed by flexibility of `NewFileCustomTag` - Replace `force_consistent_check` on L0 with `epoch_number` and remove false positive check like case 1 with `largest_seqno` above - Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (`NewestFirstBySeqno`) to check/sort them till we infer their epoch number. See usages of `EpochNumberRequirement`. - Remove fix https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and their outdated tests to file reordering corruption because such fix can be replaced by this PR. - Misc: - update existing tests with `epoch_number` so make check will pass - update https://github.com/facebook/rocksdb/pull/5958#issue-511150930 tests to verify corruption is fixed using `epoch_number` and cover universal/fifo compaction/CompactRange/CompactFile cases - assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber() Pull Request resolved: https://github.com/facebook/rocksdb/pull/10922 Test Plan: - `make check` - New unit tests under `db/db_compaction_test.cc`, `db/db_test2.cc`, `db/version_builder_test.cc`, `db/repair_test.cc` - Updated tests (i.e, `DBCompactionTestL0FilesMisorderCorruption*`) under https://github.com/facebook/rocksdb/pull/5958#issue-511150930 - [Ongoing] Compatibility test: manually run https://github.com/ajkr/rocksdb/commit/36a5686ec012f35a4371e409aa85c404ca1c210d (with file ingestion off for running the `.orig` binary to prevent this bug affecting upgrade/downgrade formality checking) for 1 hour on `simple black/white box`, `cf_consistency/txn/enable_ts with whitebox + test_best_efforts_recovery with blackbox` - [Ongoing] normal db stress test - [Ongoing] db stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761 Reviewed By: ajkr Differential Revision: D41063187 Pulled By: hx235 fbshipit-source-id: 826cb23455de7beaabe2d16c57682a82733a32a9
2022-12-13 21:29:37 +00:00
Add(0, 4U, "260", "300", 4 * kFileSize, 0, 260, 300, 0, true,
/* temperature*/ Temperature::kUnknown,
/*oldest_ancestor_time*/ kUnknownOldestAncesterTime,
/*ts_of_smallest*/ Slice(), /*ts_of_largest*/ Slice(),
/*epoch_number*/ 1);
Add(3, 5U, "240", "290", 8 * kFileSize, 0, 201, 250);
Add(4, 3U, "301", "350", 8 * kFileSize, 0, 101, 150);
Add(4, 6U, "501", "750", 8 * kFileSize, 0, 101, 150);
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(
universal_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction);
// Validate that its a delete triggered compaction
ASSERT_EQ(CompactionReason::kFilesMarkedForCompaction,
compaction->compaction_reason());
ASSERT_EQ(3, compaction->output_level());
ASSERT_EQ(0, compaction->start_level());
ASSERT_EQ(1U, compaction->num_input_files(0));
ASSERT_EQ(1U, compaction->num_input_files(1));
AddVersionStorage();
Sort L0 files by newly introduced epoch_num (#10922) Summary: **Context:** Sorting L0 files by `largest_seqno` has at least two inconvenience: - File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. `force_consistency_check=true` will catch such overlap seqno range even those harmless overlap. - For example, consider the following sequence of events ("key@n" indicates key at seqno "n") - insert k1@1 to memtable m1 - ingest file s1 with k2@2, ingest file s2 with k3@3 - insert k4@4 to m1 - compact files s1, s2 and result in new file s3 of seqno range [2, 3] - flush m1 and result in new file s4 of seqno range [1, 4]. And `force_consistency_check=true` will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1 - However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption. - Single delete can decrease a file's largest seqno and ordering by `largest_seqno` can introduce a wrong ordering hence file reordering corruption - For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to ajkr for this example) - an existing SST s1 contains only k1@1 - insert k1@2 to memtable m1 - ingest file s2 with k3@3, ingest file s3 with k4@4 - insert single delete k5@5 in m1 - flush m1 and result in new file s4 of seqno range [2, 5] - compact s1, s2, s3 and result in new file s5 of seqno range [1, 4] - compact s4 and result in new file s6 of seqno range [2] due to single delete - By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by `force_consistency_check=true`, there isn't a good way to prevent this from happening if ordering by `largest_seqno` Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to ajkr , we now introduce `epoch_num` which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum `epoch_num` among input files'). This will avoid the above inconvenience in the following ways: - In the first case above, there will no longer be overlap seqno range check in `force_consistency_check=true` but `epoch_number` ordering check. This will result in file ordering s1 < s2 < s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class `DBCompactionTestL0FilesMisorderCorruption*` for more. - In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption. **Summary:** - Introduce `epoch_number` stored per `ColumnFamilyData` and sort CF's L0 files by their assigned `epoch_number` instead of `largest_seqno`. - `epoch_number` is increased and assigned upon `VersionEdit::AddFile()` for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the `kReservedEpochNumberForFileIngestedBehind`) - Compaction output file is assigned with the minimum `epoch_number` among input files' - Refit level: reuse refitted file's epoch_number - Other paths needing `epoch_number` treatment: - Import column families: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo` - Repair: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`. - Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon `VersionEdit::AddFile()` where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair). - Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later `Refit(target_level=0)`). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder. - Persist `epoch_number` of each file in manifest and recover `epoch_number` on db recovery - Backward compatibility with old db without `epoch_number` support is guaranteed by assigning `epoch_number` to recovered files by `NewestFirstBySeqno` order. See `VersionStorageInfo::RecoverEpochNumbers()` for more - Forward compatibility with manifest is guaranteed by flexibility of `NewFileCustomTag` - Replace `force_consistent_check` on L0 with `epoch_number` and remove false positive check like case 1 with `largest_seqno` above - Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (`NewestFirstBySeqno`) to check/sort them till we infer their epoch number. See usages of `EpochNumberRequirement`. - Remove fix https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and their outdated tests to file reordering corruption because such fix can be replaced by this PR. - Misc: - update existing tests with `epoch_number` so make check will pass - update https://github.com/facebook/rocksdb/pull/5958#issue-511150930 tests to verify corruption is fixed using `epoch_number` and cover universal/fifo compaction/CompactRange/CompactFile cases - assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber() Pull Request resolved: https://github.com/facebook/rocksdb/pull/10922 Test Plan: - `make check` - New unit tests under `db/db_compaction_test.cc`, `db/db_test2.cc`, `db/version_builder_test.cc`, `db/repair_test.cc` - Updated tests (i.e, `DBCompactionTestL0FilesMisorderCorruption*`) under https://github.com/facebook/rocksdb/pull/5958#issue-511150930 - [Ongoing] Compatibility test: manually run https://github.com/ajkr/rocksdb/commit/36a5686ec012f35a4371e409aa85c404ca1c210d (with file ingestion off for running the `.orig` binary to prevent this bug affecting upgrade/downgrade formality checking) for 1 hour on `simple black/white box`, `cf_consistency/txn/enable_ts with whitebox + test_best_efforts_recovery with blackbox` - [Ongoing] normal db stress test - [Ongoing] db stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761 Reviewed By: ajkr Differential Revision: D41063187 Pulled By: hx235 fbshipit-source-id: 826cb23455de7beaabe2d16c57682a82733a32a9
2022-12-13 21:29:37 +00:00
Add(0, 1U, "150", "200", kFileSize, 0, 500, 550, /*compensated_file_size*/ 0,
/*marked_for_compact*/ false, /* temperature*/ Temperature::kUnknown,
/*oldest_ancestor_time*/ kUnknownOldestAncesterTime,
/*ts_of_smallest*/ Slice(), /*ts_of_largest*/ Slice(),
/*epoch_number*/ 3);
Add(0, 2U, "201", "250", 2 * kFileSize, 0, 401, 450,
/*compensated_file_size*/ 0, /*marked_for_compact*/ false,
/* temperature*/ Temperature::kUnknown,
/*oldest_ancestor_time*/ kUnknownOldestAncesterTime,
/*ts_of_smallest*/ Slice(), /*ts_of_largest*/ Slice(),
/*epoch_number*/ 2);
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction2(
universal_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_FALSE(compaction2);
}
TEST_F(CompactionPickerTest, UniversalMarkedCompactionStartOutputOverlap) {
// The case where universal periodic compaction can be picked
// with some newer files being compacted.
const uint64_t kFileSize = 100000;
ioptions_.compaction_style = kCompactionStyleUniversal;
bool input_level_overlap = false;
bool output_level_overlap = false;
// Let's mark 2 files in 2 different levels for compaction. The
// compaction picker will randomly pick one, so use the sync point to
// ensure a deterministic order. Loop until both cases are covered
size_t random_index = 0;
SyncPoint::GetInstance()->SetCallBack(
"CompactionPicker::PickFilesMarkedForCompaction", [&](void* arg) {
size_t* index = static_cast<size_t*>(arg);
*index = random_index;
});
SyncPoint::GetInstance()->EnableProcessing();
while (!input_level_overlap || !output_level_overlap) {
// Ensure that the L0 file gets picked first
random_index = !input_level_overlap ? 0 : 1;
UniversalCompactionPicker universal_compaction_picker(ioptions_, &icmp_);
NewVersionStorage(5, kCompactionStyleUniversal);
Add(0, 1U, "260", "300", 4 * kFileSize, 0, 260, 300, 0, true);
Add(3, 2U, "010", "020", 2 * kFileSize, 0, 201, 248);
Add(3, 3U, "250", "270", 2 * kFileSize, 0, 202, 249);
Add(3, 4U, "290", "310", 2 * kFileSize, 0, 203, 250);
Add(3, 5U, "310", "320", 2 * kFileSize, 0, 204, 251, 0, true);
Add(4, 6U, "301", "350", 8 * kFileSize, 0, 101, 150);
Add(4, 7U, "501", "750", 8 * kFileSize, 0, 101, 150);
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(
universal_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction);
// Validate that its a delete triggered compaction
ASSERT_EQ(CompactionReason::kFilesMarkedForCompaction,
compaction->compaction_reason());
ASSERT_TRUE(compaction->start_level() == 0 ||
compaction->start_level() == 3);
if (compaction->start_level() == 0) {
// The L0 file was picked. The next compaction will detect an
// overlap on its input level
input_level_overlap = true;
ASSERT_EQ(3, compaction->output_level());
ASSERT_EQ(1U, compaction->num_input_files(0));
ASSERT_EQ(3U, compaction->num_input_files(1));
} else {
// The level 3 file was picked. The next compaction will pick
// the L0 file and will detect overlap when adding output
// level inputs
output_level_overlap = true;
ASSERT_EQ(4, compaction->output_level());
ASSERT_EQ(2U, compaction->num_input_files(0));
ASSERT_EQ(1U, compaction->num_input_files(1));
}
vstorage_->ComputeCompactionScore(ioptions_, mutable_cf_options_);
// After recomputing the compaction score, only one marked file will remain
random_index = 0;
std::unique_ptr<Compaction> compaction2(
universal_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_FALSE(compaction2);
DeleteVersionStorage();
}
}
TEST_F(CompactionPickerTest, UniversalMarkedL0NoOverlap) {
const uint64_t kFileSize = 100000;
ioptions_.compaction_style = kCompactionStyleUniversal;
UniversalCompactionPicker universal_compaction_picker(ioptions_, &icmp_);
// This test covers the case where a delete triggered compaction is
// scheduled and should result in a full compaction
NewVersionStorage(1, kCompactionStyleUniversal);
// Mark file number 4 for compaction
Add(0, 4U, "260", "300", 1 * kFileSize, 0, 260, 300, 0, true);
Add(0, 5U, "240", "290", 2 * kFileSize, 0, 201, 250);
Add(0, 3U, "301", "350", 4 * kFileSize, 0, 101, 150);
Add(0, 6U, "501", "750", 8 * kFileSize, 0, 50, 100);
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(
universal_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction);
// Validate that its a delete triggered compaction
ASSERT_EQ(CompactionReason::kFilesMarkedForCompaction,
compaction->compaction_reason());
ASSERT_EQ(0, compaction->output_level());
ASSERT_EQ(0, compaction->start_level());
ASSERT_EQ(4U, compaction->num_input_files(0));
ASSERT_TRUE(file_map_[4].first->being_compacted);
ASSERT_TRUE(file_map_[5].first->being_compacted);
ASSERT_TRUE(file_map_[3].first->being_compacted);
ASSERT_TRUE(file_map_[6].first->being_compacted);
}
TEST_F(CompactionPickerTest, UniversalMarkedL0WithOverlap) {
const uint64_t kFileSize = 100000;
ioptions_.compaction_style = kCompactionStyleUniversal;
UniversalCompactionPicker universal_compaction_picker(ioptions_, &icmp_);
// This test covers the case where a file is being compacted, and a
// delete triggered compaction is then scheduled. The latter should stop
// at the first file being compacted
NewVersionStorage(1, kCompactionStyleUniversal);
// Mark file number 4 for compaction
Add(0, 4U, "260", "300", 1 * kFileSize, 0, 260, 300, 0, true);
Add(0, 5U, "240", "290", 2 * kFileSize, 0, 201, 250);
Add(0, 3U, "301", "350", 4 * kFileSize, 0, 101, 150);
Add(0, 6U, "501", "750", 8 * kFileSize, 0, 50, 100);
UpdateVersionStorageInfo();
file_map_[3].first->being_compacted = true;
std::unique_ptr<Compaction> compaction(
universal_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction);
// Validate that its a delete triggered compaction
ASSERT_EQ(CompactionReason::kFilesMarkedForCompaction,
compaction->compaction_reason());
ASSERT_EQ(0, compaction->output_level());
ASSERT_EQ(0, compaction->start_level());
ASSERT_EQ(2U, compaction->num_input_files(0));
ASSERT_TRUE(file_map_[4].first->being_compacted);
ASSERT_TRUE(file_map_[5].first->being_compacted);
}
TEST_F(CompactionPickerTest, UniversalMarkedL0Overlap2) {
const uint64_t kFileSize = 100000;
ioptions_.compaction_style = kCompactionStyleUniversal;
UniversalCompactionPicker universal_compaction_picker(ioptions_, &icmp_);
// This test covers the case where a delete triggered compaction is
// scheduled first, followed by a "regular" compaction. The latter
// should fail
NewVersionStorage(1, kCompactionStyleUniversal);
// Mark file number 5 for compaction
Sort L0 files by newly introduced epoch_num (#10922) Summary: **Context:** Sorting L0 files by `largest_seqno` has at least two inconvenience: - File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. `force_consistency_check=true` will catch such overlap seqno range even those harmless overlap. - For example, consider the following sequence of events ("key@n" indicates key at seqno "n") - insert k1@1 to memtable m1 - ingest file s1 with k2@2, ingest file s2 with k3@3 - insert k4@4 to m1 - compact files s1, s2 and result in new file s3 of seqno range [2, 3] - flush m1 and result in new file s4 of seqno range [1, 4]. And `force_consistency_check=true` will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1 - However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption. - Single delete can decrease a file's largest seqno and ordering by `largest_seqno` can introduce a wrong ordering hence file reordering corruption - For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to ajkr for this example) - an existing SST s1 contains only k1@1 - insert k1@2 to memtable m1 - ingest file s2 with k3@3, ingest file s3 with k4@4 - insert single delete k5@5 in m1 - flush m1 and result in new file s4 of seqno range [2, 5] - compact s1, s2, s3 and result in new file s5 of seqno range [1, 4] - compact s4 and result in new file s6 of seqno range [2] due to single delete - By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by `force_consistency_check=true`, there isn't a good way to prevent this from happening if ordering by `largest_seqno` Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to ajkr , we now introduce `epoch_num` which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum `epoch_num` among input files'). This will avoid the above inconvenience in the following ways: - In the first case above, there will no longer be overlap seqno range check in `force_consistency_check=true` but `epoch_number` ordering check. This will result in file ordering s1 < s2 < s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class `DBCompactionTestL0FilesMisorderCorruption*` for more. - In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption. **Summary:** - Introduce `epoch_number` stored per `ColumnFamilyData` and sort CF's L0 files by their assigned `epoch_number` instead of `largest_seqno`. - `epoch_number` is increased and assigned upon `VersionEdit::AddFile()` for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the `kReservedEpochNumberForFileIngestedBehind`) - Compaction output file is assigned with the minimum `epoch_number` among input files' - Refit level: reuse refitted file's epoch_number - Other paths needing `epoch_number` treatment: - Import column families: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo` - Repair: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`. - Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon `VersionEdit::AddFile()` where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair). - Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later `Refit(target_level=0)`). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder. - Persist `epoch_number` of each file in manifest and recover `epoch_number` on db recovery - Backward compatibility with old db without `epoch_number` support is guaranteed by assigning `epoch_number` to recovered files by `NewestFirstBySeqno` order. See `VersionStorageInfo::RecoverEpochNumbers()` for more - Forward compatibility with manifest is guaranteed by flexibility of `NewFileCustomTag` - Replace `force_consistent_check` on L0 with `epoch_number` and remove false positive check like case 1 with `largest_seqno` above - Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (`NewestFirstBySeqno`) to check/sort them till we infer their epoch number. See usages of `EpochNumberRequirement`. - Remove fix https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and their outdated tests to file reordering corruption because such fix can be replaced by this PR. - Misc: - update existing tests with `epoch_number` so make check will pass - update https://github.com/facebook/rocksdb/pull/5958#issue-511150930 tests to verify corruption is fixed using `epoch_number` and cover universal/fifo compaction/CompactRange/CompactFile cases - assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber() Pull Request resolved: https://github.com/facebook/rocksdb/pull/10922 Test Plan: - `make check` - New unit tests under `db/db_compaction_test.cc`, `db/db_test2.cc`, `db/version_builder_test.cc`, `db/repair_test.cc` - Updated tests (i.e, `DBCompactionTestL0FilesMisorderCorruption*`) under https://github.com/facebook/rocksdb/pull/5958#issue-511150930 - [Ongoing] Compatibility test: manually run https://github.com/ajkr/rocksdb/commit/36a5686ec012f35a4371e409aa85c404ca1c210d (with file ingestion off for running the `.orig` binary to prevent this bug affecting upgrade/downgrade formality checking) for 1 hour on `simple black/white box`, `cf_consistency/txn/enable_ts with whitebox + test_best_efforts_recovery with blackbox` - [Ongoing] normal db stress test - [Ongoing] db stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761 Reviewed By: ajkr Differential Revision: D41063187 Pulled By: hx235 fbshipit-source-id: 826cb23455de7beaabe2d16c57682a82733a32a9
2022-12-13 21:29:37 +00:00
Add(0, 4U, "260", "300", 1 * kFileSize, 0, 260, 300,
/*compensated_file_size*/ 0, /*marked_for_compact*/ false,
/* temperature*/ Temperature::kUnknown,
/*oldest_ancestor_time*/ kUnknownOldestAncesterTime,
/*ts_of_smallest*/ Slice(), /*ts_of_largest*/ Slice(),
/*epoch_number*/ 4);
Add(0, 5U, "240", "290", 2 * kFileSize, 0, 201, 250, 0, true,
/* temperature*/ Temperature::kUnknown,
/*oldest_ancestor_time*/ kUnknownOldestAncesterTime,
/*ts_of_smallest*/ Slice(), /*ts_of_largest*/ Slice(),
/*epoch_number*/ 3);
Add(0, 3U, "301", "350", 4 * kFileSize, 0, 101, 150,
/*compensated_file_size*/ 0, /*marked_for_compact*/ false,
/* temperature*/ Temperature::kUnknown,
/*oldest_ancestor_time*/ kUnknownOldestAncesterTime,
/*ts_of_smallest*/ Slice(), /*ts_of_largest*/ Slice(),
/*epoch_number*/ 2);
Add(0, 6U, "501", "750", 8 * kFileSize, 0, 50, 100,
/*compensated_file_size*/ 0, /*marked_for_compact*/ false,
/* temperature*/ Temperature::kUnknown,
/*oldest_ancestor_time*/ kUnknownOldestAncesterTime,
/*ts_of_smallest*/ Slice(), /*ts_of_largest*/ Slice(),
/*epoch_number*/ 1);
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(
universal_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction);
// Validate that its a delete triggered compaction
ASSERT_EQ(CompactionReason::kFilesMarkedForCompaction,
compaction->compaction_reason());
ASSERT_EQ(0, compaction->output_level());
ASSERT_EQ(0, compaction->start_level());
ASSERT_EQ(3U, compaction->num_input_files(0));
ASSERT_TRUE(file_map_[5].first->being_compacted);
ASSERT_TRUE(file_map_[3].first->being_compacted);
ASSERT_TRUE(file_map_[6].first->being_compacted);
AddVersionStorage();
Sort L0 files by newly introduced epoch_num (#10922) Summary: **Context:** Sorting L0 files by `largest_seqno` has at least two inconvenience: - File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. `force_consistency_check=true` will catch such overlap seqno range even those harmless overlap. - For example, consider the following sequence of events ("key@n" indicates key at seqno "n") - insert k1@1 to memtable m1 - ingest file s1 with k2@2, ingest file s2 with k3@3 - insert k4@4 to m1 - compact files s1, s2 and result in new file s3 of seqno range [2, 3] - flush m1 and result in new file s4 of seqno range [1, 4]. And `force_consistency_check=true` will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1 - However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption. - Single delete can decrease a file's largest seqno and ordering by `largest_seqno` can introduce a wrong ordering hence file reordering corruption - For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to ajkr for this example) - an existing SST s1 contains only k1@1 - insert k1@2 to memtable m1 - ingest file s2 with k3@3, ingest file s3 with k4@4 - insert single delete k5@5 in m1 - flush m1 and result in new file s4 of seqno range [2, 5] - compact s1, s2, s3 and result in new file s5 of seqno range [1, 4] - compact s4 and result in new file s6 of seqno range [2] due to single delete - By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by `force_consistency_check=true`, there isn't a good way to prevent this from happening if ordering by `largest_seqno` Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to ajkr , we now introduce `epoch_num` which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum `epoch_num` among input files'). This will avoid the above inconvenience in the following ways: - In the first case above, there will no longer be overlap seqno range check in `force_consistency_check=true` but `epoch_number` ordering check. This will result in file ordering s1 < s2 < s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class `DBCompactionTestL0FilesMisorderCorruption*` for more. - In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption. **Summary:** - Introduce `epoch_number` stored per `ColumnFamilyData` and sort CF's L0 files by their assigned `epoch_number` instead of `largest_seqno`. - `epoch_number` is increased and assigned upon `VersionEdit::AddFile()` for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the `kReservedEpochNumberForFileIngestedBehind`) - Compaction output file is assigned with the minimum `epoch_number` among input files' - Refit level: reuse refitted file's epoch_number - Other paths needing `epoch_number` treatment: - Import column families: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo` - Repair: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`. - Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon `VersionEdit::AddFile()` where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair). - Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later `Refit(target_level=0)`). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder. - Persist `epoch_number` of each file in manifest and recover `epoch_number` on db recovery - Backward compatibility with old db without `epoch_number` support is guaranteed by assigning `epoch_number` to recovered files by `NewestFirstBySeqno` order. See `VersionStorageInfo::RecoverEpochNumbers()` for more - Forward compatibility with manifest is guaranteed by flexibility of `NewFileCustomTag` - Replace `force_consistent_check` on L0 with `epoch_number` and remove false positive check like case 1 with `largest_seqno` above - Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (`NewestFirstBySeqno`) to check/sort them till we infer their epoch number. See usages of `EpochNumberRequirement`. - Remove fix https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and their outdated tests to file reordering corruption because such fix can be replaced by this PR. - Misc: - update existing tests with `epoch_number` so make check will pass - update https://github.com/facebook/rocksdb/pull/5958#issue-511150930 tests to verify corruption is fixed using `epoch_number` and cover universal/fifo compaction/CompactRange/CompactFile cases - assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber() Pull Request resolved: https://github.com/facebook/rocksdb/pull/10922 Test Plan: - `make check` - New unit tests under `db/db_compaction_test.cc`, `db/db_test2.cc`, `db/version_builder_test.cc`, `db/repair_test.cc` - Updated tests (i.e, `DBCompactionTestL0FilesMisorderCorruption*`) under https://github.com/facebook/rocksdb/pull/5958#issue-511150930 - [Ongoing] Compatibility test: manually run https://github.com/ajkr/rocksdb/commit/36a5686ec012f35a4371e409aa85c404ca1c210d (with file ingestion off for running the `.orig` binary to prevent this bug affecting upgrade/downgrade formality checking) for 1 hour on `simple black/white box`, `cf_consistency/txn/enable_ts with whitebox + test_best_efforts_recovery with blackbox` - [Ongoing] normal db stress test - [Ongoing] db stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761 Reviewed By: ajkr Differential Revision: D41063187 Pulled By: hx235 fbshipit-source-id: 826cb23455de7beaabe2d16c57682a82733a32a9
2022-12-13 21:29:37 +00:00
Add(0, 1U, "150", "200", kFileSize, 0, 500, 550, /*compensated_file_size*/ 0,
/*marked_for_compact*/ false,
/* temperature*/ Temperature::kUnknown,
/*oldest_ancestor_time*/ kUnknownOldestAncesterTime,
/*ts_of_smallest*/ Slice(), /*ts_of_largest*/ Slice(),
/*epoch_number*/ 6);
Add(0, 2U, "201", "250", kFileSize, 0, 401, 450, /*compensated_file_size*/ 0,
/*marked_for_compact*/ false,
/* temperature*/ Temperature::kUnknown,
/*oldest_ancestor_time*/ kUnknownOldestAncesterTime,
/*ts_of_smallest*/ Slice(), /*ts_of_largest*/ Slice(),
/*epoch_number*/ 5);
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction2(
universal_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
ASSERT_TRUE(compaction2);
ASSERT_EQ(3U, compaction->num_input_files(0));
ASSERT_TRUE(file_map_[1].first->being_compacted);
ASSERT_TRUE(file_map_[2].first->being_compacted);
ASSERT_TRUE(file_map_[4].first->being_compacted);
}
TEST_F(CompactionPickerTest, UniversalMarkedManualCompaction) {
const uint64_t kFileSize = 100000;
const int kNumLevels = 7;
// This test makes sure the `files_marked_for_compaction_` is updated after
// creating manual compaction.
ioptions_.compaction_style = kCompactionStyleUniversal;
UniversalCompactionPicker universal_compaction_picker(ioptions_, &icmp_);
NewVersionStorage(kNumLevels, kCompactionStyleUniversal);
// Add 3 files marked for compaction
Add(0, 3U, "301", "350", 4 * kFileSize, 0, 101, 150, 0, true);
Add(0, 4U, "260", "300", 1 * kFileSize, 0, 260, 300, 0, true);
Add(0, 5U, "240", "290", 2 * kFileSize, 0, 201, 250, 0, true);
UpdateVersionStorageInfo();
// All 3 files are marked for compaction
ASSERT_EQ(3U, vstorage_->FilesMarkedForCompaction().size());
bool manual_conflict = false;
InternalKey* manual_end = nullptr;
std::unique_ptr<Compaction> compaction(
universal_compaction_picker.CompactRange(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
ColumnFamilyData::kCompactAllLevels, 6, CompactRangeOptions(),
nullptr, nullptr, &manual_end, &manual_conflict,
std::numeric_limits<uint64_t>::max(), ""));
ASSERT_TRUE(compaction);
ASSERT_EQ(CompactionReason::kManualCompaction,
compaction->compaction_reason());
ASSERT_EQ(kNumLevels - 1, compaction->output_level());
ASSERT_EQ(0, compaction->start_level());
ASSERT_EQ(3U, compaction->num_input_files(0));
ASSERT_TRUE(file_map_[3].first->being_compacted);
ASSERT_TRUE(file_map_[4].first->being_compacted);
ASSERT_TRUE(file_map_[5].first->being_compacted);
// After creating the manual compaction, all files should be cleared from
// `FilesMarkedForCompaction`. So they won't be picked by others.
ASSERT_EQ(0U, vstorage_->FilesMarkedForCompaction().size());
}
TEST_F(CompactionPickerTest, UniversalSizeAmpTierCompactionNonLastLevel) {
// This test make sure size amplification compaction could still be triggered
// if the last sorted run is not the last level.
const uint64_t kFileSize = 100000;
const int kNumLevels = 7;
const int kLastLevel = kNumLevels - 1;
ioptions_.compaction_style = kCompactionStyleUniversal;
ioptions_.preclude_last_level_data_seconds = 1000;
mutable_cf_options_.compaction_options_universal
.max_size_amplification_percent = 200;
UniversalCompactionPicker universal_compaction_picker(ioptions_, &icmp_);
NewVersionStorage(kNumLevels, kCompactionStyleUniversal);
Add(0, 100U, "100", "300", 1 * kFileSize);
Add(0, 101U, "200", "400", 1 * kFileSize);
Add(4, 90U, "100", "600", 4 * kFileSize);
Add(5, 80U, "200", "300", 2 * kFileSize);
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(
universal_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
// Make sure it's a size amp compaction and includes all files
ASSERT_EQ(compaction->compaction_reason(),
CompactionReason::kUniversalSizeAmplification);
ASSERT_EQ(compaction->output_level(), kLastLevel);
ASSERT_EQ(compaction->input_levels(0)->num_files, 2);
ASSERT_EQ(compaction->input_levels(4)->num_files, 1);
ASSERT_EQ(compaction->input_levels(5)->num_files, 1);
}
TEST_F(CompactionPickerTest, UniversalSizeRatioTierCompactionLastLevel) {
// This test makes sure the size amp calculation skips the last level (L6), so
// size amp compaction is not triggered, instead a size ratio compaction is
// triggered.
const uint64_t kFileSize = 100000;
const int kNumLevels = 7;
const int kLastLevel = kNumLevels - 1;
const int kPenultimateLevel = kLastLevel - 1;
ioptions_.compaction_style = kCompactionStyleUniversal;
ioptions_.preclude_last_level_data_seconds = 1000;
mutable_cf_options_.compaction_options_universal
.max_size_amplification_percent = 200;
UniversalCompactionPicker universal_compaction_picker(ioptions_, &icmp_);
NewVersionStorage(kNumLevels, kCompactionStyleUniversal);
Add(0, 100U, "100", "300", 1 * kFileSize);
Add(0, 101U, "200", "400", 1 * kFileSize);
Add(5, 90U, "100", "600", 4 * kFileSize);
Add(6, 80U, "200", "300", 2 * kFileSize);
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(
universal_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
// Internally, size amp compaction is evaluated before size ratio compaction.
// Here to make sure it's size ratio compaction instead of size amp
ASSERT_EQ(compaction->compaction_reason(),
CompactionReason::kUniversalSizeRatio);
ASSERT_EQ(compaction->output_level(), kPenultimateLevel - 1);
ASSERT_EQ(compaction->input_levels(0)->num_files, 2);
ASSERT_EQ(compaction->input_levels(5)->num_files, 0);
ASSERT_EQ(compaction->input_levels(6)->num_files, 0);
}
TEST_F(CompactionPickerTest, UniversalSizeAmpTierCompactionNotSuport) {
// Tiered compaction only support level_num > 2 (otherwise the penultimate
// level is going to be level 0, which may make thing more complicated), so
// when there's only 2 level, still treating level 1 as the last level for
// size amp compaction
const uint64_t kFileSize = 100000;
const int kNumLevels = 2;
const int kLastLevel = kNumLevels - 1;
ioptions_.compaction_style = kCompactionStyleUniversal;
ioptions_.preclude_last_level_data_seconds = 1000;
mutable_cf_options_.compaction_options_universal
.max_size_amplification_percent = 200;
UniversalCompactionPicker universal_compaction_picker(ioptions_, &icmp_);
NewVersionStorage(kNumLevels, kCompactionStyleUniversal);
Add(0, 100U, "100", "300", 1 * kFileSize);
Add(0, 101U, "200", "400", 1 * kFileSize);
Add(0, 90U, "100", "600", 4 * kFileSize);
Add(1, 80U, "200", "300", 2 * kFileSize);
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(
universal_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
// size amp compaction is still triggered even preclude_last_level is set
ASSERT_EQ(compaction->compaction_reason(),
CompactionReason::kUniversalSizeAmplification);
ASSERT_EQ(compaction->output_level(), kLastLevel);
ASSERT_EQ(compaction->input_levels(0)->num_files, 3);
ASSERT_EQ(compaction->input_levels(1)->num_files, 1);
}
TEST_F(CompactionPickerTest, UniversalSizeAmpTierCompactionLastLevel) {
// This test makes sure the size amp compaction for tiered storage could still
// be triggered, but only for non-last-level files
const uint64_t kFileSize = 100000;
const int kNumLevels = 7;
const int kLastLevel = kNumLevels - 1;
const int kPenultimateLevel = kLastLevel - 1;
ioptions_.compaction_style = kCompactionStyleUniversal;
ioptions_.preclude_last_level_data_seconds = 1000;
mutable_cf_options_.compaction_options_universal
.max_size_amplification_percent = 200;
UniversalCompactionPicker universal_compaction_picker(ioptions_, &icmp_);
NewVersionStorage(kNumLevels, kCompactionStyleUniversal);
Add(0, 100U, "100", "300", 3 * kFileSize);
Add(0, 101U, "200", "400", 2 * kFileSize);
Add(5, 90U, "100", "600", 2 * kFileSize);
Add(6, 80U, "200", "300", 2 * kFileSize);
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(
universal_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
// It's a Size Amp compaction, but doesn't include the last level file and
// output to the penultimate level.
ASSERT_EQ(compaction->compaction_reason(),
CompactionReason::kUniversalSizeAmplification);
ASSERT_EQ(compaction->output_level(), kPenultimateLevel);
ASSERT_EQ(compaction->input_levels(0)->num_files, 2);
ASSERT_EQ(compaction->input_levels(5)->num_files, 1);
ASSERT_EQ(compaction->input_levels(6)->num_files, 0);
}
Fix overlapping check by excluding timestamp (#10615) Summary: With user-defined timestamp, checking overlapping should exclude timestamp part from key. This has already been done for range checking for files in sstableKeyCompare(), but not yet done when checking with concurrent compactions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10615 Test Plan: (Will add more tests) make check (Repro seems easier with this commit sha: git checkout 78bbdef530bd36fa299d496bd1013cf39d8e203a) rm -rf /dev/shm/rocksdb/* && mkdir /dev/shm/rocksdb/rocksdb_crashtest_expected && ./db_stress --allow_data_in_errors=True --clear_column_family_one_in=0 --continuous_verification_interval=0 --data_block_index_type=1 --db=/dev/shm/rocksdb//rocksdb_crashtest_blackbox --delpercent=5 --delrangepercent=0 --expected_values_dir=/dev/shm/rocksdb//rocksdb_crashtest_expected --iterpercent=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=25000000 --max_write_batch_group_size_bytes=1048576 --nooverwritepercent=1 --ops_per_thread=1000000 --paranoid_file_checks=1 --partition_filters=0 --prefix_size=8 --prefixpercent=5 --readpercent=30 --reopen=0 --snapshot_hold_ops=100000 --subcompactions=1 --compaction_pri=3 --target_file_size_base=65536 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_cf_consistency=0 --use_multiget=1 --user_timestamp_size=8 --value_size_mult=32 --verify_checksum=1 --write_buffer_size=65536 --writepercent=60 -disable_wal=1 Reviewed By: akankshamahajan15 Differential Revision: D39146797 Pulled By: riversand963 fbshipit-source-id: 7fca800026ca6219220100b8b6cf84d907828163
2022-09-08 20:03:07 +00:00
TEST_F(CompactionPickerU64TsTest, Overlap) {
int num_levels = ioptions_.num_levels;
NewVersionStorage(num_levels, kCompactionStyleLevel);
constexpr int level = 0;
constexpr uint64_t file_number = 20ULL;
constexpr char smallest[4] = "500";
constexpr char largest[4] = "600";
constexpr uint64_t ts_of_smallest = 12345ULL;
constexpr uint64_t ts_of_largest = 56789ULL;
{
std::string ts1;
PutFixed64(&ts1, ts_of_smallest);
std::string ts2;
PutFixed64(&ts2, ts_of_largest);
Add(level, file_number, smallest, largest,
/*file_size=*/1U, /*path_id=*/0,
/*smallest_seq=*/100, /*largest_seq=*/100, /*compensated_file_size=*/0,
/*marked_for_compact=*/false, /*temperature=*/Temperature::kUnknown,
/*oldest_ancestor_time=*/kUnknownOldestAncesterTime, ts1, ts2);
UpdateVersionStorageInfo();
}
std::unordered_set<uint64_t> input{file_number};
std::vector<CompactionInputFiles> input_files;
ASSERT_OK(level_compaction_picker.GetCompactionInputsFromFileNumbers(
&input_files, &input, vstorage_.get(), CompactionOptions()));
std::unique_ptr<Compaction> comp1(level_compaction_picker.CompactFiles(
CompactionOptions(), input_files, level, vstorage_.get(),
mutable_cf_options_, mutable_db_options_, /*output_path_id=*/0));
{
// [600, ts=50000] to [600, ts=50000] is the range to check.
// ucmp->Compare(smallest_user_key, c->GetLargestUserKey()) > 0, but
// ucmp->CompareWithoutTimestamp(smallest_user_key,
// c->GetLargestUserKey()) == 0.
// Should still be considered overlapping.
std::string user_key_with_ts1(largest);
PutFixed64(&user_key_with_ts1, ts_of_largest - 1);
std::string user_key_with_ts2(largest);
PutFixed64(&user_key_with_ts2, ts_of_largest - 1);
ASSERT_TRUE(level_compaction_picker.RangeOverlapWithCompaction(
user_key_with_ts1, user_key_with_ts2, level));
}
{
// [500, ts=60000] to [500, ts=60000] is the range to check.
// ucmp->Compare(largest_user_key, c->GetSmallestUserKey()) < 0, but
// ucmp->CompareWithoutTimestamp(largest_user_key,
// c->GetSmallestUserKey()) == 0.
// Should still be considered overlapping.
std::string user_key_with_ts1(smallest);
PutFixed64(&user_key_with_ts1, ts_of_smallest + 1);
std::string user_key_with_ts2(smallest);
PutFixed64(&user_key_with_ts2, ts_of_smallest + 1);
ASSERT_TRUE(level_compaction_picker.RangeOverlapWithCompaction(
user_key_with_ts1, user_key_with_ts2, level));
}
}
TEST_F(CompactionPickerU64TsTest, CannotTrivialMoveUniversal) {
constexpr uint64_t kFileSize = 100000;
mutable_cf_options_.level0_file_num_compaction_trigger = 2;
mutable_cf_options_.compaction_options_universal.allow_trivial_move = true;
NewVersionStorage(1, kCompactionStyleUniversal);
UniversalCompactionPicker universal_compaction_picker(ioptions_, &icmp_);
UpdateVersionStorageInfo();
// must return false when there's no files.
ASSERT_FALSE(universal_compaction_picker.NeedsCompaction(vstorage_.get()));
std::string ts1;
PutFixed64(&ts1, 9000);
std::string ts2;
PutFixed64(&ts2, 8000);
std::string ts3;
PutFixed64(&ts3, 7000);
std::string ts4;
PutFixed64(&ts4, 6000);
NewVersionStorage(3, kCompactionStyleUniversal);
// A compaction should be triggered and pick file 2
Add(1, 1U, "150", "150", kFileSize, /*path_id=*/0, /*smallest_seq=*/100,
/*largest_seq=*/100, /*compensated_file_size=*/kFileSize,
/*marked_for_compact=*/false, Temperature::kUnknown,
kUnknownOldestAncesterTime, ts1, ts2);
Add(2, 2U, "150", "150", kFileSize, /*path_id=*/0, /*smallest_seq=*/100,
/*largest_seq=*/100, /*compensated_file_size=*/kFileSize,
/*marked_for_compact=*/false, Temperature::kUnknown,
kUnknownOldestAncesterTime, ts3, ts4);
UpdateVersionStorageInfo();
std::unique_ptr<Compaction> compaction(
universal_compaction_picker.PickCompaction(
cf_name_, mutable_cf_options_, mutable_db_options_, vstorage_.get(),
&log_buffer_));
Fix overlapping check by excluding timestamp (#10615) Summary: With user-defined timestamp, checking overlapping should exclude timestamp part from key. This has already been done for range checking for files in sstableKeyCompare(), but not yet done when checking with concurrent compactions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10615 Test Plan: (Will add more tests) make check (Repro seems easier with this commit sha: git checkout 78bbdef530bd36fa299d496bd1013cf39d8e203a) rm -rf /dev/shm/rocksdb/* && mkdir /dev/shm/rocksdb/rocksdb_crashtest_expected && ./db_stress --allow_data_in_errors=True --clear_column_family_one_in=0 --continuous_verification_interval=0 --data_block_index_type=1 --db=/dev/shm/rocksdb//rocksdb_crashtest_blackbox --delpercent=5 --delrangepercent=0 --expected_values_dir=/dev/shm/rocksdb//rocksdb_crashtest_expected --iterpercent=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=25000000 --max_write_batch_group_size_bytes=1048576 --nooverwritepercent=1 --ops_per_thread=1000000 --paranoid_file_checks=1 --partition_filters=0 --prefix_size=8 --prefixpercent=5 --readpercent=30 --reopen=0 --snapshot_hold_ops=100000 --subcompactions=1 --compaction_pri=3 --target_file_size_base=65536 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_cf_consistency=0 --use_multiget=1 --user_timestamp_size=8 --value_size_mult=32 --verify_checksum=1 --write_buffer_size=65536 --writepercent=60 -disable_wal=1 Reviewed By: akankshamahajan15 Differential Revision: D39146797 Pulled By: riversand963 fbshipit-source-id: 7fca800026ca6219220100b8b6cf84d907828163
2022-09-08 20:03:07 +00:00
assert(compaction);
ASSERT_TRUE(!compaction->is_trivial_move());
}
class PerKeyPlacementCompactionPickerTest
: public CompactionPickerTest,
public testing::WithParamInterface<bool> {
public:
PerKeyPlacementCompactionPickerTest() : CompactionPickerTest() {}
void SetUp() override { enable_per_key_placement_ = GetParam(); }
protected:
bool enable_per_key_placement_ = false;
};
TEST_P(PerKeyPlacementCompactionPickerTest, OverlapWithNormalCompaction) {
SyncPoint::GetInstance()->SetCallBack(
"Compaction::SupportsPerKeyPlacement:Enabled", [&](void* arg) {
auto supports_per_key_placement = static_cast<bool*>(arg);
*supports_per_key_placement = enable_per_key_placement_;
});
SyncPoint::GetInstance()->EnableProcessing();
int num_levels = ioptions_.num_levels;
NewVersionStorage(num_levels, kCompactionStyleLevel);
Add(0, 21U, "100", "150", 60000000U);
Add(0, 22U, "300", "350", 60000000U);
Add(5, 40U, "200", "250", 60000000U);
Add(6, 50U, "101", "351", 60000000U);
UpdateVersionStorageInfo();
CompactionOptions comp_options;
std::unordered_set<uint64_t> input_set;
input_set.insert(40);
std::vector<CompactionInputFiles> input_files;
ASSERT_OK(level_compaction_picker.GetCompactionInputsFromFileNumbers(
&input_files, &input_set, vstorage_.get(), comp_options));
std::unique_ptr<Compaction> comp1(level_compaction_picker.CompactFiles(
comp_options, input_files, 5, vstorage_.get(), mutable_cf_options_,
mutable_db_options_, 0));
input_set.clear();
input_files.clear();
input_set.insert(21);
input_set.insert(22);
input_set.insert(50);
ASSERT_OK(level_compaction_picker.GetCompactionInputsFromFileNumbers(
&input_files, &input_set, vstorage_.get(), comp_options));
ASSERT_EQ(enable_per_key_placement_,
level_compaction_picker.FilesRangeOverlapWithCompaction(
input_files, 6,
Compaction::EvaluatePenultimateLevel(vstorage_.get(), ioptions_,
0, 6)));
}
TEST_P(PerKeyPlacementCompactionPickerTest, NormalCompactionOverlap) {
SyncPoint::GetInstance()->SetCallBack(
"Compaction::SupportsPerKeyPlacement:Enabled", [&](void* arg) {
auto supports_per_key_placement = static_cast<bool*>(arg);
*supports_per_key_placement = enable_per_key_placement_;
});
SyncPoint::GetInstance()->EnableProcessing();
int num_levels = ioptions_.num_levels;
NewVersionStorage(num_levels, kCompactionStyleLevel);
Add(0, 21U, "100", "150", 60000000U);
Add(0, 22U, "300", "350", 60000000U);
Add(4, 40U, "200", "220", 60000000U);
Add(4, 41U, "230", "250", 60000000U);
Add(6, 50U, "101", "351", 60000000U);
UpdateVersionStorageInfo();
CompactionOptions comp_options;
std::unordered_set<uint64_t> input_set;
input_set.insert(21);
input_set.insert(22);
input_set.insert(50);
std::vector<CompactionInputFiles> input_files;
ASSERT_OK(level_compaction_picker.GetCompactionInputsFromFileNumbers(
&input_files, &input_set, vstorage_.get(), comp_options));
std::unique_ptr<Compaction> comp1(level_compaction_picker.CompactFiles(
comp_options, input_files, 6, vstorage_.get(), mutable_cf_options_,
mutable_db_options_, 0));
input_set.clear();
input_files.clear();
input_set.insert(40);
input_set.insert(41);
ASSERT_OK(level_compaction_picker.GetCompactionInputsFromFileNumbers(
&input_files, &input_set, vstorage_.get(), comp_options));
ASSERT_EQ(enable_per_key_placement_,
level_compaction_picker.FilesRangeOverlapWithCompaction(
input_files, 5, Compaction::kInvalidLevel));
}
TEST_P(PerKeyPlacementCompactionPickerTest,
OverlapWithNormalCompactionUniveral) {
SyncPoint::GetInstance()->SetCallBack(
"Compaction::SupportsPerKeyPlacement:Enabled", [&](void* arg) {
auto supports_per_key_placement = static_cast<bool*>(arg);
*supports_per_key_placement = enable_per_key_placement_;
});
SyncPoint::GetInstance()->EnableProcessing();
int num_levels = ioptions_.num_levels;
UniversalCompactionPicker universal_compaction_picker(ioptions_, &icmp_);
NewVersionStorage(num_levels, kCompactionStyleUniversal);
Add(0, 21U, "100", "150", 60000000U);
Add(0, 22U, "300", "350", 60000000U);
Add(5, 40U, "200", "250", 60000000U);
Add(6, 50U, "101", "351", 60000000U);
UpdateVersionStorageInfo();
CompactionOptions comp_options;
std::unordered_set<uint64_t> input_set;
input_set.insert(40);
std::vector<CompactionInputFiles> input_files;
ASSERT_OK(universal_compaction_picker.GetCompactionInputsFromFileNumbers(
&input_files, &input_set, vstorage_.get(), comp_options));
std::unique_ptr<Compaction> comp1(universal_compaction_picker.CompactFiles(
comp_options, input_files, 5, vstorage_.get(), mutable_cf_options_,
mutable_db_options_, 0));
input_set.clear();
input_files.clear();
input_set.insert(21);
input_set.insert(22);
input_set.insert(50);
ASSERT_OK(universal_compaction_picker.GetCompactionInputsFromFileNumbers(
&input_files, &input_set, vstorage_.get(), comp_options));
ASSERT_EQ(enable_per_key_placement_,
universal_compaction_picker.FilesRangeOverlapWithCompaction(
input_files, 6,
Compaction::EvaluatePenultimateLevel(vstorage_.get(), ioptions_,
0, 6)));
}
TEST_P(PerKeyPlacementCompactionPickerTest, NormalCompactionOverlapUniversal) {
SyncPoint::GetInstance()->SetCallBack(
"Compaction::SupportsPerKeyPlacement:Enabled", [&](void* arg) {
auto supports_per_key_placement = static_cast<bool*>(arg);
*supports_per_key_placement = enable_per_key_placement_;
});
SyncPoint::GetInstance()->EnableProcessing();
int num_levels = ioptions_.num_levels;
UniversalCompactionPicker universal_compaction_picker(ioptions_, &icmp_);
NewVersionStorage(num_levels, kCompactionStyleUniversal);
Add(0, 21U, "100", "150", 60000000U);
Add(0, 22U, "300", "350", 60000000U);
Add(4, 40U, "200", "220", 60000000U);
Add(4, 41U, "230", "250", 60000000U);
Add(6, 50U, "101", "351", 60000000U);
UpdateVersionStorageInfo();
CompactionOptions comp_options;
std::unordered_set<uint64_t> input_set;
input_set.insert(21);
input_set.insert(22);
input_set.insert(50);
std::vector<CompactionInputFiles> input_files;
ASSERT_OK(universal_compaction_picker.GetCompactionInputsFromFileNumbers(
&input_files, &input_set, vstorage_.get(), comp_options));
std::unique_ptr<Compaction> comp1(universal_compaction_picker.CompactFiles(
comp_options, input_files, 6, vstorage_.get(), mutable_cf_options_,
mutable_db_options_, 0));
input_set.clear();
input_files.clear();
input_set.insert(40);
input_set.insert(41);
ASSERT_OK(universal_compaction_picker.GetCompactionInputsFromFileNumbers(
&input_files, &input_set, vstorage_.get(), comp_options));
ASSERT_EQ(enable_per_key_placement_,
universal_compaction_picker.FilesRangeOverlapWithCompaction(
input_files, 5, Compaction::kInvalidLevel));
}
TEST_P(PerKeyPlacementCompactionPickerTest, PenultimateOverlapUniversal) {
// This test is make sure the Tiered compaction would lock whole range of
// both output level and penultimate level
if (enable_per_key_placement_) {
ioptions_.preclude_last_level_data_seconds = 10000;
}
int num_levels = ioptions_.num_levels;
ioptions_.compaction_style = kCompactionStyleUniversal;
UniversalCompactionPicker universal_compaction_picker(ioptions_, &icmp_);
NewVersionStorage(num_levels, kCompactionStyleUniversal);
// L4: [200, 220] [230, 250] [360, 380]
// L5:
// L6: [101, 351]
Add(4, 40U, "200", "220", 60000000U);
Add(4, 41U, "230", "250", 60000000U);
Add(4, 42U, "360", "380", 60000000U);
Add(6, 60U, "101", "351", 60000000U);
UpdateVersionStorageInfo();
// the existing compaction is the 1st L4 file + L6 file
// then compaction of the 2nd L4 file to L5 (penultimate level) is overlapped
// when the tiered compaction feature is on.
CompactionOptions comp_options;
std::unordered_set<uint64_t> input_set;
input_set.insert(40);
input_set.insert(60);
std::vector<CompactionInputFiles> input_files;
ASSERT_OK(universal_compaction_picker.GetCompactionInputsFromFileNumbers(
&input_files, &input_set, vstorage_.get(), comp_options));
std::unique_ptr<Compaction> comp1(universal_compaction_picker.CompactFiles(
comp_options, input_files, 6, vstorage_.get(), mutable_cf_options_,
mutable_db_options_, 0));
input_set.clear();
input_files.clear();
input_set.insert(41);
ASSERT_OK(universal_compaction_picker.GetCompactionInputsFromFileNumbers(
&input_files, &input_set, vstorage_.get(), comp_options));
ASSERT_EQ(enable_per_key_placement_,
universal_compaction_picker.FilesRangeOverlapWithCompaction(
input_files, 5, Compaction::kInvalidLevel));
// compacting the 3rd L4 file is always safe:
input_set.clear();
input_files.clear();
input_set.insert(42);
ASSERT_OK(universal_compaction_picker.GetCompactionInputsFromFileNumbers(
&input_files, &input_set, vstorage_.get(), comp_options));
ASSERT_FALSE(universal_compaction_picker.FilesRangeOverlapWithCompaction(
input_files, 5, Compaction::kInvalidLevel));
}
TEST_P(PerKeyPlacementCompactionPickerTest, LastLevelOnlyOverlapUniversal) {
if (enable_per_key_placement_) {
ioptions_.preclude_last_level_data_seconds = 10000;
}
int num_levels = ioptions_.num_levels;
ioptions_.compaction_style = kCompactionStyleUniversal;
UniversalCompactionPicker universal_compaction_picker(ioptions_, &icmp_);
NewVersionStorage(num_levels, kCompactionStyleUniversal);
// L4: [200, 220] [230, 250] [360, 380]
// L5:
// L6: [101, 351]
Add(4, 40U, "200", "220", 60000000U);
Add(4, 41U, "230", "250", 60000000U);
Add(4, 42U, "360", "380", 60000000U);
Add(6, 60U, "101", "351", 60000000U);
UpdateVersionStorageInfo();
CompactionOptions comp_options;
std::unordered_set<uint64_t> input_set;
input_set.insert(60);
std::vector<CompactionInputFiles> input_files;
ASSERT_OK(universal_compaction_picker.GetCompactionInputsFromFileNumbers(
&input_files, &input_set, vstorage_.get(), comp_options));
std::unique_ptr<Compaction> comp1(universal_compaction_picker.CompactFiles(
comp_options, input_files, 6, vstorage_.get(), mutable_cf_options_,
mutable_db_options_, 0));
// cannot compact file 41 if the preclude_last_level feature is on, otherwise
// compact file 41 is okay.
input_set.clear();
input_files.clear();
input_set.insert(41);
ASSERT_OK(universal_compaction_picker.GetCompactionInputsFromFileNumbers(
&input_files, &input_set, vstorage_.get(), comp_options));
ASSERT_EQ(enable_per_key_placement_,
universal_compaction_picker.FilesRangeOverlapWithCompaction(
input_files, 5, Compaction::kInvalidLevel));
// compacting the 3rd L4 file is always safe:
input_set.clear();
input_files.clear();
input_set.insert(42);
ASSERT_OK(universal_compaction_picker.GetCompactionInputsFromFileNumbers(
&input_files, &input_set, vstorage_.get(), comp_options));
ASSERT_FALSE(universal_compaction_picker.FilesRangeOverlapWithCompaction(
input_files, 5, Compaction::kInvalidLevel));
}
TEST_P(PerKeyPlacementCompactionPickerTest,
LastLevelOnlyFailPenultimateUniversal) {
// This is to test last_level only compaction still unable to do the
// penultimate level compaction if there's already a file in the penultimate
// level.
// This should rarely happen in universal compaction, as the non-empty L5
// should be included in the compaction.
if (enable_per_key_placement_) {
ioptions_.preclude_last_level_data_seconds = 10000;
}
int num_levels = ioptions_.num_levels;
ioptions_.compaction_style = kCompactionStyleUniversal;
UniversalCompactionPicker universal_compaction_picker(ioptions_, &icmp_);
NewVersionStorage(num_levels, kCompactionStyleUniversal);
// L4: [200, 220]
// L5: [230, 250]
// L6: [101, 351]
Add(4, 40U, "200", "220", 60000000U);
Add(5, 50U, "230", "250", 60000000U);
Add(6, 60U, "101", "351", 60000000U);
UpdateVersionStorageInfo();
CompactionOptions comp_options;
std::unordered_set<uint64_t> input_set;
input_set.insert(60);
std::vector<CompactionInputFiles> input_files;
ASSERT_OK(universal_compaction_picker.GetCompactionInputsFromFileNumbers(
&input_files, &input_set, vstorage_.get(), comp_options));
std::unique_ptr<Compaction> comp1(universal_compaction_picker.CompactFiles(
comp_options, input_files, 6, vstorage_.get(), mutable_cf_options_,
mutable_db_options_, 0));
ASSERT_TRUE(comp1);
ASSERT_EQ(comp1->GetPenultimateLevel(), Compaction::kInvalidLevel);
// As comp1 cannot be output to the penultimate level, compacting file 40 to
// L5 is always safe.
input_set.clear();
input_files.clear();
input_set.insert(40);
ASSERT_OK(universal_compaction_picker.GetCompactionInputsFromFileNumbers(
&input_files, &input_set, vstorage_.get(), comp_options));
ASSERT_FALSE(universal_compaction_picker.FilesRangeOverlapWithCompaction(
input_files, 5, Compaction::kInvalidLevel));
std::unique_ptr<Compaction> comp2(universal_compaction_picker.CompactFiles(
comp_options, input_files, 5, vstorage_.get(), mutable_cf_options_,
mutable_db_options_, 0));
ASSERT_TRUE(comp2);
ASSERT_EQ(Compaction::kInvalidLevel, comp2->GetPenultimateLevel());
}
TEST_P(PerKeyPlacementCompactionPickerTest,
LastLevelOnlyConflictWithOngoingUniversal) {
// This is to test last_level only compaction still unable to do the
// penultimate level compaction if there's already an ongoing compaction to
// the penultimate level
if (enable_per_key_placement_) {
ioptions_.preclude_last_level_data_seconds = 10000;
}
int num_levels = ioptions_.num_levels;
ioptions_.compaction_style = kCompactionStyleUniversal;
UniversalCompactionPicker universal_compaction_picker(ioptions_, &icmp_);
NewVersionStorage(num_levels, kCompactionStyleUniversal);
// L4: [200, 220] [230, 250] [360, 380]
// L5:
// L6: [101, 351]
Add(4, 40U, "200", "220", 60000000U);
Add(4, 41U, "230", "250", 60000000U);
Add(4, 42U, "360", "380", 60000000U);
Add(6, 60U, "101", "351", 60000000U);
UpdateVersionStorageInfo();
// create an ongoing compaction to L5 (penultimate level)
CompactionOptions comp_options;
std::unordered_set<uint64_t> input_set;
input_set.insert(40);
std::vector<CompactionInputFiles> input_files;
ASSERT_OK(universal_compaction_picker.GetCompactionInputsFromFileNumbers(
&input_files, &input_set, vstorage_.get(), comp_options));
std::unique_ptr<Compaction> comp1(universal_compaction_picker.CompactFiles(
comp_options, input_files, 5, vstorage_.get(), mutable_cf_options_,
mutable_db_options_, 0));
ASSERT_TRUE(comp1);
ASSERT_EQ(comp1->GetPenultimateLevel(), Compaction::kInvalidLevel);
input_set.clear();
input_files.clear();
input_set.insert(60);
ASSERT_OK(universal_compaction_picker.GetCompactionInputsFromFileNumbers(
&input_files, &input_set, vstorage_.get(), comp_options));
ASSERT_EQ(enable_per_key_placement_,
universal_compaction_picker.FilesRangeOverlapWithCompaction(
input_files, 6,
Compaction::EvaluatePenultimateLevel(vstorage_.get(), ioptions_,
6, 6)));
if (!enable_per_key_placement_) {
std::unique_ptr<Compaction> comp2(universal_compaction_picker.CompactFiles(
comp_options, input_files, 6, vstorage_.get(), mutable_cf_options_,
mutable_db_options_, 0));
ASSERT_TRUE(comp2);
ASSERT_EQ(Compaction::kInvalidLevel, comp2->GetPenultimateLevel());
}
}
TEST_P(PerKeyPlacementCompactionPickerTest,
LastLevelOnlyNoConflictWithOngoingUniversal) {
// This is similar to `LastLevelOnlyConflictWithOngoingUniversal`, the only
// change is the ongoing compaction to L5 has no overlap with the last level
// compaction, so it's safe to move data from the last level to the
// penultimate level.
if (enable_per_key_placement_) {
ioptions_.preclude_last_level_data_seconds = 10000;
}
int num_levels = ioptions_.num_levels;
ioptions_.compaction_style = kCompactionStyleUniversal;
UniversalCompactionPicker universal_compaction_picker(ioptions_, &icmp_);
NewVersionStorage(num_levels, kCompactionStyleUniversal);
// L4: [200, 220] [230, 250] [360, 380]
// L5:
// L6: [101, 351]
Add(4, 40U, "200", "220", 60000000U);
Add(4, 41U, "230", "250", 60000000U);
Add(4, 42U, "360", "380", 60000000U);
Add(6, 60U, "101", "351", 60000000U);
UpdateVersionStorageInfo();
// create an ongoing compaction to L5 (penultimate level)
CompactionOptions comp_options;
std::unordered_set<uint64_t> input_set;
input_set.insert(42);
std::vector<CompactionInputFiles> input_files;
ASSERT_OK(universal_compaction_picker.GetCompactionInputsFromFileNumbers(
&input_files, &input_set, vstorage_.get(), comp_options));
std::unique_ptr<Compaction> comp1(universal_compaction_picker.CompactFiles(
comp_options, input_files, 5, vstorage_.get(), mutable_cf_options_,
mutable_db_options_, 0));
ASSERT_TRUE(comp1);
ASSERT_EQ(comp1->GetPenultimateLevel(), Compaction::kInvalidLevel);
input_set.clear();
input_files.clear();
input_set.insert(60);
ASSERT_OK(universal_compaction_picker.GetCompactionInputsFromFileNumbers(
&input_files, &input_set, vstorage_.get(), comp_options));
// always safe to move data up
ASSERT_FALSE(universal_compaction_picker.FilesRangeOverlapWithCompaction(
input_files, 6,
Compaction::EvaluatePenultimateLevel(vstorage_.get(), ioptions_, 6, 6)));
// 2 compactions can be run in parallel
std::unique_ptr<Compaction> comp2(universal_compaction_picker.CompactFiles(
comp_options, input_files, 6, vstorage_.get(), mutable_cf_options_,
mutable_db_options_, 0));
ASSERT_TRUE(comp2);
if (enable_per_key_placement_) {
ASSERT_NE(Compaction::kInvalidLevel, comp2->GetPenultimateLevel());
} else {
ASSERT_EQ(Compaction::kInvalidLevel, comp2->GetPenultimateLevel());
}
}
INSTANTIATE_TEST_CASE_P(PerKeyPlacementCompactionPickerTest,
PerKeyPlacementCompactionPickerTest, ::testing::Bool());
#endif // ROCKSDB_LITE
} // namespace ROCKSDB_NAMESPACE
rocksdb: switch to gtest Summary: Our existing test notation is very similar to what is used in gtest. It makes it easy to adopt what is different. In this diff I modify existing [[ https://code.google.com/p/googletest/wiki/Primer#Test_Fixtures:_Using_the_Same_Data_Configuration_for_Multiple_Te | test fixture ]] classes to inherit from `testing::Test`. Also for unit tests that use fixture class, `TEST` is replaced with `TEST_F` as required in gtest. There are several custom `main` functions in our existing tests. To make this transition easier, I modify all `main` functions to fallow gtest notation. But eventually we can remove them and use implementation of `main` that gtest provides. ```lang=bash % cat ~/transform #!/bin/sh files=$(git ls-files '*test\.cc') for file in $files do if grep -q "rocksdb::test::RunAllTests()" $file then if grep -Eq '^class \w+Test {' $file then perl -pi -e 's/^(class \w+Test) {/${1}: public testing::Test {/g' $file perl -pi -e 's/^(TEST)/${1}_F/g' $file fi perl -pi -e 's/(int main.*\{)/${1}::testing::InitGoogleTest(&argc, argv);/g' $file perl -pi -e 's/rocksdb::test::RunAllTests/RUN_ALL_TESTS/g' $file fi done % sh ~/transform % make format ``` Second iteration of this diff contains only scripted changes. Third iteration contains manual changes to fix last errors and make it compilable. Test Plan: Build and notice no errors. ```lang=bash % USE_CLANG=1 make check -j55 ``` Tests are still testing. Reviewers: meyering, sdong, rven, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D35157
2015-03-17 21:08:00 +00:00
int main(int argc, char** argv) {
ROCKSDB_NAMESPACE::port::InstallStackTraceHandler();
rocksdb: switch to gtest Summary: Our existing test notation is very similar to what is used in gtest. It makes it easy to adopt what is different. In this diff I modify existing [[ https://code.google.com/p/googletest/wiki/Primer#Test_Fixtures:_Using_the_Same_Data_Configuration_for_Multiple_Te | test fixture ]] classes to inherit from `testing::Test`. Also for unit tests that use fixture class, `TEST` is replaced with `TEST_F` as required in gtest. There are several custom `main` functions in our existing tests. To make this transition easier, I modify all `main` functions to fallow gtest notation. But eventually we can remove them and use implementation of `main` that gtest provides. ```lang=bash % cat ~/transform #!/bin/sh files=$(git ls-files '*test\.cc') for file in $files do if grep -q "rocksdb::test::RunAllTests()" $file then if grep -Eq '^class \w+Test {' $file then perl -pi -e 's/^(class \w+Test) {/${1}: public testing::Test {/g' $file perl -pi -e 's/^(TEST)/${1}_F/g' $file fi perl -pi -e 's/(int main.*\{)/${1}::testing::InitGoogleTest(&argc, argv);/g' $file perl -pi -e 's/rocksdb::test::RunAllTests/RUN_ALL_TESTS/g' $file fi done % sh ~/transform % make format ``` Second iteration of this diff contains only scripted changes. Third iteration contains manual changes to fix last errors and make it compilable. Test Plan: Build and notice no errors. ```lang=bash % USE_CLANG=1 make check -j55 ``` Tests are still testing. Reviewers: meyering, sdong, rven, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D35157
2015-03-17 21:08:00 +00:00
::testing::InitGoogleTest(&argc, argv);
return RUN_ALL_TESTS();
}