Fix manual compaction max_compaction_bytes under-calculated issue (#8269)

Summary:
Fix a bug that for manual compaction, `max_compaction_bytes` is only
limit the SST files from input level, but not overlapped files on output
level.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8269

Test Plan: `make check`

Reviewed By: ajkr

Differential Revision: D28231044

Pulled By: jay-zhuang

fbshipit-source-id: 9d7d03004f30cc4b1b9819830141436907554b7c
This commit is contained in:
Jay Zhuang 2021-05-21 14:02:37 -07:00 committed by Facebook GitHub Bot
parent bd3d080ef8
commit 6c86543590
4 changed files with 117 additions and 5 deletions

View file

@ -8,6 +8,7 @@
* Fixed a bug where `GetLiveFiles()` output included a non-existent file called "OPTIONS-000000". Backups and checkpoints, which use `GetLiveFiles()`, failed on DBs impacted by this bug. Read-write DBs were impacted when the latest OPTIONS file failed to write and `fail_if_options_file_error == false`. Read-only DBs were impacted when no OPTIONS files existed. * Fixed a bug where `GetLiveFiles()` output included a non-existent file called "OPTIONS-000000". Backups and checkpoints, which use `GetLiveFiles()`, failed on DBs impacted by this bug. Read-write DBs were impacted when the latest OPTIONS file failed to write and `fail_if_options_file_error == false`. Read-only DBs were impacted when no OPTIONS files existed.
* Handle return code by io_uring_submit_and_wait() and io_uring_wait_cqe(). * Handle return code by io_uring_submit_and_wait() and io_uring_wait_cqe().
* In the IngestExternalFile() API, only try to sync the ingested file if the file is linked and the FileSystem/Env supports reopening a writable file. * In the IngestExternalFile() API, only try to sync the ingested file if the file is linked and the FileSystem/Env supports reopening a writable file.
* Fixed a bug that `AdvancedColumnFamilyOptions.max_compaction_bytes` is under-calculated for manual compaction (`CompactRange()`). Manual compaction is split to multiple compactions if the compaction size exceed the `max_compaction_bytes`. The bug creates much larger compaction which size exceed the user setting. On the other hand, larger manual compaction size can increase the subcompaction parallelism, you can tune that by setting `max_compaction_bytes`.
### Behavior Changes ### Behavior Changes
* Due to the fix of false-postive alert of "SST file is ahead of WAL", all the CFs with no SST file (CF empty) will bypass the consistency check. We fixed a false-positive, but introduced a very rare true-negative which will be triggered in the following conditions: A CF with some delete operations in the last a few queries which will result in an empty CF (those are flushed to SST file and a compaction triggered which combines this file and all other SST files and generates an empty CF, or there is another reason to write a manifest entry for this CF after a flush that generates no SST file from an empty CF). The deletion entries are logged in a WAL and this WAL was corrupted, while the CF's log number points to the next WAL (due to the flush). Therefore, the DB can only recover to the point without these trailing deletions and cause the inconsistent DB status. * Due to the fix of false-postive alert of "SST file is ahead of WAL", all the CFs with no SST file (CF empty) will bypass the consistency check. We fixed a false-positive, but introduced a very rare true-negative which will be triggered in the following conditions: A CF with some delete operations in the last a few queries which will result in an empty CF (those are flushed to SST file and a compaction triggered which combines this file and all other SST files and generates an empty CF, or there is another reason to write a manifest entry for this CF after a flush that generates no SST file from an empty CF). The deletion entries are logged in a WAL and this WAL was corrupted, while the CF's log number points to the next WAL (due to the flush). Therefore, the DB can only recover to the point without these trailing deletions and cause the inconsistent DB status.

View file

@ -670,17 +670,41 @@ Compaction* CompactionPicker::CompactRange(
// two files overlap. // two files overlap.
if (input_level > 0) { if (input_level > 0) {
const uint64_t limit = mutable_cf_options.max_compaction_bytes; const uint64_t limit = mutable_cf_options.max_compaction_bytes;
uint64_t total = 0; uint64_t input_level_total = 0;
int hint_index = -1;
InternalKey* smallest = nullptr;
InternalKey* largest = nullptr;
for (size_t i = 0; i + 1 < inputs.size(); ++i) { for (size_t i = 0; i + 1 < inputs.size(); ++i) {
if (!smallest) {
smallest = &inputs[i]->smallest;
}
largest = &inputs[i]->largest;
uint64_t s = inputs[i]->compensated_file_size; uint64_t s = inputs[i]->compensated_file_size;
total += s; uint64_t output_level_total = 0;
if (total >= limit) { if (output_level < vstorage->num_non_empty_levels()) {
std::vector<FileMetaData*> files;
vstorage->GetOverlappingInputsRangeBinarySearch(
output_level, smallest, largest, &files, hint_index, &hint_index);
for (const auto& file : files) {
output_level_total += file->compensated_file_size;
}
}
input_level_total += s;
if (input_level_total + output_level_total >= limit) {
covering_the_whole_range = false; covering_the_whole_range = false;
// still include the current file, so the compaction could be larger
// than max_compaction_bytes, which is also to make sure the compaction
// can make progress even `max_compaction_bytes` is small (e.g. smaller
// than an SST file).
inputs.files.resize(i + 1); inputs.files.resize(i + 1);
break; break;
} }
} }
} }
assert(compact_range_options.target_path_id < assert(compact_range_options.target_path_id <
static_cast<uint32_t>(ioptions_.cf_paths.size())); static_cast<uint32_t>(ioptions_.cf_paths.size()));

View file

@ -5153,6 +5153,92 @@ TEST_F(DBCompactionTest, ManualCompactionBottomLevelOptimized) {
ASSERT_EQ(num, 0); ASSERT_EQ(num, 0);
} }
TEST_F(DBCompactionTest, ManualCompactionMax) {
uint64_t l1_avg_size, l2_avg_size;
auto generate_sst_func = [&]() {
Random rnd(301);
for (auto i = 0; i < 100; i++) {
for (auto j = 0; j < 10; j++) {
ASSERT_OK(Put(Key(i * 10 + j), rnd.RandomString(1024)));
}
ASSERT_OK(Flush());
}
MoveFilesToLevel(2);
for (auto i = 0; i < 10; i++) {
for (auto j = 0; j < 10; j++) {
ASSERT_OK(Put(Key(i * 100 + j * 10), rnd.RandomString(1024)));
}
ASSERT_OK(Flush());
}
MoveFilesToLevel(1);
std::vector<std::vector<FileMetaData>> level_to_files;
dbfull()->TEST_GetFilesMetaData(dbfull()->DefaultColumnFamily(),
&level_to_files);
uint64_t total = 0;
for (const auto& file : level_to_files[1]) {
total += file.compensated_file_size;
}
l1_avg_size = total / level_to_files[1].size();
total = 0;
for (const auto& file : level_to_files[2]) {
total += file.compensated_file_size;
}
l2_avg_size = total / level_to_files[2].size();
};
std::atomic_int num_compactions(0);
SyncPoint::GetInstance()->SetCallBack(
"DBImpl::BGWorkCompaction", [&](void* /*arg*/) { ++num_compactions; });
SyncPoint::GetInstance()->EnableProcessing();
Options opts = CurrentOptions();
opts.disable_auto_compactions = true;
// with default setting (1.6G by default), it should cover all files in 1
// compaction
DestroyAndReopen(opts);
generate_sst_func();
num_compactions.store(0);
CompactRangeOptions cro;
ASSERT_OK(db_->CompactRange(cro, nullptr, nullptr));
ASSERT_TRUE(num_compactions.load() == 1);
// split the compaction to 5
uint64_t total = (l1_avg_size + l2_avg_size * 10) * 10;
int num_split = 5;
opts.max_compaction_bytes = total / num_split;
DestroyAndReopen(opts);
generate_sst_func();
num_compactions.store(0);
ASSERT_OK(db_->CompactRange(cro, nullptr, nullptr));
ASSERT_TRUE(num_compactions.load() == num_split);
// very small max_compaction_bytes, it should still move forward
opts.max_compaction_bytes = l1_avg_size / 2;
DestroyAndReopen(opts);
generate_sst_func();
num_compactions.store(0);
ASSERT_OK(db_->CompactRange(cro, nullptr, nullptr));
ASSERT_TRUE(num_compactions.load() > 10);
// dynamically set the option
num_split = 2;
opts.max_compaction_bytes = 0;
DestroyAndReopen(opts);
generate_sst_func();
Status s = db_->SetOptions(
{{"max_compaction_bytes", std::to_string(total / num_split)}});
ASSERT_OK(s);
num_compactions.store(0);
ASSERT_OK(db_->CompactRange(cro, nullptr, nullptr));
ASSERT_TRUE(num_compactions.load() == num_split);
}
TEST_F(DBCompactionTest, CompactionDuringShutdown) { TEST_F(DBCompactionTest, CompactionDuringShutdown) {
Options opts = CurrentOptions(); Options opts = CurrentOptions();
opts.level0_file_num_compaction_trigger = 2; opts.level0_file_num_compaction_trigger = 2;

View file

@ -1685,10 +1685,11 @@ TEST_F(DBRangeDelTest, OverlappedKeys) {
true /* disallow_trivial_move */)); true /* disallow_trivial_move */));
ASSERT_EQ(3, NumTableFilesAtLevel(1)); ASSERT_EQ(3, NumTableFilesAtLevel(1));
std::vector<std::vector<FileMetaData>> files;
ASSERT_OK(dbfull()->TEST_CompactRange(1, nullptr, nullptr, nullptr, ASSERT_OK(dbfull()->TEST_CompactRange(1, nullptr, nullptr, nullptr,
true /* disallow_trivial_move */)); true /* disallow_trivial_move */));
ASSERT_EQ(1, NumTableFilesAtLevel(2)); ASSERT_EQ(
3, NumTableFilesAtLevel(
2)); // L1->L2 compaction size is limited to max_compaction_bytes
ASSERT_EQ(0, NumTableFilesAtLevel(1)); ASSERT_EQ(0, NumTableFilesAtLevel(1));
} }