2016-10-21 00:05:32 +00:00
|
|
|
// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
|
2017-07-15 23:03:42 +00:00
|
|
|
// This source code is licensed under both the GPLv2 (found in the
|
|
|
|
// COPYING file in the root directory) and Apache 2.0 License
|
|
|
|
// (found in the LICENSE.Apache file in the root directory).
|
2016-10-21 00:05:32 +00:00
|
|
|
|
|
|
|
#pragma once
|
|
|
|
#include <string>
|
|
|
|
#include <unordered_set>
|
|
|
|
#include <vector>
|
|
|
|
|
|
|
|
#include "db/column_family.h"
|
|
|
|
#include "db/internal_stats.h"
|
|
|
|
#include "db/snapshot_impl.h"
|
Add missing range conflict check between file ingestion and RefitLevel() (#10988)
Summary:
**Context:**
File ingestion never checks whether the key range it acts on overlaps with an ongoing RefitLevel() (used in `CompactRange()` with `change_level=true`). That's because RefitLevel() doesn't register and make its key range known to file ingestion. Though it checks overlapping with other compactions by https://github.com/facebook/rocksdb/blob/7.8.fb/db/external_sst_file_ingestion_job.cc#L998.
RefitLevel() (used in `CompactRange()` with `change_level=true`) doesn't check whether the key range it acts on overlaps with an ongoing file ingestion. That's because file ingestion does not register and make its key range known to other compactions.
- Note that non-refitlevel-compaction (e.g, manual compaction w/o RefitLevel() or general compaction) also does not check key range overlap with ongoing file ingestion for the same reason.
- But it's fine. Credited to cbi42's discovery, `WaitForIngestFile` was called by background and foreground compactions. They were introduced in https://github.com/facebook/rocksdb/commit/0f88160f67d36ea30e3aca3a3cef924c3a009be6, https://github.com/facebook/rocksdb/commit/5c64fb67d2fc198f1a73ff3ae543749a6a41f513 and https://github.com/facebook/rocksdb/commit/87dfc1d23e0e16ff73e15f63c6fa0fb3b3fc8c8c.
- Regardless, this PR registers file ingestion like a compaction is a general approach that will also add range conflict check between file ingestion and non-refitlevel-compaction, though it has not been the issue motivated this PR.
Above are bugs resulting in two bad consequences:
- If file ingestion and RefitLevel() creates files in the same level, then range-overlapped files will be created at that level and caught as corruption by `force_consistency_checks=true`
- If file ingestion and RefitLevel() creates file in different levels, then with one further compaction on the ingested file, it can result in two same keys both with seqno 0 in two different levels. Then with iterator's [optimization](https://github.com/facebook/rocksdb/blame/c62f3221698fd273b673d4f7e54eabb8329a4369/db/db_iter.cc#L342-L343) that assumes no two same keys both with seqno 0, it will either break this assertion in debug build or, even worst, return value of this same key for the key after it, which is the wrong value to return, in release build.
Therefore we decide to introduce range conflict check for file ingestion and RefitLevel() inspired from the existing range conflict check among compactions.
**Summary:**
- Treat file ingestion job and RefitLevel() as `Compaction` of new compaction reasons: `CompactionReason::kExternalSstIngestion` and `CompactionReason::kRefitLevel` and register/unregister them. File ingestion is treated as compaction from L0 to different levels and RefitLevel() as compaction from source level to target level.
- Check for `RangeOverlapWithCompaction` with other ongoing compactions, `RegisterCompaction()` on this "compaction" before changing the LSM state in `VersionStorageInfo`, and `UnregisterCompaction()` after changing.
- Replace scattered fixes (https://github.com/facebook/rocksdb/commit/0f88160f67d36ea30e3aca3a3cef924c3a009be6, https://github.com/facebook/rocksdb/commit/5c64fb67d2fc198f1a73ff3ae543749a6a41f513 and https://github.com/facebook/rocksdb/commit/87dfc1d23e0e16ff73e15f63c6fa0fb3b3fc8c8c.) that prevents overlapping between file ingestion and non-refit-level compaction with this fix cuz those practices are easy to overlook.
- Misc: logic cleanup, see PR comments
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10988
Test Plan:
- New unit test `DBCompactionTestWithOngoingFileIngestionParam*` that failed pre-fix and passed afterwards.
- Made compatible with existing tests, see PR comments
- make check
- [Ongoing] Stress test rehearsal with normal value and aggressive CI value https://github.com/facebook/rocksdb/pull/10761
Reviewed By: cbi42
Differential Revision: D41535685
Pulled By: hx235
fbshipit-source-id: 549833a577ba1496d20a870583d4caa737da1258
2022-12-29 23:05:36 +00:00
|
|
|
#include "db/version_edit.h"
|
2020-08-13 00:28:10 +00:00
|
|
|
#include "env/file_system_tracer.h"
|
2019-09-13 21:48:18 +00:00
|
|
|
#include "logging/event_logger.h"
|
2017-04-06 02:02:00 +00:00
|
|
|
#include "options/db_options.h"
|
2016-10-21 00:05:32 +00:00
|
|
|
#include "rocksdb/db.h"
|
2021-01-26 06:07:26 +00:00
|
|
|
#include "rocksdb/file_system.h"
|
2016-10-21 00:05:32 +00:00
|
|
|
#include "rocksdb/sst_file_writer.h"
|
|
|
|
#include "util/autovector.h"
|
|
|
|
|
2020-02-20 20:07:53 +00:00
|
|
|
namespace ROCKSDB_NAMESPACE {
|
2016-10-21 00:05:32 +00:00
|
|
|
|
2019-06-21 17:12:29 +00:00
|
|
|
class Directories;
|
2021-01-26 06:07:26 +00:00
|
|
|
class SystemClock;
|
2019-06-21 17:12:29 +00:00
|
|
|
|
2024-10-16 00:22:01 +00:00
|
|
|
struct KeyRangeInfo {
|
|
|
|
// Smallest internal key in an external file or for a batch of external files.
|
2019-08-15 03:58:59 +00:00
|
|
|
InternalKey smallest_internal_key;
|
2024-10-16 00:22:01 +00:00
|
|
|
// Largest internal key in an external file or for a batch of external files.
|
2019-08-15 03:58:59 +00:00
|
|
|
InternalKey largest_internal_key;
|
2024-10-16 00:22:01 +00:00
|
|
|
|
|
|
|
bool empty() const {
|
|
|
|
return smallest_internal_key.size() == 0 &&
|
|
|
|
largest_internal_key.size() == 0;
|
|
|
|
}
|
|
|
|
};
|
|
|
|
|
|
|
|
// Helper class to apply SST file key range checks to the external files.
|
|
|
|
class ExternalFileRangeChecker {
|
|
|
|
public:
|
|
|
|
explicit ExternalFileRangeChecker(const Comparator* ucmp) : ucmp_(ucmp) {}
|
|
|
|
|
|
|
|
// Operator used for sorting ranges.
|
|
|
|
bool operator()(const KeyRangeInfo* prev_range,
|
|
|
|
const KeyRangeInfo* range) const {
|
|
|
|
assert(prev_range);
|
|
|
|
assert(range);
|
|
|
|
return sstableKeyCompare(ucmp_, prev_range->smallest_internal_key,
|
|
|
|
range->smallest_internal_key) < 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
// Check whether `range` overlaps with `prev_range`. `ranges_sorted` can be
|
|
|
|
// set to true when the inputs are already sorted based on the sorting logic
|
|
|
|
// provided by this checker's operator(), which can help simplify the check.
|
|
|
|
bool OverlapsWithPrev(const KeyRangeInfo* prev_range,
|
|
|
|
const KeyRangeInfo* range,
|
|
|
|
bool ranges_sorted = false) const {
|
|
|
|
assert(prev_range);
|
|
|
|
assert(range);
|
|
|
|
if (prev_range->empty() || range->empty()) {
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
if (ranges_sorted) {
|
|
|
|
return sstableKeyCompare(ucmp_, prev_range->largest_internal_key,
|
|
|
|
range->smallest_internal_key) >= 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
return sstableKeyCompare(ucmp_, prev_range->largest_internal_key,
|
|
|
|
range->smallest_internal_key) >= 0 &&
|
|
|
|
sstableKeyCompare(ucmp_, prev_range->smallest_internal_key,
|
|
|
|
range->largest_internal_key) <= 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
void MaybeUpdateRange(const InternalKey& start_key,
|
|
|
|
const InternalKey& end_key, KeyRangeInfo* range) const {
|
|
|
|
assert(range);
|
|
|
|
if (range->smallest_internal_key.size() == 0 ||
|
|
|
|
sstableKeyCompare(ucmp_, start_key, range->smallest_internal_key) < 0) {
|
|
|
|
range->smallest_internal_key = start_key;
|
|
|
|
}
|
|
|
|
if (range->largest_internal_key.size() == 0 ||
|
|
|
|
sstableKeyCompare(ucmp_, end_key, range->largest_internal_key) > 0) {
|
|
|
|
range->largest_internal_key = end_key;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
private:
|
|
|
|
const Comparator* ucmp_;
|
|
|
|
};
|
|
|
|
|
|
|
|
struct IngestedFileInfo : public KeyRangeInfo {
|
|
|
|
// External file path
|
|
|
|
std::string external_file_path;
|
Use extended file boundary for key range overlap check during file ingestion (#12735)
Summary:
When https://github.com/facebook/rocksdb/issues/12343 added support to bulk load external files while column family enables user-defined timestamps, it's a requirement that the external file doesn't overlap with the DB in key ranges. More specifically, the external file should not contain a user key (without timestamp) that already have some entries in the DB.
All the `*Overlap*` functions like `RangeOverlapWithMemtable`, `RangeOverlapWithCompaction` are using `CompareWithoutTimestamp` to check for overlap already. One thing that is missing here is we need to extend the external file's user key boundary for this check to avoid missing the checks for the boundary user keys. For example, with the current way of checking things where `external_file_info.smallest.user_key()` is used as the left boundary, and `external_file_info.largest.user_key()` is used as the right boundary, a file with this entry: (b, 40) can fit into a DB with these two entries: (b, 30), (c, 20).
To avoid this, we extend the user key boundaries used for overlap check, by updating the left boundary with the maximum timestamp and the right boundary with the minimum timestamp.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12735
Test Plan: Added unit test
Reviewed By: ltamasi
Differential Revision: D58152117
Pulled By: jowlyzhang
fbshipit-source-id: 9cba61e7357f6d76ad44c258381c35073ebbf347
2024-06-04 20:39:51 +00:00
|
|
|
// NOTE: use below two fields for all `*Overlap*` types of checks instead of
|
|
|
|
// smallest_internal_key.user_key() and largest_internal_key.user_key().
|
|
|
|
// The smallest / largest user key contained in the file for key range checks.
|
|
|
|
// These could be different from smallest_internal_key.user_key(), and
|
|
|
|
// largest_internal_key.user_key() when user-defined timestamps are enabled,
|
|
|
|
// because the check is about making sure the user key without timestamps part
|
|
|
|
// does not overlap. To achieve that, the smallest user key will be updated
|
|
|
|
// with the maximum timestamp while the largest user key will be updated with
|
|
|
|
// the min timestamp. It's otherwise the same.
|
|
|
|
std::string start_ukey;
|
|
|
|
std::string limit_ukey;
|
2016-10-21 00:05:32 +00:00
|
|
|
// Sequence number for keys in external file
|
|
|
|
SequenceNumber original_seqno;
|
|
|
|
// Offset of the global sequence number field in the file, will
|
|
|
|
// be zero if version is 1 (global seqno is not supported)
|
|
|
|
size_t global_seqno_offset;
|
|
|
|
// External file size
|
|
|
|
uint64_t file_size;
|
|
|
|
// total number of keys in external file
|
|
|
|
uint64_t num_entries;
|
2018-07-14 05:40:23 +00:00
|
|
|
// total number of range deletions in external file
|
|
|
|
uint64_t num_range_deletions;
|
2024-02-03 02:07:57 +00:00
|
|
|
// Id of column family this file should be ingested into
|
2016-12-05 22:16:23 +00:00
|
|
|
uint32_t cf_id;
|
2016-12-06 21:56:17 +00:00
|
|
|
// TableProperties read from external file
|
|
|
|
TableProperties table_properties;
|
2016-10-21 00:05:32 +00:00
|
|
|
// Version of external file
|
|
|
|
int version;
|
|
|
|
|
|
|
|
// FileDescriptor for the file inside the DB
|
|
|
|
FileDescriptor fd;
|
|
|
|
// file path that we picked for file inside the DB
|
2018-04-13 17:47:54 +00:00
|
|
|
std::string internal_file_path;
|
2016-10-21 00:05:32 +00:00
|
|
|
// Global sequence number that we picked for the file inside the DB
|
|
|
|
SequenceNumber assigned_seqno = 0;
|
|
|
|
// Level inside the DB we picked for the external file.
|
|
|
|
int picked_level = 0;
|
2018-04-13 17:47:54 +00:00
|
|
|
// Whether to copy or link the external sst file. copy_file will be set to
|
|
|
|
// false if ingestion_options.move_files is true and underlying FS
|
2018-04-16 21:18:51 +00:00
|
|
|
// supports link operation. Need to provide a default value to make the
|
|
|
|
// undefined-behavior sanity check of llvm happy. Since
|
|
|
|
// ingestion_options.move_files is false by default, thus copy_file is true
|
|
|
|
// by default.
|
|
|
|
bool copy_file = true;
|
Ingest SST files with checksum information (#6891)
Summary:
Application can ingest SST files with file checksum information, such that during ingestion, DB is able to check data integrity and identify of the SST file. The PR introduces generate_and_verify_file_checksum to IngestExternalFileOption to control if the ingested checksum information should be verified with the generated checksum.
1. If generate_and_verify_file_checksum options is *FALSE*: *1)* if DB does not enable SST file checksum, the checksum information ingested will be ignored; *2)* if DB enables the SST file checksum and the checksum function name matches the checksum function name in DB, we trust the ingested checksum, store it in Manifest. If the checksum function name does not match, we treat that as an error and fail the IngestExternalFile() call.
2. If generate_and_verify_file_checksum options is *TRUE*: *1)* if DB does not enable SST file checksum, the checksum information ingested will be ignored; *2)* if DB enable the SST file checksum, we will use the checksum generator from DB to calculate the checksum for each ingested SST files after they are copied or moved. Then, compare the checksum results with the ingested checksum information: _A)_ if the checksum function name does not match, _verification always report true_ and we store the DB generated checksum information in Manifest. _B)_ if the checksum function name mach, and checksum match, ingestion continues and stores the checksum information in the Manifest. Otherwise, terminate file ingestion and report file corruption.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6891
Test Plan: added unit test, pass make asan_check
Reviewed By: pdillinger
Differential Revision: D21935988
Pulled By: zhichao-cao
fbshipit-source-id: 7b55f486632db467e76d72602218d0658aa7f6ed
2020-06-11 21:25:01 +00:00
|
|
|
// The checksum of ingested file
|
|
|
|
std::string file_checksum;
|
|
|
|
// The name of checksum function that generate the checksum
|
|
|
|
std::string file_checksum_func_name;
|
2021-10-08 17:29:06 +00:00
|
|
|
// The temperature of the file to be ingested
|
|
|
|
Temperature file_temperature = Temperature::kUnknown;
|
2022-05-19 18:04:21 +00:00
|
|
|
// Unique id of the file to be ingested
|
|
|
|
UniqueId64x2 unique_id{};
|
2024-02-21 23:41:53 +00:00
|
|
|
// Whether the external file should be treated as if it has user-defined
|
|
|
|
// timestamps or not. If this flag is false, and the column family enables
|
|
|
|
// UDT feature, the file will have min-timestamp artificially padded to its
|
|
|
|
// user keys when it's read. Since it will affect how `TableReader` reads a
|
|
|
|
// table file, it's defaulted to optimize for the majority of the case where
|
|
|
|
// the user key's format in the external file matches the column family's
|
|
|
|
// setting.
|
|
|
|
bool user_defined_timestamps_persisted = true;
|
2016-10-21 00:05:32 +00:00
|
|
|
};
|
|
|
|
|
2024-10-16 00:22:01 +00:00
|
|
|
// A batch of files.
|
|
|
|
struct FileBatchInfo : public KeyRangeInfo {
|
|
|
|
autovector<IngestedFileInfo*> files;
|
|
|
|
// When true, `smallest_internal_key` and `largest_internal_key` will be
|
|
|
|
// tracked and updated as new file get added via `AddFile`. When false, we
|
|
|
|
// bypass this tracking. This is used when the all input external files
|
|
|
|
// are already checked and not overlapping, and they just need to be added
|
|
|
|
// into one default batch.
|
|
|
|
bool track_batch_range;
|
|
|
|
|
|
|
|
void AddFile(IngestedFileInfo* file,
|
|
|
|
const ExternalFileRangeChecker& key_range_checker) {
|
|
|
|
assert(file);
|
|
|
|
files.push_back(file);
|
|
|
|
if (track_batch_range) {
|
|
|
|
key_range_checker.MaybeUpdateRange(file->smallest_internal_key,
|
|
|
|
file->largest_internal_key, this);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
explicit FileBatchInfo(bool _track_batch_range)
|
|
|
|
: track_batch_range(_track_batch_range) {}
|
|
|
|
};
|
|
|
|
|
2016-10-21 00:05:32 +00:00
|
|
|
class ExternalSstFileIngestionJob {
|
|
|
|
public:
|
|
|
|
ExternalSstFileIngestionJob(
|
2021-03-15 11:32:24 +00:00
|
|
|
VersionSet* versions, ColumnFamilyData* cfd,
|
Add missing range conflict check between file ingestion and RefitLevel() (#10988)
Summary:
**Context:**
File ingestion never checks whether the key range it acts on overlaps with an ongoing RefitLevel() (used in `CompactRange()` with `change_level=true`). That's because RefitLevel() doesn't register and make its key range known to file ingestion. Though it checks overlapping with other compactions by https://github.com/facebook/rocksdb/blob/7.8.fb/db/external_sst_file_ingestion_job.cc#L998.
RefitLevel() (used in `CompactRange()` with `change_level=true`) doesn't check whether the key range it acts on overlaps with an ongoing file ingestion. That's because file ingestion does not register and make its key range known to other compactions.
- Note that non-refitlevel-compaction (e.g, manual compaction w/o RefitLevel() or general compaction) also does not check key range overlap with ongoing file ingestion for the same reason.
- But it's fine. Credited to cbi42's discovery, `WaitForIngestFile` was called by background and foreground compactions. They were introduced in https://github.com/facebook/rocksdb/commit/0f88160f67d36ea30e3aca3a3cef924c3a009be6, https://github.com/facebook/rocksdb/commit/5c64fb67d2fc198f1a73ff3ae543749a6a41f513 and https://github.com/facebook/rocksdb/commit/87dfc1d23e0e16ff73e15f63c6fa0fb3b3fc8c8c.
- Regardless, this PR registers file ingestion like a compaction is a general approach that will also add range conflict check between file ingestion and non-refitlevel-compaction, though it has not been the issue motivated this PR.
Above are bugs resulting in two bad consequences:
- If file ingestion and RefitLevel() creates files in the same level, then range-overlapped files will be created at that level and caught as corruption by `force_consistency_checks=true`
- If file ingestion and RefitLevel() creates file in different levels, then with one further compaction on the ingested file, it can result in two same keys both with seqno 0 in two different levels. Then with iterator's [optimization](https://github.com/facebook/rocksdb/blame/c62f3221698fd273b673d4f7e54eabb8329a4369/db/db_iter.cc#L342-L343) that assumes no two same keys both with seqno 0, it will either break this assertion in debug build or, even worst, return value of this same key for the key after it, which is the wrong value to return, in release build.
Therefore we decide to introduce range conflict check for file ingestion and RefitLevel() inspired from the existing range conflict check among compactions.
**Summary:**
- Treat file ingestion job and RefitLevel() as `Compaction` of new compaction reasons: `CompactionReason::kExternalSstIngestion` and `CompactionReason::kRefitLevel` and register/unregister them. File ingestion is treated as compaction from L0 to different levels and RefitLevel() as compaction from source level to target level.
- Check for `RangeOverlapWithCompaction` with other ongoing compactions, `RegisterCompaction()` on this "compaction" before changing the LSM state in `VersionStorageInfo`, and `UnregisterCompaction()` after changing.
- Replace scattered fixes (https://github.com/facebook/rocksdb/commit/0f88160f67d36ea30e3aca3a3cef924c3a009be6, https://github.com/facebook/rocksdb/commit/5c64fb67d2fc198f1a73ff3ae543749a6a41f513 and https://github.com/facebook/rocksdb/commit/87dfc1d23e0e16ff73e15f63c6fa0fb3b3fc8c8c.) that prevents overlapping between file ingestion and non-refit-level compaction with this fix cuz those practices are easy to overlook.
- Misc: logic cleanup, see PR comments
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10988
Test Plan:
- New unit test `DBCompactionTestWithOngoingFileIngestionParam*` that failed pre-fix and passed afterwards.
- Made compatible with existing tests, see PR comments
- make check
- [Ongoing] Stress test rehearsal with normal value and aggressive CI value https://github.com/facebook/rocksdb/pull/10761
Reviewed By: cbi42
Differential Revision: D41535685
Pulled By: hx235
fbshipit-source-id: 549833a577ba1496d20a870583d4caa737da1258
2022-12-29 23:05:36 +00:00
|
|
|
const ImmutableDBOptions& db_options,
|
|
|
|
const MutableDBOptions& mutable_db_options, const EnvOptions& env_options,
|
2021-03-15 11:32:24 +00:00
|
|
|
SnapshotList* db_snapshots,
|
2019-06-21 17:12:29 +00:00
|
|
|
const IngestExternalFileOptions& ingestion_options,
|
2020-08-13 00:28:10 +00:00
|
|
|
Directories* directories, EventLogger* event_logger,
|
|
|
|
const std::shared_ptr<IOTracer>& io_tracer)
|
2021-03-15 11:32:24 +00:00
|
|
|
: clock_(db_options.clock),
|
2020-08-13 00:28:10 +00:00
|
|
|
fs_(db_options.fs, io_tracer),
|
2016-10-21 00:05:32 +00:00
|
|
|
versions_(versions),
|
|
|
|
cfd_(cfd),
|
2024-10-16 00:22:01 +00:00
|
|
|
ucmp_(cfd ? cfd->user_comparator() : nullptr),
|
|
|
|
file_range_checker_(ucmp_),
|
2016-10-21 00:05:32 +00:00
|
|
|
db_options_(db_options),
|
Add missing range conflict check between file ingestion and RefitLevel() (#10988)
Summary:
**Context:**
File ingestion never checks whether the key range it acts on overlaps with an ongoing RefitLevel() (used in `CompactRange()` with `change_level=true`). That's because RefitLevel() doesn't register and make its key range known to file ingestion. Though it checks overlapping with other compactions by https://github.com/facebook/rocksdb/blob/7.8.fb/db/external_sst_file_ingestion_job.cc#L998.
RefitLevel() (used in `CompactRange()` with `change_level=true`) doesn't check whether the key range it acts on overlaps with an ongoing file ingestion. That's because file ingestion does not register and make its key range known to other compactions.
- Note that non-refitlevel-compaction (e.g, manual compaction w/o RefitLevel() or general compaction) also does not check key range overlap with ongoing file ingestion for the same reason.
- But it's fine. Credited to cbi42's discovery, `WaitForIngestFile` was called by background and foreground compactions. They were introduced in https://github.com/facebook/rocksdb/commit/0f88160f67d36ea30e3aca3a3cef924c3a009be6, https://github.com/facebook/rocksdb/commit/5c64fb67d2fc198f1a73ff3ae543749a6a41f513 and https://github.com/facebook/rocksdb/commit/87dfc1d23e0e16ff73e15f63c6fa0fb3b3fc8c8c.
- Regardless, this PR registers file ingestion like a compaction is a general approach that will also add range conflict check between file ingestion and non-refitlevel-compaction, though it has not been the issue motivated this PR.
Above are bugs resulting in two bad consequences:
- If file ingestion and RefitLevel() creates files in the same level, then range-overlapped files will be created at that level and caught as corruption by `force_consistency_checks=true`
- If file ingestion and RefitLevel() creates file in different levels, then with one further compaction on the ingested file, it can result in two same keys both with seqno 0 in two different levels. Then with iterator's [optimization](https://github.com/facebook/rocksdb/blame/c62f3221698fd273b673d4f7e54eabb8329a4369/db/db_iter.cc#L342-L343) that assumes no two same keys both with seqno 0, it will either break this assertion in debug build or, even worst, return value of this same key for the key after it, which is the wrong value to return, in release build.
Therefore we decide to introduce range conflict check for file ingestion and RefitLevel() inspired from the existing range conflict check among compactions.
**Summary:**
- Treat file ingestion job and RefitLevel() as `Compaction` of new compaction reasons: `CompactionReason::kExternalSstIngestion` and `CompactionReason::kRefitLevel` and register/unregister them. File ingestion is treated as compaction from L0 to different levels and RefitLevel() as compaction from source level to target level.
- Check for `RangeOverlapWithCompaction` with other ongoing compactions, `RegisterCompaction()` on this "compaction" before changing the LSM state in `VersionStorageInfo`, and `UnregisterCompaction()` after changing.
- Replace scattered fixes (https://github.com/facebook/rocksdb/commit/0f88160f67d36ea30e3aca3a3cef924c3a009be6, https://github.com/facebook/rocksdb/commit/5c64fb67d2fc198f1a73ff3ae543749a6a41f513 and https://github.com/facebook/rocksdb/commit/87dfc1d23e0e16ff73e15f63c6fa0fb3b3fc8c8c.) that prevents overlapping between file ingestion and non-refit-level compaction with this fix cuz those practices are easy to overlook.
- Misc: logic cleanup, see PR comments
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10988
Test Plan:
- New unit test `DBCompactionTestWithOngoingFileIngestionParam*` that failed pre-fix and passed afterwards.
- Made compatible with existing tests, see PR comments
- make check
- [Ongoing] Stress test rehearsal with normal value and aggressive CI value https://github.com/facebook/rocksdb/pull/10761
Reviewed By: cbi42
Differential Revision: D41535685
Pulled By: hx235
fbshipit-source-id: 549833a577ba1496d20a870583d4caa737da1258
2022-12-29 23:05:36 +00:00
|
|
|
mutable_db_options_(mutable_db_options),
|
2016-10-21 00:05:32 +00:00
|
|
|
env_options_(env_options),
|
|
|
|
db_snapshots_(db_snapshots),
|
|
|
|
ingestion_options_(ingestion_options),
|
2019-06-21 17:12:29 +00:00
|
|
|
directories_(directories),
|
2019-09-13 21:48:18 +00:00
|
|
|
event_logger_(event_logger),
|
2021-01-26 06:07:26 +00:00
|
|
|
job_start_time_(clock_->NowMicros()),
|
2020-08-18 23:19:22 +00:00
|
|
|
consumed_seqno_count_(0),
|
|
|
|
io_tracer_(io_tracer) {
|
2019-06-21 17:12:29 +00:00
|
|
|
assert(directories != nullptr);
|
2024-10-16 00:22:01 +00:00
|
|
|
assert(cfd_);
|
|
|
|
assert(ucmp_);
|
2019-06-21 17:12:29 +00:00
|
|
|
}
|
2016-10-21 00:05:32 +00:00
|
|
|
|
2024-02-03 02:07:57 +00:00
|
|
|
~ExternalSstFileIngestionJob() { UnregisterRange(); }
|
Add missing range conflict check between file ingestion and RefitLevel() (#10988)
Summary:
**Context:**
File ingestion never checks whether the key range it acts on overlaps with an ongoing RefitLevel() (used in `CompactRange()` with `change_level=true`). That's because RefitLevel() doesn't register and make its key range known to file ingestion. Though it checks overlapping with other compactions by https://github.com/facebook/rocksdb/blob/7.8.fb/db/external_sst_file_ingestion_job.cc#L998.
RefitLevel() (used in `CompactRange()` with `change_level=true`) doesn't check whether the key range it acts on overlaps with an ongoing file ingestion. That's because file ingestion does not register and make its key range known to other compactions.
- Note that non-refitlevel-compaction (e.g, manual compaction w/o RefitLevel() or general compaction) also does not check key range overlap with ongoing file ingestion for the same reason.
- But it's fine. Credited to cbi42's discovery, `WaitForIngestFile` was called by background and foreground compactions. They were introduced in https://github.com/facebook/rocksdb/commit/0f88160f67d36ea30e3aca3a3cef924c3a009be6, https://github.com/facebook/rocksdb/commit/5c64fb67d2fc198f1a73ff3ae543749a6a41f513 and https://github.com/facebook/rocksdb/commit/87dfc1d23e0e16ff73e15f63c6fa0fb3b3fc8c8c.
- Regardless, this PR registers file ingestion like a compaction is a general approach that will also add range conflict check between file ingestion and non-refitlevel-compaction, though it has not been the issue motivated this PR.
Above are bugs resulting in two bad consequences:
- If file ingestion and RefitLevel() creates files in the same level, then range-overlapped files will be created at that level and caught as corruption by `force_consistency_checks=true`
- If file ingestion and RefitLevel() creates file in different levels, then with one further compaction on the ingested file, it can result in two same keys both with seqno 0 in two different levels. Then with iterator's [optimization](https://github.com/facebook/rocksdb/blame/c62f3221698fd273b673d4f7e54eabb8329a4369/db/db_iter.cc#L342-L343) that assumes no two same keys both with seqno 0, it will either break this assertion in debug build or, even worst, return value of this same key for the key after it, which is the wrong value to return, in release build.
Therefore we decide to introduce range conflict check for file ingestion and RefitLevel() inspired from the existing range conflict check among compactions.
**Summary:**
- Treat file ingestion job and RefitLevel() as `Compaction` of new compaction reasons: `CompactionReason::kExternalSstIngestion` and `CompactionReason::kRefitLevel` and register/unregister them. File ingestion is treated as compaction from L0 to different levels and RefitLevel() as compaction from source level to target level.
- Check for `RangeOverlapWithCompaction` with other ongoing compactions, `RegisterCompaction()` on this "compaction" before changing the LSM state in `VersionStorageInfo`, and `UnregisterCompaction()` after changing.
- Replace scattered fixes (https://github.com/facebook/rocksdb/commit/0f88160f67d36ea30e3aca3a3cef924c3a009be6, https://github.com/facebook/rocksdb/commit/5c64fb67d2fc198f1a73ff3ae543749a6a41f513 and https://github.com/facebook/rocksdb/commit/87dfc1d23e0e16ff73e15f63c6fa0fb3b3fc8c8c.) that prevents overlapping between file ingestion and non-refit-level compaction with this fix cuz those practices are easy to overlook.
- Misc: logic cleanup, see PR comments
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10988
Test Plan:
- New unit test `DBCompactionTestWithOngoingFileIngestionParam*` that failed pre-fix and passed afterwards.
- Made compatible with existing tests, see PR comments
- make check
- [Ongoing] Stress test rehearsal with normal value and aggressive CI value https://github.com/facebook/rocksdb/pull/10761
Reviewed By: cbi42
Differential Revision: D41535685
Pulled By: hx235
fbshipit-source-id: 549833a577ba1496d20a870583d4caa737da1258
2022-12-29 23:05:36 +00:00
|
|
|
|
2016-10-21 00:05:32 +00:00
|
|
|
// Prepare the job by copying external files into the DB.
|
2018-05-21 21:33:55 +00:00
|
|
|
Status Prepare(const std::vector<std::string>& external_files_paths,
|
Ingest SST files with checksum information (#6891)
Summary:
Application can ingest SST files with file checksum information, such that during ingestion, DB is able to check data integrity and identify of the SST file. The PR introduces generate_and_verify_file_checksum to IngestExternalFileOption to control if the ingested checksum information should be verified with the generated checksum.
1. If generate_and_verify_file_checksum options is *FALSE*: *1)* if DB does not enable SST file checksum, the checksum information ingested will be ignored; *2)* if DB enables the SST file checksum and the checksum function name matches the checksum function name in DB, we trust the ingested checksum, store it in Manifest. If the checksum function name does not match, we treat that as an error and fail the IngestExternalFile() call.
2. If generate_and_verify_file_checksum options is *TRUE*: *1)* if DB does not enable SST file checksum, the checksum information ingested will be ignored; *2)* if DB enable the SST file checksum, we will use the checksum generator from DB to calculate the checksum for each ingested SST files after they are copied or moved. Then, compare the checksum results with the ingested checksum information: _A)_ if the checksum function name does not match, _verification always report true_ and we store the DB generated checksum information in Manifest. _B)_ if the checksum function name mach, and checksum match, ingestion continues and stores the checksum information in the Manifest. Otherwise, terminate file ingestion and report file corruption.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6891
Test Plan: added unit test, pass make asan_check
Reviewed By: pdillinger
Differential Revision: D21935988
Pulled By: zhichao-cao
fbshipit-source-id: 7b55f486632db467e76d72602218d0658aa7f6ed
2020-06-11 21:25:01 +00:00
|
|
|
const std::vector<std::string>& files_checksums,
|
|
|
|
const std::vector<std::string>& files_checksum_func_names,
|
2021-10-08 17:29:06 +00:00
|
|
|
const Temperature& file_temperature, uint64_t next_file_number,
|
|
|
|
SuperVersion* sv);
|
2016-10-21 00:05:32 +00:00
|
|
|
|
|
|
|
// Check if we need to flush the memtable before running the ingestion job
|
|
|
|
// This will be true if the files we are ingesting are overlapping with any
|
|
|
|
// key range in the memtable.
|
2018-02-28 01:08:34 +00:00
|
|
|
//
|
|
|
|
// @param super_version A referenced SuperVersion that will be held for the
|
|
|
|
// duration of this function.
|
|
|
|
//
|
|
|
|
// Thread-safe
|
|
|
|
Status NeedsFlush(bool* flush_needed, SuperVersion* super_version);
|
2016-10-21 00:05:32 +00:00
|
|
|
|
|
|
|
// Will execute the ingestion job and prepare edit() to be applied.
|
|
|
|
// REQUIRES: Mutex held
|
|
|
|
Status Run();
|
|
|
|
|
Add missing range conflict check between file ingestion and RefitLevel() (#10988)
Summary:
**Context:**
File ingestion never checks whether the key range it acts on overlaps with an ongoing RefitLevel() (used in `CompactRange()` with `change_level=true`). That's because RefitLevel() doesn't register and make its key range known to file ingestion. Though it checks overlapping with other compactions by https://github.com/facebook/rocksdb/blob/7.8.fb/db/external_sst_file_ingestion_job.cc#L998.
RefitLevel() (used in `CompactRange()` with `change_level=true`) doesn't check whether the key range it acts on overlaps with an ongoing file ingestion. That's because file ingestion does not register and make its key range known to other compactions.
- Note that non-refitlevel-compaction (e.g, manual compaction w/o RefitLevel() or general compaction) also does not check key range overlap with ongoing file ingestion for the same reason.
- But it's fine. Credited to cbi42's discovery, `WaitForIngestFile` was called by background and foreground compactions. They were introduced in https://github.com/facebook/rocksdb/commit/0f88160f67d36ea30e3aca3a3cef924c3a009be6, https://github.com/facebook/rocksdb/commit/5c64fb67d2fc198f1a73ff3ae543749a6a41f513 and https://github.com/facebook/rocksdb/commit/87dfc1d23e0e16ff73e15f63c6fa0fb3b3fc8c8c.
- Regardless, this PR registers file ingestion like a compaction is a general approach that will also add range conflict check between file ingestion and non-refitlevel-compaction, though it has not been the issue motivated this PR.
Above are bugs resulting in two bad consequences:
- If file ingestion and RefitLevel() creates files in the same level, then range-overlapped files will be created at that level and caught as corruption by `force_consistency_checks=true`
- If file ingestion and RefitLevel() creates file in different levels, then with one further compaction on the ingested file, it can result in two same keys both with seqno 0 in two different levels. Then with iterator's [optimization](https://github.com/facebook/rocksdb/blame/c62f3221698fd273b673d4f7e54eabb8329a4369/db/db_iter.cc#L342-L343) that assumes no two same keys both with seqno 0, it will either break this assertion in debug build or, even worst, return value of this same key for the key after it, which is the wrong value to return, in release build.
Therefore we decide to introduce range conflict check for file ingestion and RefitLevel() inspired from the existing range conflict check among compactions.
**Summary:**
- Treat file ingestion job and RefitLevel() as `Compaction` of new compaction reasons: `CompactionReason::kExternalSstIngestion` and `CompactionReason::kRefitLevel` and register/unregister them. File ingestion is treated as compaction from L0 to different levels and RefitLevel() as compaction from source level to target level.
- Check for `RangeOverlapWithCompaction` with other ongoing compactions, `RegisterCompaction()` on this "compaction" before changing the LSM state in `VersionStorageInfo`, and `UnregisterCompaction()` after changing.
- Replace scattered fixes (https://github.com/facebook/rocksdb/commit/0f88160f67d36ea30e3aca3a3cef924c3a009be6, https://github.com/facebook/rocksdb/commit/5c64fb67d2fc198f1a73ff3ae543749a6a41f513 and https://github.com/facebook/rocksdb/commit/87dfc1d23e0e16ff73e15f63c6fa0fb3b3fc8c8c.) that prevents overlapping between file ingestion and non-refit-level compaction with this fix cuz those practices are easy to overlook.
- Misc: logic cleanup, see PR comments
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10988
Test Plan:
- New unit test `DBCompactionTestWithOngoingFileIngestionParam*` that failed pre-fix and passed afterwards.
- Made compatible with existing tests, see PR comments
- make check
- [Ongoing] Stress test rehearsal with normal value and aggressive CI value https://github.com/facebook/rocksdb/pull/10761
Reviewed By: cbi42
Differential Revision: D41535685
Pulled By: hx235
fbshipit-source-id: 549833a577ba1496d20a870583d4caa737da1258
2022-12-29 23:05:36 +00:00
|
|
|
// Register key range involved in this ingestion job
|
|
|
|
// to prevent key range conflict with other ongoing compaction/file ingestion
|
|
|
|
// REQUIRES: Mutex held
|
|
|
|
void RegisterRange();
|
|
|
|
|
|
|
|
// Unregister key range registered for this ingestion job
|
|
|
|
// REQUIRES: Mutex held
|
|
|
|
void UnregisterRange();
|
|
|
|
|
2016-10-21 00:05:32 +00:00
|
|
|
// Update column family stats.
|
|
|
|
// REQUIRES: Mutex held
|
|
|
|
void UpdateStats();
|
|
|
|
|
2017-06-05 18:23:31 +00:00
|
|
|
// Cleanup after successful/failed job
|
2016-10-21 00:05:32 +00:00
|
|
|
void Cleanup(const Status& status);
|
|
|
|
|
|
|
|
VersionEdit* edit() { return &edit_; }
|
|
|
|
|
2016-12-06 21:56:17 +00:00
|
|
|
const autovector<IngestedFileInfo>& files_to_ingest() const {
|
|
|
|
return files_to_ingest_;
|
|
|
|
}
|
|
|
|
|
2024-02-03 02:07:57 +00:00
|
|
|
// How many sequence numbers did we consume as part of the ingestion job?
|
2019-09-13 21:48:18 +00:00
|
|
|
int ConsumedSequenceNumbersCount() const { return consumed_seqno_count_; }
|
2019-02-13 03:07:25 +00:00
|
|
|
|
2016-10-21 00:05:32 +00:00
|
|
|
private:
|
2024-02-21 23:41:53 +00:00
|
|
|
Status ResetTableReader(const std::string& external_file,
|
|
|
|
uint64_t new_file_number,
|
|
|
|
bool user_defined_timestamps_persisted,
|
|
|
|
SuperVersion* sv, IngestedFileInfo* file_to_ingest,
|
|
|
|
std::unique_ptr<TableReader>* table_reader);
|
|
|
|
|
|
|
|
// Read the external file's table properties to do various sanity checks and
|
|
|
|
// populates certain fields in `IngestedFileInfo` according to some table
|
|
|
|
// properties.
|
|
|
|
// In some cases when sanity check passes, `table_reader` could be reset with
|
|
|
|
// different options. For example: when external file does not contain
|
|
|
|
// timestamps while column family enables UDT in Memtables only feature.
|
|
|
|
Status SanityCheckTableProperties(const std::string& external_file,
|
|
|
|
uint64_t new_file_number, SuperVersion* sv,
|
|
|
|
IngestedFileInfo* file_to_ingest,
|
|
|
|
std::unique_ptr<TableReader>* table_reader);
|
|
|
|
|
2016-10-21 00:05:32 +00:00
|
|
|
// Open the external file and populate `file_to_ingest` with all the
|
|
|
|
// external information we need to ingest this file.
|
|
|
|
Status GetIngestedFileInfo(const std::string& external_file,
|
New stable, fixed-length cache keys (#9126)
Summary:
This change standardizes on a new 16-byte cache key format for
block cache (incl compressed and secondary) and persistent cache (but
not table cache and row cache).
The goal is a really fast cache key with practically ideal stability and
uniqueness properties without external dependencies (e.g. from FileSystem).
A fixed key size of 16 bytes should enable future optimizations to the
concurrent hash table for block cache, which is a heavy CPU user /
bottleneck, but there appears to be measurable performance improvement
even with no changes to LRUCache.
This change replaces a lot of disjointed and ugly code handling cache
keys with calls to a simple, clean new internal API (cache_key.h).
(Preserving the old cache key logic under an option would be very ugly
and likely negate the performance gain of the new approach. Complete
replacement carries some inherent risk, but I think that's acceptable
with sufficient analysis and testing.)
The scheme for encoding new cache keys is complicated but explained
in cache_key.cc.
Also: EndianSwapValue is moved to math.h to be next to other bit
operations. (Explains some new include "math.h".) ReverseBits operation
added and unit tests added to hash_test for both.
Fixes https://github.com/facebook/rocksdb/issues/7405 (presuming a root cause)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9126
Test Plan:
### Basic correctness
Several tests needed updates to work with the new functionality, mostly
because we are no longer relying on filesystem for stable cache keys
so table builders & readers need more context info to agree on cache
keys. This functionality is so core, a huge number of existing tests
exercise the cache key functionality.
### Performance
Create db with
`TEST_TMPDIR=/dev/shm ./db_bench -bloom_bits=10 -benchmarks=fillrandom -num=3000000 -partition_index_and_filters`
And test performance with
`TEST_TMPDIR=/dev/shm ./db_bench -readonly -use_existing_db -bloom_bits=10 -benchmarks=readrandom -num=3000000 -duration=30 -cache_index_and_filter_blocks -cache_size=250000 -threads=4`
using DEBUG_LEVEL=0 and simultaneous before & after runs.
Before ops/sec, avg over 100 runs: 121924
After ops/sec, avg over 100 runs: 125385 (+2.8%)
### Collision probability
I have built a tool, ./cache_bench -stress_cache_key to broadly simulate host-wide cache activity
over many months, by making some pessimistic simplifying assumptions:
* Every generated file has a cache entry for every byte offset in the file (contiguous range of cache keys)
* All of every file is cached for its entire lifetime
We use a simple table with skewed address assignment and replacement on address collision
to simulate files coming & going, with quite a variance (super-Poisson) in ages. Some output
with `./cache_bench -stress_cache_key -sck_keep_bits=40`:
```
Total cache or DBs size: 32TiB Writing 925.926 MiB/s or 76.2939TiB/day
Multiply by 9.22337e+18 to correct for simulation losses (but still assume whole file cached)
```
These come from default settings of 2.5M files per day of 32 MB each, and
`-sck_keep_bits=40` means that to represent a single file, we are only keeping 40 bits of
the 128-bit cache key. With file size of 2\*\*25 contiguous keys (pessimistic), our simulation
is about 2\*\*(128-40-25) or about 9 billion billion times more prone to collision than reality.
More default assumptions, relatively pessimistic:
* 100 DBs in same process (doesn't matter much)
* Re-open DB in same process (new session ID related to old session ID) on average
every 100 files generated
* Restart process (all new session IDs unrelated to old) 24 times per day
After enough data, we get a result at the end:
```
(keep 40 bits) 17 collisions after 2 x 90 days, est 10.5882 days between (9.76592e+19 corrected)
```
If we believe the (pessimistic) simulation and the mathematical generalization, we would need to run a billion machines all for 97 billion days to expect a cache key collision. To help verify that our generalization ("corrected") is robust, we can make our simulation more precise with `-sck_keep_bits=41` and `42`, which takes more running time to get enough data:
```
(keep 41 bits) 16 collisions after 4 x 90 days, est 22.5 days between (1.03763e+20 corrected)
(keep 42 bits) 19 collisions after 10 x 90 days, est 47.3684 days between (1.09224e+20 corrected)
```
The generalized prediction still holds. With the `-sck_randomize` option, we can see that we are beating "random" cache keys (except offsets still non-randomized) by a modest amount (roughly 20x less collision prone than random), which should make us reasonably comfortable even in "degenerate" cases:
```
197 collisions after 1 x 90 days, est 0.456853 days between (4.21372e+18 corrected)
```
I've run other tests to validate other conditions behave as expected, never behaving "worse than random" unless we start chopping off structured data.
Reviewed By: zhichao-cao
Differential Revision: D33171746
Pulled By: pdillinger
fbshipit-source-id: f16a57e369ed37be5e7e33525ace848d0537c88f
2021-12-17 01:13:55 +00:00
|
|
|
uint64_t new_file_number,
|
2018-05-21 21:33:55 +00:00
|
|
|
IngestedFileInfo* file_to_ingest,
|
|
|
|
SuperVersion* sv);
|
2016-10-21 00:05:32 +00:00
|
|
|
|
2024-10-16 00:22:01 +00:00
|
|
|
// If the input files' key range overlaps themselves, this function divides
|
|
|
|
// them in the user specified order into multiple batches. Where the files
|
|
|
|
// within a batch do not overlap with each other, but key range could overlap
|
|
|
|
// between batches.
|
|
|
|
// If the input files' key range don't overlap themselves, they always just
|
|
|
|
// make one batch.
|
|
|
|
void DivideInputFilesIntoBatches();
|
|
|
|
|
|
|
|
// Assign level for the files in one batch. The files within one batch are not
|
|
|
|
// overlapping, and we assign level to each file one after another.
|
|
|
|
// If `prev_batch_uppermost_level` is specified, all files in this batch will
|
|
|
|
// be assigned to levels that are higher than `prev_batch_uppermost_level`.
|
|
|
|
// The uppermost level used by this batch of files is tracked too, so that it
|
|
|
|
// can be used by the next batch.
|
|
|
|
// REQUIRES: Mutex held
|
|
|
|
Status AssignLevelsForOneBatch(FileBatchInfo& batch,
|
|
|
|
SuperVersion* super_version,
|
|
|
|
bool force_global_seqno,
|
|
|
|
SequenceNumber* last_seqno,
|
|
|
|
int* batch_uppermost_level,
|
|
|
|
std::optional<int> prev_batch_uppermost_level);
|
|
|
|
|
2021-10-13 03:38:36 +00:00
|
|
|
// Assign `file_to_ingest` the appropriate sequence number and the lowest
|
2017-04-26 20:28:39 +00:00
|
|
|
// possible level that it can be ingested to according to compaction_style.
|
2024-10-16 00:22:01 +00:00
|
|
|
// If `prev_batch_uppermost_level` is specified, the file will only be
|
|
|
|
// assigned to levels tha are higher than `prev_batch_uppermost_level`.
|
2016-10-21 00:05:32 +00:00
|
|
|
// REQUIRES: Mutex held
|
2024-10-16 00:22:01 +00:00
|
|
|
Status AssignLevelAndSeqnoForIngestedFile(
|
|
|
|
SuperVersion* sv, bool force_global_seqno,
|
|
|
|
CompactionStyle compaction_style, SequenceNumber last_seqno,
|
|
|
|
IngestedFileInfo* file_to_ingest, SequenceNumber* assigned_seqno,
|
|
|
|
std::optional<int> prev_batch_uppermost_level);
|
2016-10-21 00:05:32 +00:00
|
|
|
|
2017-05-17 18:32:26 +00:00
|
|
|
// File that we want to ingest behind always goes to the lowest level;
|
|
|
|
// we just check that it fits in the level, that DB allows ingest_behind,
|
|
|
|
// and that we don't have 0 seqnums at the upper levels.
|
|
|
|
// REQUIRES: Mutex held
|
|
|
|
Status CheckLevelForIngestedBehindFile(IngestedFileInfo* file_to_ingest);
|
|
|
|
|
2016-10-21 00:05:32 +00:00
|
|
|
// Set the file global sequence number to `seqno`
|
|
|
|
Status AssignGlobalSeqnoForIngestedFile(IngestedFileInfo* file_to_ingest,
|
|
|
|
SequenceNumber seqno);
|
Ingest SST files with checksum information (#6891)
Summary:
Application can ingest SST files with file checksum information, such that during ingestion, DB is able to check data integrity and identify of the SST file. The PR introduces generate_and_verify_file_checksum to IngestExternalFileOption to control if the ingested checksum information should be verified with the generated checksum.
1. If generate_and_verify_file_checksum options is *FALSE*: *1)* if DB does not enable SST file checksum, the checksum information ingested will be ignored; *2)* if DB enables the SST file checksum and the checksum function name matches the checksum function name in DB, we trust the ingested checksum, store it in Manifest. If the checksum function name does not match, we treat that as an error and fail the IngestExternalFile() call.
2. If generate_and_verify_file_checksum options is *TRUE*: *1)* if DB does not enable SST file checksum, the checksum information ingested will be ignored; *2)* if DB enable the SST file checksum, we will use the checksum generator from DB to calculate the checksum for each ingested SST files after they are copied or moved. Then, compare the checksum results with the ingested checksum information: _A)_ if the checksum function name does not match, _verification always report true_ and we store the DB generated checksum information in Manifest. _B)_ if the checksum function name mach, and checksum match, ingestion continues and stores the checksum information in the Manifest. Otherwise, terminate file ingestion and report file corruption.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6891
Test Plan: added unit test, pass make asan_check
Reviewed By: pdillinger
Differential Revision: D21935988
Pulled By: zhichao-cao
fbshipit-source-id: 7b55f486632db467e76d72602218d0658aa7f6ed
2020-06-11 21:25:01 +00:00
|
|
|
// Generate the file checksum and store in the IngestedFileInfo
|
|
|
|
IOStatus GenerateChecksumForIngestedFile(IngestedFileInfo* file_to_ingest);
|
2016-10-21 00:05:32 +00:00
|
|
|
|
|
|
|
// Check if `file_to_ingest` can fit in level `level`
|
|
|
|
// REQUIRES: Mutex held
|
|
|
|
bool IngestedFileFitInLevel(const IngestedFileInfo* file_to_ingest,
|
|
|
|
int level);
|
|
|
|
|
2019-06-21 17:12:29 +00:00
|
|
|
// Helper method to sync given file.
|
|
|
|
template <typename TWritableFile>
|
|
|
|
Status SyncIngestedFile(TWritableFile* file);
|
|
|
|
|
Add missing range conflict check between file ingestion and RefitLevel() (#10988)
Summary:
**Context:**
File ingestion never checks whether the key range it acts on overlaps with an ongoing RefitLevel() (used in `CompactRange()` with `change_level=true`). That's because RefitLevel() doesn't register and make its key range known to file ingestion. Though it checks overlapping with other compactions by https://github.com/facebook/rocksdb/blob/7.8.fb/db/external_sst_file_ingestion_job.cc#L998.
RefitLevel() (used in `CompactRange()` with `change_level=true`) doesn't check whether the key range it acts on overlaps with an ongoing file ingestion. That's because file ingestion does not register and make its key range known to other compactions.
- Note that non-refitlevel-compaction (e.g, manual compaction w/o RefitLevel() or general compaction) also does not check key range overlap with ongoing file ingestion for the same reason.
- But it's fine. Credited to cbi42's discovery, `WaitForIngestFile` was called by background and foreground compactions. They were introduced in https://github.com/facebook/rocksdb/commit/0f88160f67d36ea30e3aca3a3cef924c3a009be6, https://github.com/facebook/rocksdb/commit/5c64fb67d2fc198f1a73ff3ae543749a6a41f513 and https://github.com/facebook/rocksdb/commit/87dfc1d23e0e16ff73e15f63c6fa0fb3b3fc8c8c.
- Regardless, this PR registers file ingestion like a compaction is a general approach that will also add range conflict check between file ingestion and non-refitlevel-compaction, though it has not been the issue motivated this PR.
Above are bugs resulting in two bad consequences:
- If file ingestion and RefitLevel() creates files in the same level, then range-overlapped files will be created at that level and caught as corruption by `force_consistency_checks=true`
- If file ingestion and RefitLevel() creates file in different levels, then with one further compaction on the ingested file, it can result in two same keys both with seqno 0 in two different levels. Then with iterator's [optimization](https://github.com/facebook/rocksdb/blame/c62f3221698fd273b673d4f7e54eabb8329a4369/db/db_iter.cc#L342-L343) that assumes no two same keys both with seqno 0, it will either break this assertion in debug build or, even worst, return value of this same key for the key after it, which is the wrong value to return, in release build.
Therefore we decide to introduce range conflict check for file ingestion and RefitLevel() inspired from the existing range conflict check among compactions.
**Summary:**
- Treat file ingestion job and RefitLevel() as `Compaction` of new compaction reasons: `CompactionReason::kExternalSstIngestion` and `CompactionReason::kRefitLevel` and register/unregister them. File ingestion is treated as compaction from L0 to different levels and RefitLevel() as compaction from source level to target level.
- Check for `RangeOverlapWithCompaction` with other ongoing compactions, `RegisterCompaction()` on this "compaction" before changing the LSM state in `VersionStorageInfo`, and `UnregisterCompaction()` after changing.
- Replace scattered fixes (https://github.com/facebook/rocksdb/commit/0f88160f67d36ea30e3aca3a3cef924c3a009be6, https://github.com/facebook/rocksdb/commit/5c64fb67d2fc198f1a73ff3ae543749a6a41f513 and https://github.com/facebook/rocksdb/commit/87dfc1d23e0e16ff73e15f63c6fa0fb3b3fc8c8c.) that prevents overlapping between file ingestion and non-refit-level compaction with this fix cuz those practices are easy to overlook.
- Misc: logic cleanup, see PR comments
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10988
Test Plan:
- New unit test `DBCompactionTestWithOngoingFileIngestionParam*` that failed pre-fix and passed afterwards.
- Made compatible with existing tests, see PR comments
- make check
- [Ongoing] Stress test rehearsal with normal value and aggressive CI value https://github.com/facebook/rocksdb/pull/10761
Reviewed By: cbi42
Differential Revision: D41535685
Pulled By: hx235
fbshipit-source-id: 549833a577ba1496d20a870583d4caa737da1258
2022-12-29 23:05:36 +00:00
|
|
|
// Create equivalent `Compaction` objects to this file ingestion job
|
|
|
|
// , which will be used to check range conflict with other ongoing
|
|
|
|
// compactions.
|
|
|
|
void CreateEquivalentFileIngestingCompactions();
|
|
|
|
|
2024-02-03 02:07:57 +00:00
|
|
|
// Remove all the internal files created, called when ingestion job fails.
|
|
|
|
void DeleteInternalFiles();
|
|
|
|
|
2021-03-15 11:32:24 +00:00
|
|
|
SystemClock* clock_;
|
2020-08-13 00:28:10 +00:00
|
|
|
FileSystemPtr fs_;
|
2016-10-21 00:05:32 +00:00
|
|
|
VersionSet* versions_;
|
|
|
|
ColumnFamilyData* cfd_;
|
2024-10-16 00:22:01 +00:00
|
|
|
const Comparator* ucmp_;
|
|
|
|
ExternalFileRangeChecker file_range_checker_;
|
2016-10-21 00:05:32 +00:00
|
|
|
const ImmutableDBOptions& db_options_;
|
Add missing range conflict check between file ingestion and RefitLevel() (#10988)
Summary:
**Context:**
File ingestion never checks whether the key range it acts on overlaps with an ongoing RefitLevel() (used in `CompactRange()` with `change_level=true`). That's because RefitLevel() doesn't register and make its key range known to file ingestion. Though it checks overlapping with other compactions by https://github.com/facebook/rocksdb/blob/7.8.fb/db/external_sst_file_ingestion_job.cc#L998.
RefitLevel() (used in `CompactRange()` with `change_level=true`) doesn't check whether the key range it acts on overlaps with an ongoing file ingestion. That's because file ingestion does not register and make its key range known to other compactions.
- Note that non-refitlevel-compaction (e.g, manual compaction w/o RefitLevel() or general compaction) also does not check key range overlap with ongoing file ingestion for the same reason.
- But it's fine. Credited to cbi42's discovery, `WaitForIngestFile` was called by background and foreground compactions. They were introduced in https://github.com/facebook/rocksdb/commit/0f88160f67d36ea30e3aca3a3cef924c3a009be6, https://github.com/facebook/rocksdb/commit/5c64fb67d2fc198f1a73ff3ae543749a6a41f513 and https://github.com/facebook/rocksdb/commit/87dfc1d23e0e16ff73e15f63c6fa0fb3b3fc8c8c.
- Regardless, this PR registers file ingestion like a compaction is a general approach that will also add range conflict check between file ingestion and non-refitlevel-compaction, though it has not been the issue motivated this PR.
Above are bugs resulting in two bad consequences:
- If file ingestion and RefitLevel() creates files in the same level, then range-overlapped files will be created at that level and caught as corruption by `force_consistency_checks=true`
- If file ingestion and RefitLevel() creates file in different levels, then with one further compaction on the ingested file, it can result in two same keys both with seqno 0 in two different levels. Then with iterator's [optimization](https://github.com/facebook/rocksdb/blame/c62f3221698fd273b673d4f7e54eabb8329a4369/db/db_iter.cc#L342-L343) that assumes no two same keys both with seqno 0, it will either break this assertion in debug build or, even worst, return value of this same key for the key after it, which is the wrong value to return, in release build.
Therefore we decide to introduce range conflict check for file ingestion and RefitLevel() inspired from the existing range conflict check among compactions.
**Summary:**
- Treat file ingestion job and RefitLevel() as `Compaction` of new compaction reasons: `CompactionReason::kExternalSstIngestion` and `CompactionReason::kRefitLevel` and register/unregister them. File ingestion is treated as compaction from L0 to different levels and RefitLevel() as compaction from source level to target level.
- Check for `RangeOverlapWithCompaction` with other ongoing compactions, `RegisterCompaction()` on this "compaction" before changing the LSM state in `VersionStorageInfo`, and `UnregisterCompaction()` after changing.
- Replace scattered fixes (https://github.com/facebook/rocksdb/commit/0f88160f67d36ea30e3aca3a3cef924c3a009be6, https://github.com/facebook/rocksdb/commit/5c64fb67d2fc198f1a73ff3ae543749a6a41f513 and https://github.com/facebook/rocksdb/commit/87dfc1d23e0e16ff73e15f63c6fa0fb3b3fc8c8c.) that prevents overlapping between file ingestion and non-refit-level compaction with this fix cuz those practices are easy to overlook.
- Misc: logic cleanup, see PR comments
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10988
Test Plan:
- New unit test `DBCompactionTestWithOngoingFileIngestionParam*` that failed pre-fix and passed afterwards.
- Made compatible with existing tests, see PR comments
- make check
- [Ongoing] Stress test rehearsal with normal value and aggressive CI value https://github.com/facebook/rocksdb/pull/10761
Reviewed By: cbi42
Differential Revision: D41535685
Pulled By: hx235
fbshipit-source-id: 549833a577ba1496d20a870583d4caa737da1258
2022-12-29 23:05:36 +00:00
|
|
|
const MutableDBOptions& mutable_db_options_;
|
2016-10-21 00:05:32 +00:00
|
|
|
const EnvOptions& env_options_;
|
|
|
|
SnapshotList* db_snapshots_;
|
|
|
|
autovector<IngestedFileInfo> files_to_ingest_;
|
2024-10-16 00:22:01 +00:00
|
|
|
std::vector<FileBatchInfo> file_batches_to_ingest_;
|
2016-10-21 00:05:32 +00:00
|
|
|
const IngestExternalFileOptions& ingestion_options_;
|
2019-06-21 17:12:29 +00:00
|
|
|
Directories* directories_;
|
2019-09-13 21:48:18 +00:00
|
|
|
EventLogger* event_logger_;
|
2016-10-21 00:05:32 +00:00
|
|
|
VersionEdit edit_;
|
|
|
|
uint64_t job_start_time_;
|
2019-09-13 21:48:18 +00:00
|
|
|
int consumed_seqno_count_;
|
|
|
|
// Set in ExternalSstFileIngestionJob::Prepare(), if true all files are
|
|
|
|
// ingested in L0
|
|
|
|
bool files_overlap_{false};
|
Ingest SST files with checksum information (#6891)
Summary:
Application can ingest SST files with file checksum information, such that during ingestion, DB is able to check data integrity and identify of the SST file. The PR introduces generate_and_verify_file_checksum to IngestExternalFileOption to control if the ingested checksum information should be verified with the generated checksum.
1. If generate_and_verify_file_checksum options is *FALSE*: *1)* if DB does not enable SST file checksum, the checksum information ingested will be ignored; *2)* if DB enables the SST file checksum and the checksum function name matches the checksum function name in DB, we trust the ingested checksum, store it in Manifest. If the checksum function name does not match, we treat that as an error and fail the IngestExternalFile() call.
2. If generate_and_verify_file_checksum options is *TRUE*: *1)* if DB does not enable SST file checksum, the checksum information ingested will be ignored; *2)* if DB enable the SST file checksum, we will use the checksum generator from DB to calculate the checksum for each ingested SST files after they are copied or moved. Then, compare the checksum results with the ingested checksum information: _A)_ if the checksum function name does not match, _verification always report true_ and we store the DB generated checksum information in Manifest. _B)_ if the checksum function name mach, and checksum match, ingestion continues and stores the checksum information in the Manifest. Otherwise, terminate file ingestion and report file corruption.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6891
Test Plan: added unit test, pass make asan_check
Reviewed By: pdillinger
Differential Revision: D21935988
Pulled By: zhichao-cao
fbshipit-source-id: 7b55f486632db467e76d72602218d0658aa7f6ed
2020-06-11 21:25:01 +00:00
|
|
|
// Set in ExternalSstFileIngestionJob::Prepare(), if true and DB
|
|
|
|
// file_checksum_gen_factory is set, DB will generate checksum each file.
|
|
|
|
bool need_generate_file_checksum_{true};
|
2020-08-18 23:19:22 +00:00
|
|
|
std::shared_ptr<IOTracer> io_tracer_;
|
Add missing range conflict check between file ingestion and RefitLevel() (#10988)
Summary:
**Context:**
File ingestion never checks whether the key range it acts on overlaps with an ongoing RefitLevel() (used in `CompactRange()` with `change_level=true`). That's because RefitLevel() doesn't register and make its key range known to file ingestion. Though it checks overlapping with other compactions by https://github.com/facebook/rocksdb/blob/7.8.fb/db/external_sst_file_ingestion_job.cc#L998.
RefitLevel() (used in `CompactRange()` with `change_level=true`) doesn't check whether the key range it acts on overlaps with an ongoing file ingestion. That's because file ingestion does not register and make its key range known to other compactions.
- Note that non-refitlevel-compaction (e.g, manual compaction w/o RefitLevel() or general compaction) also does not check key range overlap with ongoing file ingestion for the same reason.
- But it's fine. Credited to cbi42's discovery, `WaitForIngestFile` was called by background and foreground compactions. They were introduced in https://github.com/facebook/rocksdb/commit/0f88160f67d36ea30e3aca3a3cef924c3a009be6, https://github.com/facebook/rocksdb/commit/5c64fb67d2fc198f1a73ff3ae543749a6a41f513 and https://github.com/facebook/rocksdb/commit/87dfc1d23e0e16ff73e15f63c6fa0fb3b3fc8c8c.
- Regardless, this PR registers file ingestion like a compaction is a general approach that will also add range conflict check between file ingestion and non-refitlevel-compaction, though it has not been the issue motivated this PR.
Above are bugs resulting in two bad consequences:
- If file ingestion and RefitLevel() creates files in the same level, then range-overlapped files will be created at that level and caught as corruption by `force_consistency_checks=true`
- If file ingestion and RefitLevel() creates file in different levels, then with one further compaction on the ingested file, it can result in two same keys both with seqno 0 in two different levels. Then with iterator's [optimization](https://github.com/facebook/rocksdb/blame/c62f3221698fd273b673d4f7e54eabb8329a4369/db/db_iter.cc#L342-L343) that assumes no two same keys both with seqno 0, it will either break this assertion in debug build or, even worst, return value of this same key for the key after it, which is the wrong value to return, in release build.
Therefore we decide to introduce range conflict check for file ingestion and RefitLevel() inspired from the existing range conflict check among compactions.
**Summary:**
- Treat file ingestion job and RefitLevel() as `Compaction` of new compaction reasons: `CompactionReason::kExternalSstIngestion` and `CompactionReason::kRefitLevel` and register/unregister them. File ingestion is treated as compaction from L0 to different levels and RefitLevel() as compaction from source level to target level.
- Check for `RangeOverlapWithCompaction` with other ongoing compactions, `RegisterCompaction()` on this "compaction" before changing the LSM state in `VersionStorageInfo`, and `UnregisterCompaction()` after changing.
- Replace scattered fixes (https://github.com/facebook/rocksdb/commit/0f88160f67d36ea30e3aca3a3cef924c3a009be6, https://github.com/facebook/rocksdb/commit/5c64fb67d2fc198f1a73ff3ae543749a6a41f513 and https://github.com/facebook/rocksdb/commit/87dfc1d23e0e16ff73e15f63c6fa0fb3b3fc8c8c.) that prevents overlapping between file ingestion and non-refit-level compaction with this fix cuz those practices are easy to overlook.
- Misc: logic cleanup, see PR comments
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10988
Test Plan:
- New unit test `DBCompactionTestWithOngoingFileIngestionParam*` that failed pre-fix and passed afterwards.
- Made compatible with existing tests, see PR comments
- make check
- [Ongoing] Stress test rehearsal with normal value and aggressive CI value https://github.com/facebook/rocksdb/pull/10761
Reviewed By: cbi42
Differential Revision: D41535685
Pulled By: hx235
fbshipit-source-id: 549833a577ba1496d20a870583d4caa737da1258
2022-12-29 23:05:36 +00:00
|
|
|
|
|
|
|
// Below are variables used in (un)registering range for this ingestion job
|
|
|
|
//
|
|
|
|
// FileMetaData used in inputs of compactions equivalent to this ingestion
|
|
|
|
// job
|
|
|
|
std::vector<FileMetaData*> compaction_input_metdatas_;
|
|
|
|
// Compactions equivalent to this ingestion job
|
|
|
|
std::vector<Compaction*> file_ingesting_compactions_;
|
2016-10-21 00:05:32 +00:00
|
|
|
};
|
|
|
|
|
2020-02-20 20:07:53 +00:00
|
|
|
} // namespace ROCKSDB_NAMESPACE
|