2017-04-06 00:14:05 +00:00
|
|
|
// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
|
2017-07-15 23:03:42 +00:00
|
|
|
// This source code is licensed under both the GPLv2 (found in the
|
|
|
|
// COPYING file in the root directory) and Apache 2.0 License
|
|
|
|
// (found in the LICENSE.Apache file in the root directory).
|
2017-04-06 00:14:05 +00:00
|
|
|
//
|
|
|
|
// Copyright (c) 2011 The LevelDB Authors. All rights reserved.
|
|
|
|
// Use of this source code is governed by a BSD-style license that can be
|
|
|
|
// found in the LICENSE file. See the AUTHORS file for names of contributors.
|
2019-06-06 20:52:39 +00:00
|
|
|
#include <cinttypes>
|
2021-08-19 00:39:00 +00:00
|
|
|
#include <deque>
|
2017-04-06 00:14:05 +00:00
|
|
|
|
|
|
|
#include "db/builder.h"
|
2020-08-17 18:52:23 +00:00
|
|
|
#include "db/db_impl/db_impl.h"
|
2018-06-28 19:23:57 +00:00
|
|
|
#include "db/error_handler.h"
|
2017-06-23 02:30:39 +00:00
|
|
|
#include "db/event_helpers.h"
|
2019-05-30 03:44:08 +00:00
|
|
|
#include "file/sst_file_manager_impl.h"
|
2021-09-29 11:01:57 +00:00
|
|
|
#include "logging/logging.h"
|
2017-04-06 02:02:00 +00:00
|
|
|
#include "monitoring/iostats_context_imp.h"
|
|
|
|
#include "monitoring/perf_context_imp.h"
|
|
|
|
#include "monitoring/thread_status_updater.h"
|
|
|
|
#include "monitoring/thread_status_util.h"
|
Group SST write in flush, compaction and db open with new stats (#11910)
Summary:
## Context/Summary
Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity.
For that, this PR does the following:
- Tag different write IOs by passing down and converting WriteOptions to IOOptions
- Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS
Some related code refactory to make implementation cleaner:
- Blob stats
- Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info.
- Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write.
- Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority
- Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification
- Build table
- TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables
- Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder.
This parameter is used for dynamically changing file io priority for flush, see https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more
- Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority
## Test
### db bench
Flush
```
./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100
rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
```
compaction, db oopen
```
Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279
rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213
rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66
```
blob stats - just to make sure they aren't broken by this PR
```
Integrated Blob DB
Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600
rocksdb.blobdb.blob.file.synced COUNT : 1
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same)
```
```
Stacked Blob DB
Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876
rocksdb.blobdb.blob.file.synced COUNT : 8
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same)
```
### Rehearsal CI stress test
Trigger 3 full runs of all our CI stress tests
### Performance
Flush
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark; enable_statistics = true
Pre-pr: avg 507515519.3 ns
497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908,
Post-pr: avg 511971266.5 ns, regressed 0.88%
502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408,
```
Compaction
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 495346098.30 ns
492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846
Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97%
502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007
```
Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats)
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 3848.10 ns
3814,3838,3839,3848,3854,3854,3854,3860,3860,3860
Post-pr: avg 3874.20 ns, regressed 0.68%
3863,3867,3871,3874,3875,3877,3877,3877,3880,3881
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910
Reviewed By: ajkr
Differential Revision: D49788060
Pulled By: hx235
fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff
2023-12-29 23:29:23 +00:00
|
|
|
#include "rocksdb/file_system.h"
|
|
|
|
#include "rocksdb/io_status.h"
|
|
|
|
#include "rocksdb/options.h"
|
|
|
|
#include "rocksdb/table.h"
|
2019-05-30 18:21:38 +00:00
|
|
|
#include "test_util/sync_point.h"
|
2019-08-27 17:57:28 +00:00
|
|
|
#include "util/cast_util.h"
|
2023-07-26 23:25:06 +00:00
|
|
|
#include "util/coding.h"
|
2019-05-31 00:39:43 +00:00
|
|
|
#include "util/concurrent_task_limiter_impl.h"
|
2023-08-30 20:42:04 +00:00
|
|
|
#include "util/udt_util.h"
|
2017-04-06 00:14:05 +00:00
|
|
|
|
2020-02-20 20:07:53 +00:00
|
|
|
namespace ROCKSDB_NAMESPACE {
|
2018-02-09 20:09:55 +00:00
|
|
|
|
2018-04-03 02:53:19 +00:00
|
|
|
bool DBImpl::EnoughRoomForCompaction(
|
Auto recovery from out of space errors (#4164)
Summary:
This commit implements automatic recovery from a Status::NoSpace() error
during background operations such as write callback, flush and
compaction. The broad design is as follows -
1. Compaction errors are treated as soft errors and don't put the
database in read-only mode. A compaction is delayed until enough free
disk space is available to accomodate the compaction outputs, which is
estimated based on the input size. This means that users can continue to
write, and we rely on the WriteController to delay or stop writes if the
compaction debt becomes too high due to persistent low disk space
condition
2. Errors during write callback and flush are treated as hard errors,
i.e the database is put in read-only mode and goes back to read-write
only fater certain recovery actions are taken.
3. Both types of recovery rely on the SstFileManagerImpl to poll for
sufficient disk space. We assume that there is a 1-1 mapping between an
SFM and the underlying OS storage container. For cases where multiple
DBs are hosted on a single storage container, the user is expected to
allocate a single SFM instance and use the same one for all the DBs. If
no SFM is specified by the user, DBImpl::Open() will allocate one, but
this will be one per DB and each DB will recover independently. The
recovery implemented by SFM is as follows -
a) On the first occurance of an out of space error during compaction,
subsequent
compactions will be delayed until the disk free space check indicates
enough available space. The required space is computed as the sum of
input sizes.
b) The free space check requirement will be removed once the amount of
free space is greater than the size reserved by in progress
compactions when the first error occured
c) If the out of space error is a hard error, a background thread in
SFM will poll for sufficient headroom before triggering the recovery
of the database and putting it in write-only mode. The headroom is
calculated as the sum of the write_buffer_size of all the DB instances
associated with the SFM
4. EventListener callbacks will be called at the start and completion of
automatic recovery. Users can disable the auto recov ery in the start
callback, and later initiate it manually by calling DB::Resume()
Todo:
1. More extensive testing
2. Add disk full condition to db_stress (follow-on PR)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4164
Differential Revision: D9846378
Pulled By: anand1976
fbshipit-source-id: 80ea875dbd7f00205e19c82215ff6e37da10da4a
2018-09-15 20:36:19 +00:00
|
|
|
ColumnFamilyData* cfd, const std::vector<CompactionInputFiles>& inputs,
|
2018-04-03 02:53:19 +00:00
|
|
|
bool* sfm_reserved_compact_space, LogBuffer* log_buffer) {
|
|
|
|
// Check if we have enough room to do the compaction
|
|
|
|
bool enough_room = true;
|
|
|
|
auto sfm = static_cast<SstFileManagerImpl*>(
|
|
|
|
immutable_db_options_.sst_file_manager.get());
|
|
|
|
if (sfm) {
|
Auto recovery from out of space errors (#4164)
Summary:
This commit implements automatic recovery from a Status::NoSpace() error
during background operations such as write callback, flush and
compaction. The broad design is as follows -
1. Compaction errors are treated as soft errors and don't put the
database in read-only mode. A compaction is delayed until enough free
disk space is available to accomodate the compaction outputs, which is
estimated based on the input size. This means that users can continue to
write, and we rely on the WriteController to delay or stop writes if the
compaction debt becomes too high due to persistent low disk space
condition
2. Errors during write callback and flush are treated as hard errors,
i.e the database is put in read-only mode and goes back to read-write
only fater certain recovery actions are taken.
3. Both types of recovery rely on the SstFileManagerImpl to poll for
sufficient disk space. We assume that there is a 1-1 mapping between an
SFM and the underlying OS storage container. For cases where multiple
DBs are hosted on a single storage container, the user is expected to
allocate a single SFM instance and use the same one for all the DBs. If
no SFM is specified by the user, DBImpl::Open() will allocate one, but
this will be one per DB and each DB will recover independently. The
recovery implemented by SFM is as follows -
a) On the first occurance of an out of space error during compaction,
subsequent
compactions will be delayed until the disk free space check indicates
enough available space. The required space is computed as the sum of
input sizes.
b) The free space check requirement will be removed once the amount of
free space is greater than the size reserved by in progress
compactions when the first error occured
c) If the out of space error is a hard error, a background thread in
SFM will poll for sufficient headroom before triggering the recovery
of the database and putting it in write-only mode. The headroom is
calculated as the sum of the write_buffer_size of all the DB instances
associated with the SFM
4. EventListener callbacks will be called at the start and completion of
automatic recovery. Users can disable the auto recov ery in the start
callback, and later initiate it manually by calling DB::Resume()
Todo:
1. More extensive testing
2. Add disk full condition to db_stress (follow-on PR)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4164
Differential Revision: D9846378
Pulled By: anand1976
fbshipit-source-id: 80ea875dbd7f00205e19c82215ff6e37da10da4a
2018-09-15 20:36:19 +00:00
|
|
|
// Pass the current bg_error_ to SFM so it can decide what checks to
|
|
|
|
// perform. If this DB instance hasn't seen any error yet, the SFM can be
|
|
|
|
// optimistic and not do disk space checks
|
2021-01-06 22:14:01 +00:00
|
|
|
Status bg_error = error_handler_.GetBGError();
|
|
|
|
enough_room = sfm->EnoughRoomForCompaction(cfd, inputs, bg_error);
|
|
|
|
bg_error.PermitUncheckedError(); // bg_error is just a copy of the Status
|
|
|
|
// from the error_handler_
|
2018-04-03 02:53:19 +00:00
|
|
|
if (enough_room) {
|
|
|
|
*sfm_reserved_compact_space = true;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (!enough_room) {
|
|
|
|
// Just in case tests want to change the value of enough_room
|
|
|
|
TEST_SYNC_POINT_CALLBACK(
|
|
|
|
"DBImpl::BackgroundCompaction():CancelledCompaction", &enough_room);
|
|
|
|
ROCKS_LOG_BUFFER(log_buffer,
|
|
|
|
"Cancelled compaction because not enough room");
|
|
|
|
RecordTick(stats_, COMPACTION_CANCELLED, 1);
|
|
|
|
}
|
|
|
|
return enough_room;
|
|
|
|
}
|
|
|
|
|
2019-01-02 17:56:39 +00:00
|
|
|
bool DBImpl::RequestCompactionToken(ColumnFamilyData* cfd, bool force,
|
Concurrent task limiter for compaction thread control (#4332)
Summary:
The PR is targeting to resolve the issue of:
https://github.com/facebook/rocksdb/issues/3972#issue-330771918
We have a rocksdb created with leveled-compaction with multiple column families (CFs), some of CFs are using HDD to store big and less frequently accessed data and others are using SSD.
When there are continuously write traffics going on to all CFs, the compaction thread pool is mostly occupied by those slow HDD compactions, which blocks fully utilize SSD bandwidth.
Since atomic write and transaction is needed across CFs, so splitting it to multiple rocksdb instance is not an option for us.
With the compaction thread control, we got 30%+ HDD write throughput gain, and also a lot smooth SSD write since less write stall happening.
ConcurrentTaskLimiter can be shared with multi-CFs across rocksdb instances, so the feature does not only work for multi-CFs scenarios, but also for multi-rocksdbs scenarios, who need disk IO resource control per tenant.
The usage is straight forward:
e.g.:
//
// Enable compaction thread limiter thru ColumnFamilyOptions
//
std::shared_ptr<ConcurrentTaskLimiter> ctl(NewConcurrentTaskLimiter("foo_limiter", 4));
Options options;
ColumnFamilyOptions cf_opt(options);
cf_opt.compaction_thread_limiter = ctl;
...
//
// Compaction thread limiter can be tuned or disabled on-the-fly
//
ctl->SetMaxOutstandingTask(12); // enlarge to 12 tasks
...
ctl->ResetMaxOutstandingTask(); // disable (bypass) thread limiter
ctl->SetMaxOutstandingTask(-1); // Same as above
...
ctl->SetMaxOutstandingTask(0); // full throttle (0 task)
//
// Sharing compaction thread limiter among CFs (to resolve multiple storage perf issue)
//
std::shared_ptr<ConcurrentTaskLimiter> ctl_ssd(NewConcurrentTaskLimiter("ssd_limiter", 8));
std::shared_ptr<ConcurrentTaskLimiter> ctl_hdd(NewConcurrentTaskLimiter("hdd_limiter", 4));
Options options;
ColumnFamilyOptions cf_opt_ssd1(options);
ColumnFamilyOptions cf_opt_ssd2(options);
ColumnFamilyOptions cf_opt_hdd1(options);
ColumnFamilyOptions cf_opt_hdd2(options);
ColumnFamilyOptions cf_opt_hdd3(options);
// SSD CFs
cf_opt_ssd1.compaction_thread_limiter = ctl_ssd;
cf_opt_ssd2.compaction_thread_limiter = ctl_ssd;
// HDD CFs
cf_opt_hdd1.compaction_thread_limiter = ctl_hdd;
cf_opt_hdd2.compaction_thread_limiter = ctl_hdd;
cf_opt_hdd3.compaction_thread_limiter = ctl_hdd;
...
//
// The limiter is disabled by default (or set to nullptr explicitly)
//
Options options;
ColumnFamilyOptions cf_opt(options);
cf_opt.compaction_thread_limiter = nullptr;
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4332
Differential Revision: D13226590
Pulled By: siying
fbshipit-source-id: 14307aec55b8bd59c8223d04aa6db3c03d1b0c1d
2018-12-13 21:16:04 +00:00
|
|
|
std::unique_ptr<TaskLimiterToken>* token,
|
|
|
|
LogBuffer* log_buffer) {
|
|
|
|
assert(*token == nullptr);
|
|
|
|
auto limiter = static_cast<ConcurrentTaskLimiterImpl*>(
|
|
|
|
cfd->ioptions()->compaction_thread_limiter.get());
|
|
|
|
if (limiter == nullptr) {
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
*token = limiter->GetToken(force);
|
|
|
|
if (*token != nullptr) {
|
|
|
|
ROCKS_LOG_BUFFER(log_buffer,
|
2019-01-02 17:56:39 +00:00
|
|
|
"Thread limiter [%s] increase [%s] compaction task, "
|
|
|
|
"force: %s, tasks after: %d",
|
|
|
|
limiter->GetName().c_str(), cfd->GetName().c_str(),
|
|
|
|
force ? "true" : "false", limiter->GetOutstandingTask());
|
Concurrent task limiter for compaction thread control (#4332)
Summary:
The PR is targeting to resolve the issue of:
https://github.com/facebook/rocksdb/issues/3972#issue-330771918
We have a rocksdb created with leveled-compaction with multiple column families (CFs), some of CFs are using HDD to store big and less frequently accessed data and others are using SSD.
When there are continuously write traffics going on to all CFs, the compaction thread pool is mostly occupied by those slow HDD compactions, which blocks fully utilize SSD bandwidth.
Since atomic write and transaction is needed across CFs, so splitting it to multiple rocksdb instance is not an option for us.
With the compaction thread control, we got 30%+ HDD write throughput gain, and also a lot smooth SSD write since less write stall happening.
ConcurrentTaskLimiter can be shared with multi-CFs across rocksdb instances, so the feature does not only work for multi-CFs scenarios, but also for multi-rocksdbs scenarios, who need disk IO resource control per tenant.
The usage is straight forward:
e.g.:
//
// Enable compaction thread limiter thru ColumnFamilyOptions
//
std::shared_ptr<ConcurrentTaskLimiter> ctl(NewConcurrentTaskLimiter("foo_limiter", 4));
Options options;
ColumnFamilyOptions cf_opt(options);
cf_opt.compaction_thread_limiter = ctl;
...
//
// Compaction thread limiter can be tuned or disabled on-the-fly
//
ctl->SetMaxOutstandingTask(12); // enlarge to 12 tasks
...
ctl->ResetMaxOutstandingTask(); // disable (bypass) thread limiter
ctl->SetMaxOutstandingTask(-1); // Same as above
...
ctl->SetMaxOutstandingTask(0); // full throttle (0 task)
//
// Sharing compaction thread limiter among CFs (to resolve multiple storage perf issue)
//
std::shared_ptr<ConcurrentTaskLimiter> ctl_ssd(NewConcurrentTaskLimiter("ssd_limiter", 8));
std::shared_ptr<ConcurrentTaskLimiter> ctl_hdd(NewConcurrentTaskLimiter("hdd_limiter", 4));
Options options;
ColumnFamilyOptions cf_opt_ssd1(options);
ColumnFamilyOptions cf_opt_ssd2(options);
ColumnFamilyOptions cf_opt_hdd1(options);
ColumnFamilyOptions cf_opt_hdd2(options);
ColumnFamilyOptions cf_opt_hdd3(options);
// SSD CFs
cf_opt_ssd1.compaction_thread_limiter = ctl_ssd;
cf_opt_ssd2.compaction_thread_limiter = ctl_ssd;
// HDD CFs
cf_opt_hdd1.compaction_thread_limiter = ctl_hdd;
cf_opt_hdd2.compaction_thread_limiter = ctl_hdd;
cf_opt_hdd3.compaction_thread_limiter = ctl_hdd;
...
//
// The limiter is disabled by default (or set to nullptr explicitly)
//
Options options;
ColumnFamilyOptions cf_opt(options);
cf_opt.compaction_thread_limiter = nullptr;
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4332
Differential Revision: D13226590
Pulled By: siying
fbshipit-source-id: 14307aec55b8bd59c8223d04aa6db3c03d1b0c1d
2018-12-13 21:16:04 +00:00
|
|
|
return true;
|
|
|
|
}
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2023-07-26 23:25:06 +00:00
|
|
|
bool DBImpl::ShouldRescheduleFlushRequestToRetainUDT(
|
|
|
|
const FlushRequest& flush_req) {
|
|
|
|
mutex_.AssertHeld();
|
|
|
|
assert(flush_req.cfd_to_max_mem_id_to_persist.size() == 1);
|
|
|
|
ColumnFamilyData* cfd = flush_req.cfd_to_max_mem_id_to_persist.begin()->first;
|
|
|
|
uint64_t max_memtable_id =
|
|
|
|
flush_req.cfd_to_max_mem_id_to_persist.begin()->second;
|
|
|
|
if (cfd->IsDropped() ||
|
|
|
|
!cfd->ShouldPostponeFlushToRetainUDT(max_memtable_id)) {
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
// Check if holding on the flush will cause entering write stall mode.
|
|
|
|
// Write stall entered because of the accumulation of write buffers can be
|
|
|
|
// alleviated if we continue with the flush instead of postponing it.
|
|
|
|
const auto& mutable_cf_options = *cfd->GetLatestMutableCFOptions();
|
|
|
|
|
|
|
|
// Taking the status of the active Memtable into consideration so that we are
|
|
|
|
// not just checking if DB is currently already in write stall mode.
|
|
|
|
int mem_to_flush = cfd->mem()->ApproximateMemoryUsageFast() >=
|
|
|
|
cfd->mem()->write_buffer_size() / 2
|
|
|
|
? 1
|
|
|
|
: 0;
|
|
|
|
WriteStallCondition write_stall =
|
|
|
|
ColumnFamilyData::GetWriteStallConditionAndCause(
|
|
|
|
cfd->imm()->NumNotFlushed() + mem_to_flush, /*num_l0_files=*/0,
|
|
|
|
/*num_compaction_needed_bytes=*/0, mutable_cf_options,
|
|
|
|
*cfd->ioptions())
|
|
|
|
.first;
|
|
|
|
if (write_stall != WriteStallCondition::kNormal) {
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
Group SST write in flush, compaction and db open with new stats (#11910)
Summary:
## Context/Summary
Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity.
For that, this PR does the following:
- Tag different write IOs by passing down and converting WriteOptions to IOOptions
- Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS
Some related code refactory to make implementation cleaner:
- Blob stats
- Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info.
- Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write.
- Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority
- Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification
- Build table
- TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables
- Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder.
This parameter is used for dynamically changing file io priority for flush, see https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more
- Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority
## Test
### db bench
Flush
```
./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100
rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
```
compaction, db oopen
```
Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279
rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213
rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66
```
blob stats - just to make sure they aren't broken by this PR
```
Integrated Blob DB
Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600
rocksdb.blobdb.blob.file.synced COUNT : 1
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same)
```
```
Stacked Blob DB
Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876
rocksdb.blobdb.blob.file.synced COUNT : 8
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same)
```
### Rehearsal CI stress test
Trigger 3 full runs of all our CI stress tests
### Performance
Flush
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark; enable_statistics = true
Pre-pr: avg 507515519.3 ns
497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908,
Post-pr: avg 511971266.5 ns, regressed 0.88%
502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408,
```
Compaction
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 495346098.30 ns
492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846
Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97%
502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007
```
Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats)
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 3848.10 ns
3814,3838,3839,3848,3854,3854,3854,3860,3860,3860
Post-pr: avg 3874.20 ns, regressed 0.68%
3863,3867,3871,3874,3875,3877,3877,3877,3880,3881
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910
Reviewed By: ajkr
Differential Revision: D49788060
Pulled By: hx235
fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff
2023-12-29 23:29:23 +00:00
|
|
|
IOStatus DBImpl::SyncClosedLogs(const WriteOptions& write_options,
|
|
|
|
JobContext* job_context,
|
2023-10-12 23:55:25 +00:00
|
|
|
VersionEdit* synced_wals,
|
|
|
|
bool error_recovery_in_prog) {
|
2017-04-06 00:14:05 +00:00
|
|
|
TEST_SYNC_POINT("DBImpl::SyncClosedLogs:Start");
|
2022-07-21 20:35:36 +00:00
|
|
|
InstrumentedMutexLock l(&log_write_mutex_);
|
2017-04-06 00:14:05 +00:00
|
|
|
autovector<log::Writer*, 1> logs_to_sync;
|
|
|
|
uint64_t current_log_number = logfile_number_;
|
|
|
|
while (logs_.front().number < current_log_number &&
|
2022-06-17 23:45:28 +00:00
|
|
|
logs_.front().IsSyncing()) {
|
2017-04-06 00:14:05 +00:00
|
|
|
log_sync_cv_.Wait();
|
|
|
|
}
|
|
|
|
for (auto it = logs_.begin();
|
|
|
|
it != logs_.end() && it->number < current_log_number; ++it) {
|
|
|
|
auto& log = *it;
|
2022-06-17 23:45:28 +00:00
|
|
|
log.PrepareForSync();
|
2017-04-06 00:14:05 +00:00
|
|
|
logs_to_sync.push_back(log.writer);
|
|
|
|
}
|
|
|
|
|
Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487)
Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.
The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487
Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check
Reviewed By: anand1976
Differential Revision: D20685017
Pulled By: zhichao-cao
fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
2020-03-27 23:03:05 +00:00
|
|
|
IOStatus io_s;
|
2017-04-06 00:14:05 +00:00
|
|
|
if (!logs_to_sync.empty()) {
|
2022-07-21 20:35:36 +00:00
|
|
|
log_write_mutex_.Unlock();
|
2017-04-06 00:14:05 +00:00
|
|
|
|
2021-11-19 17:55:10 +00:00
|
|
|
assert(job_context);
|
|
|
|
|
2017-04-06 00:14:05 +00:00
|
|
|
for (log::Writer* log : logs_to_sync) {
|
|
|
|
ROCKS_LOG_INFO(immutable_db_options_.info_log,
|
|
|
|
"[JOB %d] Syncing log #%" PRIu64, job_context->job_id,
|
|
|
|
log->get_log_number());
|
2023-10-12 23:55:25 +00:00
|
|
|
if (error_recovery_in_prog) {
|
2022-08-10 17:19:20 +00:00
|
|
|
log->file()->reset_seen_error();
|
|
|
|
}
|
Group SST write in flush, compaction and db open with new stats (#11910)
Summary:
## Context/Summary
Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity.
For that, this PR does the following:
- Tag different write IOs by passing down and converting WriteOptions to IOOptions
- Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS
Some related code refactory to make implementation cleaner:
- Blob stats
- Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info.
- Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write.
- Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority
- Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification
- Build table
- TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables
- Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder.
This parameter is used for dynamically changing file io priority for flush, see https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more
- Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority
## Test
### db bench
Flush
```
./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100
rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
```
compaction, db oopen
```
Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279
rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213
rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66
```
blob stats - just to make sure they aren't broken by this PR
```
Integrated Blob DB
Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600
rocksdb.blobdb.blob.file.synced COUNT : 1
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same)
```
```
Stacked Blob DB
Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876
rocksdb.blobdb.blob.file.synced COUNT : 8
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same)
```
### Rehearsal CI stress test
Trigger 3 full runs of all our CI stress tests
### Performance
Flush
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark; enable_statistics = true
Pre-pr: avg 507515519.3 ns
497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908,
Post-pr: avg 511971266.5 ns, regressed 0.88%
502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408,
```
Compaction
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 495346098.30 ns
492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846
Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97%
502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007
```
Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats)
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 3848.10 ns
3814,3838,3839,3848,3854,3854,3854,3860,3860,3860
Post-pr: avg 3874.20 ns, regressed 0.68%
3863,3867,3871,3874,3875,3877,3877,3877,3880,3881
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910
Reviewed By: ajkr
Differential Revision: D49788060
Pulled By: hx235
fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff
2023-12-29 23:29:23 +00:00
|
|
|
|
|
|
|
IOOptions io_options;
|
|
|
|
io_s = WritableFileWriter::PrepareIOOptions(write_options, io_options);
|
|
|
|
if (!io_s.ok()) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
io_s = log->file()->Sync(io_options, immutable_db_options_.use_fsync);
|
Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487)
Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.
The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487
Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check
Reviewed By: anand1976
Differential Revision: D20685017
Pulled By: zhichao-cao
fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
2020-03-27 23:03:05 +00:00
|
|
|
if (!io_s.ok()) {
|
2018-04-19 21:00:52 +00:00
|
|
|
break;
|
|
|
|
}
|
2019-06-07 22:31:40 +00:00
|
|
|
|
|
|
|
if (immutable_db_options_.recycle_log_file_num > 0) {
|
2023-10-12 23:55:25 +00:00
|
|
|
if (error_recovery_in_prog) {
|
2022-08-10 17:19:20 +00:00
|
|
|
log->file()->reset_seen_error();
|
|
|
|
}
|
Group SST write in flush, compaction and db open with new stats (#11910)
Summary:
## Context/Summary
Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity.
For that, this PR does the following:
- Tag different write IOs by passing down and converting WriteOptions to IOOptions
- Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS
Some related code refactory to make implementation cleaner:
- Blob stats
- Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info.
- Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write.
- Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority
- Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification
- Build table
- TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables
- Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder.
This parameter is used for dynamically changing file io priority for flush, see https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more
- Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority
## Test
### db bench
Flush
```
./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100
rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
```
compaction, db oopen
```
Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279
rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213
rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66
```
blob stats - just to make sure they aren't broken by this PR
```
Integrated Blob DB
Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600
rocksdb.blobdb.blob.file.synced COUNT : 1
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same)
```
```
Stacked Blob DB
Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876
rocksdb.blobdb.blob.file.synced COUNT : 8
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same)
```
### Rehearsal CI stress test
Trigger 3 full runs of all our CI stress tests
### Performance
Flush
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark; enable_statistics = true
Pre-pr: avg 507515519.3 ns
497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908,
Post-pr: avg 511971266.5 ns, regressed 0.88%
502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408,
```
Compaction
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 495346098.30 ns
492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846
Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97%
502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007
```
Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats)
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 3848.10 ns
3814,3838,3839,3848,3854,3854,3854,3860,3860,3860
Post-pr: avg 3874.20 ns, regressed 0.68%
3863,3867,3871,3874,3875,3877,3877,3877,3880,3881
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910
Reviewed By: ajkr
Differential Revision: D49788060
Pulled By: hx235
fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff
2023-12-29 23:29:23 +00:00
|
|
|
// TODO: plumb Env::IOActivity, Env::IOPriority
|
|
|
|
io_s = log->Close(WriteOptions());
|
Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487)
Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.
The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487
Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check
Reviewed By: anand1976
Differential Revision: D20685017
Pulled By: zhichao-cao
fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
2020-03-27 23:03:05 +00:00
|
|
|
if (!io_s.ok()) {
|
2019-06-07 22:31:40 +00:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487)
Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.
The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487
Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check
Reviewed By: anand1976
Differential Revision: D20685017
Pulled By: zhichao-cao
fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
2020-03-27 23:03:05 +00:00
|
|
|
if (io_s.ok()) {
|
Group SST write in flush, compaction and db open with new stats (#11910)
Summary:
## Context/Summary
Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity.
For that, this PR does the following:
- Tag different write IOs by passing down and converting WriteOptions to IOOptions
- Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS
Some related code refactory to make implementation cleaner:
- Blob stats
- Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info.
- Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write.
- Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority
- Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification
- Build table
- TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables
- Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder.
This parameter is used for dynamically changing file io priority for flush, see https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more
- Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority
## Test
### db bench
Flush
```
./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100
rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
```
compaction, db oopen
```
Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279
rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213
rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66
```
blob stats - just to make sure they aren't broken by this PR
```
Integrated Blob DB
Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600
rocksdb.blobdb.blob.file.synced COUNT : 1
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same)
```
```
Stacked Blob DB
Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876
rocksdb.blobdb.blob.file.synced COUNT : 8
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same)
```
### Rehearsal CI stress test
Trigger 3 full runs of all our CI stress tests
### Performance
Flush
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark; enable_statistics = true
Pre-pr: avg 507515519.3 ns
497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908,
Post-pr: avg 511971266.5 ns, regressed 0.88%
502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408,
```
Compaction
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 495346098.30 ns
492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846
Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97%
502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007
```
Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats)
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 3848.10 ns
3814,3838,3839,3848,3854,3854,3854,3860,3860,3860
Post-pr: avg 3874.20 ns, regressed 0.68%
3863,3867,3871,3874,3875,3877,3877,3877,3880,3881
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910
Reviewed By: ajkr
Differential Revision: D49788060
Pulled By: hx235
fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff
2023-12-29 23:29:23 +00:00
|
|
|
IOOptions io_options;
|
|
|
|
io_s = WritableFileWriter::PrepareIOOptions(write_options, io_options);
|
|
|
|
if (io_s.ok()) {
|
|
|
|
io_s = directories_.GetWalDir()->FsyncWithDirOptions(
|
|
|
|
io_options, nullptr,
|
|
|
|
DirFsyncOptions(DirFsyncOptions::FsyncReason::kNewFileSynced));
|
|
|
|
}
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
|
2021-11-19 17:55:10 +00:00
|
|
|
TEST_SYNC_POINT_CALLBACK("DBImpl::SyncClosedLogs:BeforeReLock",
|
|
|
|
/*arg=*/nullptr);
|
2022-07-21 20:35:36 +00:00
|
|
|
log_write_mutex_.Lock();
|
2017-04-06 00:14:05 +00:00
|
|
|
|
|
|
|
// "number <= current_log_number - 1" is equivalent to
|
|
|
|
// "number < current_log_number".
|
2020-11-07 00:30:44 +00:00
|
|
|
if (io_s.ok()) {
|
2022-07-21 20:35:36 +00:00
|
|
|
MarkLogsSynced(current_log_number - 1, true, synced_wals);
|
2020-11-07 00:30:44 +00:00
|
|
|
} else {
|
|
|
|
MarkLogsNotSynced(current_log_number - 1);
|
|
|
|
}
|
Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487)
Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.
The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487
Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check
Reviewed By: anand1976
Differential Revision: D20685017
Pulled By: zhichao-cao
fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
2020-03-27 23:03:05 +00:00
|
|
|
if (!io_s.ok()) {
|
2017-04-06 00:14:05 +00:00
|
|
|
TEST_SYNC_POINT("DBImpl::SyncClosedLogs:Failed");
|
Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487)
Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.
The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487
Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check
Reviewed By: anand1976
Differential Revision: D20685017
Pulled By: zhichao-cao
fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
2020-03-27 23:03:05 +00:00
|
|
|
return io_s;
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
}
|
2021-03-18 21:31:30 +00:00
|
|
|
TEST_SYNC_POINT("DBImpl::SyncClosedLogs:end");
|
Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487)
Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.
The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487
Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check
Reviewed By: anand1976
Differential Revision: D20685017
Pulled By: zhichao-cao
fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
2020-03-27 23:03:05 +00:00
|
|
|
return io_s;
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
Status DBImpl::FlushMemTableToOutputFile(
|
|
|
|
ColumnFamilyData* cfd, const MutableCFOptions& mutable_cf_options,
|
2023-01-24 17:54:04 +00:00
|
|
|
bool* made_progress, JobContext* job_context, FlushReason flush_reason,
|
2019-01-31 19:53:29 +00:00
|
|
|
SuperVersionContext* superversion_context,
|
|
|
|
std::vector<SequenceNumber>& snapshot_seqs,
|
|
|
|
SequenceNumber earliest_write_conflict_snapshot,
|
2019-03-20 00:24:09 +00:00
|
|
|
SnapshotChecker* snapshot_checker, LogBuffer* log_buffer,
|
|
|
|
Env::Priority thread_pri) {
|
2017-04-06 00:14:05 +00:00
|
|
|
mutex_.AssertHeld();
|
2020-09-15 04:10:09 +00:00
|
|
|
assert(cfd);
|
2021-11-19 17:55:10 +00:00
|
|
|
assert(cfd->imm());
|
2017-04-06 00:14:05 +00:00
|
|
|
assert(cfd->imm()->NumNotFlushed() != 0);
|
|
|
|
assert(cfd->imm()->IsFlushPending());
|
2021-11-19 17:55:10 +00:00
|
|
|
assert(versions_);
|
|
|
|
assert(versions_->GetColumnFamilySet());
|
Group SST write in flush, compaction and db open with new stats (#11910)
Summary:
## Context/Summary
Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity.
For that, this PR does the following:
- Tag different write IOs by passing down and converting WriteOptions to IOOptions
- Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS
Some related code refactory to make implementation cleaner:
- Blob stats
- Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info.
- Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write.
- Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority
- Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification
- Build table
- TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables
- Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder.
This parameter is used for dynamically changing file io priority for flush, see https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more
- Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority
## Test
### db bench
Flush
```
./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100
rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
```
compaction, db oopen
```
Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279
rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213
rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66
```
blob stats - just to make sure they aren't broken by this PR
```
Integrated Blob DB
Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600
rocksdb.blobdb.blob.file.synced COUNT : 1
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same)
```
```
Stacked Blob DB
Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876
rocksdb.blobdb.blob.file.synced COUNT : 8
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same)
```
### Rehearsal CI stress test
Trigger 3 full runs of all our CI stress tests
### Performance
Flush
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark; enable_statistics = true
Pre-pr: avg 507515519.3 ns
497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908,
Post-pr: avg 511971266.5 ns, regressed 0.88%
502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408,
```
Compaction
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 495346098.30 ns
492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846
Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97%
502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007
```
Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats)
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 3848.10 ns
3814,3838,3839,3848,3854,3854,3854,3860,3860,3860
Post-pr: avg 3874.20 ns, regressed 0.68%
3863,3867,3871,3874,3875,3877,3877,3877,3880,3881
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910
Reviewed By: ajkr
Differential Revision: D49788060
Pulled By: hx235
fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff
2023-12-29 23:29:23 +00:00
|
|
|
const ReadOptions read_options(Env::IOActivity::kFlush);
|
|
|
|
const WriteOptions write_options(Env::IOActivity::kFlush);
|
2021-11-19 17:55:10 +00:00
|
|
|
// If there are more than one column families, we need to make sure that
|
|
|
|
// all the log files except the most recent one are synced. Otherwise if
|
|
|
|
// the host crashes after flushing and before WAL is persistent, the
|
|
|
|
// flushed SST may contain data from write batches whose updates to
|
|
|
|
// other (unflushed) column families are missing.
|
|
|
|
const bool needs_to_sync_closed_wals =
|
|
|
|
logfile_number_ > 0 &&
|
|
|
|
versions_->GetColumnFamilySet()->NumberOfColumnFamilies() > 1;
|
2022-03-03 05:03:14 +00:00
|
|
|
|
2021-11-19 17:55:10 +00:00
|
|
|
// If needs_to_sync_closed_wals is true, we need to record the current
|
|
|
|
// maximum memtable ID of this column family so that a later PickMemtables()
|
|
|
|
// call will not pick memtables whose IDs are higher. This is due to the fact
|
|
|
|
// that SyncClosedLogs() may release the db mutex, and memtable switch can
|
|
|
|
// happen for this column family in the meantime. The newly created memtables
|
|
|
|
// have their data backed by unsynced WALs, thus they cannot be included in
|
|
|
|
// this flush job.
|
2022-03-03 05:03:14 +00:00
|
|
|
// Another reason why we must record the current maximum memtable ID of this
|
|
|
|
// column family: SyncClosedLogs() may release db mutex, thus it's possible
|
|
|
|
// for application to continue to insert into memtables increasing db's
|
|
|
|
// sequence number. The application may take a snapshot, but this snapshot is
|
|
|
|
// not included in `snapshot_seqs` which will be passed to flush job because
|
|
|
|
// `snapshot_seqs` has already been computed before this function starts.
|
|
|
|
// Recording the max memtable ID ensures that the flush job does not flush
|
|
|
|
// a memtable without knowing such snapshot(s).
|
2023-10-02 23:26:24 +00:00
|
|
|
uint64_t max_memtable_id =
|
|
|
|
needs_to_sync_closed_wals
|
|
|
|
? cfd->imm()->GetLatestMemTableID(false /* for_atomic_flush */)
|
|
|
|
: std::numeric_limits<uint64_t>::max();
|
2022-03-03 05:03:14 +00:00
|
|
|
|
|
|
|
// If needs_to_sync_closed_wals is false, then the flush job will pick ALL
|
|
|
|
// existing memtables of the column family when PickMemTable() is called
|
|
|
|
// later. Although we won't call SyncClosedLogs() in this case, we may still
|
|
|
|
// call the callbacks of the listeners, i.e. NotifyOnFlushBegin() which also
|
|
|
|
// releases and re-acquires the db mutex. In the meantime, the application
|
|
|
|
// can still insert into the memtables and increase the db's sequence number.
|
|
|
|
// The application can take a snapshot, hoping that the latest visible state
|
2023-10-10 02:10:06 +00:00
|
|
|
// to this snapshot is preserved. This is hard to guarantee since db mutex
|
2022-03-03 05:03:14 +00:00
|
|
|
// not held. This newly-created snapshot is not included in `snapshot_seqs`
|
|
|
|
// and the flush job is unaware of its presence. Consequently, the flush job
|
|
|
|
// may drop certain keys when generating the L0, causing incorrect data to be
|
|
|
|
// returned for snapshot read using this snapshot.
|
|
|
|
// To address this, we make sure NotifyOnFlushBegin() executes after memtable
|
|
|
|
// picking so that no new snapshot can be taken between the two functions.
|
|
|
|
|
2017-04-06 00:14:05 +00:00
|
|
|
FlushJob flush_job(
|
2021-11-19 17:55:10 +00:00
|
|
|
dbname_, cfd, immutable_db_options_, mutable_cf_options, max_memtable_id,
|
|
|
|
file_options_for_compaction_, versions_.get(), &mutex_, &shutting_down_,
|
|
|
|
snapshot_seqs, earliest_write_conflict_snapshot, snapshot_checker,
|
2023-01-24 17:54:04 +00:00
|
|
|
job_context, flush_reason, log_buffer, directories_.GetDbDir(),
|
|
|
|
GetDataDir(cfd, 0U),
|
2017-04-06 00:14:05 +00:00
|
|
|
GetCompressionFlush(*cfd->ioptions(), mutable_cf_options), stats_,
|
2018-10-16 02:59:20 +00:00
|
|
|
&event_logger_, mutable_cf_options.report_bg_io_stats,
|
2020-06-17 17:55:42 +00:00
|
|
|
true /* sync_output_directory */, true /* write_manifest */, thread_pri,
|
Refactor, clean up, fixes, and more testing for SeqnoToTimeMapping (#11905)
Summary:
This change is before a planned DBImpl change to ensure all sufficiently recent sequence numbers since Open are covered by SeqnoToTimeMapping (bug fix with existing test work-arounds). **Intended follow-up**
However, I found enough issues with SeqnoToTimeMapping to warrant this PR first, including very small fixes in DB implementation related to API contract of SeqnoToTimeMapping.
Functional fixes / changes:
* This fixes some mishandling of boundary cases. For example, if the user decides to stop writing to DB, the last written sequence number would perpetually have its write time updated to "now" and would always be ineligible for migration to cold tier. Part of the problem is that the SeqnoToTimeMapping would return a seqno known to have been written before (immediately or otherwise) the requested time, but compaction_job.cc would include that seqno in the preserve/exclude set. That is fixed (in part) by adding one in compaction_job.cc
* That problem was worse because a whole range of seqnos could be updated perpetually with new times in SeqnoToTimeMapping::Append (if no writes to DB). That logic was apparently optimized for GetOldestApproximateTime (now GetProximalTimeBeforeSeqno), which is not used in production, to the detriment of GetOldestSequenceNum (now GetProximalSeqnoBeforeTime), which is used in production. (Perhaps plans changed during development?) This is fixed in Append to optimize for accuracy of GetProximalSeqnoBeforeTime. (Unit tests added and updated.)
* Related: SeqnoToTimeMapping did not have a clear contract about the relationships between seqnos and times, just the idea of a rough correspondence. Now the class description makes it clear that the write time of each recorded seqno comes before or at the associated time, to support getting best results for GetProximalSeqnoBeforeTime. And this makes it easier to make clear the contract of each API function.
* Update `DBImpl::RecordSeqnoToTimeMapping()` to follow this ordering in gathering samples.
Some part of these changes has required an expanded test work-around for the problem (see intended follow-up above) that the DB does not immediately ensure recent seqnos are covered by its mapping. These work-arounds will be removed with that planned work.
An apparent compaction bug is revealed in
PrecludeLastLevelTest::RangeDelsCauseFileEndpointsToOverlap, so that test is disabled. Filed GitHub issue #11909
Cosmetic / code safety things (not exhaustive):
* Fix some confusing names.
* `seqno_time_mapping` was used inconsistently in places. Now just `seqno_to_time_mapping` to correspond to class name.
* Rename confusing `GetOldestSequenceNum` -> `GetProximalSeqnoBeforeTime` and `GetOldestApproximateTime` -> `GetProximalTimeBeforeSeqno`. Part of the motivation is that our times and seqnos here have the same underlying type, so we want to be clear about which is expected where to avoid mixing.
* Rename `kUnknownSeqnoTime` to `kUnknownTimeBeforeAll` because the value is a bad choice for unknown if we ever add ProximalAfterBlah functions.
* Arithmetic on SeqnoTimePair doesn't make sense except for delta encoding, so use better names / APIs with that in mind.
* (OMG) Don't allow direct comparison between SeqnoTimePair and SequenceNumber. (There is no checking that it isn't compared against time by accident.)
* A field name essentially matching the containing class name is a confusing pattern (`seqno_time_mapping_`).
* Wrap calls to confusing (but useful) upper_bound and lower_bound functions to have clearer names and more code reuse.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11905
Test Plan: GetOldestSequenceNum (now GetProximalSeqnoBeforeTime) and TruncateOldEntries were lacking unit tests, despite both being used in production (experimental feature). Added those and expanded others.
Reviewed By: jowlyzhang
Differential Revision: D49755592
Pulled By: pdillinger
fbshipit-source-id: f72a3baac74d24b963c77e538bba89a7fc8dce51
2023-09-29 18:21:59 +00:00
|
|
|
io_tracer_, seqno_to_time_mapping_, db_id_, db_session_id_,
|
2022-07-15 04:49:34 +00:00
|
|
|
cfd->GetFullHistoryTsLow(), &blob_callback_);
|
2017-04-06 00:14:05 +00:00
|
|
|
FileMetaData file_meta;
|
|
|
|
|
|
|
|
Status s;
|
2021-03-18 21:31:30 +00:00
|
|
|
bool need_cancel = false;
|
|
|
|
IOStatus log_io_s = IOStatus::OK();
|
2021-11-19 17:55:10 +00:00
|
|
|
if (needs_to_sync_closed_wals) {
|
2022-07-21 20:35:36 +00:00
|
|
|
// SyncClosedLogs() may unlock and re-lock the log_write_mutex multiple
|
|
|
|
// times.
|
|
|
|
VersionEdit synced_wals;
|
2023-10-12 23:55:25 +00:00
|
|
|
bool error_recovery_in_prog = error_handler_.IsRecoveryInProgress();
|
2022-07-21 20:35:36 +00:00
|
|
|
mutex_.Unlock();
|
Group SST write in flush, compaction and db open with new stats (#11910)
Summary:
## Context/Summary
Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity.
For that, this PR does the following:
- Tag different write IOs by passing down and converting WriteOptions to IOOptions
- Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS
Some related code refactory to make implementation cleaner:
- Blob stats
- Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info.
- Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write.
- Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority
- Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification
- Build table
- TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables
- Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder.
This parameter is used for dynamically changing file io priority for flush, see https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more
- Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority
## Test
### db bench
Flush
```
./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100
rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
```
compaction, db oopen
```
Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279
rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213
rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66
```
blob stats - just to make sure they aren't broken by this PR
```
Integrated Blob DB
Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600
rocksdb.blobdb.blob.file.synced COUNT : 1
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same)
```
```
Stacked Blob DB
Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876
rocksdb.blobdb.blob.file.synced COUNT : 8
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same)
```
### Rehearsal CI stress test
Trigger 3 full runs of all our CI stress tests
### Performance
Flush
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark; enable_statistics = true
Pre-pr: avg 507515519.3 ns
497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908,
Post-pr: avg 511971266.5 ns, regressed 0.88%
502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408,
```
Compaction
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 495346098.30 ns
492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846
Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97%
502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007
```
Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats)
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 3848.10 ns
3814,3838,3839,3848,3854,3854,3854,3860,3860,3860
Post-pr: avg 3874.20 ns, regressed 0.68%
3863,3867,3871,3874,3875,3877,3877,3877,3880,3881
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910
Reviewed By: ajkr
Differential Revision: D49788060
Pulled By: hx235
fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff
2023-12-29 23:29:23 +00:00
|
|
|
log_io_s = SyncClosedLogs(write_options, job_context, &synced_wals,
|
|
|
|
error_recovery_in_prog);
|
2022-07-21 20:35:36 +00:00
|
|
|
mutex_.Lock();
|
|
|
|
if (log_io_s.ok() && synced_wals.IsWalAddition()) {
|
Group SST write in flush, compaction and db open with new stats (#11910)
Summary:
## Context/Summary
Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity.
For that, this PR does the following:
- Tag different write IOs by passing down and converting WriteOptions to IOOptions
- Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS
Some related code refactory to make implementation cleaner:
- Blob stats
- Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info.
- Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write.
- Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority
- Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification
- Build table
- TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables
- Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder.
This parameter is used for dynamically changing file io priority for flush, see https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more
- Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority
## Test
### db bench
Flush
```
./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100
rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
```
compaction, db oopen
```
Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279
rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213
rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66
```
blob stats - just to make sure they aren't broken by this PR
```
Integrated Blob DB
Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600
rocksdb.blobdb.blob.file.synced COUNT : 1
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same)
```
```
Stacked Blob DB
Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876
rocksdb.blobdb.blob.file.synced COUNT : 8
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same)
```
### Rehearsal CI stress test
Trigger 3 full runs of all our CI stress tests
### Performance
Flush
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark; enable_statistics = true
Pre-pr: avg 507515519.3 ns
497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908,
Post-pr: avg 511971266.5 ns, regressed 0.88%
502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408,
```
Compaction
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 495346098.30 ns
492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846
Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97%
502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007
```
Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats)
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 3848.10 ns
3814,3838,3839,3848,3854,3854,3854,3860,3860,3860
Post-pr: avg 3874.20 ns, regressed 0.68%
3863,3867,3871,3874,3875,3877,3877,3877,3880,3881
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910
Reviewed By: ajkr
Differential Revision: D49788060
Pulled By: hx235
fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff
2023-12-29 23:29:23 +00:00
|
|
|
log_io_s = status_to_io_status(
|
|
|
|
ApplyWALToManifest(read_options, write_options, &synced_wals));
|
2022-08-04 19:14:28 +00:00
|
|
|
TEST_SYNC_POINT_CALLBACK("DBImpl::FlushMemTableToOutputFile:CommitWal:1",
|
|
|
|
nullptr);
|
2022-07-21 20:35:36 +00:00
|
|
|
}
|
|
|
|
|
2021-03-18 21:31:30 +00:00
|
|
|
if (!log_io_s.ok() && !log_io_s.IsShutdownInProgress() &&
|
|
|
|
!log_io_s.IsColumnFamilyDropped()) {
|
|
|
|
error_handler_.SetBGError(log_io_s, BackgroundErrorReason::kFlush);
|
2020-09-18 03:22:35 +00:00
|
|
|
}
|
2018-11-13 19:27:32 +00:00
|
|
|
} else {
|
|
|
|
TEST_SYNC_POINT("DBImpl::SyncClosedLogs:Skip");
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
2021-03-19 04:51:21 +00:00
|
|
|
s = log_io_s;
|
2017-04-06 00:14:05 +00:00
|
|
|
|
2021-03-18 21:31:30 +00:00
|
|
|
// If the log sync failed, we do not need to pick memtable. Otherwise,
|
|
|
|
// num_flush_not_started_ needs to be rollback.
|
|
|
|
TEST_SYNC_POINT("DBImpl::FlushMemTableToOutputFile:BeforePickMemtables");
|
Rollback other pending memtable flushes when a flush fails (#11865)
Summary:
when atomic_flush=false, there are certain cases where we try to install memtable results with already deleted SST files. This can happen when the following sequence events happen:
```
Start Flush0 for memtable M0 to SST0
Start Flush1 for memtable M1 to SST1
Flush 1 returns OK, but don't install to MANIFEST and let whoever flushes M0 to take care of it
Flush0 finishes with a retryable IOError, it rollbacks M0, (incorrectly) does not rollback M1, and deletes SST0 and SST1
Starts Flush2 for M0, it does not pick up M1 since it thought M1 is flushed
Flush2 writes SST2 and finishes OK, tries to install SST2 and SST1
Error opening SST1 since it's already deleted with an error message like the following:
IO error: No such file or directory: While open a file for random read: /tmp/rocksdbtest-501/db_flush_test_3577_4230653031040984171/000011.sst: No such file or directory
```
This happens since:
1. We currently only rollback the memtables that we are flushing in a flush job when atomic_flush=false.
2. Pending output SSTs from previous flushes are deleted since a pending file number is released whenever a flush job is finished no matter of flush status: https://github.com/facebook/rocksdb/blob/f42e70bf561d4be9b6bbe7316d1c2c0c8a3818e6/db/db_impl/db_impl_compaction_flush.cc#L3161
This PR fixes the issue by rollback these pending flushes.
There is another issue where if a new flush for new memtable starts and finishes after Flush0 finishes. Its output may also be deleted (see more in unit test). It is fixed by checking bg error status before installing a memtable result, and rollback if there is an error.
There is a more efficient fix where we just don't release the pending file output number for flushes that delegate installation. It is more efficient since it does not have to rewrite the flush output file. With the fix in this PR, we can end up with a giant file if a lot of memtables are being flushed together. However, the more efficient fix is a bit more complicated to implement (requires associating such pending file numbers with flush job/memtables) and is more risky since it changes normal flush code path.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11865
Test Plan: * Added repro unit tests.
Reviewed By: anand1976
Differential Revision: D49484922
Pulled By: cbi42
fbshipit-source-id: 25b536c08f4e02e7f1d0f86571663737d2b5d53d
2023-09-21 22:31:29 +00:00
|
|
|
// Exit a flush due to bg error should not set bg error again.
|
|
|
|
bool skip_set_bg_error = false;
|
2023-09-22 23:43:50 +00:00
|
|
|
if (s.ok() && !error_handler_.GetBGError().ok() &&
|
|
|
|
error_handler_.IsBGWorkStopped() &&
|
|
|
|
flush_reason != FlushReason::kErrorRecovery &&
|
|
|
|
flush_reason != FlushReason::kErrorRecoveryRetryFlush) {
|
Rollback other pending memtable flushes when a flush fails (#11865)
Summary:
when atomic_flush=false, there are certain cases where we try to install memtable results with already deleted SST files. This can happen when the following sequence events happen:
```
Start Flush0 for memtable M0 to SST0
Start Flush1 for memtable M1 to SST1
Flush 1 returns OK, but don't install to MANIFEST and let whoever flushes M0 to take care of it
Flush0 finishes with a retryable IOError, it rollbacks M0, (incorrectly) does not rollback M1, and deletes SST0 and SST1
Starts Flush2 for M0, it does not pick up M1 since it thought M1 is flushed
Flush2 writes SST2 and finishes OK, tries to install SST2 and SST1
Error opening SST1 since it's already deleted with an error message like the following:
IO error: No such file or directory: While open a file for random read: /tmp/rocksdbtest-501/db_flush_test_3577_4230653031040984171/000011.sst: No such file or directory
```
This happens since:
1. We currently only rollback the memtables that we are flushing in a flush job when atomic_flush=false.
2. Pending output SSTs from previous flushes are deleted since a pending file number is released whenever a flush job is finished no matter of flush status: https://github.com/facebook/rocksdb/blob/f42e70bf561d4be9b6bbe7316d1c2c0c8a3818e6/db/db_impl/db_impl_compaction_flush.cc#L3161
This PR fixes the issue by rollback these pending flushes.
There is another issue where if a new flush for new memtable starts and finishes after Flush0 finishes. Its output may also be deleted (see more in unit test). It is fixed by checking bg error status before installing a memtable result, and rollback if there is an error.
There is a more efficient fix where we just don't release the pending file output number for flushes that delegate installation. It is more efficient since it does not have to rewrite the flush output file. With the fix in this PR, we can end up with a giant file if a lot of memtables are being flushed together. However, the more efficient fix is a bit more complicated to implement (requires associating such pending file numbers with flush job/memtables) and is more risky since it changes normal flush code path.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11865
Test Plan: * Added repro unit tests.
Reviewed By: anand1976
Differential Revision: D49484922
Pulled By: cbi42
fbshipit-source-id: 25b536c08f4e02e7f1d0f86571663737d2b5d53d
2023-09-21 22:31:29 +00:00
|
|
|
// Error recovery in progress, should not pick memtable which excludes
|
|
|
|
// them from being picked up by recovery flush.
|
|
|
|
// This ensures that when bg error is set, no new flush can pick
|
|
|
|
// memtables.
|
|
|
|
skip_set_bg_error = true;
|
|
|
|
s = error_handler_.GetBGError();
|
|
|
|
assert(!s.ok());
|
2023-09-22 23:43:50 +00:00
|
|
|
ROCKS_LOG_BUFFER(log_buffer,
|
|
|
|
"[JOB %d] Skip flush due to background error %s",
|
|
|
|
job_context->job_id, s.ToString().c_str());
|
Rollback other pending memtable flushes when a flush fails (#11865)
Summary:
when atomic_flush=false, there are certain cases where we try to install memtable results with already deleted SST files. This can happen when the following sequence events happen:
```
Start Flush0 for memtable M0 to SST0
Start Flush1 for memtable M1 to SST1
Flush 1 returns OK, but don't install to MANIFEST and let whoever flushes M0 to take care of it
Flush0 finishes with a retryable IOError, it rollbacks M0, (incorrectly) does not rollback M1, and deletes SST0 and SST1
Starts Flush2 for M0, it does not pick up M1 since it thought M1 is flushed
Flush2 writes SST2 and finishes OK, tries to install SST2 and SST1
Error opening SST1 since it's already deleted with an error message like the following:
IO error: No such file or directory: While open a file for random read: /tmp/rocksdbtest-501/db_flush_test_3577_4230653031040984171/000011.sst: No such file or directory
```
This happens since:
1. We currently only rollback the memtables that we are flushing in a flush job when atomic_flush=false.
2. Pending output SSTs from previous flushes are deleted since a pending file number is released whenever a flush job is finished no matter of flush status: https://github.com/facebook/rocksdb/blob/f42e70bf561d4be9b6bbe7316d1c2c0c8a3818e6/db/db_impl/db_impl_compaction_flush.cc#L3161
This PR fixes the issue by rollback these pending flushes.
There is another issue where if a new flush for new memtable starts and finishes after Flush0 finishes. Its output may also be deleted (see more in unit test). It is fixed by checking bg error status before installing a memtable result, and rollback if there is an error.
There is a more efficient fix where we just don't release the pending file output number for flushes that delegate installation. It is more efficient since it does not have to rewrite the flush output file. With the fix in this PR, we can end up with a giant file if a lot of memtables are being flushed together. However, the more efficient fix is a bit more complicated to implement (requires associating such pending file numbers with flush job/memtables) and is more risky since it changes normal flush code path.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11865
Test Plan: * Added repro unit tests.
Reviewed By: anand1976
Differential Revision: D49484922
Pulled By: cbi42
fbshipit-source-id: 25b536c08f4e02e7f1d0f86571663737d2b5d53d
2023-09-21 22:31:29 +00:00
|
|
|
}
|
|
|
|
|
2021-03-18 21:31:30 +00:00
|
|
|
if (s.ok()) {
|
|
|
|
flush_job.PickMemTable();
|
|
|
|
need_cancel = true;
|
|
|
|
}
|
2021-11-19 17:55:10 +00:00
|
|
|
TEST_SYNC_POINT_CALLBACK(
|
|
|
|
"DBImpl::FlushMemTableToOutputFile:AfterPickMemtables", &flush_job);
|
2022-03-03 05:03:14 +00:00
|
|
|
|
|
|
|
// may temporarily unlock and lock the mutex.
|
2023-01-24 17:54:04 +00:00
|
|
|
NotifyOnFlushBegin(cfd, &file_meta, mutable_cf_options, job_context->job_id,
|
|
|
|
flush_reason);
|
2022-03-03 05:03:14 +00:00
|
|
|
|
2021-08-19 00:39:00 +00:00
|
|
|
bool switched_to_mempurge = false;
|
2017-04-06 00:14:05 +00:00
|
|
|
// Within flush_job.Run, rocksdb may call event listener to notify
|
|
|
|
// file creation and deletion.
|
|
|
|
//
|
|
|
|
// Note that flush_job.Run will unlock and lock the db_mutex,
|
|
|
|
// and EventListener callback will be called when the db_mutex
|
|
|
|
// is unlocked by the current thread.
|
|
|
|
if (s.ok()) {
|
2021-08-19 00:39:00 +00:00
|
|
|
s = flush_job.Run(&logs_with_prep_tracker_, &file_meta,
|
Rollback other pending memtable flushes when a flush fails (#11865)
Summary:
when atomic_flush=false, there are certain cases where we try to install memtable results with already deleted SST files. This can happen when the following sequence events happen:
```
Start Flush0 for memtable M0 to SST0
Start Flush1 for memtable M1 to SST1
Flush 1 returns OK, but don't install to MANIFEST and let whoever flushes M0 to take care of it
Flush0 finishes with a retryable IOError, it rollbacks M0, (incorrectly) does not rollback M1, and deletes SST0 and SST1
Starts Flush2 for M0, it does not pick up M1 since it thought M1 is flushed
Flush2 writes SST2 and finishes OK, tries to install SST2 and SST1
Error opening SST1 since it's already deleted with an error message like the following:
IO error: No such file or directory: While open a file for random read: /tmp/rocksdbtest-501/db_flush_test_3577_4230653031040984171/000011.sst: No such file or directory
```
This happens since:
1. We currently only rollback the memtables that we are flushing in a flush job when atomic_flush=false.
2. Pending output SSTs from previous flushes are deleted since a pending file number is released whenever a flush job is finished no matter of flush status: https://github.com/facebook/rocksdb/blob/f42e70bf561d4be9b6bbe7316d1c2c0c8a3818e6/db/db_impl/db_impl_compaction_flush.cc#L3161
This PR fixes the issue by rollback these pending flushes.
There is another issue where if a new flush for new memtable starts and finishes after Flush0 finishes. Its output may also be deleted (see more in unit test). It is fixed by checking bg error status before installing a memtable result, and rollback if there is an error.
There is a more efficient fix where we just don't release the pending file output number for flushes that delegate installation. It is more efficient since it does not have to rewrite the flush output file. With the fix in this PR, we can end up with a giant file if a lot of memtables are being flushed together. However, the more efficient fix is a bit more complicated to implement (requires associating such pending file numbers with flush job/memtables) and is more risky since it changes normal flush code path.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11865
Test Plan: * Added repro unit tests.
Reviewed By: anand1976
Differential Revision: D49484922
Pulled By: cbi42
fbshipit-source-id: 25b536c08f4e02e7f1d0f86571663737d2b5d53d
2023-09-21 22:31:29 +00:00
|
|
|
&switched_to_mempurge, &skip_set_bg_error,
|
|
|
|
&error_handler_);
|
2021-03-18 21:31:30 +00:00
|
|
|
need_cancel = false;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!s.ok() && need_cancel) {
|
2017-04-06 00:14:05 +00:00
|
|
|
flush_job.Cancel();
|
|
|
|
}
|
|
|
|
|
|
|
|
if (s.ok()) {
|
2018-08-24 20:17:29 +00:00
|
|
|
InstallSuperVersionAndScheduleWork(cfd, superversion_context,
|
|
|
|
mutable_cf_options);
|
2017-04-06 00:14:05 +00:00
|
|
|
if (made_progress) {
|
2018-10-16 02:59:20 +00:00
|
|
|
*made_progress = true;
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
2020-09-15 04:10:09 +00:00
|
|
|
|
|
|
|
const std::string& column_family_name = cfd->GetName();
|
|
|
|
|
|
|
|
Version* const current = cfd->current();
|
|
|
|
assert(current);
|
|
|
|
|
|
|
|
const VersionStorageInfo* const storage_info = current->storage_info();
|
|
|
|
assert(storage_info);
|
|
|
|
|
2017-04-06 00:14:05 +00:00
|
|
|
VersionStorageInfo::LevelSummaryStorage tmp;
|
|
|
|
ROCKS_LOG_BUFFER(log_buffer, "[%s] Level summary: %s\n",
|
2020-09-15 04:10:09 +00:00
|
|
|
column_family_name.c_str(),
|
|
|
|
storage_info->LevelSummary(&tmp));
|
|
|
|
|
|
|
|
const auto& blob_files = storage_info->GetBlobFiles();
|
|
|
|
if (!blob_files.empty()) {
|
Use a sorted vector instead of a map to store blob file metadata (#9526)
Summary:
The patch replaces `std::map` with a sorted `std::vector` for
`VersionStorageInfo::blob_files_` and preallocates the space
for the `vector` before saving the `BlobFileMetaData` into the
new `VersionStorageInfo` in `VersionBuilder::Rep::SaveBlobFilesTo`.
These changes reduce the time the DB mutex is held while
saving new `Version`s, and using a sorted `vector` also makes
lookups faster thanks to better memory locality.
In addition, the patch introduces helper methods
`VersionStorageInfo::GetBlobFileMetaData` and
`VersionStorageInfo::GetBlobFileMetaDataLB` that can be used by
clients to perform lookups in the `vector`, and does some general
cleanup in the parts of code where blob file metadata are used.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9526
Test Plan:
Ran `make check` and the crash test script for a while.
Performance was tested using a load-optimized benchmark (`fillseq` with vector memtable, no WAL) and small file sizes so that a significant number of files are produced:
```
numactl --interleave=all ./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/ltamasi-dbbench --wal_dir=/data/ltamasi-dbbench --num=800000000 --num_levels=8 --key_size=20 --value_size=400 --block_size=8192 --cache_size=51539607552 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=lz4 --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --soft_pending_compaction_bytes_limit=167503724544 --hard_pending_compaction_bytes_limit=335007449088 --min_level_to_compress=0 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --enable_blob_files=1 --blob_file_size=16777216 --min_blob_size=0 --blob_compression_type=lz4 --enable_blob_garbage_collection=1 --seed=<some value>
```
Final statistics before the patch:
```
Cumulative writes: 0 writes, 700M keys, 0 commit groups, 0.0 writes per commit group, ingest: 284.62 GB, 121.27 MB/s
Interval writes: 0 writes, 334K keys, 0 commit groups, 0.0 writes per commit group, ingest: 139.28 MB, 72.46 MB/s
```
With the patch:
```
Cumulative writes: 0 writes, 760M keys, 0 commit groups, 0.0 writes per commit group, ingest: 308.66 GB, 131.52 MB/s
Interval writes: 0 writes, 445K keys, 0 commit groups, 0.0 writes per commit group, ingest: 185.35 MB, 93.15 MB/s
```
Total time to complete the benchmark is 2611 seconds with the patch, down from 2986 secs.
Reviewed By: riversand963
Differential Revision: D34082728
Pulled By: ltamasi
fbshipit-source-id: fc598abf676dce436734d06bb9d2d99a26a004fc
2022-02-09 20:35:39 +00:00
|
|
|
assert(blob_files.front());
|
|
|
|
assert(blob_files.back());
|
|
|
|
|
|
|
|
ROCKS_LOG_BUFFER(
|
|
|
|
log_buffer,
|
|
|
|
"[%s] Blob file summary: head=%" PRIu64 ", tail=%" PRIu64 "\n",
|
|
|
|
column_family_name.c_str(), blob_files.front()->GetBlobFileNumber(),
|
|
|
|
blob_files.back()->GetBlobFileNumber());
|
2020-09-15 04:10:09 +00:00
|
|
|
}
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
|
Rollback other pending memtable flushes when a flush fails (#11865)
Summary:
when atomic_flush=false, there are certain cases where we try to install memtable results with already deleted SST files. This can happen when the following sequence events happen:
```
Start Flush0 for memtable M0 to SST0
Start Flush1 for memtable M1 to SST1
Flush 1 returns OK, but don't install to MANIFEST and let whoever flushes M0 to take care of it
Flush0 finishes with a retryable IOError, it rollbacks M0, (incorrectly) does not rollback M1, and deletes SST0 and SST1
Starts Flush2 for M0, it does not pick up M1 since it thought M1 is flushed
Flush2 writes SST2 and finishes OK, tries to install SST2 and SST1
Error opening SST1 since it's already deleted with an error message like the following:
IO error: No such file or directory: While open a file for random read: /tmp/rocksdbtest-501/db_flush_test_3577_4230653031040984171/000011.sst: No such file or directory
```
This happens since:
1. We currently only rollback the memtables that we are flushing in a flush job when atomic_flush=false.
2. Pending output SSTs from previous flushes are deleted since a pending file number is released whenever a flush job is finished no matter of flush status: https://github.com/facebook/rocksdb/blob/f42e70bf561d4be9b6bbe7316d1c2c0c8a3818e6/db/db_impl/db_impl_compaction_flush.cc#L3161
This PR fixes the issue by rollback these pending flushes.
There is another issue where if a new flush for new memtable starts and finishes after Flush0 finishes. Its output may also be deleted (see more in unit test). It is fixed by checking bg error status before installing a memtable result, and rollback if there is an error.
There is a more efficient fix where we just don't release the pending file output number for flushes that delegate installation. It is more efficient since it does not have to rewrite the flush output file. With the fix in this PR, we can end up with a giant file if a lot of memtables are being flushed together. However, the more efficient fix is a bit more complicated to implement (requires associating such pending file numbers with flush job/memtables) and is more risky since it changes normal flush code path.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11865
Test Plan: * Added repro unit tests.
Reviewed By: anand1976
Differential Revision: D49484922
Pulled By: cbi42
fbshipit-source-id: 25b536c08f4e02e7f1d0f86571663737d2b5d53d
2023-09-21 22:31:29 +00:00
|
|
|
if (!s.ok() && !s.IsShutdownInProgress() && !s.IsColumnFamilyDropped() &&
|
|
|
|
!skip_set_bg_error) {
|
2022-03-15 21:45:34 +00:00
|
|
|
if (log_io_s.ok()) {
|
First step towards handling MANIFEST write error (#6949)
Summary:
This PR provides preliminary support for handling IO error during MANIFEST write.
File write/sync is not guaranteed to be atomic. If we encounter an IOError while writing/syncing to the MANIFEST file, we cannot be sure about the state of the MANIFEST file. The version edits may or may not have reached the file. During cleanup, if we delete the newly-generated SST files referenced by the pending version edit(s), but the version edit(s) actually are persistent in the MANIFEST, then next recovery attempt will process the version edits(s) and then fail since the SST files have already been deleted.
One approach is to truncate the MANIFEST after write/sync error, so that it is safe to delete the SST files. However, file truncation may not be supported on certain file systems. Therefore, we take the following approach.
If an IOError is detected during MANIFEST write/sync, we disable file deletions for the faulty database. Depending on whether the IOError is retryable (set by underlying file system), either RocksDB or application can call `DB::Resume()`, or simply shutdown and restart. During `Resume()`, RocksDB will try to switch to a new MANIFEST and write all existing in-memory version storage in the new file. If this succeeds, then RocksDB may proceed. If all recovery is completed, then file deletions will be re-enabled.
Note that multiple threads can call `LogAndApply()` at the same time, though only one of them will be going through the process MANIFEST write, possibly batching the version edits of other threads. When the leading MANIFEST writer finishes, all of the MANIFEST writing threads in this batch will have the same IOError. They will all call `ErrorHandler::SetBGError()` in which file deletion will be disabled.
Possible future directions:
- Add an `ErrorContext` structure so that it is easier to pass more info to `ErrorHandler`. Currently, as in this example, a new `BackgroundErrorReason` has to be added.
Test plan (dev server):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6949
Reviewed By: anand1976
Differential Revision: D22026020
Pulled By: riversand963
fbshipit-source-id: f3c68a2ef45d9b505d0d625c7c5e0c88495b91c8
2020-06-25 02:05:47 +00:00
|
|
|
// Error while writing to MANIFEST.
|
|
|
|
// In fact, versions_->io_status() can also be the result of renaming
|
|
|
|
// CURRENT file. With current code, it's just difficult to tell. So just
|
|
|
|
// be pessimistic and try write to a new MANIFEST.
|
|
|
|
// TODO: distinguish between MANIFEST write and CURRENT renaming
|
2020-09-18 03:22:35 +00:00
|
|
|
if (!versions_->io_status().ok()) {
|
2021-03-26 04:41:44 +00:00
|
|
|
// If WAL sync is successful (either WAL size is 0 or there is no IO
|
|
|
|
// error), all the Manifest write will be map to soft error.
|
|
|
|
// TODO: kManifestWriteNoWAL and kFlushNoWAL are misleading. Refactor is
|
|
|
|
// needed.
|
2022-03-15 21:45:34 +00:00
|
|
|
error_handler_.SetBGError(s,
|
2021-03-26 04:41:44 +00:00
|
|
|
BackgroundErrorReason::kManifestWriteNoWAL);
|
2020-09-18 03:22:35 +00:00
|
|
|
} else {
|
2021-03-26 04:41:44 +00:00
|
|
|
// If WAL sync is successful (either WAL size is 0 or there is no IO
|
|
|
|
// error), all the other SST file write errors will be set as
|
|
|
|
// kFlushNoWAL.
|
2022-03-15 21:45:34 +00:00
|
|
|
error_handler_.SetBGError(s, BackgroundErrorReason::kFlushNoWAL);
|
2020-09-18 03:22:35 +00:00
|
|
|
}
|
Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487)
Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.
The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487
Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check
Reviewed By: anand1976
Differential Revision: D20685017
Pulled By: zhichao-cao
fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
2020-03-27 23:03:05 +00:00
|
|
|
} else {
|
2022-03-15 21:45:34 +00:00
|
|
|
assert(s == log_io_s);
|
|
|
|
Status new_bg_error = s;
|
|
|
|
error_handler_.SetBGError(new_bg_error, BackgroundErrorReason::kFlush);
|
Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487)
Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.
The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487
Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check
Reviewed By: anand1976
Differential Revision: D20685017
Pulled By: zhichao-cao
fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
2020-03-27 23:03:05 +00:00
|
|
|
}
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
2021-08-19 00:39:00 +00:00
|
|
|
// If flush ran smoothly and no mempurge happened
|
|
|
|
// install new SST file path.
|
|
|
|
if (s.ok() && (!switched_to_mempurge)) {
|
2017-04-06 00:14:05 +00:00
|
|
|
// may temporarily unlock and lock the mutex.
|
2019-10-16 17:39:00 +00:00
|
|
|
NotifyOnFlushCompleted(cfd, mutable_cf_options,
|
|
|
|
flush_job.GetCommittedFlushJobsInfo());
|
2017-04-06 00:14:05 +00:00
|
|
|
auto sfm = static_cast<SstFileManagerImpl*>(
|
|
|
|
immutable_db_options_.sst_file_manager.get());
|
|
|
|
if (sfm) {
|
|
|
|
// Notify sst_file_manager that a new file was added
|
|
|
|
std::string file_path = MakeTableFileName(
|
2018-04-06 02:49:06 +00:00
|
|
|
cfd->ioptions()->cf_paths[0].path, file_meta.fd.GetNumber());
|
2021-01-04 19:07:10 +00:00
|
|
|
// TODO (PR7798). We should only add the file to the FileManager if it
|
|
|
|
// exists. Otherwise, some tests may fail. Ignore the error in the
|
|
|
|
// interim.
|
|
|
|
sfm->OnAddFile(file_path).PermitUncheckedError();
|
2018-06-28 19:23:57 +00:00
|
|
|
if (sfm->IsMaxAllowedSpaceReached()) {
|
2019-03-27 23:13:08 +00:00
|
|
|
Status new_bg_error =
|
|
|
|
Status::SpaceLimit("Max allowed space was reached");
|
2017-04-06 00:14:05 +00:00
|
|
|
TEST_SYNC_POINT_CALLBACK(
|
|
|
|
"DBImpl::FlushMemTableToOutputFile:MaxAllowedSpaceReached",
|
2017-06-23 02:30:39 +00:00
|
|
|
&new_bg_error);
|
2020-12-08 04:09:55 +00:00
|
|
|
error_handler_.SetBGError(new_bg_error, BackgroundErrorReason::kFlush);
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
2019-10-16 17:39:00 +00:00
|
|
|
TEST_SYNC_POINT("DBImpl::FlushMemTableToOutputFile:Finish");
|
2017-04-06 00:14:05 +00:00
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
2018-08-24 20:17:29 +00:00
|
|
|
Status DBImpl::FlushMemTablesToOutputFiles(
|
|
|
|
const autovector<BGFlushArg>& bg_flush_args, bool* made_progress,
|
2019-03-20 00:24:09 +00:00
|
|
|
JobContext* job_context, LogBuffer* log_buffer, Env::Priority thread_pri) {
|
2018-11-12 20:22:10 +00:00
|
|
|
if (immutable_db_options_.atomic_flush) {
|
2019-03-27 23:13:08 +00:00
|
|
|
return AtomicFlushMemTablesToOutputFiles(
|
|
|
|
bg_flush_args, made_progress, job_context, log_buffer, thread_pri);
|
2018-10-26 22:06:44 +00:00
|
|
|
}
|
2020-12-02 17:29:50 +00:00
|
|
|
assert(bg_flush_args.size() == 1);
|
2019-01-31 19:53:29 +00:00
|
|
|
std::vector<SequenceNumber> snapshot_seqs;
|
|
|
|
SequenceNumber earliest_write_conflict_snapshot;
|
|
|
|
SnapshotChecker* snapshot_checker;
|
|
|
|
GetSnapshotContext(job_context, &snapshot_seqs,
|
|
|
|
&earliest_write_conflict_snapshot, &snapshot_checker);
|
2020-12-02 17:29:50 +00:00
|
|
|
const auto& bg_flush_arg = bg_flush_args[0];
|
|
|
|
ColumnFamilyData* cfd = bg_flush_arg.cfd_;
|
2022-03-08 19:26:40 +00:00
|
|
|
// intentional infrequent copy for each flush
|
|
|
|
MutableCFOptions mutable_cf_options_copy = *cfd->GetLatestMutableCFOptions();
|
2020-12-02 17:29:50 +00:00
|
|
|
SuperVersionContext* superversion_context =
|
|
|
|
bg_flush_arg.superversion_context_;
|
2023-01-24 17:54:04 +00:00
|
|
|
FlushReason flush_reason = bg_flush_arg.flush_reason_;
|
2020-12-02 17:29:50 +00:00
|
|
|
Status s = FlushMemTableToOutputFile(
|
2023-01-24 17:54:04 +00:00
|
|
|
cfd, mutable_cf_options_copy, made_progress, job_context, flush_reason,
|
2022-03-08 19:26:40 +00:00
|
|
|
superversion_context, snapshot_seqs, earliest_write_conflict_snapshot,
|
|
|
|
snapshot_checker, log_buffer, thread_pri);
|
2020-12-02 17:29:50 +00:00
|
|
|
return s;
|
2018-08-24 20:17:29 +00:00
|
|
|
}
|
|
|
|
|
2018-10-16 02:59:20 +00:00
|
|
|
/*
|
|
|
|
* Atomically flushes multiple column families.
|
|
|
|
*
|
|
|
|
* For each column family, all memtables with ID smaller than or equal to the
|
|
|
|
* ID specified in bg_flush_args will be flushed. Only after all column
|
|
|
|
* families finish flush will this function commit to MANIFEST. If any of the
|
|
|
|
* column families are not flushed successfully, this function does not have
|
|
|
|
* any side-effect on the state of the database.
|
|
|
|
*/
|
|
|
|
Status DBImpl::AtomicFlushMemTablesToOutputFiles(
|
|
|
|
const autovector<BGFlushArg>& bg_flush_args, bool* made_progress,
|
2019-03-20 00:24:09 +00:00
|
|
|
JobContext* job_context, LogBuffer* log_buffer, Env::Priority thread_pri) {
|
2018-10-16 02:59:20 +00:00
|
|
|
mutex_.AssertHeld();
|
Group SST write in flush, compaction and db open with new stats (#11910)
Summary:
## Context/Summary
Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity.
For that, this PR does the following:
- Tag different write IOs by passing down and converting WriteOptions to IOOptions
- Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS
Some related code refactory to make implementation cleaner:
- Blob stats
- Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info.
- Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write.
- Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority
- Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification
- Build table
- TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables
- Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder.
This parameter is used for dynamically changing file io priority for flush, see https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more
- Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority
## Test
### db bench
Flush
```
./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100
rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
```
compaction, db oopen
```
Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279
rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213
rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66
```
blob stats - just to make sure they aren't broken by this PR
```
Integrated Blob DB
Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600
rocksdb.blobdb.blob.file.synced COUNT : 1
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same)
```
```
Stacked Blob DB
Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876
rocksdb.blobdb.blob.file.synced COUNT : 8
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same)
```
### Rehearsal CI stress test
Trigger 3 full runs of all our CI stress tests
### Performance
Flush
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark; enable_statistics = true
Pre-pr: avg 507515519.3 ns
497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908,
Post-pr: avg 511971266.5 ns, regressed 0.88%
502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408,
```
Compaction
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 495346098.30 ns
492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846
Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97%
502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007
```
Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats)
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 3848.10 ns
3814,3838,3839,3848,3854,3854,3854,3860,3860,3860
Post-pr: avg 3874.20 ns, regressed 0.68%
3863,3867,3871,3874,3875,3877,3877,3877,3880,3881
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910
Reviewed By: ajkr
Differential Revision: D49788060
Pulled By: hx235
fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff
2023-12-29 23:29:23 +00:00
|
|
|
const ReadOptions read_options(Env::IOActivity::kFlush);
|
|
|
|
const WriteOptions write_options(Env::IOActivity::kFlush);
|
2018-10-16 02:59:20 +00:00
|
|
|
|
|
|
|
autovector<ColumnFamilyData*> cfds;
|
|
|
|
for (const auto& arg : bg_flush_args) {
|
|
|
|
cfds.emplace_back(arg.cfd_);
|
|
|
|
}
|
|
|
|
|
|
|
|
#ifndef NDEBUG
|
|
|
|
for (const auto cfd : cfds) {
|
|
|
|
assert(cfd->imm()->NumNotFlushed() != 0);
|
|
|
|
assert(cfd->imm()->IsFlushPending());
|
2023-01-24 17:54:04 +00:00
|
|
|
}
|
2023-02-22 13:44:03 +00:00
|
|
|
for (const auto& bg_flush_arg : bg_flush_args) {
|
2023-01-24 17:54:04 +00:00
|
|
|
assert(bg_flush_arg.flush_reason_ == bg_flush_args[0].flush_reason_);
|
2018-10-16 02:59:20 +00:00
|
|
|
}
|
|
|
|
#endif /* !NDEBUG */
|
|
|
|
|
2019-01-16 05:32:15 +00:00
|
|
|
std::vector<SequenceNumber> snapshot_seqs;
|
2018-10-16 02:59:20 +00:00
|
|
|
SequenceNumber earliest_write_conflict_snapshot;
|
2019-01-16 05:32:15 +00:00
|
|
|
SnapshotChecker* snapshot_checker;
|
|
|
|
GetSnapshotContext(job_context, &snapshot_seqs,
|
|
|
|
&earliest_write_conflict_snapshot, &snapshot_checker);
|
2018-10-16 02:59:20 +00:00
|
|
|
|
2020-03-03 00:14:00 +00:00
|
|
|
autovector<FSDirectory*> distinct_output_dirs;
|
2019-02-12 20:01:55 +00:00
|
|
|
autovector<std::string> distinct_output_dir_paths;
|
2019-10-16 17:39:00 +00:00
|
|
|
std::vector<std::unique_ptr<FlushJob>> jobs;
|
2019-01-12 01:40:44 +00:00
|
|
|
std::vector<MutableCFOptions> all_mutable_cf_options;
|
2018-10-16 02:59:20 +00:00
|
|
|
int num_cfs = static_cast<int>(cfds.size());
|
2019-01-12 01:40:44 +00:00
|
|
|
all_mutable_cf_options.reserve(num_cfs);
|
2018-10-16 02:59:20 +00:00
|
|
|
for (int i = 0; i < num_cfs; ++i) {
|
|
|
|
auto cfd = cfds[i];
|
2020-03-03 00:14:00 +00:00
|
|
|
FSDirectory* data_dir = GetDataDir(cfd, 0U);
|
2019-02-12 20:01:55 +00:00
|
|
|
const std::string& curr_path = cfd->ioptions()->cf_paths[0].path;
|
2018-10-16 02:59:20 +00:00
|
|
|
|
|
|
|
// Add to distinct output directories if eligible. Use linear search. Since
|
|
|
|
// the number of elements in the vector is not large, performance should be
|
|
|
|
// tolerable.
|
|
|
|
bool found = false;
|
2019-02-12 20:01:55 +00:00
|
|
|
for (const auto& path : distinct_output_dir_paths) {
|
|
|
|
if (path == curr_path) {
|
2018-10-16 02:59:20 +00:00
|
|
|
found = true;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (!found) {
|
2019-02-12 20:01:55 +00:00
|
|
|
distinct_output_dir_paths.emplace_back(curr_path);
|
2018-10-16 02:59:20 +00:00
|
|
|
distinct_output_dirs.emplace_back(data_dir);
|
|
|
|
}
|
|
|
|
|
2019-01-12 01:40:44 +00:00
|
|
|
all_mutable_cf_options.emplace_back(*cfd->GetLatestMutableCFOptions());
|
|
|
|
const MutableCFOptions& mutable_cf_options = all_mutable_cf_options.back();
|
2020-12-02 17:29:50 +00:00
|
|
|
uint64_t max_memtable_id = bg_flush_args[i].max_memtable_id_;
|
2023-01-24 17:54:04 +00:00
|
|
|
FlushReason flush_reason = bg_flush_args[i].flush_reason_;
|
2019-10-16 17:39:00 +00:00
|
|
|
jobs.emplace_back(new FlushJob(
|
2019-02-12 20:01:55 +00:00
|
|
|
dbname_, cfd, immutable_db_options_, mutable_cf_options,
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
2019-12-13 22:47:08 +00:00
|
|
|
max_memtable_id, file_options_for_compaction_, versions_.get(), &mutex_,
|
2018-10-16 02:59:20 +00:00
|
|
|
&shutting_down_, snapshot_seqs, earliest_write_conflict_snapshot,
|
2023-01-24 17:54:04 +00:00
|
|
|
snapshot_checker, job_context, flush_reason, log_buffer,
|
|
|
|
directories_.GetDbDir(), data_dir,
|
|
|
|
GetCompressionFlush(*cfd->ioptions(), mutable_cf_options), stats_,
|
|
|
|
&event_logger_, mutable_cf_options.report_bg_io_stats,
|
2019-03-20 00:24:09 +00:00
|
|
|
false /* sync_output_directory */, false /* write_manifest */,
|
Refactor, clean up, fixes, and more testing for SeqnoToTimeMapping (#11905)
Summary:
This change is before a planned DBImpl change to ensure all sufficiently recent sequence numbers since Open are covered by SeqnoToTimeMapping (bug fix with existing test work-arounds). **Intended follow-up**
However, I found enough issues with SeqnoToTimeMapping to warrant this PR first, including very small fixes in DB implementation related to API contract of SeqnoToTimeMapping.
Functional fixes / changes:
* This fixes some mishandling of boundary cases. For example, if the user decides to stop writing to DB, the last written sequence number would perpetually have its write time updated to "now" and would always be ineligible for migration to cold tier. Part of the problem is that the SeqnoToTimeMapping would return a seqno known to have been written before (immediately or otherwise) the requested time, but compaction_job.cc would include that seqno in the preserve/exclude set. That is fixed (in part) by adding one in compaction_job.cc
* That problem was worse because a whole range of seqnos could be updated perpetually with new times in SeqnoToTimeMapping::Append (if no writes to DB). That logic was apparently optimized for GetOldestApproximateTime (now GetProximalTimeBeforeSeqno), which is not used in production, to the detriment of GetOldestSequenceNum (now GetProximalSeqnoBeforeTime), which is used in production. (Perhaps plans changed during development?) This is fixed in Append to optimize for accuracy of GetProximalSeqnoBeforeTime. (Unit tests added and updated.)
* Related: SeqnoToTimeMapping did not have a clear contract about the relationships between seqnos and times, just the idea of a rough correspondence. Now the class description makes it clear that the write time of each recorded seqno comes before or at the associated time, to support getting best results for GetProximalSeqnoBeforeTime. And this makes it easier to make clear the contract of each API function.
* Update `DBImpl::RecordSeqnoToTimeMapping()` to follow this ordering in gathering samples.
Some part of these changes has required an expanded test work-around for the problem (see intended follow-up above) that the DB does not immediately ensure recent seqnos are covered by its mapping. These work-arounds will be removed with that planned work.
An apparent compaction bug is revealed in
PrecludeLastLevelTest::RangeDelsCauseFileEndpointsToOverlap, so that test is disabled. Filed GitHub issue #11909
Cosmetic / code safety things (not exhaustive):
* Fix some confusing names.
* `seqno_time_mapping` was used inconsistently in places. Now just `seqno_to_time_mapping` to correspond to class name.
* Rename confusing `GetOldestSequenceNum` -> `GetProximalSeqnoBeforeTime` and `GetOldestApproximateTime` -> `GetProximalTimeBeforeSeqno`. Part of the motivation is that our times and seqnos here have the same underlying type, so we want to be clear about which is expected where to avoid mixing.
* Rename `kUnknownSeqnoTime` to `kUnknownTimeBeforeAll` because the value is a bad choice for unknown if we ever add ProximalAfterBlah functions.
* Arithmetic on SeqnoTimePair doesn't make sense except for delta encoding, so use better names / APIs with that in mind.
* (OMG) Don't allow direct comparison between SeqnoTimePair and SequenceNumber. (There is no checking that it isn't compared against time by accident.)
* A field name essentially matching the containing class name is a confusing pattern (`seqno_time_mapping_`).
* Wrap calls to confusing (but useful) upper_bound and lower_bound functions to have clearer names and more code reuse.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11905
Test Plan: GetOldestSequenceNum (now GetProximalSeqnoBeforeTime) and TruncateOldEntries were lacking unit tests, despite both being used in production (experimental feature). Added those and expanded others.
Reviewed By: jowlyzhang
Differential Revision: D49755592
Pulled By: pdillinger
fbshipit-source-id: f72a3baac74d24b963c77e538bba89a7fc8dce51
2023-09-29 18:21:59 +00:00
|
|
|
thread_pri, io_tracer_, seqno_to_time_mapping_, db_id_, db_session_id_,
|
2021-08-20 18:37:53 +00:00
|
|
|
cfd->GetFullHistoryTsLow(), &blob_callback_));
|
2018-10-16 02:59:20 +00:00
|
|
|
}
|
|
|
|
|
2019-01-31 22:28:53 +00:00
|
|
|
std::vector<FileMetaData> file_meta(num_cfs);
|
2021-08-19 00:39:00 +00:00
|
|
|
// Use of deque<bool> because vector<bool>
|
|
|
|
// is specific and doesn't allow &v[i].
|
|
|
|
std::deque<bool> switched_to_mempurge(num_cfs, false);
|
2018-10-16 02:59:20 +00:00
|
|
|
Status s;
|
2021-03-18 21:31:30 +00:00
|
|
|
IOStatus log_io_s = IOStatus::OK();
|
2018-10-16 02:59:20 +00:00
|
|
|
assert(num_cfs == static_cast<int>(jobs.size()));
|
|
|
|
|
2019-01-31 22:28:53 +00:00
|
|
|
for (int i = 0; i != num_cfs; ++i) {
|
|
|
|
const MutableCFOptions& mutable_cf_options = all_mutable_cf_options.at(i);
|
2018-10-16 02:59:20 +00:00
|
|
|
// may temporarily unlock and lock the mutex.
|
2023-01-24 17:54:04 +00:00
|
|
|
FlushReason flush_reason = bg_flush_args[i].flush_reason_;
|
2018-10-16 02:59:20 +00:00
|
|
|
NotifyOnFlushBegin(cfds[i], &file_meta[i], mutable_cf_options,
|
2023-01-24 17:54:04 +00:00
|
|
|
job_context->job_id, flush_reason);
|
2018-10-16 02:59:20 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
if (logfile_number_ > 0) {
|
|
|
|
// TODO (yanqin) investigate whether we should sync the closed logs for
|
|
|
|
// single column family case.
|
2022-07-21 20:35:36 +00:00
|
|
|
VersionEdit synced_wals;
|
2023-10-12 23:55:25 +00:00
|
|
|
bool error_recovery_in_prog = error_handler_.IsRecoveryInProgress();
|
2022-07-21 20:35:36 +00:00
|
|
|
mutex_.Unlock();
|
Group SST write in flush, compaction and db open with new stats (#11910)
Summary:
## Context/Summary
Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity.
For that, this PR does the following:
- Tag different write IOs by passing down and converting WriteOptions to IOOptions
- Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS
Some related code refactory to make implementation cleaner:
- Blob stats
- Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info.
- Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write.
- Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority
- Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification
- Build table
- TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables
- Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder.
This parameter is used for dynamically changing file io priority for flush, see https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more
- Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority
## Test
### db bench
Flush
```
./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100
rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
```
compaction, db oopen
```
Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279
rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213
rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66
```
blob stats - just to make sure they aren't broken by this PR
```
Integrated Blob DB
Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600
rocksdb.blobdb.blob.file.synced COUNT : 1
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same)
```
```
Stacked Blob DB
Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876
rocksdb.blobdb.blob.file.synced COUNT : 8
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same)
```
### Rehearsal CI stress test
Trigger 3 full runs of all our CI stress tests
### Performance
Flush
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark; enable_statistics = true
Pre-pr: avg 507515519.3 ns
497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908,
Post-pr: avg 511971266.5 ns, regressed 0.88%
502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408,
```
Compaction
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 495346098.30 ns
492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846
Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97%
502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007
```
Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats)
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 3848.10 ns
3814,3838,3839,3848,3854,3854,3854,3860,3860,3860
Post-pr: avg 3874.20 ns, regressed 0.68%
3863,3867,3871,3874,3875,3877,3877,3877,3880,3881
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910
Reviewed By: ajkr
Differential Revision: D49788060
Pulled By: hx235
fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff
2023-12-29 23:29:23 +00:00
|
|
|
log_io_s = SyncClosedLogs(write_options, job_context, &synced_wals,
|
|
|
|
error_recovery_in_prog);
|
2022-07-21 20:35:36 +00:00
|
|
|
mutex_.Lock();
|
|
|
|
if (log_io_s.ok() && synced_wals.IsWalAddition()) {
|
Group SST write in flush, compaction and db open with new stats (#11910)
Summary:
## Context/Summary
Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity.
For that, this PR does the following:
- Tag different write IOs by passing down and converting WriteOptions to IOOptions
- Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS
Some related code refactory to make implementation cleaner:
- Blob stats
- Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info.
- Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write.
- Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority
- Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification
- Build table
- TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables
- Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder.
This parameter is used for dynamically changing file io priority for flush, see https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more
- Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority
## Test
### db bench
Flush
```
./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100
rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
```
compaction, db oopen
```
Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279
rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213
rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66
```
blob stats - just to make sure they aren't broken by this PR
```
Integrated Blob DB
Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600
rocksdb.blobdb.blob.file.synced COUNT : 1
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same)
```
```
Stacked Blob DB
Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876
rocksdb.blobdb.blob.file.synced COUNT : 8
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same)
```
### Rehearsal CI stress test
Trigger 3 full runs of all our CI stress tests
### Performance
Flush
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark; enable_statistics = true
Pre-pr: avg 507515519.3 ns
497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908,
Post-pr: avg 511971266.5 ns, regressed 0.88%
502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408,
```
Compaction
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 495346098.30 ns
492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846
Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97%
502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007
```
Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats)
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 3848.10 ns
3814,3838,3839,3848,3854,3854,3854,3860,3860,3860
Post-pr: avg 3874.20 ns, regressed 0.68%
3863,3867,3871,3874,3875,3877,3877,3877,3880,3881
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910
Reviewed By: ajkr
Differential Revision: D49788060
Pulled By: hx235
fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff
2023-12-29 23:29:23 +00:00
|
|
|
log_io_s = status_to_io_status(
|
|
|
|
ApplyWALToManifest(read_options, write_options, &synced_wals));
|
2022-07-21 20:35:36 +00:00
|
|
|
}
|
|
|
|
|
2021-03-18 21:31:30 +00:00
|
|
|
if (!log_io_s.ok() && !log_io_s.IsShutdownInProgress() &&
|
|
|
|
!log_io_s.IsColumnFamilyDropped()) {
|
|
|
|
if (total_log_size_ > 0) {
|
|
|
|
error_handler_.SetBGError(log_io_s, BackgroundErrorReason::kFlush);
|
|
|
|
} else {
|
|
|
|
// If the WAL is empty, we use different error reason
|
|
|
|
error_handler_.SetBGError(log_io_s, BackgroundErrorReason::kFlushNoWAL);
|
|
|
|
}
|
|
|
|
}
|
2018-10-16 02:59:20 +00:00
|
|
|
}
|
2021-03-19 04:51:21 +00:00
|
|
|
s = log_io_s;
|
2018-10-16 02:59:20 +00:00
|
|
|
|
2018-10-26 22:06:44 +00:00
|
|
|
// exec_status stores the execution status of flush_jobs as
|
|
|
|
// <bool /* executed */, Status /* status code */>
|
|
|
|
autovector<std::pair<bool, Status>> exec_status;
|
2021-03-18 21:31:30 +00:00
|
|
|
std::vector<bool> pick_status;
|
2018-10-26 22:06:44 +00:00
|
|
|
for (int i = 0; i != num_cfs; ++i) {
|
|
|
|
// Initially all jobs are not executed, with status OK.
|
2018-12-13 23:10:16 +00:00
|
|
|
exec_status.emplace_back(false, Status::OK());
|
2021-03-18 21:31:30 +00:00
|
|
|
pick_status.push_back(false);
|
|
|
|
}
|
|
|
|
|
2023-09-22 23:43:50 +00:00
|
|
|
bool flush_for_recovery =
|
|
|
|
bg_flush_args[0].flush_reason_ == FlushReason::kErrorRecovery ||
|
|
|
|
bg_flush_args[0].flush_reason_ == FlushReason::kErrorRecoveryRetryFlush;
|
|
|
|
bool skip_set_bg_error = false;
|
|
|
|
|
|
|
|
if (s.ok() && !error_handler_.GetBGError().ok() &&
|
|
|
|
error_handler_.IsBGWorkStopped() && !flush_for_recovery) {
|
|
|
|
s = error_handler_.GetBGError();
|
|
|
|
skip_set_bg_error = true;
|
|
|
|
assert(!s.ok());
|
|
|
|
ROCKS_LOG_BUFFER(log_buffer,
|
|
|
|
"[JOB %d] Skip flush due to background error %s",
|
|
|
|
job_context->job_id, s.ToString().c_str());
|
|
|
|
}
|
|
|
|
|
2021-03-18 21:31:30 +00:00
|
|
|
if (s.ok()) {
|
|
|
|
for (int i = 0; i != num_cfs; ++i) {
|
|
|
|
jobs[i]->PickMemTable();
|
|
|
|
pick_status[i] = true;
|
|
|
|
}
|
2018-10-26 22:06:44 +00:00
|
|
|
}
|
|
|
|
|
2018-10-16 02:59:20 +00:00
|
|
|
if (s.ok()) {
|
2021-08-19 00:39:00 +00:00
|
|
|
assert(switched_to_mempurge.size() ==
|
|
|
|
static_cast<long unsigned int>(num_cfs));
|
2018-10-16 02:59:20 +00:00
|
|
|
// TODO (yanqin): parallelize jobs with threads.
|
2018-11-15 04:52:21 +00:00
|
|
|
for (int i = 1; i != num_cfs; ++i) {
|
2018-10-26 22:06:44 +00:00
|
|
|
exec_status[i].second =
|
2021-08-19 00:39:00 +00:00
|
|
|
jobs[i]->Run(&logs_with_prep_tracker_, &file_meta[i],
|
|
|
|
&(switched_to_mempurge.at(i)));
|
2018-10-26 22:06:44 +00:00
|
|
|
exec_status[i].first = true;
|
2018-10-16 02:59:20 +00:00
|
|
|
}
|
2018-11-15 04:52:21 +00:00
|
|
|
if (num_cfs > 1) {
|
|
|
|
TEST_SYNC_POINT(
|
|
|
|
"DBImpl::AtomicFlushMemTablesToOutputFiles:SomeFlushJobsComplete:1");
|
|
|
|
TEST_SYNC_POINT(
|
|
|
|
"DBImpl::AtomicFlushMemTablesToOutputFiles:SomeFlushJobsComplete:2");
|
|
|
|
}
|
2019-10-16 17:39:00 +00:00
|
|
|
assert(exec_status.size() > 0);
|
|
|
|
assert(!file_meta.empty());
|
2021-08-19 00:39:00 +00:00
|
|
|
exec_status[0].second = jobs[0]->Run(
|
|
|
|
&logs_with_prep_tracker_, file_meta.data() /* &file_meta[0] */,
|
|
|
|
switched_to_mempurge.empty() ? nullptr : &(switched_to_mempurge.at(0)));
|
2018-12-13 23:10:16 +00:00
|
|
|
exec_status[0].first = true;
|
|
|
|
|
|
|
|
Status error_status;
|
|
|
|
for (const auto& e : exec_status) {
|
|
|
|
if (!e.second.ok()) {
|
|
|
|
s = e.second;
|
2019-05-20 17:37:37 +00:00
|
|
|
if (!e.second.IsShutdownInProgress() &&
|
|
|
|
!e.second.IsColumnFamilyDropped()) {
|
2018-12-13 23:10:16 +00:00
|
|
|
// If a flush job did not return OK, and the CF is not dropped, and
|
|
|
|
// the DB is not shutting down, then we have to return this result to
|
|
|
|
// caller later.
|
|
|
|
error_status = e.second;
|
|
|
|
}
|
2018-11-15 04:52:21 +00:00
|
|
|
}
|
|
|
|
}
|
2018-12-13 23:10:16 +00:00
|
|
|
|
|
|
|
s = error_status.ok() ? s : error_status;
|
2018-10-16 02:59:20 +00:00
|
|
|
}
|
|
|
|
|
2019-05-20 17:37:37 +00:00
|
|
|
if (s.IsColumnFamilyDropped()) {
|
2019-04-29 19:29:57 +00:00
|
|
|
s = Status::OK();
|
|
|
|
}
|
|
|
|
|
2020-02-07 18:50:17 +00:00
|
|
|
if (s.ok() || s.IsShutdownInProgress()) {
|
2018-10-16 02:59:20 +00:00
|
|
|
// Sync on all distinct output directories.
|
|
|
|
for (auto dir : distinct_output_dirs) {
|
|
|
|
if (dir != nullptr) {
|
Group SST write in flush, compaction and db open with new stats (#11910)
Summary:
## Context/Summary
Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity.
For that, this PR does the following:
- Tag different write IOs by passing down and converting WriteOptions to IOOptions
- Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS
Some related code refactory to make implementation cleaner:
- Blob stats
- Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info.
- Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write.
- Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority
- Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification
- Build table
- TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables
- Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder.
This parameter is used for dynamically changing file io priority for flush, see https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more
- Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority
## Test
### db bench
Flush
```
./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100
rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
```
compaction, db oopen
```
Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279
rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213
rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66
```
blob stats - just to make sure they aren't broken by this PR
```
Integrated Blob DB
Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600
rocksdb.blobdb.blob.file.synced COUNT : 1
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same)
```
```
Stacked Blob DB
Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876
rocksdb.blobdb.blob.file.synced COUNT : 8
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same)
```
### Rehearsal CI stress test
Trigger 3 full runs of all our CI stress tests
### Performance
Flush
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark; enable_statistics = true
Pre-pr: avg 507515519.3 ns
497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908,
Post-pr: avg 511971266.5 ns, regressed 0.88%
502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408,
```
Compaction
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 495346098.30 ns
492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846
Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97%
502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007
```
Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats)
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 3848.10 ns
3814,3838,3839,3848,3854,3854,3854,3860,3860,3860
Post-pr: avg 3874.20 ns, regressed 0.68%
3863,3867,3871,3874,3875,3877,3877,3877,3880,3881
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910
Reviewed By: ajkr
Differential Revision: D49788060
Pulled By: hx235
fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff
2023-12-29 23:29:23 +00:00
|
|
|
IOOptions io_options;
|
|
|
|
Status error_status =
|
|
|
|
WritableFileWriter::PrepareIOOptions(write_options, io_options);
|
|
|
|
if (error_status.ok()) {
|
|
|
|
error_status = dir->FsyncWithDirOptions(
|
|
|
|
io_options, nullptr,
|
|
|
|
DirFsyncOptions(DirFsyncOptions::FsyncReason::kNewFileSynced));
|
|
|
|
}
|
2019-04-29 19:29:57 +00:00
|
|
|
if (!error_status.ok()) {
|
|
|
|
s = error_status;
|
2018-10-16 02:59:20 +00:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
2023-09-22 23:43:50 +00:00
|
|
|
} else if (!skip_set_bg_error) {
|
|
|
|
// When `skip_set_bg_error` is true, no memtable is picked so
|
|
|
|
// there is no need to call Cancel() or RollbackMemtableFlush().
|
|
|
|
//
|
2020-02-07 18:50:17 +00:00
|
|
|
// Need to undo atomic flush if something went wrong, i.e. s is not OK and
|
|
|
|
// it is not because of CF drop.
|
|
|
|
// Have to cancel the flush jobs that have NOT executed because we need to
|
|
|
|
// unref the versions.
|
|
|
|
for (int i = 0; i != num_cfs; ++i) {
|
2021-03-18 21:31:30 +00:00
|
|
|
if (pick_status[i] && !exec_status[i].first) {
|
2020-02-07 18:50:17 +00:00
|
|
|
jobs[i]->Cancel();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
for (int i = 0; i != num_cfs; ++i) {
|
2021-03-19 04:51:21 +00:00
|
|
|
if (exec_status[i].second.ok() && exec_status[i].first) {
|
2020-02-07 18:50:17 +00:00
|
|
|
auto& mems = jobs[i]->GetMemTables();
|
Rollback other pending memtable flushes when a flush fails (#11865)
Summary:
when atomic_flush=false, there are certain cases where we try to install memtable results with already deleted SST files. This can happen when the following sequence events happen:
```
Start Flush0 for memtable M0 to SST0
Start Flush1 for memtable M1 to SST1
Flush 1 returns OK, but don't install to MANIFEST and let whoever flushes M0 to take care of it
Flush0 finishes with a retryable IOError, it rollbacks M0, (incorrectly) does not rollback M1, and deletes SST0 and SST1
Starts Flush2 for M0, it does not pick up M1 since it thought M1 is flushed
Flush2 writes SST2 and finishes OK, tries to install SST2 and SST1
Error opening SST1 since it's already deleted with an error message like the following:
IO error: No such file or directory: While open a file for random read: /tmp/rocksdbtest-501/db_flush_test_3577_4230653031040984171/000011.sst: No such file or directory
```
This happens since:
1. We currently only rollback the memtables that we are flushing in a flush job when atomic_flush=false.
2. Pending output SSTs from previous flushes are deleted since a pending file number is released whenever a flush job is finished no matter of flush status: https://github.com/facebook/rocksdb/blob/f42e70bf561d4be9b6bbe7316d1c2c0c8a3818e6/db/db_impl/db_impl_compaction_flush.cc#L3161
This PR fixes the issue by rollback these pending flushes.
There is another issue where if a new flush for new memtable starts and finishes after Flush0 finishes. Its output may also be deleted (see more in unit test). It is fixed by checking bg error status before installing a memtable result, and rollback if there is an error.
There is a more efficient fix where we just don't release the pending file output number for flushes that delegate installation. It is more efficient since it does not have to rewrite the flush output file. With the fix in this PR, we can end up with a giant file if a lot of memtables are being flushed together. However, the more efficient fix is a bit more complicated to implement (requires associating such pending file numbers with flush job/memtables) and is more risky since it changes normal flush code path.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11865
Test Plan: * Added repro unit tests.
Reviewed By: anand1976
Differential Revision: D49484922
Pulled By: cbi42
fbshipit-source-id: 25b536c08f4e02e7f1d0f86571663737d2b5d53d
2023-09-21 22:31:29 +00:00
|
|
|
cfds[i]->imm()->RollbackMemtableFlush(
|
|
|
|
mems, /*rollback_succeeding_memtables=*/false);
|
2020-02-07 18:50:17 +00:00
|
|
|
}
|
|
|
|
}
|
2019-01-04 04:53:52 +00:00
|
|
|
}
|
2018-10-16 02:59:20 +00:00
|
|
|
|
2019-01-04 04:53:52 +00:00
|
|
|
if (s.ok()) {
|
Fix atomic flush waiting forever for MANIFEST write (#9034)
Summary:
In atomic flush, concurrent background flush threads will commit to the MANIFEST
one by one, in the order of the IDs of their picked memtables for all included column
families. Each time, a background flush thread decides whether to wait based on two
criteria:
- Is db stopped? If so, don't wait.
- Am I the one to commit the currently earliest memtable? If so, don't wait and ready to go.
When atomic flush was implemented, error writing to or syncing the MANIFEST would
cause the db to be stopped. Therefore, this background thread does not have to check
for the background error while waiting. If there has been such an error, `DBStopped()`
would have been true, and this thread will **not** wait forever.
After we improved error handling, RocksDB may map an IOError while writing to MANIFEST
to a soft error, if there is no WAL. This requires the background threads to check for
background error while waiting. Otherwise, a background flush thread may wait forever.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9034
Test Plan: make check
Reviewed By: zhichao-cao
Differential Revision: D31639225
Pulled By: riversand963
fbshipit-source-id: e9ab07c4d8f2eade238adeefe3e42dd9a5a3ebbd
2021-10-21 04:33:32 +00:00
|
|
|
const auto wait_to_install_func =
|
|
|
|
[&]() -> std::pair<Status, bool /*continue to wait*/> {
|
|
|
|
if (!versions_->io_status().ok()) {
|
|
|
|
// Something went wrong elsewhere, we cannot count on waiting for our
|
|
|
|
// turn to write/sync to MANIFEST or CURRENT. Just return.
|
|
|
|
return std::make_pair(versions_->io_status(), false);
|
|
|
|
} else if (shutting_down_.load(std::memory_order_acquire)) {
|
|
|
|
return std::make_pair(Status::ShutdownInProgress(), false);
|
|
|
|
}
|
2019-01-04 04:53:52 +00:00
|
|
|
bool ready = true;
|
|
|
|
for (size_t i = 0; i != cfds.size(); ++i) {
|
2019-10-16 17:39:00 +00:00
|
|
|
const auto& mems = jobs[i]->GetMemTables();
|
2019-01-04 04:53:52 +00:00
|
|
|
if (cfds[i]->IsDropped()) {
|
|
|
|
// If the column family is dropped, then do not wait.
|
2018-12-13 23:10:16 +00:00
|
|
|
continue;
|
2019-01-04 04:53:52 +00:00
|
|
|
} else if (!mems.empty() &&
|
|
|
|
cfds[i]->imm()->GetEarliestMemTableID() < mems[0]->GetID()) {
|
|
|
|
// If a flush job needs to install the flush result for mems and
|
|
|
|
// mems[0] is not the earliest memtable, it means another thread must
|
|
|
|
// be installing flush results for the same column family, then the
|
|
|
|
// current thread needs to wait.
|
|
|
|
ready = false;
|
|
|
|
break;
|
|
|
|
} else if (mems.empty() && cfds[i]->imm()->GetEarliestMemTableID() <=
|
|
|
|
bg_flush_args[i].max_memtable_id_) {
|
|
|
|
// If a flush job does not need to install flush results, then it has
|
|
|
|
// to wait until all memtables up to max_memtable_id_ (inclusive) are
|
|
|
|
// installed.
|
|
|
|
ready = false;
|
|
|
|
break;
|
2018-12-13 23:10:16 +00:00
|
|
|
}
|
2018-10-16 02:59:20 +00:00
|
|
|
}
|
Fix atomic flush waiting forever for MANIFEST write (#9034)
Summary:
In atomic flush, concurrent background flush threads will commit to the MANIFEST
one by one, in the order of the IDs of their picked memtables for all included column
families. Each time, a background flush thread decides whether to wait based on two
criteria:
- Is db stopped? If so, don't wait.
- Am I the one to commit the currently earliest memtable? If so, don't wait and ready to go.
When atomic flush was implemented, error writing to or syncing the MANIFEST would
cause the db to be stopped. Therefore, this background thread does not have to check
for the background error while waiting. If there has been such an error, `DBStopped()`
would have been true, and this thread will **not** wait forever.
After we improved error handling, RocksDB may map an IOError while writing to MANIFEST
to a soft error, if there is no WAL. This requires the background threads to check for
background error while waiting. Otherwise, a background flush thread may wait forever.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9034
Test Plan: make check
Reviewed By: zhichao-cao
Differential Revision: D31639225
Pulled By: riversand963
fbshipit-source-id: e9ab07c4d8f2eade238adeefe3e42dd9a5a3ebbd
2021-10-21 04:33:32 +00:00
|
|
|
return std::make_pair(Status::OK(), !ready);
|
2019-01-04 04:53:52 +00:00
|
|
|
};
|
|
|
|
|
2022-03-15 21:45:34 +00:00
|
|
|
bool resuming_from_bg_err =
|
2023-09-22 23:43:50 +00:00
|
|
|
error_handler_.IsDBStopped() || flush_for_recovery;
|
Fix atomic flush waiting forever for MANIFEST write (#9034)
Summary:
In atomic flush, concurrent background flush threads will commit to the MANIFEST
one by one, in the order of the IDs of their picked memtables for all included column
families. Each time, a background flush thread decides whether to wait based on two
criteria:
- Is db stopped? If so, don't wait.
- Am I the one to commit the currently earliest memtable? If so, don't wait and ready to go.
When atomic flush was implemented, error writing to or syncing the MANIFEST would
cause the db to be stopped. Therefore, this background thread does not have to check
for the background error while waiting. If there has been such an error, `DBStopped()`
would have been true, and this thread will **not** wait forever.
After we improved error handling, RocksDB may map an IOError while writing to MANIFEST
to a soft error, if there is no WAL. This requires the background threads to check for
background error while waiting. Otherwise, a background flush thread may wait forever.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9034
Test Plan: make check
Reviewed By: zhichao-cao
Differential Revision: D31639225
Pulled By: riversand963
fbshipit-source-id: e9ab07c4d8f2eade238adeefe3e42dd9a5a3ebbd
2021-10-21 04:33:32 +00:00
|
|
|
while ((!resuming_from_bg_err || error_handler_.GetRecoveryError().ok())) {
|
|
|
|
std::pair<Status, bool> res = wait_to_install_func();
|
|
|
|
|
|
|
|
TEST_SYNC_POINT_CALLBACK(
|
|
|
|
"DBImpl::AtomicFlushMemTablesToOutputFiles:WaitToCommit", &res);
|
|
|
|
|
|
|
|
if (!res.first.ok()) {
|
|
|
|
s = res.first;
|
|
|
|
break;
|
|
|
|
} else if (!res.second) {
|
2023-09-22 23:43:50 +00:00
|
|
|
// we are the oldest immutable memtable
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
// We are not the oldest immutable memtable
|
|
|
|
TEST_SYNC_POINT_CALLBACK(
|
|
|
|
"DBImpl::AtomicFlushMemTablesToOutputFiles:WaitCV", &res);
|
|
|
|
//
|
|
|
|
// If bg work is stopped, recovery thread first calls
|
|
|
|
// WaitForBackgroundWork() before proceeding to flush for recovery. This
|
|
|
|
// flush can block WaitForBackgroundWork() while waiting for recovery
|
|
|
|
// flush to install result. To avoid this deadlock, we should abort here
|
|
|
|
// if there is background error.
|
|
|
|
if (!flush_for_recovery && error_handler_.IsBGWorkStopped() &&
|
|
|
|
!error_handler_.GetBGError().ok()) {
|
|
|
|
s = error_handler_.GetBGError();
|
|
|
|
assert(!s.ok());
|
Fix atomic flush waiting forever for MANIFEST write (#9034)
Summary:
In atomic flush, concurrent background flush threads will commit to the MANIFEST
one by one, in the order of the IDs of their picked memtables for all included column
families. Each time, a background flush thread decides whether to wait based on two
criteria:
- Is db stopped? If so, don't wait.
- Am I the one to commit the currently earliest memtable? If so, don't wait and ready to go.
When atomic flush was implemented, error writing to or syncing the MANIFEST would
cause the db to be stopped. Therefore, this background thread does not have to check
for the background error while waiting. If there has been such an error, `DBStopped()`
would have been true, and this thread will **not** wait forever.
After we improved error handling, RocksDB may map an IOError while writing to MANIFEST
to a soft error, if there is no WAL. This requires the background threads to check for
background error while waiting. Otherwise, a background flush thread may wait forever.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9034
Test Plan: make check
Reviewed By: zhichao-cao
Differential Revision: D31639225
Pulled By: riversand963
fbshipit-source-id: e9ab07c4d8f2eade238adeefe3e42dd9a5a3ebbd
2021-10-21 04:33:32 +00:00
|
|
|
break;
|
|
|
|
}
|
2019-01-04 04:53:52 +00:00
|
|
|
atomic_flush_install_cv_.Wait();
|
Fix atomic flush waiting forever for MANIFEST write (#9034)
Summary:
In atomic flush, concurrent background flush threads will commit to the MANIFEST
one by one, in the order of the IDs of their picked memtables for all included column
families. Each time, a background flush thread decides whether to wait based on two
criteria:
- Is db stopped? If so, don't wait.
- Am I the one to commit the currently earliest memtable? If so, don't wait and ready to go.
When atomic flush was implemented, error writing to or syncing the MANIFEST would
cause the db to be stopped. Therefore, this background thread does not have to check
for the background error while waiting. If there has been such an error, `DBStopped()`
would have been true, and this thread will **not** wait forever.
After we improved error handling, RocksDB may map an IOError while writing to MANIFEST
to a soft error, if there is no WAL. This requires the background threads to check for
background error while waiting. Otherwise, a background flush thread may wait forever.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9034
Test Plan: make check
Reviewed By: zhichao-cao
Differential Revision: D31639225
Pulled By: riversand963
fbshipit-source-id: e9ab07c4d8f2eade238adeefe3e42dd9a5a3ebbd
2021-10-21 04:33:32 +00:00
|
|
|
|
2023-09-22 23:43:50 +00:00
|
|
|
resuming_from_bg_err = error_handler_.IsDBStopped() || flush_for_recovery;
|
2019-01-04 04:53:52 +00:00
|
|
|
}
|
|
|
|
|
Fix atomic flush waiting forever for MANIFEST write (#9034)
Summary:
In atomic flush, concurrent background flush threads will commit to the MANIFEST
one by one, in the order of the IDs of their picked memtables for all included column
families. Each time, a background flush thread decides whether to wait based on two
criteria:
- Is db stopped? If so, don't wait.
- Am I the one to commit the currently earliest memtable? If so, don't wait and ready to go.
When atomic flush was implemented, error writing to or syncing the MANIFEST would
cause the db to be stopped. Therefore, this background thread does not have to check
for the background error while waiting. If there has been such an error, `DBStopped()`
would have been true, and this thread will **not** wait forever.
After we improved error handling, RocksDB may map an IOError while writing to MANIFEST
to a soft error, if there is no WAL. This requires the background threads to check for
background error while waiting. Otherwise, a background flush thread may wait forever.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9034
Test Plan: make check
Reviewed By: zhichao-cao
Differential Revision: D31639225
Pulled By: riversand963
fbshipit-source-id: e9ab07c4d8f2eade238adeefe3e42dd9a5a3ebbd
2021-10-21 04:33:32 +00:00
|
|
|
if (!resuming_from_bg_err) {
|
|
|
|
// If not resuming from bg err, then we determine future action based on
|
|
|
|
// whether we hit background error.
|
|
|
|
if (s.ok()) {
|
|
|
|
s = error_handler_.GetBGError();
|
|
|
|
}
|
|
|
|
} else if (s.ok()) {
|
|
|
|
// If resuming from bg err, we still rely on wait_to_install_func()'s
|
|
|
|
// result to determine future action. If wait_to_install_func() returns
|
|
|
|
// non-ok already, then we should not proceed to flush result
|
|
|
|
// installation.
|
|
|
|
s = error_handler_.GetRecoveryError();
|
|
|
|
}
|
2023-09-22 23:43:50 +00:00
|
|
|
// Since we are not installing these memtables, need to rollback
|
|
|
|
// to allow future flush job to pick up these memtables.
|
|
|
|
if (!s.ok()) {
|
|
|
|
for (int i = 0; i != num_cfs; ++i) {
|
|
|
|
assert(exec_status[i].first);
|
|
|
|
assert(exec_status[i].second.ok());
|
|
|
|
auto& mems = jobs[i]->GetMemTables();
|
|
|
|
cfds[i]->imm()->RollbackMemtableFlush(
|
|
|
|
mems, /*rollback_succeeding_memtables=*/false);
|
|
|
|
}
|
|
|
|
}
|
2019-01-04 04:53:52 +00:00
|
|
|
}
|
2018-10-16 02:59:20 +00:00
|
|
|
|
2019-01-04 04:53:52 +00:00
|
|
|
if (s.ok()) {
|
|
|
|
autovector<ColumnFamilyData*> tmp_cfds;
|
|
|
|
autovector<const autovector<MemTable*>*> mems_list;
|
|
|
|
autovector<const MutableCFOptions*> mutable_cf_options_list;
|
2019-01-31 22:28:53 +00:00
|
|
|
autovector<FileMetaData*> tmp_file_meta;
|
2021-08-03 20:30:05 +00:00
|
|
|
autovector<std::list<std::unique_ptr<FlushJobInfo>>*>
|
|
|
|
committed_flush_jobs_info;
|
2019-01-04 04:53:52 +00:00
|
|
|
for (int i = 0; i != num_cfs; ++i) {
|
2019-10-16 17:39:00 +00:00
|
|
|
const auto& mems = jobs[i]->GetMemTables();
|
2019-01-04 04:53:52 +00:00
|
|
|
if (!cfds[i]->IsDropped() && !mems.empty()) {
|
|
|
|
tmp_cfds.emplace_back(cfds[i]);
|
|
|
|
mems_list.emplace_back(&mems);
|
2019-01-12 01:40:44 +00:00
|
|
|
mutable_cf_options_list.emplace_back(&all_mutable_cf_options[i]);
|
2019-01-31 22:28:53 +00:00
|
|
|
tmp_file_meta.emplace_back(&file_meta[i]);
|
2021-08-03 20:30:05 +00:00
|
|
|
committed_flush_jobs_info.emplace_back(
|
|
|
|
jobs[i]->GetCommittedFlushJobsInfo());
|
2019-01-04 04:53:52 +00:00
|
|
|
}
|
2018-10-16 02:59:20 +00:00
|
|
|
}
|
2019-01-04 04:53:52 +00:00
|
|
|
|
|
|
|
s = InstallMemtableAtomicFlushResults(
|
|
|
|
nullptr /* imm_lists */, tmp_cfds, mutable_cf_options_list, mems_list,
|
2020-12-04 03:21:08 +00:00
|
|
|
versions_.get(), &logs_with_prep_tracker_, &mutex_, tmp_file_meta,
|
2021-08-03 20:30:05 +00:00
|
|
|
committed_flush_jobs_info, &job_context->memtables_to_free,
|
|
|
|
directories_.GetDbDir(), log_buffer);
|
2018-10-16 02:59:20 +00:00
|
|
|
}
|
|
|
|
|
2019-04-29 19:29:57 +00:00
|
|
|
if (s.ok()) {
|
2018-10-16 02:59:20 +00:00
|
|
|
assert(num_cfs ==
|
|
|
|
static_cast<int>(job_context->superversion_contexts.size()));
|
|
|
|
for (int i = 0; i != num_cfs; ++i) {
|
2020-09-15 04:10:09 +00:00
|
|
|
assert(cfds[i]);
|
|
|
|
|
2018-12-13 23:10:16 +00:00
|
|
|
if (cfds[i]->IsDropped()) {
|
|
|
|
continue;
|
|
|
|
}
|
2018-10-16 02:59:20 +00:00
|
|
|
InstallSuperVersionAndScheduleWork(cfds[i],
|
|
|
|
&job_context->superversion_contexts[i],
|
2019-01-31 22:28:53 +00:00
|
|
|
all_mutable_cf_options[i]);
|
2020-09-15 04:10:09 +00:00
|
|
|
|
|
|
|
const std::string& column_family_name = cfds[i]->GetName();
|
|
|
|
|
|
|
|
Version* const current = cfds[i]->current();
|
|
|
|
assert(current);
|
|
|
|
|
|
|
|
const VersionStorageInfo* const storage_info = current->storage_info();
|
|
|
|
assert(storage_info);
|
|
|
|
|
2018-10-16 02:59:20 +00:00
|
|
|
VersionStorageInfo::LevelSummaryStorage tmp;
|
|
|
|
ROCKS_LOG_BUFFER(log_buffer, "[%s] Level summary: %s\n",
|
2020-09-15 04:10:09 +00:00
|
|
|
column_family_name.c_str(),
|
|
|
|
storage_info->LevelSummary(&tmp));
|
|
|
|
|
|
|
|
const auto& blob_files = storage_info->GetBlobFiles();
|
|
|
|
if (!blob_files.empty()) {
|
Use a sorted vector instead of a map to store blob file metadata (#9526)
Summary:
The patch replaces `std::map` with a sorted `std::vector` for
`VersionStorageInfo::blob_files_` and preallocates the space
for the `vector` before saving the `BlobFileMetaData` into the
new `VersionStorageInfo` in `VersionBuilder::Rep::SaveBlobFilesTo`.
These changes reduce the time the DB mutex is held while
saving new `Version`s, and using a sorted `vector` also makes
lookups faster thanks to better memory locality.
In addition, the patch introduces helper methods
`VersionStorageInfo::GetBlobFileMetaData` and
`VersionStorageInfo::GetBlobFileMetaDataLB` that can be used by
clients to perform lookups in the `vector`, and does some general
cleanup in the parts of code where blob file metadata are used.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9526
Test Plan:
Ran `make check` and the crash test script for a while.
Performance was tested using a load-optimized benchmark (`fillseq` with vector memtable, no WAL) and small file sizes so that a significant number of files are produced:
```
numactl --interleave=all ./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/ltamasi-dbbench --wal_dir=/data/ltamasi-dbbench --num=800000000 --num_levels=8 --key_size=20 --value_size=400 --block_size=8192 --cache_size=51539607552 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=lz4 --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --soft_pending_compaction_bytes_limit=167503724544 --hard_pending_compaction_bytes_limit=335007449088 --min_level_to_compress=0 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --enable_blob_files=1 --blob_file_size=16777216 --min_blob_size=0 --blob_compression_type=lz4 --enable_blob_garbage_collection=1 --seed=<some value>
```
Final statistics before the patch:
```
Cumulative writes: 0 writes, 700M keys, 0 commit groups, 0.0 writes per commit group, ingest: 284.62 GB, 121.27 MB/s
Interval writes: 0 writes, 334K keys, 0 commit groups, 0.0 writes per commit group, ingest: 139.28 MB, 72.46 MB/s
```
With the patch:
```
Cumulative writes: 0 writes, 760M keys, 0 commit groups, 0.0 writes per commit group, ingest: 308.66 GB, 131.52 MB/s
Interval writes: 0 writes, 445K keys, 0 commit groups, 0.0 writes per commit group, ingest: 185.35 MB, 93.15 MB/s
```
Total time to complete the benchmark is 2611 seconds with the patch, down from 2986 secs.
Reviewed By: riversand963
Differential Revision: D34082728
Pulled By: ltamasi
fbshipit-source-id: fc598abf676dce436734d06bb9d2d99a26a004fc
2022-02-09 20:35:39 +00:00
|
|
|
assert(blob_files.front());
|
|
|
|
assert(blob_files.back());
|
|
|
|
|
|
|
|
ROCKS_LOG_BUFFER(
|
|
|
|
log_buffer,
|
|
|
|
"[%s] Blob file summary: head=%" PRIu64 ", tail=%" PRIu64 "\n",
|
|
|
|
column_family_name.c_str(), blob_files.front()->GetBlobFileNumber(),
|
|
|
|
blob_files.back()->GetBlobFileNumber());
|
2020-09-15 04:10:09 +00:00
|
|
|
}
|
2018-10-16 02:59:20 +00:00
|
|
|
}
|
|
|
|
if (made_progress) {
|
|
|
|
*made_progress = true;
|
|
|
|
}
|
|
|
|
auto sfm = static_cast<SstFileManagerImpl*>(
|
|
|
|
immutable_db_options_.sst_file_manager.get());
|
2019-10-16 17:39:00 +00:00
|
|
|
assert(all_mutable_cf_options.size() == static_cast<size_t>(num_cfs));
|
2020-12-24 00:54:05 +00:00
|
|
|
for (int i = 0; s.ok() && i != num_cfs; ++i) {
|
2021-08-19 00:39:00 +00:00
|
|
|
// If mempurge happened instead of Flush,
|
|
|
|
// no NotifyOnFlushCompleted call (no SST file created).
|
|
|
|
if (switched_to_mempurge[i]) {
|
|
|
|
continue;
|
|
|
|
}
|
2018-12-13 23:10:16 +00:00
|
|
|
if (cfds[i]->IsDropped()) {
|
|
|
|
continue;
|
|
|
|
}
|
2019-10-16 17:39:00 +00:00
|
|
|
NotifyOnFlushCompleted(cfds[i], all_mutable_cf_options[i],
|
|
|
|
jobs[i]->GetCommittedFlushJobsInfo());
|
2018-10-16 02:59:20 +00:00
|
|
|
if (sfm) {
|
|
|
|
std::string file_path = MakeTableFileName(
|
|
|
|
cfds[i]->ioptions()->cf_paths[0].path, file_meta[i].fd.GetNumber());
|
2021-01-04 19:07:10 +00:00
|
|
|
// TODO (PR7798). We should only add the file to the FileManager if it
|
|
|
|
// exists. Otherwise, some tests may fail. Ignore the error in the
|
|
|
|
// interim.
|
|
|
|
sfm->OnAddFile(file_path).PermitUncheckedError();
|
2018-10-16 02:59:20 +00:00
|
|
|
if (sfm->IsMaxAllowedSpaceReached() &&
|
|
|
|
error_handler_.GetBGError().ok()) {
|
|
|
|
Status new_bg_error =
|
|
|
|
Status::SpaceLimit("Max allowed space was reached");
|
2020-12-08 04:09:55 +00:00
|
|
|
error_handler_.SetBGError(new_bg_error,
|
|
|
|
BackgroundErrorReason::kFlush);
|
2018-10-16 02:59:20 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487)
Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.
The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487
Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check
Reviewed By: anand1976
Differential Revision: D20685017
Pulled By: zhichao-cao
fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
2020-03-27 23:03:05 +00:00
|
|
|
// Need to undo atomic flush if something went wrong, i.e. s is not OK and
|
|
|
|
// it is not because of CF drop.
|
2023-09-22 23:43:50 +00:00
|
|
|
if (!s.ok() && !s.IsColumnFamilyDropped() && !skip_set_bg_error) {
|
2022-03-15 21:45:34 +00:00
|
|
|
if (log_io_s.ok()) {
|
First step towards handling MANIFEST write error (#6949)
Summary:
This PR provides preliminary support for handling IO error during MANIFEST write.
File write/sync is not guaranteed to be atomic. If we encounter an IOError while writing/syncing to the MANIFEST file, we cannot be sure about the state of the MANIFEST file. The version edits may or may not have reached the file. During cleanup, if we delete the newly-generated SST files referenced by the pending version edit(s), but the version edit(s) actually are persistent in the MANIFEST, then next recovery attempt will process the version edits(s) and then fail since the SST files have already been deleted.
One approach is to truncate the MANIFEST after write/sync error, so that it is safe to delete the SST files. However, file truncation may not be supported on certain file systems. Therefore, we take the following approach.
If an IOError is detected during MANIFEST write/sync, we disable file deletions for the faulty database. Depending on whether the IOError is retryable (set by underlying file system), either RocksDB or application can call `DB::Resume()`, or simply shutdown and restart. During `Resume()`, RocksDB will try to switch to a new MANIFEST and write all existing in-memory version storage in the new file. If this succeeds, then RocksDB may proceed. If all recovery is completed, then file deletions will be re-enabled.
Note that multiple threads can call `LogAndApply()` at the same time, though only one of them will be going through the process MANIFEST write, possibly batching the version edits of other threads. When the leading MANIFEST writer finishes, all of the MANIFEST writing threads in this batch will have the same IOError. They will all call `ErrorHandler::SetBGError()` in which file deletion will be disabled.
Possible future directions:
- Add an `ErrorContext` structure so that it is easier to pass more info to `ErrorHandler`. Currently, as in this example, a new `BackgroundErrorReason` has to be added.
Test plan (dev server):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6949
Reviewed By: anand1976
Differential Revision: D22026020
Pulled By: riversand963
fbshipit-source-id: f3c68a2ef45d9b505d0d625c7c5e0c88495b91c8
2020-06-25 02:05:47 +00:00
|
|
|
// Error while writing to MANIFEST.
|
|
|
|
// In fact, versions_->io_status() can also be the result of renaming
|
|
|
|
// CURRENT file. With current code, it's just difficult to tell. So just
|
|
|
|
// be pessimistic and try write to a new MANIFEST.
|
|
|
|
// TODO: distinguish between MANIFEST write and CURRENT renaming
|
2020-09-18 03:22:35 +00:00
|
|
|
if (!versions_->io_status().ok()) {
|
2021-03-26 04:41:44 +00:00
|
|
|
// If WAL sync is successful (either WAL size is 0 or there is no IO
|
|
|
|
// error), all the Manifest write will be map to soft error.
|
|
|
|
// TODO: kManifestWriteNoWAL and kFlushNoWAL are misleading. Refactor
|
|
|
|
// is needed.
|
2022-03-15 21:45:34 +00:00
|
|
|
error_handler_.SetBGError(s,
|
2021-03-26 04:41:44 +00:00
|
|
|
BackgroundErrorReason::kManifestWriteNoWAL);
|
2020-09-18 03:22:35 +00:00
|
|
|
} else {
|
2021-03-26 04:41:44 +00:00
|
|
|
// If WAL sync is successful (either WAL size is 0 or there is no IO
|
|
|
|
// error), all the other SST file write errors will be set as
|
|
|
|
// kFlushNoWAL.
|
2022-03-15 21:45:34 +00:00
|
|
|
error_handler_.SetBGError(s, BackgroundErrorReason::kFlushNoWAL);
|
2020-09-18 03:22:35 +00:00
|
|
|
}
|
Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487)
Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.
The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487
Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check
Reviewed By: anand1976
Differential Revision: D20685017
Pulled By: zhichao-cao
fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
2020-03-27 23:03:05 +00:00
|
|
|
} else {
|
2022-03-15 21:45:34 +00:00
|
|
|
assert(s == log_io_s);
|
|
|
|
Status new_bg_error = s;
|
|
|
|
error_handler_.SetBGError(new_bg_error, BackgroundErrorReason::kFlush);
|
Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487)
Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.
The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487
Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check
Reviewed By: anand1976
Differential Revision: D20685017
Pulled By: zhichao-cao
fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
2020-03-27 23:03:05 +00:00
|
|
|
}
|
2018-10-16 02:59:20 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
2017-04-18 19:00:36 +00:00
|
|
|
void DBImpl::NotifyOnFlushBegin(ColumnFamilyData* cfd, FileMetaData* file_meta,
|
|
|
|
const MutableCFOptions& mutable_cf_options,
|
2023-01-24 17:54:04 +00:00
|
|
|
int job_id, FlushReason flush_reason) {
|
2017-04-18 19:00:36 +00:00
|
|
|
if (immutable_db_options_.listeners.size() == 0U) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
mutex_.AssertHeld();
|
|
|
|
if (shutting_down_.load(std::memory_order_acquire)) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
bool triggered_writes_slowdown =
|
|
|
|
(cfd->current()->storage_info()->NumLevelFiles(0) >=
|
|
|
|
mutable_cf_options.level0_slowdown_writes_trigger);
|
|
|
|
bool triggered_writes_stop =
|
|
|
|
(cfd->current()->storage_info()->NumLevelFiles(0) >=
|
|
|
|
mutable_cf_options.level0_stop_writes_trigger);
|
|
|
|
// release lock while notifying events
|
|
|
|
mutex_.Unlock();
|
|
|
|
{
|
2019-11-01 18:44:59 +00:00
|
|
|
FlushJobInfo info{};
|
2018-12-12 04:25:45 +00:00
|
|
|
info.cf_id = cfd->GetID();
|
2017-04-18 19:00:36 +00:00
|
|
|
info.cf_name = cfd->GetName();
|
|
|
|
// TODO(yhchiang): make db_paths dynamic in case flush does not
|
|
|
|
// go to L0 in the future.
|
2019-10-24 21:42:43 +00:00
|
|
|
const uint64_t file_number = file_meta->fd.GetNumber();
|
|
|
|
info.file_path =
|
|
|
|
MakeTableFileName(cfd->ioptions()->cf_paths[0].path, file_number);
|
|
|
|
info.file_number = file_number;
|
2017-04-18 19:00:36 +00:00
|
|
|
info.thread_id = env_->GetThreadID();
|
|
|
|
info.job_id = job_id;
|
|
|
|
info.triggered_writes_slowdown = triggered_writes_slowdown;
|
|
|
|
info.triggered_writes_stop = triggered_writes_stop;
|
2018-07-27 23:00:26 +00:00
|
|
|
info.smallest_seqno = file_meta->fd.smallest_seqno;
|
|
|
|
info.largest_seqno = file_meta->fd.largest_seqno;
|
2023-01-24 17:54:04 +00:00
|
|
|
info.flush_reason = flush_reason;
|
2024-03-04 18:08:32 +00:00
|
|
|
for (const auto& listener : immutable_db_options_.listeners) {
|
2017-04-18 19:00:36 +00:00
|
|
|
listener->OnFlushBegin(this, info);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
mutex_.Lock();
|
2023-07-19 19:52:39 +00:00
|
|
|
// no need to signal bg_cv_ as it will be signaled at the end of the
|
|
|
|
// flush process.
|
2017-04-18 19:00:36 +00:00
|
|
|
}
|
|
|
|
|
2019-10-16 17:39:00 +00:00
|
|
|
void DBImpl::NotifyOnFlushCompleted(
|
|
|
|
ColumnFamilyData* cfd, const MutableCFOptions& mutable_cf_options,
|
|
|
|
std::list<std::unique_ptr<FlushJobInfo>>* flush_jobs_info) {
|
|
|
|
assert(flush_jobs_info != nullptr);
|
2017-04-06 00:14:05 +00:00
|
|
|
if (immutable_db_options_.listeners.size() == 0U) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
mutex_.AssertHeld();
|
|
|
|
if (shutting_down_.load(std::memory_order_acquire)) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
bool triggered_writes_slowdown =
|
|
|
|
(cfd->current()->storage_info()->NumLevelFiles(0) >=
|
|
|
|
mutable_cf_options.level0_slowdown_writes_trigger);
|
|
|
|
bool triggered_writes_stop =
|
|
|
|
(cfd->current()->storage_info()->NumLevelFiles(0) >=
|
|
|
|
mutable_cf_options.level0_stop_writes_trigger);
|
|
|
|
// release lock while notifying events
|
|
|
|
mutex_.Unlock();
|
|
|
|
{
|
2019-10-16 17:39:00 +00:00
|
|
|
for (auto& info : *flush_jobs_info) {
|
|
|
|
info->triggered_writes_slowdown = triggered_writes_slowdown;
|
|
|
|
info->triggered_writes_stop = triggered_writes_stop;
|
2024-03-04 18:08:32 +00:00
|
|
|
for (const auto& listener : immutable_db_options_.listeners) {
|
2019-10-16 17:39:00 +00:00
|
|
|
listener->OnFlushCompleted(this, *info);
|
|
|
|
}
|
Fix TSAN data race in EventListenerTest.MultiCF (#9528)
Summary:
**Context:**
`EventListenerTest.MultiCF` occasionally failed on TSAN data race as below:
```
WARNING: ThreadSanitizer: data race (pid=2047633)
Read of size 8 at 0x7b6000001440 by main thread:
#0 std::vector<rocksdb::DB*, std::allocator<rocksdb::DB*> >::size() const /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/stl_vector.h:916:40 (listener_test+0x52337c)
https://github.com/facebook/rocksdb/issues/1 rocksdb::EventListenerTest_MultiCF_Test::TestBody() /home/circleci/project/db/listener_test.cc:384:7 (listener_test+0x52337c)
Previous write of size 8 at 0x7b6000001440 by thread T2:
#0 void std::vector<rocksdb::DB*, std::allocator<rocksdb::DB*> >::_M_realloc_insert<rocksdb::DB* const&>(__gnu_cxx::__normal_iterator<rocksdb::DB**, std::vector<rocksdb::DB*, std::allocator<rocksdb::DB*> > >, rocksdb::DB* const&) /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/vector.tcc:503:31 (listener_test+0x550654)
https://github.com/facebook/rocksdb/issues/1 std::vector<rocksdb::DB*, std::allocator<rocksdb::DB*> >::push_back(rocksdb::DB* const&) /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/stl_vector.h:1195:4 (listener_test+0x550654)
https://github.com/facebook/rocksdb/issues/2 rocksdb::TestFlushListener::OnFlushCompleted(rocksdb::DB*, rocksdb::FlushJobInfo const&) /home/circleci/project/db/listener_test.cc:255:18 (listener_test+0x550654)
```
After investigation, it is due to the following:
(1) `ASSERT_OK(Flush(i));` before the read `std::vector::size()` is supposed to be [blocked on `DB::Impl::bg_cv_` for memtable flush to finish](https://github.com/facebook/rocksdb/blob/320d9a8e8a1b6998f92934f87fc71ad8bd6d4596/db/db_impl/db_impl_compaction_flush.cc#L2319) and get signaled [at the end of background flush ](https://github.com/facebook/rocksdb/blob/320d9a8e8a1b6998f92934f87fc71ad8bd6d4596/db/db_impl/db_impl_compaction_flush.cc#L2830), which happens after the write `std::vector::push_back()` . So the sequence of execution should have been synchronized as `call flush() -> write -> return from flush() -> read` and would not cause any TSAN data race.
- The subsequent `ASSERT_OK(dbfull()->TEST_WaitForFlushMemTable());` serves a similar purpose based on [the previous attempt to deflake the test.](https://github.com/facebook/rocksdb/pull/9084)
(2) However, there are multiple places in the code can signal this `DB::Impl::bg_cv_` and mistakenly wake up `ASSERT_OK(Flush(i));` (or `ASSERT_OK(dbfull()->TEST_WaitForFlushMemTable());`) too early (and with the lock available to them), resulting in non-synchronized read and write thus a TSAN data race.
- Reproduced by the following, suggested by ajkr:
```
diff --git a/db/db_impl/db_impl_compaction_flush.cc b/db/db_impl/db_impl_compaction_flush.cc
index 4ff87c1e4..52492e9cf 100644
--- a/db/db_impl/db_impl_compaction_flush.cc
+++ b/db/db_impl/db_impl_compaction_flush.cc
@@ -22,7 +22,7 @@
#include "test_util/sync_point.h"
#include "util/cast_util.h"
#include "util/concurrent_task_limiter_impl.h"
namespace ROCKSDB_NAMESPACE {
bool DBImpl::EnoughRoomForCompaction(
@@ -855,6 +855,7 @@ void DBImpl::NotifyOnFlushCompleted(
mutable_cf_options.level0_stop_writes_trigger);
// release lock while notifying events
mutex_.Unlock();
+ bg_cv_.SignalAll();
```
**Summary:**
- Added synchornization between read and write by ` ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->LoadDependency()` mechanism
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9528
Test Plan:
`./listener_test --gtest_filter=EventListenerTest.MultiCF --gtest_repeat=10`
- pre-fix:
```
Repeating all tests (iteration 3)
Note: Google Test filter = EventListenerTest.MultiCF
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from EventListenerTest
[ RUN ] EventListenerTest.MultiCF
==================
WARNING: ThreadSanitizer: data race (pid=3377137)
Read of size 8 at 0x7b6000000840 by main thread:
#0 std::vector<rocksdb::DB*, std::allocator<rocksdb::DB*> >::size()
https://github.com/facebook/rocksdb/issues/1 rocksdb::EventListenerTest_MultiCF_Test::TestBody() db/listener_test.cc:384 (listener_test+0x4bb300)
Previous write of size 8 at 0x7b6000000840 by thread T2:
#0 void std::vector<rocksdb::DB*, std::allocator<rocksdb::DB*> >::_M_realloc_insert<rocksdb::DB* const&>(__gnu_cxx::__normal_iterator<rocksdb::DB**, std::vector<rocksdb::DB*, std::allocator<rocksdb::DB*> > >, rocksdb::DB* const&)
https://github.com/facebook/rocksdb/issues/1 std::vector<rocksdb::DB*, std::allocator<rocksdb::DB*> >::push_back(rocksdb::DB* const&)
https://github.com/facebook/rocksdb/issues/2 rocksdb::TestFlushListener::OnFlushCompleted(rocksdb::DB*, rocksdb::FlushJobInfo const&) db/listener_test.cc:255 (listener_test+0x4e820f)
```
- post-fix: `All passed`
Reviewed By: ajkr
Differential Revision: D34085791
Pulled By: hx235
fbshipit-source-id: f877aa687ea1d5cb6f31ef8c4772625d22868e8b
2022-02-10 18:17:53 +00:00
|
|
|
TEST_SYNC_POINT(
|
|
|
|
"DBImpl::NotifyOnFlushCompleted::PostAllOnFlushCompleted");
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
2019-10-16 17:39:00 +00:00
|
|
|
flush_jobs_info->clear();
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
mutex_.Lock();
|
|
|
|
// no need to signal bg_cv_ as it will be signaled at the end of the
|
|
|
|
// flush process.
|
|
|
|
}
|
|
|
|
|
|
|
|
Status DBImpl::CompactRange(const CompactRangeOptions& options,
|
|
|
|
ColumnFamilyHandle* column_family,
|
2020-12-02 20:59:23 +00:00
|
|
|
const Slice* begin_without_ts,
|
|
|
|
const Slice* end_without_ts) {
|
2021-04-21 22:23:04 +00:00
|
|
|
if (manual_compaction_paused_.load(std::memory_order_acquire) > 0) {
|
|
|
|
return Status::Incomplete(Status::SubCode::kManualCompactionPaused);
|
|
|
|
}
|
|
|
|
|
2021-06-07 18:40:31 +00:00
|
|
|
if (options.canceled && options.canceled->load(std::memory_order_acquire)) {
|
|
|
|
return Status::Incomplete(Status::SubCode::kManualCompactionPaused);
|
|
|
|
}
|
|
|
|
|
2020-12-02 20:59:23 +00:00
|
|
|
const Comparator* const ucmp = column_family->GetComparator();
|
|
|
|
assert(ucmp);
|
|
|
|
size_t ts_sz = ucmp->timestamp_size();
|
|
|
|
if (ts_sz == 0) {
|
|
|
|
return CompactRangeInternal(options, column_family, begin_without_ts,
|
2022-03-12 00:13:23 +00:00
|
|
|
end_without_ts, "" /*trim_ts*/);
|
2020-12-02 20:59:23 +00:00
|
|
|
}
|
|
|
|
|
2023-08-30 20:42:04 +00:00
|
|
|
std::string begin_str, end_str;
|
|
|
|
auto [begin, end] =
|
|
|
|
MaybeAddTimestampsToRange(begin_without_ts, end_without_ts, ts_sz,
|
|
|
|
&begin_str, &end_str, false /*exclusive_end*/);
|
2020-12-02 20:59:23 +00:00
|
|
|
|
2023-08-30 20:42:04 +00:00
|
|
|
return CompactRangeInternal(
|
|
|
|
options, column_family, begin.has_value() ? &begin.value() : nullptr,
|
|
|
|
end.has_value() ? &end.value() : nullptr, "" /*trim_ts*/);
|
2020-12-02 20:59:23 +00:00
|
|
|
}
|
|
|
|
|
2021-12-23 19:02:43 +00:00
|
|
|
Status DBImpl::IncreaseFullHistoryTsLow(ColumnFamilyHandle* column_family,
|
2021-02-08 21:43:23 +00:00
|
|
|
std::string ts_low) {
|
2021-12-23 19:02:43 +00:00
|
|
|
ColumnFamilyData* cfd = nullptr;
|
|
|
|
if (column_family == nullptr) {
|
|
|
|
cfd = default_cf_handle_->cfd();
|
|
|
|
} else {
|
|
|
|
auto cfh = static_cast_with_check<ColumnFamilyHandleImpl>(column_family);
|
|
|
|
assert(cfh != nullptr);
|
|
|
|
cfd = cfh->cfd();
|
|
|
|
}
|
|
|
|
assert(cfd != nullptr && cfd->user_comparator() != nullptr);
|
|
|
|
if (cfd->user_comparator()->timestamp_size() == 0) {
|
|
|
|
return Status::InvalidArgument(
|
|
|
|
"Timestamp is not enabled in this column family");
|
|
|
|
}
|
|
|
|
if (cfd->user_comparator()->timestamp_size() != ts_low.size()) {
|
|
|
|
return Status::InvalidArgument("ts_low size mismatch");
|
|
|
|
}
|
|
|
|
return IncreaseFullHistoryTsLowImpl(cfd, ts_low);
|
|
|
|
}
|
|
|
|
|
|
|
|
Status DBImpl::IncreaseFullHistoryTsLowImpl(ColumnFamilyData* cfd,
|
|
|
|
std::string ts_low) {
|
2021-02-08 21:43:23 +00:00
|
|
|
VersionEdit edit;
|
|
|
|
edit.SetColumnFamily(cfd->GetID());
|
|
|
|
edit.SetFullHistoryTsLow(ts_low);
|
2023-04-21 16:07:18 +00:00
|
|
|
|
Group SST write in flush, compaction and db open with new stats (#11910)
Summary:
## Context/Summary
Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity.
For that, this PR does the following:
- Tag different write IOs by passing down and converting WriteOptions to IOOptions
- Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS
Some related code refactory to make implementation cleaner:
- Blob stats
- Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info.
- Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write.
- Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority
- Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification
- Build table
- TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables
- Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder.
This parameter is used for dynamically changing file io priority for flush, see https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more
- Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority
## Test
### db bench
Flush
```
./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100
rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
```
compaction, db oopen
```
Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279
rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213
rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66
```
blob stats - just to make sure they aren't broken by this PR
```
Integrated Blob DB
Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600
rocksdb.blobdb.blob.file.synced COUNT : 1
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same)
```
```
Stacked Blob DB
Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876
rocksdb.blobdb.blob.file.synced COUNT : 8
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same)
```
### Rehearsal CI stress test
Trigger 3 full runs of all our CI stress tests
### Performance
Flush
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark; enable_statistics = true
Pre-pr: avg 507515519.3 ns
497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908,
Post-pr: avg 511971266.5 ns, regressed 0.88%
502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408,
```
Compaction
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 495346098.30 ns
492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846
Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97%
502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007
```
Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats)
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 3848.10 ns
3814,3838,3839,3848,3854,3854,3854,3860,3860,3860
Post-pr: avg 3874.20 ns, regressed 0.68%
3863,3867,3871,3874,3875,3877,3877,3877,3880,3881
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910
Reviewed By: ajkr
Differential Revision: D49788060
Pulled By: hx235
fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff
2023-12-29 23:29:23 +00:00
|
|
|
// TODO: plumb Env::IOActivity, Env::IOPriority
|
2023-04-21 16:07:18 +00:00
|
|
|
const ReadOptions read_options;
|
Group SST write in flush, compaction and db open with new stats (#11910)
Summary:
## Context/Summary
Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity.
For that, this PR does the following:
- Tag different write IOs by passing down and converting WriteOptions to IOOptions
- Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS
Some related code refactory to make implementation cleaner:
- Blob stats
- Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info.
- Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write.
- Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority
- Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification
- Build table
- TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables
- Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder.
This parameter is used for dynamically changing file io priority for flush, see https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more
- Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority
## Test
### db bench
Flush
```
./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100
rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
```
compaction, db oopen
```
Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279
rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213
rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66
```
blob stats - just to make sure they aren't broken by this PR
```
Integrated Blob DB
Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600
rocksdb.blobdb.blob.file.synced COUNT : 1
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same)
```
```
Stacked Blob DB
Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876
rocksdb.blobdb.blob.file.synced COUNT : 8
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same)
```
### Rehearsal CI stress test
Trigger 3 full runs of all our CI stress tests
### Performance
Flush
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark; enable_statistics = true
Pre-pr: avg 507515519.3 ns
497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908,
Post-pr: avg 511971266.5 ns, regressed 0.88%
502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408,
```
Compaction
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 495346098.30 ns
492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846
Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97%
502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007
```
Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats)
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 3848.10 ns
3814,3838,3839,3848,3854,3854,3854,3860,3860,3860
Post-pr: avg 3874.20 ns, regressed 0.68%
3863,3867,3871,3874,3875,3877,3877,3877,3880,3881
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910
Reviewed By: ajkr
Differential Revision: D49788060
Pulled By: hx235
fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff
2023-12-29 23:29:23 +00:00
|
|
|
const WriteOptions write_options;
|
|
|
|
|
2022-06-10 15:21:08 +00:00
|
|
|
TEST_SYNC_POINT_CALLBACK("DBImpl::IncreaseFullHistoryTsLowImpl:BeforeEdit",
|
|
|
|
&edit);
|
2021-02-08 21:43:23 +00:00
|
|
|
|
|
|
|
InstrumentedMutexLock l(&mutex_);
|
|
|
|
std::string current_ts_low = cfd->GetFullHistoryTsLow();
|
|
|
|
const Comparator* ucmp = cfd->user_comparator();
|
2021-12-23 19:02:43 +00:00
|
|
|
assert(ucmp->timestamp_size() == ts_low.size() && !ts_low.empty());
|
2021-02-08 21:43:23 +00:00
|
|
|
if (!current_ts_low.empty() &&
|
|
|
|
ucmp->CompareTimestamp(ts_low, current_ts_low) < 0) {
|
2022-06-10 15:21:08 +00:00
|
|
|
return Status::InvalidArgument("Cannot decrease full_history_ts_low");
|
2021-02-08 21:43:23 +00:00
|
|
|
}
|
|
|
|
|
2022-06-10 15:21:08 +00:00
|
|
|
Status s = versions_->LogAndApply(cfd, *cfd->GetLatestMutableCFOptions(),
|
Group SST write in flush, compaction and db open with new stats (#11910)
Summary:
## Context/Summary
Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity.
For that, this PR does the following:
- Tag different write IOs by passing down and converting WriteOptions to IOOptions
- Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS
Some related code refactory to make implementation cleaner:
- Blob stats
- Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info.
- Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write.
- Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority
- Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification
- Build table
- TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables
- Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder.
This parameter is used for dynamically changing file io priority for flush, see https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more
- Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority
## Test
### db bench
Flush
```
./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100
rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
```
compaction, db oopen
```
Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279
rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213
rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66
```
blob stats - just to make sure they aren't broken by this PR
```
Integrated Blob DB
Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600
rocksdb.blobdb.blob.file.synced COUNT : 1
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same)
```
```
Stacked Blob DB
Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876
rocksdb.blobdb.blob.file.synced COUNT : 8
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same)
```
### Rehearsal CI stress test
Trigger 3 full runs of all our CI stress tests
### Performance
Flush
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark; enable_statistics = true
Pre-pr: avg 507515519.3 ns
497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908,
Post-pr: avg 511971266.5 ns, regressed 0.88%
502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408,
```
Compaction
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 495346098.30 ns
492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846
Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97%
502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007
```
Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats)
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 3848.10 ns
3814,3838,3839,3848,3854,3854,3854,3860,3860,3860
Post-pr: avg 3874.20 ns, regressed 0.68%
3863,3867,3871,3874,3875,3877,3877,3877,3880,3881
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910
Reviewed By: ajkr
Differential Revision: D49788060
Pulled By: hx235
fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff
2023-12-29 23:29:23 +00:00
|
|
|
read_options, write_options, &edit, &mutex_,
|
2023-04-21 16:07:18 +00:00
|
|
|
directories_.GetDbDir());
|
2022-06-10 15:21:08 +00:00
|
|
|
if (!s.ok()) {
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
current_ts_low = cfd->GetFullHistoryTsLow();
|
|
|
|
if (!current_ts_low.empty() &&
|
|
|
|
ucmp->CompareTimestamp(current_ts_low, ts_low) > 0) {
|
|
|
|
std::stringstream oss;
|
|
|
|
oss << "full_history_ts_low: " << Slice(current_ts_low).ToString(true)
|
|
|
|
<< " is set to be higher than the requested "
|
|
|
|
"timestamp: "
|
|
|
|
<< Slice(ts_low).ToString(true) << std::endl;
|
|
|
|
return Status::TryAgain(oss.str());
|
|
|
|
}
|
|
|
|
return Status::OK();
|
2021-02-08 21:43:23 +00:00
|
|
|
}
|
|
|
|
|
2020-12-02 20:59:23 +00:00
|
|
|
Status DBImpl::CompactRangeInternal(const CompactRangeOptions& options,
|
|
|
|
ColumnFamilyHandle* column_family,
|
2022-03-12 00:13:23 +00:00
|
|
|
const Slice* begin, const Slice* end,
|
|
|
|
const std::string& trim_ts) {
|
2020-07-03 02:24:25 +00:00
|
|
|
auto cfh = static_cast_with_check<ColumnFamilyHandleImpl>(column_family);
|
2018-04-06 02:49:06 +00:00
|
|
|
auto cfd = cfh->cfd();
|
|
|
|
|
|
|
|
if (options.target_path_id >= cfd->ioptions()->cf_paths.size()) {
|
2017-04-06 00:14:05 +00:00
|
|
|
return Status::InvalidArgument("Invalid target path ID");
|
|
|
|
}
|
|
|
|
|
2018-02-12 23:34:39 +00:00
|
|
|
bool flush_needed = true;
|
2021-02-08 21:43:23 +00:00
|
|
|
|
|
|
|
// Update full_history_ts_low if it's set
|
|
|
|
if (options.full_history_ts_low != nullptr &&
|
|
|
|
!options.full_history_ts_low->empty()) {
|
|
|
|
std::string ts_low = options.full_history_ts_low->ToString();
|
|
|
|
if (begin != nullptr || end != nullptr) {
|
|
|
|
return Status::InvalidArgument(
|
|
|
|
"Cannot specify compaction range with full_history_ts_low");
|
|
|
|
}
|
2021-12-23 19:02:43 +00:00
|
|
|
Status s = IncreaseFullHistoryTsLowImpl(cfd, ts_low);
|
2021-02-08 21:43:23 +00:00
|
|
|
if (!s.ok()) {
|
|
|
|
LogFlush(immutable_db_options_.info_log);
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2020-12-24 00:54:05 +00:00
|
|
|
Status s;
|
2018-02-28 01:08:34 +00:00
|
|
|
if (begin != nullptr && end != nullptr) {
|
|
|
|
// TODO(ajkr): We could also optimize away the flush in certain cases where
|
|
|
|
// one/both sides of the interval are unbounded. But it requires more
|
|
|
|
// changes to RangesOverlapWithMemtables.
|
2024-02-07 02:35:36 +00:00
|
|
|
UserKeyRange range(*begin, *end);
|
2019-12-17 21:20:42 +00:00
|
|
|
SuperVersion* super_version = cfd->GetReferencedSuperVersion(this);
|
2020-12-24 00:54:05 +00:00
|
|
|
s = cfd->RangesOverlapWithMemtables(
|
|
|
|
{range}, super_version, immutable_db_options_.allow_data_in_errors,
|
|
|
|
&flush_needed);
|
2018-02-28 01:08:34 +00:00
|
|
|
CleanupSuperVersion(super_version);
|
|
|
|
}
|
|
|
|
|
2020-12-24 00:54:05 +00:00
|
|
|
if (s.ok() && flush_needed) {
|
2018-08-29 18:58:13 +00:00
|
|
|
FlushOptions fo;
|
|
|
|
fo.allow_write_stall = options.allow_write_stall;
|
2018-11-12 20:22:10 +00:00
|
|
|
if (immutable_db_options_.atomic_flush) {
|
Fix bug of prematurely excluded CF in atomic flush contains unflushed data that should've been included in the atomic flush (#11148)
Summary:
**Context:**
Atomic flush should guarantee recoverability of all data of seqno up to the max seqno of the flush. It achieves this by ensuring all such data are flushed by the time this atomic flush finishes through `SelectColumnFamiliesForAtomicFlush()`. However, our crash test exposed the following case where an excluded CF from an atomic flush contains unflushed data of seqno less than the max seqno of that atomic flush and loses its data with `WriteOptions::DisableWAL=true` in face of a crash right after the atomic flush finishes .
```
./db_stress --preserve_unverified_changes=1 --reopen=0 --acquire_snapshot_one_in=0 --adaptive_readahead=1 --allow_data_in_errors=True --async_io=1 --atomic_flush=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=15 --bottommost_compression_type=none --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=0 --checkpoint_one_in=0 --checksum_type=kXXH3 --clear_column_family_one_in=0 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_ttl=100 --compression_max_dict_buffer_bytes=134217727 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=lz4hc --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=$db --db_write_buffer_size=1048576 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --expected_values_dir=$exp --fail_if_options_file_error=0 --fifo_allow_compaction=0 --file_checksum_impl=none --flush_one_in=0 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=100 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=524288 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --long_running_snapshots=1 --manual_wal_flush_one_in=100 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=10000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=4 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=0 --periodic_compaction_seconds=100 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=32 --readahead_size=16384 --readpercent=50 --recycle_log_file_num=0 --ribbon_starting_level=6 --secondary_cache_fault_one_in=0 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=1048576 --stats_dump_period_sec=10 --subcompactions=1 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 --use_multiget=1 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=none --write_buffer_size=524288 --write_dbid_to_manifest=1 --write_fault_one_in=0 --writepercent=30 &
pid=$!
sleep 0.2
sleep 10
kill $pid
sleep 0.2
./db_stress --ops_per_thread=1 --preserve_unverified_changes=1 --reopen=0 --acquire_snapshot_one_in=0 --adaptive_readahead=1 --allow_data_in_errors=True --async_io=1 --atomic_flush=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=15 --bottommost_compression_type=none --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=0 --checkpoint_one_in=0 --checksum_type=kXXH3 --clear_column_family_one_in=0 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_ttl=100 --compression_max_dict_buffer_bytes=134217727 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=lz4hc --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=$db --db_write_buffer_size=1048576 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --expected_values_dir=$exp --fail_if_options_file_error=0 --fifo_allow_compaction=0 --file_checksum_impl=none --flush_one_in=0 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=100 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=524288 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --long_running_snapshots=1 --manual_wal_flush_one_in=100 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=10000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=4 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=0 --periodic_compaction_seconds=100 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=32 --readahead_size=16384 --readpercent=50 --recycle_log_file_num=0 --ribbon_starting_level=6 --secondary_cache_fault_one_in=0 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=1048576 --stats_dump_period_sec=10 --subcompactions=1 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 --use_multiget=1 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=none --write_buffer_size=524288 --write_dbid_to_manifest=1 --write_fault_one_in=0 --writepercent=30 &
pid=$!
sleep 0.2
sleep 40
kill $pid
sleep 0.2
Verification failed for column family 6 key 0000000000000239000000000000012B0000000000000138 (56622): value_from_db: , value_from_expected: 4A6331754E4F4C4D42434041464744455A5B58595E5F5C5D5253505156575455, msg: Value not found: NotFound:
Crash-recovery verification failed :(
No writes or ops?
Verification failed :(
```
The bug is due to the following:
- When atomic flush is used, an empty CF is legally [excluded](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_filesnapshot.cc#L39) in `SelectColumnFamiliesForAtomicFlush` as the first step of `DBImpl::FlushForGetLiveFiles` before [passing](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_filesnapshot.cc#L42) the included CFDs to `AtomicFlushMemTables`.
- But [later](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_impl/db_impl_compaction_flush.cc#L2133) in `AtomicFlushMemTables`, `WaitUntilFlushWouldNotStallWrites` will [release the db mutex](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_impl/db_impl_compaction_flush.cc#L2403), during which data@seqno N can be inserted into the excluded CF and data@seqno M can be inserted into one of the included CFs, where M > N.
- However, data@seqno N in an already-excluded CF is thus excluded from this atomic flush while we seqno N is less than seqno M.
**Summary:**
- Replace `SelectColumnFamiliesForAtomicFlush()`-before-`AtomicFlushMemTables()` with `SelectColumnFamiliesForAtomicFlush()`-after-wait-within-`AtomicFlushMemTables()` so we ensure no write affecting the recoverability of this atomic job (i.e, change to max seqno of this atomic flush or insertion of data with less seqno than the max seqno of the atomic flush to excluded CF) can happen after calling `SelectColumnFamiliesForAtomicFlush()`.
- For above, refactored and clarified comments on `SelectColumnFamiliesForAtomicFlush()` and `AtomicFlushMemTables()` for clearer semantics of passed-in CFDs to atomic-flush
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11148
Test Plan:
- New unit test failed before the fix and passes after
- Make check
- Rehearsal stress test
Reviewed By: ajkr
Differential Revision: D42799871
Pulled By: hx235
fbshipit-source-id: 13636b63e9c25c5895857afc36ea580d57f6d644
2023-03-14 23:53:20 +00:00
|
|
|
s = AtomicFlushMemTables(fo, FlushReason::kManualCompaction);
|
2018-10-26 22:06:44 +00:00
|
|
|
} else {
|
Fix bug of prematurely excluded CF in atomic flush contains unflushed data that should've been included in the atomic flush (#11148)
Summary:
**Context:**
Atomic flush should guarantee recoverability of all data of seqno up to the max seqno of the flush. It achieves this by ensuring all such data are flushed by the time this atomic flush finishes through `SelectColumnFamiliesForAtomicFlush()`. However, our crash test exposed the following case where an excluded CF from an atomic flush contains unflushed data of seqno less than the max seqno of that atomic flush and loses its data with `WriteOptions::DisableWAL=true` in face of a crash right after the atomic flush finishes .
```
./db_stress --preserve_unverified_changes=1 --reopen=0 --acquire_snapshot_one_in=0 --adaptive_readahead=1 --allow_data_in_errors=True --async_io=1 --atomic_flush=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=15 --bottommost_compression_type=none --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=0 --checkpoint_one_in=0 --checksum_type=kXXH3 --clear_column_family_one_in=0 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_ttl=100 --compression_max_dict_buffer_bytes=134217727 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=lz4hc --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=$db --db_write_buffer_size=1048576 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --expected_values_dir=$exp --fail_if_options_file_error=0 --fifo_allow_compaction=0 --file_checksum_impl=none --flush_one_in=0 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=100 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=524288 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --long_running_snapshots=1 --manual_wal_flush_one_in=100 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=10000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=4 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=0 --periodic_compaction_seconds=100 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=32 --readahead_size=16384 --readpercent=50 --recycle_log_file_num=0 --ribbon_starting_level=6 --secondary_cache_fault_one_in=0 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=1048576 --stats_dump_period_sec=10 --subcompactions=1 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 --use_multiget=1 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=none --write_buffer_size=524288 --write_dbid_to_manifest=1 --write_fault_one_in=0 --writepercent=30 &
pid=$!
sleep 0.2
sleep 10
kill $pid
sleep 0.2
./db_stress --ops_per_thread=1 --preserve_unverified_changes=1 --reopen=0 --acquire_snapshot_one_in=0 --adaptive_readahead=1 --allow_data_in_errors=True --async_io=1 --atomic_flush=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=15 --bottommost_compression_type=none --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=0 --checkpoint_one_in=0 --checksum_type=kXXH3 --clear_column_family_one_in=0 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_ttl=100 --compression_max_dict_buffer_bytes=134217727 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=lz4hc --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=$db --db_write_buffer_size=1048576 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --expected_values_dir=$exp --fail_if_options_file_error=0 --fifo_allow_compaction=0 --file_checksum_impl=none --flush_one_in=0 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=100 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=524288 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --long_running_snapshots=1 --manual_wal_flush_one_in=100 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=10000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=4 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=0 --periodic_compaction_seconds=100 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=32 --readahead_size=16384 --readpercent=50 --recycle_log_file_num=0 --ribbon_starting_level=6 --secondary_cache_fault_one_in=0 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=1048576 --stats_dump_period_sec=10 --subcompactions=1 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 --use_multiget=1 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=none --write_buffer_size=524288 --write_dbid_to_manifest=1 --write_fault_one_in=0 --writepercent=30 &
pid=$!
sleep 0.2
sleep 40
kill $pid
sleep 0.2
Verification failed for column family 6 key 0000000000000239000000000000012B0000000000000138 (56622): value_from_db: , value_from_expected: 4A6331754E4F4C4D42434041464744455A5B58595E5F5C5D5253505156575455, msg: Value not found: NotFound:
Crash-recovery verification failed :(
No writes or ops?
Verification failed :(
```
The bug is due to the following:
- When atomic flush is used, an empty CF is legally [excluded](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_filesnapshot.cc#L39) in `SelectColumnFamiliesForAtomicFlush` as the first step of `DBImpl::FlushForGetLiveFiles` before [passing](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_filesnapshot.cc#L42) the included CFDs to `AtomicFlushMemTables`.
- But [later](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_impl/db_impl_compaction_flush.cc#L2133) in `AtomicFlushMemTables`, `WaitUntilFlushWouldNotStallWrites` will [release the db mutex](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_impl/db_impl_compaction_flush.cc#L2403), during which data@seqno N can be inserted into the excluded CF and data@seqno M can be inserted into one of the included CFs, where M > N.
- However, data@seqno N in an already-excluded CF is thus excluded from this atomic flush while we seqno N is less than seqno M.
**Summary:**
- Replace `SelectColumnFamiliesForAtomicFlush()`-before-`AtomicFlushMemTables()` with `SelectColumnFamiliesForAtomicFlush()`-after-wait-within-`AtomicFlushMemTables()` so we ensure no write affecting the recoverability of this atomic job (i.e, change to max seqno of this atomic flush or insertion of data with less seqno than the max seqno of the atomic flush to excluded CF) can happen after calling `SelectColumnFamiliesForAtomicFlush()`.
- For above, refactored and clarified comments on `SelectColumnFamiliesForAtomicFlush()` and `AtomicFlushMemTables()` for clearer semantics of passed-in CFDs to atomic-flush
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11148
Test Plan:
- New unit test failed before the fix and passes after
- Make check
- Rehearsal stress test
Reviewed By: ajkr
Differential Revision: D42799871
Pulled By: hx235
fbshipit-source-id: 13636b63e9c25c5895857afc36ea580d57f6d644
2023-03-14 23:53:20 +00:00
|
|
|
s = FlushMemTable(cfd, fo, FlushReason::kManualCompaction);
|
2018-10-26 22:06:44 +00:00
|
|
|
}
|
2018-02-12 23:34:39 +00:00
|
|
|
if (!s.ok()) {
|
|
|
|
LogFlush(immutable_db_options_.info_log);
|
|
|
|
return s;
|
|
|
|
}
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
|
2020-03-05 04:12:23 +00:00
|
|
|
constexpr int kInvalidLevel = -1;
|
|
|
|
int final_output_level = kInvalidLevel;
|
|
|
|
bool exclusive = options.exclusive_manual_compaction;
|
2017-04-06 00:14:05 +00:00
|
|
|
if (cfd->ioptions()->compaction_style == kCompactionStyleUniversal &&
|
|
|
|
cfd->NumberLevels() > 1) {
|
|
|
|
// Always compact all files together.
|
2017-05-17 18:32:26 +00:00
|
|
|
final_output_level = cfd->NumberLevels() - 1;
|
|
|
|
// if bottom most level is reserved
|
|
|
|
if (immutable_db_options_.allow_ingest_behind) {
|
|
|
|
final_output_level--;
|
|
|
|
}
|
2017-04-06 00:14:05 +00:00
|
|
|
s = RunManualCompaction(cfd, ColumnFamilyData::kCompactAllLevels,
|
2019-04-17 06:29:32 +00:00
|
|
|
final_output_level, options, begin, end, exclusive,
|
Always allow L0->L1 trivial move during manual compaction (#11375)
Summary:
during manual compaction (CompactRange()), L0->L1 trivial move is disabled when only L0 overlaps with compacting key range (introduced in https://github.com/facebook/rocksdb/issues/7368 to enforce kForce* contract). This can cause large memory usage due to compaction readahead when number of L0 files is large. This PR allows L0->L1 trivial move in this case, and will do a L1 -> L1 intra-level compaction when needed (`bottommost_level_compaction` is kForce*). In brief, consider a DB with only L0 file, and user calls CompactRange(kForce, nullptr, nullptr),
- before this PR, RocksDB does a L0 -> L1 compaction (disallow trivial move),
- after this PR, RocksDB does a L0 -> L1 compaction (allow trivial move), and a L1 -> L1 compaction.
Users can use kForceOptimized to avoid this extra L1->L1 compaction overhead when L0s are overlapping and cannot be trivial moved.
This PR also fixed a bug (see previous discussion in https://github.com/facebook/rocksdb/issues/11041) where `final_output_level` of a manual compaction can be miscalculated when `level_compaction_dynamic_level_bytes=true`. This bug could cause incorrect level being moved when CompactRangeOptions::change_level is specified.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11375
Test Plan: - Added new unit tests to test that L0 -> L1 compaction allows trivial move and L1 -> L1 compaction is done when needed.
Reviewed By: ajkr
Differential Revision: D44943518
Pulled By: cbi42
fbshipit-source-id: e9fb770d17b163c18a623e1d1bd6b81159192708
2023-04-20 18:10:48 +00:00
|
|
|
false /* disable_trivial_move */,
|
|
|
|
std::numeric_limits<uint64_t>::max(), trim_ts);
|
2017-04-06 00:14:05 +00:00
|
|
|
} else {
|
2020-03-05 04:12:23 +00:00
|
|
|
int first_overlapped_level = kInvalidLevel;
|
|
|
|
{
|
|
|
|
SuperVersion* super_version = cfd->GetReferencedSuperVersion(this);
|
|
|
|
Version* current_version = super_version->current;
|
2022-12-21 23:41:10 +00:00
|
|
|
|
|
|
|
// Might need to query the partitioner
|
|
|
|
SstPartitionerFactory* partitioner_factory =
|
|
|
|
current_version->cfd()->ioptions()->sst_partitioner_factory.get();
|
|
|
|
std::unique_ptr<SstPartitioner> partitioner;
|
|
|
|
if (partitioner_factory && begin != nullptr && end != nullptr) {
|
|
|
|
SstPartitioner::Context context;
|
|
|
|
context.is_full_compaction = false;
|
|
|
|
context.is_manual_compaction = true;
|
|
|
|
context.output_level = /*unknown*/ -1;
|
|
|
|
// Small lies about compaction range
|
|
|
|
context.smallest_user_key = *begin;
|
|
|
|
context.largest_user_key = *end;
|
|
|
|
partitioner = partitioner_factory->CreatePartitioner(context);
|
|
|
|
}
|
|
|
|
|
2020-03-05 04:12:23 +00:00
|
|
|
ReadOptions ro;
|
|
|
|
ro.total_order_seek = true;
|
2023-04-21 16:07:18 +00:00
|
|
|
ro.io_activity = Env::IOActivity::kCompaction;
|
2020-03-05 04:12:23 +00:00
|
|
|
bool overlap;
|
|
|
|
for (int level = 0;
|
|
|
|
level < current_version->storage_info()->num_non_empty_levels();
|
|
|
|
level++) {
|
|
|
|
overlap = true;
|
2022-12-21 23:41:10 +00:00
|
|
|
|
|
|
|
// Whether to look at specific keys within files for overlap with
|
|
|
|
// compaction range, other than largest and smallest keys of the file
|
|
|
|
// known in Version metadata.
|
|
|
|
bool check_overlap_within_file = false;
|
2020-03-05 04:12:23 +00:00
|
|
|
if (begin != nullptr && end != nullptr) {
|
2022-12-21 23:41:10 +00:00
|
|
|
// Typically checking overlap within files in this case
|
|
|
|
check_overlap_within_file = true;
|
|
|
|
// WART: Not known why we don't check within file in one-sided bound
|
|
|
|
// cases
|
|
|
|
if (partitioner) {
|
|
|
|
// Especially if the partitioner is new, the manual compaction
|
|
|
|
// might be used to enforce the partitioning. Checking overlap
|
|
|
|
// within files might miss cases where compaction is needed to
|
|
|
|
// partition the files, as in this example:
|
|
|
|
// * File has two keys "001" and "111"
|
|
|
|
// * Compaction range is ["011", "101")
|
|
|
|
// * Partition boundary at "100"
|
|
|
|
// In cases like this, file-level overlap with the compaction
|
|
|
|
// range is sufficient to force any partitioning that is needed
|
|
|
|
// within the compaction range.
|
|
|
|
//
|
|
|
|
// But if there's no partitioning boundary within the compaction
|
|
|
|
// range, we can be sure there's no need to fix partitioning
|
|
|
|
// within that range, thus safe to check overlap within file.
|
|
|
|
//
|
|
|
|
// Use a hypothetical trivial move query to check for partition
|
|
|
|
// boundary in range. (NOTE: in defiance of all conventions,
|
|
|
|
// `begin` and `end` here are both INCLUSIVE bounds, which makes
|
|
|
|
// this analogy to CanDoTrivialMove() accurate even when `end` is
|
|
|
|
// the first key in a partition.)
|
|
|
|
if (!partitioner->CanDoTrivialMove(*begin, *end)) {
|
|
|
|
check_overlap_within_file = false;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (check_overlap_within_file) {
|
2020-03-05 04:12:23 +00:00
|
|
|
Status status = current_version->OverlapWithLevelIterator(
|
|
|
|
ro, file_options_, *begin, *end, level, &overlap);
|
|
|
|
if (!status.ok()) {
|
2022-12-21 23:41:10 +00:00
|
|
|
check_overlap_within_file = false;
|
2020-03-05 04:12:23 +00:00
|
|
|
}
|
2022-12-21 23:41:10 +00:00
|
|
|
}
|
|
|
|
if (!check_overlap_within_file) {
|
2020-03-05 04:12:23 +00:00
|
|
|
overlap = current_version->storage_info()->OverlapInLevel(level,
|
|
|
|
begin, end);
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
2020-03-05 04:12:23 +00:00
|
|
|
if (overlap) {
|
2023-06-01 22:27:29 +00:00
|
|
|
first_overlapped_level = level;
|
|
|
|
break;
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
}
|
2020-03-05 04:12:23 +00:00
|
|
|
CleanupSuperVersion(super_version);
|
|
|
|
}
|
|
|
|
if (s.ok() && first_overlapped_level != kInvalidLevel) {
|
Always allow L0->L1 trivial move during manual compaction (#11375)
Summary:
during manual compaction (CompactRange()), L0->L1 trivial move is disabled when only L0 overlaps with compacting key range (introduced in https://github.com/facebook/rocksdb/issues/7368 to enforce kForce* contract). This can cause large memory usage due to compaction readahead when number of L0 files is large. This PR allows L0->L1 trivial move in this case, and will do a L1 -> L1 intra-level compaction when needed (`bottommost_level_compaction` is kForce*). In brief, consider a DB with only L0 file, and user calls CompactRange(kForce, nullptr, nullptr),
- before this PR, RocksDB does a L0 -> L1 compaction (disallow trivial move),
- after this PR, RocksDB does a L0 -> L1 compaction (allow trivial move), and a L1 -> L1 compaction.
Users can use kForceOptimized to avoid this extra L1->L1 compaction overhead when L0s are overlapping and cannot be trivial moved.
This PR also fixed a bug (see previous discussion in https://github.com/facebook/rocksdb/issues/11041) where `final_output_level` of a manual compaction can be miscalculated when `level_compaction_dynamic_level_bytes=true`. This bug could cause incorrect level being moved when CompactRangeOptions::change_level is specified.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11375
Test Plan: - Added new unit tests to test that L0 -> L1 compaction allows trivial move and L1 -> L1 compaction is done when needed.
Reviewed By: ajkr
Differential Revision: D44943518
Pulled By: cbi42
fbshipit-source-id: e9fb770d17b163c18a623e1d1bd6b81159192708
2023-04-20 18:10:48 +00:00
|
|
|
if (cfd->ioptions()->compaction_style == kCompactionStyleUniversal ||
|
|
|
|
cfd->ioptions()->compaction_style == kCompactionStyleFIFO) {
|
|
|
|
assert(first_overlapped_level == 0);
|
|
|
|
s = RunManualCompaction(
|
|
|
|
cfd, first_overlapped_level, first_overlapped_level, options, begin,
|
|
|
|
end, exclusive, true /* disallow_trivial_move */,
|
|
|
|
std::numeric_limits<uint64_t>::max() /* max_file_num_to_ignore */,
|
|
|
|
trim_ts);
|
2023-06-01 22:27:29 +00:00
|
|
|
final_output_level = first_overlapped_level;
|
Always allow L0->L1 trivial move during manual compaction (#11375)
Summary:
during manual compaction (CompactRange()), L0->L1 trivial move is disabled when only L0 overlaps with compacting key range (introduced in https://github.com/facebook/rocksdb/issues/7368 to enforce kForce* contract). This can cause large memory usage due to compaction readahead when number of L0 files is large. This PR allows L0->L1 trivial move in this case, and will do a L1 -> L1 intra-level compaction when needed (`bottommost_level_compaction` is kForce*). In brief, consider a DB with only L0 file, and user calls CompactRange(kForce, nullptr, nullptr),
- before this PR, RocksDB does a L0 -> L1 compaction (disallow trivial move),
- after this PR, RocksDB does a L0 -> L1 compaction (allow trivial move), and a L1 -> L1 compaction.
Users can use kForceOptimized to avoid this extra L1->L1 compaction overhead when L0s are overlapping and cannot be trivial moved.
This PR also fixed a bug (see previous discussion in https://github.com/facebook/rocksdb/issues/11041) where `final_output_level` of a manual compaction can be miscalculated when `level_compaction_dynamic_level_bytes=true`. This bug could cause incorrect level being moved when CompactRangeOptions::change_level is specified.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11375
Test Plan: - Added new unit tests to test that L0 -> L1 compaction allows trivial move and L1 -> L1 compaction is done when needed.
Reviewed By: ajkr
Differential Revision: D44943518
Pulled By: cbi42
fbshipit-source-id: e9fb770d17b163c18a623e1d1bd6b81159192708
2023-04-20 18:10:48 +00:00
|
|
|
} else {
|
|
|
|
assert(cfd->ioptions()->compaction_style == kCompactionStyleLevel);
|
|
|
|
uint64_t next_file_number = versions_->current_next_file_number();
|
|
|
|
// Start compaction from `first_overlapped_level`, one level down at a
|
|
|
|
// time, until output level >= max_overlapped_level.
|
|
|
|
// When max_overlapped_level == 0, we will still compact from L0 -> L1
|
|
|
|
// (or LBase), and followed by a bottommost level intra-level compaction
|
|
|
|
// at L1 (or LBase), if applicable.
|
|
|
|
int level = first_overlapped_level;
|
|
|
|
final_output_level = level;
|
2023-05-09 22:43:43 +00:00
|
|
|
int output_level = 0, base_level = 0;
|
2023-06-01 22:27:29 +00:00
|
|
|
for (;;) {
|
|
|
|
// Always allow L0 -> L1 compaction
|
|
|
|
if (level > 0) {
|
|
|
|
if (cfd->ioptions()->level_compaction_dynamic_level_bytes) {
|
|
|
|
assert(final_output_level < cfd->ioptions()->num_levels);
|
|
|
|
if (final_output_level + 1 == cfd->ioptions()->num_levels) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
// TODO(cbi): there is still a race condition here where
|
|
|
|
// if a background compaction compacts some file beyond
|
|
|
|
// current()->storage_info()->num_non_empty_levels() right after
|
|
|
|
// the check here.This should happen very infrequently and should
|
|
|
|
// not happen once a user populates the last level of the LSM.
|
|
|
|
InstrumentedMutexLock l(&mutex_);
|
|
|
|
// num_non_empty_levels may be lower after a compaction, so
|
|
|
|
// we check for >= here.
|
|
|
|
if (final_output_level + 1 >=
|
|
|
|
cfd->current()->storage_info()->num_non_empty_levels()) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
2020-03-05 04:12:23 +00:00
|
|
|
output_level = level + 1;
|
Always allow L0->L1 trivial move during manual compaction (#11375)
Summary:
during manual compaction (CompactRange()), L0->L1 trivial move is disabled when only L0 overlaps with compacting key range (introduced in https://github.com/facebook/rocksdb/issues/7368 to enforce kForce* contract). This can cause large memory usage due to compaction readahead when number of L0 files is large. This PR allows L0->L1 trivial move in this case, and will do a L1 -> L1 intra-level compaction when needed (`bottommost_level_compaction` is kForce*). In brief, consider a DB with only L0 file, and user calls CompactRange(kForce, nullptr, nullptr),
- before this PR, RocksDB does a L0 -> L1 compaction (disallow trivial move),
- after this PR, RocksDB does a L0 -> L1 compaction (allow trivial move), and a L1 -> L1 compaction.
Users can use kForceOptimized to avoid this extra L1->L1 compaction overhead when L0s are overlapping and cannot be trivial moved.
This PR also fixed a bug (see previous discussion in https://github.com/facebook/rocksdb/issues/11041) where `final_output_level` of a manual compaction can be miscalculated when `level_compaction_dynamic_level_bytes=true`. This bug could cause incorrect level being moved when CompactRangeOptions::change_level is specified.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11375
Test Plan: - Added new unit tests to test that L0 -> L1 compaction allows trivial move and L1 -> L1 compaction is done when needed.
Reviewed By: ajkr
Differential Revision: D44943518
Pulled By: cbi42
fbshipit-source-id: e9fb770d17b163c18a623e1d1bd6b81159192708
2023-04-20 18:10:48 +00:00
|
|
|
if (cfd->ioptions()->level_compaction_dynamic_level_bytes &&
|
2020-03-05 04:12:23 +00:00
|
|
|
level == 0) {
|
|
|
|
output_level = ColumnFamilyData::kCompactToBaseLevel;
|
|
|
|
}
|
Always allow L0->L1 trivial move during manual compaction (#11375)
Summary:
during manual compaction (CompactRange()), L0->L1 trivial move is disabled when only L0 overlaps with compacting key range (introduced in https://github.com/facebook/rocksdb/issues/7368 to enforce kForce* contract). This can cause large memory usage due to compaction readahead when number of L0 files is large. This PR allows L0->L1 trivial move in this case, and will do a L1 -> L1 intra-level compaction when needed (`bottommost_level_compaction` is kForce*). In brief, consider a DB with only L0 file, and user calls CompactRange(kForce, nullptr, nullptr),
- before this PR, RocksDB does a L0 -> L1 compaction (disallow trivial move),
- after this PR, RocksDB does a L0 -> L1 compaction (allow trivial move), and a L1 -> L1 compaction.
Users can use kForceOptimized to avoid this extra L1->L1 compaction overhead when L0s are overlapping and cannot be trivial moved.
This PR also fixed a bug (see previous discussion in https://github.com/facebook/rocksdb/issues/11041) where `final_output_level` of a manual compaction can be miscalculated when `level_compaction_dynamic_level_bytes=true`. This bug could cause incorrect level being moved when CompactRangeOptions::change_level is specified.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11375
Test Plan: - Added new unit tests to test that L0 -> L1 compaction allows trivial move and L1 -> L1 compaction is done when needed.
Reviewed By: ajkr
Differential Revision: D44943518
Pulled By: cbi42
fbshipit-source-id: e9fb770d17b163c18a623e1d1bd6b81159192708
2023-04-20 18:10:48 +00:00
|
|
|
// Use max value for `max_file_num_to_ignore` to always compact
|
|
|
|
// files down.
|
|
|
|
s = RunManualCompaction(
|
|
|
|
cfd, level, output_level, options, begin, end, exclusive,
|
|
|
|
!trim_ts.empty() /* disallow_trivial_move */,
|
|
|
|
std::numeric_limits<uint64_t>::max() /* max_file_num_to_ignore */,
|
|
|
|
trim_ts,
|
|
|
|
output_level == ColumnFamilyData::kCompactToBaseLevel
|
|
|
|
? &base_level
|
|
|
|
: nullptr);
|
|
|
|
if (!s.ok()) {
|
|
|
|
break;
|
2020-10-07 20:16:40 +00:00
|
|
|
}
|
Always allow L0->L1 trivial move during manual compaction (#11375)
Summary:
during manual compaction (CompactRange()), L0->L1 trivial move is disabled when only L0 overlaps with compacting key range (introduced in https://github.com/facebook/rocksdb/issues/7368 to enforce kForce* contract). This can cause large memory usage due to compaction readahead when number of L0 files is large. This PR allows L0->L1 trivial move in this case, and will do a L1 -> L1 intra-level compaction when needed (`bottommost_level_compaction` is kForce*). In brief, consider a DB with only L0 file, and user calls CompactRange(kForce, nullptr, nullptr),
- before this PR, RocksDB does a L0 -> L1 compaction (disallow trivial move),
- after this PR, RocksDB does a L0 -> L1 compaction (allow trivial move), and a L1 -> L1 compaction.
Users can use kForceOptimized to avoid this extra L1->L1 compaction overhead when L0s are overlapping and cannot be trivial moved.
This PR also fixed a bug (see previous discussion in https://github.com/facebook/rocksdb/issues/11041) where `final_output_level` of a manual compaction can be miscalculated when `level_compaction_dynamic_level_bytes=true`. This bug could cause incorrect level being moved when CompactRangeOptions::change_level is specified.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11375
Test Plan: - Added new unit tests to test that L0 -> L1 compaction allows trivial move and L1 -> L1 compaction is done when needed.
Reviewed By: ajkr
Differential Revision: D44943518
Pulled By: cbi42
fbshipit-source-id: e9fb770d17b163c18a623e1d1bd6b81159192708
2023-04-20 18:10:48 +00:00
|
|
|
if (output_level == ColumnFamilyData::kCompactToBaseLevel) {
|
|
|
|
assert(base_level > 0);
|
|
|
|
level = base_level;
|
|
|
|
} else {
|
|
|
|
++level;
|
|
|
|
}
|
|
|
|
final_output_level = level;
|
|
|
|
TEST_SYNC_POINT("DBImpl::RunManualCompaction()::1");
|
|
|
|
TEST_SYNC_POINT("DBImpl::RunManualCompaction()::2");
|
2020-03-05 04:12:23 +00:00
|
|
|
}
|
Always allow L0->L1 trivial move during manual compaction (#11375)
Summary:
during manual compaction (CompactRange()), L0->L1 trivial move is disabled when only L0 overlaps with compacting key range (introduced in https://github.com/facebook/rocksdb/issues/7368 to enforce kForce* contract). This can cause large memory usage due to compaction readahead when number of L0 files is large. This PR allows L0->L1 trivial move in this case, and will do a L1 -> L1 intra-level compaction when needed (`bottommost_level_compaction` is kForce*). In brief, consider a DB with only L0 file, and user calls CompactRange(kForce, nullptr, nullptr),
- before this PR, RocksDB does a L0 -> L1 compaction (disallow trivial move),
- after this PR, RocksDB does a L0 -> L1 compaction (allow trivial move), and a L1 -> L1 compaction.
Users can use kForceOptimized to avoid this extra L1->L1 compaction overhead when L0s are overlapping and cannot be trivial moved.
This PR also fixed a bug (see previous discussion in https://github.com/facebook/rocksdb/issues/11041) where `final_output_level` of a manual compaction can be miscalculated when `level_compaction_dynamic_level_bytes=true`. This bug could cause incorrect level being moved when CompactRangeOptions::change_level is specified.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11375
Test Plan: - Added new unit tests to test that L0 -> L1 compaction allows trivial move and L1 -> L1 compaction is done when needed.
Reviewed By: ajkr
Differential Revision: D44943518
Pulled By: cbi42
fbshipit-source-id: e9fb770d17b163c18a623e1d1bd6b81159192708
2023-04-20 18:10:48 +00:00
|
|
|
if (s.ok()) {
|
|
|
|
assert(final_output_level > 0);
|
|
|
|
// bottommost level intra-level compaction
|
|
|
|
if ((options.bottommost_level_compaction ==
|
|
|
|
BottommostLevelCompaction::kIfHaveCompactionFilter &&
|
|
|
|
(cfd->ioptions()->compaction_filter != nullptr ||
|
|
|
|
cfd->ioptions()->compaction_filter_factory != nullptr)) ||
|
|
|
|
options.bottommost_level_compaction ==
|
|
|
|
BottommostLevelCompaction::kForceOptimized ||
|
|
|
|
options.bottommost_level_compaction ==
|
|
|
|
BottommostLevelCompaction::kForce) {
|
|
|
|
// Use `next_file_number` as `max_file_num_to_ignore` to avoid
|
2023-06-01 22:27:29 +00:00
|
|
|
// rewriting newly compacted files when it is kForceOptimized
|
|
|
|
// or kIfHaveCompactionFilter with compaction filter set.
|
Always allow L0->L1 trivial move during manual compaction (#11375)
Summary:
during manual compaction (CompactRange()), L0->L1 trivial move is disabled when only L0 overlaps with compacting key range (introduced in https://github.com/facebook/rocksdb/issues/7368 to enforce kForce* contract). This can cause large memory usage due to compaction readahead when number of L0 files is large. This PR allows L0->L1 trivial move in this case, and will do a L1 -> L1 intra-level compaction when needed (`bottommost_level_compaction` is kForce*). In brief, consider a DB with only L0 file, and user calls CompactRange(kForce, nullptr, nullptr),
- before this PR, RocksDB does a L0 -> L1 compaction (disallow trivial move),
- after this PR, RocksDB does a L0 -> L1 compaction (allow trivial move), and a L1 -> L1 compaction.
Users can use kForceOptimized to avoid this extra L1->L1 compaction overhead when L0s are overlapping and cannot be trivial moved.
This PR also fixed a bug (see previous discussion in https://github.com/facebook/rocksdb/issues/11041) where `final_output_level` of a manual compaction can be miscalculated when `level_compaction_dynamic_level_bytes=true`. This bug could cause incorrect level being moved when CompactRangeOptions::change_level is specified.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11375
Test Plan: - Added new unit tests to test that L0 -> L1 compaction allows trivial move and L1 -> L1 compaction is done when needed.
Reviewed By: ajkr
Differential Revision: D44943518
Pulled By: cbi42
fbshipit-source-id: e9fb770d17b163c18a623e1d1bd6b81159192708
2023-04-20 18:10:48 +00:00
|
|
|
s = RunManualCompaction(
|
|
|
|
cfd, final_output_level, final_output_level, options, begin,
|
2023-06-01 22:27:29 +00:00
|
|
|
end, exclusive, true /* disallow_trivial_move */,
|
Always allow L0->L1 trivial move during manual compaction (#11375)
Summary:
during manual compaction (CompactRange()), L0->L1 trivial move is disabled when only L0 overlaps with compacting key range (introduced in https://github.com/facebook/rocksdb/issues/7368 to enforce kForce* contract). This can cause large memory usage due to compaction readahead when number of L0 files is large. This PR allows L0->L1 trivial move in this case, and will do a L1 -> L1 intra-level compaction when needed (`bottommost_level_compaction` is kForce*). In brief, consider a DB with only L0 file, and user calls CompactRange(kForce, nullptr, nullptr),
- before this PR, RocksDB does a L0 -> L1 compaction (disallow trivial move),
- after this PR, RocksDB does a L0 -> L1 compaction (allow trivial move), and a L1 -> L1 compaction.
Users can use kForceOptimized to avoid this extra L1->L1 compaction overhead when L0s are overlapping and cannot be trivial moved.
This PR also fixed a bug (see previous discussion in https://github.com/facebook/rocksdb/issues/11041) where `final_output_level` of a manual compaction can be miscalculated when `level_compaction_dynamic_level_bytes=true`. This bug could cause incorrect level being moved when CompactRangeOptions::change_level is specified.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11375
Test Plan: - Added new unit tests to test that L0 -> L1 compaction allows trivial move and L1 -> L1 compaction is done when needed.
Reviewed By: ajkr
Differential Revision: D44943518
Pulled By: cbi42
fbshipit-source-id: e9fb770d17b163c18a623e1d1bd6b81159192708
2023-04-20 18:10:48 +00:00
|
|
|
next_file_number /* max_file_num_to_ignore */, trim_ts);
|
|
|
|
}
|
2020-03-05 04:12:23 +00:00
|
|
|
}
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
2020-03-05 04:12:23 +00:00
|
|
|
if (!s.ok() || final_output_level == kInvalidLevel) {
|
2017-04-06 00:14:05 +00:00
|
|
|
LogFlush(immutable_db_options_.info_log);
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (options.change_level) {
|
2020-08-17 21:20:21 +00:00
|
|
|
TEST_SYNC_POINT("DBImpl::CompactRange:BeforeRefit:1");
|
|
|
|
TEST_SYNC_POINT("DBImpl::CompactRange:BeforeRefit:2");
|
|
|
|
|
2017-04-06 00:14:05 +00:00
|
|
|
ROCKS_LOG_INFO(immutable_db_options_.info_log,
|
|
|
|
"[RefitLevel] waiting for background threads to stop");
|
Add missing range conflict check between file ingestion and RefitLevel() (#10988)
Summary:
**Context:**
File ingestion never checks whether the key range it acts on overlaps with an ongoing RefitLevel() (used in `CompactRange()` with `change_level=true`). That's because RefitLevel() doesn't register and make its key range known to file ingestion. Though it checks overlapping with other compactions by https://github.com/facebook/rocksdb/blob/7.8.fb/db/external_sst_file_ingestion_job.cc#L998.
RefitLevel() (used in `CompactRange()` with `change_level=true`) doesn't check whether the key range it acts on overlaps with an ongoing file ingestion. That's because file ingestion does not register and make its key range known to other compactions.
- Note that non-refitlevel-compaction (e.g, manual compaction w/o RefitLevel() or general compaction) also does not check key range overlap with ongoing file ingestion for the same reason.
- But it's fine. Credited to cbi42's discovery, `WaitForIngestFile` was called by background and foreground compactions. They were introduced in https://github.com/facebook/rocksdb/commit/0f88160f67d36ea30e3aca3a3cef924c3a009be6, https://github.com/facebook/rocksdb/commit/5c64fb67d2fc198f1a73ff3ae543749a6a41f513 and https://github.com/facebook/rocksdb/commit/87dfc1d23e0e16ff73e15f63c6fa0fb3b3fc8c8c.
- Regardless, this PR registers file ingestion like a compaction is a general approach that will also add range conflict check between file ingestion and non-refitlevel-compaction, though it has not been the issue motivated this PR.
Above are bugs resulting in two bad consequences:
- If file ingestion and RefitLevel() creates files in the same level, then range-overlapped files will be created at that level and caught as corruption by `force_consistency_checks=true`
- If file ingestion and RefitLevel() creates file in different levels, then with one further compaction on the ingested file, it can result in two same keys both with seqno 0 in two different levels. Then with iterator's [optimization](https://github.com/facebook/rocksdb/blame/c62f3221698fd273b673d4f7e54eabb8329a4369/db/db_iter.cc#L342-L343) that assumes no two same keys both with seqno 0, it will either break this assertion in debug build or, even worst, return value of this same key for the key after it, which is the wrong value to return, in release build.
Therefore we decide to introduce range conflict check for file ingestion and RefitLevel() inspired from the existing range conflict check among compactions.
**Summary:**
- Treat file ingestion job and RefitLevel() as `Compaction` of new compaction reasons: `CompactionReason::kExternalSstIngestion` and `CompactionReason::kRefitLevel` and register/unregister them. File ingestion is treated as compaction from L0 to different levels and RefitLevel() as compaction from source level to target level.
- Check for `RangeOverlapWithCompaction` with other ongoing compactions, `RegisterCompaction()` on this "compaction" before changing the LSM state in `VersionStorageInfo`, and `UnregisterCompaction()` after changing.
- Replace scattered fixes (https://github.com/facebook/rocksdb/commit/0f88160f67d36ea30e3aca3a3cef924c3a009be6, https://github.com/facebook/rocksdb/commit/5c64fb67d2fc198f1a73ff3ae543749a6a41f513 and https://github.com/facebook/rocksdb/commit/87dfc1d23e0e16ff73e15f63c6fa0fb3b3fc8c8c.) that prevents overlapping between file ingestion and non-refit-level compaction with this fix cuz those practices are easy to overlook.
- Misc: logic cleanup, see PR comments
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10988
Test Plan:
- New unit test `DBCompactionTestWithOngoingFileIngestionParam*` that failed pre-fix and passed afterwards.
- Made compatible with existing tests, see PR comments
- make check
- [Ongoing] Stress test rehearsal with normal value and aggressive CI value https://github.com/facebook/rocksdb/pull/10761
Reviewed By: cbi42
Differential Revision: D41535685
Pulled By: hx235
fbshipit-source-id: 549833a577ba1496d20a870583d4caa737da1258
2022-12-29 23:05:36 +00:00
|
|
|
// TODO(hx235): remove `Enable/DisableManualCompaction` and
|
|
|
|
// `Continue/PauseBackgroundWork` once we ensure registering RefitLevel()'s
|
|
|
|
// range is sufficient (if not, what else is needed) for avoiding range
|
|
|
|
// conflicts with other activities (e.g, compaction, flush) that are
|
|
|
|
// currently avoided by `Enable/DisableManualCompaction` and
|
|
|
|
// `Continue/PauseBackgroundWork`.
|
2020-08-14 18:28:12 +00:00
|
|
|
DisableManualCompaction();
|
2017-04-06 00:14:05 +00:00
|
|
|
s = PauseBackgroundWork();
|
|
|
|
if (s.ok()) {
|
2020-08-14 18:28:12 +00:00
|
|
|
TEST_SYNC_POINT("DBImpl::CompactRange:PreRefitLevel");
|
2017-04-06 00:14:05 +00:00
|
|
|
s = ReFitLevel(cfd, final_output_level, options.target_level);
|
2020-08-14 18:28:12 +00:00
|
|
|
TEST_SYNC_POINT("DBImpl::CompactRange:PostRefitLevel");
|
2020-09-25 04:47:43 +00:00
|
|
|
// ContinueBackgroundWork always return Status::OK().
|
2020-10-22 03:16:30 +00:00
|
|
|
Status temp_s = ContinueBackgroundWork();
|
|
|
|
assert(temp_s.ok());
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
2020-08-14 18:28:12 +00:00
|
|
|
EnableManualCompaction();
|
Prevent corruption with parallel manual compactions and `change_level == true` (#9077)
Summary:
The bug can impact the following scenario. There must be two `CompactRange()`s, call them A and B. Compaction A must have `change_level=true`. Compactions A and B must run in parallel, and new data must be added while they run as well.
Now, on to the details of the race condition. Compaction A must reach the refitting phase while B's next step is to trivial move new data (i.e., data that has been inserted behind A) down to the same level that A's refit targets (`CompactRangeOptions::target_level`). B must be unregistered (i.e., has not yet called `AddManualCompaction()` for the current `RunManualCompaction()`) while A invokes `DisableManualCompaction()`s to prepare for refitting. In the old code, B could still proceed to register a manual compaction, while A had disabled manual compaction.
The next part of the race condition is B picks and schedules a trivial move while A has released the lock in refitting phase in order to persist the LSM state change (i.e., the log phase of `LogAndApply()`). That way, B does not see the refitted data when picking a trivial-move compaction. So it is susceptible to picking one that overlaps.
Finally, B executes the picked trivial-move compaction. Trivial-move compactions are special in that they never check whether manual compaction is disabled. So the picked compaction causing overlap ends up being applied, leading to LSM corruption if `force_consistency_checks=false`, or entering read-only mode with `Status::Corruption` if `force_consistency_checks=true` (the default).
The fix is just to prevent B from registering itself in `RunManualCompaction()` while manual compactions are disabled, consequently preventing any trivial move or other compaction from being picked/scheduled.
Thanks to siying for finding the bug.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9077
Test Plan: The test does not go all the way in exposing the bug because it requires a compaction to be picked/scheduled while logging LSM state change for RefitLevel(). But the fix is to make such a compaction not picked/scheduled in the first place, so any repro of that scenario would end up hanging RefitLevel() logging. So instead I just verified no such compaction is registered in the scenario where `RefitLevel()` disables manual compactions.
Reviewed By: siying
Differential Revision: D31921908
Pulled By: ajkr
fbshipit-source-id: 9bb5d0e847ad428211227f40830c685c209fbecb
2021-10-28 06:07:29 +00:00
|
|
|
TEST_SYNC_POINT(
|
|
|
|
"DBImpl::CompactRange:PostRefitLevel:ManualCompactionEnabled");
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
LogFlush(immutable_db_options_.info_log);
|
|
|
|
|
|
|
|
{
|
|
|
|
InstrumentedMutexLock l(&mutex_);
|
|
|
|
// an automatic compaction that has been scheduled might have been
|
|
|
|
// preempted by the manual compactions. Need to schedule it back.
|
|
|
|
MaybeScheduleFlushOrCompaction();
|
|
|
|
}
|
|
|
|
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
2018-04-13 00:55:14 +00:00
|
|
|
Status DBImpl::CompactFiles(const CompactionOptions& compact_options,
|
|
|
|
ColumnFamilyHandle* column_family,
|
|
|
|
const std::vector<std::string>& input_file_names,
|
|
|
|
const int output_level, const int output_path_id,
|
2018-12-13 22:12:02 +00:00
|
|
|
std::vector<std::string>* const output_file_names,
|
|
|
|
CompactionJobInfo* compaction_job_info) {
|
2017-04-06 00:14:05 +00:00
|
|
|
if (column_family == nullptr) {
|
|
|
|
return Status::InvalidArgument("ColumnFamilyHandle must be non-null.");
|
|
|
|
}
|
|
|
|
|
2020-07-03 02:24:25 +00:00
|
|
|
auto cfd =
|
|
|
|
static_cast_with_check<ColumnFamilyHandleImpl>(column_family)->cfd();
|
2017-04-06 00:14:05 +00:00
|
|
|
assert(cfd);
|
|
|
|
|
|
|
|
Status s;
|
2021-05-20 04:40:43 +00:00
|
|
|
JobContext job_context(next_job_id_.fetch_add(1), true);
|
2017-04-06 00:14:05 +00:00
|
|
|
LogBuffer log_buffer(InfoLogLevel::INFO_LEVEL,
|
|
|
|
immutable_db_options_.info_log.get());
|
|
|
|
|
|
|
|
// Perform CompactFiles
|
2018-11-12 22:30:21 +00:00
|
|
|
TEST_SYNC_POINT("TestCompactFiles::IngestExternalFile2");
|
Prefer static_cast in place of most reinterpret_cast (#12308)
Summary:
The following are risks associated with pointer-to-pointer reinterpret_cast:
* Can produce the "wrong result" (crash or memory corruption). IIRC, in theory this can happen for any up-cast or down-cast for a non-standard-layout type, though in practice would only happen for multiple inheritance cases (where the base class pointer might be "inside" the derived object). We don't use multiple inheritance a lot, but we do.
* Can mask useful compiler errors upon code change, including converting between unrelated pointer types that you are expecting to be related, and converting between pointer and scalar types unintentionally.
I can only think of some obscure cases where static_cast could be troublesome when it compiles as a replacement:
* Going through `void*` could plausibly cause unnecessary or broken pointer arithmetic. Suppose we have
`struct Derived: public Base1, public Base2`. If we have `Derived*` -> `void*` -> `Base2*` -> `Derived*` through reinterpret casts, this could plausibly work (though technical UB) assuming the `Base2*` is not dereferenced. Changing to static cast could introduce breaking pointer arithmetic.
* Unnecessary (but safe) pointer arithmetic could arise in a case like `Derived*` -> `Base2*` -> `Derived*` where before the Base2 pointer might not have been dereferenced. This could potentially affect performance.
With some light scripting, I tried replacing pointer-to-pointer reinterpret_casts with static_cast and kept the cases that still compile. Most occurrences of reinterpret_cast have successfully been changed (except for java/ and third-party/). 294 changed, 257 remain.
A couple of related interventions included here:
* Previously Cache::Handle was not actually derived from in the implementations and just used as a `void*` stand-in with reinterpret_cast. Now there is a relationship to allow static_cast. In theory, this could introduce pointer arithmetic (as described above) but is unlikely without multiple inheritance AND non-empty Cache::Handle.
* Remove some unnecessary casts to void* as this is allowed to be implicit (for better or worse).
Most of the remaining reinterpret_casts are for converting to/from raw bytes of objects. We could consider better idioms for these patterns in follow-up work.
I wish there were a way to implement a template variant of static_cast that would only compile if no pointer arithmetic is generated, but best I can tell, this is not possible. AFAIK the best you could do is a dynamic check that the void* conversion after the static cast is unchanged.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12308
Test Plan: existing tests, CI
Reviewed By: ltamasi
Differential Revision: D53204947
Pulled By: pdillinger
fbshipit-source-id: 9de23e618263b0d5b9820f4e15966876888a16e2
2024-02-07 18:44:11 +00:00
|
|
|
TEST_SYNC_POINT_CALLBACK("TestCompactFiles:PausingManualCompaction:3",
|
|
|
|
static_cast<void*>(const_cast<std::atomic<int>*>(
|
|
|
|
&manual_compaction_paused_)));
|
2017-04-06 00:14:05 +00:00
|
|
|
{
|
|
|
|
InstrumentedMutexLock l(&mutex_);
|
2018-11-12 22:30:21 +00:00
|
|
|
auto* current = cfd->current();
|
|
|
|
current->Ref();
|
|
|
|
|
|
|
|
s = CompactFilesImpl(compact_options, cfd, current, input_file_names,
|
2018-04-13 00:55:14 +00:00
|
|
|
output_file_names, output_level, output_path_id,
|
2018-12-13 22:12:02 +00:00
|
|
|
&job_context, &log_buffer, compaction_job_info);
|
2018-11-12 22:30:21 +00:00
|
|
|
|
|
|
|
current->Unref();
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
// Find and delete obsolete files
|
|
|
|
{
|
|
|
|
InstrumentedMutexLock l(&mutex_);
|
|
|
|
// If !s.ok(), this means that Compaction failed. In that case, we want
|
|
|
|
// to delete all obsolete files we might have created and we force
|
|
|
|
// FindObsoleteFiles(). This is because job_context does not
|
|
|
|
// catch all created files if compaction failed.
|
|
|
|
FindObsoleteFiles(&job_context, !s.ok());
|
|
|
|
} // release the mutex
|
|
|
|
|
|
|
|
// delete unnecessary files if any, this is done outside the mutex
|
2018-01-12 21:16:39 +00:00
|
|
|
if (job_context.HaveSomethingToClean() ||
|
|
|
|
job_context.HaveSomethingToDelete() || !log_buffer.IsEmpty()) {
|
2017-04-06 00:14:05 +00:00
|
|
|
// Have to flush the info logs before bg_compaction_scheduled_--
|
|
|
|
// because if bg_flush_scheduled_ becomes 0 and the lock is
|
|
|
|
// released, the deconstructor of DB can kick in and destroy all the
|
|
|
|
// states of DB so info_log might not be available after that point.
|
|
|
|
// It also applies to access other states that DB owns.
|
|
|
|
log_buffer.FlushBufferToLog();
|
|
|
|
if (job_context.HaveSomethingToDelete()) {
|
|
|
|
// no mutex is locked here. No need to Unlock() and Lock() here.
|
|
|
|
PurgeObsoleteFiles(job_context);
|
|
|
|
}
|
|
|
|
job_context.Clean();
|
|
|
|
}
|
|
|
|
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
|
|
|
Status DBImpl::CompactFilesImpl(
|
|
|
|
const CompactionOptions& compact_options, ColumnFamilyData* cfd,
|
|
|
|
Version* version, const std::vector<std::string>& input_file_names,
|
2018-04-13 00:55:14 +00:00
|
|
|
std::vector<std::string>* const output_file_names, const int output_level,
|
2018-12-13 22:12:02 +00:00
|
|
|
int output_path_id, JobContext* job_context, LogBuffer* log_buffer,
|
|
|
|
CompactionJobInfo* compaction_job_info) {
|
2017-04-06 00:14:05 +00:00
|
|
|
mutex_.AssertHeld();
|
|
|
|
|
|
|
|
if (shutting_down_.load(std::memory_order_acquire)) {
|
|
|
|
return Status::ShutdownInProgress();
|
|
|
|
}
|
2020-08-14 18:28:12 +00:00
|
|
|
if (manual_compaction_paused_.load(std::memory_order_acquire) > 0) {
|
2019-09-17 04:00:13 +00:00
|
|
|
return Status::Incomplete(Status::SubCode::kManualCompactionPaused);
|
|
|
|
}
|
2017-04-06 00:14:05 +00:00
|
|
|
|
|
|
|
std::unordered_set<uint64_t> input_set;
|
2018-10-10 00:13:53 +00:00
|
|
|
for (const auto& file_name : input_file_names) {
|
2017-04-06 00:14:05 +00:00
|
|
|
input_set.insert(TableFileNameToNumber(file_name));
|
|
|
|
}
|
|
|
|
|
|
|
|
ColumnFamilyMetaData cf_meta;
|
|
|
|
// TODO(yhchiang): can directly use version here if none of the
|
|
|
|
// following functions call is pluggable to external developers.
|
|
|
|
version->GetColumnFamilyMetaData(&cf_meta);
|
|
|
|
|
|
|
|
if (output_path_id < 0) {
|
2018-04-06 02:49:06 +00:00
|
|
|
if (cfd->ioptions()->cf_paths.size() == 1U) {
|
2017-04-06 00:14:05 +00:00
|
|
|
output_path_id = 0;
|
|
|
|
} else {
|
|
|
|
return Status::NotSupported(
|
|
|
|
"Automatic output path selection is not "
|
|
|
|
"yet supported in CompactFiles()");
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2023-06-14 18:28:56 +00:00
|
|
|
if (cfd->ioptions()->allow_ingest_behind &&
|
|
|
|
output_level >= cfd->ioptions()->num_levels - 1) {
|
|
|
|
return Status::InvalidArgument(
|
|
|
|
"Exceed the maximum output level defined by "
|
|
|
|
"the current compaction algorithm with ingest_behind --- " +
|
|
|
|
std::to_string(cfd->ioptions()->num_levels - 1));
|
|
|
|
}
|
|
|
|
|
2017-04-06 00:14:05 +00:00
|
|
|
Status s = cfd->compaction_picker()->SanitizeCompactionInputFiles(
|
2022-11-29 18:56:42 +00:00
|
|
|
&input_set, cf_meta, output_level);
|
Add missing range conflict check between file ingestion and RefitLevel() (#10988)
Summary:
**Context:**
File ingestion never checks whether the key range it acts on overlaps with an ongoing RefitLevel() (used in `CompactRange()` with `change_level=true`). That's because RefitLevel() doesn't register and make its key range known to file ingestion. Though it checks overlapping with other compactions by https://github.com/facebook/rocksdb/blob/7.8.fb/db/external_sst_file_ingestion_job.cc#L998.
RefitLevel() (used in `CompactRange()` with `change_level=true`) doesn't check whether the key range it acts on overlaps with an ongoing file ingestion. That's because file ingestion does not register and make its key range known to other compactions.
- Note that non-refitlevel-compaction (e.g, manual compaction w/o RefitLevel() or general compaction) also does not check key range overlap with ongoing file ingestion for the same reason.
- But it's fine. Credited to cbi42's discovery, `WaitForIngestFile` was called by background and foreground compactions. They were introduced in https://github.com/facebook/rocksdb/commit/0f88160f67d36ea30e3aca3a3cef924c3a009be6, https://github.com/facebook/rocksdb/commit/5c64fb67d2fc198f1a73ff3ae543749a6a41f513 and https://github.com/facebook/rocksdb/commit/87dfc1d23e0e16ff73e15f63c6fa0fb3b3fc8c8c.
- Regardless, this PR registers file ingestion like a compaction is a general approach that will also add range conflict check between file ingestion and non-refitlevel-compaction, though it has not been the issue motivated this PR.
Above are bugs resulting in two bad consequences:
- If file ingestion and RefitLevel() creates files in the same level, then range-overlapped files will be created at that level and caught as corruption by `force_consistency_checks=true`
- If file ingestion and RefitLevel() creates file in different levels, then with one further compaction on the ingested file, it can result in two same keys both with seqno 0 in two different levels. Then with iterator's [optimization](https://github.com/facebook/rocksdb/blame/c62f3221698fd273b673d4f7e54eabb8329a4369/db/db_iter.cc#L342-L343) that assumes no two same keys both with seqno 0, it will either break this assertion in debug build or, even worst, return value of this same key for the key after it, which is the wrong value to return, in release build.
Therefore we decide to introduce range conflict check for file ingestion and RefitLevel() inspired from the existing range conflict check among compactions.
**Summary:**
- Treat file ingestion job and RefitLevel() as `Compaction` of new compaction reasons: `CompactionReason::kExternalSstIngestion` and `CompactionReason::kRefitLevel` and register/unregister them. File ingestion is treated as compaction from L0 to different levels and RefitLevel() as compaction from source level to target level.
- Check for `RangeOverlapWithCompaction` with other ongoing compactions, `RegisterCompaction()` on this "compaction" before changing the LSM state in `VersionStorageInfo`, and `UnregisterCompaction()` after changing.
- Replace scattered fixes (https://github.com/facebook/rocksdb/commit/0f88160f67d36ea30e3aca3a3cef924c3a009be6, https://github.com/facebook/rocksdb/commit/5c64fb67d2fc198f1a73ff3ae543749a6a41f513 and https://github.com/facebook/rocksdb/commit/87dfc1d23e0e16ff73e15f63c6fa0fb3b3fc8c8c.) that prevents overlapping between file ingestion and non-refit-level compaction with this fix cuz those practices are easy to overlook.
- Misc: logic cleanup, see PR comments
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10988
Test Plan:
- New unit test `DBCompactionTestWithOngoingFileIngestionParam*` that failed pre-fix and passed afterwards.
- Made compatible with existing tests, see PR comments
- make check
- [Ongoing] Stress test rehearsal with normal value and aggressive CI value https://github.com/facebook/rocksdb/pull/10761
Reviewed By: cbi42
Differential Revision: D41535685
Pulled By: hx235
fbshipit-source-id: 549833a577ba1496d20a870583d4caa737da1258
2022-12-29 23:05:36 +00:00
|
|
|
TEST_SYNC_POINT("DBImpl::CompactFilesImpl::PostSanitizeCompactionInputFiles");
|
2017-04-06 00:14:05 +00:00
|
|
|
if (!s.ok()) {
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
|
|
|
std::vector<CompactionInputFiles> input_files;
|
|
|
|
s = cfd->compaction_picker()->GetCompactionInputsFromFileNumbers(
|
|
|
|
&input_files, &input_set, version->storage_info(), compact_options);
|
|
|
|
if (!s.ok()) {
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
2018-10-10 00:13:53 +00:00
|
|
|
for (const auto& inputs : input_files) {
|
2017-04-07 03:06:34 +00:00
|
|
|
if (cfd->compaction_picker()->AreFilesInCompaction(inputs.files)) {
|
2017-04-06 00:14:05 +00:00
|
|
|
return Status::Aborted(
|
|
|
|
"Some of the necessary compaction input "
|
|
|
|
"files are already being compacted");
|
|
|
|
}
|
|
|
|
}
|
2018-04-03 02:53:19 +00:00
|
|
|
bool sfm_reserved_compact_space = false;
|
|
|
|
// First check if we have enough room to do the compaction
|
|
|
|
bool enough_room = EnoughRoomForCompaction(
|
Auto recovery from out of space errors (#4164)
Summary:
This commit implements automatic recovery from a Status::NoSpace() error
during background operations such as write callback, flush and
compaction. The broad design is as follows -
1. Compaction errors are treated as soft errors and don't put the
database in read-only mode. A compaction is delayed until enough free
disk space is available to accomodate the compaction outputs, which is
estimated based on the input size. This means that users can continue to
write, and we rely on the WriteController to delay or stop writes if the
compaction debt becomes too high due to persistent low disk space
condition
2. Errors during write callback and flush are treated as hard errors,
i.e the database is put in read-only mode and goes back to read-write
only fater certain recovery actions are taken.
3. Both types of recovery rely on the SstFileManagerImpl to poll for
sufficient disk space. We assume that there is a 1-1 mapping between an
SFM and the underlying OS storage container. For cases where multiple
DBs are hosted on a single storage container, the user is expected to
allocate a single SFM instance and use the same one for all the DBs. If
no SFM is specified by the user, DBImpl::Open() will allocate one, but
this will be one per DB and each DB will recover independently. The
recovery implemented by SFM is as follows -
a) On the first occurance of an out of space error during compaction,
subsequent
compactions will be delayed until the disk free space check indicates
enough available space. The required space is computed as the sum of
input sizes.
b) The free space check requirement will be removed once the amount of
free space is greater than the size reserved by in progress
compactions when the first error occured
c) If the out of space error is a hard error, a background thread in
SFM will poll for sufficient headroom before triggering the recovery
of the database and putting it in write-only mode. The headroom is
calculated as the sum of the write_buffer_size of all the DB instances
associated with the SFM
4. EventListener callbacks will be called at the start and completion of
automatic recovery. Users can disable the auto recov ery in the start
callback, and later initiate it manually by calling DB::Resume()
Todo:
1. More extensive testing
2. Add disk full condition to db_stress (follow-on PR)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4164
Differential Revision: D9846378
Pulled By: anand1976
fbshipit-source-id: 80ea875dbd7f00205e19c82215ff6e37da10da4a
2018-09-15 20:36:19 +00:00
|
|
|
cfd, input_files, &sfm_reserved_compact_space, log_buffer);
|
2018-04-03 02:53:19 +00:00
|
|
|
|
|
|
|
if (!enough_room) {
|
|
|
|
// m's vars will get set properly at the end of this function,
|
|
|
|
// as long as status == CompactionTooLarge
|
|
|
|
return Status::CompactionTooLarge();
|
|
|
|
}
|
2017-04-06 00:14:05 +00:00
|
|
|
|
|
|
|
// At this point, CompactFiles will be run.
|
|
|
|
bg_compaction_scheduled_++;
|
|
|
|
|
2018-11-09 19:17:34 +00:00
|
|
|
std::unique_ptr<Compaction> c;
|
2017-04-06 00:14:05 +00:00
|
|
|
assert(cfd->compaction_picker());
|
2017-04-07 03:06:34 +00:00
|
|
|
c.reset(cfd->compaction_picker()->CompactFiles(
|
2017-04-06 00:14:05 +00:00
|
|
|
compact_options, input_files, output_level, version->storage_info(),
|
2020-07-23 01:31:25 +00:00
|
|
|
*cfd->GetLatestMutableCFOptions(), mutable_db_options_, output_path_id));
|
2017-09-13 22:31:34 +00:00
|
|
|
// we already sanitized the set of input files and checked for conflicts
|
|
|
|
// without releasing the lock, so we're guaranteed a compaction can be formed.
|
|
|
|
assert(c != nullptr);
|
|
|
|
|
2023-09-20 20:34:39 +00:00
|
|
|
c->FinalizeInputInfo(version);
|
|
|
|
|
2017-04-06 00:14:05 +00:00
|
|
|
// deletion compaction currently not allowed in CompactFiles.
|
|
|
|
assert(!c->deletion_compaction());
|
|
|
|
|
2019-01-16 05:32:15 +00:00
|
|
|
std::vector<SequenceNumber> snapshot_seqs;
|
2017-04-06 00:14:05 +00:00
|
|
|
SequenceNumber earliest_write_conflict_snapshot;
|
2019-01-16 05:32:15 +00:00
|
|
|
SnapshotChecker* snapshot_checker;
|
|
|
|
GetSnapshotContext(job_context, &snapshot_seqs,
|
|
|
|
&earliest_write_conflict_snapshot, &snapshot_checker);
|
2017-04-06 00:14:05 +00:00
|
|
|
|
2019-10-08 21:18:48 +00:00
|
|
|
std::unique_ptr<std::list<uint64_t>::iterator> pending_outputs_inserted_elem(
|
|
|
|
new std::list<uint64_t>::iterator(
|
|
|
|
CaptureCurrentFileNumberInPendingOutputs()));
|
2017-04-06 00:14:05 +00:00
|
|
|
|
|
|
|
assert(is_snapshot_supported_ || snapshots_.empty());
|
2018-12-13 22:12:02 +00:00
|
|
|
CompactionJobStats compaction_job_stats;
|
2017-04-06 00:14:05 +00:00
|
|
|
CompactionJob compaction_job(
|
2021-05-20 04:40:43 +00:00
|
|
|
job_context->job_id, c.get(), immutable_db_options_, mutable_db_options_,
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
2019-12-13 22:47:08 +00:00
|
|
|
file_options_for_compaction_, versions_.get(), &shutting_down_,
|
2022-04-11 17:26:55 +00:00
|
|
|
log_buffer, directories_.GetDbDir(),
|
2020-10-26 20:50:03 +00:00
|
|
|
GetDataDir(c->column_family_data(), c->output_path_id()),
|
|
|
|
GetDataDir(c->column_family_data(), 0), stats_, &mutex_, &error_handler_,
|
|
|
|
snapshot_seqs, earliest_write_conflict_snapshot, snapshot_checker,
|
CompactionIterator sees consistent view of which keys are committed (#9830)
Summary:
**This PR does not affect the functionality of `DB` and write-committed transactions.**
`CompactionIterator` uses `KeyCommitted(seq)` to determine if a key in the database is committed.
As the name 'write-committed' implies, if write-committed policy is used, a key exists in the database only if
it is committed. In fact, the implementation of `KeyCommitted()` is as follows:
```
inline bool KeyCommitted(SequenceNumber seq) {
// For non-txn-db and write-committed, snapshot_checker_ is always nullptr.
return snapshot_checker_ == nullptr ||
snapshot_checker_->CheckInSnapshot(seq, kMaxSequence) == SnapshotCheckerResult::kInSnapshot;
}
```
With that being said, we focus on write-prepared/write-unprepared transactions.
A few notes:
- A key can exist in the db even if it's uncommitted. Therefore, we rely on `snapshot_checker_` to determine data visibility. We also require that all writes go through transaction API instead of the raw `WriteBatch` + `Write`, thus at most one uncommitted version of one user key can exist in the database.
- `CompactionIterator` outputs a key as long as the key is uncommitted.
Due to the above reasons, it is possible that `CompactionIterator` decides to output an uncommitted key without
doing further checks on the key (`NextFromInput()`). By the time the key is being prepared for output, the key becomes
committed because the `snapshot_checker_(seq, kMaxSequence)` becomes true in the implementation of `KeyCommitted()`.
Then `CompactionIterator` will try to zero its sequence number and hit assertion error if the key is a tombstone.
To fix this issue, we should make the `CompactionIterator` see a consistent view of the input keys. Note that
for write-prepared/write-unprepared, the background flush/compaction jobs already take a "job snapshot" before starting
processing keys. The job snapshot is released only after the entire flush/compaction finishes. We can use this snapshot
to determine whether a key is committed or not with minor change to `KeyCommitted()`.
```
inline bool KeyCommitted(SequenceNumber sequence) {
// For non-txn-db and write-committed, snapshot_checker_ is always nullptr.
return snapshot_checker_ == nullptr ||
snapshot_checker_->CheckInSnapshot(sequence, job_snapshot_) ==
SnapshotCheckerResult::kInSnapshot;
}
```
As a result, whether a key is committed or not will remain a constant throughout compaction, causing no trouble
for `CompactionIterator`s assertions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9830
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D35561162
Pulled By: riversand963
fbshipit-source-id: 0e00d200c195240341cfe6d34cbc86798b315b9f
2022-04-14 18:11:04 +00:00
|
|
|
job_context, table_cache_, &event_logger_,
|
2017-11-17 01:46:43 +00:00
|
|
|
c->mutable_cf_options()->paranoid_file_checks,
|
2017-04-06 00:14:05 +00:00
|
|
|
c->mutable_cf_options()->report_bg_io_stats, dbname_,
|
2020-08-13 00:28:10 +00:00
|
|
|
&compaction_job_stats, Env::Priority::USER, io_tracer_,
|
2022-06-07 01:32:26 +00:00
|
|
|
kManualCompactionCanceledFalse_, db_id_, db_session_id_,
|
2022-03-12 00:13:23 +00:00
|
|
|
c->column_family_data()->GetFullHistoryTsLow(), c->trim_ts(),
|
Support subcmpct using reserved resources for round-robin priority (#10341)
Summary:
Earlier implementation of round-robin priority can only pick one file at a time and disallows parallel compactions within the same level. In this PR, round-robin compaction policy will expand towards more input files with respecting some additional constraints, which are summarized as follows:
* Constraint 1: We can only pick consecutive files
- Constraint 1a: When a file is being compacted (or some input files are being compacted after expanding), we cannot choose it and have to stop choosing more files
- Constraint 1b: When we reach the last file (with the largest keys), we cannot choose more files (the next file will be the first one with small keys)
* Constraint 2: We should ensure the total compaction bytes (including the overlapped files from the next level) is no more than `mutable_cf_options_.max_compaction_bytes`
* Constraint 3: We try our best to pick as many files as possible so that the post-compaction level size can be just less than `MaxBytesForLevel(start_level_)`
* Constraint 4: If trivial move is allowed, we reuse the logic of `TryNonL0TrivialMove()` instead of expanding files with Constraint 3
More details can be found in `LevelCompactionBuilder::SetupOtherFilesWithRoundRobinExpansion()`.
The above optimization accelerates the process of moving the compaction cursor, in which the write-amp can be further reduced. While a large compaction may lead to high write stall, we break this large compaction into several subcompactions **regardless of** the `max_subcompactions` limit. The number of subcompactions for round-robin compaction priority is determined through the following steps:
* Step 1: Initialized against `max_output_file_limit`, the number of input files in the start level, and also the range size limit `ranges.size()`
* Step 2: Call `AcquireSubcompactionResources()`when max subcompactions is not sufficient, but we may or may not obtain desired resources, additional number of resources is stored in `extra_num_subcompaction_threads_reserved_`). Subcompaction limit is changed and update `num_planned_subcompactions` with `GetSubcompactionLimit()`
* Step 3: Call `ShrinkSubcompactionResources()` to ensure extra resources can be released (extra resources may exist for round-robin compaction when the number of actual number of subcompactions is less than the number of planned subcompactions)
More details can be found in `CompactionJob::AcquireSubcompactionResources()`,`CompactionJob::ShrinkSubcompactionResources()`, and `CompactionJob::ReleaseSubcompactionResources()`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10341
Test Plan: Add `CompactionPriMultipleFilesRoundRobin[1-3]` unit test in `compaction_picker_test.cc` and `RoundRobinSubcompactionsAgainstResources.SubcompactionsUsingResources/[0-4]`, `RoundRobinSubcompactionsAgainstPressureToken.PressureTokenTest/[0-1]` in `db_compaction_test.cc`
Reviewed By: ajkr, hx235
Differential Revision: D37792644
Pulled By: littlepig2013
fbshipit-source-id: 7fecb7c4ffd97b34bbf6e3b760b2c35a772a0657
2022-07-24 18:12:44 +00:00
|
|
|
&blob_callback_, &bg_compaction_scheduled_,
|
|
|
|
&bg_bottom_compaction_scheduled_);
|
2017-04-06 00:14:05 +00:00
|
|
|
|
|
|
|
// Creating a compaction influences the compaction score because the score
|
|
|
|
// takes running compactions into account (by skipping files that are already
|
|
|
|
// being compacted). Since we just changed compaction score, we recalculate it
|
|
|
|
// here.
|
|
|
|
version->storage_info()->ComputeCompactionScore(*cfd->ioptions(),
|
|
|
|
*c->mutable_cf_options());
|
|
|
|
|
|
|
|
compaction_job.Prepare();
|
|
|
|
|
|
|
|
mutex_.Unlock();
|
|
|
|
TEST_SYNC_POINT("CompactFilesImpl:0");
|
|
|
|
TEST_SYNC_POINT("CompactFilesImpl:1");
|
2020-12-24 00:54:05 +00:00
|
|
|
// Ignore the status here, as it will be checked in the Install down below...
|
|
|
|
compaction_job.Run().PermitUncheckedError();
|
2017-04-06 00:14:05 +00:00
|
|
|
TEST_SYNC_POINT("CompactFilesImpl:2");
|
|
|
|
TEST_SYNC_POINT("CompactFilesImpl:3");
|
|
|
|
mutex_.Lock();
|
|
|
|
|
2023-09-18 20:11:53 +00:00
|
|
|
bool compaction_released = false;
|
|
|
|
Status status =
|
|
|
|
compaction_job.Install(*c->mutable_cf_options(), &compaction_released);
|
|
|
|
if (!compaction_released) {
|
|
|
|
c->ReleaseCompactionFiles(s);
|
|
|
|
}
|
2017-04-06 00:14:05 +00:00
|
|
|
if (status.ok()) {
|
2020-09-30 01:21:49 +00:00
|
|
|
assert(compaction_job.io_status().ok());
|
2024-03-04 18:08:32 +00:00
|
|
|
InstallSuperVersionAndScheduleWork(
|
|
|
|
c->column_family_data(), job_context->superversion_contexts.data(),
|
|
|
|
*c->mutable_cf_options());
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
2020-10-06 21:40:37 +00:00
|
|
|
// status above captures any error during compaction_job.Install, so its ok
|
|
|
|
// not check compaction_job.io_status() explicitly if we're not calling
|
|
|
|
// SetBGError
|
|
|
|
compaction_job.io_status().PermitUncheckedError();
|
2018-04-03 02:53:19 +00:00
|
|
|
// Need to make sure SstFileManager does its bookkeeping
|
|
|
|
auto sfm = static_cast<SstFileManagerImpl*>(
|
|
|
|
immutable_db_options_.sst_file_manager.get());
|
|
|
|
if (sfm && sfm_reserved_compact_space) {
|
|
|
|
sfm->OnCompactionCompletion(c.get());
|
|
|
|
}
|
2017-04-06 00:14:05 +00:00
|
|
|
|
|
|
|
ReleaseFileNumberFromPendingOutputs(pending_outputs_inserted_elem);
|
|
|
|
|
2018-12-13 22:12:02 +00:00
|
|
|
if (compaction_job_info != nullptr) {
|
|
|
|
BuildCompactionJobInfo(cfd, c.get(), s, compaction_job_stats,
|
2023-07-28 16:47:31 +00:00
|
|
|
job_context->job_id, compaction_job_info);
|
2018-12-13 22:12:02 +00:00
|
|
|
}
|
|
|
|
|
2017-04-06 00:14:05 +00:00
|
|
|
if (status.ok()) {
|
|
|
|
// Done
|
2019-06-04 05:37:40 +00:00
|
|
|
} else if (status.IsColumnFamilyDropped() || status.IsShutdownInProgress()) {
|
2017-04-06 00:14:05 +00:00
|
|
|
// Ignore compaction errors found during shutting down
|
2019-09-17 04:00:13 +00:00
|
|
|
} else if (status.IsManualCompactionPaused()) {
|
|
|
|
// Don't report stopping manual compaction as error
|
|
|
|
ROCKS_LOG_INFO(immutable_db_options_.info_log,
|
|
|
|
"[%s] [JOB %d] Stopping manual compaction",
|
|
|
|
c->column_family_data()->GetName().c_str(),
|
|
|
|
job_context->job_id);
|
2017-04-06 00:14:05 +00:00
|
|
|
} else {
|
|
|
|
ROCKS_LOG_WARN(immutable_db_options_.info_log,
|
|
|
|
"[%s] [JOB %d] Compaction error: %s",
|
|
|
|
c->column_family_data()->GetName().c_str(),
|
|
|
|
job_context->job_id, status.ToString().c_str());
|
2020-07-15 18:02:44 +00:00
|
|
|
IOStatus io_s = compaction_job.io_status();
|
|
|
|
if (!io_s.ok()) {
|
2020-12-08 04:09:55 +00:00
|
|
|
error_handler_.SetBGError(io_s, BackgroundErrorReason::kCompaction);
|
2020-07-15 18:02:44 +00:00
|
|
|
} else {
|
2020-12-08 04:09:55 +00:00
|
|
|
error_handler_.SetBGError(status, BackgroundErrorReason::kCompaction);
|
2020-07-15 18:02:44 +00:00
|
|
|
}
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
|
2018-03-15 18:46:16 +00:00
|
|
|
if (output_file_names != nullptr) {
|
2020-03-03 16:39:14 +00:00
|
|
|
for (const auto& newf : c->edit()->GetNewFiles()) {
|
2021-09-17 00:17:40 +00:00
|
|
|
output_file_names->push_back(TableFileName(
|
|
|
|
c->immutable_options()->cf_paths, newf.second.fd.GetNumber(),
|
|
|
|
newf.second.fd.GetPathId()));
|
|
|
|
}
|
|
|
|
|
|
|
|
for (const auto& blob_file : c->edit()->GetBlobFileAdditions()) {
|
|
|
|
output_file_names->push_back(
|
|
|
|
BlobFileName(c->immutable_options()->cf_paths.front().path,
|
|
|
|
blob_file.GetBlobFileNumber()));
|
2018-03-15 18:46:16 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2017-04-06 00:14:05 +00:00
|
|
|
c.reset();
|
|
|
|
|
|
|
|
bg_compaction_scheduled_--;
|
|
|
|
if (bg_compaction_scheduled_ == 0) {
|
|
|
|
bg_cv_.SignalAll();
|
|
|
|
}
|
2019-02-05 19:20:37 +00:00
|
|
|
MaybeScheduleFlushOrCompaction();
|
2018-04-03 02:53:19 +00:00
|
|
|
TEST_SYNC_POINT("CompactFilesImpl:End");
|
2017-04-06 00:14:05 +00:00
|
|
|
|
|
|
|
return status;
|
|
|
|
}
|
|
|
|
|
|
|
|
Status DBImpl::PauseBackgroundWork() {
|
|
|
|
InstrumentedMutexLock guard_lock(&mutex_);
|
|
|
|
bg_compaction_paused_++;
|
2017-08-03 22:36:28 +00:00
|
|
|
while (bg_bottom_compaction_scheduled_ > 0 || bg_compaction_scheduled_ > 0 ||
|
|
|
|
bg_flush_scheduled_ > 0) {
|
2017-04-06 00:14:05 +00:00
|
|
|
bg_cv_.Wait();
|
|
|
|
}
|
|
|
|
bg_work_paused_++;
|
|
|
|
return Status::OK();
|
|
|
|
}
|
|
|
|
|
|
|
|
Status DBImpl::ContinueBackgroundWork() {
|
|
|
|
InstrumentedMutexLock guard_lock(&mutex_);
|
|
|
|
if (bg_work_paused_ == 0) {
|
2024-01-30 00:31:09 +00:00
|
|
|
return Status::InvalidArgument("Background work already unpaused");
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
assert(bg_work_paused_ > 0);
|
|
|
|
assert(bg_compaction_paused_ > 0);
|
|
|
|
bg_compaction_paused_--;
|
|
|
|
bg_work_paused_--;
|
|
|
|
// It's sufficient to check just bg_work_paused_ here since
|
|
|
|
// bg_work_paused_ is always no greater than bg_compaction_paused_
|
|
|
|
if (bg_work_paused_ == 0) {
|
|
|
|
MaybeScheduleFlushOrCompaction();
|
|
|
|
}
|
|
|
|
return Status::OK();
|
|
|
|
}
|
|
|
|
|
2019-03-27 23:13:08 +00:00
|
|
|
void DBImpl::NotifyOnCompactionBegin(ColumnFamilyData* cfd, Compaction* c,
|
|
|
|
const Status& st,
|
2018-10-11 00:30:22 +00:00
|
|
|
const CompactionJobStats& job_stats,
|
|
|
|
int job_id) {
|
|
|
|
if (immutable_db_options_.listeners.empty()) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
mutex_.AssertHeld();
|
|
|
|
if (shutting_down_.load(std::memory_order_acquire)) {
|
|
|
|
return;
|
|
|
|
}
|
2019-09-17 04:00:13 +00:00
|
|
|
if (c->is_manual_compaction() &&
|
2020-08-14 18:28:12 +00:00
|
|
|
manual_compaction_paused_.load(std::memory_order_acquire) > 0) {
|
2019-09-17 04:00:13 +00:00
|
|
|
return;
|
|
|
|
}
|
2021-07-02 02:17:21 +00:00
|
|
|
|
|
|
|
c->SetNotifyOnCompactionCompleted();
|
2018-10-11 00:30:22 +00:00
|
|
|
// release lock while notifying events
|
|
|
|
mutex_.Unlock();
|
|
|
|
TEST_SYNC_POINT("DBImpl::NotifyOnCompactionBegin::UnlockMutex");
|
|
|
|
{
|
2019-11-01 18:44:59 +00:00
|
|
|
CompactionJobInfo info{};
|
2023-07-28 16:47:31 +00:00
|
|
|
BuildCompactionJobInfo(cfd, c, st, job_stats, job_id, &info);
|
2024-03-04 18:08:32 +00:00
|
|
|
for (const auto& listener : immutable_db_options_.listeners) {
|
2018-10-11 00:30:22 +00:00
|
|
|
listener->OnCompactionBegin(this, info);
|
|
|
|
}
|
2020-09-25 04:47:43 +00:00
|
|
|
info.status.PermitUncheckedError();
|
2018-10-11 00:30:22 +00:00
|
|
|
}
|
|
|
|
mutex_.Lock();
|
|
|
|
}
|
|
|
|
|
2017-04-06 00:14:05 +00:00
|
|
|
void DBImpl::NotifyOnCompactionCompleted(
|
2018-04-13 00:55:14 +00:00
|
|
|
ColumnFamilyData* cfd, Compaction* c, const Status& st,
|
|
|
|
const CompactionJobStats& compaction_job_stats, const int job_id) {
|
2017-04-06 00:14:05 +00:00
|
|
|
if (immutable_db_options_.listeners.size() == 0U) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
mutex_.AssertHeld();
|
|
|
|
if (shutting_down_.load(std::memory_order_acquire)) {
|
|
|
|
return;
|
|
|
|
}
|
2021-07-02 02:17:21 +00:00
|
|
|
|
|
|
|
if (c->ShouldNotifyOnCompactionCompleted() == false) {
|
2019-09-17 04:00:13 +00:00
|
|
|
return;
|
|
|
|
}
|
2021-06-07 18:40:31 +00:00
|
|
|
|
2017-04-06 00:14:05 +00:00
|
|
|
// release lock while notifying events
|
|
|
|
mutex_.Unlock();
|
|
|
|
TEST_SYNC_POINT("DBImpl::NotifyOnCompactionCompleted::UnlockMutex");
|
|
|
|
{
|
2019-11-01 18:44:59 +00:00
|
|
|
CompactionJobInfo info{};
|
2023-07-28 16:47:31 +00:00
|
|
|
BuildCompactionJobInfo(cfd, c, st, compaction_job_stats, job_id, &info);
|
2024-03-04 18:08:32 +00:00
|
|
|
for (const auto& listener : immutable_db_options_.listeners) {
|
2017-04-06 00:14:05 +00:00
|
|
|
listener->OnCompactionCompleted(this, info);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
mutex_.Lock();
|
|
|
|
// no need to signal bg_cv_ as it will be signaled at the end of the
|
|
|
|
// flush process.
|
|
|
|
}
|
|
|
|
|
|
|
|
// REQUIREMENT: block all background work by calling PauseBackgroundWork()
|
|
|
|
// before calling this function
|
|
|
|
Status DBImpl::ReFitLevel(ColumnFamilyData* cfd, int level, int target_level) {
|
|
|
|
assert(level < cfd->NumberLevels());
|
|
|
|
if (target_level >= cfd->NumberLevels()) {
|
|
|
|
return Status::InvalidArgument("Target level exceeds number of levels");
|
|
|
|
}
|
|
|
|
|
2023-04-21 16:07:18 +00:00
|
|
|
const ReadOptions read_options(Env::IOActivity::kCompaction);
|
Group SST write in flush, compaction and db open with new stats (#11910)
Summary:
## Context/Summary
Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity.
For that, this PR does the following:
- Tag different write IOs by passing down and converting WriteOptions to IOOptions
- Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS
Some related code refactory to make implementation cleaner:
- Blob stats
- Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info.
- Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write.
- Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority
- Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification
- Build table
- TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables
- Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder.
This parameter is used for dynamically changing file io priority for flush, see https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more
- Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority
## Test
### db bench
Flush
```
./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100
rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
```
compaction, db oopen
```
Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279
rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213
rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66
```
blob stats - just to make sure they aren't broken by this PR
```
Integrated Blob DB
Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600
rocksdb.blobdb.blob.file.synced COUNT : 1
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same)
```
```
Stacked Blob DB
Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876
rocksdb.blobdb.blob.file.synced COUNT : 8
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same)
```
### Rehearsal CI stress test
Trigger 3 full runs of all our CI stress tests
### Performance
Flush
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark; enable_statistics = true
Pre-pr: avg 507515519.3 ns
497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908,
Post-pr: avg 511971266.5 ns, regressed 0.88%
502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408,
```
Compaction
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 495346098.30 ns
492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846
Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97%
502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007
```
Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats)
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 3848.10 ns
3814,3838,3839,3848,3854,3854,3854,3860,3860,3860
Post-pr: avg 3874.20 ns, regressed 0.68%
3863,3867,3871,3874,3875,3877,3877,3877,3880,3881
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910
Reviewed By: ajkr
Differential Revision: D49788060
Pulled By: hx235
fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff
2023-12-29 23:29:23 +00:00
|
|
|
const WriteOptions write_options(Env::IOActivity::kCompaction);
|
2023-04-21 16:07:18 +00:00
|
|
|
|
2017-10-06 01:00:38 +00:00
|
|
|
SuperVersionContext sv_context(/* create_superversion */ true);
|
2017-04-06 00:14:05 +00:00
|
|
|
|
|
|
|
InstrumentedMutexLock guard_lock(&mutex_);
|
|
|
|
|
Add missing range conflict check between file ingestion and RefitLevel() (#10988)
Summary:
**Context:**
File ingestion never checks whether the key range it acts on overlaps with an ongoing RefitLevel() (used in `CompactRange()` with `change_level=true`). That's because RefitLevel() doesn't register and make its key range known to file ingestion. Though it checks overlapping with other compactions by https://github.com/facebook/rocksdb/blob/7.8.fb/db/external_sst_file_ingestion_job.cc#L998.
RefitLevel() (used in `CompactRange()` with `change_level=true`) doesn't check whether the key range it acts on overlaps with an ongoing file ingestion. That's because file ingestion does not register and make its key range known to other compactions.
- Note that non-refitlevel-compaction (e.g, manual compaction w/o RefitLevel() or general compaction) also does not check key range overlap with ongoing file ingestion for the same reason.
- But it's fine. Credited to cbi42's discovery, `WaitForIngestFile` was called by background and foreground compactions. They were introduced in https://github.com/facebook/rocksdb/commit/0f88160f67d36ea30e3aca3a3cef924c3a009be6, https://github.com/facebook/rocksdb/commit/5c64fb67d2fc198f1a73ff3ae543749a6a41f513 and https://github.com/facebook/rocksdb/commit/87dfc1d23e0e16ff73e15f63c6fa0fb3b3fc8c8c.
- Regardless, this PR registers file ingestion like a compaction is a general approach that will also add range conflict check between file ingestion and non-refitlevel-compaction, though it has not been the issue motivated this PR.
Above are bugs resulting in two bad consequences:
- If file ingestion and RefitLevel() creates files in the same level, then range-overlapped files will be created at that level and caught as corruption by `force_consistency_checks=true`
- If file ingestion and RefitLevel() creates file in different levels, then with one further compaction on the ingested file, it can result in two same keys both with seqno 0 in two different levels. Then with iterator's [optimization](https://github.com/facebook/rocksdb/blame/c62f3221698fd273b673d4f7e54eabb8329a4369/db/db_iter.cc#L342-L343) that assumes no two same keys both with seqno 0, it will either break this assertion in debug build or, even worst, return value of this same key for the key after it, which is the wrong value to return, in release build.
Therefore we decide to introduce range conflict check for file ingestion and RefitLevel() inspired from the existing range conflict check among compactions.
**Summary:**
- Treat file ingestion job and RefitLevel() as `Compaction` of new compaction reasons: `CompactionReason::kExternalSstIngestion` and `CompactionReason::kRefitLevel` and register/unregister them. File ingestion is treated as compaction from L0 to different levels and RefitLevel() as compaction from source level to target level.
- Check for `RangeOverlapWithCompaction` with other ongoing compactions, `RegisterCompaction()` on this "compaction" before changing the LSM state in `VersionStorageInfo`, and `UnregisterCompaction()` after changing.
- Replace scattered fixes (https://github.com/facebook/rocksdb/commit/0f88160f67d36ea30e3aca3a3cef924c3a009be6, https://github.com/facebook/rocksdb/commit/5c64fb67d2fc198f1a73ff3ae543749a6a41f513 and https://github.com/facebook/rocksdb/commit/87dfc1d23e0e16ff73e15f63c6fa0fb3b3fc8c8c.) that prevents overlapping between file ingestion and non-refit-level compaction with this fix cuz those practices are easy to overlook.
- Misc: logic cleanup, see PR comments
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10988
Test Plan:
- New unit test `DBCompactionTestWithOngoingFileIngestionParam*` that failed pre-fix and passed afterwards.
- Made compatible with existing tests, see PR comments
- make check
- [Ongoing] Stress test rehearsal with normal value and aggressive CI value https://github.com/facebook/rocksdb/pull/10761
Reviewed By: cbi42
Differential Revision: D41535685
Pulled By: hx235
fbshipit-source-id: 549833a577ba1496d20a870583d4caa737da1258
2022-12-29 23:05:36 +00:00
|
|
|
auto* vstorage = cfd->current()->storage_info();
|
|
|
|
if (vstorage->LevelFiles(level).empty()) {
|
|
|
|
return Status::OK();
|
|
|
|
}
|
2017-04-06 00:14:05 +00:00
|
|
|
// only allow one thread refitting
|
|
|
|
if (refitting_level_) {
|
|
|
|
ROCKS_LOG_INFO(immutable_db_options_.info_log,
|
|
|
|
"[ReFitLevel] another thread is refitting");
|
|
|
|
return Status::NotSupported("another thread is refitting");
|
|
|
|
}
|
|
|
|
refitting_level_ = true;
|
|
|
|
|
|
|
|
const MutableCFOptions mutable_cf_options = *cfd->GetLatestMutableCFOptions();
|
|
|
|
// move to a smaller level
|
|
|
|
int to_level = target_level;
|
|
|
|
if (target_level < 0) {
|
|
|
|
to_level = FindMinimumEmptyLevelFitting(cfd, mutable_cf_options, level);
|
|
|
|
}
|
|
|
|
|
2020-08-17 21:20:21 +00:00
|
|
|
if (to_level != level) {
|
Add missing range conflict check between file ingestion and RefitLevel() (#10988)
Summary:
**Context:**
File ingestion never checks whether the key range it acts on overlaps with an ongoing RefitLevel() (used in `CompactRange()` with `change_level=true`). That's because RefitLevel() doesn't register and make its key range known to file ingestion. Though it checks overlapping with other compactions by https://github.com/facebook/rocksdb/blob/7.8.fb/db/external_sst_file_ingestion_job.cc#L998.
RefitLevel() (used in `CompactRange()` with `change_level=true`) doesn't check whether the key range it acts on overlaps with an ongoing file ingestion. That's because file ingestion does not register and make its key range known to other compactions.
- Note that non-refitlevel-compaction (e.g, manual compaction w/o RefitLevel() or general compaction) also does not check key range overlap with ongoing file ingestion for the same reason.
- But it's fine. Credited to cbi42's discovery, `WaitForIngestFile` was called by background and foreground compactions. They were introduced in https://github.com/facebook/rocksdb/commit/0f88160f67d36ea30e3aca3a3cef924c3a009be6, https://github.com/facebook/rocksdb/commit/5c64fb67d2fc198f1a73ff3ae543749a6a41f513 and https://github.com/facebook/rocksdb/commit/87dfc1d23e0e16ff73e15f63c6fa0fb3b3fc8c8c.
- Regardless, this PR registers file ingestion like a compaction is a general approach that will also add range conflict check between file ingestion and non-refitlevel-compaction, though it has not been the issue motivated this PR.
Above are bugs resulting in two bad consequences:
- If file ingestion and RefitLevel() creates files in the same level, then range-overlapped files will be created at that level and caught as corruption by `force_consistency_checks=true`
- If file ingestion and RefitLevel() creates file in different levels, then with one further compaction on the ingested file, it can result in two same keys both with seqno 0 in two different levels. Then with iterator's [optimization](https://github.com/facebook/rocksdb/blame/c62f3221698fd273b673d4f7e54eabb8329a4369/db/db_iter.cc#L342-L343) that assumes no two same keys both with seqno 0, it will either break this assertion in debug build or, even worst, return value of this same key for the key after it, which is the wrong value to return, in release build.
Therefore we decide to introduce range conflict check for file ingestion and RefitLevel() inspired from the existing range conflict check among compactions.
**Summary:**
- Treat file ingestion job and RefitLevel() as `Compaction` of new compaction reasons: `CompactionReason::kExternalSstIngestion` and `CompactionReason::kRefitLevel` and register/unregister them. File ingestion is treated as compaction from L0 to different levels and RefitLevel() as compaction from source level to target level.
- Check for `RangeOverlapWithCompaction` with other ongoing compactions, `RegisterCompaction()` on this "compaction" before changing the LSM state in `VersionStorageInfo`, and `UnregisterCompaction()` after changing.
- Replace scattered fixes (https://github.com/facebook/rocksdb/commit/0f88160f67d36ea30e3aca3a3cef924c3a009be6, https://github.com/facebook/rocksdb/commit/5c64fb67d2fc198f1a73ff3ae543749a6a41f513 and https://github.com/facebook/rocksdb/commit/87dfc1d23e0e16ff73e15f63c6fa0fb3b3fc8c8c.) that prevents overlapping between file ingestion and non-refit-level compaction with this fix cuz those practices are easy to overlook.
- Misc: logic cleanup, see PR comments
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10988
Test Plan:
- New unit test `DBCompactionTestWithOngoingFileIngestionParam*` that failed pre-fix and passed afterwards.
- Made compatible with existing tests, see PR comments
- make check
- [Ongoing] Stress test rehearsal with normal value and aggressive CI value https://github.com/facebook/rocksdb/pull/10761
Reviewed By: cbi42
Differential Revision: D41535685
Pulled By: hx235
fbshipit-source-id: 549833a577ba1496d20a870583d4caa737da1258
2022-12-29 23:05:36 +00:00
|
|
|
std::vector<CompactionInputFiles> input(1);
|
|
|
|
input[0].level = level;
|
|
|
|
for (auto& f : vstorage->LevelFiles(level)) {
|
|
|
|
input[0].files.push_back(f);
|
|
|
|
}
|
|
|
|
InternalKey refit_level_smallest;
|
|
|
|
InternalKey refit_level_largest;
|
|
|
|
cfd->compaction_picker()->GetRange(input[0], &refit_level_smallest,
|
|
|
|
&refit_level_largest);
|
2020-08-17 21:20:21 +00:00
|
|
|
if (to_level > level) {
|
|
|
|
if (level == 0) {
|
2020-09-28 18:33:51 +00:00
|
|
|
refitting_level_ = false;
|
2017-04-06 00:14:05 +00:00
|
|
|
return Status::NotSupported(
|
2020-08-17 21:20:21 +00:00
|
|
|
"Cannot change from level 0 to other levels.");
|
|
|
|
}
|
|
|
|
// Check levels are empty for a trivial move
|
|
|
|
for (int l = level + 1; l <= to_level; l++) {
|
|
|
|
if (vstorage->NumLevelFiles(l) > 0) {
|
2020-09-28 18:33:51 +00:00
|
|
|
refitting_level_ = false;
|
2020-08-17 21:20:21 +00:00
|
|
|
return Status::NotSupported(
|
|
|
|
"Levels between source and target are not empty for a move.");
|
|
|
|
}
|
Add missing range conflict check between file ingestion and RefitLevel() (#10988)
Summary:
**Context:**
File ingestion never checks whether the key range it acts on overlaps with an ongoing RefitLevel() (used in `CompactRange()` with `change_level=true`). That's because RefitLevel() doesn't register and make its key range known to file ingestion. Though it checks overlapping with other compactions by https://github.com/facebook/rocksdb/blob/7.8.fb/db/external_sst_file_ingestion_job.cc#L998.
RefitLevel() (used in `CompactRange()` with `change_level=true`) doesn't check whether the key range it acts on overlaps with an ongoing file ingestion. That's because file ingestion does not register and make its key range known to other compactions.
- Note that non-refitlevel-compaction (e.g, manual compaction w/o RefitLevel() or general compaction) also does not check key range overlap with ongoing file ingestion for the same reason.
- But it's fine. Credited to cbi42's discovery, `WaitForIngestFile` was called by background and foreground compactions. They were introduced in https://github.com/facebook/rocksdb/commit/0f88160f67d36ea30e3aca3a3cef924c3a009be6, https://github.com/facebook/rocksdb/commit/5c64fb67d2fc198f1a73ff3ae543749a6a41f513 and https://github.com/facebook/rocksdb/commit/87dfc1d23e0e16ff73e15f63c6fa0fb3b3fc8c8c.
- Regardless, this PR registers file ingestion like a compaction is a general approach that will also add range conflict check between file ingestion and non-refitlevel-compaction, though it has not been the issue motivated this PR.
Above are bugs resulting in two bad consequences:
- If file ingestion and RefitLevel() creates files in the same level, then range-overlapped files will be created at that level and caught as corruption by `force_consistency_checks=true`
- If file ingestion and RefitLevel() creates file in different levels, then with one further compaction on the ingested file, it can result in two same keys both with seqno 0 in two different levels. Then with iterator's [optimization](https://github.com/facebook/rocksdb/blame/c62f3221698fd273b673d4f7e54eabb8329a4369/db/db_iter.cc#L342-L343) that assumes no two same keys both with seqno 0, it will either break this assertion in debug build or, even worst, return value of this same key for the key after it, which is the wrong value to return, in release build.
Therefore we decide to introduce range conflict check for file ingestion and RefitLevel() inspired from the existing range conflict check among compactions.
**Summary:**
- Treat file ingestion job and RefitLevel() as `Compaction` of new compaction reasons: `CompactionReason::kExternalSstIngestion` and `CompactionReason::kRefitLevel` and register/unregister them. File ingestion is treated as compaction from L0 to different levels and RefitLevel() as compaction from source level to target level.
- Check for `RangeOverlapWithCompaction` with other ongoing compactions, `RegisterCompaction()` on this "compaction" before changing the LSM state in `VersionStorageInfo`, and `UnregisterCompaction()` after changing.
- Replace scattered fixes (https://github.com/facebook/rocksdb/commit/0f88160f67d36ea30e3aca3a3cef924c3a009be6, https://github.com/facebook/rocksdb/commit/5c64fb67d2fc198f1a73ff3ae543749a6a41f513 and https://github.com/facebook/rocksdb/commit/87dfc1d23e0e16ff73e15f63c6fa0fb3b3fc8c8c.) that prevents overlapping between file ingestion and non-refit-level compaction with this fix cuz those practices are easy to overlook.
- Misc: logic cleanup, see PR comments
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10988
Test Plan:
- New unit test `DBCompactionTestWithOngoingFileIngestionParam*` that failed pre-fix and passed afterwards.
- Made compatible with existing tests, see PR comments
- make check
- [Ongoing] Stress test rehearsal with normal value and aggressive CI value https://github.com/facebook/rocksdb/pull/10761
Reviewed By: cbi42
Differential Revision: D41535685
Pulled By: hx235
fbshipit-source-id: 549833a577ba1496d20a870583d4caa737da1258
2022-12-29 23:05:36 +00:00
|
|
|
if (cfd->RangeOverlapWithCompaction(refit_level_smallest.user_key(),
|
|
|
|
refit_level_largest.user_key(),
|
|
|
|
l)) {
|
|
|
|
refitting_level_ = false;
|
|
|
|
return Status::NotSupported(
|
|
|
|
"Levels between source and target "
|
|
|
|
"will have some ongoing compaction's output.");
|
|
|
|
}
|
2020-08-17 21:20:21 +00:00
|
|
|
}
|
|
|
|
} else {
|
|
|
|
// to_level < level
|
|
|
|
// Check levels are empty for a trivial move
|
|
|
|
for (int l = to_level; l < level; l++) {
|
|
|
|
if (vstorage->NumLevelFiles(l) > 0) {
|
2020-09-28 18:33:51 +00:00
|
|
|
refitting_level_ = false;
|
2020-08-17 21:20:21 +00:00
|
|
|
return Status::NotSupported(
|
|
|
|
"Levels between source and target are not empty for a move.");
|
|
|
|
}
|
Add missing range conflict check between file ingestion and RefitLevel() (#10988)
Summary:
**Context:**
File ingestion never checks whether the key range it acts on overlaps with an ongoing RefitLevel() (used in `CompactRange()` with `change_level=true`). That's because RefitLevel() doesn't register and make its key range known to file ingestion. Though it checks overlapping with other compactions by https://github.com/facebook/rocksdb/blob/7.8.fb/db/external_sst_file_ingestion_job.cc#L998.
RefitLevel() (used in `CompactRange()` with `change_level=true`) doesn't check whether the key range it acts on overlaps with an ongoing file ingestion. That's because file ingestion does not register and make its key range known to other compactions.
- Note that non-refitlevel-compaction (e.g, manual compaction w/o RefitLevel() or general compaction) also does not check key range overlap with ongoing file ingestion for the same reason.
- But it's fine. Credited to cbi42's discovery, `WaitForIngestFile` was called by background and foreground compactions. They were introduced in https://github.com/facebook/rocksdb/commit/0f88160f67d36ea30e3aca3a3cef924c3a009be6, https://github.com/facebook/rocksdb/commit/5c64fb67d2fc198f1a73ff3ae543749a6a41f513 and https://github.com/facebook/rocksdb/commit/87dfc1d23e0e16ff73e15f63c6fa0fb3b3fc8c8c.
- Regardless, this PR registers file ingestion like a compaction is a general approach that will also add range conflict check between file ingestion and non-refitlevel-compaction, though it has not been the issue motivated this PR.
Above are bugs resulting in two bad consequences:
- If file ingestion and RefitLevel() creates files in the same level, then range-overlapped files will be created at that level and caught as corruption by `force_consistency_checks=true`
- If file ingestion and RefitLevel() creates file in different levels, then with one further compaction on the ingested file, it can result in two same keys both with seqno 0 in two different levels. Then with iterator's [optimization](https://github.com/facebook/rocksdb/blame/c62f3221698fd273b673d4f7e54eabb8329a4369/db/db_iter.cc#L342-L343) that assumes no two same keys both with seqno 0, it will either break this assertion in debug build or, even worst, return value of this same key for the key after it, which is the wrong value to return, in release build.
Therefore we decide to introduce range conflict check for file ingestion and RefitLevel() inspired from the existing range conflict check among compactions.
**Summary:**
- Treat file ingestion job and RefitLevel() as `Compaction` of new compaction reasons: `CompactionReason::kExternalSstIngestion` and `CompactionReason::kRefitLevel` and register/unregister them. File ingestion is treated as compaction from L0 to different levels and RefitLevel() as compaction from source level to target level.
- Check for `RangeOverlapWithCompaction` with other ongoing compactions, `RegisterCompaction()` on this "compaction" before changing the LSM state in `VersionStorageInfo`, and `UnregisterCompaction()` after changing.
- Replace scattered fixes (https://github.com/facebook/rocksdb/commit/0f88160f67d36ea30e3aca3a3cef924c3a009be6, https://github.com/facebook/rocksdb/commit/5c64fb67d2fc198f1a73ff3ae543749a6a41f513 and https://github.com/facebook/rocksdb/commit/87dfc1d23e0e16ff73e15f63c6fa0fb3b3fc8c8c.) that prevents overlapping between file ingestion and non-refit-level compaction with this fix cuz those practices are easy to overlook.
- Misc: logic cleanup, see PR comments
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10988
Test Plan:
- New unit test `DBCompactionTestWithOngoingFileIngestionParam*` that failed pre-fix and passed afterwards.
- Made compatible with existing tests, see PR comments
- make check
- [Ongoing] Stress test rehearsal with normal value and aggressive CI value https://github.com/facebook/rocksdb/pull/10761
Reviewed By: cbi42
Differential Revision: D41535685
Pulled By: hx235
fbshipit-source-id: 549833a577ba1496d20a870583d4caa737da1258
2022-12-29 23:05:36 +00:00
|
|
|
if (cfd->RangeOverlapWithCompaction(refit_level_smallest.user_key(),
|
|
|
|
refit_level_largest.user_key(),
|
|
|
|
l)) {
|
|
|
|
refitting_level_ = false;
|
|
|
|
return Status::NotSupported(
|
|
|
|
"Levels between source and target "
|
|
|
|
"will have some ongoing compaction's output.");
|
|
|
|
}
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
ROCKS_LOG_DEBUG(immutable_db_options_.info_log,
|
|
|
|
"[%s] Before refitting:\n%s", cfd->GetName().c_str(),
|
|
|
|
cfd->current()->DebugString().data());
|
|
|
|
|
Add missing range conflict check between file ingestion and RefitLevel() (#10988)
Summary:
**Context:**
File ingestion never checks whether the key range it acts on overlaps with an ongoing RefitLevel() (used in `CompactRange()` with `change_level=true`). That's because RefitLevel() doesn't register and make its key range known to file ingestion. Though it checks overlapping with other compactions by https://github.com/facebook/rocksdb/blob/7.8.fb/db/external_sst_file_ingestion_job.cc#L998.
RefitLevel() (used in `CompactRange()` with `change_level=true`) doesn't check whether the key range it acts on overlaps with an ongoing file ingestion. That's because file ingestion does not register and make its key range known to other compactions.
- Note that non-refitlevel-compaction (e.g, manual compaction w/o RefitLevel() or general compaction) also does not check key range overlap with ongoing file ingestion for the same reason.
- But it's fine. Credited to cbi42's discovery, `WaitForIngestFile` was called by background and foreground compactions. They were introduced in https://github.com/facebook/rocksdb/commit/0f88160f67d36ea30e3aca3a3cef924c3a009be6, https://github.com/facebook/rocksdb/commit/5c64fb67d2fc198f1a73ff3ae543749a6a41f513 and https://github.com/facebook/rocksdb/commit/87dfc1d23e0e16ff73e15f63c6fa0fb3b3fc8c8c.
- Regardless, this PR registers file ingestion like a compaction is a general approach that will also add range conflict check between file ingestion and non-refitlevel-compaction, though it has not been the issue motivated this PR.
Above are bugs resulting in two bad consequences:
- If file ingestion and RefitLevel() creates files in the same level, then range-overlapped files will be created at that level and caught as corruption by `force_consistency_checks=true`
- If file ingestion and RefitLevel() creates file in different levels, then with one further compaction on the ingested file, it can result in two same keys both with seqno 0 in two different levels. Then with iterator's [optimization](https://github.com/facebook/rocksdb/blame/c62f3221698fd273b673d4f7e54eabb8329a4369/db/db_iter.cc#L342-L343) that assumes no two same keys both with seqno 0, it will either break this assertion in debug build or, even worst, return value of this same key for the key after it, which is the wrong value to return, in release build.
Therefore we decide to introduce range conflict check for file ingestion and RefitLevel() inspired from the existing range conflict check among compactions.
**Summary:**
- Treat file ingestion job and RefitLevel() as `Compaction` of new compaction reasons: `CompactionReason::kExternalSstIngestion` and `CompactionReason::kRefitLevel` and register/unregister them. File ingestion is treated as compaction from L0 to different levels and RefitLevel() as compaction from source level to target level.
- Check for `RangeOverlapWithCompaction` with other ongoing compactions, `RegisterCompaction()` on this "compaction" before changing the LSM state in `VersionStorageInfo`, and `UnregisterCompaction()` after changing.
- Replace scattered fixes (https://github.com/facebook/rocksdb/commit/0f88160f67d36ea30e3aca3a3cef924c3a009be6, https://github.com/facebook/rocksdb/commit/5c64fb67d2fc198f1a73ff3ae543749a6a41f513 and https://github.com/facebook/rocksdb/commit/87dfc1d23e0e16ff73e15f63c6fa0fb3b3fc8c8c.) that prevents overlapping between file ingestion and non-refit-level compaction with this fix cuz those practices are easy to overlook.
- Misc: logic cleanup, see PR comments
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10988
Test Plan:
- New unit test `DBCompactionTestWithOngoingFileIngestionParam*` that failed pre-fix and passed afterwards.
- Made compatible with existing tests, see PR comments
- make check
- [Ongoing] Stress test rehearsal with normal value and aggressive CI value https://github.com/facebook/rocksdb/pull/10761
Reviewed By: cbi42
Differential Revision: D41535685
Pulled By: hx235
fbshipit-source-id: 549833a577ba1496d20a870583d4caa737da1258
2022-12-29 23:05:36 +00:00
|
|
|
std::unique_ptr<Compaction> c(new Compaction(
|
|
|
|
vstorage, *cfd->ioptions(), mutable_cf_options, mutable_db_options_,
|
|
|
|
{input}, to_level,
|
|
|
|
MaxFileSizeForLevel(
|
|
|
|
mutable_cf_options, to_level,
|
|
|
|
cfd->ioptions()
|
|
|
|
->compaction_style) /* output file size limit, not applicable */
|
|
|
|
,
|
|
|
|
LLONG_MAX /* max compaction bytes, not applicable */,
|
|
|
|
0 /* output path ID, not applicable */, mutable_cf_options.compression,
|
2024-02-28 22:36:13 +00:00
|
|
|
mutable_cf_options.compression_opts,
|
|
|
|
mutable_cf_options.default_write_temperature,
|
Add missing range conflict check between file ingestion and RefitLevel() (#10988)
Summary:
**Context:**
File ingestion never checks whether the key range it acts on overlaps with an ongoing RefitLevel() (used in `CompactRange()` with `change_level=true`). That's because RefitLevel() doesn't register and make its key range known to file ingestion. Though it checks overlapping with other compactions by https://github.com/facebook/rocksdb/blob/7.8.fb/db/external_sst_file_ingestion_job.cc#L998.
RefitLevel() (used in `CompactRange()` with `change_level=true`) doesn't check whether the key range it acts on overlaps with an ongoing file ingestion. That's because file ingestion does not register and make its key range known to other compactions.
- Note that non-refitlevel-compaction (e.g, manual compaction w/o RefitLevel() or general compaction) also does not check key range overlap with ongoing file ingestion for the same reason.
- But it's fine. Credited to cbi42's discovery, `WaitForIngestFile` was called by background and foreground compactions. They were introduced in https://github.com/facebook/rocksdb/commit/0f88160f67d36ea30e3aca3a3cef924c3a009be6, https://github.com/facebook/rocksdb/commit/5c64fb67d2fc198f1a73ff3ae543749a6a41f513 and https://github.com/facebook/rocksdb/commit/87dfc1d23e0e16ff73e15f63c6fa0fb3b3fc8c8c.
- Regardless, this PR registers file ingestion like a compaction is a general approach that will also add range conflict check between file ingestion and non-refitlevel-compaction, though it has not been the issue motivated this PR.
Above are bugs resulting in two bad consequences:
- If file ingestion and RefitLevel() creates files in the same level, then range-overlapped files will be created at that level and caught as corruption by `force_consistency_checks=true`
- If file ingestion and RefitLevel() creates file in different levels, then with one further compaction on the ingested file, it can result in two same keys both with seqno 0 in two different levels. Then with iterator's [optimization](https://github.com/facebook/rocksdb/blame/c62f3221698fd273b673d4f7e54eabb8329a4369/db/db_iter.cc#L342-L343) that assumes no two same keys both with seqno 0, it will either break this assertion in debug build or, even worst, return value of this same key for the key after it, which is the wrong value to return, in release build.
Therefore we decide to introduce range conflict check for file ingestion and RefitLevel() inspired from the existing range conflict check among compactions.
**Summary:**
- Treat file ingestion job and RefitLevel() as `Compaction` of new compaction reasons: `CompactionReason::kExternalSstIngestion` and `CompactionReason::kRefitLevel` and register/unregister them. File ingestion is treated as compaction from L0 to different levels and RefitLevel() as compaction from source level to target level.
- Check for `RangeOverlapWithCompaction` with other ongoing compactions, `RegisterCompaction()` on this "compaction" before changing the LSM state in `VersionStorageInfo`, and `UnregisterCompaction()` after changing.
- Replace scattered fixes (https://github.com/facebook/rocksdb/commit/0f88160f67d36ea30e3aca3a3cef924c3a009be6, https://github.com/facebook/rocksdb/commit/5c64fb67d2fc198f1a73ff3ae543749a6a41f513 and https://github.com/facebook/rocksdb/commit/87dfc1d23e0e16ff73e15f63c6fa0fb3b3fc8c8c.) that prevents overlapping between file ingestion and non-refit-level compaction with this fix cuz those practices are easy to overlook.
- Misc: logic cleanup, see PR comments
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10988
Test Plan:
- New unit test `DBCompactionTestWithOngoingFileIngestionParam*` that failed pre-fix and passed afterwards.
- Made compatible with existing tests, see PR comments
- make check
- [Ongoing] Stress test rehearsal with normal value and aggressive CI value https://github.com/facebook/rocksdb/pull/10761
Reviewed By: cbi42
Differential Revision: D41535685
Pulled By: hx235
fbshipit-source-id: 549833a577ba1496d20a870583d4caa737da1258
2022-12-29 23:05:36 +00:00
|
|
|
0 /* max_subcompactions, not applicable */,
|
|
|
|
{} /* grandparents, not applicable */, false /* is manual */,
|
|
|
|
"" /* trim_ts */, -1 /* score, not applicable */,
|
|
|
|
false /* is deletion compaction, not applicable */,
|
|
|
|
false /* l0_files_might_overlap, not applicable */,
|
|
|
|
CompactionReason::kRefitLevel));
|
|
|
|
cfd->compaction_picker()->RegisterCompaction(c.get());
|
|
|
|
TEST_SYNC_POINT("DBImpl::ReFitLevel:PostRegisterCompaction");
|
2017-04-06 00:14:05 +00:00
|
|
|
VersionEdit edit;
|
|
|
|
edit.SetColumnFamily(cfd->GetID());
|
Sort L0 files by newly introduced epoch_num (#10922)
Summary:
**Context:**
Sorting L0 files by `largest_seqno` has at least two inconvenience:
- File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. `force_consistency_check=true` will catch such overlap seqno range even those harmless overlap.
- For example, consider the following sequence of events ("key@n" indicates key at seqno "n")
- insert k1@1 to memtable m1
- ingest file s1 with k2@2, ingest file s2 with k3@3
- insert k4@4 to m1
- compact files s1, s2 and result in new file s3 of seqno range [2, 3]
- flush m1 and result in new file s4 of seqno range [1, 4]. And `force_consistency_check=true` will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1
- However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption.
- Single delete can decrease a file's largest seqno and ordering by `largest_seqno` can introduce a wrong ordering hence file reordering corruption
- For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to ajkr for this example)
- an existing SST s1 contains only k1@1
- insert k1@2 to memtable m1
- ingest file s2 with k3@3, ingest file s3 with k4@4
- insert single delete k5@5 in m1
- flush m1 and result in new file s4 of seqno range [2, 5]
- compact s1, s2, s3 and result in new file s5 of seqno range [1, 4]
- compact s4 and result in new file s6 of seqno range [2] due to single delete
- By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by `force_consistency_check=true`, there isn't a good way to prevent this from happening if ordering by `largest_seqno`
Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to ajkr , we now introduce `epoch_num` which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum `epoch_num` among input files'). This will avoid the above inconvenience in the following ways:
- In the first case above, there will no longer be overlap seqno range check in `force_consistency_check=true` but `epoch_number` ordering check. This will result in file ordering s1 < s2 < s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class `DBCompactionTestL0FilesMisorderCorruption*` for more.
- In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption.
**Summary:**
- Introduce `epoch_number` stored per `ColumnFamilyData` and sort CF's L0 files by their assigned `epoch_number` instead of `largest_seqno`.
- `epoch_number` is increased and assigned upon `VersionEdit::AddFile()` for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the `kReservedEpochNumberForFileIngestedBehind`)
- Compaction output file is assigned with the minimum `epoch_number` among input files'
- Refit level: reuse refitted file's epoch_number
- Other paths needing `epoch_number` treatment:
- Import column families: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`
- Repair: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`.
- Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon `VersionEdit::AddFile()` where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair).
- Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later `Refit(target_level=0)`). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder.
- Persist `epoch_number` of each file in manifest and recover `epoch_number` on db recovery
- Backward compatibility with old db without `epoch_number` support is guaranteed by assigning `epoch_number` to recovered files by `NewestFirstBySeqno` order. See `VersionStorageInfo::RecoverEpochNumbers()` for more
- Forward compatibility with manifest is guaranteed by flexibility of `NewFileCustomTag`
- Replace `force_consistent_check` on L0 with `epoch_number` and remove false positive check like case 1 with `largest_seqno` above
- Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (`NewestFirstBySeqno`) to check/sort them till we infer their epoch number. See usages of `EpochNumberRequirement`.
- Remove fix https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and their outdated tests to file reordering corruption because such fix can be replaced by this PR.
- Misc:
- update existing tests with `epoch_number` so make check will pass
- update https://github.com/facebook/rocksdb/pull/5958#issue-511150930 tests to verify corruption is fixed using `epoch_number` and cover universal/fifo compaction/CompactRange/CompactFile cases
- assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber()
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10922
Test Plan:
- `make check`
- New unit tests under `db/db_compaction_test.cc`, `db/db_test2.cc`, `db/version_builder_test.cc`, `db/repair_test.cc`
- Updated tests (i.e, `DBCompactionTestL0FilesMisorderCorruption*`) under https://github.com/facebook/rocksdb/pull/5958#issue-511150930
- [Ongoing] Compatibility test: manually run https://github.com/ajkr/rocksdb/commit/36a5686ec012f35a4371e409aa85c404ca1c210d (with file ingestion off for running the `.orig` binary to prevent this bug affecting upgrade/downgrade formality checking) for 1 hour on `simple black/white box`, `cf_consistency/txn/enable_ts with whitebox + test_best_efforts_recovery with blackbox`
- [Ongoing] normal db stress test
- [Ongoing] db stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761
Reviewed By: ajkr
Differential Revision: D41063187
Pulled By: hx235
fbshipit-source-id: 826cb23455de7beaabe2d16c57682a82733a32a9
2022-12-13 21:29:37 +00:00
|
|
|
|
2017-04-06 00:14:05 +00:00
|
|
|
for (const auto& f : vstorage->LevelFiles(level)) {
|
|
|
|
edit.DeleteFile(level, f->fd.GetNumber());
|
2021-11-10 18:47:53 +00:00
|
|
|
edit.AddFile(
|
|
|
|
to_level, f->fd.GetNumber(), f->fd.GetPathId(), f->fd.GetFileSize(),
|
|
|
|
f->smallest, f->largest, f->fd.smallest_seqno, f->fd.largest_seqno,
|
2021-12-03 22:42:05 +00:00
|
|
|
f->marked_for_compaction, f->temperature, f->oldest_blob_file_number,
|
Sort L0 files by newly introduced epoch_num (#10922)
Summary:
**Context:**
Sorting L0 files by `largest_seqno` has at least two inconvenience:
- File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. `force_consistency_check=true` will catch such overlap seqno range even those harmless overlap.
- For example, consider the following sequence of events ("key@n" indicates key at seqno "n")
- insert k1@1 to memtable m1
- ingest file s1 with k2@2, ingest file s2 with k3@3
- insert k4@4 to m1
- compact files s1, s2 and result in new file s3 of seqno range [2, 3]
- flush m1 and result in new file s4 of seqno range [1, 4]. And `force_consistency_check=true` will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1
- However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption.
- Single delete can decrease a file's largest seqno and ordering by `largest_seqno` can introduce a wrong ordering hence file reordering corruption
- For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to ajkr for this example)
- an existing SST s1 contains only k1@1
- insert k1@2 to memtable m1
- ingest file s2 with k3@3, ingest file s3 with k4@4
- insert single delete k5@5 in m1
- flush m1 and result in new file s4 of seqno range [2, 5]
- compact s1, s2, s3 and result in new file s5 of seqno range [1, 4]
- compact s4 and result in new file s6 of seqno range [2] due to single delete
- By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by `force_consistency_check=true`, there isn't a good way to prevent this from happening if ordering by `largest_seqno`
Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to ajkr , we now introduce `epoch_num` which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum `epoch_num` among input files'). This will avoid the above inconvenience in the following ways:
- In the first case above, there will no longer be overlap seqno range check in `force_consistency_check=true` but `epoch_number` ordering check. This will result in file ordering s1 < s2 < s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class `DBCompactionTestL0FilesMisorderCorruption*` for more.
- In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption.
**Summary:**
- Introduce `epoch_number` stored per `ColumnFamilyData` and sort CF's L0 files by their assigned `epoch_number` instead of `largest_seqno`.
- `epoch_number` is increased and assigned upon `VersionEdit::AddFile()` for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the `kReservedEpochNumberForFileIngestedBehind`)
- Compaction output file is assigned with the minimum `epoch_number` among input files'
- Refit level: reuse refitted file's epoch_number
- Other paths needing `epoch_number` treatment:
- Import column families: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`
- Repair: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`.
- Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon `VersionEdit::AddFile()` where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair).
- Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later `Refit(target_level=0)`). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder.
- Persist `epoch_number` of each file in manifest and recover `epoch_number` on db recovery
- Backward compatibility with old db without `epoch_number` support is guaranteed by assigning `epoch_number` to recovered files by `NewestFirstBySeqno` order. See `VersionStorageInfo::RecoverEpochNumbers()` for more
- Forward compatibility with manifest is guaranteed by flexibility of `NewFileCustomTag`
- Replace `force_consistent_check` on L0 with `epoch_number` and remove false positive check like case 1 with `largest_seqno` above
- Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (`NewestFirstBySeqno`) to check/sort them till we infer their epoch number. See usages of `EpochNumberRequirement`.
- Remove fix https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and their outdated tests to file reordering corruption because such fix can be replaced by this PR.
- Misc:
- update existing tests with `epoch_number` so make check will pass
- update https://github.com/facebook/rocksdb/pull/5958#issue-511150930 tests to verify corruption is fixed using `epoch_number` and cover universal/fifo compaction/CompactRange/CompactFile cases
- assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber()
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10922
Test Plan:
- `make check`
- New unit tests under `db/db_compaction_test.cc`, `db/db_test2.cc`, `db/version_builder_test.cc`, `db/repair_test.cc`
- Updated tests (i.e, `DBCompactionTestL0FilesMisorderCorruption*`) under https://github.com/facebook/rocksdb/pull/5958#issue-511150930
- [Ongoing] Compatibility test: manually run https://github.com/ajkr/rocksdb/commit/36a5686ec012f35a4371e409aa85c404ca1c210d (with file ingestion off for running the `.orig` binary to prevent this bug affecting upgrade/downgrade formality checking) for 1 hour on `simple black/white box`, `cf_consistency/txn/enable_ts with whitebox + test_best_efforts_recovery with blackbox`
- [Ongoing] normal db stress test
- [Ongoing] db stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761
Reviewed By: ajkr
Differential Revision: D41063187
Pulled By: hx235
fbshipit-source-id: 826cb23455de7beaabe2d16c57682a82733a32a9
2022-12-13 21:29:37 +00:00
|
|
|
f->oldest_ancester_time, f->file_creation_time, f->epoch_number,
|
2022-12-29 21:28:24 +00:00
|
|
|
f->file_checksum, f->file_checksum_func_name, f->unique_id,
|
2023-06-22 04:49:01 +00:00
|
|
|
f->compensated_range_deletion_size, f->tail_size,
|
|
|
|
f->user_defined_timestamps_persisted);
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
ROCKS_LOG_DEBUG(immutable_db_options_.info_log,
|
|
|
|
"[%s] Apply version edit:\n%s", cfd->GetName().c_str(),
|
|
|
|
edit.DebugString().data());
|
|
|
|
|
Group SST write in flush, compaction and db open with new stats (#11910)
Summary:
## Context/Summary
Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity.
For that, this PR does the following:
- Tag different write IOs by passing down and converting WriteOptions to IOOptions
- Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS
Some related code refactory to make implementation cleaner:
- Blob stats
- Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info.
- Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write.
- Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority
- Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification
- Build table
- TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables
- Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder.
This parameter is used for dynamically changing file io priority for flush, see https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more
- Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority
## Test
### db bench
Flush
```
./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100
rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
```
compaction, db oopen
```
Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279
rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213
rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66
```
blob stats - just to make sure they aren't broken by this PR
```
Integrated Blob DB
Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600
rocksdb.blobdb.blob.file.synced COUNT : 1
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same)
```
```
Stacked Blob DB
Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876
rocksdb.blobdb.blob.file.synced COUNT : 8
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same)
```
### Rehearsal CI stress test
Trigger 3 full runs of all our CI stress tests
### Performance
Flush
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark; enable_statistics = true
Pre-pr: avg 507515519.3 ns
497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908,
Post-pr: avg 511971266.5 ns, regressed 0.88%
502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408,
```
Compaction
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 495346098.30 ns
492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846
Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97%
502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007
```
Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats)
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 3848.10 ns
3814,3838,3839,3848,3854,3854,3854,3860,3860,3860
Post-pr: avg 3874.20 ns, regressed 0.68%
3863,3867,3871,3874,3875,3877,3877,3877,3880,3881
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910
Reviewed By: ajkr
Differential Revision: D49788060
Pulled By: hx235
fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff
2023-12-29 23:29:23 +00:00
|
|
|
Status status = versions_->LogAndApply(cfd, mutable_cf_options,
|
|
|
|
read_options, write_options, &edit,
|
|
|
|
&mutex_, directories_.GetDbDir());
|
2020-12-23 07:44:44 +00:00
|
|
|
|
Add missing range conflict check between file ingestion and RefitLevel() (#10988)
Summary:
**Context:**
File ingestion never checks whether the key range it acts on overlaps with an ongoing RefitLevel() (used in `CompactRange()` with `change_level=true`). That's because RefitLevel() doesn't register and make its key range known to file ingestion. Though it checks overlapping with other compactions by https://github.com/facebook/rocksdb/blob/7.8.fb/db/external_sst_file_ingestion_job.cc#L998.
RefitLevel() (used in `CompactRange()` with `change_level=true`) doesn't check whether the key range it acts on overlaps with an ongoing file ingestion. That's because file ingestion does not register and make its key range known to other compactions.
- Note that non-refitlevel-compaction (e.g, manual compaction w/o RefitLevel() or general compaction) also does not check key range overlap with ongoing file ingestion for the same reason.
- But it's fine. Credited to cbi42's discovery, `WaitForIngestFile` was called by background and foreground compactions. They were introduced in https://github.com/facebook/rocksdb/commit/0f88160f67d36ea30e3aca3a3cef924c3a009be6, https://github.com/facebook/rocksdb/commit/5c64fb67d2fc198f1a73ff3ae543749a6a41f513 and https://github.com/facebook/rocksdb/commit/87dfc1d23e0e16ff73e15f63c6fa0fb3b3fc8c8c.
- Regardless, this PR registers file ingestion like a compaction is a general approach that will also add range conflict check between file ingestion and non-refitlevel-compaction, though it has not been the issue motivated this PR.
Above are bugs resulting in two bad consequences:
- If file ingestion and RefitLevel() creates files in the same level, then range-overlapped files will be created at that level and caught as corruption by `force_consistency_checks=true`
- If file ingestion and RefitLevel() creates file in different levels, then with one further compaction on the ingested file, it can result in two same keys both with seqno 0 in two different levels. Then with iterator's [optimization](https://github.com/facebook/rocksdb/blame/c62f3221698fd273b673d4f7e54eabb8329a4369/db/db_iter.cc#L342-L343) that assumes no two same keys both with seqno 0, it will either break this assertion in debug build or, even worst, return value of this same key for the key after it, which is the wrong value to return, in release build.
Therefore we decide to introduce range conflict check for file ingestion and RefitLevel() inspired from the existing range conflict check among compactions.
**Summary:**
- Treat file ingestion job and RefitLevel() as `Compaction` of new compaction reasons: `CompactionReason::kExternalSstIngestion` and `CompactionReason::kRefitLevel` and register/unregister them. File ingestion is treated as compaction from L0 to different levels and RefitLevel() as compaction from source level to target level.
- Check for `RangeOverlapWithCompaction` with other ongoing compactions, `RegisterCompaction()` on this "compaction" before changing the LSM state in `VersionStorageInfo`, and `UnregisterCompaction()` after changing.
- Replace scattered fixes (https://github.com/facebook/rocksdb/commit/0f88160f67d36ea30e3aca3a3cef924c3a009be6, https://github.com/facebook/rocksdb/commit/5c64fb67d2fc198f1a73ff3ae543749a6a41f513 and https://github.com/facebook/rocksdb/commit/87dfc1d23e0e16ff73e15f63c6fa0fb3b3fc8c8c.) that prevents overlapping between file ingestion and non-refit-level compaction with this fix cuz those practices are easy to overlook.
- Misc: logic cleanup, see PR comments
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10988
Test Plan:
- New unit test `DBCompactionTestWithOngoingFileIngestionParam*` that failed pre-fix and passed afterwards.
- Made compatible with existing tests, see PR comments
- make check
- [Ongoing] Stress test rehearsal with normal value and aggressive CI value https://github.com/facebook/rocksdb/pull/10761
Reviewed By: cbi42
Differential Revision: D41535685
Pulled By: hx235
fbshipit-source-id: 549833a577ba1496d20a870583d4caa737da1258
2022-12-29 23:05:36 +00:00
|
|
|
cfd->compaction_picker()->UnregisterCompaction(c.get());
|
|
|
|
c.reset();
|
|
|
|
|
2017-10-06 01:00:38 +00:00
|
|
|
InstallSuperVersionAndScheduleWork(cfd, &sv_context, mutable_cf_options);
|
2017-04-06 00:14:05 +00:00
|
|
|
|
|
|
|
ROCKS_LOG_DEBUG(immutable_db_options_.info_log, "[%s] LogAndApply: %s\n",
|
|
|
|
cfd->GetName().c_str(), status.ToString().data());
|
|
|
|
|
|
|
|
if (status.ok()) {
|
|
|
|
ROCKS_LOG_DEBUG(immutable_db_options_.info_log,
|
|
|
|
"[%s] After refitting:\n%s", cfd->GetName().c_str(),
|
|
|
|
cfd->current()->DebugString().data());
|
|
|
|
}
|
2020-12-23 07:44:44 +00:00
|
|
|
sv_context.Clean();
|
|
|
|
refitting_level_ = false;
|
|
|
|
|
|
|
|
return status;
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
refitting_level_ = false;
|
2020-12-23 07:44:44 +00:00
|
|
|
return Status::OK();
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
int DBImpl::NumberLevels(ColumnFamilyHandle* column_family) {
|
2020-07-03 02:24:25 +00:00
|
|
|
auto cfh = static_cast_with_check<ColumnFamilyHandleImpl>(column_family);
|
2017-04-06 00:14:05 +00:00
|
|
|
return cfh->cfd()->NumberLevels();
|
|
|
|
}
|
|
|
|
|
2018-03-05 21:08:17 +00:00
|
|
|
int DBImpl::MaxMemCompactionLevel(ColumnFamilyHandle* /*column_family*/) {
|
2017-04-06 00:14:05 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
int DBImpl::Level0StopWriteTrigger(ColumnFamilyHandle* column_family) {
|
2020-07-03 02:24:25 +00:00
|
|
|
auto cfh = static_cast_with_check<ColumnFamilyHandleImpl>(column_family);
|
2017-04-06 00:14:05 +00:00
|
|
|
InstrumentedMutexLock l(&mutex_);
|
2018-04-13 00:55:14 +00:00
|
|
|
return cfh->cfd()
|
|
|
|
->GetSuperVersion()
|
|
|
|
->mutable_cf_options.level0_stop_writes_trigger;
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
|
2023-05-31 19:53:51 +00:00
|
|
|
Status DBImpl::FlushAllColumnFamilies(const FlushOptions& flush_options,
|
|
|
|
FlushReason flush_reason) {
|
|
|
|
mutex_.AssertHeld();
|
|
|
|
Status status;
|
|
|
|
if (immutable_db_options_.atomic_flush) {
|
|
|
|
mutex_.Unlock();
|
|
|
|
status = AtomicFlushMemTables(flush_options, flush_reason);
|
|
|
|
if (status.IsColumnFamilyDropped()) {
|
|
|
|
status = Status::OK();
|
|
|
|
}
|
|
|
|
mutex_.Lock();
|
|
|
|
} else {
|
|
|
|
for (auto cfd : versions_->GetRefedColumnFamilySet()) {
|
|
|
|
if (cfd->IsDropped()) {
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
mutex_.Unlock();
|
|
|
|
status = FlushMemTable(cfd, flush_options, flush_reason);
|
|
|
|
TEST_SYNC_POINT("DBImpl::FlushAllColumnFamilies:1");
|
|
|
|
TEST_SYNC_POINT("DBImpl::FlushAllColumnFamilies:2");
|
|
|
|
mutex_.Lock();
|
|
|
|
if (!status.ok() && !status.IsColumnFamilyDropped()) {
|
|
|
|
break;
|
|
|
|
} else if (status.IsColumnFamilyDropped()) {
|
|
|
|
status = Status::OK();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return status;
|
|
|
|
}
|
|
|
|
|
2017-04-06 00:14:05 +00:00
|
|
|
Status DBImpl::Flush(const FlushOptions& flush_options,
|
|
|
|
ColumnFamilyHandle* column_family) {
|
2020-07-03 02:24:25 +00:00
|
|
|
auto cfh = static_cast_with_check<ColumnFamilyHandleImpl>(column_family);
|
2018-01-19 01:32:50 +00:00
|
|
|
ROCKS_LOG_INFO(immutable_db_options_.info_log, "[%s] Manual flush start.",
|
|
|
|
cfh->GetName().c_str());
|
2018-10-26 22:06:44 +00:00
|
|
|
Status s;
|
2018-11-12 20:22:10 +00:00
|
|
|
if (immutable_db_options_.atomic_flush) {
|
Fix bug of prematurely excluded CF in atomic flush contains unflushed data that should've been included in the atomic flush (#11148)
Summary:
**Context:**
Atomic flush should guarantee recoverability of all data of seqno up to the max seqno of the flush. It achieves this by ensuring all such data are flushed by the time this atomic flush finishes through `SelectColumnFamiliesForAtomicFlush()`. However, our crash test exposed the following case where an excluded CF from an atomic flush contains unflushed data of seqno less than the max seqno of that atomic flush and loses its data with `WriteOptions::DisableWAL=true` in face of a crash right after the atomic flush finishes .
```
./db_stress --preserve_unverified_changes=1 --reopen=0 --acquire_snapshot_one_in=0 --adaptive_readahead=1 --allow_data_in_errors=True --async_io=1 --atomic_flush=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=15 --bottommost_compression_type=none --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=0 --checkpoint_one_in=0 --checksum_type=kXXH3 --clear_column_family_one_in=0 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_ttl=100 --compression_max_dict_buffer_bytes=134217727 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=lz4hc --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=$db --db_write_buffer_size=1048576 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --expected_values_dir=$exp --fail_if_options_file_error=0 --fifo_allow_compaction=0 --file_checksum_impl=none --flush_one_in=0 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=100 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=524288 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --long_running_snapshots=1 --manual_wal_flush_one_in=100 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=10000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=4 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=0 --periodic_compaction_seconds=100 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=32 --readahead_size=16384 --readpercent=50 --recycle_log_file_num=0 --ribbon_starting_level=6 --secondary_cache_fault_one_in=0 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=1048576 --stats_dump_period_sec=10 --subcompactions=1 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 --use_multiget=1 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=none --write_buffer_size=524288 --write_dbid_to_manifest=1 --write_fault_one_in=0 --writepercent=30 &
pid=$!
sleep 0.2
sleep 10
kill $pid
sleep 0.2
./db_stress --ops_per_thread=1 --preserve_unverified_changes=1 --reopen=0 --acquire_snapshot_one_in=0 --adaptive_readahead=1 --allow_data_in_errors=True --async_io=1 --atomic_flush=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=15 --bottommost_compression_type=none --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=0 --checkpoint_one_in=0 --checksum_type=kXXH3 --clear_column_family_one_in=0 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_ttl=100 --compression_max_dict_buffer_bytes=134217727 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=lz4hc --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=$db --db_write_buffer_size=1048576 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --expected_values_dir=$exp --fail_if_options_file_error=0 --fifo_allow_compaction=0 --file_checksum_impl=none --flush_one_in=0 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=100 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=524288 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --long_running_snapshots=1 --manual_wal_flush_one_in=100 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=10000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=4 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=0 --periodic_compaction_seconds=100 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=32 --readahead_size=16384 --readpercent=50 --recycle_log_file_num=0 --ribbon_starting_level=6 --secondary_cache_fault_one_in=0 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=1048576 --stats_dump_period_sec=10 --subcompactions=1 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 --use_multiget=1 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=none --write_buffer_size=524288 --write_dbid_to_manifest=1 --write_fault_one_in=0 --writepercent=30 &
pid=$!
sleep 0.2
sleep 40
kill $pid
sleep 0.2
Verification failed for column family 6 key 0000000000000239000000000000012B0000000000000138 (56622): value_from_db: , value_from_expected: 4A6331754E4F4C4D42434041464744455A5B58595E5F5C5D5253505156575455, msg: Value not found: NotFound:
Crash-recovery verification failed :(
No writes or ops?
Verification failed :(
```
The bug is due to the following:
- When atomic flush is used, an empty CF is legally [excluded](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_filesnapshot.cc#L39) in `SelectColumnFamiliesForAtomicFlush` as the first step of `DBImpl::FlushForGetLiveFiles` before [passing](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_filesnapshot.cc#L42) the included CFDs to `AtomicFlushMemTables`.
- But [later](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_impl/db_impl_compaction_flush.cc#L2133) in `AtomicFlushMemTables`, `WaitUntilFlushWouldNotStallWrites` will [release the db mutex](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_impl/db_impl_compaction_flush.cc#L2403), during which data@seqno N can be inserted into the excluded CF and data@seqno M can be inserted into one of the included CFs, where M > N.
- However, data@seqno N in an already-excluded CF is thus excluded from this atomic flush while we seqno N is less than seqno M.
**Summary:**
- Replace `SelectColumnFamiliesForAtomicFlush()`-before-`AtomicFlushMemTables()` with `SelectColumnFamiliesForAtomicFlush()`-after-wait-within-`AtomicFlushMemTables()` so we ensure no write affecting the recoverability of this atomic job (i.e, change to max seqno of this atomic flush or insertion of data with less seqno than the max seqno of the atomic flush to excluded CF) can happen after calling `SelectColumnFamiliesForAtomicFlush()`.
- For above, refactored and clarified comments on `SelectColumnFamiliesForAtomicFlush()` and `AtomicFlushMemTables()` for clearer semantics of passed-in CFDs to atomic-flush
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11148
Test Plan:
- New unit test failed before the fix and passes after
- Make check
- Rehearsal stress test
Reviewed By: ajkr
Differential Revision: D42799871
Pulled By: hx235
fbshipit-source-id: 13636b63e9c25c5895857afc36ea580d57f6d644
2023-03-14 23:53:20 +00:00
|
|
|
s = AtomicFlushMemTables(flush_options, FlushReason::kManualFlush,
|
|
|
|
{cfh->cfd()});
|
2018-10-26 22:06:44 +00:00
|
|
|
} else {
|
2022-05-13 19:31:30 +00:00
|
|
|
s = FlushMemTable(cfh->cfd(), flush_options, FlushReason::kManualFlush);
|
2018-10-26 22:06:44 +00:00
|
|
|
}
|
|
|
|
|
2018-01-19 01:32:50 +00:00
|
|
|
ROCKS_LOG_INFO(immutable_db_options_.info_log,
|
|
|
|
"[%s] Manual flush finished, status: %s\n",
|
|
|
|
cfh->GetName().c_str(), s.ToString().c_str());
|
|
|
|
return s;
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
|
2018-10-26 22:06:44 +00:00
|
|
|
Status DBImpl::Flush(const FlushOptions& flush_options,
|
|
|
|
const std::vector<ColumnFamilyHandle*>& column_families) {
|
Auto recovery from out of space errors (#4164)
Summary:
This commit implements automatic recovery from a Status::NoSpace() error
during background operations such as write callback, flush and
compaction. The broad design is as follows -
1. Compaction errors are treated as soft errors and don't put the
database in read-only mode. A compaction is delayed until enough free
disk space is available to accomodate the compaction outputs, which is
estimated based on the input size. This means that users can continue to
write, and we rely on the WriteController to delay or stop writes if the
compaction debt becomes too high due to persistent low disk space
condition
2. Errors during write callback and flush are treated as hard errors,
i.e the database is put in read-only mode and goes back to read-write
only fater certain recovery actions are taken.
3. Both types of recovery rely on the SstFileManagerImpl to poll for
sufficient disk space. We assume that there is a 1-1 mapping between an
SFM and the underlying OS storage container. For cases where multiple
DBs are hosted on a single storage container, the user is expected to
allocate a single SFM instance and use the same one for all the DBs. If
no SFM is specified by the user, DBImpl::Open() will allocate one, but
this will be one per DB and each DB will recover independently. The
recovery implemented by SFM is as follows -
a) On the first occurance of an out of space error during compaction,
subsequent
compactions will be delayed until the disk free space check indicates
enough available space. The required space is computed as the sum of
input sizes.
b) The free space check requirement will be removed once the amount of
free space is greater than the size reserved by in progress
compactions when the first error occured
c) If the out of space error is a hard error, a background thread in
SFM will poll for sufficient headroom before triggering the recovery
of the database and putting it in write-only mode. The headroom is
calculated as the sum of the write_buffer_size of all the DB instances
associated with the SFM
4. EventListener callbacks will be called at the start and completion of
automatic recovery. Users can disable the auto recov ery in the start
callback, and later initiate it manually by calling DB::Resume()
Todo:
1. More extensive testing
2. Add disk full condition to db_stress (follow-on PR)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4164
Differential Revision: D9846378
Pulled By: anand1976
fbshipit-source-id: 80ea875dbd7f00205e19c82215ff6e37da10da4a
2018-09-15 20:36:19 +00:00
|
|
|
Status s;
|
2018-11-12 20:22:10 +00:00
|
|
|
if (!immutable_db_options_.atomic_flush) {
|
2018-10-26 22:06:44 +00:00
|
|
|
for (auto cfh : column_families) {
|
|
|
|
s = Flush(flush_options, cfh);
|
|
|
|
if (!s.ok()) {
|
|
|
|
break;
|
|
|
|
}
|
Auto recovery from out of space errors (#4164)
Summary:
This commit implements automatic recovery from a Status::NoSpace() error
during background operations such as write callback, flush and
compaction. The broad design is as follows -
1. Compaction errors are treated as soft errors and don't put the
database in read-only mode. A compaction is delayed until enough free
disk space is available to accomodate the compaction outputs, which is
estimated based on the input size. This means that users can continue to
write, and we rely on the WriteController to delay or stop writes if the
compaction debt becomes too high due to persistent low disk space
condition
2. Errors during write callback and flush are treated as hard errors,
i.e the database is put in read-only mode and goes back to read-write
only fater certain recovery actions are taken.
3. Both types of recovery rely on the SstFileManagerImpl to poll for
sufficient disk space. We assume that there is a 1-1 mapping between an
SFM and the underlying OS storage container. For cases where multiple
DBs are hosted on a single storage container, the user is expected to
allocate a single SFM instance and use the same one for all the DBs. If
no SFM is specified by the user, DBImpl::Open() will allocate one, but
this will be one per DB and each DB will recover independently. The
recovery implemented by SFM is as follows -
a) On the first occurance of an out of space error during compaction,
subsequent
compactions will be delayed until the disk free space check indicates
enough available space. The required space is computed as the sum of
input sizes.
b) The free space check requirement will be removed once the amount of
free space is greater than the size reserved by in progress
compactions when the first error occured
c) If the out of space error is a hard error, a background thread in
SFM will poll for sufficient headroom before triggering the recovery
of the database and putting it in write-only mode. The headroom is
calculated as the sum of the write_buffer_size of all the DB instances
associated with the SFM
4. EventListener callbacks will be called at the start and completion of
automatic recovery. Users can disable the auto recov ery in the start
callback, and later initiate it manually by calling DB::Resume()
Todo:
1. More extensive testing
2. Add disk full condition to db_stress (follow-on PR)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4164
Differential Revision: D9846378
Pulled By: anand1976
fbshipit-source-id: 80ea875dbd7f00205e19c82215ff6e37da10da4a
2018-09-15 20:36:19 +00:00
|
|
|
}
|
2018-10-26 22:06:44 +00:00
|
|
|
} else {
|
|
|
|
ROCKS_LOG_INFO(immutable_db_options_.info_log,
|
|
|
|
"Manual atomic flush start.\n"
|
|
|
|
"=====Column families:=====");
|
|
|
|
for (auto cfh : column_families) {
|
|
|
|
auto cfhi = static_cast<ColumnFamilyHandleImpl*>(cfh);
|
|
|
|
ROCKS_LOG_INFO(immutable_db_options_.info_log, "%s",
|
|
|
|
cfhi->GetName().c_str());
|
Auto recovery from out of space errors (#4164)
Summary:
This commit implements automatic recovery from a Status::NoSpace() error
during background operations such as write callback, flush and
compaction. The broad design is as follows -
1. Compaction errors are treated as soft errors and don't put the
database in read-only mode. A compaction is delayed until enough free
disk space is available to accomodate the compaction outputs, which is
estimated based on the input size. This means that users can continue to
write, and we rely on the WriteController to delay or stop writes if the
compaction debt becomes too high due to persistent low disk space
condition
2. Errors during write callback and flush are treated as hard errors,
i.e the database is put in read-only mode and goes back to read-write
only fater certain recovery actions are taken.
3. Both types of recovery rely on the SstFileManagerImpl to poll for
sufficient disk space. We assume that there is a 1-1 mapping between an
SFM and the underlying OS storage container. For cases where multiple
DBs are hosted on a single storage container, the user is expected to
allocate a single SFM instance and use the same one for all the DBs. If
no SFM is specified by the user, DBImpl::Open() will allocate one, but
this will be one per DB and each DB will recover independently. The
recovery implemented by SFM is as follows -
a) On the first occurance of an out of space error during compaction,
subsequent
compactions will be delayed until the disk free space check indicates
enough available space. The required space is computed as the sum of
input sizes.
b) The free space check requirement will be removed once the amount of
free space is greater than the size reserved by in progress
compactions when the first error occured
c) If the out of space error is a hard error, a background thread in
SFM will poll for sufficient headroom before triggering the recovery
of the database and putting it in write-only mode. The headroom is
calculated as the sum of the write_buffer_size of all the DB instances
associated with the SFM
4. EventListener callbacks will be called at the start and completion of
automatic recovery. Users can disable the auto recov ery in the start
callback, and later initiate it manually by calling DB::Resume()
Todo:
1. More extensive testing
2. Add disk full condition to db_stress (follow-on PR)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4164
Differential Revision: D9846378
Pulled By: anand1976
fbshipit-source-id: 80ea875dbd7f00205e19c82215ff6e37da10da4a
2018-09-15 20:36:19 +00:00
|
|
|
}
|
2018-10-26 22:06:44 +00:00
|
|
|
ROCKS_LOG_INFO(immutable_db_options_.info_log,
|
|
|
|
"=====End of column families list=====");
|
|
|
|
autovector<ColumnFamilyData*> cfds;
|
|
|
|
std::for_each(column_families.begin(), column_families.end(),
|
|
|
|
[&cfds](ColumnFamilyHandle* elem) {
|
|
|
|
auto cfh = static_cast<ColumnFamilyHandleImpl*>(elem);
|
|
|
|
cfds.emplace_back(cfh->cfd());
|
|
|
|
});
|
Fix bug of prematurely excluded CF in atomic flush contains unflushed data that should've been included in the atomic flush (#11148)
Summary:
**Context:**
Atomic flush should guarantee recoverability of all data of seqno up to the max seqno of the flush. It achieves this by ensuring all such data are flushed by the time this atomic flush finishes through `SelectColumnFamiliesForAtomicFlush()`. However, our crash test exposed the following case where an excluded CF from an atomic flush contains unflushed data of seqno less than the max seqno of that atomic flush and loses its data with `WriteOptions::DisableWAL=true` in face of a crash right after the atomic flush finishes .
```
./db_stress --preserve_unverified_changes=1 --reopen=0 --acquire_snapshot_one_in=0 --adaptive_readahead=1 --allow_data_in_errors=True --async_io=1 --atomic_flush=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=15 --bottommost_compression_type=none --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=0 --checkpoint_one_in=0 --checksum_type=kXXH3 --clear_column_family_one_in=0 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_ttl=100 --compression_max_dict_buffer_bytes=134217727 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=lz4hc --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=$db --db_write_buffer_size=1048576 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --expected_values_dir=$exp --fail_if_options_file_error=0 --fifo_allow_compaction=0 --file_checksum_impl=none --flush_one_in=0 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=100 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=524288 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --long_running_snapshots=1 --manual_wal_flush_one_in=100 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=10000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=4 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=0 --periodic_compaction_seconds=100 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=32 --readahead_size=16384 --readpercent=50 --recycle_log_file_num=0 --ribbon_starting_level=6 --secondary_cache_fault_one_in=0 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=1048576 --stats_dump_period_sec=10 --subcompactions=1 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 --use_multiget=1 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=none --write_buffer_size=524288 --write_dbid_to_manifest=1 --write_fault_one_in=0 --writepercent=30 &
pid=$!
sleep 0.2
sleep 10
kill $pid
sleep 0.2
./db_stress --ops_per_thread=1 --preserve_unverified_changes=1 --reopen=0 --acquire_snapshot_one_in=0 --adaptive_readahead=1 --allow_data_in_errors=True --async_io=1 --atomic_flush=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=15 --bottommost_compression_type=none --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=0 --checkpoint_one_in=0 --checksum_type=kXXH3 --clear_column_family_one_in=0 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_ttl=100 --compression_max_dict_buffer_bytes=134217727 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=lz4hc --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=$db --db_write_buffer_size=1048576 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --expected_values_dir=$exp --fail_if_options_file_error=0 --fifo_allow_compaction=0 --file_checksum_impl=none --flush_one_in=0 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=100 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=524288 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --long_running_snapshots=1 --manual_wal_flush_one_in=100 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=10000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=4 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=0 --periodic_compaction_seconds=100 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=32 --readahead_size=16384 --readpercent=50 --recycle_log_file_num=0 --ribbon_starting_level=6 --secondary_cache_fault_one_in=0 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=1048576 --stats_dump_period_sec=10 --subcompactions=1 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 --use_multiget=1 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=none --write_buffer_size=524288 --write_dbid_to_manifest=1 --write_fault_one_in=0 --writepercent=30 &
pid=$!
sleep 0.2
sleep 40
kill $pid
sleep 0.2
Verification failed for column family 6 key 0000000000000239000000000000012B0000000000000138 (56622): value_from_db: , value_from_expected: 4A6331754E4F4C4D42434041464744455A5B58595E5F5C5D5253505156575455, msg: Value not found: NotFound:
Crash-recovery verification failed :(
No writes or ops?
Verification failed :(
```
The bug is due to the following:
- When atomic flush is used, an empty CF is legally [excluded](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_filesnapshot.cc#L39) in `SelectColumnFamiliesForAtomicFlush` as the first step of `DBImpl::FlushForGetLiveFiles` before [passing](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_filesnapshot.cc#L42) the included CFDs to `AtomicFlushMemTables`.
- But [later](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_impl/db_impl_compaction_flush.cc#L2133) in `AtomicFlushMemTables`, `WaitUntilFlushWouldNotStallWrites` will [release the db mutex](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_impl/db_impl_compaction_flush.cc#L2403), during which data@seqno N can be inserted into the excluded CF and data@seqno M can be inserted into one of the included CFs, where M > N.
- However, data@seqno N in an already-excluded CF is thus excluded from this atomic flush while we seqno N is less than seqno M.
**Summary:**
- Replace `SelectColumnFamiliesForAtomicFlush()`-before-`AtomicFlushMemTables()` with `SelectColumnFamiliesForAtomicFlush()`-after-wait-within-`AtomicFlushMemTables()` so we ensure no write affecting the recoverability of this atomic job (i.e, change to max seqno of this atomic flush or insertion of data with less seqno than the max seqno of the atomic flush to excluded CF) can happen after calling `SelectColumnFamiliesForAtomicFlush()`.
- For above, refactored and clarified comments on `SelectColumnFamiliesForAtomicFlush()` and `AtomicFlushMemTables()` for clearer semantics of passed-in CFDs to atomic-flush
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11148
Test Plan:
- New unit test failed before the fix and passes after
- Make check
- Rehearsal stress test
Reviewed By: ajkr
Differential Revision: D42799871
Pulled By: hx235
fbshipit-source-id: 13636b63e9c25c5895857afc36ea580d57f6d644
2023-03-14 23:53:20 +00:00
|
|
|
s = AtomicFlushMemTables(flush_options, FlushReason::kManualFlush, cfds);
|
2018-10-26 22:06:44 +00:00
|
|
|
ROCKS_LOG_INFO(immutable_db_options_.info_log,
|
2019-04-04 19:05:42 +00:00
|
|
|
"Manual atomic flush finished, status: %s\n"
|
|
|
|
"=====Column families:=====",
|
|
|
|
s.ToString().c_str());
|
2018-10-26 22:06:44 +00:00
|
|
|
for (auto cfh : column_families) {
|
|
|
|
auto cfhi = static_cast<ColumnFamilyHandleImpl*>(cfh);
|
|
|
|
ROCKS_LOG_INFO(immutable_db_options_.info_log, "%s",
|
|
|
|
cfhi->GetName().c_str());
|
Auto recovery from out of space errors (#4164)
Summary:
This commit implements automatic recovery from a Status::NoSpace() error
during background operations such as write callback, flush and
compaction. The broad design is as follows -
1. Compaction errors are treated as soft errors and don't put the
database in read-only mode. A compaction is delayed until enough free
disk space is available to accomodate the compaction outputs, which is
estimated based on the input size. This means that users can continue to
write, and we rely on the WriteController to delay or stop writes if the
compaction debt becomes too high due to persistent low disk space
condition
2. Errors during write callback and flush are treated as hard errors,
i.e the database is put in read-only mode and goes back to read-write
only fater certain recovery actions are taken.
3. Both types of recovery rely on the SstFileManagerImpl to poll for
sufficient disk space. We assume that there is a 1-1 mapping between an
SFM and the underlying OS storage container. For cases where multiple
DBs are hosted on a single storage container, the user is expected to
allocate a single SFM instance and use the same one for all the DBs. If
no SFM is specified by the user, DBImpl::Open() will allocate one, but
this will be one per DB and each DB will recover independently. The
recovery implemented by SFM is as follows -
a) On the first occurance of an out of space error during compaction,
subsequent
compactions will be delayed until the disk free space check indicates
enough available space. The required space is computed as the sum of
input sizes.
b) The free space check requirement will be removed once the amount of
free space is greater than the size reserved by in progress
compactions when the first error occured
c) If the out of space error is a hard error, a background thread in
SFM will poll for sufficient headroom before triggering the recovery
of the database and putting it in write-only mode. The headroom is
calculated as the sum of the write_buffer_size of all the DB instances
associated with the SFM
4. EventListener callbacks will be called at the start and completion of
automatic recovery. Users can disable the auto recov ery in the start
callback, and later initiate it manually by calling DB::Resume()
Todo:
1. More extensive testing
2. Add disk full condition to db_stress (follow-on PR)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4164
Differential Revision: D9846378
Pulled By: anand1976
fbshipit-source-id: 80ea875dbd7f00205e19c82215ff6e37da10da4a
2018-09-15 20:36:19 +00:00
|
|
|
}
|
2018-10-26 22:06:44 +00:00
|
|
|
ROCKS_LOG_INFO(immutable_db_options_.info_log,
|
|
|
|
"=====End of column families list=====");
|
Auto recovery from out of space errors (#4164)
Summary:
This commit implements automatic recovery from a Status::NoSpace() error
during background operations such as write callback, flush and
compaction. The broad design is as follows -
1. Compaction errors are treated as soft errors and don't put the
database in read-only mode. A compaction is delayed until enough free
disk space is available to accomodate the compaction outputs, which is
estimated based on the input size. This means that users can continue to
write, and we rely on the WriteController to delay or stop writes if the
compaction debt becomes too high due to persistent low disk space
condition
2. Errors during write callback and flush are treated as hard errors,
i.e the database is put in read-only mode and goes back to read-write
only fater certain recovery actions are taken.
3. Both types of recovery rely on the SstFileManagerImpl to poll for
sufficient disk space. We assume that there is a 1-1 mapping between an
SFM and the underlying OS storage container. For cases where multiple
DBs are hosted on a single storage container, the user is expected to
allocate a single SFM instance and use the same one for all the DBs. If
no SFM is specified by the user, DBImpl::Open() will allocate one, but
this will be one per DB and each DB will recover independently. The
recovery implemented by SFM is as follows -
a) On the first occurance of an out of space error during compaction,
subsequent
compactions will be delayed until the disk free space check indicates
enough available space. The required space is computed as the sum of
input sizes.
b) The free space check requirement will be removed once the amount of
free space is greater than the size reserved by in progress
compactions when the first error occured
c) If the out of space error is a hard error, a background thread in
SFM will poll for sufficient headroom before triggering the recovery
of the database and putting it in write-only mode. The headroom is
calculated as the sum of the write_buffer_size of all the DB instances
associated with the SFM
4. EventListener callbacks will be called at the start and completion of
automatic recovery. Users can disable the auto recov ery in the start
callback, and later initiate it manually by calling DB::Resume()
Todo:
1. More extensive testing
2. Add disk full condition to db_stress (follow-on PR)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4164
Differential Revision: D9846378
Pulled By: anand1976
fbshipit-source-id: 80ea875dbd7f00205e19c82215ff6e37da10da4a
2018-09-15 20:36:19 +00:00
|
|
|
}
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
2019-04-17 06:29:32 +00:00
|
|
|
Status DBImpl::RunManualCompaction(
|
|
|
|
ColumnFamilyData* cfd, int input_level, int output_level,
|
|
|
|
const CompactRangeOptions& compact_range_options, const Slice* begin,
|
|
|
|
const Slice* end, bool exclusive, bool disallow_trivial_move,
|
Always allow L0->L1 trivial move during manual compaction (#11375)
Summary:
during manual compaction (CompactRange()), L0->L1 trivial move is disabled when only L0 overlaps with compacting key range (introduced in https://github.com/facebook/rocksdb/issues/7368 to enforce kForce* contract). This can cause large memory usage due to compaction readahead when number of L0 files is large. This PR allows L0->L1 trivial move in this case, and will do a L1 -> L1 intra-level compaction when needed (`bottommost_level_compaction` is kForce*). In brief, consider a DB with only L0 file, and user calls CompactRange(kForce, nullptr, nullptr),
- before this PR, RocksDB does a L0 -> L1 compaction (disallow trivial move),
- after this PR, RocksDB does a L0 -> L1 compaction (allow trivial move), and a L1 -> L1 compaction.
Users can use kForceOptimized to avoid this extra L1->L1 compaction overhead when L0s are overlapping and cannot be trivial moved.
This PR also fixed a bug (see previous discussion in https://github.com/facebook/rocksdb/issues/11041) where `final_output_level` of a manual compaction can be miscalculated when `level_compaction_dynamic_level_bytes=true`. This bug could cause incorrect level being moved when CompactRangeOptions::change_level is specified.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11375
Test Plan: - Added new unit tests to test that L0 -> L1 compaction allows trivial move and L1 -> L1 compaction is done when needed.
Reviewed By: ajkr
Differential Revision: D44943518
Pulled By: cbi42
fbshipit-source-id: e9fb770d17b163c18a623e1d1bd6b81159192708
2023-04-20 18:10:48 +00:00
|
|
|
uint64_t max_file_num_to_ignore, const std::string& trim_ts,
|
|
|
|
int* final_output_level) {
|
2017-04-06 00:14:05 +00:00
|
|
|
assert(input_level == ColumnFamilyData::kCompactAllLevels ||
|
|
|
|
input_level >= 0);
|
|
|
|
|
|
|
|
InternalKey begin_storage, end_storage;
|
2022-02-16 01:59:31 +00:00
|
|
|
CompactionArg* ca = nullptr;
|
2017-04-06 00:14:05 +00:00
|
|
|
|
|
|
|
bool scheduled = false;
|
2022-03-15 19:31:14 +00:00
|
|
|
bool unscheduled = false;
|
2022-03-02 21:43:00 +00:00
|
|
|
Env::Priority thread_pool_priority = Env::Priority::TOTAL;
|
2017-04-06 00:14:05 +00:00
|
|
|
bool manual_conflict = false;
|
2022-03-13 04:07:04 +00:00
|
|
|
|
2022-03-15 19:31:14 +00:00
|
|
|
ManualCompactionState manual(
|
2022-03-13 04:07:04 +00:00
|
|
|
cfd, input_level, output_level, compact_range_options.target_path_id,
|
|
|
|
exclusive, disallow_trivial_move, compact_range_options.canceled);
|
2017-04-06 00:14:05 +00:00
|
|
|
// For universal compaction, we enforce every manual compaction to compact
|
|
|
|
// all files.
|
|
|
|
if (begin == nullptr ||
|
|
|
|
cfd->ioptions()->compaction_style == kCompactionStyleUniversal ||
|
|
|
|
cfd->ioptions()->compaction_style == kCompactionStyleFIFO) {
|
2022-03-15 19:31:14 +00:00
|
|
|
manual.begin = nullptr;
|
2017-04-06 00:14:05 +00:00
|
|
|
} else {
|
2017-09-13 00:16:44 +00:00
|
|
|
begin_storage.SetMinPossibleForUserKey(*begin);
|
2022-03-15 19:31:14 +00:00
|
|
|
manual.begin = &begin_storage;
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
if (end == nullptr ||
|
|
|
|
cfd->ioptions()->compaction_style == kCompactionStyleUniversal ||
|
|
|
|
cfd->ioptions()->compaction_style == kCompactionStyleFIFO) {
|
2022-03-15 19:31:14 +00:00
|
|
|
manual.end = nullptr;
|
2017-04-06 00:14:05 +00:00
|
|
|
} else {
|
2017-09-13 00:16:44 +00:00
|
|
|
end_storage.SetMaxPossibleForUserKey(*end);
|
2022-03-15 19:31:14 +00:00
|
|
|
manual.end = &end_storage;
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
TEST_SYNC_POINT("DBImpl::RunManualCompaction:0");
|
|
|
|
TEST_SYNC_POINT("DBImpl::RunManualCompaction:1");
|
|
|
|
InstrumentedMutexLock l(&mutex_);
|
|
|
|
|
Prevent corruption with parallel manual compactions and `change_level == true` (#9077)
Summary:
The bug can impact the following scenario. There must be two `CompactRange()`s, call them A and B. Compaction A must have `change_level=true`. Compactions A and B must run in parallel, and new data must be added while they run as well.
Now, on to the details of the race condition. Compaction A must reach the refitting phase while B's next step is to trivial move new data (i.e., data that has been inserted behind A) down to the same level that A's refit targets (`CompactRangeOptions::target_level`). B must be unregistered (i.e., has not yet called `AddManualCompaction()` for the current `RunManualCompaction()`) while A invokes `DisableManualCompaction()`s to prepare for refitting. In the old code, B could still proceed to register a manual compaction, while A had disabled manual compaction.
The next part of the race condition is B picks and schedules a trivial move while A has released the lock in refitting phase in order to persist the LSM state change (i.e., the log phase of `LogAndApply()`). That way, B does not see the refitted data when picking a trivial-move compaction. So it is susceptible to picking one that overlaps.
Finally, B executes the picked trivial-move compaction. Trivial-move compactions are special in that they never check whether manual compaction is disabled. So the picked compaction causing overlap ends up being applied, leading to LSM corruption if `force_consistency_checks=false`, or entering read-only mode with `Status::Corruption` if `force_consistency_checks=true` (the default).
The fix is just to prevent B from registering itself in `RunManualCompaction()` while manual compactions are disabled, consequently preventing any trivial move or other compaction from being picked/scheduled.
Thanks to siying for finding the bug.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9077
Test Plan: The test does not go all the way in exposing the bug because it requires a compaction to be picked/scheduled while logging LSM state change for RefitLevel(). But the fix is to make such a compaction not picked/scheduled in the first place, so any repro of that scenario would end up hanging RefitLevel() logging. So instead I just verified no such compaction is registered in the scenario where `RefitLevel()` disables manual compactions.
Reviewed By: siying
Differential Revision: D31921908
Pulled By: ajkr
fbshipit-source-id: 9bb5d0e847ad428211227f40830c685c209fbecb
2021-10-28 06:07:29 +00:00
|
|
|
if (manual_compaction_paused_ > 0) {
|
|
|
|
// Does not make sense to `AddManualCompaction()` in this scenario since
|
|
|
|
// `DisableManualCompaction()` just waited for the manual compaction queue
|
|
|
|
// to drain. So return immediately.
|
|
|
|
TEST_SYNC_POINT("DBImpl::RunManualCompaction:PausedAtStart");
|
2022-03-15 19:31:14 +00:00
|
|
|
manual.status =
|
Prevent corruption with parallel manual compactions and `change_level == true` (#9077)
Summary:
The bug can impact the following scenario. There must be two `CompactRange()`s, call them A and B. Compaction A must have `change_level=true`. Compactions A and B must run in parallel, and new data must be added while they run as well.
Now, on to the details of the race condition. Compaction A must reach the refitting phase while B's next step is to trivial move new data (i.e., data that has been inserted behind A) down to the same level that A's refit targets (`CompactRangeOptions::target_level`). B must be unregistered (i.e., has not yet called `AddManualCompaction()` for the current `RunManualCompaction()`) while A invokes `DisableManualCompaction()`s to prepare for refitting. In the old code, B could still proceed to register a manual compaction, while A had disabled manual compaction.
The next part of the race condition is B picks and schedules a trivial move while A has released the lock in refitting phase in order to persist the LSM state change (i.e., the log phase of `LogAndApply()`). That way, B does not see the refitted data when picking a trivial-move compaction. So it is susceptible to picking one that overlaps.
Finally, B executes the picked trivial-move compaction. Trivial-move compactions are special in that they never check whether manual compaction is disabled. So the picked compaction causing overlap ends up being applied, leading to LSM corruption if `force_consistency_checks=false`, or entering read-only mode with `Status::Corruption` if `force_consistency_checks=true` (the default).
The fix is just to prevent B from registering itself in `RunManualCompaction()` while manual compactions are disabled, consequently preventing any trivial move or other compaction from being picked/scheduled.
Thanks to siying for finding the bug.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9077
Test Plan: The test does not go all the way in exposing the bug because it requires a compaction to be picked/scheduled while logging LSM state change for RefitLevel(). But the fix is to make such a compaction not picked/scheduled in the first place, so any repro of that scenario would end up hanging RefitLevel() logging. So instead I just verified no such compaction is registered in the scenario where `RefitLevel()` disables manual compactions.
Reviewed By: siying
Differential Revision: D31921908
Pulled By: ajkr
fbshipit-source-id: 9bb5d0e847ad428211227f40830c685c209fbecb
2021-10-28 06:07:29 +00:00
|
|
|
Status::Incomplete(Status::SubCode::kManualCompactionPaused);
|
2022-03-15 19:31:14 +00:00
|
|
|
manual.done = true;
|
|
|
|
return manual.status;
|
Prevent corruption with parallel manual compactions and `change_level == true` (#9077)
Summary:
The bug can impact the following scenario. There must be two `CompactRange()`s, call them A and B. Compaction A must have `change_level=true`. Compactions A and B must run in parallel, and new data must be added while they run as well.
Now, on to the details of the race condition. Compaction A must reach the refitting phase while B's next step is to trivial move new data (i.e., data that has been inserted behind A) down to the same level that A's refit targets (`CompactRangeOptions::target_level`). B must be unregistered (i.e., has not yet called `AddManualCompaction()` for the current `RunManualCompaction()`) while A invokes `DisableManualCompaction()`s to prepare for refitting. In the old code, B could still proceed to register a manual compaction, while A had disabled manual compaction.
The next part of the race condition is B picks and schedules a trivial move while A has released the lock in refitting phase in order to persist the LSM state change (i.e., the log phase of `LogAndApply()`). That way, B does not see the refitted data when picking a trivial-move compaction. So it is susceptible to picking one that overlaps.
Finally, B executes the picked trivial-move compaction. Trivial-move compactions are special in that they never check whether manual compaction is disabled. So the picked compaction causing overlap ends up being applied, leading to LSM corruption if `force_consistency_checks=false`, or entering read-only mode with `Status::Corruption` if `force_consistency_checks=true` (the default).
The fix is just to prevent B from registering itself in `RunManualCompaction()` while manual compactions are disabled, consequently preventing any trivial move or other compaction from being picked/scheduled.
Thanks to siying for finding the bug.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9077
Test Plan: The test does not go all the way in exposing the bug because it requires a compaction to be picked/scheduled while logging LSM state change for RefitLevel(). But the fix is to make such a compaction not picked/scheduled in the first place, so any repro of that scenario would end up hanging RefitLevel() logging. So instead I just verified no such compaction is registered in the scenario where `RefitLevel()` disables manual compactions.
Reviewed By: siying
Differential Revision: D31921908
Pulled By: ajkr
fbshipit-source-id: 9bb5d0e847ad428211227f40830c685c209fbecb
2021-10-28 06:07:29 +00:00
|
|
|
}
|
|
|
|
|
2017-04-06 00:14:05 +00:00
|
|
|
// When a manual compaction arrives, temporarily disable scheduling of
|
|
|
|
// non-manual compactions and wait until the number of scheduled compaction
|
2021-10-07 22:22:34 +00:00
|
|
|
// jobs drops to zero. This used to be needed to ensure that this manual
|
|
|
|
// compaction can compact any range of keys/files. Now it is optional
|
|
|
|
// (see `CompactRangeOptions::exclusive_manual_compaction`). The use case for
|
2022-07-29 00:07:36 +00:00
|
|
|
// `exclusive_manual_compaction=true` is unclear beyond not trusting the code.
|
2017-04-06 00:14:05 +00:00
|
|
|
//
|
|
|
|
// HasPendingManualCompaction() is true when at least one thread is inside
|
|
|
|
// RunManualCompaction(), i.e. during that time no other compaction will
|
|
|
|
// get scheduled (see MaybeScheduleFlushOrCompaction).
|
|
|
|
//
|
|
|
|
// Note that the following loop doesn't stop more that one thread calling
|
|
|
|
// RunManualCompaction() from getting to the second while loop below.
|
|
|
|
// However, only one of them will actually schedule compaction, while
|
|
|
|
// others will wait on a condition variable until it completes.
|
|
|
|
|
2022-03-15 19:31:14 +00:00
|
|
|
AddManualCompaction(&manual);
|
2017-04-06 00:14:05 +00:00
|
|
|
TEST_SYNC_POINT_CALLBACK("DBImpl::RunManualCompaction:NotScheduled", &mutex_);
|
|
|
|
if (exclusive) {
|
2021-10-07 22:22:34 +00:00
|
|
|
// Limitation: there's no way to wake up the below loop when user sets
|
|
|
|
// `*manual.canceled`. So `CompactRangeOptions::exclusive_manual_compaction`
|
|
|
|
// and `CompactRangeOptions::canceled` might not work well together.
|
2017-08-03 22:36:28 +00:00
|
|
|
while (bg_bottom_compaction_scheduled_ > 0 ||
|
|
|
|
bg_compaction_scheduled_ > 0) {
|
2022-06-07 01:32:26 +00:00
|
|
|
if (manual_compaction_paused_ > 0 || manual.canceled == true) {
|
2021-10-07 22:22:34 +00:00
|
|
|
// Pretend the error came from compaction so the below cleanup/error
|
|
|
|
// handling code can process it.
|
2022-03-15 19:31:14 +00:00
|
|
|
manual.done = true;
|
|
|
|
manual.status =
|
2021-10-07 22:22:34 +00:00
|
|
|
Status::Incomplete(Status::SubCode::kManualCompactionPaused);
|
|
|
|
break;
|
|
|
|
}
|
2017-05-02 22:01:07 +00:00
|
|
|
TEST_SYNC_POINT("DBImpl::RunManualCompaction:WaitScheduled");
|
2017-04-06 00:14:05 +00:00
|
|
|
ROCKS_LOG_INFO(
|
|
|
|
immutable_db_options_.info_log,
|
|
|
|
"[%s] Manual compaction waiting for all other scheduled background "
|
|
|
|
"compactions to finish",
|
|
|
|
cfd->GetName().c_str());
|
|
|
|
bg_cv_.Wait();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
Concurrent task limiter for compaction thread control (#4332)
Summary:
The PR is targeting to resolve the issue of:
https://github.com/facebook/rocksdb/issues/3972#issue-330771918
We have a rocksdb created with leveled-compaction with multiple column families (CFs), some of CFs are using HDD to store big and less frequently accessed data and others are using SSD.
When there are continuously write traffics going on to all CFs, the compaction thread pool is mostly occupied by those slow HDD compactions, which blocks fully utilize SSD bandwidth.
Since atomic write and transaction is needed across CFs, so splitting it to multiple rocksdb instance is not an option for us.
With the compaction thread control, we got 30%+ HDD write throughput gain, and also a lot smooth SSD write since less write stall happening.
ConcurrentTaskLimiter can be shared with multi-CFs across rocksdb instances, so the feature does not only work for multi-CFs scenarios, but also for multi-rocksdbs scenarios, who need disk IO resource control per tenant.
The usage is straight forward:
e.g.:
//
// Enable compaction thread limiter thru ColumnFamilyOptions
//
std::shared_ptr<ConcurrentTaskLimiter> ctl(NewConcurrentTaskLimiter("foo_limiter", 4));
Options options;
ColumnFamilyOptions cf_opt(options);
cf_opt.compaction_thread_limiter = ctl;
...
//
// Compaction thread limiter can be tuned or disabled on-the-fly
//
ctl->SetMaxOutstandingTask(12); // enlarge to 12 tasks
...
ctl->ResetMaxOutstandingTask(); // disable (bypass) thread limiter
ctl->SetMaxOutstandingTask(-1); // Same as above
...
ctl->SetMaxOutstandingTask(0); // full throttle (0 task)
//
// Sharing compaction thread limiter among CFs (to resolve multiple storage perf issue)
//
std::shared_ptr<ConcurrentTaskLimiter> ctl_ssd(NewConcurrentTaskLimiter("ssd_limiter", 8));
std::shared_ptr<ConcurrentTaskLimiter> ctl_hdd(NewConcurrentTaskLimiter("hdd_limiter", 4));
Options options;
ColumnFamilyOptions cf_opt_ssd1(options);
ColumnFamilyOptions cf_opt_ssd2(options);
ColumnFamilyOptions cf_opt_hdd1(options);
ColumnFamilyOptions cf_opt_hdd2(options);
ColumnFamilyOptions cf_opt_hdd3(options);
// SSD CFs
cf_opt_ssd1.compaction_thread_limiter = ctl_ssd;
cf_opt_ssd2.compaction_thread_limiter = ctl_ssd;
// HDD CFs
cf_opt_hdd1.compaction_thread_limiter = ctl_hdd;
cf_opt_hdd2.compaction_thread_limiter = ctl_hdd;
cf_opt_hdd3.compaction_thread_limiter = ctl_hdd;
...
//
// The limiter is disabled by default (or set to nullptr explicitly)
//
Options options;
ColumnFamilyOptions cf_opt(options);
cf_opt.compaction_thread_limiter = nullptr;
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4332
Differential Revision: D13226590
Pulled By: siying
fbshipit-source-id: 14307aec55b8bd59c8223d04aa6db3c03d1b0c1d
2018-12-13 21:16:04 +00:00
|
|
|
LogBuffer log_buffer(InfoLogLevel::INFO_LEVEL,
|
|
|
|
immutable_db_options_.info_log.get());
|
2022-05-23 17:09:37 +00:00
|
|
|
|
|
|
|
ROCKS_LOG_BUFFER(&log_buffer, "[%s] Manual compaction starting",
|
|
|
|
cfd->GetName().c_str());
|
|
|
|
|
2017-04-06 00:14:05 +00:00
|
|
|
// We don't check bg_error_ here, because if we get the error in compaction,
|
|
|
|
// the compaction will set manual.status to bg_error_ and set manual.done to
|
|
|
|
// true.
|
2022-03-15 19:31:14 +00:00
|
|
|
while (!manual.done) {
|
2017-04-06 00:14:05 +00:00
|
|
|
assert(HasPendingManualCompaction());
|
|
|
|
manual_conflict = false;
|
2017-10-19 17:48:47 +00:00
|
|
|
Compaction* compaction = nullptr;
|
2022-03-15 19:31:14 +00:00
|
|
|
if (ShouldntRunManualCompaction(&manual) || (manual.in_progress == true) ||
|
|
|
|
scheduled ||
|
|
|
|
(((manual.manual_end = &manual.tmp_storage1) != nullptr) &&
|
|
|
|
((compaction = manual.cfd->CompactRange(
|
|
|
|
*manual.cfd->GetLatestMutableCFOptions(), mutable_db_options_,
|
|
|
|
manual.input_level, manual.output_level, compact_range_options,
|
|
|
|
manual.begin, manual.end, &manual.manual_end, &manual_conflict,
|
|
|
|
max_file_num_to_ignore, trim_ts)) == nullptr &&
|
2018-04-13 00:55:14 +00:00
|
|
|
manual_conflict))) {
|
2023-02-01 00:57:49 +00:00
|
|
|
if (!scheduled) {
|
|
|
|
// There is a conflicting compaction
|
|
|
|
if (manual_compaction_paused_ > 0 || manual.canceled == true) {
|
|
|
|
// Stop waiting since it was canceled. Pretend the error came from
|
|
|
|
// compaction so the below cleanup/error handling code can process it.
|
|
|
|
manual.done = true;
|
|
|
|
manual.status =
|
|
|
|
Status::Incomplete(Status::SubCode::kManualCompactionPaused);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (!manual.done) {
|
|
|
|
bg_cv_.Wait();
|
|
|
|
}
|
2022-03-15 19:31:14 +00:00
|
|
|
if (manual_compaction_paused_ > 0 && scheduled && !unscheduled) {
|
|
|
|
assert(thread_pool_priority != Env::Priority::TOTAL);
|
|
|
|
// unschedule all manual compactions
|
|
|
|
auto unscheduled_task_num = env_->UnSchedule(
|
|
|
|
GetTaskTag(TaskType::kManualCompaction), thread_pool_priority);
|
|
|
|
if (unscheduled_task_num > 0) {
|
|
|
|
ROCKS_LOG_INFO(
|
|
|
|
immutable_db_options_.info_log,
|
|
|
|
"[%s] Unscheduled %d number of manual compactions from the "
|
|
|
|
"thread-pool",
|
|
|
|
cfd->GetName().c_str(), unscheduled_task_num);
|
|
|
|
// it may unschedule other manual compactions, notify others.
|
|
|
|
bg_cv_.SignalAll();
|
2022-03-13 04:07:04 +00:00
|
|
|
}
|
2022-03-15 19:31:14 +00:00
|
|
|
unscheduled = true;
|
|
|
|
TEST_SYNC_POINT("DBImpl::RunManualCompaction:Unscheduled");
|
2022-02-16 01:59:31 +00:00
|
|
|
}
|
2022-03-15 19:31:14 +00:00
|
|
|
if (scheduled && manual.incomplete == true) {
|
|
|
|
assert(!manual.in_progress);
|
2017-04-06 00:14:05 +00:00
|
|
|
scheduled = false;
|
2022-03-15 19:31:14 +00:00
|
|
|
manual.incomplete = false;
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
} else if (!scheduled) {
|
2017-08-03 22:36:28 +00:00
|
|
|
if (compaction == nullptr) {
|
2022-03-15 19:31:14 +00:00
|
|
|
manual.done = true;
|
Always allow L0->L1 trivial move during manual compaction (#11375)
Summary:
during manual compaction (CompactRange()), L0->L1 trivial move is disabled when only L0 overlaps with compacting key range (introduced in https://github.com/facebook/rocksdb/issues/7368 to enforce kForce* contract). This can cause large memory usage due to compaction readahead when number of L0 files is large. This PR allows L0->L1 trivial move in this case, and will do a L1 -> L1 intra-level compaction when needed (`bottommost_level_compaction` is kForce*). In brief, consider a DB with only L0 file, and user calls CompactRange(kForce, nullptr, nullptr),
- before this PR, RocksDB does a L0 -> L1 compaction (disallow trivial move),
- after this PR, RocksDB does a L0 -> L1 compaction (allow trivial move), and a L1 -> L1 compaction.
Users can use kForceOptimized to avoid this extra L1->L1 compaction overhead when L0s are overlapping and cannot be trivial moved.
This PR also fixed a bug (see previous discussion in https://github.com/facebook/rocksdb/issues/11041) where `final_output_level` of a manual compaction can be miscalculated when `level_compaction_dynamic_level_bytes=true`. This bug could cause incorrect level being moved when CompactRangeOptions::change_level is specified.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11375
Test Plan: - Added new unit tests to test that L0 -> L1 compaction allows trivial move and L1 -> L1 compaction is done when needed.
Reviewed By: ajkr
Differential Revision: D44943518
Pulled By: cbi42
fbshipit-source-id: e9fb770d17b163c18a623e1d1bd6b81159192708
2023-04-20 18:10:48 +00:00
|
|
|
if (final_output_level) {
|
|
|
|
// No compaction needed or there is a conflicting compaction.
|
|
|
|
// Still set `final_output_level` to the level where we would
|
|
|
|
// have compacted to.
|
|
|
|
*final_output_level = output_level;
|
|
|
|
if (output_level == ColumnFamilyData::kCompactToBaseLevel) {
|
|
|
|
*final_output_level = cfd->current()->storage_info()->base_level();
|
|
|
|
}
|
|
|
|
}
|
2017-04-06 00:14:05 +00:00
|
|
|
bg_cv_.SignalAll();
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
ca = new CompactionArg;
|
|
|
|
ca->db = this;
|
2017-08-03 22:36:28 +00:00
|
|
|
ca->prepicked_compaction = new PrepickedCompaction;
|
2022-03-15 19:31:14 +00:00
|
|
|
ca->prepicked_compaction->manual_compaction_state = &manual;
|
2017-08-03 22:36:28 +00:00
|
|
|
ca->prepicked_compaction->compaction = compaction;
|
2019-01-02 17:56:39 +00:00
|
|
|
if (!RequestCompactionToken(
|
|
|
|
cfd, true, &ca->prepicked_compaction->task_token, &log_buffer)) {
|
|
|
|
// Don't throttle manual compaction, only count outstanding tasks.
|
|
|
|
assert(false);
|
Concurrent task limiter for compaction thread control (#4332)
Summary:
The PR is targeting to resolve the issue of:
https://github.com/facebook/rocksdb/issues/3972#issue-330771918
We have a rocksdb created with leveled-compaction with multiple column families (CFs), some of CFs are using HDD to store big and less frequently accessed data and others are using SSD.
When there are continuously write traffics going on to all CFs, the compaction thread pool is mostly occupied by those slow HDD compactions, which blocks fully utilize SSD bandwidth.
Since atomic write and transaction is needed across CFs, so splitting it to multiple rocksdb instance is not an option for us.
With the compaction thread control, we got 30%+ HDD write throughput gain, and also a lot smooth SSD write since less write stall happening.
ConcurrentTaskLimiter can be shared with multi-CFs across rocksdb instances, so the feature does not only work for multi-CFs scenarios, but also for multi-rocksdbs scenarios, who need disk IO resource control per tenant.
The usage is straight forward:
e.g.:
//
// Enable compaction thread limiter thru ColumnFamilyOptions
//
std::shared_ptr<ConcurrentTaskLimiter> ctl(NewConcurrentTaskLimiter("foo_limiter", 4));
Options options;
ColumnFamilyOptions cf_opt(options);
cf_opt.compaction_thread_limiter = ctl;
...
//
// Compaction thread limiter can be tuned or disabled on-the-fly
//
ctl->SetMaxOutstandingTask(12); // enlarge to 12 tasks
...
ctl->ResetMaxOutstandingTask(); // disable (bypass) thread limiter
ctl->SetMaxOutstandingTask(-1); // Same as above
...
ctl->SetMaxOutstandingTask(0); // full throttle (0 task)
//
// Sharing compaction thread limiter among CFs (to resolve multiple storage perf issue)
//
std::shared_ptr<ConcurrentTaskLimiter> ctl_ssd(NewConcurrentTaskLimiter("ssd_limiter", 8));
std::shared_ptr<ConcurrentTaskLimiter> ctl_hdd(NewConcurrentTaskLimiter("hdd_limiter", 4));
Options options;
ColumnFamilyOptions cf_opt_ssd1(options);
ColumnFamilyOptions cf_opt_ssd2(options);
ColumnFamilyOptions cf_opt_hdd1(options);
ColumnFamilyOptions cf_opt_hdd2(options);
ColumnFamilyOptions cf_opt_hdd3(options);
// SSD CFs
cf_opt_ssd1.compaction_thread_limiter = ctl_ssd;
cf_opt_ssd2.compaction_thread_limiter = ctl_ssd;
// HDD CFs
cf_opt_hdd1.compaction_thread_limiter = ctl_hdd;
cf_opt_hdd2.compaction_thread_limiter = ctl_hdd;
cf_opt_hdd3.compaction_thread_limiter = ctl_hdd;
...
//
// The limiter is disabled by default (or set to nullptr explicitly)
//
Options options;
ColumnFamilyOptions cf_opt(options);
cf_opt.compaction_thread_limiter = nullptr;
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4332
Differential Revision: D13226590
Pulled By: siying
fbshipit-source-id: 14307aec55b8bd59c8223d04aa6db3c03d1b0c1d
2018-12-13 21:16:04 +00:00
|
|
|
}
|
2022-03-15 19:31:14 +00:00
|
|
|
manual.incomplete = false;
|
2020-03-25 03:20:33 +00:00
|
|
|
if (compaction->bottommost_level() &&
|
|
|
|
env_->GetBackgroundThreads(Env::Priority::BOTTOM) > 0) {
|
2021-11-19 01:26:39 +00:00
|
|
|
bg_bottom_compaction_scheduled_++;
|
|
|
|
ca->compaction_pri_ = Env::Priority::BOTTOM;
|
|
|
|
env_->Schedule(&DBImpl::BGWorkBottomCompaction, ca,
|
2022-03-02 21:43:00 +00:00
|
|
|
Env::Priority::BOTTOM,
|
|
|
|
GetTaskTag(TaskType::kManualCompaction),
|
2021-11-19 01:26:39 +00:00
|
|
|
&DBImpl::UnscheduleCompactionCallback);
|
2022-03-02 21:43:00 +00:00
|
|
|
thread_pool_priority = Env::Priority::BOTTOM;
|
2021-11-19 01:26:39 +00:00
|
|
|
} else {
|
|
|
|
bg_compaction_scheduled_++;
|
|
|
|
ca->compaction_pri_ = Env::Priority::LOW;
|
2022-03-02 21:43:00 +00:00
|
|
|
env_->Schedule(&DBImpl::BGWorkCompaction, ca, Env::Priority::LOW,
|
|
|
|
GetTaskTag(TaskType::kManualCompaction),
|
2021-11-19 01:26:39 +00:00
|
|
|
&DBImpl::UnscheduleCompactionCallback);
|
2022-03-02 21:43:00 +00:00
|
|
|
thread_pool_priority = Env::Priority::LOW;
|
2020-03-25 03:20:33 +00:00
|
|
|
}
|
2017-04-06 00:14:05 +00:00
|
|
|
scheduled = true;
|
2022-02-16 01:59:31 +00:00
|
|
|
TEST_SYNC_POINT("DBImpl::RunManualCompaction:Scheduled");
|
Always allow L0->L1 trivial move during manual compaction (#11375)
Summary:
during manual compaction (CompactRange()), L0->L1 trivial move is disabled when only L0 overlaps with compacting key range (introduced in https://github.com/facebook/rocksdb/issues/7368 to enforce kForce* contract). This can cause large memory usage due to compaction readahead when number of L0 files is large. This PR allows L0->L1 trivial move in this case, and will do a L1 -> L1 intra-level compaction when needed (`bottommost_level_compaction` is kForce*). In brief, consider a DB with only L0 file, and user calls CompactRange(kForce, nullptr, nullptr),
- before this PR, RocksDB does a L0 -> L1 compaction (disallow trivial move),
- after this PR, RocksDB does a L0 -> L1 compaction (allow trivial move), and a L1 -> L1 compaction.
Users can use kForceOptimized to avoid this extra L1->L1 compaction overhead when L0s are overlapping and cannot be trivial moved.
This PR also fixed a bug (see previous discussion in https://github.com/facebook/rocksdb/issues/11041) where `final_output_level` of a manual compaction can be miscalculated when `level_compaction_dynamic_level_bytes=true`. This bug could cause incorrect level being moved when CompactRangeOptions::change_level is specified.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11375
Test Plan: - Added new unit tests to test that L0 -> L1 compaction allows trivial move and L1 -> L1 compaction is done when needed.
Reviewed By: ajkr
Differential Revision: D44943518
Pulled By: cbi42
fbshipit-source-id: e9fb770d17b163c18a623e1d1bd6b81159192708
2023-04-20 18:10:48 +00:00
|
|
|
if (final_output_level) {
|
|
|
|
*final_output_level = compaction->output_level();
|
|
|
|
}
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
Concurrent task limiter for compaction thread control (#4332)
Summary:
The PR is targeting to resolve the issue of:
https://github.com/facebook/rocksdb/issues/3972#issue-330771918
We have a rocksdb created with leveled-compaction with multiple column families (CFs), some of CFs are using HDD to store big and less frequently accessed data and others are using SSD.
When there are continuously write traffics going on to all CFs, the compaction thread pool is mostly occupied by those slow HDD compactions, which blocks fully utilize SSD bandwidth.
Since atomic write and transaction is needed across CFs, so splitting it to multiple rocksdb instance is not an option for us.
With the compaction thread control, we got 30%+ HDD write throughput gain, and also a lot smooth SSD write since less write stall happening.
ConcurrentTaskLimiter can be shared with multi-CFs across rocksdb instances, so the feature does not only work for multi-CFs scenarios, but also for multi-rocksdbs scenarios, who need disk IO resource control per tenant.
The usage is straight forward:
e.g.:
//
// Enable compaction thread limiter thru ColumnFamilyOptions
//
std::shared_ptr<ConcurrentTaskLimiter> ctl(NewConcurrentTaskLimiter("foo_limiter", 4));
Options options;
ColumnFamilyOptions cf_opt(options);
cf_opt.compaction_thread_limiter = ctl;
...
//
// Compaction thread limiter can be tuned or disabled on-the-fly
//
ctl->SetMaxOutstandingTask(12); // enlarge to 12 tasks
...
ctl->ResetMaxOutstandingTask(); // disable (bypass) thread limiter
ctl->SetMaxOutstandingTask(-1); // Same as above
...
ctl->SetMaxOutstandingTask(0); // full throttle (0 task)
//
// Sharing compaction thread limiter among CFs (to resolve multiple storage perf issue)
//
std::shared_ptr<ConcurrentTaskLimiter> ctl_ssd(NewConcurrentTaskLimiter("ssd_limiter", 8));
std::shared_ptr<ConcurrentTaskLimiter> ctl_hdd(NewConcurrentTaskLimiter("hdd_limiter", 4));
Options options;
ColumnFamilyOptions cf_opt_ssd1(options);
ColumnFamilyOptions cf_opt_ssd2(options);
ColumnFamilyOptions cf_opt_hdd1(options);
ColumnFamilyOptions cf_opt_hdd2(options);
ColumnFamilyOptions cf_opt_hdd3(options);
// SSD CFs
cf_opt_ssd1.compaction_thread_limiter = ctl_ssd;
cf_opt_ssd2.compaction_thread_limiter = ctl_ssd;
// HDD CFs
cf_opt_hdd1.compaction_thread_limiter = ctl_hdd;
cf_opt_hdd2.compaction_thread_limiter = ctl_hdd;
cf_opt_hdd3.compaction_thread_limiter = ctl_hdd;
...
//
// The limiter is disabled by default (or set to nullptr explicitly)
//
Options options;
ColumnFamilyOptions cf_opt(options);
cf_opt.compaction_thread_limiter = nullptr;
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4332
Differential Revision: D13226590
Pulled By: siying
fbshipit-source-id: 14307aec55b8bd59c8223d04aa6db3c03d1b0c1d
2018-12-13 21:16:04 +00:00
|
|
|
log_buffer.FlushBufferToLog();
|
2022-03-15 19:31:14 +00:00
|
|
|
assert(!manual.in_progress);
|
2017-04-06 00:14:05 +00:00
|
|
|
assert(HasPendingManualCompaction());
|
2022-03-15 19:31:14 +00:00
|
|
|
RemoveManualCompaction(&manual);
|
2022-03-02 21:43:00 +00:00
|
|
|
// if the manual job is unscheduled, try schedule other jobs in case there's
|
|
|
|
// any unscheduled compaction job which was blocked by exclusive manual
|
|
|
|
// compaction.
|
2022-03-15 19:31:14 +00:00
|
|
|
if (manual.status.IsIncomplete() &&
|
|
|
|
manual.status.subcode() == Status::SubCode::kManualCompactionPaused) {
|
2022-03-02 21:43:00 +00:00
|
|
|
MaybeScheduleFlushOrCompaction();
|
|
|
|
}
|
2017-04-06 00:14:05 +00:00
|
|
|
bg_cv_.SignalAll();
|
2022-03-15 19:31:14 +00:00
|
|
|
return manual.status;
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
|
2018-10-26 22:06:44 +00:00
|
|
|
void DBImpl::GenerateFlushRequest(const autovector<ColumnFamilyData*>& cfds,
|
2023-01-24 17:54:04 +00:00
|
|
|
FlushReason flush_reason, FlushRequest* req) {
|
2018-10-26 22:06:44 +00:00
|
|
|
assert(req != nullptr);
|
2023-01-24 17:54:04 +00:00
|
|
|
req->flush_reason = flush_reason;
|
|
|
|
req->cfd_to_max_mem_id_to_persist.reserve(cfds.size());
|
2018-10-26 22:06:44 +00:00
|
|
|
for (const auto cfd : cfds) {
|
|
|
|
if (nullptr == cfd) {
|
|
|
|
// cfd may be null, see DBImpl::ScheduleFlushes
|
|
|
|
continue;
|
|
|
|
}
|
2023-10-02 23:26:24 +00:00
|
|
|
uint64_t max_memtable_id = cfd->imm()->GetLatestMemTableID(
|
|
|
|
immutable_db_options_.atomic_flush /* for_atomic_flush */);
|
2023-01-24 17:54:04 +00:00
|
|
|
req->cfd_to_max_mem_id_to_persist.emplace(cfd, max_memtable_id);
|
2018-10-26 22:06:44 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2017-04-06 00:14:05 +00:00
|
|
|
Status DBImpl::FlushMemTable(ColumnFamilyData* cfd,
|
|
|
|
const FlushOptions& flush_options,
|
2022-10-04 23:43:01 +00:00
|
|
|
FlushReason flush_reason,
|
|
|
|
bool entered_write_thread) {
|
2020-12-02 17:29:50 +00:00
|
|
|
// This method should not be called if atomic_flush is true.
|
|
|
|
assert(!immutable_db_options_.atomic_flush);
|
2022-10-04 23:43:01 +00:00
|
|
|
if (!flush_options.wait && write_controller_.IsStopped()) {
|
|
|
|
std::ostringstream oss;
|
|
|
|
oss << "Writes have been stopped, thus unable to perform manual flush. "
|
|
|
|
"Please try again later after writes are resumed";
|
|
|
|
return Status::TryAgain(oss.str());
|
|
|
|
}
|
2017-04-06 00:14:05 +00:00
|
|
|
Status s;
|
2018-08-29 18:58:13 +00:00
|
|
|
if (!flush_options.allow_write_stall) {
|
|
|
|
bool flush_needed = true;
|
|
|
|
s = WaitUntilFlushWouldNotStallWrites(cfd, &flush_needed);
|
|
|
|
TEST_SYNC_POINT("DBImpl::FlushMemTable:StallWaitDone");
|
|
|
|
if (!s.ok() || !flush_needed) {
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
}
|
2020-09-18 03:22:35 +00:00
|
|
|
|
2022-10-04 23:43:01 +00:00
|
|
|
const bool needs_to_join_write_thread = !entered_write_thread;
|
2020-12-02 17:29:50 +00:00
|
|
|
autovector<FlushRequest> flush_reqs;
|
|
|
|
autovector<uint64_t> memtable_ids_to_wait;
|
2017-04-06 00:14:05 +00:00
|
|
|
{
|
|
|
|
WriteContext context;
|
|
|
|
InstrumentedMutexLock guard_lock(&mutex_);
|
|
|
|
|
|
|
|
WriteThread::Writer w;
|
2019-11-15 21:59:03 +00:00
|
|
|
WriteThread::Writer nonmem_w;
|
2022-10-04 23:43:01 +00:00
|
|
|
if (needs_to_join_write_thread) {
|
2017-04-06 00:14:05 +00:00
|
|
|
write_thread_.EnterUnbatched(&w, &mutex_);
|
2019-11-15 21:59:03 +00:00
|
|
|
if (two_write_queues_) {
|
|
|
|
nonmem_write_thread_.EnterUnbatched(&nonmem_w, &mutex_);
|
|
|
|
}
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
2019-12-12 22:05:48 +00:00
|
|
|
WaitForPendingWrites();
|
2017-04-06 00:14:05 +00:00
|
|
|
|
2023-10-02 23:26:24 +00:00
|
|
|
if (!cfd->mem()->IsEmpty() || !cached_recoverable_state_empty_.load()) {
|
Fix assert(cfd->imm()->NumNotFlushed() > 0) in FlushMemtable (#7744)
Summary:
In current code base, in FlushMemtable, when `(Flush_reason == FlushReason::kErrorRecoveryRetryFlush && (!cfd->mem()->IsEmpty() || !cached_recoverable_state_empty_.load()))`, we assert that cfd->imm()->NumNotFlushed() > 0. However, there are some corner cases that can fail this assert: 1) if there are multiple CFs, some CF has immutable memtable, some CFs don't. In ResumeImpl, all CFs will call FlushMemtable, which will hit the assert. 2) Regular flush is scheduled and running, the resume thread is waiting. New KVs are inserted and SchedulePendingFlush is called. Regular flush will continue call MaybeScheduleFlushAndCompaction until all the immutable memtables are flushed. When regular flush ends and auto resume thread starts to schedule new flushes, cfd->imm()->NumNotFlushed() can be 0.
Remove the assert and added the comments.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7744
Test Plan: make check and pass the stress test
Reviewed By: riversand963
Differential Revision: D25340573
Pulled By: zhichao-cao
fbshipit-source-id: eac357bdace660247c197f01a9ff6857e3c97672
2020-12-05 04:30:23 +00:00
|
|
|
s = SwitchMemtable(cfd, &context);
|
2018-12-19 00:43:12 +00:00
|
|
|
}
|
2022-05-05 20:08:21 +00:00
|
|
|
const uint64_t flush_memtable_id = std::numeric_limits<uint64_t>::max();
|
2018-12-19 00:43:12 +00:00
|
|
|
if (s.ok()) {
|
|
|
|
if (cfd->imm()->NumNotFlushed() != 0 || !cfd->mem()->IsEmpty() ||
|
|
|
|
!cached_recoverable_state_empty_.load()) {
|
2023-01-24 17:54:04 +00:00
|
|
|
FlushRequest req{flush_reason, {{cfd, flush_memtable_id}}};
|
2020-12-02 17:29:50 +00:00
|
|
|
flush_reqs.emplace_back(std::move(req));
|
2023-10-02 23:26:24 +00:00
|
|
|
memtable_ids_to_wait.emplace_back(
|
|
|
|
cfd->imm()->GetLatestMemTableID(false /* for_atomic_flush */));
|
2018-12-19 00:43:12 +00:00
|
|
|
}
|
2023-10-02 23:26:24 +00:00
|
|
|
if (immutable_db_options_.persist_stats_to_disk) {
|
2019-07-01 18:53:25 +00:00
|
|
|
ColumnFamilyData* cfd_stats =
|
|
|
|
versions_->GetColumnFamilySet()->GetColumnFamily(
|
|
|
|
kPersistentStatsColumnFamilyName);
|
|
|
|
if (cfd_stats != nullptr && cfd_stats != cfd &&
|
|
|
|
!cfd_stats->mem()->IsEmpty()) {
|
|
|
|
// only force flush stats CF when it will be the only CF lagging
|
|
|
|
// behind after the current flush
|
|
|
|
bool stats_cf_flush_needed = true;
|
|
|
|
for (auto* loop_cfd : *versions_->GetColumnFamilySet()) {
|
|
|
|
if (loop_cfd == cfd_stats || loop_cfd == cfd) {
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
if (loop_cfd->GetLogNumber() <= cfd_stats->GetLogNumber()) {
|
|
|
|
stats_cf_flush_needed = false;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (stats_cf_flush_needed) {
|
|
|
|
ROCKS_LOG_INFO(immutable_db_options_.info_log,
|
|
|
|
"Force flushing stats CF with manual flush of %s "
|
2019-09-20 19:00:55 +00:00
|
|
|
"to avoid holding old logs",
|
|
|
|
cfd->GetName().c_str());
|
2019-07-01 18:53:25 +00:00
|
|
|
s = SwitchMemtable(cfd_stats, &context);
|
2023-01-24 17:54:04 +00:00
|
|
|
FlushRequest req{flush_reason, {{cfd_stats, flush_memtable_id}}};
|
2020-12-02 17:29:50 +00:00
|
|
|
flush_reqs.emplace_back(std::move(req));
|
|
|
|
memtable_ids_to_wait.emplace_back(
|
2023-10-02 23:26:24 +00:00
|
|
|
cfd_stats->imm()->GetLatestMemTableID(
|
|
|
|
false /* for_atomic_flush */));
|
2019-07-01 18:53:25 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
2018-08-24 20:17:29 +00:00
|
|
|
}
|
2020-12-02 17:29:50 +00:00
|
|
|
|
|
|
|
if (s.ok() && !flush_reqs.empty()) {
|
|
|
|
for (const auto& req : flush_reqs) {
|
2023-01-24 17:54:04 +00:00
|
|
|
assert(req.cfd_to_max_mem_id_to_persist.size() == 1);
|
|
|
|
ColumnFamilyData* loop_cfd =
|
|
|
|
req.cfd_to_max_mem_id_to_persist.begin()->first;
|
2018-08-24 20:17:29 +00:00
|
|
|
loop_cfd->imm()->FlushRequested();
|
|
|
|
}
|
2019-07-01 21:04:10 +00:00
|
|
|
// If the caller wants to wait for this flush to complete, it indicates
|
|
|
|
// that the caller expects the ColumnFamilyData not to be free'ed by
|
|
|
|
// other threads which may drop the column family concurrently.
|
|
|
|
// Therefore, we increase the cfd's ref count.
|
|
|
|
if (flush_options.wait) {
|
2020-12-02 17:29:50 +00:00
|
|
|
for (const auto& req : flush_reqs) {
|
2023-01-24 17:54:04 +00:00
|
|
|
assert(req.cfd_to_max_mem_id_to_persist.size() == 1);
|
|
|
|
ColumnFamilyData* loop_cfd =
|
|
|
|
req.cfd_to_max_mem_id_to_persist.begin()->first;
|
2019-07-01 21:04:10 +00:00
|
|
|
loop_cfd->Ref();
|
|
|
|
}
|
|
|
|
}
|
2020-12-02 17:29:50 +00:00
|
|
|
for (const auto& req : flush_reqs) {
|
2023-01-24 17:54:04 +00:00
|
|
|
SchedulePendingFlush(req);
|
2020-12-02 17:29:50 +00:00
|
|
|
}
|
2018-08-24 20:17:29 +00:00
|
|
|
MaybeScheduleFlushOrCompaction();
|
|
|
|
}
|
2017-04-06 00:14:05 +00:00
|
|
|
|
2022-10-04 23:43:01 +00:00
|
|
|
if (needs_to_join_write_thread) {
|
2017-04-06 00:14:05 +00:00
|
|
|
write_thread_.ExitUnbatched(&w);
|
2019-11-15 21:59:03 +00:00
|
|
|
if (two_write_queues_) {
|
|
|
|
nonmem_write_thread_.ExitUnbatched(&nonmem_w);
|
|
|
|
}
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
}
|
2019-07-01 21:04:10 +00:00
|
|
|
TEST_SYNC_POINT("DBImpl::FlushMemTable:AfterScheduleFlush");
|
|
|
|
TEST_SYNC_POINT("DBImpl::FlushMemTable:BeforeWaitForBgFlush");
|
2017-04-06 00:14:05 +00:00
|
|
|
if (s.ok() && flush_options.wait) {
|
2018-08-24 20:17:29 +00:00
|
|
|
autovector<ColumnFamilyData*> cfds;
|
|
|
|
autovector<const uint64_t*> flush_memtable_ids;
|
2020-12-02 17:29:50 +00:00
|
|
|
assert(flush_reqs.size() == memtable_ids_to_wait.size());
|
|
|
|
for (size_t i = 0; i < flush_reqs.size(); ++i) {
|
2023-01-24 17:54:04 +00:00
|
|
|
assert(flush_reqs[i].cfd_to_max_mem_id_to_persist.size() == 1);
|
|
|
|
cfds.push_back(flush_reqs[i].cfd_to_max_mem_id_to_persist.begin()->first);
|
2020-12-02 17:29:50 +00:00
|
|
|
flush_memtable_ids.push_back(&(memtable_ids_to_wait[i]));
|
2018-08-24 20:17:29 +00:00
|
|
|
}
|
2020-09-18 03:22:35 +00:00
|
|
|
s = WaitForFlushMemTables(
|
|
|
|
cfds, flush_memtable_ids,
|
2023-10-02 23:26:24 +00:00
|
|
|
flush_reason == FlushReason::kErrorRecovery /* resuming_from_bg_err */);
|
2019-12-13 03:02:51 +00:00
|
|
|
InstrumentedMutexLock lock_guard(&mutex_);
|
2019-07-01 21:04:10 +00:00
|
|
|
for (auto* tmp_cfd : cfds) {
|
2019-12-13 03:02:51 +00:00
|
|
|
tmp_cfd->UnrefAndTryDelete();
|
2019-07-01 21:04:10 +00:00
|
|
|
}
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
2019-12-12 22:05:48 +00:00
|
|
|
TEST_SYNC_POINT("DBImpl::FlushMemTable:FlushMemTableFinished");
|
2017-04-06 00:14:05 +00:00
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
2018-10-26 22:06:44 +00:00
|
|
|
Status DBImpl::AtomicFlushMemTables(
|
|
|
|
const FlushOptions& flush_options, FlushReason flush_reason,
|
Fix bug of prematurely excluded CF in atomic flush contains unflushed data that should've been included in the atomic flush (#11148)
Summary:
**Context:**
Atomic flush should guarantee recoverability of all data of seqno up to the max seqno of the flush. It achieves this by ensuring all such data are flushed by the time this atomic flush finishes through `SelectColumnFamiliesForAtomicFlush()`. However, our crash test exposed the following case where an excluded CF from an atomic flush contains unflushed data of seqno less than the max seqno of that atomic flush and loses its data with `WriteOptions::DisableWAL=true` in face of a crash right after the atomic flush finishes .
```
./db_stress --preserve_unverified_changes=1 --reopen=0 --acquire_snapshot_one_in=0 --adaptive_readahead=1 --allow_data_in_errors=True --async_io=1 --atomic_flush=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=15 --bottommost_compression_type=none --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=0 --checkpoint_one_in=0 --checksum_type=kXXH3 --clear_column_family_one_in=0 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_ttl=100 --compression_max_dict_buffer_bytes=134217727 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=lz4hc --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=$db --db_write_buffer_size=1048576 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --expected_values_dir=$exp --fail_if_options_file_error=0 --fifo_allow_compaction=0 --file_checksum_impl=none --flush_one_in=0 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=100 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=524288 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --long_running_snapshots=1 --manual_wal_flush_one_in=100 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=10000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=4 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=0 --periodic_compaction_seconds=100 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=32 --readahead_size=16384 --readpercent=50 --recycle_log_file_num=0 --ribbon_starting_level=6 --secondary_cache_fault_one_in=0 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=1048576 --stats_dump_period_sec=10 --subcompactions=1 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 --use_multiget=1 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=none --write_buffer_size=524288 --write_dbid_to_manifest=1 --write_fault_one_in=0 --writepercent=30 &
pid=$!
sleep 0.2
sleep 10
kill $pid
sleep 0.2
./db_stress --ops_per_thread=1 --preserve_unverified_changes=1 --reopen=0 --acquire_snapshot_one_in=0 --adaptive_readahead=1 --allow_data_in_errors=True --async_io=1 --atomic_flush=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=15 --bottommost_compression_type=none --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=0 --checkpoint_one_in=0 --checksum_type=kXXH3 --clear_column_family_one_in=0 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_ttl=100 --compression_max_dict_buffer_bytes=134217727 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=lz4hc --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=$db --db_write_buffer_size=1048576 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --expected_values_dir=$exp --fail_if_options_file_error=0 --fifo_allow_compaction=0 --file_checksum_impl=none --flush_one_in=0 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=100 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=524288 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --long_running_snapshots=1 --manual_wal_flush_one_in=100 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=10000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=4 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=0 --periodic_compaction_seconds=100 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=32 --readahead_size=16384 --readpercent=50 --recycle_log_file_num=0 --ribbon_starting_level=6 --secondary_cache_fault_one_in=0 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=1048576 --stats_dump_period_sec=10 --subcompactions=1 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 --use_multiget=1 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=none --write_buffer_size=524288 --write_dbid_to_manifest=1 --write_fault_one_in=0 --writepercent=30 &
pid=$!
sleep 0.2
sleep 40
kill $pid
sleep 0.2
Verification failed for column family 6 key 0000000000000239000000000000012B0000000000000138 (56622): value_from_db: , value_from_expected: 4A6331754E4F4C4D42434041464744455A5B58595E5F5C5D5253505156575455, msg: Value not found: NotFound:
Crash-recovery verification failed :(
No writes or ops?
Verification failed :(
```
The bug is due to the following:
- When atomic flush is used, an empty CF is legally [excluded](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_filesnapshot.cc#L39) in `SelectColumnFamiliesForAtomicFlush` as the first step of `DBImpl::FlushForGetLiveFiles` before [passing](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_filesnapshot.cc#L42) the included CFDs to `AtomicFlushMemTables`.
- But [later](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_impl/db_impl_compaction_flush.cc#L2133) in `AtomicFlushMemTables`, `WaitUntilFlushWouldNotStallWrites` will [release the db mutex](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_impl/db_impl_compaction_flush.cc#L2403), during which data@seqno N can be inserted into the excluded CF and data@seqno M can be inserted into one of the included CFs, where M > N.
- However, data@seqno N in an already-excluded CF is thus excluded from this atomic flush while we seqno N is less than seqno M.
**Summary:**
- Replace `SelectColumnFamiliesForAtomicFlush()`-before-`AtomicFlushMemTables()` with `SelectColumnFamiliesForAtomicFlush()`-after-wait-within-`AtomicFlushMemTables()` so we ensure no write affecting the recoverability of this atomic job (i.e, change to max seqno of this atomic flush or insertion of data with less seqno than the max seqno of the atomic flush to excluded CF) can happen after calling `SelectColumnFamiliesForAtomicFlush()`.
- For above, refactored and clarified comments on `SelectColumnFamiliesForAtomicFlush()` and `AtomicFlushMemTables()` for clearer semantics of passed-in CFDs to atomic-flush
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11148
Test Plan:
- New unit test failed before the fix and passes after
- Make check
- Rehearsal stress test
Reviewed By: ajkr
Differential Revision: D42799871
Pulled By: hx235
fbshipit-source-id: 13636b63e9c25c5895857afc36ea580d57f6d644
2023-03-14 23:53:20 +00:00
|
|
|
const autovector<ColumnFamilyData*>& provided_candidate_cfds,
|
2022-10-04 23:43:01 +00:00
|
|
|
bool entered_write_thread) {
|
|
|
|
assert(immutable_db_options_.atomic_flush);
|
|
|
|
if (!flush_options.wait && write_controller_.IsStopped()) {
|
|
|
|
std::ostringstream oss;
|
|
|
|
oss << "Writes have been stopped, thus unable to perform manual flush. "
|
|
|
|
"Please try again later after writes are resumed";
|
|
|
|
return Status::TryAgain(oss.str());
|
|
|
|
}
|
2018-10-26 22:06:44 +00:00
|
|
|
Status s;
|
Fix bug of prematurely excluded CF in atomic flush contains unflushed data that should've been included in the atomic flush (#11148)
Summary:
**Context:**
Atomic flush should guarantee recoverability of all data of seqno up to the max seqno of the flush. It achieves this by ensuring all such data are flushed by the time this atomic flush finishes through `SelectColumnFamiliesForAtomicFlush()`. However, our crash test exposed the following case where an excluded CF from an atomic flush contains unflushed data of seqno less than the max seqno of that atomic flush and loses its data with `WriteOptions::DisableWAL=true` in face of a crash right after the atomic flush finishes .
```
./db_stress --preserve_unverified_changes=1 --reopen=0 --acquire_snapshot_one_in=0 --adaptive_readahead=1 --allow_data_in_errors=True --async_io=1 --atomic_flush=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=15 --bottommost_compression_type=none --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=0 --checkpoint_one_in=0 --checksum_type=kXXH3 --clear_column_family_one_in=0 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_ttl=100 --compression_max_dict_buffer_bytes=134217727 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=lz4hc --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=$db --db_write_buffer_size=1048576 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --expected_values_dir=$exp --fail_if_options_file_error=0 --fifo_allow_compaction=0 --file_checksum_impl=none --flush_one_in=0 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=100 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=524288 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --long_running_snapshots=1 --manual_wal_flush_one_in=100 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=10000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=4 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=0 --periodic_compaction_seconds=100 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=32 --readahead_size=16384 --readpercent=50 --recycle_log_file_num=0 --ribbon_starting_level=6 --secondary_cache_fault_one_in=0 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=1048576 --stats_dump_period_sec=10 --subcompactions=1 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 --use_multiget=1 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=none --write_buffer_size=524288 --write_dbid_to_manifest=1 --write_fault_one_in=0 --writepercent=30 &
pid=$!
sleep 0.2
sleep 10
kill $pid
sleep 0.2
./db_stress --ops_per_thread=1 --preserve_unverified_changes=1 --reopen=0 --acquire_snapshot_one_in=0 --adaptive_readahead=1 --allow_data_in_errors=True --async_io=1 --atomic_flush=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=15 --bottommost_compression_type=none --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=0 --checkpoint_one_in=0 --checksum_type=kXXH3 --clear_column_family_one_in=0 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_ttl=100 --compression_max_dict_buffer_bytes=134217727 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=lz4hc --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=$db --db_write_buffer_size=1048576 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --expected_values_dir=$exp --fail_if_options_file_error=0 --fifo_allow_compaction=0 --file_checksum_impl=none --flush_one_in=0 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=100 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=524288 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --long_running_snapshots=1 --manual_wal_flush_one_in=100 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=10000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=4 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=0 --periodic_compaction_seconds=100 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=32 --readahead_size=16384 --readpercent=50 --recycle_log_file_num=0 --ribbon_starting_level=6 --secondary_cache_fault_one_in=0 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=1048576 --stats_dump_period_sec=10 --subcompactions=1 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 --use_multiget=1 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=none --write_buffer_size=524288 --write_dbid_to_manifest=1 --write_fault_one_in=0 --writepercent=30 &
pid=$!
sleep 0.2
sleep 40
kill $pid
sleep 0.2
Verification failed for column family 6 key 0000000000000239000000000000012B0000000000000138 (56622): value_from_db: , value_from_expected: 4A6331754E4F4C4D42434041464744455A5B58595E5F5C5D5253505156575455, msg: Value not found: NotFound:
Crash-recovery verification failed :(
No writes or ops?
Verification failed :(
```
The bug is due to the following:
- When atomic flush is used, an empty CF is legally [excluded](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_filesnapshot.cc#L39) in `SelectColumnFamiliesForAtomicFlush` as the first step of `DBImpl::FlushForGetLiveFiles` before [passing](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_filesnapshot.cc#L42) the included CFDs to `AtomicFlushMemTables`.
- But [later](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_impl/db_impl_compaction_flush.cc#L2133) in `AtomicFlushMemTables`, `WaitUntilFlushWouldNotStallWrites` will [release the db mutex](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_impl/db_impl_compaction_flush.cc#L2403), during which data@seqno N can be inserted into the excluded CF and data@seqno M can be inserted into one of the included CFs, where M > N.
- However, data@seqno N in an already-excluded CF is thus excluded from this atomic flush while we seqno N is less than seqno M.
**Summary:**
- Replace `SelectColumnFamiliesForAtomicFlush()`-before-`AtomicFlushMemTables()` with `SelectColumnFamiliesForAtomicFlush()`-after-wait-within-`AtomicFlushMemTables()` so we ensure no write affecting the recoverability of this atomic job (i.e, change to max seqno of this atomic flush or insertion of data with less seqno than the max seqno of the atomic flush to excluded CF) can happen after calling `SelectColumnFamiliesForAtomicFlush()`.
- For above, refactored and clarified comments on `SelectColumnFamiliesForAtomicFlush()` and `AtomicFlushMemTables()` for clearer semantics of passed-in CFDs to atomic-flush
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11148
Test Plan:
- New unit test failed before the fix and passes after
- Make check
- Rehearsal stress test
Reviewed By: ajkr
Differential Revision: D42799871
Pulled By: hx235
fbshipit-source-id: 13636b63e9c25c5895857afc36ea580d57f6d644
2023-03-14 23:53:20 +00:00
|
|
|
autovector<ColumnFamilyData*> candidate_cfds;
|
|
|
|
if (provided_candidate_cfds.empty()) {
|
|
|
|
// Generate candidate cfds if not provided
|
|
|
|
{
|
|
|
|
InstrumentedMutexLock l(&mutex_);
|
|
|
|
for (ColumnFamilyData* cfd : *versions_->GetColumnFamilySet()) {
|
|
|
|
if (!cfd->IsDropped() && cfd->initialized()) {
|
|
|
|
cfd->Ref();
|
|
|
|
candidate_cfds.push_back(cfd);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
candidate_cfds = provided_candidate_cfds;
|
|
|
|
}
|
|
|
|
|
2018-10-26 22:06:44 +00:00
|
|
|
if (!flush_options.allow_write_stall) {
|
|
|
|
int num_cfs_to_flush = 0;
|
Fix bug of prematurely excluded CF in atomic flush contains unflushed data that should've been included in the atomic flush (#11148)
Summary:
**Context:**
Atomic flush should guarantee recoverability of all data of seqno up to the max seqno of the flush. It achieves this by ensuring all such data are flushed by the time this atomic flush finishes through `SelectColumnFamiliesForAtomicFlush()`. However, our crash test exposed the following case where an excluded CF from an atomic flush contains unflushed data of seqno less than the max seqno of that atomic flush and loses its data with `WriteOptions::DisableWAL=true` in face of a crash right after the atomic flush finishes .
```
./db_stress --preserve_unverified_changes=1 --reopen=0 --acquire_snapshot_one_in=0 --adaptive_readahead=1 --allow_data_in_errors=True --async_io=1 --atomic_flush=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=15 --bottommost_compression_type=none --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=0 --checkpoint_one_in=0 --checksum_type=kXXH3 --clear_column_family_one_in=0 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_ttl=100 --compression_max_dict_buffer_bytes=134217727 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=lz4hc --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=$db --db_write_buffer_size=1048576 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --expected_values_dir=$exp --fail_if_options_file_error=0 --fifo_allow_compaction=0 --file_checksum_impl=none --flush_one_in=0 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=100 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=524288 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --long_running_snapshots=1 --manual_wal_flush_one_in=100 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=10000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=4 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=0 --periodic_compaction_seconds=100 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=32 --readahead_size=16384 --readpercent=50 --recycle_log_file_num=0 --ribbon_starting_level=6 --secondary_cache_fault_one_in=0 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=1048576 --stats_dump_period_sec=10 --subcompactions=1 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 --use_multiget=1 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=none --write_buffer_size=524288 --write_dbid_to_manifest=1 --write_fault_one_in=0 --writepercent=30 &
pid=$!
sleep 0.2
sleep 10
kill $pid
sleep 0.2
./db_stress --ops_per_thread=1 --preserve_unverified_changes=1 --reopen=0 --acquire_snapshot_one_in=0 --adaptive_readahead=1 --allow_data_in_errors=True --async_io=1 --atomic_flush=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=15 --bottommost_compression_type=none --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=0 --checkpoint_one_in=0 --checksum_type=kXXH3 --clear_column_family_one_in=0 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_ttl=100 --compression_max_dict_buffer_bytes=134217727 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=lz4hc --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=$db --db_write_buffer_size=1048576 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --expected_values_dir=$exp --fail_if_options_file_error=0 --fifo_allow_compaction=0 --file_checksum_impl=none --flush_one_in=0 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=100 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=524288 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --long_running_snapshots=1 --manual_wal_flush_one_in=100 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=10000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=4 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=0 --periodic_compaction_seconds=100 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=32 --readahead_size=16384 --readpercent=50 --recycle_log_file_num=0 --ribbon_starting_level=6 --secondary_cache_fault_one_in=0 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=1048576 --stats_dump_period_sec=10 --subcompactions=1 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 --use_multiget=1 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=none --write_buffer_size=524288 --write_dbid_to_manifest=1 --write_fault_one_in=0 --writepercent=30 &
pid=$!
sleep 0.2
sleep 40
kill $pid
sleep 0.2
Verification failed for column family 6 key 0000000000000239000000000000012B0000000000000138 (56622): value_from_db: , value_from_expected: 4A6331754E4F4C4D42434041464744455A5B58595E5F5C5D5253505156575455, msg: Value not found: NotFound:
Crash-recovery verification failed :(
No writes or ops?
Verification failed :(
```
The bug is due to the following:
- When atomic flush is used, an empty CF is legally [excluded](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_filesnapshot.cc#L39) in `SelectColumnFamiliesForAtomicFlush` as the first step of `DBImpl::FlushForGetLiveFiles` before [passing](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_filesnapshot.cc#L42) the included CFDs to `AtomicFlushMemTables`.
- But [later](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_impl/db_impl_compaction_flush.cc#L2133) in `AtomicFlushMemTables`, `WaitUntilFlushWouldNotStallWrites` will [release the db mutex](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_impl/db_impl_compaction_flush.cc#L2403), during which data@seqno N can be inserted into the excluded CF and data@seqno M can be inserted into one of the included CFs, where M > N.
- However, data@seqno N in an already-excluded CF is thus excluded from this atomic flush while we seqno N is less than seqno M.
**Summary:**
- Replace `SelectColumnFamiliesForAtomicFlush()`-before-`AtomicFlushMemTables()` with `SelectColumnFamiliesForAtomicFlush()`-after-wait-within-`AtomicFlushMemTables()` so we ensure no write affecting the recoverability of this atomic job (i.e, change to max seqno of this atomic flush or insertion of data with less seqno than the max seqno of the atomic flush to excluded CF) can happen after calling `SelectColumnFamiliesForAtomicFlush()`.
- For above, refactored and clarified comments on `SelectColumnFamiliesForAtomicFlush()` and `AtomicFlushMemTables()` for clearer semantics of passed-in CFDs to atomic-flush
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11148
Test Plan:
- New unit test failed before the fix and passes after
- Make check
- Rehearsal stress test
Reviewed By: ajkr
Differential Revision: D42799871
Pulled By: hx235
fbshipit-source-id: 13636b63e9c25c5895857afc36ea580d57f6d644
2023-03-14 23:53:20 +00:00
|
|
|
for (auto cfd : candidate_cfds) {
|
2018-10-26 22:06:44 +00:00
|
|
|
bool flush_needed = true;
|
|
|
|
s = WaitUntilFlushWouldNotStallWrites(cfd, &flush_needed);
|
|
|
|
if (!s.ok()) {
|
Fix bug of prematurely excluded CF in atomic flush contains unflushed data that should've been included in the atomic flush (#11148)
Summary:
**Context:**
Atomic flush should guarantee recoverability of all data of seqno up to the max seqno of the flush. It achieves this by ensuring all such data are flushed by the time this atomic flush finishes through `SelectColumnFamiliesForAtomicFlush()`. However, our crash test exposed the following case where an excluded CF from an atomic flush contains unflushed data of seqno less than the max seqno of that atomic flush and loses its data with `WriteOptions::DisableWAL=true` in face of a crash right after the atomic flush finishes .
```
./db_stress --preserve_unverified_changes=1 --reopen=0 --acquire_snapshot_one_in=0 --adaptive_readahead=1 --allow_data_in_errors=True --async_io=1 --atomic_flush=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=15 --bottommost_compression_type=none --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=0 --checkpoint_one_in=0 --checksum_type=kXXH3 --clear_column_family_one_in=0 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_ttl=100 --compression_max_dict_buffer_bytes=134217727 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=lz4hc --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=$db --db_write_buffer_size=1048576 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --expected_values_dir=$exp --fail_if_options_file_error=0 --fifo_allow_compaction=0 --file_checksum_impl=none --flush_one_in=0 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=100 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=524288 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --long_running_snapshots=1 --manual_wal_flush_one_in=100 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=10000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=4 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=0 --periodic_compaction_seconds=100 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=32 --readahead_size=16384 --readpercent=50 --recycle_log_file_num=0 --ribbon_starting_level=6 --secondary_cache_fault_one_in=0 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=1048576 --stats_dump_period_sec=10 --subcompactions=1 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 --use_multiget=1 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=none --write_buffer_size=524288 --write_dbid_to_manifest=1 --write_fault_one_in=0 --writepercent=30 &
pid=$!
sleep 0.2
sleep 10
kill $pid
sleep 0.2
./db_stress --ops_per_thread=1 --preserve_unverified_changes=1 --reopen=0 --acquire_snapshot_one_in=0 --adaptive_readahead=1 --allow_data_in_errors=True --async_io=1 --atomic_flush=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=15 --bottommost_compression_type=none --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=0 --checkpoint_one_in=0 --checksum_type=kXXH3 --clear_column_family_one_in=0 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_ttl=100 --compression_max_dict_buffer_bytes=134217727 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=lz4hc --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=$db --db_write_buffer_size=1048576 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --expected_values_dir=$exp --fail_if_options_file_error=0 --fifo_allow_compaction=0 --file_checksum_impl=none --flush_one_in=0 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=100 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=524288 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --long_running_snapshots=1 --manual_wal_flush_one_in=100 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=10000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=4 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=0 --periodic_compaction_seconds=100 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=32 --readahead_size=16384 --readpercent=50 --recycle_log_file_num=0 --ribbon_starting_level=6 --secondary_cache_fault_one_in=0 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=1048576 --stats_dump_period_sec=10 --subcompactions=1 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 --use_multiget=1 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=none --write_buffer_size=524288 --write_dbid_to_manifest=1 --write_fault_one_in=0 --writepercent=30 &
pid=$!
sleep 0.2
sleep 40
kill $pid
sleep 0.2
Verification failed for column family 6 key 0000000000000239000000000000012B0000000000000138 (56622): value_from_db: , value_from_expected: 4A6331754E4F4C4D42434041464744455A5B58595E5F5C5D5253505156575455, msg: Value not found: NotFound:
Crash-recovery verification failed :(
No writes or ops?
Verification failed :(
```
The bug is due to the following:
- When atomic flush is used, an empty CF is legally [excluded](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_filesnapshot.cc#L39) in `SelectColumnFamiliesForAtomicFlush` as the first step of `DBImpl::FlushForGetLiveFiles` before [passing](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_filesnapshot.cc#L42) the included CFDs to `AtomicFlushMemTables`.
- But [later](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_impl/db_impl_compaction_flush.cc#L2133) in `AtomicFlushMemTables`, `WaitUntilFlushWouldNotStallWrites` will [release the db mutex](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_impl/db_impl_compaction_flush.cc#L2403), during which data@seqno N can be inserted into the excluded CF and data@seqno M can be inserted into one of the included CFs, where M > N.
- However, data@seqno N in an already-excluded CF is thus excluded from this atomic flush while we seqno N is less than seqno M.
**Summary:**
- Replace `SelectColumnFamiliesForAtomicFlush()`-before-`AtomicFlushMemTables()` with `SelectColumnFamiliesForAtomicFlush()`-after-wait-within-`AtomicFlushMemTables()` so we ensure no write affecting the recoverability of this atomic job (i.e, change to max seqno of this atomic flush or insertion of data with less seqno than the max seqno of the atomic flush to excluded CF) can happen after calling `SelectColumnFamiliesForAtomicFlush()`.
- For above, refactored and clarified comments on `SelectColumnFamiliesForAtomicFlush()` and `AtomicFlushMemTables()` for clearer semantics of passed-in CFDs to atomic-flush
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11148
Test Plan:
- New unit test failed before the fix and passes after
- Make check
- Rehearsal stress test
Reviewed By: ajkr
Differential Revision: D42799871
Pulled By: hx235
fbshipit-source-id: 13636b63e9c25c5895857afc36ea580d57f6d644
2023-03-14 23:53:20 +00:00
|
|
|
// Unref the newly generated candidate cfds (when not provided) in
|
|
|
|
// `candidate_cfds`
|
|
|
|
if (provided_candidate_cfds.empty()) {
|
|
|
|
for (auto candidate_cfd : candidate_cfds) {
|
|
|
|
candidate_cfd->UnrefAndTryDelete();
|
|
|
|
}
|
|
|
|
}
|
2018-10-26 22:06:44 +00:00
|
|
|
return s;
|
|
|
|
} else if (flush_needed) {
|
|
|
|
++num_cfs_to_flush;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (0 == num_cfs_to_flush) {
|
Fix bug of prematurely excluded CF in atomic flush contains unflushed data that should've been included in the atomic flush (#11148)
Summary:
**Context:**
Atomic flush should guarantee recoverability of all data of seqno up to the max seqno of the flush. It achieves this by ensuring all such data are flushed by the time this atomic flush finishes through `SelectColumnFamiliesForAtomicFlush()`. However, our crash test exposed the following case where an excluded CF from an atomic flush contains unflushed data of seqno less than the max seqno of that atomic flush and loses its data with `WriteOptions::DisableWAL=true` in face of a crash right after the atomic flush finishes .
```
./db_stress --preserve_unverified_changes=1 --reopen=0 --acquire_snapshot_one_in=0 --adaptive_readahead=1 --allow_data_in_errors=True --async_io=1 --atomic_flush=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=15 --bottommost_compression_type=none --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=0 --checkpoint_one_in=0 --checksum_type=kXXH3 --clear_column_family_one_in=0 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_ttl=100 --compression_max_dict_buffer_bytes=134217727 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=lz4hc --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=$db --db_write_buffer_size=1048576 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --expected_values_dir=$exp --fail_if_options_file_error=0 --fifo_allow_compaction=0 --file_checksum_impl=none --flush_one_in=0 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=100 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=524288 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --long_running_snapshots=1 --manual_wal_flush_one_in=100 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=10000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=4 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=0 --periodic_compaction_seconds=100 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=32 --readahead_size=16384 --readpercent=50 --recycle_log_file_num=0 --ribbon_starting_level=6 --secondary_cache_fault_one_in=0 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=1048576 --stats_dump_period_sec=10 --subcompactions=1 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 --use_multiget=1 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=none --write_buffer_size=524288 --write_dbid_to_manifest=1 --write_fault_one_in=0 --writepercent=30 &
pid=$!
sleep 0.2
sleep 10
kill $pid
sleep 0.2
./db_stress --ops_per_thread=1 --preserve_unverified_changes=1 --reopen=0 --acquire_snapshot_one_in=0 --adaptive_readahead=1 --allow_data_in_errors=True --async_io=1 --atomic_flush=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=15 --bottommost_compression_type=none --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=0 --checkpoint_one_in=0 --checksum_type=kXXH3 --clear_column_family_one_in=0 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_ttl=100 --compression_max_dict_buffer_bytes=134217727 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=lz4hc --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=$db --db_write_buffer_size=1048576 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --expected_values_dir=$exp --fail_if_options_file_error=0 --fifo_allow_compaction=0 --file_checksum_impl=none --flush_one_in=0 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=100 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=524288 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --long_running_snapshots=1 --manual_wal_flush_one_in=100 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=10000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=4 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=0 --periodic_compaction_seconds=100 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=32 --readahead_size=16384 --readpercent=50 --recycle_log_file_num=0 --ribbon_starting_level=6 --secondary_cache_fault_one_in=0 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=1048576 --stats_dump_period_sec=10 --subcompactions=1 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 --use_multiget=1 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=none --write_buffer_size=524288 --write_dbid_to_manifest=1 --write_fault_one_in=0 --writepercent=30 &
pid=$!
sleep 0.2
sleep 40
kill $pid
sleep 0.2
Verification failed for column family 6 key 0000000000000239000000000000012B0000000000000138 (56622): value_from_db: , value_from_expected: 4A6331754E4F4C4D42434041464744455A5B58595E5F5C5D5253505156575455, msg: Value not found: NotFound:
Crash-recovery verification failed :(
No writes or ops?
Verification failed :(
```
The bug is due to the following:
- When atomic flush is used, an empty CF is legally [excluded](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_filesnapshot.cc#L39) in `SelectColumnFamiliesForAtomicFlush` as the first step of `DBImpl::FlushForGetLiveFiles` before [passing](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_filesnapshot.cc#L42) the included CFDs to `AtomicFlushMemTables`.
- But [later](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_impl/db_impl_compaction_flush.cc#L2133) in `AtomicFlushMemTables`, `WaitUntilFlushWouldNotStallWrites` will [release the db mutex](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_impl/db_impl_compaction_flush.cc#L2403), during which data@seqno N can be inserted into the excluded CF and data@seqno M can be inserted into one of the included CFs, where M > N.
- However, data@seqno N in an already-excluded CF is thus excluded from this atomic flush while we seqno N is less than seqno M.
**Summary:**
- Replace `SelectColumnFamiliesForAtomicFlush()`-before-`AtomicFlushMemTables()` with `SelectColumnFamiliesForAtomicFlush()`-after-wait-within-`AtomicFlushMemTables()` so we ensure no write affecting the recoverability of this atomic job (i.e, change to max seqno of this atomic flush or insertion of data with less seqno than the max seqno of the atomic flush to excluded CF) can happen after calling `SelectColumnFamiliesForAtomicFlush()`.
- For above, refactored and clarified comments on `SelectColumnFamiliesForAtomicFlush()` and `AtomicFlushMemTables()` for clearer semantics of passed-in CFDs to atomic-flush
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11148
Test Plan:
- New unit test failed before the fix and passes after
- Make check
- Rehearsal stress test
Reviewed By: ajkr
Differential Revision: D42799871
Pulled By: hx235
fbshipit-source-id: 13636b63e9c25c5895857afc36ea580d57f6d644
2023-03-14 23:53:20 +00:00
|
|
|
// Unref the newly generated candidate cfds (when not provided) in
|
|
|
|
// `candidate_cfds`
|
|
|
|
if (provided_candidate_cfds.empty()) {
|
|
|
|
for (auto candidate_cfd : candidate_cfds) {
|
|
|
|
candidate_cfd->UnrefAndTryDelete();
|
|
|
|
}
|
|
|
|
}
|
2018-10-26 22:06:44 +00:00
|
|
|
return s;
|
|
|
|
}
|
|
|
|
}
|
2022-10-04 23:43:01 +00:00
|
|
|
const bool needs_to_join_write_thread = !entered_write_thread;
|
2018-10-26 22:06:44 +00:00
|
|
|
FlushRequest flush_req;
|
|
|
|
autovector<ColumnFamilyData*> cfds;
|
|
|
|
{
|
|
|
|
WriteContext context;
|
|
|
|
InstrumentedMutexLock guard_lock(&mutex_);
|
|
|
|
|
|
|
|
WriteThread::Writer w;
|
2019-11-15 21:59:03 +00:00
|
|
|
WriteThread::Writer nonmem_w;
|
2022-10-04 23:43:01 +00:00
|
|
|
if (needs_to_join_write_thread) {
|
2018-10-26 22:06:44 +00:00
|
|
|
write_thread_.EnterUnbatched(&w, &mutex_);
|
2019-11-15 21:59:03 +00:00
|
|
|
if (two_write_queues_) {
|
|
|
|
nonmem_write_thread_.EnterUnbatched(&nonmem_w, &mutex_);
|
|
|
|
}
|
2018-10-26 22:06:44 +00:00
|
|
|
}
|
2019-12-12 22:05:48 +00:00
|
|
|
WaitForPendingWrites();
|
2018-10-26 22:06:44 +00:00
|
|
|
|
Fix bug of prematurely excluded CF in atomic flush contains unflushed data that should've been included in the atomic flush (#11148)
Summary:
**Context:**
Atomic flush should guarantee recoverability of all data of seqno up to the max seqno of the flush. It achieves this by ensuring all such data are flushed by the time this atomic flush finishes through `SelectColumnFamiliesForAtomicFlush()`. However, our crash test exposed the following case where an excluded CF from an atomic flush contains unflushed data of seqno less than the max seqno of that atomic flush and loses its data with `WriteOptions::DisableWAL=true` in face of a crash right after the atomic flush finishes .
```
./db_stress --preserve_unverified_changes=1 --reopen=0 --acquire_snapshot_one_in=0 --adaptive_readahead=1 --allow_data_in_errors=True --async_io=1 --atomic_flush=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=15 --bottommost_compression_type=none --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=0 --checkpoint_one_in=0 --checksum_type=kXXH3 --clear_column_family_one_in=0 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_ttl=100 --compression_max_dict_buffer_bytes=134217727 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=lz4hc --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=$db --db_write_buffer_size=1048576 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --expected_values_dir=$exp --fail_if_options_file_error=0 --fifo_allow_compaction=0 --file_checksum_impl=none --flush_one_in=0 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=100 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=524288 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --long_running_snapshots=1 --manual_wal_flush_one_in=100 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=10000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=4 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=0 --periodic_compaction_seconds=100 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=32 --readahead_size=16384 --readpercent=50 --recycle_log_file_num=0 --ribbon_starting_level=6 --secondary_cache_fault_one_in=0 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=1048576 --stats_dump_period_sec=10 --subcompactions=1 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 --use_multiget=1 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=none --write_buffer_size=524288 --write_dbid_to_manifest=1 --write_fault_one_in=0 --writepercent=30 &
pid=$!
sleep 0.2
sleep 10
kill $pid
sleep 0.2
./db_stress --ops_per_thread=1 --preserve_unverified_changes=1 --reopen=0 --acquire_snapshot_one_in=0 --adaptive_readahead=1 --allow_data_in_errors=True --async_io=1 --atomic_flush=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=15 --bottommost_compression_type=none --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=0 --checkpoint_one_in=0 --checksum_type=kXXH3 --clear_column_family_one_in=0 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_ttl=100 --compression_max_dict_buffer_bytes=134217727 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=lz4hc --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=$db --db_write_buffer_size=1048576 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --expected_values_dir=$exp --fail_if_options_file_error=0 --fifo_allow_compaction=0 --file_checksum_impl=none --flush_one_in=0 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=100 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=524288 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --long_running_snapshots=1 --manual_wal_flush_one_in=100 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=10000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=4 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=0 --periodic_compaction_seconds=100 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=32 --readahead_size=16384 --readpercent=50 --recycle_log_file_num=0 --ribbon_starting_level=6 --secondary_cache_fault_one_in=0 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=1048576 --stats_dump_period_sec=10 --subcompactions=1 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 --use_multiget=1 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=none --write_buffer_size=524288 --write_dbid_to_manifest=1 --write_fault_one_in=0 --writepercent=30 &
pid=$!
sleep 0.2
sleep 40
kill $pid
sleep 0.2
Verification failed for column family 6 key 0000000000000239000000000000012B0000000000000138 (56622): value_from_db: , value_from_expected: 4A6331754E4F4C4D42434041464744455A5B58595E5F5C5D5253505156575455, msg: Value not found: NotFound:
Crash-recovery verification failed :(
No writes or ops?
Verification failed :(
```
The bug is due to the following:
- When atomic flush is used, an empty CF is legally [excluded](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_filesnapshot.cc#L39) in `SelectColumnFamiliesForAtomicFlush` as the first step of `DBImpl::FlushForGetLiveFiles` before [passing](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_filesnapshot.cc#L42) the included CFDs to `AtomicFlushMemTables`.
- But [later](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_impl/db_impl_compaction_flush.cc#L2133) in `AtomicFlushMemTables`, `WaitUntilFlushWouldNotStallWrites` will [release the db mutex](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_impl/db_impl_compaction_flush.cc#L2403), during which data@seqno N can be inserted into the excluded CF and data@seqno M can be inserted into one of the included CFs, where M > N.
- However, data@seqno N in an already-excluded CF is thus excluded from this atomic flush while we seqno N is less than seqno M.
**Summary:**
- Replace `SelectColumnFamiliesForAtomicFlush()`-before-`AtomicFlushMemTables()` with `SelectColumnFamiliesForAtomicFlush()`-after-wait-within-`AtomicFlushMemTables()` so we ensure no write affecting the recoverability of this atomic job (i.e, change to max seqno of this atomic flush or insertion of data with less seqno than the max seqno of the atomic flush to excluded CF) can happen after calling `SelectColumnFamiliesForAtomicFlush()`.
- For above, refactored and clarified comments on `SelectColumnFamiliesForAtomicFlush()` and `AtomicFlushMemTables()` for clearer semantics of passed-in CFDs to atomic-flush
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11148
Test Plan:
- New unit test failed before the fix and passes after
- Make check
- Rehearsal stress test
Reviewed By: ajkr
Differential Revision: D42799871
Pulled By: hx235
fbshipit-source-id: 13636b63e9c25c5895857afc36ea580d57f6d644
2023-03-14 23:53:20 +00:00
|
|
|
SelectColumnFamiliesForAtomicFlush(&cfds, candidate_cfds);
|
|
|
|
|
|
|
|
// Unref the newly generated candidate cfds (when not provided) in
|
|
|
|
// `candidate_cfds`
|
|
|
|
if (provided_candidate_cfds.empty()) {
|
|
|
|
for (auto candidate_cfd : candidate_cfds) {
|
|
|
|
candidate_cfd->UnrefAndTryDelete();
|
2018-10-26 22:06:44 +00:00
|
|
|
}
|
|
|
|
}
|
Fix bug of prematurely excluded CF in atomic flush contains unflushed data that should've been included in the atomic flush (#11148)
Summary:
**Context:**
Atomic flush should guarantee recoverability of all data of seqno up to the max seqno of the flush. It achieves this by ensuring all such data are flushed by the time this atomic flush finishes through `SelectColumnFamiliesForAtomicFlush()`. However, our crash test exposed the following case where an excluded CF from an atomic flush contains unflushed data of seqno less than the max seqno of that atomic flush and loses its data with `WriteOptions::DisableWAL=true` in face of a crash right after the atomic flush finishes .
```
./db_stress --preserve_unverified_changes=1 --reopen=0 --acquire_snapshot_one_in=0 --adaptive_readahead=1 --allow_data_in_errors=True --async_io=1 --atomic_flush=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=15 --bottommost_compression_type=none --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=0 --checkpoint_one_in=0 --checksum_type=kXXH3 --clear_column_family_one_in=0 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_ttl=100 --compression_max_dict_buffer_bytes=134217727 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=lz4hc --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=$db --db_write_buffer_size=1048576 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --expected_values_dir=$exp --fail_if_options_file_error=0 --fifo_allow_compaction=0 --file_checksum_impl=none --flush_one_in=0 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=100 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=524288 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --long_running_snapshots=1 --manual_wal_flush_one_in=100 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=10000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=4 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=0 --periodic_compaction_seconds=100 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=32 --readahead_size=16384 --readpercent=50 --recycle_log_file_num=0 --ribbon_starting_level=6 --secondary_cache_fault_one_in=0 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=1048576 --stats_dump_period_sec=10 --subcompactions=1 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 --use_multiget=1 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=none --write_buffer_size=524288 --write_dbid_to_manifest=1 --write_fault_one_in=0 --writepercent=30 &
pid=$!
sleep 0.2
sleep 10
kill $pid
sleep 0.2
./db_stress --ops_per_thread=1 --preserve_unverified_changes=1 --reopen=0 --acquire_snapshot_one_in=0 --adaptive_readahead=1 --allow_data_in_errors=True --async_io=1 --atomic_flush=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=15 --bottommost_compression_type=none --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=0 --checkpoint_one_in=0 --checksum_type=kXXH3 --clear_column_family_one_in=0 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_ttl=100 --compression_max_dict_buffer_bytes=134217727 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=lz4hc --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=$db --db_write_buffer_size=1048576 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --expected_values_dir=$exp --fail_if_options_file_error=0 --fifo_allow_compaction=0 --file_checksum_impl=none --flush_one_in=0 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=100 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=524288 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --long_running_snapshots=1 --manual_wal_flush_one_in=100 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=10000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=4 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=0 --periodic_compaction_seconds=100 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=32 --readahead_size=16384 --readpercent=50 --recycle_log_file_num=0 --ribbon_starting_level=6 --secondary_cache_fault_one_in=0 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=1048576 --stats_dump_period_sec=10 --subcompactions=1 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 --use_multiget=1 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=none --write_buffer_size=524288 --write_dbid_to_manifest=1 --write_fault_one_in=0 --writepercent=30 &
pid=$!
sleep 0.2
sleep 40
kill $pid
sleep 0.2
Verification failed for column family 6 key 0000000000000239000000000000012B0000000000000138 (56622): value_from_db: , value_from_expected: 4A6331754E4F4C4D42434041464744455A5B58595E5F5C5D5253505156575455, msg: Value not found: NotFound:
Crash-recovery verification failed :(
No writes or ops?
Verification failed :(
```
The bug is due to the following:
- When atomic flush is used, an empty CF is legally [excluded](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_filesnapshot.cc#L39) in `SelectColumnFamiliesForAtomicFlush` as the first step of `DBImpl::FlushForGetLiveFiles` before [passing](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_filesnapshot.cc#L42) the included CFDs to `AtomicFlushMemTables`.
- But [later](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_impl/db_impl_compaction_flush.cc#L2133) in `AtomicFlushMemTables`, `WaitUntilFlushWouldNotStallWrites` will [release the db mutex](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_impl/db_impl_compaction_flush.cc#L2403), during which data@seqno N can be inserted into the excluded CF and data@seqno M can be inserted into one of the included CFs, where M > N.
- However, data@seqno N in an already-excluded CF is thus excluded from this atomic flush while we seqno N is less than seqno M.
**Summary:**
- Replace `SelectColumnFamiliesForAtomicFlush()`-before-`AtomicFlushMemTables()` with `SelectColumnFamiliesForAtomicFlush()`-after-wait-within-`AtomicFlushMemTables()` so we ensure no write affecting the recoverability of this atomic job (i.e, change to max seqno of this atomic flush or insertion of data with less seqno than the max seqno of the atomic flush to excluded CF) can happen after calling `SelectColumnFamiliesForAtomicFlush()`.
- For above, refactored and clarified comments on `SelectColumnFamiliesForAtomicFlush()` and `AtomicFlushMemTables()` for clearer semantics of passed-in CFDs to atomic-flush
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11148
Test Plan:
- New unit test failed before the fix and passes after
- Make check
- Rehearsal stress test
Reviewed By: ajkr
Differential Revision: D42799871
Pulled By: hx235
fbshipit-source-id: 13636b63e9c25c5895857afc36ea580d57f6d644
2023-03-14 23:53:20 +00:00
|
|
|
|
2018-10-26 22:06:44 +00:00
|
|
|
for (auto cfd : cfds) {
|
2023-10-02 23:26:24 +00:00
|
|
|
if (cfd->mem()->IsEmpty() && cached_recoverable_state_empty_.load()) {
|
2018-12-19 00:43:12 +00:00
|
|
|
continue;
|
|
|
|
}
|
2018-10-26 22:06:44 +00:00
|
|
|
cfd->Ref();
|
|
|
|
s = SwitchMemtable(cfd, &context);
|
2019-12-13 03:02:51 +00:00
|
|
|
cfd->UnrefAndTryDelete();
|
2018-10-26 22:06:44 +00:00
|
|
|
if (!s.ok()) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (s.ok()) {
|
|
|
|
AssignAtomicFlushSeq(cfds);
|
|
|
|
for (auto cfd : cfds) {
|
|
|
|
cfd->imm()->FlushRequested();
|
|
|
|
}
|
2019-07-01 21:04:10 +00:00
|
|
|
// If the caller wants to wait for this flush to complete, it indicates
|
|
|
|
// that the caller expects the ColumnFamilyData not to be free'ed by
|
|
|
|
// other threads which may drop the column family concurrently.
|
|
|
|
// Therefore, we increase the cfd's ref count.
|
|
|
|
if (flush_options.wait) {
|
|
|
|
for (auto cfd : cfds) {
|
|
|
|
cfd->Ref();
|
|
|
|
}
|
|
|
|
}
|
2023-01-24 17:54:04 +00:00
|
|
|
GenerateFlushRequest(cfds, flush_reason, &flush_req);
|
|
|
|
SchedulePendingFlush(flush_req);
|
2018-10-26 22:06:44 +00:00
|
|
|
MaybeScheduleFlushOrCompaction();
|
|
|
|
}
|
|
|
|
|
2022-10-04 23:43:01 +00:00
|
|
|
if (needs_to_join_write_thread) {
|
2018-10-26 22:06:44 +00:00
|
|
|
write_thread_.ExitUnbatched(&w);
|
2019-11-15 21:59:03 +00:00
|
|
|
if (two_write_queues_) {
|
|
|
|
nonmem_write_thread_.ExitUnbatched(&nonmem_w);
|
|
|
|
}
|
2018-10-26 22:06:44 +00:00
|
|
|
}
|
|
|
|
}
|
2018-12-13 23:10:16 +00:00
|
|
|
TEST_SYNC_POINT("DBImpl::AtomicFlushMemTables:AfterScheduleFlush");
|
2019-07-01 21:04:10 +00:00
|
|
|
TEST_SYNC_POINT("DBImpl::AtomicFlushMemTables:BeforeWaitForBgFlush");
|
2018-10-26 22:06:44 +00:00
|
|
|
if (s.ok() && flush_options.wait) {
|
|
|
|
autovector<const uint64_t*> flush_memtable_ids;
|
2023-01-24 17:54:04 +00:00
|
|
|
for (auto& iter : flush_req.cfd_to_max_mem_id_to_persist) {
|
2018-10-26 22:06:44 +00:00
|
|
|
flush_memtable_ids.push_back(&(iter.second));
|
|
|
|
}
|
2020-09-18 03:22:35 +00:00
|
|
|
s = WaitForFlushMemTables(
|
|
|
|
cfds, flush_memtable_ids,
|
2023-10-02 23:26:24 +00:00
|
|
|
flush_reason == FlushReason::kErrorRecovery /* resuming_from_bg_err */);
|
2019-12-13 03:02:51 +00:00
|
|
|
InstrumentedMutexLock lock_guard(&mutex_);
|
2019-07-01 21:04:10 +00:00
|
|
|
for (auto* cfd : cfds) {
|
2019-12-13 03:02:51 +00:00
|
|
|
cfd->UnrefAndTryDelete();
|
2019-07-01 21:04:10 +00:00
|
|
|
}
|
2023-10-02 23:26:24 +00:00
|
|
|
}
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
|
|
|
Status DBImpl::RetryFlushesForErrorRecovery(FlushReason flush_reason,
|
|
|
|
bool wait) {
|
|
|
|
mutex_.AssertHeld();
|
|
|
|
assert(flush_reason == FlushReason::kErrorRecoveryRetryFlush ||
|
|
|
|
flush_reason == FlushReason::kCatchUpAfterErrorRecovery);
|
|
|
|
|
|
|
|
// Collect referenced CFDs.
|
|
|
|
autovector<ColumnFamilyData*> cfds;
|
|
|
|
for (ColumnFamilyData* cfd : *versions_->GetColumnFamilySet()) {
|
|
|
|
if (!cfd->IsDropped() && cfd->initialized() &&
|
|
|
|
cfd->imm()->NumNotFlushed() != 0) {
|
|
|
|
cfd->Ref();
|
|
|
|
cfd->imm()->FlushRequested();
|
|
|
|
cfds.push_back(cfd);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
// Submit flush requests for all immutable memtables needing flush.
|
|
|
|
// `flush_memtable_ids` will be populated such that all immutable
|
|
|
|
// memtables eligible for flush are waited on before this function
|
|
|
|
// returns.
|
|
|
|
autovector<uint64_t> flush_memtable_ids;
|
|
|
|
if (immutable_db_options_.atomic_flush) {
|
|
|
|
FlushRequest flush_req;
|
|
|
|
GenerateFlushRequest(cfds, flush_reason, &flush_req);
|
|
|
|
SchedulePendingFlush(flush_req);
|
|
|
|
for (auto& iter : flush_req.cfd_to_max_mem_id_to_persist) {
|
|
|
|
flush_memtable_ids.push_back(iter.second);
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
for (auto cfd : cfds) {
|
|
|
|
flush_memtable_ids.push_back(
|
|
|
|
cfd->imm()->GetLatestMemTableID(false /* for_atomic_flush */));
|
|
|
|
// Impose no bound on the highest memtable ID flushed. There is no
|
|
|
|
// reason to do so outside of atomic flush.
|
|
|
|
FlushRequest flush_req{
|
|
|
|
flush_reason,
|
|
|
|
{{cfd,
|
|
|
|
std::numeric_limits<uint64_t>::max() /* max_mem_id_to_persist */}}};
|
|
|
|
SchedulePendingFlush(flush_req);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
MaybeScheduleFlushOrCompaction();
|
|
|
|
|
|
|
|
Status s;
|
|
|
|
if (wait) {
|
|
|
|
mutex_.Unlock();
|
|
|
|
autovector<const uint64_t*> flush_memtable_id_ptrs;
|
|
|
|
for (auto& flush_memtable_id : flush_memtable_ids) {
|
|
|
|
flush_memtable_id_ptrs.push_back(&flush_memtable_id);
|
|
|
|
}
|
|
|
|
s = WaitForFlushMemTables(cfds, flush_memtable_id_ptrs,
|
|
|
|
true /* resuming_from_bg_err */);
|
|
|
|
mutex_.Lock();
|
|
|
|
}
|
|
|
|
|
|
|
|
for (auto* cfd : cfds) {
|
|
|
|
cfd->UnrefAndTryDelete();
|
2018-10-26 22:06:44 +00:00
|
|
|
}
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
2018-08-29 18:58:13 +00:00
|
|
|
// Calling FlushMemTable(), whether from DB::Flush() or from Backup Engine, can
|
|
|
|
// cause write stall, for example if one memtable is being flushed already.
|
|
|
|
// This method tries to avoid write stall (similar to CompactRange() behavior)
|
|
|
|
// it emulates how the SuperVersion / LSM would change if flush happens, checks
|
|
|
|
// it against various constrains and delays flush if it'd cause write stall.
|
2022-06-22 22:45:21 +00:00
|
|
|
// Caller should check status and flush_needed to see if flush already happened.
|
2018-08-29 18:58:13 +00:00
|
|
|
Status DBImpl::WaitUntilFlushWouldNotStallWrites(ColumnFamilyData* cfd,
|
2019-03-27 23:13:08 +00:00
|
|
|
bool* flush_needed) {
|
2018-08-29 18:58:13 +00:00
|
|
|
{
|
|
|
|
*flush_needed = true;
|
|
|
|
InstrumentedMutexLock l(&mutex_);
|
|
|
|
uint64_t orig_active_memtable_id = cfd->mem()->GetID();
|
|
|
|
WriteStallCondition write_stall_condition = WriteStallCondition::kNormal;
|
|
|
|
do {
|
|
|
|
if (write_stall_condition != WriteStallCondition::kNormal) {
|
2018-11-01 22:23:20 +00:00
|
|
|
// Same error handling as user writes: Don't wait if there's a
|
|
|
|
// background error, even if it's a soft error. We might wait here
|
|
|
|
// indefinitely as the pending flushes/compactions may never finish
|
|
|
|
// successfully, resulting in the stall condition lasting indefinitely
|
|
|
|
if (error_handler_.IsBGWorkStopped()) {
|
|
|
|
return error_handler_.GetBGError();
|
|
|
|
}
|
|
|
|
|
2018-08-29 18:58:13 +00:00
|
|
|
TEST_SYNC_POINT("DBImpl::WaitUntilFlushWouldNotStallWrites:StallWait");
|
|
|
|
ROCKS_LOG_INFO(immutable_db_options_.info_log,
|
|
|
|
"[%s] WaitUntilFlushWouldNotStallWrites"
|
|
|
|
" waiting on stall conditions to clear",
|
|
|
|
cfd->GetName().c_str());
|
|
|
|
bg_cv_.Wait();
|
|
|
|
}
|
2019-05-20 17:37:37 +00:00
|
|
|
if (cfd->IsDropped()) {
|
|
|
|
return Status::ColumnFamilyDropped();
|
|
|
|
}
|
|
|
|
if (shutting_down_.load(std::memory_order_acquire)) {
|
2018-08-29 18:58:13 +00:00
|
|
|
return Status::ShutdownInProgress();
|
|
|
|
}
|
|
|
|
|
|
|
|
uint64_t earliest_memtable_id =
|
|
|
|
std::min(cfd->mem()->GetID(), cfd->imm()->GetEarliestMemTableID());
|
|
|
|
if (earliest_memtable_id > orig_active_memtable_id) {
|
|
|
|
// We waited so long that the memtable we were originally waiting on was
|
|
|
|
// flushed.
|
|
|
|
*flush_needed = false;
|
|
|
|
return Status::OK();
|
|
|
|
}
|
|
|
|
|
|
|
|
const auto& mutable_cf_options = *cfd->GetLatestMutableCFOptions();
|
|
|
|
const auto* vstorage = cfd->current()->storage_info();
|
|
|
|
|
|
|
|
// Skip stalling check if we're below auto-flush and auto-compaction
|
|
|
|
// triggers. If it stalled in these conditions, that'd mean the stall
|
|
|
|
// triggers are so low that stalling is needed for any background work. In
|
|
|
|
// that case we shouldn't wait since background work won't be scheduled.
|
|
|
|
if (cfd->imm()->NumNotFlushed() <
|
|
|
|
cfd->ioptions()->min_write_buffer_number_to_merge &&
|
|
|
|
vstorage->l0_delay_trigger_count() <
|
|
|
|
mutable_cf_options.level0_file_num_compaction_trigger) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
// check whether one extra immutable memtable or an extra L0 file would
|
|
|
|
// cause write stalling mode to be entered. It could still enter stall
|
|
|
|
// mode due to pending compaction bytes, but that's less common
|
2023-07-26 23:25:06 +00:00
|
|
|
// No extra immutable Memtable will be created if the current Memtable is
|
|
|
|
// empty.
|
|
|
|
int mem_to_flush = cfd->mem()->IsEmpty() ? 0 : 1;
|
2021-03-09 10:19:28 +00:00
|
|
|
write_stall_condition = ColumnFamilyData::GetWriteStallConditionAndCause(
|
2023-07-26 23:25:06 +00:00
|
|
|
cfd->imm()->NumNotFlushed() + mem_to_flush,
|
2021-03-09 10:19:28 +00:00
|
|
|
vstorage->l0_delay_trigger_count() + 1,
|
|
|
|
vstorage->estimated_compaction_needed_bytes(),
|
|
|
|
mutable_cf_options, *cfd->ioptions())
|
|
|
|
.first;
|
2018-08-29 18:58:13 +00:00
|
|
|
} while (write_stall_condition != WriteStallCondition::kNormal);
|
|
|
|
}
|
|
|
|
return Status::OK();
|
|
|
|
}
|
|
|
|
|
2018-08-24 20:17:29 +00:00
|
|
|
// Wait for memtables to be flushed for multiple column families.
|
|
|
|
// let N = cfds.size()
|
|
|
|
// for i in [0, N),
|
|
|
|
// 1) if flush_memtable_ids[i] is not null, then the memtables with lower IDs
|
|
|
|
// have to be flushed for THIS column family;
|
|
|
|
// 2) if flush_memtable_ids[i] is null, then all memtables in THIS column
|
|
|
|
// family have to be flushed.
|
|
|
|
// Finish waiting when ALL column families finish flushing memtables.
|
2018-10-26 22:06:44 +00:00
|
|
|
// resuming_from_bg_err indicates whether the caller is trying to resume from
|
|
|
|
// background error or in normal processing.
|
2018-08-24 20:17:29 +00:00
|
|
|
Status DBImpl::WaitForFlushMemTables(
|
|
|
|
const autovector<ColumnFamilyData*>& cfds,
|
2018-10-26 22:06:44 +00:00
|
|
|
const autovector<const uint64_t*>& flush_memtable_ids,
|
|
|
|
bool resuming_from_bg_err) {
|
2018-08-24 20:17:29 +00:00
|
|
|
int num = static_cast<int>(cfds.size());
|
2017-04-06 00:14:05 +00:00
|
|
|
// Wait until the compaction completes
|
|
|
|
InstrumentedMutexLock l(&mutex_);
|
2022-01-12 21:20:46 +00:00
|
|
|
Status s;
|
2018-10-26 22:06:44 +00:00
|
|
|
// If the caller is trying to resume from bg error, then
|
|
|
|
// error_handler_.IsDBStopped() is true.
|
|
|
|
while (resuming_from_bg_err || !error_handler_.IsDBStopped()) {
|
2017-04-06 00:14:05 +00:00
|
|
|
if (shutting_down_.load(std::memory_order_acquire)) {
|
2022-01-12 21:20:46 +00:00
|
|
|
s = Status::ShutdownInProgress();
|
|
|
|
return s;
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
2018-10-26 22:06:44 +00:00
|
|
|
// If an error has occurred during resumption, then no need to wait.
|
2022-01-12 21:20:46 +00:00
|
|
|
// But flush operation may fail because of this error, so need to
|
|
|
|
// return the status.
|
2018-10-26 22:06:44 +00:00
|
|
|
if (!error_handler_.GetRecoveryError().ok()) {
|
2022-01-12 21:20:46 +00:00
|
|
|
s = error_handler_.GetRecoveryError();
|
2018-10-26 22:06:44 +00:00
|
|
|
break;
|
|
|
|
}
|
2020-09-18 03:22:35 +00:00
|
|
|
// If BGWorkStopped, which indicate that there is a BG error and
|
|
|
|
// 1) soft error but requires no BG work, 2) no in auto_recovery_
|
|
|
|
if (!resuming_from_bg_err && error_handler_.IsBGWorkStopped() &&
|
|
|
|
error_handler_.GetBGError().severity() < Status::Severity::kHardError) {
|
2022-01-12 21:20:46 +00:00
|
|
|
s = error_handler_.GetBGError();
|
|
|
|
return s;
|
2020-09-18 03:22:35 +00:00
|
|
|
}
|
|
|
|
|
2018-08-24 20:17:29 +00:00
|
|
|
// Number of column families that have been dropped.
|
|
|
|
int num_dropped = 0;
|
|
|
|
// Number of column families that have finished flush.
|
|
|
|
int num_finished = 0;
|
|
|
|
for (int i = 0; i < num; ++i) {
|
|
|
|
if (cfds[i]->IsDropped()) {
|
|
|
|
++num_dropped;
|
|
|
|
} else if (cfds[i]->imm()->NumNotFlushed() == 0 ||
|
|
|
|
(flush_memtable_ids[i] != nullptr &&
|
|
|
|
cfds[i]->imm()->GetEarliestMemTableID() >
|
|
|
|
*flush_memtable_ids[i])) {
|
|
|
|
++num_finished;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (1 == num_dropped && 1 == num) {
|
2022-01-12 21:20:46 +00:00
|
|
|
s = Status::ColumnFamilyDropped();
|
|
|
|
return s;
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
2018-08-24 20:17:29 +00:00
|
|
|
// Column families involved in this flush request have either been dropped
|
|
|
|
// or finished flush. Then it's time to finish waiting.
|
|
|
|
if (num_dropped + num_finished == num) {
|
|
|
|
break;
|
|
|
|
}
|
2017-04-06 00:14:05 +00:00
|
|
|
bg_cv_.Wait();
|
|
|
|
}
|
2018-10-26 22:06:44 +00:00
|
|
|
// If not resuming from bg error, and an error has caused the DB to stop,
|
|
|
|
// then report the bg error to caller.
|
|
|
|
if (!resuming_from_bg_err && error_handler_.IsDBStopped()) {
|
2018-06-28 19:23:57 +00:00
|
|
|
s = error_handler_.GetBGError();
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
|
|
|
Status DBImpl::EnableAutoCompaction(
|
|
|
|
const std::vector<ColumnFamilyHandle*>& column_family_handles) {
|
|
|
|
Status s;
|
|
|
|
for (auto cf_ptr : column_family_handles) {
|
|
|
|
Status status =
|
|
|
|
this->SetOptions(cf_ptr, {{"disable_auto_compactions", "false"}});
|
|
|
|
if (!status.ok()) {
|
|
|
|
s = status;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
2022-06-07 01:32:26 +00:00
|
|
|
// NOTE: Calling DisableManualCompaction() may overwrite the
|
|
|
|
// user-provided canceled variable in CompactRangeOptions
|
2019-09-17 04:00:13 +00:00
|
|
|
void DBImpl::DisableManualCompaction() {
|
2020-08-14 18:28:12 +00:00
|
|
|
InstrumentedMutexLock l(&mutex_);
|
|
|
|
manual_compaction_paused_.fetch_add(1, std::memory_order_release);
|
2021-10-07 22:22:34 +00:00
|
|
|
|
2022-06-07 01:32:26 +00:00
|
|
|
// Mark the canceled as true when the cancellation is triggered by
|
|
|
|
// manual_compaction_paused (may overwrite user-provided `canceled`)
|
|
|
|
for (const auto& manual_compaction : manual_compaction_dequeue_) {
|
|
|
|
manual_compaction->canceled = true;
|
|
|
|
}
|
|
|
|
|
2021-10-07 22:22:34 +00:00
|
|
|
// Wake up manual compactions waiting to start.
|
|
|
|
bg_cv_.SignalAll();
|
|
|
|
|
2020-08-14 18:28:12 +00:00
|
|
|
// Wait for any pending manual compactions to finish (typically through
|
|
|
|
// failing with `Status::Incomplete`) prior to returning. This way we are
|
|
|
|
// guaranteed no pending manual compaction will commit while manual
|
|
|
|
// compactions are "disabled".
|
|
|
|
while (HasPendingManualCompaction()) {
|
|
|
|
bg_cv_.Wait();
|
|
|
|
}
|
2019-09-17 04:00:13 +00:00
|
|
|
}
|
|
|
|
|
2022-06-07 01:32:26 +00:00
|
|
|
// NOTE: In contrast to DisableManualCompaction(), calling
|
|
|
|
// EnableManualCompaction() does NOT overwrite the user-provided *canceled
|
|
|
|
// variable to be false since there is NO CHANCE a canceled compaction
|
|
|
|
// is uncanceled. In other words, a canceled compaction must have been
|
|
|
|
// dropped out of the manual compaction queue, when we disable it.
|
2019-09-17 04:00:13 +00:00
|
|
|
void DBImpl::EnableManualCompaction() {
|
2020-08-14 18:28:12 +00:00
|
|
|
InstrumentedMutexLock l(&mutex_);
|
|
|
|
assert(manual_compaction_paused_ > 0);
|
|
|
|
manual_compaction_paused_.fetch_sub(1, std::memory_order_release);
|
2019-09-17 04:00:13 +00:00
|
|
|
}
|
|
|
|
|
2017-04-06 00:14:05 +00:00
|
|
|
void DBImpl::MaybeScheduleFlushOrCompaction() {
|
|
|
|
mutex_.AssertHeld();
|
2023-10-27 22:56:48 +00:00
|
|
|
TEST_SYNC_POINT("DBImpl::MaybeScheduleFlushOrCompaction:Start");
|
2017-04-06 00:14:05 +00:00
|
|
|
if (!opened_successfully_) {
|
|
|
|
// Compaction may introduce data race to DB open
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
if (bg_work_paused_ > 0) {
|
|
|
|
// we paused the background work
|
|
|
|
return;
|
Auto recovery from out of space errors (#4164)
Summary:
This commit implements automatic recovery from a Status::NoSpace() error
during background operations such as write callback, flush and
compaction. The broad design is as follows -
1. Compaction errors are treated as soft errors and don't put the
database in read-only mode. A compaction is delayed until enough free
disk space is available to accomodate the compaction outputs, which is
estimated based on the input size. This means that users can continue to
write, and we rely on the WriteController to delay or stop writes if the
compaction debt becomes too high due to persistent low disk space
condition
2. Errors during write callback and flush are treated as hard errors,
i.e the database is put in read-only mode and goes back to read-write
only fater certain recovery actions are taken.
3. Both types of recovery rely on the SstFileManagerImpl to poll for
sufficient disk space. We assume that there is a 1-1 mapping between an
SFM and the underlying OS storage container. For cases where multiple
DBs are hosted on a single storage container, the user is expected to
allocate a single SFM instance and use the same one for all the DBs. If
no SFM is specified by the user, DBImpl::Open() will allocate one, but
this will be one per DB and each DB will recover independently. The
recovery implemented by SFM is as follows -
a) On the first occurance of an out of space error during compaction,
subsequent
compactions will be delayed until the disk free space check indicates
enough available space. The required space is computed as the sum of
input sizes.
b) The free space check requirement will be removed once the amount of
free space is greater than the size reserved by in progress
compactions when the first error occured
c) If the out of space error is a hard error, a background thread in
SFM will poll for sufficient headroom before triggering the recovery
of the database and putting it in write-only mode. The headroom is
calculated as the sum of the write_buffer_size of all the DB instances
associated with the SFM
4. EventListener callbacks will be called at the start and completion of
automatic recovery. Users can disable the auto recov ery in the start
callback, and later initiate it manually by calling DB::Resume()
Todo:
1. More extensive testing
2. Add disk full condition to db_stress (follow-on PR)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4164
Differential Revision: D9846378
Pulled By: anand1976
fbshipit-source-id: 80ea875dbd7f00205e19c82215ff6e37da10da4a
2018-09-15 20:36:19 +00:00
|
|
|
} else if (error_handler_.IsBGWorkStopped() &&
|
2019-03-27 23:13:08 +00:00
|
|
|
!error_handler_.IsRecoveryInProgress()) {
|
Auto recovery from out of space errors (#4164)
Summary:
This commit implements automatic recovery from a Status::NoSpace() error
during background operations such as write callback, flush and
compaction. The broad design is as follows -
1. Compaction errors are treated as soft errors and don't put the
database in read-only mode. A compaction is delayed until enough free
disk space is available to accomodate the compaction outputs, which is
estimated based on the input size. This means that users can continue to
write, and we rely on the WriteController to delay or stop writes if the
compaction debt becomes too high due to persistent low disk space
condition
2. Errors during write callback and flush are treated as hard errors,
i.e the database is put in read-only mode and goes back to read-write
only fater certain recovery actions are taken.
3. Both types of recovery rely on the SstFileManagerImpl to poll for
sufficient disk space. We assume that there is a 1-1 mapping between an
SFM and the underlying OS storage container. For cases where multiple
DBs are hosted on a single storage container, the user is expected to
allocate a single SFM instance and use the same one for all the DBs. If
no SFM is specified by the user, DBImpl::Open() will allocate one, but
this will be one per DB and each DB will recover independently. The
recovery implemented by SFM is as follows -
a) On the first occurance of an out of space error during compaction,
subsequent
compactions will be delayed until the disk free space check indicates
enough available space. The required space is computed as the sum of
input sizes.
b) The free space check requirement will be removed once the amount of
free space is greater than the size reserved by in progress
compactions when the first error occured
c) If the out of space error is a hard error, a background thread in
SFM will poll for sufficient headroom before triggering the recovery
of the database and putting it in write-only mode. The headroom is
calculated as the sum of the write_buffer_size of all the DB instances
associated with the SFM
4. EventListener callbacks will be called at the start and completion of
automatic recovery. Users can disable the auto recov ery in the start
callback, and later initiate it manually by calling DB::Resume()
Todo:
1. More extensive testing
2. Add disk full condition to db_stress (follow-on PR)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4164
Differential Revision: D9846378
Pulled By: anand1976
fbshipit-source-id: 80ea875dbd7f00205e19c82215ff6e37da10da4a
2018-09-15 20:36:19 +00:00
|
|
|
// There has been a hard error and this call is not part of the recovery
|
|
|
|
// sequence. Bail out here so we don't get into an endless loop of
|
|
|
|
// scheduling BG work which will again call this function
|
2023-09-22 23:43:50 +00:00
|
|
|
//
|
|
|
|
// Note that a non-recovery flush can still be scheduled if
|
|
|
|
// error_handler_.IsRecoveryInProgress() returns true. We rely on
|
|
|
|
// BackgroundCallFlush() to check flush reason and drop non-recovery
|
|
|
|
// flushes.
|
Auto recovery from out of space errors (#4164)
Summary:
This commit implements automatic recovery from a Status::NoSpace() error
during background operations such as write callback, flush and
compaction. The broad design is as follows -
1. Compaction errors are treated as soft errors and don't put the
database in read-only mode. A compaction is delayed until enough free
disk space is available to accomodate the compaction outputs, which is
estimated based on the input size. This means that users can continue to
write, and we rely on the WriteController to delay or stop writes if the
compaction debt becomes too high due to persistent low disk space
condition
2. Errors during write callback and flush are treated as hard errors,
i.e the database is put in read-only mode and goes back to read-write
only fater certain recovery actions are taken.
3. Both types of recovery rely on the SstFileManagerImpl to poll for
sufficient disk space. We assume that there is a 1-1 mapping between an
SFM and the underlying OS storage container. For cases where multiple
DBs are hosted on a single storage container, the user is expected to
allocate a single SFM instance and use the same one for all the DBs. If
no SFM is specified by the user, DBImpl::Open() will allocate one, but
this will be one per DB and each DB will recover independently. The
recovery implemented by SFM is as follows -
a) On the first occurance of an out of space error during compaction,
subsequent
compactions will be delayed until the disk free space check indicates
enough available space. The required space is computed as the sum of
input sizes.
b) The free space check requirement will be removed once the amount of
free space is greater than the size reserved by in progress
compactions when the first error occured
c) If the out of space error is a hard error, a background thread in
SFM will poll for sufficient headroom before triggering the recovery
of the database and putting it in write-only mode. The headroom is
calculated as the sum of the write_buffer_size of all the DB instances
associated with the SFM
4. EventListener callbacks will be called at the start and completion of
automatic recovery. Users can disable the auto recov ery in the start
callback, and later initiate it manually by calling DB::Resume()
Todo:
1. More extensive testing
2. Add disk full condition to db_stress (follow-on PR)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4164
Differential Revision: D9846378
Pulled By: anand1976
fbshipit-source-id: 80ea875dbd7f00205e19c82215ff6e37da10da4a
2018-09-15 20:36:19 +00:00
|
|
|
return;
|
2017-04-06 00:14:05 +00:00
|
|
|
} else if (shutting_down_.load(std::memory_order_acquire)) {
|
|
|
|
// DB is being deleted; no more background compactions
|
|
|
|
return;
|
|
|
|
}
|
2017-05-24 18:25:38 +00:00
|
|
|
auto bg_job_limits = GetBGJobLimits();
|
2017-05-23 18:04:25 +00:00
|
|
|
bool is_flush_pool_empty =
|
2018-04-13 00:55:14 +00:00
|
|
|
env_->GetBackgroundThreads(Env::Priority::HIGH) == 0;
|
2017-05-23 18:04:25 +00:00
|
|
|
while (!is_flush_pool_empty && unscheduled_flushes_ > 0 &&
|
2017-05-24 18:25:38 +00:00
|
|
|
bg_flush_scheduled_ < bg_job_limits.max_flushes) {
|
2023-05-26 00:25:51 +00:00
|
|
|
TEST_SYNC_POINT_CALLBACK(
|
|
|
|
"DBImpl::MaybeScheduleFlushOrCompaction:BeforeSchedule",
|
|
|
|
&unscheduled_flushes_);
|
2017-04-06 00:14:05 +00:00
|
|
|
bg_flush_scheduled_++;
|
2019-03-20 00:24:09 +00:00
|
|
|
FlushThreadArg* fta = new FlushThreadArg;
|
|
|
|
fta->db_ = this;
|
|
|
|
fta->thread_pri_ = Env::Priority::HIGH;
|
|
|
|
env_->Schedule(&DBImpl::BGWorkFlush, fta, Env::Priority::HIGH, this,
|
|
|
|
&DBImpl::UnscheduleFlushCallback);
|
2019-11-27 22:46:38 +00:00
|
|
|
--unscheduled_flushes_;
|
|
|
|
TEST_SYNC_POINT_CALLBACK(
|
|
|
|
"DBImpl::MaybeScheduleFlushOrCompaction:AfterSchedule:0",
|
|
|
|
&unscheduled_flushes_);
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
|
2017-05-23 18:04:25 +00:00
|
|
|
// special case -- if high-pri (flush) thread pool is empty, then schedule
|
|
|
|
// flushes in low-pri (compaction) thread pool.
|
|
|
|
if (is_flush_pool_empty) {
|
2017-04-06 00:14:05 +00:00
|
|
|
while (unscheduled_flushes_ > 0 &&
|
|
|
|
bg_flush_scheduled_ + bg_compaction_scheduled_ <
|
2017-05-24 18:25:38 +00:00
|
|
|
bg_job_limits.max_flushes) {
|
2017-04-06 00:14:05 +00:00
|
|
|
bg_flush_scheduled_++;
|
2019-03-20 00:24:09 +00:00
|
|
|
FlushThreadArg* fta = new FlushThreadArg;
|
|
|
|
fta->db_ = this;
|
|
|
|
fta->thread_pri_ = Env::Priority::LOW;
|
|
|
|
env_->Schedule(&DBImpl::BGWorkFlush, fta, Env::Priority::LOW, this,
|
|
|
|
&DBImpl::UnscheduleFlushCallback);
|
2019-11-27 22:46:38 +00:00
|
|
|
--unscheduled_flushes_;
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (bg_compaction_paused_ > 0) {
|
|
|
|
// we paused the background compaction
|
|
|
|
return;
|
Auto recovery from out of space errors (#4164)
Summary:
This commit implements automatic recovery from a Status::NoSpace() error
during background operations such as write callback, flush and
compaction. The broad design is as follows -
1. Compaction errors are treated as soft errors and don't put the
database in read-only mode. A compaction is delayed until enough free
disk space is available to accomodate the compaction outputs, which is
estimated based on the input size. This means that users can continue to
write, and we rely on the WriteController to delay or stop writes if the
compaction debt becomes too high due to persistent low disk space
condition
2. Errors during write callback and flush are treated as hard errors,
i.e the database is put in read-only mode and goes back to read-write
only fater certain recovery actions are taken.
3. Both types of recovery rely on the SstFileManagerImpl to poll for
sufficient disk space. We assume that there is a 1-1 mapping between an
SFM and the underlying OS storage container. For cases where multiple
DBs are hosted on a single storage container, the user is expected to
allocate a single SFM instance and use the same one for all the DBs. If
no SFM is specified by the user, DBImpl::Open() will allocate one, but
this will be one per DB and each DB will recover independently. The
recovery implemented by SFM is as follows -
a) On the first occurance of an out of space error during compaction,
subsequent
compactions will be delayed until the disk free space check indicates
enough available space. The required space is computed as the sum of
input sizes.
b) The free space check requirement will be removed once the amount of
free space is greater than the size reserved by in progress
compactions when the first error occured
c) If the out of space error is a hard error, a background thread in
SFM will poll for sufficient headroom before triggering the recovery
of the database and putting it in write-only mode. The headroom is
calculated as the sum of the write_buffer_size of all the DB instances
associated with the SFM
4. EventListener callbacks will be called at the start and completion of
automatic recovery. Users can disable the auto recov ery in the start
callback, and later initiate it manually by calling DB::Resume()
Todo:
1. More extensive testing
2. Add disk full condition to db_stress (follow-on PR)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4164
Differential Revision: D9846378
Pulled By: anand1976
fbshipit-source-id: 80ea875dbd7f00205e19c82215ff6e37da10da4a
2018-09-15 20:36:19 +00:00
|
|
|
} else if (error_handler_.IsBGWorkStopped()) {
|
|
|
|
// Compaction is not part of the recovery sequence from a hard error. We
|
|
|
|
// might get here because recovery might do a flush and install a new
|
|
|
|
// super version, which will try to schedule pending compactions. Bail
|
|
|
|
// out here and let the higher level recovery handle compactions
|
|
|
|
return;
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
if (HasExclusiveManualCompaction()) {
|
|
|
|
// only manual compactions are allowed to run. don't schedule automatic
|
|
|
|
// compactions
|
2018-01-17 06:56:47 +00:00
|
|
|
TEST_SYNC_POINT("DBImpl::MaybeScheduleFlushOrCompaction:Conflict");
|
2017-04-06 00:14:05 +00:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2021-11-19 01:26:39 +00:00
|
|
|
while (bg_compaction_scheduled_ + bg_bottom_compaction_scheduled_ <
|
|
|
|
bg_job_limits.max_compactions &&
|
2017-04-06 00:14:05 +00:00
|
|
|
unscheduled_compactions_ > 0) {
|
|
|
|
CompactionArg* ca = new CompactionArg;
|
|
|
|
ca->db = this;
|
Fix possible hang issue in ~DBImpl() when flush is scheduled in LOW pool (#8125)
Summary:
In DBImpl::CloseHelper, we wait for bg_compaction_scheduled_
and bg_flush_scheduled_ to drop to 0. Unschedule is called prior
to cancel any unscheduled flushes/compactions. It is assumed that
anything in the high priority is a flush, and anything in the low
priority pool is a compaction. This assumption, however, is broken when
the high-pri pool is full.
As a result, bg_compaction_scheduled_ can go < 0 and bg_flush_scheduled_
will remain > 0 and DB can be in hang state.
The fix is, we decrement the `bg_{flush,compaction,bottom_compaction}_scheduled_`
inside the `Unschedule{Flush,Compaction,BottomCompaction}Callback()`s. DB
`mutex_` will make the counts atomic in `Unschedule`.
Related discussion: https://github.com/facebook/rocksdb/issues/7928
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8125
Test Plan: Added new test case which hangs without the fix.
Reviewed By: jay-zhuang
Differential Revision: D27390043
Pulled By: ajkr
fbshipit-source-id: 78a367fba9a59ac5607ad24bd1c46dc16d5ec110
2021-03-31 01:34:11 +00:00
|
|
|
ca->compaction_pri_ = Env::Priority::LOW;
|
2017-08-03 22:36:28 +00:00
|
|
|
ca->prepicked_compaction = nullptr;
|
2017-04-06 00:14:05 +00:00
|
|
|
bg_compaction_scheduled_++;
|
|
|
|
unscheduled_compactions_--;
|
|
|
|
env_->Schedule(&DBImpl::BGWorkCompaction, ca, Env::Priority::LOW, this,
|
2019-03-20 00:24:09 +00:00
|
|
|
&DBImpl::UnscheduleCompactionCallback);
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2017-05-24 18:25:38 +00:00
|
|
|
DBImpl::BGJobLimits DBImpl::GetBGJobLimits() const {
|
2017-04-06 00:14:05 +00:00
|
|
|
mutex_.AssertHeld();
|
2020-04-20 23:17:25 +00:00
|
|
|
return GetBGJobLimits(mutable_db_options_.max_background_flushes,
|
2017-05-24 18:25:38 +00:00
|
|
|
mutable_db_options_.max_background_compactions,
|
|
|
|
mutable_db_options_.max_background_jobs,
|
|
|
|
write_controller_.NeedSpeedupCompaction());
|
|
|
|
}
|
|
|
|
|
|
|
|
DBImpl::BGJobLimits DBImpl::GetBGJobLimits(int max_background_flushes,
|
|
|
|
int max_background_compactions,
|
|
|
|
int max_background_jobs,
|
|
|
|
bool parallelize_compactions) {
|
|
|
|
BGJobLimits res;
|
|
|
|
if (max_background_flushes == -1 && max_background_compactions == -1) {
|
|
|
|
// for our first stab implementing max_background_jobs, simply allocate a
|
|
|
|
// quarter of the threads to flushes.
|
|
|
|
res.max_flushes = std::max(1, max_background_jobs / 4);
|
|
|
|
res.max_compactions = std::max(1, max_background_jobs - res.max_flushes);
|
2017-04-06 00:14:05 +00:00
|
|
|
} else {
|
2017-05-24 18:25:38 +00:00
|
|
|
// compatibility code in case users haven't migrated to max_background_jobs,
|
|
|
|
// which automatically computes flush/compaction limits
|
|
|
|
res.max_flushes = std::max(1, max_background_flushes);
|
|
|
|
res.max_compactions = std::max(1, max_background_compactions);
|
|
|
|
}
|
|
|
|
if (!parallelize_compactions) {
|
|
|
|
// throttle background compactions until we deem necessary
|
|
|
|
res.max_compactions = 1;
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
2017-05-24 18:25:38 +00:00
|
|
|
return res;
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
void DBImpl::AddToCompactionQueue(ColumnFamilyData* cfd) {
|
2018-04-27 18:11:12 +00:00
|
|
|
assert(!cfd->queued_for_compaction());
|
2017-04-06 00:14:05 +00:00
|
|
|
cfd->Ref();
|
|
|
|
compaction_queue_.push_back(cfd);
|
2018-04-27 18:11:12 +00:00
|
|
|
cfd->set_queued_for_compaction(true);
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
ColumnFamilyData* DBImpl::PopFirstFromCompactionQueue() {
|
|
|
|
assert(!compaction_queue_.empty());
|
|
|
|
auto cfd = *compaction_queue_.begin();
|
|
|
|
compaction_queue_.pop_front();
|
2018-04-27 18:11:12 +00:00
|
|
|
assert(cfd->queued_for_compaction());
|
|
|
|
cfd->set_queued_for_compaction(false);
|
2017-04-06 00:14:05 +00:00
|
|
|
return cfd;
|
|
|
|
}
|
|
|
|
|
2018-08-24 20:17:29 +00:00
|
|
|
DBImpl::FlushRequest DBImpl::PopFirstFromFlushQueue() {
|
2017-04-06 00:14:05 +00:00
|
|
|
assert(!flush_queue_.empty());
|
2023-07-19 19:52:39 +00:00
|
|
|
FlushRequest flush_req = std::move(flush_queue_.front());
|
2017-04-06 00:14:05 +00:00
|
|
|
flush_queue_.pop_front();
|
2020-12-02 17:29:50 +00:00
|
|
|
if (!immutable_db_options_.atomic_flush) {
|
2023-01-24 17:54:04 +00:00
|
|
|
assert(flush_req.cfd_to_max_mem_id_to_persist.size() == 1);
|
2020-12-02 17:29:50 +00:00
|
|
|
}
|
2023-01-24 17:54:04 +00:00
|
|
|
for (const auto& elem : flush_req.cfd_to_max_mem_id_to_persist) {
|
2020-12-02 17:29:50 +00:00
|
|
|
if (!immutable_db_options_.atomic_flush) {
|
|
|
|
ColumnFamilyData* cfd = elem.first;
|
|
|
|
assert(cfd);
|
|
|
|
assert(cfd->queued_for_flush());
|
|
|
|
cfd->set_queued_for_flush(false);
|
|
|
|
}
|
|
|
|
}
|
2018-08-24 20:17:29 +00:00
|
|
|
return flush_req;
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
|
Concurrent task limiter for compaction thread control (#4332)
Summary:
The PR is targeting to resolve the issue of:
https://github.com/facebook/rocksdb/issues/3972#issue-330771918
We have a rocksdb created with leveled-compaction with multiple column families (CFs), some of CFs are using HDD to store big and less frequently accessed data and others are using SSD.
When there are continuously write traffics going on to all CFs, the compaction thread pool is mostly occupied by those slow HDD compactions, which blocks fully utilize SSD bandwidth.
Since atomic write and transaction is needed across CFs, so splitting it to multiple rocksdb instance is not an option for us.
With the compaction thread control, we got 30%+ HDD write throughput gain, and also a lot smooth SSD write since less write stall happening.
ConcurrentTaskLimiter can be shared with multi-CFs across rocksdb instances, so the feature does not only work for multi-CFs scenarios, but also for multi-rocksdbs scenarios, who need disk IO resource control per tenant.
The usage is straight forward:
e.g.:
//
// Enable compaction thread limiter thru ColumnFamilyOptions
//
std::shared_ptr<ConcurrentTaskLimiter> ctl(NewConcurrentTaskLimiter("foo_limiter", 4));
Options options;
ColumnFamilyOptions cf_opt(options);
cf_opt.compaction_thread_limiter = ctl;
...
//
// Compaction thread limiter can be tuned or disabled on-the-fly
//
ctl->SetMaxOutstandingTask(12); // enlarge to 12 tasks
...
ctl->ResetMaxOutstandingTask(); // disable (bypass) thread limiter
ctl->SetMaxOutstandingTask(-1); // Same as above
...
ctl->SetMaxOutstandingTask(0); // full throttle (0 task)
//
// Sharing compaction thread limiter among CFs (to resolve multiple storage perf issue)
//
std::shared_ptr<ConcurrentTaskLimiter> ctl_ssd(NewConcurrentTaskLimiter("ssd_limiter", 8));
std::shared_ptr<ConcurrentTaskLimiter> ctl_hdd(NewConcurrentTaskLimiter("hdd_limiter", 4));
Options options;
ColumnFamilyOptions cf_opt_ssd1(options);
ColumnFamilyOptions cf_opt_ssd2(options);
ColumnFamilyOptions cf_opt_hdd1(options);
ColumnFamilyOptions cf_opt_hdd2(options);
ColumnFamilyOptions cf_opt_hdd3(options);
// SSD CFs
cf_opt_ssd1.compaction_thread_limiter = ctl_ssd;
cf_opt_ssd2.compaction_thread_limiter = ctl_ssd;
// HDD CFs
cf_opt_hdd1.compaction_thread_limiter = ctl_hdd;
cf_opt_hdd2.compaction_thread_limiter = ctl_hdd;
cf_opt_hdd3.compaction_thread_limiter = ctl_hdd;
...
//
// The limiter is disabled by default (or set to nullptr explicitly)
//
Options options;
ColumnFamilyOptions cf_opt(options);
cf_opt.compaction_thread_limiter = nullptr;
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4332
Differential Revision: D13226590
Pulled By: siying
fbshipit-source-id: 14307aec55b8bd59c8223d04aa6db3c03d1b0c1d
2018-12-13 21:16:04 +00:00
|
|
|
ColumnFamilyData* DBImpl::PickCompactionFromQueue(
|
|
|
|
std::unique_ptr<TaskLimiterToken>* token, LogBuffer* log_buffer) {
|
|
|
|
assert(!compaction_queue_.empty());
|
|
|
|
assert(*token == nullptr);
|
|
|
|
autovector<ColumnFamilyData*> throttled_candidates;
|
|
|
|
ColumnFamilyData* cfd = nullptr;
|
|
|
|
while (!compaction_queue_.empty()) {
|
|
|
|
auto first_cfd = *compaction_queue_.begin();
|
|
|
|
compaction_queue_.pop_front();
|
|
|
|
assert(first_cfd->queued_for_compaction());
|
|
|
|
if (!RequestCompactionToken(first_cfd, false, token, log_buffer)) {
|
|
|
|
throttled_candidates.push_back(first_cfd);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
cfd = first_cfd;
|
|
|
|
cfd->set_queued_for_compaction(false);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
// Add throttled compaction candidates back to queue in the original order.
|
2019-01-02 17:56:39 +00:00
|
|
|
for (auto iter = throttled_candidates.rbegin();
|
|
|
|
iter != throttled_candidates.rend(); ++iter) {
|
Concurrent task limiter for compaction thread control (#4332)
Summary:
The PR is targeting to resolve the issue of:
https://github.com/facebook/rocksdb/issues/3972#issue-330771918
We have a rocksdb created with leveled-compaction with multiple column families (CFs), some of CFs are using HDD to store big and less frequently accessed data and others are using SSD.
When there are continuously write traffics going on to all CFs, the compaction thread pool is mostly occupied by those slow HDD compactions, which blocks fully utilize SSD bandwidth.
Since atomic write and transaction is needed across CFs, so splitting it to multiple rocksdb instance is not an option for us.
With the compaction thread control, we got 30%+ HDD write throughput gain, and also a lot smooth SSD write since less write stall happening.
ConcurrentTaskLimiter can be shared with multi-CFs across rocksdb instances, so the feature does not only work for multi-CFs scenarios, but also for multi-rocksdbs scenarios, who need disk IO resource control per tenant.
The usage is straight forward:
e.g.:
//
// Enable compaction thread limiter thru ColumnFamilyOptions
//
std::shared_ptr<ConcurrentTaskLimiter> ctl(NewConcurrentTaskLimiter("foo_limiter", 4));
Options options;
ColumnFamilyOptions cf_opt(options);
cf_opt.compaction_thread_limiter = ctl;
...
//
// Compaction thread limiter can be tuned or disabled on-the-fly
//
ctl->SetMaxOutstandingTask(12); // enlarge to 12 tasks
...
ctl->ResetMaxOutstandingTask(); // disable (bypass) thread limiter
ctl->SetMaxOutstandingTask(-1); // Same as above
...
ctl->SetMaxOutstandingTask(0); // full throttle (0 task)
//
// Sharing compaction thread limiter among CFs (to resolve multiple storage perf issue)
//
std::shared_ptr<ConcurrentTaskLimiter> ctl_ssd(NewConcurrentTaskLimiter("ssd_limiter", 8));
std::shared_ptr<ConcurrentTaskLimiter> ctl_hdd(NewConcurrentTaskLimiter("hdd_limiter", 4));
Options options;
ColumnFamilyOptions cf_opt_ssd1(options);
ColumnFamilyOptions cf_opt_ssd2(options);
ColumnFamilyOptions cf_opt_hdd1(options);
ColumnFamilyOptions cf_opt_hdd2(options);
ColumnFamilyOptions cf_opt_hdd3(options);
// SSD CFs
cf_opt_ssd1.compaction_thread_limiter = ctl_ssd;
cf_opt_ssd2.compaction_thread_limiter = ctl_ssd;
// HDD CFs
cf_opt_hdd1.compaction_thread_limiter = ctl_hdd;
cf_opt_hdd2.compaction_thread_limiter = ctl_hdd;
cf_opt_hdd3.compaction_thread_limiter = ctl_hdd;
...
//
// The limiter is disabled by default (or set to nullptr explicitly)
//
Options options;
ColumnFamilyOptions cf_opt(options);
cf_opt.compaction_thread_limiter = nullptr;
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4332
Differential Revision: D13226590
Pulled By: siying
fbshipit-source-id: 14307aec55b8bd59c8223d04aa6db3c03d1b0c1d
2018-12-13 21:16:04 +00:00
|
|
|
compaction_queue_.push_front(*iter);
|
|
|
|
}
|
|
|
|
return cfd;
|
|
|
|
}
|
|
|
|
|
2023-01-24 17:54:04 +00:00
|
|
|
void DBImpl::SchedulePendingFlush(const FlushRequest& flush_req) {
|
2020-12-02 17:29:50 +00:00
|
|
|
mutex_.AssertHeld();
|
2023-08-11 19:30:48 +00:00
|
|
|
if (reject_new_background_jobs_) {
|
|
|
|
return;
|
|
|
|
}
|
2023-01-24 17:54:04 +00:00
|
|
|
if (flush_req.cfd_to_max_mem_id_to_persist.empty()) {
|
2018-08-24 20:17:29 +00:00
|
|
|
return;
|
|
|
|
}
|
2020-12-02 17:29:50 +00:00
|
|
|
if (!immutable_db_options_.atomic_flush) {
|
|
|
|
// For the non-atomic flush case, we never schedule multiple column
|
|
|
|
// families in the same flush request.
|
2023-01-24 17:54:04 +00:00
|
|
|
assert(flush_req.cfd_to_max_mem_id_to_persist.size() == 1);
|
|
|
|
ColumnFamilyData* cfd =
|
|
|
|
flush_req.cfd_to_max_mem_id_to_persist.begin()->first;
|
2020-12-02 17:29:50 +00:00
|
|
|
assert(cfd);
|
2022-06-23 16:42:18 +00:00
|
|
|
|
2020-12-02 17:29:50 +00:00
|
|
|
if (!cfd->queued_for_flush() && cfd->imm()->IsFlushPending()) {
|
|
|
|
cfd->Ref();
|
|
|
|
cfd->set_queued_for_flush(true);
|
|
|
|
++unscheduled_flushes_;
|
|
|
|
flush_queue_.push_back(flush_req);
|
|
|
|
}
|
|
|
|
} else {
|
2023-01-24 17:54:04 +00:00
|
|
|
for (auto& iter : flush_req.cfd_to_max_mem_id_to_persist) {
|
2020-12-02 17:29:50 +00:00
|
|
|
ColumnFamilyData* cfd = iter.first;
|
|
|
|
cfd->Ref();
|
|
|
|
}
|
|
|
|
++unscheduled_flushes_;
|
|
|
|
flush_queue_.push_back(flush_req);
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
void DBImpl::SchedulePendingCompaction(ColumnFamilyData* cfd) {
|
2020-12-02 17:29:50 +00:00
|
|
|
mutex_.AssertHeld();
|
2023-08-11 19:30:48 +00:00
|
|
|
if (reject_new_background_jobs_) {
|
|
|
|
return;
|
|
|
|
}
|
2018-04-27 18:11:12 +00:00
|
|
|
if (!cfd->queued_for_compaction() && cfd->NeedsCompaction()) {
|
2017-04-06 00:14:05 +00:00
|
|
|
AddToCompactionQueue(cfd);
|
|
|
|
++unscheduled_compactions_;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2018-04-26 20:51:39 +00:00
|
|
|
void DBImpl::SchedulePendingPurge(std::string fname, std::string dir_to_sync,
|
|
|
|
FileType type, uint64_t number, int job_id) {
|
2017-04-06 00:14:05 +00:00
|
|
|
mutex_.AssertHeld();
|
2023-08-11 19:30:48 +00:00
|
|
|
if (reject_new_background_jobs_) {
|
|
|
|
return;
|
|
|
|
}
|
2018-04-26 20:51:39 +00:00
|
|
|
PurgeFileInfo file_info(fname, dir_to_sync, type, number, job_id);
|
2019-09-17 23:43:07 +00:00
|
|
|
purge_files_.insert({{number, std::move(file_info)}});
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
|
2019-03-20 00:24:09 +00:00
|
|
|
void DBImpl::BGWorkFlush(void* arg) {
|
Prefer static_cast in place of most reinterpret_cast (#12308)
Summary:
The following are risks associated with pointer-to-pointer reinterpret_cast:
* Can produce the "wrong result" (crash or memory corruption). IIRC, in theory this can happen for any up-cast or down-cast for a non-standard-layout type, though in practice would only happen for multiple inheritance cases (where the base class pointer might be "inside" the derived object). We don't use multiple inheritance a lot, but we do.
* Can mask useful compiler errors upon code change, including converting between unrelated pointer types that you are expecting to be related, and converting between pointer and scalar types unintentionally.
I can only think of some obscure cases where static_cast could be troublesome when it compiles as a replacement:
* Going through `void*` could plausibly cause unnecessary or broken pointer arithmetic. Suppose we have
`struct Derived: public Base1, public Base2`. If we have `Derived*` -> `void*` -> `Base2*` -> `Derived*` through reinterpret casts, this could plausibly work (though technical UB) assuming the `Base2*` is not dereferenced. Changing to static cast could introduce breaking pointer arithmetic.
* Unnecessary (but safe) pointer arithmetic could arise in a case like `Derived*` -> `Base2*` -> `Derived*` where before the Base2 pointer might not have been dereferenced. This could potentially affect performance.
With some light scripting, I tried replacing pointer-to-pointer reinterpret_casts with static_cast and kept the cases that still compile. Most occurrences of reinterpret_cast have successfully been changed (except for java/ and third-party/). 294 changed, 257 remain.
A couple of related interventions included here:
* Previously Cache::Handle was not actually derived from in the implementations and just used as a `void*` stand-in with reinterpret_cast. Now there is a relationship to allow static_cast. In theory, this could introduce pointer arithmetic (as described above) but is unlikely without multiple inheritance AND non-empty Cache::Handle.
* Remove some unnecessary casts to void* as this is allowed to be implicit (for better or worse).
Most of the remaining reinterpret_casts are for converting to/from raw bytes of objects. We could consider better idioms for these patterns in follow-up work.
I wish there were a way to implement a template variant of static_cast that would only compile if no pointer arithmetic is generated, but best I can tell, this is not possible. AFAIK the best you could do is a dynamic check that the void* conversion after the static cast is unchanged.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12308
Test Plan: existing tests, CI
Reviewed By: ltamasi
Differential Revision: D53204947
Pulled By: pdillinger
fbshipit-source-id: 9de23e618263b0d5b9820f4e15966876888a16e2
2024-02-07 18:44:11 +00:00
|
|
|
FlushThreadArg fta = *(static_cast<FlushThreadArg*>(arg));
|
|
|
|
delete static_cast<FlushThreadArg*>(arg);
|
2019-03-20 00:24:09 +00:00
|
|
|
|
|
|
|
IOSTATS_SET_THREAD_POOL_ID(fta.thread_pri_);
|
2017-04-06 00:14:05 +00:00
|
|
|
TEST_SYNC_POINT("DBImpl::BGWorkFlush");
|
2020-04-29 20:06:27 +00:00
|
|
|
static_cast_with_check<DBImpl>(fta.db_)->BackgroundCallFlush(fta.thread_pri_);
|
2017-04-06 00:14:05 +00:00
|
|
|
TEST_SYNC_POINT("DBImpl::BGWorkFlush:done");
|
|
|
|
}
|
|
|
|
|
|
|
|
void DBImpl::BGWorkCompaction(void* arg) {
|
Prefer static_cast in place of most reinterpret_cast (#12308)
Summary:
The following are risks associated with pointer-to-pointer reinterpret_cast:
* Can produce the "wrong result" (crash or memory corruption). IIRC, in theory this can happen for any up-cast or down-cast for a non-standard-layout type, though in practice would only happen for multiple inheritance cases (where the base class pointer might be "inside" the derived object). We don't use multiple inheritance a lot, but we do.
* Can mask useful compiler errors upon code change, including converting between unrelated pointer types that you are expecting to be related, and converting between pointer and scalar types unintentionally.
I can only think of some obscure cases where static_cast could be troublesome when it compiles as a replacement:
* Going through `void*` could plausibly cause unnecessary or broken pointer arithmetic. Suppose we have
`struct Derived: public Base1, public Base2`. If we have `Derived*` -> `void*` -> `Base2*` -> `Derived*` through reinterpret casts, this could plausibly work (though technical UB) assuming the `Base2*` is not dereferenced. Changing to static cast could introduce breaking pointer arithmetic.
* Unnecessary (but safe) pointer arithmetic could arise in a case like `Derived*` -> `Base2*` -> `Derived*` where before the Base2 pointer might not have been dereferenced. This could potentially affect performance.
With some light scripting, I tried replacing pointer-to-pointer reinterpret_casts with static_cast and kept the cases that still compile. Most occurrences of reinterpret_cast have successfully been changed (except for java/ and third-party/). 294 changed, 257 remain.
A couple of related interventions included here:
* Previously Cache::Handle was not actually derived from in the implementations and just used as a `void*` stand-in with reinterpret_cast. Now there is a relationship to allow static_cast. In theory, this could introduce pointer arithmetic (as described above) but is unlikely without multiple inheritance AND non-empty Cache::Handle.
* Remove some unnecessary casts to void* as this is allowed to be implicit (for better or worse).
Most of the remaining reinterpret_casts are for converting to/from raw bytes of objects. We could consider better idioms for these patterns in follow-up work.
I wish there were a way to implement a template variant of static_cast that would only compile if no pointer arithmetic is generated, but best I can tell, this is not possible. AFAIK the best you could do is a dynamic check that the void* conversion after the static cast is unchanged.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12308
Test Plan: existing tests, CI
Reviewed By: ltamasi
Differential Revision: D53204947
Pulled By: pdillinger
fbshipit-source-id: 9de23e618263b0d5b9820f4e15966876888a16e2
2024-02-07 18:44:11 +00:00
|
|
|
CompactionArg ca = *(static_cast<CompactionArg*>(arg));
|
|
|
|
delete static_cast<CompactionArg*>(arg);
|
2017-04-06 00:14:05 +00:00
|
|
|
IOSTATS_SET_THREAD_POOL_ID(Env::Priority::LOW);
|
|
|
|
TEST_SYNC_POINT("DBImpl::BGWorkCompaction");
|
2017-08-03 22:36:28 +00:00
|
|
|
auto prepicked_compaction =
|
|
|
|
static_cast<PrepickedCompaction*>(ca.prepicked_compaction);
|
2020-04-29 20:06:27 +00:00
|
|
|
static_cast_with_check<DBImpl>(ca.db)->BackgroundCallCompaction(
|
2017-08-03 22:36:28 +00:00
|
|
|
prepicked_compaction, Env::Priority::LOW);
|
|
|
|
delete prepicked_compaction;
|
|
|
|
}
|
|
|
|
|
|
|
|
void DBImpl::BGWorkBottomCompaction(void* arg) {
|
|
|
|
CompactionArg ca = *(static_cast<CompactionArg*>(arg));
|
|
|
|
delete static_cast<CompactionArg*>(arg);
|
|
|
|
IOSTATS_SET_THREAD_POOL_ID(Env::Priority::BOTTOM);
|
|
|
|
TEST_SYNC_POINT("DBImpl::BGWorkBottomCompaction");
|
|
|
|
auto* prepicked_compaction = ca.prepicked_compaction;
|
2021-11-19 01:26:39 +00:00
|
|
|
assert(prepicked_compaction && prepicked_compaction->compaction);
|
2017-08-03 22:36:28 +00:00
|
|
|
ca.db->BackgroundCallCompaction(prepicked_compaction, Env::Priority::BOTTOM);
|
|
|
|
delete prepicked_compaction;
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
void DBImpl::BGWorkPurge(void* db) {
|
|
|
|
IOSTATS_SET_THREAD_POOL_ID(Env::Priority::HIGH);
|
|
|
|
TEST_SYNC_POINT("DBImpl::BGWorkPurge:start");
|
Prefer static_cast in place of most reinterpret_cast (#12308)
Summary:
The following are risks associated with pointer-to-pointer reinterpret_cast:
* Can produce the "wrong result" (crash or memory corruption). IIRC, in theory this can happen for any up-cast or down-cast for a non-standard-layout type, though in practice would only happen for multiple inheritance cases (where the base class pointer might be "inside" the derived object). We don't use multiple inheritance a lot, but we do.
* Can mask useful compiler errors upon code change, including converting between unrelated pointer types that you are expecting to be related, and converting between pointer and scalar types unintentionally.
I can only think of some obscure cases where static_cast could be troublesome when it compiles as a replacement:
* Going through `void*` could plausibly cause unnecessary or broken pointer arithmetic. Suppose we have
`struct Derived: public Base1, public Base2`. If we have `Derived*` -> `void*` -> `Base2*` -> `Derived*` through reinterpret casts, this could plausibly work (though technical UB) assuming the `Base2*` is not dereferenced. Changing to static cast could introduce breaking pointer arithmetic.
* Unnecessary (but safe) pointer arithmetic could arise in a case like `Derived*` -> `Base2*` -> `Derived*` where before the Base2 pointer might not have been dereferenced. This could potentially affect performance.
With some light scripting, I tried replacing pointer-to-pointer reinterpret_casts with static_cast and kept the cases that still compile. Most occurrences of reinterpret_cast have successfully been changed (except for java/ and third-party/). 294 changed, 257 remain.
A couple of related interventions included here:
* Previously Cache::Handle was not actually derived from in the implementations and just used as a `void*` stand-in with reinterpret_cast. Now there is a relationship to allow static_cast. In theory, this could introduce pointer arithmetic (as described above) but is unlikely without multiple inheritance AND non-empty Cache::Handle.
* Remove some unnecessary casts to void* as this is allowed to be implicit (for better or worse).
Most of the remaining reinterpret_casts are for converting to/from raw bytes of objects. We could consider better idioms for these patterns in follow-up work.
I wish there were a way to implement a template variant of static_cast that would only compile if no pointer arithmetic is generated, but best I can tell, this is not possible. AFAIK the best you could do is a dynamic check that the void* conversion after the static cast is unchanged.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12308
Test Plan: existing tests, CI
Reviewed By: ltamasi
Differential Revision: D53204947
Pulled By: pdillinger
fbshipit-source-id: 9de23e618263b0d5b9820f4e15966876888a16e2
2024-02-07 18:44:11 +00:00
|
|
|
static_cast<DBImpl*>(db)->BackgroundCallPurge();
|
2017-04-06 00:14:05 +00:00
|
|
|
TEST_SYNC_POINT("DBImpl::BGWorkPurge:end");
|
|
|
|
}
|
|
|
|
|
2019-03-20 00:24:09 +00:00
|
|
|
void DBImpl::UnscheduleCompactionCallback(void* arg) {
|
Prefer static_cast in place of most reinterpret_cast (#12308)
Summary:
The following are risks associated with pointer-to-pointer reinterpret_cast:
* Can produce the "wrong result" (crash or memory corruption). IIRC, in theory this can happen for any up-cast or down-cast for a non-standard-layout type, though in practice would only happen for multiple inheritance cases (where the base class pointer might be "inside" the derived object). We don't use multiple inheritance a lot, but we do.
* Can mask useful compiler errors upon code change, including converting between unrelated pointer types that you are expecting to be related, and converting between pointer and scalar types unintentionally.
I can only think of some obscure cases where static_cast could be troublesome when it compiles as a replacement:
* Going through `void*` could plausibly cause unnecessary or broken pointer arithmetic. Suppose we have
`struct Derived: public Base1, public Base2`. If we have `Derived*` -> `void*` -> `Base2*` -> `Derived*` through reinterpret casts, this could plausibly work (though technical UB) assuming the `Base2*` is not dereferenced. Changing to static cast could introduce breaking pointer arithmetic.
* Unnecessary (but safe) pointer arithmetic could arise in a case like `Derived*` -> `Base2*` -> `Derived*` where before the Base2 pointer might not have been dereferenced. This could potentially affect performance.
With some light scripting, I tried replacing pointer-to-pointer reinterpret_casts with static_cast and kept the cases that still compile. Most occurrences of reinterpret_cast have successfully been changed (except for java/ and third-party/). 294 changed, 257 remain.
A couple of related interventions included here:
* Previously Cache::Handle was not actually derived from in the implementations and just used as a `void*` stand-in with reinterpret_cast. Now there is a relationship to allow static_cast. In theory, this could introduce pointer arithmetic (as described above) but is unlikely without multiple inheritance AND non-empty Cache::Handle.
* Remove some unnecessary casts to void* as this is allowed to be implicit (for better or worse).
Most of the remaining reinterpret_casts are for converting to/from raw bytes of objects. We could consider better idioms for these patterns in follow-up work.
I wish there were a way to implement a template variant of static_cast that would only compile if no pointer arithmetic is generated, but best I can tell, this is not possible. AFAIK the best you could do is a dynamic check that the void* conversion after the static cast is unchanged.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12308
Test Plan: existing tests, CI
Reviewed By: ltamasi
Differential Revision: D53204947
Pulled By: pdillinger
fbshipit-source-id: 9de23e618263b0d5b9820f4e15966876888a16e2
2024-02-07 18:44:11 +00:00
|
|
|
CompactionArg* ca_ptr = static_cast<CompactionArg*>(arg);
|
Fix possible hang issue in ~DBImpl() when flush is scheduled in LOW pool (#8125)
Summary:
In DBImpl::CloseHelper, we wait for bg_compaction_scheduled_
and bg_flush_scheduled_ to drop to 0. Unschedule is called prior
to cancel any unscheduled flushes/compactions. It is assumed that
anything in the high priority is a flush, and anything in the low
priority pool is a compaction. This assumption, however, is broken when
the high-pri pool is full.
As a result, bg_compaction_scheduled_ can go < 0 and bg_flush_scheduled_
will remain > 0 and DB can be in hang state.
The fix is, we decrement the `bg_{flush,compaction,bottom_compaction}_scheduled_`
inside the `Unschedule{Flush,Compaction,BottomCompaction}Callback()`s. DB
`mutex_` will make the counts atomic in `Unschedule`.
Related discussion: https://github.com/facebook/rocksdb/issues/7928
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8125
Test Plan: Added new test case which hangs without the fix.
Reviewed By: jay-zhuang
Differential Revision: D27390043
Pulled By: ajkr
fbshipit-source-id: 78a367fba9a59ac5607ad24bd1c46dc16d5ec110
2021-03-31 01:34:11 +00:00
|
|
|
Env::Priority compaction_pri = ca_ptr->compaction_pri_;
|
|
|
|
if (Env::Priority::BOTTOM == compaction_pri) {
|
|
|
|
// Decrement bg_bottom_compaction_scheduled_ if priority is BOTTOM
|
|
|
|
ca_ptr->db->bg_bottom_compaction_scheduled_--;
|
|
|
|
} else if (Env::Priority::LOW == compaction_pri) {
|
|
|
|
// Decrement bg_compaction_scheduled_ if priority is LOW
|
|
|
|
ca_ptr->db->bg_compaction_scheduled_--;
|
|
|
|
}
|
|
|
|
CompactionArg ca = *(ca_ptr);
|
Prefer static_cast in place of most reinterpret_cast (#12308)
Summary:
The following are risks associated with pointer-to-pointer reinterpret_cast:
* Can produce the "wrong result" (crash or memory corruption). IIRC, in theory this can happen for any up-cast or down-cast for a non-standard-layout type, though in practice would only happen for multiple inheritance cases (where the base class pointer might be "inside" the derived object). We don't use multiple inheritance a lot, but we do.
* Can mask useful compiler errors upon code change, including converting between unrelated pointer types that you are expecting to be related, and converting between pointer and scalar types unintentionally.
I can only think of some obscure cases where static_cast could be troublesome when it compiles as a replacement:
* Going through `void*` could plausibly cause unnecessary or broken pointer arithmetic. Suppose we have
`struct Derived: public Base1, public Base2`. If we have `Derived*` -> `void*` -> `Base2*` -> `Derived*` through reinterpret casts, this could plausibly work (though technical UB) assuming the `Base2*` is not dereferenced. Changing to static cast could introduce breaking pointer arithmetic.
* Unnecessary (but safe) pointer arithmetic could arise in a case like `Derived*` -> `Base2*` -> `Derived*` where before the Base2 pointer might not have been dereferenced. This could potentially affect performance.
With some light scripting, I tried replacing pointer-to-pointer reinterpret_casts with static_cast and kept the cases that still compile. Most occurrences of reinterpret_cast have successfully been changed (except for java/ and third-party/). 294 changed, 257 remain.
A couple of related interventions included here:
* Previously Cache::Handle was not actually derived from in the implementations and just used as a `void*` stand-in with reinterpret_cast. Now there is a relationship to allow static_cast. In theory, this could introduce pointer arithmetic (as described above) but is unlikely without multiple inheritance AND non-empty Cache::Handle.
* Remove some unnecessary casts to void* as this is allowed to be implicit (for better or worse).
Most of the remaining reinterpret_casts are for converting to/from raw bytes of objects. We could consider better idioms for these patterns in follow-up work.
I wish there were a way to implement a template variant of static_cast that would only compile if no pointer arithmetic is generated, but best I can tell, this is not possible. AFAIK the best you could do is a dynamic check that the void* conversion after the static cast is unchanged.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12308
Test Plan: existing tests, CI
Reviewed By: ltamasi
Differential Revision: D53204947
Pulled By: pdillinger
fbshipit-source-id: 9de23e618263b0d5b9820f4e15966876888a16e2
2024-02-07 18:44:11 +00:00
|
|
|
delete static_cast<CompactionArg*>(arg);
|
2017-08-03 22:36:28 +00:00
|
|
|
if (ca.prepicked_compaction != nullptr) {
|
2022-03-15 19:31:14 +00:00
|
|
|
// if it's a manual compaction, set status to ManualCompactionPaused
|
|
|
|
if (ca.prepicked_compaction->manual_compaction_state) {
|
|
|
|
ca.prepicked_compaction->manual_compaction_state->done = true;
|
|
|
|
ca.prepicked_compaction->manual_compaction_state->status =
|
|
|
|
Status::Incomplete(Status::SubCode::kManualCompactionPaused);
|
|
|
|
}
|
2017-08-03 22:36:28 +00:00
|
|
|
if (ca.prepicked_compaction->compaction != nullptr) {
|
2022-03-02 21:43:00 +00:00
|
|
|
ca.prepicked_compaction->compaction->ReleaseCompactionFiles(
|
|
|
|
Status::Incomplete(Status::SubCode::kManualCompactionPaused));
|
2017-08-03 22:36:28 +00:00
|
|
|
delete ca.prepicked_compaction->compaction;
|
|
|
|
}
|
|
|
|
delete ca.prepicked_compaction;
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
2019-03-20 00:24:09 +00:00
|
|
|
TEST_SYNC_POINT("DBImpl::UnscheduleCompactionCallback");
|
|
|
|
}
|
|
|
|
|
|
|
|
void DBImpl::UnscheduleFlushCallback(void* arg) {
|
Fix possible hang issue in ~DBImpl() when flush is scheduled in LOW pool (#8125)
Summary:
In DBImpl::CloseHelper, we wait for bg_compaction_scheduled_
and bg_flush_scheduled_ to drop to 0. Unschedule is called prior
to cancel any unscheduled flushes/compactions. It is assumed that
anything in the high priority is a flush, and anything in the low
priority pool is a compaction. This assumption, however, is broken when
the high-pri pool is full.
As a result, bg_compaction_scheduled_ can go < 0 and bg_flush_scheduled_
will remain > 0 and DB can be in hang state.
The fix is, we decrement the `bg_{flush,compaction,bottom_compaction}_scheduled_`
inside the `Unschedule{Flush,Compaction,BottomCompaction}Callback()`s. DB
`mutex_` will make the counts atomic in `Unschedule`.
Related discussion: https://github.com/facebook/rocksdb/issues/7928
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8125
Test Plan: Added new test case which hangs without the fix.
Reviewed By: jay-zhuang
Differential Revision: D27390043
Pulled By: ajkr
fbshipit-source-id: 78a367fba9a59ac5607ad24bd1c46dc16d5ec110
2021-03-31 01:34:11 +00:00
|
|
|
// Decrement bg_flush_scheduled_ in flush callback
|
Prefer static_cast in place of most reinterpret_cast (#12308)
Summary:
The following are risks associated with pointer-to-pointer reinterpret_cast:
* Can produce the "wrong result" (crash or memory corruption). IIRC, in theory this can happen for any up-cast or down-cast for a non-standard-layout type, though in practice would only happen for multiple inheritance cases (where the base class pointer might be "inside" the derived object). We don't use multiple inheritance a lot, but we do.
* Can mask useful compiler errors upon code change, including converting between unrelated pointer types that you are expecting to be related, and converting between pointer and scalar types unintentionally.
I can only think of some obscure cases where static_cast could be troublesome when it compiles as a replacement:
* Going through `void*` could plausibly cause unnecessary or broken pointer arithmetic. Suppose we have
`struct Derived: public Base1, public Base2`. If we have `Derived*` -> `void*` -> `Base2*` -> `Derived*` through reinterpret casts, this could plausibly work (though technical UB) assuming the `Base2*` is not dereferenced. Changing to static cast could introduce breaking pointer arithmetic.
* Unnecessary (but safe) pointer arithmetic could arise in a case like `Derived*` -> `Base2*` -> `Derived*` where before the Base2 pointer might not have been dereferenced. This could potentially affect performance.
With some light scripting, I tried replacing pointer-to-pointer reinterpret_casts with static_cast and kept the cases that still compile. Most occurrences of reinterpret_cast have successfully been changed (except for java/ and third-party/). 294 changed, 257 remain.
A couple of related interventions included here:
* Previously Cache::Handle was not actually derived from in the implementations and just used as a `void*` stand-in with reinterpret_cast. Now there is a relationship to allow static_cast. In theory, this could introduce pointer arithmetic (as described above) but is unlikely without multiple inheritance AND non-empty Cache::Handle.
* Remove some unnecessary casts to void* as this is allowed to be implicit (for better or worse).
Most of the remaining reinterpret_casts are for converting to/from raw bytes of objects. We could consider better idioms for these patterns in follow-up work.
I wish there were a way to implement a template variant of static_cast that would only compile if no pointer arithmetic is generated, but best I can tell, this is not possible. AFAIK the best you could do is a dynamic check that the void* conversion after the static cast is unchanged.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12308
Test Plan: existing tests, CI
Reviewed By: ltamasi
Differential Revision: D53204947
Pulled By: pdillinger
fbshipit-source-id: 9de23e618263b0d5b9820f4e15966876888a16e2
2024-02-07 18:44:11 +00:00
|
|
|
static_cast<FlushThreadArg*>(arg)->db_->bg_flush_scheduled_--;
|
|
|
|
Env::Priority flush_pri = static_cast<FlushThreadArg*>(arg)->thread_pri_;
|
Fix possible hang issue in ~DBImpl() when flush is scheduled in LOW pool (#8125)
Summary:
In DBImpl::CloseHelper, we wait for bg_compaction_scheduled_
and bg_flush_scheduled_ to drop to 0. Unschedule is called prior
to cancel any unscheduled flushes/compactions. It is assumed that
anything in the high priority is a flush, and anything in the low
priority pool is a compaction. This assumption, however, is broken when
the high-pri pool is full.
As a result, bg_compaction_scheduled_ can go < 0 and bg_flush_scheduled_
will remain > 0 and DB can be in hang state.
The fix is, we decrement the `bg_{flush,compaction,bottom_compaction}_scheduled_`
inside the `Unschedule{Flush,Compaction,BottomCompaction}Callback()`s. DB
`mutex_` will make the counts atomic in `Unschedule`.
Related discussion: https://github.com/facebook/rocksdb/issues/7928
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8125
Test Plan: Added new test case which hangs without the fix.
Reviewed By: jay-zhuang
Differential Revision: D27390043
Pulled By: ajkr
fbshipit-source-id: 78a367fba9a59ac5607ad24bd1c46dc16d5ec110
2021-03-31 01:34:11 +00:00
|
|
|
if (Env::Priority::LOW == flush_pri) {
|
|
|
|
TEST_SYNC_POINT("DBImpl::UnscheduleLowFlushCallback");
|
|
|
|
} else if (Env::Priority::HIGH == flush_pri) {
|
|
|
|
TEST_SYNC_POINT("DBImpl::UnscheduleHighFlushCallback");
|
|
|
|
}
|
Prefer static_cast in place of most reinterpret_cast (#12308)
Summary:
The following are risks associated with pointer-to-pointer reinterpret_cast:
* Can produce the "wrong result" (crash or memory corruption). IIRC, in theory this can happen for any up-cast or down-cast for a non-standard-layout type, though in practice would only happen for multiple inheritance cases (where the base class pointer might be "inside" the derived object). We don't use multiple inheritance a lot, but we do.
* Can mask useful compiler errors upon code change, including converting between unrelated pointer types that you are expecting to be related, and converting between pointer and scalar types unintentionally.
I can only think of some obscure cases where static_cast could be troublesome when it compiles as a replacement:
* Going through `void*` could plausibly cause unnecessary or broken pointer arithmetic. Suppose we have
`struct Derived: public Base1, public Base2`. If we have `Derived*` -> `void*` -> `Base2*` -> `Derived*` through reinterpret casts, this could plausibly work (though technical UB) assuming the `Base2*` is not dereferenced. Changing to static cast could introduce breaking pointer arithmetic.
* Unnecessary (but safe) pointer arithmetic could arise in a case like `Derived*` -> `Base2*` -> `Derived*` where before the Base2 pointer might not have been dereferenced. This could potentially affect performance.
With some light scripting, I tried replacing pointer-to-pointer reinterpret_casts with static_cast and kept the cases that still compile. Most occurrences of reinterpret_cast have successfully been changed (except for java/ and third-party/). 294 changed, 257 remain.
A couple of related interventions included here:
* Previously Cache::Handle was not actually derived from in the implementations and just used as a `void*` stand-in with reinterpret_cast. Now there is a relationship to allow static_cast. In theory, this could introduce pointer arithmetic (as described above) but is unlikely without multiple inheritance AND non-empty Cache::Handle.
* Remove some unnecessary casts to void* as this is allowed to be implicit (for better or worse).
Most of the remaining reinterpret_casts are for converting to/from raw bytes of objects. We could consider better idioms for these patterns in follow-up work.
I wish there were a way to implement a template variant of static_cast that would only compile if no pointer arithmetic is generated, but best I can tell, this is not possible. AFAIK the best you could do is a dynamic check that the void* conversion after the static cast is unchanged.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12308
Test Plan: existing tests, CI
Reviewed By: ltamasi
Differential Revision: D53204947
Pulled By: pdillinger
fbshipit-source-id: 9de23e618263b0d5b9820f4e15966876888a16e2
2024-02-07 18:44:11 +00:00
|
|
|
delete static_cast<FlushThreadArg*>(arg);
|
2019-03-20 00:24:09 +00:00
|
|
|
TEST_SYNC_POINT("DBImpl::UnscheduleFlushCallback");
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
Status DBImpl::BackgroundFlush(bool* made_progress, JobContext* job_context,
|
2019-03-20 00:24:09 +00:00
|
|
|
LogBuffer* log_buffer, FlushReason* reason,
|
2023-07-26 23:25:06 +00:00
|
|
|
bool* flush_rescheduled_to_retain_udt,
|
2019-03-20 00:24:09 +00:00
|
|
|
Env::Priority thread_pri) {
|
2017-04-06 00:14:05 +00:00
|
|
|
mutex_.AssertHeld();
|
|
|
|
|
2018-06-28 19:23:57 +00:00
|
|
|
Status status;
|
Auto recovery from out of space errors (#4164)
Summary:
This commit implements automatic recovery from a Status::NoSpace() error
during background operations such as write callback, flush and
compaction. The broad design is as follows -
1. Compaction errors are treated as soft errors and don't put the
database in read-only mode. A compaction is delayed until enough free
disk space is available to accomodate the compaction outputs, which is
estimated based on the input size. This means that users can continue to
write, and we rely on the WriteController to delay or stop writes if the
compaction debt becomes too high due to persistent low disk space
condition
2. Errors during write callback and flush are treated as hard errors,
i.e the database is put in read-only mode and goes back to read-write
only fater certain recovery actions are taken.
3. Both types of recovery rely on the SstFileManagerImpl to poll for
sufficient disk space. We assume that there is a 1-1 mapping between an
SFM and the underlying OS storage container. For cases where multiple
DBs are hosted on a single storage container, the user is expected to
allocate a single SFM instance and use the same one for all the DBs. If
no SFM is specified by the user, DBImpl::Open() will allocate one, but
this will be one per DB and each DB will recover independently. The
recovery implemented by SFM is as follows -
a) On the first occurance of an out of space error during compaction,
subsequent
compactions will be delayed until the disk free space check indicates
enough available space. The required space is computed as the sum of
input sizes.
b) The free space check requirement will be removed once the amount of
free space is greater than the size reserved by in progress
compactions when the first error occured
c) If the out of space error is a hard error, a background thread in
SFM will poll for sufficient headroom before triggering the recovery
of the database and putting it in write-only mode. The headroom is
calculated as the sum of the write_buffer_size of all the DB instances
associated with the SFM
4. EventListener callbacks will be called at the start and completion of
automatic recovery. Users can disable the auto recov ery in the start
callback, and later initiate it manually by calling DB::Resume()
Todo:
1. More extensive testing
2. Add disk full condition to db_stress (follow-on PR)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4164
Differential Revision: D9846378
Pulled By: anand1976
fbshipit-source-id: 80ea875dbd7f00205e19c82215ff6e37da10da4a
2018-09-15 20:36:19 +00:00
|
|
|
*reason = FlushReason::kOthers;
|
|
|
|
// If BG work is stopped due to an error, but a recovery is in progress,
|
|
|
|
// that means this flush is part of the recovery. So allow it to go through
|
2018-06-28 19:23:57 +00:00
|
|
|
if (!error_handler_.IsBGWorkStopped()) {
|
|
|
|
if (shutting_down_.load(std::memory_order_acquire)) {
|
|
|
|
status = Status::ShutdownInProgress();
|
|
|
|
}
|
Auto recovery from out of space errors (#4164)
Summary:
This commit implements automatic recovery from a Status::NoSpace() error
during background operations such as write callback, flush and
compaction. The broad design is as follows -
1. Compaction errors are treated as soft errors and don't put the
database in read-only mode. A compaction is delayed until enough free
disk space is available to accomodate the compaction outputs, which is
estimated based on the input size. This means that users can continue to
write, and we rely on the WriteController to delay or stop writes if the
compaction debt becomes too high due to persistent low disk space
condition
2. Errors during write callback and flush are treated as hard errors,
i.e the database is put in read-only mode and goes back to read-write
only fater certain recovery actions are taken.
3. Both types of recovery rely on the SstFileManagerImpl to poll for
sufficient disk space. We assume that there is a 1-1 mapping between an
SFM and the underlying OS storage container. For cases where multiple
DBs are hosted on a single storage container, the user is expected to
allocate a single SFM instance and use the same one for all the DBs. If
no SFM is specified by the user, DBImpl::Open() will allocate one, but
this will be one per DB and each DB will recover independently. The
recovery implemented by SFM is as follows -
a) On the first occurance of an out of space error during compaction,
subsequent
compactions will be delayed until the disk free space check indicates
enough available space. The required space is computed as the sum of
input sizes.
b) The free space check requirement will be removed once the amount of
free space is greater than the size reserved by in progress
compactions when the first error occured
c) If the out of space error is a hard error, a background thread in
SFM will poll for sufficient headroom before triggering the recovery
of the database and putting it in write-only mode. The headroom is
calculated as the sum of the write_buffer_size of all the DB instances
associated with the SFM
4. EventListener callbacks will be called at the start and completion of
automatic recovery. Users can disable the auto recov ery in the start
callback, and later initiate it manually by calling DB::Resume()
Todo:
1. More extensive testing
2. Add disk full condition to db_stress (follow-on PR)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4164
Differential Revision: D9846378
Pulled By: anand1976
fbshipit-source-id: 80ea875dbd7f00205e19c82215ff6e37da10da4a
2018-09-15 20:36:19 +00:00
|
|
|
} else if (!error_handler_.IsRecoveryInProgress()) {
|
2018-06-28 19:23:57 +00:00
|
|
|
status = error_handler_.GetBGError();
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
if (!status.ok()) {
|
|
|
|
return status;
|
|
|
|
}
|
|
|
|
|
2018-08-24 20:17:29 +00:00
|
|
|
autovector<BGFlushArg> bg_flush_args;
|
|
|
|
std::vector<SuperVersionContext>& superversion_contexts =
|
|
|
|
job_context->superversion_contexts;
|
2019-05-11 00:53:41 +00:00
|
|
|
autovector<ColumnFamilyData*> column_families_not_to_flush;
|
2017-04-06 00:14:05 +00:00
|
|
|
while (!flush_queue_.empty()) {
|
|
|
|
// This cfd is already referenced
|
2023-07-26 23:25:06 +00:00
|
|
|
FlushRequest flush_req = PopFirstFromFlushQueue();
|
|
|
|
FlushReason flush_reason = flush_req.flush_reason;
|
2023-09-22 23:43:50 +00:00
|
|
|
if (!error_handler_.GetBGError().ok() && error_handler_.IsBGWorkStopped() &&
|
|
|
|
flush_reason != FlushReason::kErrorRecovery &&
|
|
|
|
flush_reason != FlushReason::kErrorRecoveryRetryFlush) {
|
|
|
|
// Stop non-recovery flush when bg work is stopped
|
|
|
|
// Note that we drop the flush request here.
|
|
|
|
// Recovery thread should schedule further flushes after bg error
|
|
|
|
// is cleared.
|
|
|
|
status = error_handler_.GetBGError();
|
|
|
|
assert(!status.ok());
|
|
|
|
ROCKS_LOG_BUFFER(log_buffer,
|
|
|
|
"[JOB %d] Abort flush due to background error %s",
|
|
|
|
job_context->job_id, status.ToString().c_str());
|
|
|
|
*reason = flush_reason;
|
|
|
|
for (auto item : flush_req.cfd_to_max_mem_id_to_persist) {
|
|
|
|
item.first->UnrefAndTryDelete();
|
|
|
|
}
|
|
|
|
return status;
|
|
|
|
}
|
2023-07-26 23:25:06 +00:00
|
|
|
if (!immutable_db_options_.atomic_flush &&
|
|
|
|
ShouldRescheduleFlushRequestToRetainUDT(flush_req)) {
|
|
|
|
assert(flush_req.cfd_to_max_mem_id_to_persist.size() == 1);
|
|
|
|
ColumnFamilyData* cfd =
|
|
|
|
flush_req.cfd_to_max_mem_id_to_persist.begin()->first;
|
|
|
|
if (cfd->UnrefAndTryDelete()) {
|
|
|
|
return Status::OK();
|
|
|
|
}
|
|
|
|
ROCKS_LOG_BUFFER(log_buffer,
|
|
|
|
"FlushRequest for column family %s is re-scheduled to "
|
|
|
|
"retain user-defined timestamps.",
|
|
|
|
cfd->GetName().c_str());
|
|
|
|
// Reschedule the `FlushRequest` as is without checking dropped column
|
|
|
|
// family etc. The follow-up job will do the check anyways, so save the
|
|
|
|
// duplication. Column family is deduplicated by `SchdulePendingFlush` and
|
|
|
|
// `PopFirstFromFlushQueue` contains at flush request enqueueing and
|
|
|
|
// dequeueing time.
|
|
|
|
// This flush request is rescheduled right after it's popped from the
|
|
|
|
// queue while the db mutex is held, so there should be no other
|
|
|
|
// FlushRequest for the same column family with higher `max_memtable_id`
|
|
|
|
// in the queue to block the reschedule from succeeding.
|
|
|
|
#ifndef NDEBUG
|
|
|
|
flush_req.reschedule_count += 1;
|
|
|
|
#endif /* !NDEBUG */
|
|
|
|
SchedulePendingFlush(flush_req);
|
|
|
|
*reason = flush_reason;
|
|
|
|
*flush_rescheduled_to_retain_udt = true;
|
|
|
|
return Status::TryAgain();
|
|
|
|
}
|
2018-08-24 20:17:29 +00:00
|
|
|
superversion_contexts.clear();
|
2023-07-26 23:25:06 +00:00
|
|
|
superversion_contexts.reserve(
|
|
|
|
flush_req.cfd_to_max_mem_id_to_persist.size());
|
2018-08-24 20:17:29 +00:00
|
|
|
|
2023-07-26 23:25:06 +00:00
|
|
|
for (const auto& [cfd, max_memtable_id] :
|
|
|
|
flush_req.cfd_to_max_mem_id_to_persist) {
|
2022-06-23 16:42:18 +00:00
|
|
|
if (cfd->GetMempurgeUsed()) {
|
|
|
|
// If imm() contains silent memtables (e.g.: because
|
|
|
|
// MemPurge was activated), requesting a flush will
|
|
|
|
// mark the imm_needed as true.
|
2021-07-16 00:48:17 +00:00
|
|
|
cfd->imm()->FlushRequested();
|
|
|
|
}
|
2022-06-23 16:42:18 +00:00
|
|
|
|
2018-08-24 20:17:29 +00:00
|
|
|
if (cfd->IsDropped() || !cfd->imm()->IsFlushPending()) {
|
|
|
|
// can't flush this CF, try next one
|
2019-05-11 00:53:41 +00:00
|
|
|
column_families_not_to_flush.push_back(cfd);
|
2018-08-24 20:17:29 +00:00
|
|
|
continue;
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
2024-03-04 18:08:32 +00:00
|
|
|
superversion_contexts.emplace_back(true);
|
2023-07-19 19:52:39 +00:00
|
|
|
bg_flush_args.emplace_back(cfd, max_memtable_id,
|
2023-01-24 17:54:04 +00:00
|
|
|
&(superversion_contexts.back()), flush_reason);
|
2018-08-24 20:17:29 +00:00
|
|
|
}
|
2023-07-26 23:25:06 +00:00
|
|
|
// `MaybeScheduleFlushOrCompaction` schedules as many `BackgroundCallFlush`
|
|
|
|
// jobs as the number of `FlushRequest` in the `flush_queue_`, a.k.a
|
|
|
|
// `unscheduled_flushes_`. So it's sufficient to make each `BackgroundFlush`
|
|
|
|
// handle one `FlushRequest` and each have a Status returned.
|
|
|
|
if (!bg_flush_args.empty() || !column_families_not_to_flush.empty()) {
|
|
|
|
TEST_SYNC_POINT_CALLBACK("DBImpl::BackgroundFlush:CheckFlushRequest:cb",
|
|
|
|
const_cast<int*>(&flush_req.reschedule_count));
|
2018-08-24 20:17:29 +00:00
|
|
|
break;
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2018-08-24 20:17:29 +00:00
|
|
|
if (!bg_flush_args.empty()) {
|
2017-05-24 18:25:38 +00:00
|
|
|
auto bg_job_limits = GetBGJobLimits();
|
2018-08-24 20:17:29 +00:00
|
|
|
for (const auto& arg : bg_flush_args) {
|
|
|
|
ColumnFamilyData* cfd = arg.cfd_;
|
|
|
|
ROCKS_LOG_BUFFER(
|
|
|
|
log_buffer,
|
|
|
|
"Calling FlushMemTableToOutputFile with column "
|
|
|
|
"family [%s], flush slots available %d, compaction slots available "
|
|
|
|
"%d, "
|
|
|
|
"flush slots scheduled %d, compaction slots scheduled %d",
|
|
|
|
cfd->GetName().c_str(), bg_job_limits.max_flushes,
|
|
|
|
bg_job_limits.max_compactions, bg_flush_scheduled_,
|
|
|
|
bg_compaction_scheduled_);
|
|
|
|
}
|
|
|
|
status = FlushMemTablesToOutputFiles(bg_flush_args, made_progress,
|
2019-03-20 00:24:09 +00:00
|
|
|
job_context, log_buffer, thread_pri);
|
2019-07-01 21:04:10 +00:00
|
|
|
TEST_SYNC_POINT("DBImpl::BackgroundFlush:BeforeFlush");
|
2023-01-24 17:54:04 +00:00
|
|
|
// All the CFD/bg_flush_arg in the FlushReq must have the same flush reason, so
|
|
|
|
// just grab the first one
|
|
|
|
#ifndef NDEBUG
|
2023-02-22 13:44:03 +00:00
|
|
|
for (const auto& bg_flush_arg : bg_flush_args) {
|
2023-01-24 17:54:04 +00:00
|
|
|
assert(bg_flush_arg.flush_reason_ == bg_flush_args[0].flush_reason_);
|
|
|
|
}
|
|
|
|
#endif /* !NDEBUG */
|
|
|
|
*reason = bg_flush_args[0].flush_reason_;
|
2018-08-24 20:17:29 +00:00
|
|
|
for (auto& arg : bg_flush_args) {
|
|
|
|
ColumnFamilyData* cfd = arg.cfd_;
|
2019-12-13 03:02:51 +00:00
|
|
|
if (cfd->UnrefAndTryDelete()) {
|
2018-08-24 20:17:29 +00:00
|
|
|
arg.cfd_ = nullptr;
|
|
|
|
}
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
}
|
2019-05-11 00:53:41 +00:00
|
|
|
for (auto cfd : column_families_not_to_flush) {
|
2019-12-13 03:02:51 +00:00
|
|
|
cfd->UnrefAndTryDelete();
|
2019-05-11 00:53:41 +00:00
|
|
|
}
|
2017-04-06 00:14:05 +00:00
|
|
|
return status;
|
|
|
|
}
|
|
|
|
|
2019-03-20 00:24:09 +00:00
|
|
|
void DBImpl::BackgroundCallFlush(Env::Priority thread_pri) {
|
2017-04-06 00:14:05 +00:00
|
|
|
bool made_progress = false;
|
|
|
|
JobContext job_context(next_job_id_.fetch_add(1), true);
|
|
|
|
|
Fix atomic flush waiting forever for MANIFEST write (#9034)
Summary:
In atomic flush, concurrent background flush threads will commit to the MANIFEST
one by one, in the order of the IDs of their picked memtables for all included column
families. Each time, a background flush thread decides whether to wait based on two
criteria:
- Is db stopped? If so, don't wait.
- Am I the one to commit the currently earliest memtable? If so, don't wait and ready to go.
When atomic flush was implemented, error writing to or syncing the MANIFEST would
cause the db to be stopped. Therefore, this background thread does not have to check
for the background error while waiting. If there has been such an error, `DBStopped()`
would have been true, and this thread will **not** wait forever.
After we improved error handling, RocksDB may map an IOError while writing to MANIFEST
to a soft error, if there is no WAL. This requires the background threads to check for
background error while waiting. Otherwise, a background flush thread may wait forever.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9034
Test Plan: make check
Reviewed By: zhichao-cao
Differential Revision: D31639225
Pulled By: riversand963
fbshipit-source-id: e9ab07c4d8f2eade238adeefe3e42dd9a5a3ebbd
2021-10-21 04:33:32 +00:00
|
|
|
TEST_SYNC_POINT_CALLBACK("DBImpl::BackgroundCallFlush:start", nullptr);
|
2017-04-06 00:14:05 +00:00
|
|
|
|
|
|
|
LogBuffer log_buffer(InfoLogLevel::INFO_LEVEL,
|
|
|
|
immutable_db_options_.info_log.get());
|
Fix the false positive alert of CF consistency check in WAL recovery (#8207)
Summary:
In current RocksDB, in recover the information form WAL, we do the consistency check for each column family when one WAL file is corrupted and PointInTimeRecovery is set. However, it will report a false positive alert on "SST file is ahead of WALs" when one of the CF current log number is greater than the corrupted WAL number (CF contains the data beyond the corrupted WAl) due to a new column family creation during flush. In this case, a new WAL is created (it is empty) during a flush. Also, due to some reason (e.g., storage issue or crash happens before SyncCloseLog is called), the old WAL is corrupted. The new CF has no data, therefore, it does not have the consistency issue.
Fix: when checking cfd->GetLogNumber() > corrupted_wal_number also check cfd->GetLiveSstFilesSize() > 0. So the CFs with no SST file data will skip the check here.
Note potential ignored inconsistency caused due to fix: empty CF can also be caused by write+delete. In this case, after flush, there is no SST files being generated. However, this CF still have the log in the WAL. When the WAL is corrupted, the DB might be inconsistent.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8207
Test Plan: added unit test, make crash_test
Reviewed By: riversand963
Differential Revision: D27898839
Pulled By: zhichao-cao
fbshipit-source-id: 931fc2d8b92dd00b4169bf84b94e712fd688a83e
2021-04-22 17:27:56 +00:00
|
|
|
TEST_SYNC_POINT("DBImpl::BackgroundCallFlush:Start:1");
|
|
|
|
TEST_SYNC_POINT("DBImpl::BackgroundCallFlush:Start:2");
|
2017-04-06 00:14:05 +00:00
|
|
|
{
|
|
|
|
InstrumentedMutexLock l(&mutex_);
|
2017-06-20 23:43:09 +00:00
|
|
|
assert(bg_flush_scheduled_);
|
2017-04-06 00:14:05 +00:00
|
|
|
num_running_flushes_++;
|
|
|
|
|
2019-10-08 21:18:48 +00:00
|
|
|
std::unique_ptr<std::list<uint64_t>::iterator>
|
|
|
|
pending_outputs_inserted_elem(new std::list<uint64_t>::iterator(
|
|
|
|
CaptureCurrentFileNumberInPendingOutputs()));
|
Auto recovery from out of space errors (#4164)
Summary:
This commit implements automatic recovery from a Status::NoSpace() error
during background operations such as write callback, flush and
compaction. The broad design is as follows -
1. Compaction errors are treated as soft errors and don't put the
database in read-only mode. A compaction is delayed until enough free
disk space is available to accomodate the compaction outputs, which is
estimated based on the input size. This means that users can continue to
write, and we rely on the WriteController to delay or stop writes if the
compaction debt becomes too high due to persistent low disk space
condition
2. Errors during write callback and flush are treated as hard errors,
i.e the database is put in read-only mode and goes back to read-write
only fater certain recovery actions are taken.
3. Both types of recovery rely on the SstFileManagerImpl to poll for
sufficient disk space. We assume that there is a 1-1 mapping between an
SFM and the underlying OS storage container. For cases where multiple
DBs are hosted on a single storage container, the user is expected to
allocate a single SFM instance and use the same one for all the DBs. If
no SFM is specified by the user, DBImpl::Open() will allocate one, but
this will be one per DB and each DB will recover independently. The
recovery implemented by SFM is as follows -
a) On the first occurance of an out of space error during compaction,
subsequent
compactions will be delayed until the disk free space check indicates
enough available space. The required space is computed as the sum of
input sizes.
b) The free space check requirement will be removed once the amount of
free space is greater than the size reserved by in progress
compactions when the first error occured
c) If the out of space error is a hard error, a background thread in
SFM will poll for sufficient headroom before triggering the recovery
of the database and putting it in write-only mode. The headroom is
calculated as the sum of the write_buffer_size of all the DB instances
associated with the SFM
4. EventListener callbacks will be called at the start and completion of
automatic recovery. Users can disable the auto recov ery in the start
callback, and later initiate it manually by calling DB::Resume()
Todo:
1. More extensive testing
2. Add disk full condition to db_stress (follow-on PR)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4164
Differential Revision: D9846378
Pulled By: anand1976
fbshipit-source-id: 80ea875dbd7f00205e19c82215ff6e37da10da4a
2018-09-15 20:36:19 +00:00
|
|
|
FlushReason reason;
|
2023-07-26 23:25:06 +00:00
|
|
|
bool flush_rescheduled_to_retain_udt = false;
|
|
|
|
Status s =
|
|
|
|
BackgroundFlush(&made_progress, &job_context, &log_buffer, &reason,
|
|
|
|
&flush_rescheduled_to_retain_udt, thread_pri);
|
|
|
|
if (s.IsTryAgain() && flush_rescheduled_to_retain_udt) {
|
|
|
|
bg_cv_.SignalAll(); // In case a waiter can proceed despite the error
|
|
|
|
mutex_.Unlock();
|
|
|
|
TEST_SYNC_POINT_CALLBACK("DBImpl::AfterRetainUDTReschedule:cb", nullptr);
|
|
|
|
immutable_db_options_.clock->SleepForMicroseconds(
|
|
|
|
100000); // prevent hot loop
|
|
|
|
mutex_.Lock();
|
|
|
|
} else if (!s.ok() && !s.IsShutdownInProgress() &&
|
|
|
|
!s.IsColumnFamilyDropped() &&
|
|
|
|
reason != FlushReason::kErrorRecovery) {
|
2017-04-06 00:14:05 +00:00
|
|
|
// Wait a little bit before retrying background flush in
|
|
|
|
// case this is an environmental problem and we do not want to
|
|
|
|
// chew up resources for failed flushes for the duration of
|
|
|
|
// the problem.
|
|
|
|
uint64_t error_cnt =
|
2018-04-13 00:55:14 +00:00
|
|
|
default_cf_internal_stats_->BumpAndGetBackgroundErrorCount();
|
2017-04-06 00:14:05 +00:00
|
|
|
bg_cv_.SignalAll(); // In case a waiter can proceed despite the error
|
|
|
|
mutex_.Unlock();
|
|
|
|
ROCKS_LOG_ERROR(immutable_db_options_.info_log,
|
2023-09-22 23:43:50 +00:00
|
|
|
"[JOB %d] Waiting after background flush error: %s"
|
2017-04-06 00:14:05 +00:00
|
|
|
"Accumulated background error counts: %" PRIu64,
|
2023-09-22 23:43:50 +00:00
|
|
|
job_context.job_id, s.ToString().c_str(), error_cnt);
|
2017-04-06 00:14:05 +00:00
|
|
|
log_buffer.FlushBufferToLog();
|
|
|
|
LogFlush(immutable_db_options_.info_log);
|
2021-03-15 11:32:24 +00:00
|
|
|
immutable_db_options_.clock->SleepForMicroseconds(1000000);
|
2017-04-06 00:14:05 +00:00
|
|
|
mutex_.Lock();
|
|
|
|
}
|
|
|
|
|
2018-10-26 22:06:44 +00:00
|
|
|
TEST_SYNC_POINT("DBImpl::BackgroundCallFlush:FlushFinish:0");
|
2017-04-06 00:14:05 +00:00
|
|
|
ReleaseFileNumberFromPendingOutputs(pending_outputs_inserted_elem);
|
2023-11-29 19:35:59 +00:00
|
|
|
// There is no need to find obsolete files if the flush job is rescheduled
|
2023-07-26 23:25:06 +00:00
|
|
|
// to retain user-defined timestamps because the job doesn't get to the
|
|
|
|
// stage of actually flushing the MemTables.
|
|
|
|
if (!flush_rescheduled_to_retain_udt) {
|
|
|
|
// If flush failed, we want to delete all temporary files that we might
|
|
|
|
// have created. Thus, we force full scan in FindObsoleteFiles()
|
|
|
|
FindObsoleteFiles(&job_context, !s.ok() && !s.IsShutdownInProgress() &&
|
|
|
|
!s.IsColumnFamilyDropped());
|
2023-11-29 19:35:59 +00:00
|
|
|
}
|
|
|
|
// delete unnecessary files if any, this is done outside the mutex
|
|
|
|
if (job_context.HaveSomethingToClean() ||
|
|
|
|
job_context.HaveSomethingToDelete() || !log_buffer.IsEmpty()) {
|
|
|
|
mutex_.Unlock();
|
|
|
|
TEST_SYNC_POINT("DBImpl::BackgroundCallFlush:FilesFound");
|
|
|
|
// Have to flush the info logs before bg_flush_scheduled_--
|
|
|
|
// because if bg_flush_scheduled_ becomes 0 and the lock is
|
|
|
|
// released, the deconstructor of DB can kick in and destroy all the
|
|
|
|
// states of DB so info_log might not be available after that point.
|
|
|
|
// It also applies to access other states that DB owns.
|
|
|
|
log_buffer.FlushBufferToLog();
|
|
|
|
if (job_context.HaveSomethingToDelete()) {
|
|
|
|
PurgeObsoleteFiles(job_context);
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
2023-11-29 19:35:59 +00:00
|
|
|
job_context.Clean();
|
|
|
|
mutex_.Lock();
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
2023-11-29 19:35:59 +00:00
|
|
|
TEST_SYNC_POINT("DBImpl::BackgroundCallFlush:ContextCleanedUp");
|
2017-04-06 00:14:05 +00:00
|
|
|
|
|
|
|
assert(num_running_flushes_ > 0);
|
|
|
|
num_running_flushes_--;
|
|
|
|
bg_flush_scheduled_--;
|
|
|
|
// See if there's more work to be done
|
|
|
|
MaybeScheduleFlushOrCompaction();
|
2019-01-04 04:53:52 +00:00
|
|
|
atomic_flush_install_cv_.SignalAll();
|
2017-04-06 00:14:05 +00:00
|
|
|
bg_cv_.SignalAll();
|
|
|
|
// IMPORTANT: there should be no code after calling SignalAll. This call may
|
|
|
|
// signal the DB destructor that it's OK to proceed with destruction. In
|
|
|
|
// that case, all DB variables will be dealloacated and referencing them
|
|
|
|
// will cause trouble.
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2017-08-03 22:36:28 +00:00
|
|
|
void DBImpl::BackgroundCallCompaction(PrepickedCompaction* prepicked_compaction,
|
|
|
|
Env::Priority bg_thread_pri) {
|
2017-04-06 00:14:05 +00:00
|
|
|
bool made_progress = false;
|
2022-03-02 21:43:00 +00:00
|
|
|
JobContext job_context(next_job_id_.fetch_add(1), true);
|
2017-04-06 00:14:05 +00:00
|
|
|
TEST_SYNC_POINT("BackgroundCallCompaction:0");
|
|
|
|
LogBuffer log_buffer(InfoLogLevel::INFO_LEVEL,
|
|
|
|
immutable_db_options_.info_log.get());
|
|
|
|
{
|
|
|
|
InstrumentedMutexLock l(&mutex_);
|
|
|
|
|
2022-03-02 21:43:00 +00:00
|
|
|
num_running_compactions_++;
|
|
|
|
|
|
|
|
std::unique_ptr<std::list<uint64_t>::iterator>
|
|
|
|
pending_outputs_inserted_elem(new std::list<uint64_t>::iterator(
|
|
|
|
CaptureCurrentFileNumberInPendingOutputs()));
|
|
|
|
|
|
|
|
assert((bg_thread_pri == Env::Priority::BOTTOM &&
|
|
|
|
bg_bottom_compaction_scheduled_) ||
|
|
|
|
(bg_thread_pri == Env::Priority::LOW && bg_compaction_scheduled_));
|
|
|
|
Status s = BackgroundCompaction(&made_progress, &job_context, &log_buffer,
|
|
|
|
prepicked_compaction, bg_thread_pri);
|
|
|
|
TEST_SYNC_POINT("BackgroundCallCompaction:1");
|
|
|
|
if (s.IsBusy()) {
|
|
|
|
bg_cv_.SignalAll(); // In case a waiter can proceed despite the error
|
|
|
|
mutex_.Unlock();
|
|
|
|
immutable_db_options_.clock->SleepForMicroseconds(
|
|
|
|
10000); // prevent hot loop
|
|
|
|
mutex_.Lock();
|
|
|
|
} else if (!s.ok() && !s.IsShutdownInProgress() &&
|
|
|
|
!s.IsManualCompactionPaused() && !s.IsColumnFamilyDropped()) {
|
|
|
|
// Wait a little bit before retrying background compaction in
|
|
|
|
// case this is an environmental problem and we do not want to
|
|
|
|
// chew up resources for failed compactions for the duration of
|
|
|
|
// the problem.
|
|
|
|
uint64_t error_cnt =
|
|
|
|
default_cf_internal_stats_->BumpAndGetBackgroundErrorCount();
|
|
|
|
bg_cv_.SignalAll(); // In case a waiter can proceed despite the error
|
|
|
|
mutex_.Unlock();
|
|
|
|
log_buffer.FlushBufferToLog();
|
|
|
|
ROCKS_LOG_ERROR(immutable_db_options_.info_log,
|
|
|
|
"Waiting after background compaction error: %s, "
|
|
|
|
"Accumulated background error counts: %" PRIu64,
|
|
|
|
s.ToString().c_str(), error_cnt);
|
|
|
|
LogFlush(immutable_db_options_.info_log);
|
|
|
|
immutable_db_options_.clock->SleepForMicroseconds(1000000);
|
|
|
|
mutex_.Lock();
|
|
|
|
} else if (s.IsManualCompactionPaused()) {
|
|
|
|
assert(prepicked_compaction);
|
2022-03-15 19:31:14 +00:00
|
|
|
ManualCompactionState* m = prepicked_compaction->manual_compaction_state;
|
2022-03-02 21:43:00 +00:00
|
|
|
assert(m);
|
|
|
|
ROCKS_LOG_BUFFER(&log_buffer, "[%s] [JOB %d] Manual compaction paused",
|
|
|
|
m->cfd->GetName().c_str(), job_context.job_id);
|
|
|
|
}
|
|
|
|
|
|
|
|
ReleaseFileNumberFromPendingOutputs(pending_outputs_inserted_elem);
|
2022-02-16 01:59:31 +00:00
|
|
|
|
2022-03-02 21:43:00 +00:00
|
|
|
// If compaction failed, we want to delete all temporary files that we
|
|
|
|
// might have created (they might not be all recorded in job_context in
|
|
|
|
// case of a failure). Thus, we force full scan in FindObsoleteFiles()
|
|
|
|
FindObsoleteFiles(&job_context, !s.ok() && !s.IsShutdownInProgress() &&
|
|
|
|
!s.IsManualCompactionPaused() &&
|
|
|
|
!s.IsColumnFamilyDropped() &&
|
|
|
|
!s.IsBusy());
|
|
|
|
TEST_SYNC_POINT("DBImpl::BackgroundCallCompaction:FoundObsoleteFiles");
|
|
|
|
|
|
|
|
// delete unnecessary files if any, this is done outside the mutex
|
|
|
|
if (job_context.HaveSomethingToClean() ||
|
|
|
|
job_context.HaveSomethingToDelete() || !log_buffer.IsEmpty()) {
|
|
|
|
mutex_.Unlock();
|
|
|
|
// Have to flush the info logs before bg_compaction_scheduled_--
|
|
|
|
// because if bg_flush_scheduled_ becomes 0 and the lock is
|
|
|
|
// released, the deconstructor of DB can kick in and destroy all the
|
|
|
|
// states of DB so info_log might not be available after that point.
|
|
|
|
// It also applies to access other states that DB owns.
|
|
|
|
log_buffer.FlushBufferToLog();
|
|
|
|
if (job_context.HaveSomethingToDelete()) {
|
|
|
|
PurgeObsoleteFiles(job_context);
|
|
|
|
TEST_SYNC_POINT("DBImpl::BackgroundCallCompaction:PurgedObsoleteFiles");
|
|
|
|
}
|
|
|
|
job_context.Clean();
|
|
|
|
mutex_.Lock();
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
|
2022-03-02 21:43:00 +00:00
|
|
|
assert(num_running_compactions_ > 0);
|
|
|
|
num_running_compactions_--;
|
|
|
|
|
2017-08-03 22:36:28 +00:00
|
|
|
if (bg_thread_pri == Env::Priority::LOW) {
|
|
|
|
bg_compaction_scheduled_--;
|
|
|
|
} else {
|
|
|
|
assert(bg_thread_pri == Env::Priority::BOTTOM);
|
|
|
|
bg_bottom_compaction_scheduled_--;
|
|
|
|
}
|
2017-04-06 00:14:05 +00:00
|
|
|
|
|
|
|
// See if there's more work to be done
|
|
|
|
MaybeScheduleFlushOrCompaction();
|
2021-05-05 00:26:23 +00:00
|
|
|
|
|
|
|
if (prepicked_compaction != nullptr &&
|
|
|
|
prepicked_compaction->task_token != nullptr) {
|
2021-07-22 00:36:48 +00:00
|
|
|
// Releasing task tokens affects (and asserts on) the DB state, so
|
|
|
|
// must be done before we potentially signal the DB close process to
|
|
|
|
// proceed below.
|
|
|
|
prepicked_compaction->task_token.reset();
|
2021-05-05 00:26:23 +00:00
|
|
|
}
|
|
|
|
|
2017-08-03 22:36:28 +00:00
|
|
|
if (made_progress ||
|
|
|
|
(bg_compaction_scheduled_ == 0 &&
|
|
|
|
bg_bottom_compaction_scheduled_ == 0) ||
|
2018-03-07 00:13:05 +00:00
|
|
|
HasPendingManualCompaction() || unscheduled_compactions_ == 0) {
|
2017-04-06 00:14:05 +00:00
|
|
|
// signal if
|
|
|
|
// * made_progress -- need to wakeup DelayWrite
|
2017-08-03 22:36:28 +00:00
|
|
|
// * bg_{bottom,}_compaction_scheduled_ == 0 -- need to wakeup ~DBImpl
|
2017-04-06 00:14:05 +00:00
|
|
|
// * HasPendingManualCompaction -- need to wakeup RunManualCompaction
|
|
|
|
// If none of this is true, there is no need to signal since nobody is
|
|
|
|
// waiting for it
|
|
|
|
bg_cv_.SignalAll();
|
|
|
|
}
|
|
|
|
// IMPORTANT: there should be no code after calling SignalAll. This call may
|
|
|
|
// signal the DB destructor that it's OK to proceed with destruction. In
|
|
|
|
// that case, all DB variables will be dealloacated and referencing them
|
|
|
|
// will cause trouble.
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
Status DBImpl::BackgroundCompaction(bool* made_progress,
|
|
|
|
JobContext* job_context,
|
2017-08-03 22:36:28 +00:00
|
|
|
LogBuffer* log_buffer,
|
2019-03-20 00:24:09 +00:00
|
|
|
PrepickedCompaction* prepicked_compaction,
|
|
|
|
Env::Priority thread_pri) {
|
2022-03-15 19:31:14 +00:00
|
|
|
ManualCompactionState* manual_compaction =
|
2017-08-03 22:36:28 +00:00
|
|
|
prepicked_compaction == nullptr
|
|
|
|
? nullptr
|
|
|
|
: prepicked_compaction->manual_compaction_state;
|
2017-04-06 00:14:05 +00:00
|
|
|
*made_progress = false;
|
|
|
|
mutex_.AssertHeld();
|
|
|
|
TEST_SYNC_POINT("DBImpl::BackgroundCompaction:Start");
|
|
|
|
|
2023-04-21 16:07:18 +00:00
|
|
|
const ReadOptions read_options(Env::IOActivity::kCompaction);
|
Group SST write in flush, compaction and db open with new stats (#11910)
Summary:
## Context/Summary
Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity.
For that, this PR does the following:
- Tag different write IOs by passing down and converting WriteOptions to IOOptions
- Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS
Some related code refactory to make implementation cleaner:
- Blob stats
- Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info.
- Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write.
- Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority
- Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification
- Build table
- TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables
- Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder.
This parameter is used for dynamically changing file io priority for flush, see https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more
- Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority
## Test
### db bench
Flush
```
./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100
rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
```
compaction, db oopen
```
Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279
rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213
rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66
```
blob stats - just to make sure they aren't broken by this PR
```
Integrated Blob DB
Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600
rocksdb.blobdb.blob.file.synced COUNT : 1
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same)
```
```
Stacked Blob DB
Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876
rocksdb.blobdb.blob.file.synced COUNT : 8
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same)
```
### Rehearsal CI stress test
Trigger 3 full runs of all our CI stress tests
### Performance
Flush
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark; enable_statistics = true
Pre-pr: avg 507515519.3 ns
497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908,
Post-pr: avg 511971266.5 ns, regressed 0.88%
502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408,
```
Compaction
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 495346098.30 ns
492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846
Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97%
502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007
```
Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats)
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 3848.10 ns
3814,3838,3839,3848,3854,3854,3854,3860,3860,3860
Post-pr: avg 3874.20 ns, regressed 0.68%
3863,3867,3871,3874,3875,3877,3877,3877,3880,3881
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910
Reviewed By: ajkr
Differential Revision: D49788060
Pulled By: hx235
fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff
2023-12-29 23:29:23 +00:00
|
|
|
const WriteOptions write_options(Env::IOActivity::kCompaction);
|
2023-04-21 16:07:18 +00:00
|
|
|
|
2017-04-06 00:14:05 +00:00
|
|
|
bool is_manual = (manual_compaction != nullptr);
|
2018-11-09 19:17:34 +00:00
|
|
|
std::unique_ptr<Compaction> c;
|
2017-08-03 22:36:28 +00:00
|
|
|
if (prepicked_compaction != nullptr &&
|
|
|
|
prepicked_compaction->compaction != nullptr) {
|
|
|
|
c.reset(prepicked_compaction->compaction);
|
|
|
|
}
|
|
|
|
bool is_prepicked = is_manual || c;
|
2017-04-06 00:14:05 +00:00
|
|
|
|
|
|
|
// (manual_compaction->in_progress == false);
|
|
|
|
bool trivial_move_disallowed =
|
|
|
|
is_manual && manual_compaction->disallow_trivial_move;
|
|
|
|
|
|
|
|
CompactionJobStats compaction_job_stats;
|
2018-06-28 19:23:57 +00:00
|
|
|
Status status;
|
|
|
|
if (!error_handler_.IsBGWorkStopped()) {
|
|
|
|
if (shutting_down_.load(std::memory_order_acquire)) {
|
|
|
|
status = Status::ShutdownInProgress();
|
2019-09-17 04:00:13 +00:00
|
|
|
} else if (is_manual &&
|
2022-06-07 01:32:26 +00:00
|
|
|
manual_compaction->canceled.load(std::memory_order_acquire)) {
|
2021-06-07 18:40:31 +00:00
|
|
|
status = Status::Incomplete(Status::SubCode::kManualCompactionPaused);
|
2018-06-28 19:23:57 +00:00
|
|
|
}
|
|
|
|
} else {
|
|
|
|
status = error_handler_.GetBGError();
|
Auto recovery from out of space errors (#4164)
Summary:
This commit implements automatic recovery from a Status::NoSpace() error
during background operations such as write callback, flush and
compaction. The broad design is as follows -
1. Compaction errors are treated as soft errors and don't put the
database in read-only mode. A compaction is delayed until enough free
disk space is available to accomodate the compaction outputs, which is
estimated based on the input size. This means that users can continue to
write, and we rely on the WriteController to delay or stop writes if the
compaction debt becomes too high due to persistent low disk space
condition
2. Errors during write callback and flush are treated as hard errors,
i.e the database is put in read-only mode and goes back to read-write
only fater certain recovery actions are taken.
3. Both types of recovery rely on the SstFileManagerImpl to poll for
sufficient disk space. We assume that there is a 1-1 mapping between an
SFM and the underlying OS storage container. For cases where multiple
DBs are hosted on a single storage container, the user is expected to
allocate a single SFM instance and use the same one for all the DBs. If
no SFM is specified by the user, DBImpl::Open() will allocate one, but
this will be one per DB and each DB will recover independently. The
recovery implemented by SFM is as follows -
a) On the first occurance of an out of space error during compaction,
subsequent
compactions will be delayed until the disk free space check indicates
enough available space. The required space is computed as the sum of
input sizes.
b) The free space check requirement will be removed once the amount of
free space is greater than the size reserved by in progress
compactions when the first error occured
c) If the out of space error is a hard error, a background thread in
SFM will poll for sufficient headroom before triggering the recovery
of the database and putting it in write-only mode. The headroom is
calculated as the sum of the write_buffer_size of all the DB instances
associated with the SFM
4. EventListener callbacks will be called at the start and completion of
automatic recovery. Users can disable the auto recov ery in the start
callback, and later initiate it manually by calling DB::Resume()
Todo:
1. More extensive testing
2. Add disk full condition to db_stress (follow-on PR)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4164
Differential Revision: D9846378
Pulled By: anand1976
fbshipit-source-id: 80ea875dbd7f00205e19c82215ff6e37da10da4a
2018-09-15 20:36:19 +00:00
|
|
|
// If we get here, it means a hard error happened after this compaction
|
|
|
|
// was scheduled by MaybeScheduleFlushOrCompaction(), but before it got
|
|
|
|
// a chance to execute. Since we didn't pop a cfd from the compaction
|
|
|
|
// queue, increment unscheduled_compactions_
|
|
|
|
unscheduled_compactions_++;
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
if (!status.ok()) {
|
|
|
|
if (is_manual) {
|
|
|
|
manual_compaction->status = status;
|
|
|
|
manual_compaction->done = true;
|
|
|
|
manual_compaction->in_progress = false;
|
|
|
|
manual_compaction = nullptr;
|
|
|
|
}
|
2018-11-01 00:22:23 +00:00
|
|
|
if (c) {
|
|
|
|
c->ReleaseCompactionFiles(status);
|
|
|
|
c.reset();
|
|
|
|
}
|
2017-04-06 00:14:05 +00:00
|
|
|
return status;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (is_manual) {
|
|
|
|
// another thread cannot pick up the same work
|
|
|
|
manual_compaction->in_progress = true;
|
|
|
|
}
|
|
|
|
|
2022-03-15 19:31:14 +00:00
|
|
|
TEST_SYNC_POINT("DBImpl::BackgroundCompaction:InProgress");
|
|
|
|
|
Concurrent task limiter for compaction thread control (#4332)
Summary:
The PR is targeting to resolve the issue of:
https://github.com/facebook/rocksdb/issues/3972#issue-330771918
We have a rocksdb created with leveled-compaction with multiple column families (CFs), some of CFs are using HDD to store big and less frequently accessed data and others are using SSD.
When there are continuously write traffics going on to all CFs, the compaction thread pool is mostly occupied by those slow HDD compactions, which blocks fully utilize SSD bandwidth.
Since atomic write and transaction is needed across CFs, so splitting it to multiple rocksdb instance is not an option for us.
With the compaction thread control, we got 30%+ HDD write throughput gain, and also a lot smooth SSD write since less write stall happening.
ConcurrentTaskLimiter can be shared with multi-CFs across rocksdb instances, so the feature does not only work for multi-CFs scenarios, but also for multi-rocksdbs scenarios, who need disk IO resource control per tenant.
The usage is straight forward:
e.g.:
//
// Enable compaction thread limiter thru ColumnFamilyOptions
//
std::shared_ptr<ConcurrentTaskLimiter> ctl(NewConcurrentTaskLimiter("foo_limiter", 4));
Options options;
ColumnFamilyOptions cf_opt(options);
cf_opt.compaction_thread_limiter = ctl;
...
//
// Compaction thread limiter can be tuned or disabled on-the-fly
//
ctl->SetMaxOutstandingTask(12); // enlarge to 12 tasks
...
ctl->ResetMaxOutstandingTask(); // disable (bypass) thread limiter
ctl->SetMaxOutstandingTask(-1); // Same as above
...
ctl->SetMaxOutstandingTask(0); // full throttle (0 task)
//
// Sharing compaction thread limiter among CFs (to resolve multiple storage perf issue)
//
std::shared_ptr<ConcurrentTaskLimiter> ctl_ssd(NewConcurrentTaskLimiter("ssd_limiter", 8));
std::shared_ptr<ConcurrentTaskLimiter> ctl_hdd(NewConcurrentTaskLimiter("hdd_limiter", 4));
Options options;
ColumnFamilyOptions cf_opt_ssd1(options);
ColumnFamilyOptions cf_opt_ssd2(options);
ColumnFamilyOptions cf_opt_hdd1(options);
ColumnFamilyOptions cf_opt_hdd2(options);
ColumnFamilyOptions cf_opt_hdd3(options);
// SSD CFs
cf_opt_ssd1.compaction_thread_limiter = ctl_ssd;
cf_opt_ssd2.compaction_thread_limiter = ctl_ssd;
// HDD CFs
cf_opt_hdd1.compaction_thread_limiter = ctl_hdd;
cf_opt_hdd2.compaction_thread_limiter = ctl_hdd;
cf_opt_hdd3.compaction_thread_limiter = ctl_hdd;
...
//
// The limiter is disabled by default (or set to nullptr explicitly)
//
Options options;
ColumnFamilyOptions cf_opt(options);
cf_opt.compaction_thread_limiter = nullptr;
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4332
Differential Revision: D13226590
Pulled By: siying
fbshipit-source-id: 14307aec55b8bd59c8223d04aa6db3c03d1b0c1d
2018-12-13 21:16:04 +00:00
|
|
|
std::unique_ptr<TaskLimiterToken> task_token;
|
|
|
|
|
2018-04-03 02:53:19 +00:00
|
|
|
bool sfm_reserved_compact_space = false;
|
2017-04-06 00:14:05 +00:00
|
|
|
if (is_manual) {
|
2022-03-15 19:31:14 +00:00
|
|
|
ManualCompactionState* m = manual_compaction;
|
2017-04-06 00:14:05 +00:00
|
|
|
assert(m->in_progress);
|
|
|
|
if (!c) {
|
|
|
|
m->done = true;
|
|
|
|
m->manual_end = nullptr;
|
2020-08-17 18:52:23 +00:00
|
|
|
ROCKS_LOG_BUFFER(
|
|
|
|
log_buffer,
|
|
|
|
"[%s] Manual compaction from level-%d from %s .. "
|
|
|
|
"%s; nothing to do\n",
|
|
|
|
m->cfd->GetName().c_str(), m->input_level,
|
|
|
|
(m->begin ? m->begin->DebugString(true).c_str() : "(begin)"),
|
|
|
|
(m->end ? m->end->DebugString(true).c_str() : "(end)"));
|
2017-04-06 00:14:05 +00:00
|
|
|
} else {
|
2018-04-03 02:53:19 +00:00
|
|
|
// First check if we have enough room to do the compaction
|
|
|
|
bool enough_room = EnoughRoomForCompaction(
|
Auto recovery from out of space errors (#4164)
Summary:
This commit implements automatic recovery from a Status::NoSpace() error
during background operations such as write callback, flush and
compaction. The broad design is as follows -
1. Compaction errors are treated as soft errors and don't put the
database in read-only mode. A compaction is delayed until enough free
disk space is available to accomodate the compaction outputs, which is
estimated based on the input size. This means that users can continue to
write, and we rely on the WriteController to delay or stop writes if the
compaction debt becomes too high due to persistent low disk space
condition
2. Errors during write callback and flush are treated as hard errors,
i.e the database is put in read-only mode and goes back to read-write
only fater certain recovery actions are taken.
3. Both types of recovery rely on the SstFileManagerImpl to poll for
sufficient disk space. We assume that there is a 1-1 mapping between an
SFM and the underlying OS storage container. For cases where multiple
DBs are hosted on a single storage container, the user is expected to
allocate a single SFM instance and use the same one for all the DBs. If
no SFM is specified by the user, DBImpl::Open() will allocate one, but
this will be one per DB and each DB will recover independently. The
recovery implemented by SFM is as follows -
a) On the first occurance of an out of space error during compaction,
subsequent
compactions will be delayed until the disk free space check indicates
enough available space. The required space is computed as the sum of
input sizes.
b) The free space check requirement will be removed once the amount of
free space is greater than the size reserved by in progress
compactions when the first error occured
c) If the out of space error is a hard error, a background thread in
SFM will poll for sufficient headroom before triggering the recovery
of the database and putting it in write-only mode. The headroom is
calculated as the sum of the write_buffer_size of all the DB instances
associated with the SFM
4. EventListener callbacks will be called at the start and completion of
automatic recovery. Users can disable the auto recov ery in the start
callback, and later initiate it manually by calling DB::Resume()
Todo:
1. More extensive testing
2. Add disk full condition to db_stress (follow-on PR)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4164
Differential Revision: D9846378
Pulled By: anand1976
fbshipit-source-id: 80ea875dbd7f00205e19c82215ff6e37da10da4a
2018-09-15 20:36:19 +00:00
|
|
|
m->cfd, *(c->inputs()), &sfm_reserved_compact_space, log_buffer);
|
2018-04-03 02:53:19 +00:00
|
|
|
|
|
|
|
if (!enough_room) {
|
|
|
|
// Then don't do the compaction
|
|
|
|
c->ReleaseCompactionFiles(status);
|
|
|
|
c.reset();
|
|
|
|
// m's vars will get set properly at the end of this function,
|
|
|
|
// as long as status == CompactionTooLarge
|
|
|
|
status = Status::CompactionTooLarge();
|
|
|
|
} else {
|
|
|
|
ROCKS_LOG_BUFFER(
|
|
|
|
log_buffer,
|
|
|
|
"[%s] Manual compaction from level-%d to level-%d from %s .. "
|
|
|
|
"%s; will stop at %s\n",
|
|
|
|
m->cfd->GetName().c_str(), m->input_level, c->output_level(),
|
2020-04-07 00:38:59 +00:00
|
|
|
(m->begin ? m->begin->DebugString(true).c_str() : "(begin)"),
|
|
|
|
(m->end ? m->end->DebugString(true).c_str() : "(end)"),
|
2018-04-03 02:53:19 +00:00
|
|
|
((m->done || m->manual_end == nullptr)
|
|
|
|
? "(end)"
|
2020-04-07 00:38:59 +00:00
|
|
|
: m->manual_end->DebugString(true).c_str()));
|
2018-04-03 02:53:19 +00:00
|
|
|
}
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
2017-08-03 22:36:28 +00:00
|
|
|
} else if (!is_prepicked && !compaction_queue_.empty()) {
|
2018-01-16 21:10:34 +00:00
|
|
|
if (HasExclusiveManualCompaction()) {
|
2017-05-02 22:01:07 +00:00
|
|
|
// Can't compact right now, but try again later
|
|
|
|
TEST_SYNC_POINT("DBImpl::BackgroundCompaction()::Conflict");
|
|
|
|
|
2017-05-18 06:03:54 +00:00
|
|
|
// Stay in the compaction queue.
|
2017-05-02 22:01:07 +00:00
|
|
|
unscheduled_compactions_++;
|
|
|
|
|
|
|
|
return Status::OK();
|
|
|
|
}
|
|
|
|
|
Concurrent task limiter for compaction thread control (#4332)
Summary:
The PR is targeting to resolve the issue of:
https://github.com/facebook/rocksdb/issues/3972#issue-330771918
We have a rocksdb created with leveled-compaction with multiple column families (CFs), some of CFs are using HDD to store big and less frequently accessed data and others are using SSD.
When there are continuously write traffics going on to all CFs, the compaction thread pool is mostly occupied by those slow HDD compactions, which blocks fully utilize SSD bandwidth.
Since atomic write and transaction is needed across CFs, so splitting it to multiple rocksdb instance is not an option for us.
With the compaction thread control, we got 30%+ HDD write throughput gain, and also a lot smooth SSD write since less write stall happening.
ConcurrentTaskLimiter can be shared with multi-CFs across rocksdb instances, so the feature does not only work for multi-CFs scenarios, but also for multi-rocksdbs scenarios, who need disk IO resource control per tenant.
The usage is straight forward:
e.g.:
//
// Enable compaction thread limiter thru ColumnFamilyOptions
//
std::shared_ptr<ConcurrentTaskLimiter> ctl(NewConcurrentTaskLimiter("foo_limiter", 4));
Options options;
ColumnFamilyOptions cf_opt(options);
cf_opt.compaction_thread_limiter = ctl;
...
//
// Compaction thread limiter can be tuned or disabled on-the-fly
//
ctl->SetMaxOutstandingTask(12); // enlarge to 12 tasks
...
ctl->ResetMaxOutstandingTask(); // disable (bypass) thread limiter
ctl->SetMaxOutstandingTask(-1); // Same as above
...
ctl->SetMaxOutstandingTask(0); // full throttle (0 task)
//
// Sharing compaction thread limiter among CFs (to resolve multiple storage perf issue)
//
std::shared_ptr<ConcurrentTaskLimiter> ctl_ssd(NewConcurrentTaskLimiter("ssd_limiter", 8));
std::shared_ptr<ConcurrentTaskLimiter> ctl_hdd(NewConcurrentTaskLimiter("hdd_limiter", 4));
Options options;
ColumnFamilyOptions cf_opt_ssd1(options);
ColumnFamilyOptions cf_opt_ssd2(options);
ColumnFamilyOptions cf_opt_hdd1(options);
ColumnFamilyOptions cf_opt_hdd2(options);
ColumnFamilyOptions cf_opt_hdd3(options);
// SSD CFs
cf_opt_ssd1.compaction_thread_limiter = ctl_ssd;
cf_opt_ssd2.compaction_thread_limiter = ctl_ssd;
// HDD CFs
cf_opt_hdd1.compaction_thread_limiter = ctl_hdd;
cf_opt_hdd2.compaction_thread_limiter = ctl_hdd;
cf_opt_hdd3.compaction_thread_limiter = ctl_hdd;
...
//
// The limiter is disabled by default (or set to nullptr explicitly)
//
Options options;
ColumnFamilyOptions cf_opt(options);
cf_opt.compaction_thread_limiter = nullptr;
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4332
Differential Revision: D13226590
Pulled By: siying
fbshipit-source-id: 14307aec55b8bd59c8223d04aa6db3c03d1b0c1d
2018-12-13 21:16:04 +00:00
|
|
|
auto cfd = PickCompactionFromQueue(&task_token, log_buffer);
|
|
|
|
if (cfd == nullptr) {
|
|
|
|
// Can't find any executable task from the compaction queue.
|
|
|
|
// All tasks have been throttled by compaction thread limiter.
|
|
|
|
++unscheduled_compactions_;
|
|
|
|
return Status::Busy();
|
|
|
|
}
|
|
|
|
|
2017-04-06 00:14:05 +00:00
|
|
|
// We unreference here because the following code will take a Ref() on
|
|
|
|
// this cfd if it is going to use it (Compaction class holds a
|
|
|
|
// reference).
|
|
|
|
// This will all happen under a mutex so we don't have to be afraid of
|
|
|
|
// somebody else deleting it.
|
2019-12-13 03:02:51 +00:00
|
|
|
if (cfd->UnrefAndTryDelete()) {
|
2017-04-06 00:14:05 +00:00
|
|
|
// This was the last reference of the column family, so no need to
|
|
|
|
// compact.
|
|
|
|
return Status::OK();
|
|
|
|
}
|
|
|
|
|
|
|
|
// Pick up latest mutable CF Options and use it throughout the
|
|
|
|
// compaction job
|
|
|
|
// Compaction makes a copy of the latest MutableCFOptions. It should be used
|
|
|
|
// throughout the compaction procedure to make sure consistency. It will
|
|
|
|
// eventually be installed into SuperVersion
|
|
|
|
auto* mutable_cf_options = cfd->GetLatestMutableCFOptions();
|
|
|
|
if (!mutable_cf_options->disable_auto_compactions && !cfd->IsDropped()) {
|
|
|
|
// NOTE: try to avoid unnecessary copy of MutableCFOptions if
|
|
|
|
// compaction is not necessary. Need to make sure mutex is held
|
|
|
|
// until we make a copy in the following code
|
|
|
|
TEST_SYNC_POINT("DBImpl::BackgroundCompaction():BeforePickCompaction");
|
2020-07-23 01:31:25 +00:00
|
|
|
c.reset(cfd->PickCompaction(*mutable_cf_options, mutable_db_options_,
|
|
|
|
log_buffer));
|
2017-04-06 00:14:05 +00:00
|
|
|
TEST_SYNC_POINT("DBImpl::BackgroundCompaction():AfterPickCompaction");
|
2018-03-07 00:13:05 +00:00
|
|
|
|
2017-04-06 00:14:05 +00:00
|
|
|
if (c != nullptr) {
|
2018-04-03 02:53:19 +00:00
|
|
|
bool enough_room = EnoughRoomForCompaction(
|
Auto recovery from out of space errors (#4164)
Summary:
This commit implements automatic recovery from a Status::NoSpace() error
during background operations such as write callback, flush and
compaction. The broad design is as follows -
1. Compaction errors are treated as soft errors and don't put the
database in read-only mode. A compaction is delayed until enough free
disk space is available to accomodate the compaction outputs, which is
estimated based on the input size. This means that users can continue to
write, and we rely on the WriteController to delay or stop writes if the
compaction debt becomes too high due to persistent low disk space
condition
2. Errors during write callback and flush are treated as hard errors,
i.e the database is put in read-only mode and goes back to read-write
only fater certain recovery actions are taken.
3. Both types of recovery rely on the SstFileManagerImpl to poll for
sufficient disk space. We assume that there is a 1-1 mapping between an
SFM and the underlying OS storage container. For cases where multiple
DBs are hosted on a single storage container, the user is expected to
allocate a single SFM instance and use the same one for all the DBs. If
no SFM is specified by the user, DBImpl::Open() will allocate one, but
this will be one per DB and each DB will recover independently. The
recovery implemented by SFM is as follows -
a) On the first occurance of an out of space error during compaction,
subsequent
compactions will be delayed until the disk free space check indicates
enough available space. The required space is computed as the sum of
input sizes.
b) The free space check requirement will be removed once the amount of
free space is greater than the size reserved by in progress
compactions when the first error occured
c) If the out of space error is a hard error, a background thread in
SFM will poll for sufficient headroom before triggering the recovery
of the database and putting it in write-only mode. The headroom is
calculated as the sum of the write_buffer_size of all the DB instances
associated with the SFM
4. EventListener callbacks will be called at the start and completion of
automatic recovery. Users can disable the auto recov ery in the start
callback, and later initiate it manually by calling DB::Resume()
Todo:
1. More extensive testing
2. Add disk full condition to db_stress (follow-on PR)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4164
Differential Revision: D9846378
Pulled By: anand1976
fbshipit-source-id: 80ea875dbd7f00205e19c82215ff6e37da10da4a
2018-09-15 20:36:19 +00:00
|
|
|
cfd, *(c->inputs()), &sfm_reserved_compact_space, log_buffer);
|
2018-04-03 02:53:19 +00:00
|
|
|
|
2018-03-07 00:13:05 +00:00
|
|
|
if (!enough_room) {
|
|
|
|
// Then don't do the compaction
|
|
|
|
c->ReleaseCompactionFiles(status);
|
|
|
|
c->column_family_data()
|
|
|
|
->current()
|
|
|
|
->storage_info()
|
2021-06-16 23:50:43 +00:00
|
|
|
->ComputeCompactionScore(*(c->immutable_options()),
|
2018-03-07 00:13:05 +00:00
|
|
|
*(c->mutable_cf_options()));
|
2017-04-06 00:14:05 +00:00
|
|
|
AddToCompactionQueue(cfd);
|
|
|
|
++unscheduled_compactions_;
|
2018-03-07 00:13:05 +00:00
|
|
|
|
|
|
|
c.reset();
|
|
|
|
// Don't need to sleep here, because BackgroundCallCompaction
|
|
|
|
// will sleep if !s.ok()
|
|
|
|
status = Status::CompactionTooLarge();
|
|
|
|
} else {
|
|
|
|
// update statistics
|
2021-11-01 19:56:25 +00:00
|
|
|
size_t num_files = 0;
|
|
|
|
for (auto& each_level : *c->inputs()) {
|
|
|
|
num_files += each_level.files.size();
|
|
|
|
}
|
|
|
|
RecordInHistogram(stats_, NUM_FILES_IN_SINGLE_COMPACTION, num_files);
|
|
|
|
|
2018-03-07 00:13:05 +00:00
|
|
|
// There are three things that can change compaction score:
|
|
|
|
// 1) When flush or compaction finish. This case is covered by
|
|
|
|
// InstallSuperVersionAndScheduleWork
|
|
|
|
// 2) When MutableCFOptions changes. This case is also covered by
|
|
|
|
// InstallSuperVersionAndScheduleWork, because this is when the new
|
|
|
|
// options take effect.
|
|
|
|
// 3) When we Pick a new compaction, we "remove" those files being
|
|
|
|
// compacted from the calculation, which then influences compaction
|
|
|
|
// score. Here we check if we need the new compaction even without the
|
|
|
|
// files that are currently being compacted. If we need another
|
|
|
|
// compaction, we might be able to execute it in parallel, so we add
|
|
|
|
// it to the queue and schedule a new thread.
|
|
|
|
if (cfd->NeedsCompaction()) {
|
|
|
|
// Yes, we need more compactions!
|
|
|
|
AddToCompactionQueue(cfd);
|
|
|
|
++unscheduled_compactions_;
|
|
|
|
MaybeScheduleFlushOrCompaction();
|
|
|
|
}
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487)
Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.
The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487
Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check
Reviewed By: anand1976
Differential Revision: D20685017
Pulled By: zhichao-cao
fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
2020-03-27 23:03:05 +00:00
|
|
|
IOStatus io_s;
|
2023-09-18 20:11:53 +00:00
|
|
|
bool compaction_released = false;
|
2017-04-06 00:14:05 +00:00
|
|
|
if (!c) {
|
|
|
|
// Nothing to do
|
|
|
|
ROCKS_LOG_BUFFER(log_buffer, "Compaction nothing to do");
|
|
|
|
} else if (c->deletion_compaction()) {
|
|
|
|
// TODO(icanadi) Do we want to honor snapshots here? i.e. not delete old
|
|
|
|
// file if there is alive snapshot pointing to it
|
Concurrent task limiter for compaction thread control (#4332)
Summary:
The PR is targeting to resolve the issue of:
https://github.com/facebook/rocksdb/issues/3972#issue-330771918
We have a rocksdb created with leveled-compaction with multiple column families (CFs), some of CFs are using HDD to store big and less frequently accessed data and others are using SSD.
When there are continuously write traffics going on to all CFs, the compaction thread pool is mostly occupied by those slow HDD compactions, which blocks fully utilize SSD bandwidth.
Since atomic write and transaction is needed across CFs, so splitting it to multiple rocksdb instance is not an option for us.
With the compaction thread control, we got 30%+ HDD write throughput gain, and also a lot smooth SSD write since less write stall happening.
ConcurrentTaskLimiter can be shared with multi-CFs across rocksdb instances, so the feature does not only work for multi-CFs scenarios, but also for multi-rocksdbs scenarios, who need disk IO resource control per tenant.
The usage is straight forward:
e.g.:
//
// Enable compaction thread limiter thru ColumnFamilyOptions
//
std::shared_ptr<ConcurrentTaskLimiter> ctl(NewConcurrentTaskLimiter("foo_limiter", 4));
Options options;
ColumnFamilyOptions cf_opt(options);
cf_opt.compaction_thread_limiter = ctl;
...
//
// Compaction thread limiter can be tuned or disabled on-the-fly
//
ctl->SetMaxOutstandingTask(12); // enlarge to 12 tasks
...
ctl->ResetMaxOutstandingTask(); // disable (bypass) thread limiter
ctl->SetMaxOutstandingTask(-1); // Same as above
...
ctl->SetMaxOutstandingTask(0); // full throttle (0 task)
//
// Sharing compaction thread limiter among CFs (to resolve multiple storage perf issue)
//
std::shared_ptr<ConcurrentTaskLimiter> ctl_ssd(NewConcurrentTaskLimiter("ssd_limiter", 8));
std::shared_ptr<ConcurrentTaskLimiter> ctl_hdd(NewConcurrentTaskLimiter("hdd_limiter", 4));
Options options;
ColumnFamilyOptions cf_opt_ssd1(options);
ColumnFamilyOptions cf_opt_ssd2(options);
ColumnFamilyOptions cf_opt_hdd1(options);
ColumnFamilyOptions cf_opt_hdd2(options);
ColumnFamilyOptions cf_opt_hdd3(options);
// SSD CFs
cf_opt_ssd1.compaction_thread_limiter = ctl_ssd;
cf_opt_ssd2.compaction_thread_limiter = ctl_ssd;
// HDD CFs
cf_opt_hdd1.compaction_thread_limiter = ctl_hdd;
cf_opt_hdd2.compaction_thread_limiter = ctl_hdd;
cf_opt_hdd3.compaction_thread_limiter = ctl_hdd;
...
//
// The limiter is disabled by default (or set to nullptr explicitly)
//
Options options;
ColumnFamilyOptions cf_opt(options);
cf_opt.compaction_thread_limiter = nullptr;
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4332
Differential Revision: D13226590
Pulled By: siying
fbshipit-source-id: 14307aec55b8bd59c8223d04aa6db3c03d1b0c1d
2018-12-13 21:16:04 +00:00
|
|
|
TEST_SYNC_POINT_CALLBACK("DBImpl::BackgroundCompaction:BeforeCompaction",
|
|
|
|
c->column_family_data());
|
2017-04-06 00:14:05 +00:00
|
|
|
assert(c->num_input_files(1) == 0);
|
|
|
|
assert(c->column_family_data()->ioptions()->compaction_style ==
|
|
|
|
kCompactionStyleFIFO);
|
|
|
|
|
|
|
|
compaction_job_stats.num_input_files = c->num_input_files(0);
|
|
|
|
|
2018-10-11 00:30:22 +00:00
|
|
|
NotifyOnCompactionBegin(c->column_family_data(), c.get(), status,
|
|
|
|
compaction_job_stats, job_context->job_id);
|
|
|
|
|
2017-04-06 00:14:05 +00:00
|
|
|
for (const auto& f : *c->inputs(0)) {
|
|
|
|
c->edit()->DeleteFile(c->level(), f->fd.GetNumber());
|
|
|
|
}
|
2023-04-21 16:07:18 +00:00
|
|
|
status = versions_->LogAndApply(
|
|
|
|
c->column_family_data(), *c->mutable_cf_options(), read_options,
|
Group SST write in flush, compaction and db open with new stats (#11910)
Summary:
## Context/Summary
Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity.
For that, this PR does the following:
- Tag different write IOs by passing down and converting WriteOptions to IOOptions
- Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS
Some related code refactory to make implementation cleaner:
- Blob stats
- Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info.
- Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write.
- Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority
- Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification
- Build table
- TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables
- Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder.
This parameter is used for dynamically changing file io priority for flush, see https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more
- Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority
## Test
### db bench
Flush
```
./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100
rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
```
compaction, db oopen
```
Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279
rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213
rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66
```
blob stats - just to make sure they aren't broken by this PR
```
Integrated Blob DB
Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600
rocksdb.blobdb.blob.file.synced COUNT : 1
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same)
```
```
Stacked Blob DB
Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876
rocksdb.blobdb.blob.file.synced COUNT : 8
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same)
```
### Rehearsal CI stress test
Trigger 3 full runs of all our CI stress tests
### Performance
Flush
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark; enable_statistics = true
Pre-pr: avg 507515519.3 ns
497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908,
Post-pr: avg 511971266.5 ns, regressed 0.88%
502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408,
```
Compaction
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 495346098.30 ns
492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846
Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97%
502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007
```
Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats)
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 3848.10 ns
3814,3838,3839,3848,3854,3854,3854,3860,3860,3860
Post-pr: avg 3874.20 ns, regressed 0.68%
3863,3867,3871,3874,3875,3877,3877,3877,3880,3881
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910
Reviewed By: ajkr
Differential Revision: D49788060
Pulled By: hx235
fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff
2023-12-29 23:29:23 +00:00
|
|
|
write_options, c->edit(), &mutex_, directories_.GetDbDir(),
|
2023-09-18 20:11:53 +00:00
|
|
|
/*new_descriptor_log=*/false, /*column_family_options=*/nullptr,
|
|
|
|
[&c, &compaction_released](const Status& s) {
|
|
|
|
c->ReleaseCompactionFiles(s);
|
|
|
|
compaction_released = true;
|
|
|
|
});
|
Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487)
Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.
The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487
Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check
Reviewed By: anand1976
Differential Revision: D20685017
Pulled By: zhichao-cao
fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
2020-03-27 23:03:05 +00:00
|
|
|
io_s = versions_->io_status();
|
2024-03-04 18:08:32 +00:00
|
|
|
InstallSuperVersionAndScheduleWork(
|
|
|
|
c->column_family_data(), job_context->superversion_contexts.data(),
|
|
|
|
*c->mutable_cf_options());
|
2017-04-06 00:14:05 +00:00
|
|
|
ROCKS_LOG_BUFFER(log_buffer, "[%s] Deleted %d files\n",
|
|
|
|
c->column_family_data()->GetName().c_str(),
|
|
|
|
c->num_input_files(0));
|
2023-10-19 01:00:07 +00:00
|
|
|
if (status.ok() && io_s.ok()) {
|
|
|
|
UpdateDeletionCompactionStats(c);
|
|
|
|
}
|
2017-04-06 00:14:05 +00:00
|
|
|
*made_progress = true;
|
Concurrent task limiter for compaction thread control (#4332)
Summary:
The PR is targeting to resolve the issue of:
https://github.com/facebook/rocksdb/issues/3972#issue-330771918
We have a rocksdb created with leveled-compaction with multiple column families (CFs), some of CFs are using HDD to store big and less frequently accessed data and others are using SSD.
When there are continuously write traffics going on to all CFs, the compaction thread pool is mostly occupied by those slow HDD compactions, which blocks fully utilize SSD bandwidth.
Since atomic write and transaction is needed across CFs, so splitting it to multiple rocksdb instance is not an option for us.
With the compaction thread control, we got 30%+ HDD write throughput gain, and also a lot smooth SSD write since less write stall happening.
ConcurrentTaskLimiter can be shared with multi-CFs across rocksdb instances, so the feature does not only work for multi-CFs scenarios, but also for multi-rocksdbs scenarios, who need disk IO resource control per tenant.
The usage is straight forward:
e.g.:
//
// Enable compaction thread limiter thru ColumnFamilyOptions
//
std::shared_ptr<ConcurrentTaskLimiter> ctl(NewConcurrentTaskLimiter("foo_limiter", 4));
Options options;
ColumnFamilyOptions cf_opt(options);
cf_opt.compaction_thread_limiter = ctl;
...
//
// Compaction thread limiter can be tuned or disabled on-the-fly
//
ctl->SetMaxOutstandingTask(12); // enlarge to 12 tasks
...
ctl->ResetMaxOutstandingTask(); // disable (bypass) thread limiter
ctl->SetMaxOutstandingTask(-1); // Same as above
...
ctl->SetMaxOutstandingTask(0); // full throttle (0 task)
//
// Sharing compaction thread limiter among CFs (to resolve multiple storage perf issue)
//
std::shared_ptr<ConcurrentTaskLimiter> ctl_ssd(NewConcurrentTaskLimiter("ssd_limiter", 8));
std::shared_ptr<ConcurrentTaskLimiter> ctl_hdd(NewConcurrentTaskLimiter("hdd_limiter", 4));
Options options;
ColumnFamilyOptions cf_opt_ssd1(options);
ColumnFamilyOptions cf_opt_ssd2(options);
ColumnFamilyOptions cf_opt_hdd1(options);
ColumnFamilyOptions cf_opt_hdd2(options);
ColumnFamilyOptions cf_opt_hdd3(options);
// SSD CFs
cf_opt_ssd1.compaction_thread_limiter = ctl_ssd;
cf_opt_ssd2.compaction_thread_limiter = ctl_ssd;
// HDD CFs
cf_opt_hdd1.compaction_thread_limiter = ctl_hdd;
cf_opt_hdd2.compaction_thread_limiter = ctl_hdd;
cf_opt_hdd3.compaction_thread_limiter = ctl_hdd;
...
//
// The limiter is disabled by default (or set to nullptr explicitly)
//
Options options;
ColumnFamilyOptions cf_opt(options);
cf_opt.compaction_thread_limiter = nullptr;
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4332
Differential Revision: D13226590
Pulled By: siying
fbshipit-source-id: 14307aec55b8bd59c8223d04aa6db3c03d1b0c1d
2018-12-13 21:16:04 +00:00
|
|
|
TEST_SYNC_POINT_CALLBACK("DBImpl::BackgroundCompaction:AfterCompaction",
|
|
|
|
c->column_family_data());
|
2017-04-06 00:14:05 +00:00
|
|
|
} else if (!trivial_move_disallowed && c->IsTrivialMove()) {
|
|
|
|
TEST_SYNC_POINT("DBImpl::BackgroundCompaction:TrivialMove");
|
Concurrent task limiter for compaction thread control (#4332)
Summary:
The PR is targeting to resolve the issue of:
https://github.com/facebook/rocksdb/issues/3972#issue-330771918
We have a rocksdb created with leveled-compaction with multiple column families (CFs), some of CFs are using HDD to store big and less frequently accessed data and others are using SSD.
When there are continuously write traffics going on to all CFs, the compaction thread pool is mostly occupied by those slow HDD compactions, which blocks fully utilize SSD bandwidth.
Since atomic write and transaction is needed across CFs, so splitting it to multiple rocksdb instance is not an option for us.
With the compaction thread control, we got 30%+ HDD write throughput gain, and also a lot smooth SSD write since less write stall happening.
ConcurrentTaskLimiter can be shared with multi-CFs across rocksdb instances, so the feature does not only work for multi-CFs scenarios, but also for multi-rocksdbs scenarios, who need disk IO resource control per tenant.
The usage is straight forward:
e.g.:
//
// Enable compaction thread limiter thru ColumnFamilyOptions
//
std::shared_ptr<ConcurrentTaskLimiter> ctl(NewConcurrentTaskLimiter("foo_limiter", 4));
Options options;
ColumnFamilyOptions cf_opt(options);
cf_opt.compaction_thread_limiter = ctl;
...
//
// Compaction thread limiter can be tuned or disabled on-the-fly
//
ctl->SetMaxOutstandingTask(12); // enlarge to 12 tasks
...
ctl->ResetMaxOutstandingTask(); // disable (bypass) thread limiter
ctl->SetMaxOutstandingTask(-1); // Same as above
...
ctl->SetMaxOutstandingTask(0); // full throttle (0 task)
//
// Sharing compaction thread limiter among CFs (to resolve multiple storage perf issue)
//
std::shared_ptr<ConcurrentTaskLimiter> ctl_ssd(NewConcurrentTaskLimiter("ssd_limiter", 8));
std::shared_ptr<ConcurrentTaskLimiter> ctl_hdd(NewConcurrentTaskLimiter("hdd_limiter", 4));
Options options;
ColumnFamilyOptions cf_opt_ssd1(options);
ColumnFamilyOptions cf_opt_ssd2(options);
ColumnFamilyOptions cf_opt_hdd1(options);
ColumnFamilyOptions cf_opt_hdd2(options);
ColumnFamilyOptions cf_opt_hdd3(options);
// SSD CFs
cf_opt_ssd1.compaction_thread_limiter = ctl_ssd;
cf_opt_ssd2.compaction_thread_limiter = ctl_ssd;
// HDD CFs
cf_opt_hdd1.compaction_thread_limiter = ctl_hdd;
cf_opt_hdd2.compaction_thread_limiter = ctl_hdd;
cf_opt_hdd3.compaction_thread_limiter = ctl_hdd;
...
//
// The limiter is disabled by default (or set to nullptr explicitly)
//
Options options;
ColumnFamilyOptions cf_opt(options);
cf_opt.compaction_thread_limiter = nullptr;
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4332
Differential Revision: D13226590
Pulled By: siying
fbshipit-source-id: 14307aec55b8bd59c8223d04aa6db3c03d1b0c1d
2018-12-13 21:16:04 +00:00
|
|
|
TEST_SYNC_POINT_CALLBACK("DBImpl::BackgroundCompaction:BeforeCompaction",
|
|
|
|
c->column_family_data());
|
2017-04-06 00:14:05 +00:00
|
|
|
// Instrument for event update
|
|
|
|
// TODO(yhchiang): add op details for showing trivial-move.
|
2023-04-21 16:07:18 +00:00
|
|
|
ThreadStatusUtil::SetColumnFamily(c->column_family_data());
|
2017-04-06 00:14:05 +00:00
|
|
|
ThreadStatusUtil::SetThreadOperation(ThreadStatus::OP_COMPACTION);
|
|
|
|
|
|
|
|
compaction_job_stats.num_input_files = c->num_input_files(0);
|
|
|
|
|
2018-10-11 00:30:22 +00:00
|
|
|
NotifyOnCompactionBegin(c->column_family_data(), c.get(), status,
|
|
|
|
compaction_job_stats, job_context->job_id);
|
|
|
|
|
2017-04-06 00:14:05 +00:00
|
|
|
// Move files to next level
|
|
|
|
int32_t moved_files = 0;
|
|
|
|
int64_t moved_bytes = 0;
|
|
|
|
for (unsigned int l = 0; l < c->num_input_levels(); l++) {
|
|
|
|
if (c->level(l) == c->output_level()) {
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
for (size_t i = 0; i < c->num_input_files(l); i++) {
|
|
|
|
FileMetaData* f = c->input(l, i);
|
|
|
|
c->edit()->DeleteFile(c->level(l), f->fd.GetNumber());
|
2021-11-10 18:47:53 +00:00
|
|
|
c->edit()->AddFile(
|
|
|
|
c->output_level(), f->fd.GetNumber(), f->fd.GetPathId(),
|
|
|
|
f->fd.GetFileSize(), f->smallest, f->largest, f->fd.smallest_seqno,
|
2021-12-03 22:42:05 +00:00
|
|
|
f->fd.largest_seqno, f->marked_for_compaction, f->temperature,
|
2021-11-10 18:47:53 +00:00
|
|
|
f->oldest_blob_file_number, f->oldest_ancester_time,
|
Sort L0 files by newly introduced epoch_num (#10922)
Summary:
**Context:**
Sorting L0 files by `largest_seqno` has at least two inconvenience:
- File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. `force_consistency_check=true` will catch such overlap seqno range even those harmless overlap.
- For example, consider the following sequence of events ("key@n" indicates key at seqno "n")
- insert k1@1 to memtable m1
- ingest file s1 with k2@2, ingest file s2 with k3@3
- insert k4@4 to m1
- compact files s1, s2 and result in new file s3 of seqno range [2, 3]
- flush m1 and result in new file s4 of seqno range [1, 4]. And `force_consistency_check=true` will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1
- However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption.
- Single delete can decrease a file's largest seqno and ordering by `largest_seqno` can introduce a wrong ordering hence file reordering corruption
- For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to ajkr for this example)
- an existing SST s1 contains only k1@1
- insert k1@2 to memtable m1
- ingest file s2 with k3@3, ingest file s3 with k4@4
- insert single delete k5@5 in m1
- flush m1 and result in new file s4 of seqno range [2, 5]
- compact s1, s2, s3 and result in new file s5 of seqno range [1, 4]
- compact s4 and result in new file s6 of seqno range [2] due to single delete
- By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by `force_consistency_check=true`, there isn't a good way to prevent this from happening if ordering by `largest_seqno`
Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to ajkr , we now introduce `epoch_num` which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum `epoch_num` among input files'). This will avoid the above inconvenience in the following ways:
- In the first case above, there will no longer be overlap seqno range check in `force_consistency_check=true` but `epoch_number` ordering check. This will result in file ordering s1 < s2 < s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class `DBCompactionTestL0FilesMisorderCorruption*` for more.
- In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption.
**Summary:**
- Introduce `epoch_number` stored per `ColumnFamilyData` and sort CF's L0 files by their assigned `epoch_number` instead of `largest_seqno`.
- `epoch_number` is increased and assigned upon `VersionEdit::AddFile()` for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the `kReservedEpochNumberForFileIngestedBehind`)
- Compaction output file is assigned with the minimum `epoch_number` among input files'
- Refit level: reuse refitted file's epoch_number
- Other paths needing `epoch_number` treatment:
- Import column families: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`
- Repair: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`.
- Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon `VersionEdit::AddFile()` where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair).
- Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later `Refit(target_level=0)`). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder.
- Persist `epoch_number` of each file in manifest and recover `epoch_number` on db recovery
- Backward compatibility with old db without `epoch_number` support is guaranteed by assigning `epoch_number` to recovered files by `NewestFirstBySeqno` order. See `VersionStorageInfo::RecoverEpochNumbers()` for more
- Forward compatibility with manifest is guaranteed by flexibility of `NewFileCustomTag`
- Replace `force_consistent_check` on L0 with `epoch_number` and remove false positive check like case 1 with `largest_seqno` above
- Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (`NewestFirstBySeqno`) to check/sort them till we infer their epoch number. See usages of `EpochNumberRequirement`.
- Remove fix https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and their outdated tests to file reordering corruption because such fix can be replaced by this PR.
- Misc:
- update existing tests with `epoch_number` so make check will pass
- update https://github.com/facebook/rocksdb/pull/5958#issue-511150930 tests to verify corruption is fixed using `epoch_number` and cover universal/fifo compaction/CompactRange/CompactFile cases
- assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber()
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10922
Test Plan:
- `make check`
- New unit tests under `db/db_compaction_test.cc`, `db/db_test2.cc`, `db/version_builder_test.cc`, `db/repair_test.cc`
- Updated tests (i.e, `DBCompactionTestL0FilesMisorderCorruption*`) under https://github.com/facebook/rocksdb/pull/5958#issue-511150930
- [Ongoing] Compatibility test: manually run https://github.com/ajkr/rocksdb/commit/36a5686ec012f35a4371e409aa85c404ca1c210d (with file ingestion off for running the `.orig` binary to prevent this bug affecting upgrade/downgrade formality checking) for 1 hour on `simple black/white box`, `cf_consistency/txn/enable_ts with whitebox + test_best_efforts_recovery with blackbox`
- [Ongoing] normal db stress test
- [Ongoing] db stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761
Reviewed By: ajkr
Differential Revision: D41063187
Pulled By: hx235
fbshipit-source-id: 826cb23455de7beaabe2d16c57682a82733a32a9
2022-12-13 21:29:37 +00:00
|
|
|
f->file_creation_time, f->epoch_number, f->file_checksum,
|
2022-12-29 21:28:24 +00:00
|
|
|
f->file_checksum_func_name, f->unique_id,
|
2023-06-22 04:49:01 +00:00
|
|
|
f->compensated_range_deletion_size, f->tail_size,
|
|
|
|
f->user_defined_timestamps_persisted);
|
2017-04-06 00:14:05 +00:00
|
|
|
|
2018-04-13 00:55:14 +00:00
|
|
|
ROCKS_LOG_BUFFER(
|
|
|
|
log_buffer,
|
|
|
|
"[%s] Moving #%" PRIu64 " to level-%d %" PRIu64 " bytes\n",
|
|
|
|
c->column_family_data()->GetName().c_str(), f->fd.GetNumber(),
|
|
|
|
c->output_level(), f->fd.GetFileSize());
|
2017-04-06 00:14:05 +00:00
|
|
|
++moved_files;
|
|
|
|
moved_bytes += f->fd.GetFileSize();
|
|
|
|
}
|
|
|
|
}
|
2022-06-21 18:56:53 +00:00
|
|
|
if (c->compaction_reason() == CompactionReason::kLevelMaxLevelSize &&
|
|
|
|
c->immutable_options()->compaction_pri == kRoundRobin) {
|
|
|
|
int start_level = c->start_level();
|
|
|
|
if (start_level > 0) {
|
|
|
|
auto vstorage = c->input_version()->storage_info();
|
|
|
|
c->edit()->AddCompactCursor(
|
Support subcmpct using reserved resources for round-robin priority (#10341)
Summary:
Earlier implementation of round-robin priority can only pick one file at a time and disallows parallel compactions within the same level. In this PR, round-robin compaction policy will expand towards more input files with respecting some additional constraints, which are summarized as follows:
* Constraint 1: We can only pick consecutive files
- Constraint 1a: When a file is being compacted (or some input files are being compacted after expanding), we cannot choose it and have to stop choosing more files
- Constraint 1b: When we reach the last file (with the largest keys), we cannot choose more files (the next file will be the first one with small keys)
* Constraint 2: We should ensure the total compaction bytes (including the overlapped files from the next level) is no more than `mutable_cf_options_.max_compaction_bytes`
* Constraint 3: We try our best to pick as many files as possible so that the post-compaction level size can be just less than `MaxBytesForLevel(start_level_)`
* Constraint 4: If trivial move is allowed, we reuse the logic of `TryNonL0TrivialMove()` instead of expanding files with Constraint 3
More details can be found in `LevelCompactionBuilder::SetupOtherFilesWithRoundRobinExpansion()`.
The above optimization accelerates the process of moving the compaction cursor, in which the write-amp can be further reduced. While a large compaction may lead to high write stall, we break this large compaction into several subcompactions **regardless of** the `max_subcompactions` limit. The number of subcompactions for round-robin compaction priority is determined through the following steps:
* Step 1: Initialized against `max_output_file_limit`, the number of input files in the start level, and also the range size limit `ranges.size()`
* Step 2: Call `AcquireSubcompactionResources()`when max subcompactions is not sufficient, but we may or may not obtain desired resources, additional number of resources is stored in `extra_num_subcompaction_threads_reserved_`). Subcompaction limit is changed and update `num_planned_subcompactions` with `GetSubcompactionLimit()`
* Step 3: Call `ShrinkSubcompactionResources()` to ensure extra resources can be released (extra resources may exist for round-robin compaction when the number of actual number of subcompactions is less than the number of planned subcompactions)
More details can be found in `CompactionJob::AcquireSubcompactionResources()`,`CompactionJob::ShrinkSubcompactionResources()`, and `CompactionJob::ReleaseSubcompactionResources()`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10341
Test Plan: Add `CompactionPriMultipleFilesRoundRobin[1-3]` unit test in `compaction_picker_test.cc` and `RoundRobinSubcompactionsAgainstResources.SubcompactionsUsingResources/[0-4]`, `RoundRobinSubcompactionsAgainstPressureToken.PressureTokenTest/[0-1]` in `db_compaction_test.cc`
Reviewed By: ajkr, hx235
Differential Revision: D37792644
Pulled By: littlepig2013
fbshipit-source-id: 7fecb7c4ffd97b34bbf6e3b760b2c35a772a0657
2022-07-24 18:12:44 +00:00
|
|
|
start_level,
|
|
|
|
vstorage->GetNextCompactCursor(start_level, c->num_input_files(0)));
|
2022-06-21 18:56:53 +00:00
|
|
|
}
|
|
|
|
}
|
2023-04-21 16:07:18 +00:00
|
|
|
status = versions_->LogAndApply(
|
|
|
|
c->column_family_data(), *c->mutable_cf_options(), read_options,
|
Group SST write in flush, compaction and db open with new stats (#11910)
Summary:
## Context/Summary
Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity.
For that, this PR does the following:
- Tag different write IOs by passing down and converting WriteOptions to IOOptions
- Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS
Some related code refactory to make implementation cleaner:
- Blob stats
- Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info.
- Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write.
- Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority
- Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification
- Build table
- TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables
- Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder.
This parameter is used for dynamically changing file io priority for flush, see https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more
- Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority
## Test
### db bench
Flush
```
./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100
rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
```
compaction, db oopen
```
Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279
rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213
rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66
```
blob stats - just to make sure they aren't broken by this PR
```
Integrated Blob DB
Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600
rocksdb.blobdb.blob.file.synced COUNT : 1
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same)
```
```
Stacked Blob DB
Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876
rocksdb.blobdb.blob.file.synced COUNT : 8
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same)
```
### Rehearsal CI stress test
Trigger 3 full runs of all our CI stress tests
### Performance
Flush
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark; enable_statistics = true
Pre-pr: avg 507515519.3 ns
497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908,
Post-pr: avg 511971266.5 ns, regressed 0.88%
502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408,
```
Compaction
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 495346098.30 ns
492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846
Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97%
502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007
```
Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats)
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 3848.10 ns
3814,3838,3839,3848,3854,3854,3854,3860,3860,3860
Post-pr: avg 3874.20 ns, regressed 0.68%
3863,3867,3871,3874,3875,3877,3877,3877,3880,3881
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910
Reviewed By: ajkr
Differential Revision: D49788060
Pulled By: hx235
fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff
2023-12-29 23:29:23 +00:00
|
|
|
write_options, c->edit(), &mutex_, directories_.GetDbDir(),
|
2023-09-18 20:11:53 +00:00
|
|
|
/*new_descriptor_log=*/false, /*column_family_options=*/nullptr,
|
|
|
|
[&c, &compaction_released](const Status& s) {
|
|
|
|
c->ReleaseCompactionFiles(s);
|
|
|
|
compaction_released = true;
|
|
|
|
});
|
Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487)
Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.
The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487
Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check
Reviewed By: anand1976
Differential Revision: D20685017
Pulled By: zhichao-cao
fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
2020-03-27 23:03:05 +00:00
|
|
|
io_s = versions_->io_status();
|
2017-04-06 00:14:05 +00:00
|
|
|
// Use latest MutableCFOptions
|
2024-03-04 18:08:32 +00:00
|
|
|
InstallSuperVersionAndScheduleWork(
|
|
|
|
c->column_family_data(), job_context->superversion_contexts.data(),
|
|
|
|
*c->mutable_cf_options());
|
2017-04-06 00:14:05 +00:00
|
|
|
|
|
|
|
VersionStorageInfo::LevelSummaryStorage tmp;
|
|
|
|
c->column_family_data()->internal_stats()->IncBytesMoved(c->output_level(),
|
|
|
|
moved_bytes);
|
|
|
|
{
|
|
|
|
event_logger_.LogToBuffer(log_buffer)
|
|
|
|
<< "job" << job_context->job_id << "event"
|
|
|
|
<< "trivial_move"
|
|
|
|
<< "destination_level" << c->output_level() << "files" << moved_files
|
|
|
|
<< "total_files_size" << moved_bytes;
|
|
|
|
}
|
|
|
|
ROCKS_LOG_BUFFER(
|
|
|
|
log_buffer,
|
|
|
|
"[%s] Moved #%d files to level-%d %" PRIu64 " bytes %s: %s\n",
|
|
|
|
c->column_family_data()->GetName().c_str(), moved_files,
|
|
|
|
c->output_level(), moved_bytes, status.ToString().c_str(),
|
|
|
|
c->column_family_data()->current()->storage_info()->LevelSummary(&tmp));
|
|
|
|
*made_progress = true;
|
|
|
|
|
|
|
|
// Clear Instrument
|
|
|
|
ThreadStatusUtil::ResetThreadStatus();
|
Concurrent task limiter for compaction thread control (#4332)
Summary:
The PR is targeting to resolve the issue of:
https://github.com/facebook/rocksdb/issues/3972#issue-330771918
We have a rocksdb created with leveled-compaction with multiple column families (CFs), some of CFs are using HDD to store big and less frequently accessed data and others are using SSD.
When there are continuously write traffics going on to all CFs, the compaction thread pool is mostly occupied by those slow HDD compactions, which blocks fully utilize SSD bandwidth.
Since atomic write and transaction is needed across CFs, so splitting it to multiple rocksdb instance is not an option for us.
With the compaction thread control, we got 30%+ HDD write throughput gain, and also a lot smooth SSD write since less write stall happening.
ConcurrentTaskLimiter can be shared with multi-CFs across rocksdb instances, so the feature does not only work for multi-CFs scenarios, but also for multi-rocksdbs scenarios, who need disk IO resource control per tenant.
The usage is straight forward:
e.g.:
//
// Enable compaction thread limiter thru ColumnFamilyOptions
//
std::shared_ptr<ConcurrentTaskLimiter> ctl(NewConcurrentTaskLimiter("foo_limiter", 4));
Options options;
ColumnFamilyOptions cf_opt(options);
cf_opt.compaction_thread_limiter = ctl;
...
//
// Compaction thread limiter can be tuned or disabled on-the-fly
//
ctl->SetMaxOutstandingTask(12); // enlarge to 12 tasks
...
ctl->ResetMaxOutstandingTask(); // disable (bypass) thread limiter
ctl->SetMaxOutstandingTask(-1); // Same as above
...
ctl->SetMaxOutstandingTask(0); // full throttle (0 task)
//
// Sharing compaction thread limiter among CFs (to resolve multiple storage perf issue)
//
std::shared_ptr<ConcurrentTaskLimiter> ctl_ssd(NewConcurrentTaskLimiter("ssd_limiter", 8));
std::shared_ptr<ConcurrentTaskLimiter> ctl_hdd(NewConcurrentTaskLimiter("hdd_limiter", 4));
Options options;
ColumnFamilyOptions cf_opt_ssd1(options);
ColumnFamilyOptions cf_opt_ssd2(options);
ColumnFamilyOptions cf_opt_hdd1(options);
ColumnFamilyOptions cf_opt_hdd2(options);
ColumnFamilyOptions cf_opt_hdd3(options);
// SSD CFs
cf_opt_ssd1.compaction_thread_limiter = ctl_ssd;
cf_opt_ssd2.compaction_thread_limiter = ctl_ssd;
// HDD CFs
cf_opt_hdd1.compaction_thread_limiter = ctl_hdd;
cf_opt_hdd2.compaction_thread_limiter = ctl_hdd;
cf_opt_hdd3.compaction_thread_limiter = ctl_hdd;
...
//
// The limiter is disabled by default (or set to nullptr explicitly)
//
Options options;
ColumnFamilyOptions cf_opt(options);
cf_opt.compaction_thread_limiter = nullptr;
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4332
Differential Revision: D13226590
Pulled By: siying
fbshipit-source-id: 14307aec55b8bd59c8223d04aa6db3c03d1b0c1d
2018-12-13 21:16:04 +00:00
|
|
|
TEST_SYNC_POINT_CALLBACK("DBImpl::BackgroundCompaction:AfterCompaction",
|
|
|
|
c->column_family_data());
|
2018-05-14 21:44:04 +00:00
|
|
|
} else if (!is_prepicked && c->output_level() > 0 &&
|
2017-08-03 22:36:28 +00:00
|
|
|
c->output_level() ==
|
|
|
|
c->column_family_data()
|
|
|
|
->current()
|
|
|
|
->storage_info()
|
|
|
|
->MaxOutputLevel(
|
|
|
|
immutable_db_options_.allow_ingest_behind) &&
|
|
|
|
env_->GetBackgroundThreads(Env::Priority::BOTTOM) > 0) {
|
2018-05-14 21:44:04 +00:00
|
|
|
// Forward compactions involving last level to the bottom pool if it exists,
|
|
|
|
// such that compactions unlikely to contribute to write stalls can be
|
|
|
|
// delayed or deprioritized.
|
2017-08-03 22:36:28 +00:00
|
|
|
TEST_SYNC_POINT("DBImpl::BackgroundCompaction:ForwardToBottomPriPool");
|
|
|
|
CompactionArg* ca = new CompactionArg;
|
|
|
|
ca->db = this;
|
Fix possible hang issue in ~DBImpl() when flush is scheduled in LOW pool (#8125)
Summary:
In DBImpl::CloseHelper, we wait for bg_compaction_scheduled_
and bg_flush_scheduled_ to drop to 0. Unschedule is called prior
to cancel any unscheduled flushes/compactions. It is assumed that
anything in the high priority is a flush, and anything in the low
priority pool is a compaction. This assumption, however, is broken when
the high-pri pool is full.
As a result, bg_compaction_scheduled_ can go < 0 and bg_flush_scheduled_
will remain > 0 and DB can be in hang state.
The fix is, we decrement the `bg_{flush,compaction,bottom_compaction}_scheduled_`
inside the `Unschedule{Flush,Compaction,BottomCompaction}Callback()`s. DB
`mutex_` will make the counts atomic in `Unschedule`.
Related discussion: https://github.com/facebook/rocksdb/issues/7928
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8125
Test Plan: Added new test case which hangs without the fix.
Reviewed By: jay-zhuang
Differential Revision: D27390043
Pulled By: ajkr
fbshipit-source-id: 78a367fba9a59ac5607ad24bd1c46dc16d5ec110
2021-03-31 01:34:11 +00:00
|
|
|
ca->compaction_pri_ = Env::Priority::BOTTOM;
|
2017-08-03 22:36:28 +00:00
|
|
|
ca->prepicked_compaction = new PrepickedCompaction;
|
|
|
|
ca->prepicked_compaction->compaction = c.release();
|
|
|
|
ca->prepicked_compaction->manual_compaction_state = nullptr;
|
Concurrent task limiter for compaction thread control (#4332)
Summary:
The PR is targeting to resolve the issue of:
https://github.com/facebook/rocksdb/issues/3972#issue-330771918
We have a rocksdb created with leveled-compaction with multiple column families (CFs), some of CFs are using HDD to store big and less frequently accessed data and others are using SSD.
When there are continuously write traffics going on to all CFs, the compaction thread pool is mostly occupied by those slow HDD compactions, which blocks fully utilize SSD bandwidth.
Since atomic write and transaction is needed across CFs, so splitting it to multiple rocksdb instance is not an option for us.
With the compaction thread control, we got 30%+ HDD write throughput gain, and also a lot smooth SSD write since less write stall happening.
ConcurrentTaskLimiter can be shared with multi-CFs across rocksdb instances, so the feature does not only work for multi-CFs scenarios, but also for multi-rocksdbs scenarios, who need disk IO resource control per tenant.
The usage is straight forward:
e.g.:
//
// Enable compaction thread limiter thru ColumnFamilyOptions
//
std::shared_ptr<ConcurrentTaskLimiter> ctl(NewConcurrentTaskLimiter("foo_limiter", 4));
Options options;
ColumnFamilyOptions cf_opt(options);
cf_opt.compaction_thread_limiter = ctl;
...
//
// Compaction thread limiter can be tuned or disabled on-the-fly
//
ctl->SetMaxOutstandingTask(12); // enlarge to 12 tasks
...
ctl->ResetMaxOutstandingTask(); // disable (bypass) thread limiter
ctl->SetMaxOutstandingTask(-1); // Same as above
...
ctl->SetMaxOutstandingTask(0); // full throttle (0 task)
//
// Sharing compaction thread limiter among CFs (to resolve multiple storage perf issue)
//
std::shared_ptr<ConcurrentTaskLimiter> ctl_ssd(NewConcurrentTaskLimiter("ssd_limiter", 8));
std::shared_ptr<ConcurrentTaskLimiter> ctl_hdd(NewConcurrentTaskLimiter("hdd_limiter", 4));
Options options;
ColumnFamilyOptions cf_opt_ssd1(options);
ColumnFamilyOptions cf_opt_ssd2(options);
ColumnFamilyOptions cf_opt_hdd1(options);
ColumnFamilyOptions cf_opt_hdd2(options);
ColumnFamilyOptions cf_opt_hdd3(options);
// SSD CFs
cf_opt_ssd1.compaction_thread_limiter = ctl_ssd;
cf_opt_ssd2.compaction_thread_limiter = ctl_ssd;
// HDD CFs
cf_opt_hdd1.compaction_thread_limiter = ctl_hdd;
cf_opt_hdd2.compaction_thread_limiter = ctl_hdd;
cf_opt_hdd3.compaction_thread_limiter = ctl_hdd;
...
//
// The limiter is disabled by default (or set to nullptr explicitly)
//
Options options;
ColumnFamilyOptions cf_opt(options);
cf_opt.compaction_thread_limiter = nullptr;
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4332
Differential Revision: D13226590
Pulled By: siying
fbshipit-source-id: 14307aec55b8bd59c8223d04aa6db3c03d1b0c1d
2018-12-13 21:16:04 +00:00
|
|
|
// Transfer requested token, so it doesn't need to do it again.
|
|
|
|
ca->prepicked_compaction->task_token = std::move(task_token);
|
2017-08-03 22:36:28 +00:00
|
|
|
++bg_bottom_compaction_scheduled_;
|
2023-09-18 20:11:53 +00:00
|
|
|
assert(c == nullptr);
|
2017-08-03 22:36:28 +00:00
|
|
|
env_->Schedule(&DBImpl::BGWorkBottomCompaction, ca, Env::Priority::BOTTOM,
|
2019-03-20 00:24:09 +00:00
|
|
|
this, &DBImpl::UnscheduleCompactionCallback);
|
2017-04-06 00:14:05 +00:00
|
|
|
} else {
|
Concurrent task limiter for compaction thread control (#4332)
Summary:
The PR is targeting to resolve the issue of:
https://github.com/facebook/rocksdb/issues/3972#issue-330771918
We have a rocksdb created with leveled-compaction with multiple column families (CFs), some of CFs are using HDD to store big and less frequently accessed data and others are using SSD.
When there are continuously write traffics going on to all CFs, the compaction thread pool is mostly occupied by those slow HDD compactions, which blocks fully utilize SSD bandwidth.
Since atomic write and transaction is needed across CFs, so splitting it to multiple rocksdb instance is not an option for us.
With the compaction thread control, we got 30%+ HDD write throughput gain, and also a lot smooth SSD write since less write stall happening.
ConcurrentTaskLimiter can be shared with multi-CFs across rocksdb instances, so the feature does not only work for multi-CFs scenarios, but also for multi-rocksdbs scenarios, who need disk IO resource control per tenant.
The usage is straight forward:
e.g.:
//
// Enable compaction thread limiter thru ColumnFamilyOptions
//
std::shared_ptr<ConcurrentTaskLimiter> ctl(NewConcurrentTaskLimiter("foo_limiter", 4));
Options options;
ColumnFamilyOptions cf_opt(options);
cf_opt.compaction_thread_limiter = ctl;
...
//
// Compaction thread limiter can be tuned or disabled on-the-fly
//
ctl->SetMaxOutstandingTask(12); // enlarge to 12 tasks
...
ctl->ResetMaxOutstandingTask(); // disable (bypass) thread limiter
ctl->SetMaxOutstandingTask(-1); // Same as above
...
ctl->SetMaxOutstandingTask(0); // full throttle (0 task)
//
// Sharing compaction thread limiter among CFs (to resolve multiple storage perf issue)
//
std::shared_ptr<ConcurrentTaskLimiter> ctl_ssd(NewConcurrentTaskLimiter("ssd_limiter", 8));
std::shared_ptr<ConcurrentTaskLimiter> ctl_hdd(NewConcurrentTaskLimiter("hdd_limiter", 4));
Options options;
ColumnFamilyOptions cf_opt_ssd1(options);
ColumnFamilyOptions cf_opt_ssd2(options);
ColumnFamilyOptions cf_opt_hdd1(options);
ColumnFamilyOptions cf_opt_hdd2(options);
ColumnFamilyOptions cf_opt_hdd3(options);
// SSD CFs
cf_opt_ssd1.compaction_thread_limiter = ctl_ssd;
cf_opt_ssd2.compaction_thread_limiter = ctl_ssd;
// HDD CFs
cf_opt_hdd1.compaction_thread_limiter = ctl_hdd;
cf_opt_hdd2.compaction_thread_limiter = ctl_hdd;
cf_opt_hdd3.compaction_thread_limiter = ctl_hdd;
...
//
// The limiter is disabled by default (or set to nullptr explicitly)
//
Options options;
ColumnFamilyOptions cf_opt(options);
cf_opt.compaction_thread_limiter = nullptr;
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4332
Differential Revision: D13226590
Pulled By: siying
fbshipit-source-id: 14307aec55b8bd59c8223d04aa6db3c03d1b0c1d
2018-12-13 21:16:04 +00:00
|
|
|
TEST_SYNC_POINT_CALLBACK("DBImpl::BackgroundCompaction:BeforeCompaction",
|
|
|
|
c->column_family_data());
|
2018-04-13 00:55:14 +00:00
|
|
|
int output_level __attribute__((__unused__));
|
2017-10-23 21:20:53 +00:00
|
|
|
output_level = c->output_level();
|
2017-04-06 00:14:05 +00:00
|
|
|
TEST_SYNC_POINT_CALLBACK("DBImpl::BackgroundCompaction:NonTrivial",
|
|
|
|
&output_level);
|
2019-01-16 05:32:15 +00:00
|
|
|
std::vector<SequenceNumber> snapshot_seqs;
|
2017-04-06 00:14:05 +00:00
|
|
|
SequenceNumber earliest_write_conflict_snapshot;
|
2019-01-16 05:32:15 +00:00
|
|
|
SnapshotChecker* snapshot_checker;
|
|
|
|
GetSnapshotContext(job_context, &snapshot_seqs,
|
|
|
|
&earliest_write_conflict_snapshot, &snapshot_checker);
|
2017-04-06 00:14:05 +00:00
|
|
|
assert(is_snapshot_supported_ || snapshots_.empty());
|
2022-06-07 01:32:26 +00:00
|
|
|
|
2017-04-06 00:14:05 +00:00
|
|
|
CompactionJob compaction_job(
|
2017-09-28 00:37:08 +00:00
|
|
|
job_context->job_id, c.get(), immutable_db_options_,
|
2021-05-20 04:40:43 +00:00
|
|
|
mutable_db_options_, file_options_for_compaction_, versions_.get(),
|
2022-04-11 17:26:55 +00:00
|
|
|
&shutting_down_, log_buffer, directories_.GetDbDir(),
|
2020-10-26 20:50:03 +00:00
|
|
|
GetDataDir(c->column_family_data(), c->output_path_id()),
|
|
|
|
GetDataDir(c->column_family_data(), 0), stats_, &mutex_,
|
|
|
|
&error_handler_, snapshot_seqs, earliest_write_conflict_snapshot,
|
CompactionIterator sees consistent view of which keys are committed (#9830)
Summary:
**This PR does not affect the functionality of `DB` and write-committed transactions.**
`CompactionIterator` uses `KeyCommitted(seq)` to determine if a key in the database is committed.
As the name 'write-committed' implies, if write-committed policy is used, a key exists in the database only if
it is committed. In fact, the implementation of `KeyCommitted()` is as follows:
```
inline bool KeyCommitted(SequenceNumber seq) {
// For non-txn-db and write-committed, snapshot_checker_ is always nullptr.
return snapshot_checker_ == nullptr ||
snapshot_checker_->CheckInSnapshot(seq, kMaxSequence) == SnapshotCheckerResult::kInSnapshot;
}
```
With that being said, we focus on write-prepared/write-unprepared transactions.
A few notes:
- A key can exist in the db even if it's uncommitted. Therefore, we rely on `snapshot_checker_` to determine data visibility. We also require that all writes go through transaction API instead of the raw `WriteBatch` + `Write`, thus at most one uncommitted version of one user key can exist in the database.
- `CompactionIterator` outputs a key as long as the key is uncommitted.
Due to the above reasons, it is possible that `CompactionIterator` decides to output an uncommitted key without
doing further checks on the key (`NextFromInput()`). By the time the key is being prepared for output, the key becomes
committed because the `snapshot_checker_(seq, kMaxSequence)` becomes true in the implementation of `KeyCommitted()`.
Then `CompactionIterator` will try to zero its sequence number and hit assertion error if the key is a tombstone.
To fix this issue, we should make the `CompactionIterator` see a consistent view of the input keys. Note that
for write-prepared/write-unprepared, the background flush/compaction jobs already take a "job snapshot" before starting
processing keys. The job snapshot is released only after the entire flush/compaction finishes. We can use this snapshot
to determine whether a key is committed or not with minor change to `KeyCommitted()`.
```
inline bool KeyCommitted(SequenceNumber sequence) {
// For non-txn-db and write-committed, snapshot_checker_ is always nullptr.
return snapshot_checker_ == nullptr ||
snapshot_checker_->CheckInSnapshot(sequence, job_snapshot_) ==
SnapshotCheckerResult::kInSnapshot;
}
```
As a result, whether a key is committed or not will remain a constant throughout compaction, causing no trouble
for `CompactionIterator`s assertions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9830
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D35561162
Pulled By: riversand963
fbshipit-source-id: 0e00d200c195240341cfe6d34cbc86798b315b9f
2022-04-14 18:11:04 +00:00
|
|
|
snapshot_checker, job_context, table_cache_, &event_logger_,
|
2020-10-26 20:50:03 +00:00
|
|
|
c->mutable_cf_options()->paranoid_file_checks,
|
2017-04-06 00:14:05 +00:00
|
|
|
c->mutable_cf_options()->report_bg_io_stats, dbname_,
|
2020-08-13 00:28:10 +00:00
|
|
|
&compaction_job_stats, thread_pri, io_tracer_,
|
2022-06-07 01:32:26 +00:00
|
|
|
is_manual ? manual_compaction->canceled
|
|
|
|
: kManualCompactionCanceledFalse_,
|
|
|
|
db_id_, db_session_id_, c->column_family_data()->GetFullHistoryTsLow(),
|
Support subcmpct using reserved resources for round-robin priority (#10341)
Summary:
Earlier implementation of round-robin priority can only pick one file at a time and disallows parallel compactions within the same level. In this PR, round-robin compaction policy will expand towards more input files with respecting some additional constraints, which are summarized as follows:
* Constraint 1: We can only pick consecutive files
- Constraint 1a: When a file is being compacted (or some input files are being compacted after expanding), we cannot choose it and have to stop choosing more files
- Constraint 1b: When we reach the last file (with the largest keys), we cannot choose more files (the next file will be the first one with small keys)
* Constraint 2: We should ensure the total compaction bytes (including the overlapped files from the next level) is no more than `mutable_cf_options_.max_compaction_bytes`
* Constraint 3: We try our best to pick as many files as possible so that the post-compaction level size can be just less than `MaxBytesForLevel(start_level_)`
* Constraint 4: If trivial move is allowed, we reuse the logic of `TryNonL0TrivialMove()` instead of expanding files with Constraint 3
More details can be found in `LevelCompactionBuilder::SetupOtherFilesWithRoundRobinExpansion()`.
The above optimization accelerates the process of moving the compaction cursor, in which the write-amp can be further reduced. While a large compaction may lead to high write stall, we break this large compaction into several subcompactions **regardless of** the `max_subcompactions` limit. The number of subcompactions for round-robin compaction priority is determined through the following steps:
* Step 1: Initialized against `max_output_file_limit`, the number of input files in the start level, and also the range size limit `ranges.size()`
* Step 2: Call `AcquireSubcompactionResources()`when max subcompactions is not sufficient, but we may or may not obtain desired resources, additional number of resources is stored in `extra_num_subcompaction_threads_reserved_`). Subcompaction limit is changed and update `num_planned_subcompactions` with `GetSubcompactionLimit()`
* Step 3: Call `ShrinkSubcompactionResources()` to ensure extra resources can be released (extra resources may exist for round-robin compaction when the number of actual number of subcompactions is less than the number of planned subcompactions)
More details can be found in `CompactionJob::AcquireSubcompactionResources()`,`CompactionJob::ShrinkSubcompactionResources()`, and `CompactionJob::ReleaseSubcompactionResources()`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10341
Test Plan: Add `CompactionPriMultipleFilesRoundRobin[1-3]` unit test in `compaction_picker_test.cc` and `RoundRobinSubcompactionsAgainstResources.SubcompactionsUsingResources/[0-4]`, `RoundRobinSubcompactionsAgainstPressureToken.PressureTokenTest/[0-1]` in `db_compaction_test.cc`
Reviewed By: ajkr, hx235
Differential Revision: D37792644
Pulled By: littlepig2013
fbshipit-source-id: 7fecb7c4ffd97b34bbf6e3b760b2c35a772a0657
2022-07-24 18:12:44 +00:00
|
|
|
c->trim_ts(), &blob_callback_, &bg_compaction_scheduled_,
|
|
|
|
&bg_bottom_compaction_scheduled_);
|
2017-04-06 00:14:05 +00:00
|
|
|
compaction_job.Prepare();
|
|
|
|
|
2018-10-11 00:30:22 +00:00
|
|
|
NotifyOnCompactionBegin(c->column_family_data(), c.get(), status,
|
|
|
|
compaction_job_stats, job_context->job_id);
|
2017-04-06 00:14:05 +00:00
|
|
|
mutex_.Unlock();
|
2019-06-04 05:37:40 +00:00
|
|
|
TEST_SYNC_POINT_CALLBACK(
|
|
|
|
"DBImpl::BackgroundCompaction:NonTrivial:BeforeRun", nullptr);
|
2023-05-03 18:12:20 +00:00
|
|
|
// Should handle error?
|
2020-09-29 16:47:33 +00:00
|
|
|
compaction_job.Run().PermitUncheckedError();
|
2017-04-06 00:14:05 +00:00
|
|
|
TEST_SYNC_POINT("DBImpl::BackgroundCompaction:NonTrivial:AfterRun");
|
|
|
|
mutex_.Lock();
|
2023-09-18 20:11:53 +00:00
|
|
|
status =
|
|
|
|
compaction_job.Install(*c->mutable_cf_options(), &compaction_released);
|
Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487)
Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.
The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487
Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check
Reviewed By: anand1976
Differential Revision: D20685017
Pulled By: zhichao-cao
fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
2020-03-27 23:03:05 +00:00
|
|
|
io_s = compaction_job.io_status();
|
2017-04-06 00:14:05 +00:00
|
|
|
if (status.ok()) {
|
2024-03-04 18:08:32 +00:00
|
|
|
InstallSuperVersionAndScheduleWork(
|
|
|
|
c->column_family_data(), job_context->superversion_contexts.data(),
|
|
|
|
*c->mutable_cf_options());
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
*made_progress = true;
|
Concurrent task limiter for compaction thread control (#4332)
Summary:
The PR is targeting to resolve the issue of:
https://github.com/facebook/rocksdb/issues/3972#issue-330771918
We have a rocksdb created with leveled-compaction with multiple column families (CFs), some of CFs are using HDD to store big and less frequently accessed data and others are using SSD.
When there are continuously write traffics going on to all CFs, the compaction thread pool is mostly occupied by those slow HDD compactions, which blocks fully utilize SSD bandwidth.
Since atomic write and transaction is needed across CFs, so splitting it to multiple rocksdb instance is not an option for us.
With the compaction thread control, we got 30%+ HDD write throughput gain, and also a lot smooth SSD write since less write stall happening.
ConcurrentTaskLimiter can be shared with multi-CFs across rocksdb instances, so the feature does not only work for multi-CFs scenarios, but also for multi-rocksdbs scenarios, who need disk IO resource control per tenant.
The usage is straight forward:
e.g.:
//
// Enable compaction thread limiter thru ColumnFamilyOptions
//
std::shared_ptr<ConcurrentTaskLimiter> ctl(NewConcurrentTaskLimiter("foo_limiter", 4));
Options options;
ColumnFamilyOptions cf_opt(options);
cf_opt.compaction_thread_limiter = ctl;
...
//
// Compaction thread limiter can be tuned or disabled on-the-fly
//
ctl->SetMaxOutstandingTask(12); // enlarge to 12 tasks
...
ctl->ResetMaxOutstandingTask(); // disable (bypass) thread limiter
ctl->SetMaxOutstandingTask(-1); // Same as above
...
ctl->SetMaxOutstandingTask(0); // full throttle (0 task)
//
// Sharing compaction thread limiter among CFs (to resolve multiple storage perf issue)
//
std::shared_ptr<ConcurrentTaskLimiter> ctl_ssd(NewConcurrentTaskLimiter("ssd_limiter", 8));
std::shared_ptr<ConcurrentTaskLimiter> ctl_hdd(NewConcurrentTaskLimiter("hdd_limiter", 4));
Options options;
ColumnFamilyOptions cf_opt_ssd1(options);
ColumnFamilyOptions cf_opt_ssd2(options);
ColumnFamilyOptions cf_opt_hdd1(options);
ColumnFamilyOptions cf_opt_hdd2(options);
ColumnFamilyOptions cf_opt_hdd3(options);
// SSD CFs
cf_opt_ssd1.compaction_thread_limiter = ctl_ssd;
cf_opt_ssd2.compaction_thread_limiter = ctl_ssd;
// HDD CFs
cf_opt_hdd1.compaction_thread_limiter = ctl_hdd;
cf_opt_hdd2.compaction_thread_limiter = ctl_hdd;
cf_opt_hdd3.compaction_thread_limiter = ctl_hdd;
...
//
// The limiter is disabled by default (or set to nullptr explicitly)
//
Options options;
ColumnFamilyOptions cf_opt(options);
cf_opt.compaction_thread_limiter = nullptr;
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4332
Differential Revision: D13226590
Pulled By: siying
fbshipit-source-id: 14307aec55b8bd59c8223d04aa6db3c03d1b0c1d
2018-12-13 21:16:04 +00:00
|
|
|
TEST_SYNC_POINT_CALLBACK("DBImpl::BackgroundCompaction:AfterCompaction",
|
|
|
|
c->column_family_data());
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487)
Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.
The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487
Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check
Reviewed By: anand1976
Differential Revision: D20685017
Pulled By: zhichao-cao
fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
2020-03-27 23:03:05 +00:00
|
|
|
|
|
|
|
if (status.ok() && !io_s.ok()) {
|
|
|
|
status = io_s;
|
2020-09-29 16:47:33 +00:00
|
|
|
} else {
|
|
|
|
io_s.PermitUncheckedError();
|
Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487)
Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.
The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487
Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check
Reviewed By: anand1976
Differential Revision: D20685017
Pulled By: zhichao-cao
fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
2020-03-27 23:03:05 +00:00
|
|
|
}
|
|
|
|
|
2017-04-06 00:14:05 +00:00
|
|
|
if (c != nullptr) {
|
2023-09-18 20:11:53 +00:00
|
|
|
if (!compaction_released) {
|
|
|
|
c->ReleaseCompactionFiles(status);
|
|
|
|
} else {
|
|
|
|
#ifndef NDEBUG
|
|
|
|
// Sanity checking that compaction files are freed.
|
|
|
|
for (size_t i = 0; i < c->num_input_levels(); i++) {
|
|
|
|
for (size_t j = 0; j < c->inputs(i)->size(); j++) {
|
|
|
|
assert(!c->input(i, j)->being_compacted);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
std::unordered_set<Compaction*>* cip = c->column_family_data()
|
|
|
|
->compaction_picker()
|
|
|
|
->compactions_in_progress();
|
|
|
|
assert(cip->find(c.get()) == cip->end());
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2017-04-06 00:14:05 +00:00
|
|
|
*made_progress = true;
|
2018-03-07 00:13:05 +00:00
|
|
|
|
|
|
|
// Need to make sure SstFileManager does its bookkeeping
|
|
|
|
auto sfm = static_cast<SstFileManagerImpl*>(
|
|
|
|
immutable_db_options_.sst_file_manager.get());
|
2018-04-03 02:53:19 +00:00
|
|
|
if (sfm && sfm_reserved_compact_space) {
|
2018-03-07 00:13:05 +00:00
|
|
|
sfm->OnCompactionCompletion(c.get());
|
|
|
|
}
|
|
|
|
|
2018-04-13 00:55:14 +00:00
|
|
|
NotifyOnCompactionCompleted(c->column_family_data(), c.get(), status,
|
|
|
|
compaction_job_stats, job_context->job_id);
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
|
2019-09-17 04:00:13 +00:00
|
|
|
if (status.ok() || status.IsCompactionTooLarge() ||
|
|
|
|
status.IsManualCompactionPaused()) {
|
2017-04-06 00:14:05 +00:00
|
|
|
// Done
|
2019-06-04 05:37:40 +00:00
|
|
|
} else if (status.IsColumnFamilyDropped() || status.IsShutdownInProgress()) {
|
2017-04-06 00:14:05 +00:00
|
|
|
// Ignore compaction errors found during shutting down
|
|
|
|
} else {
|
|
|
|
ROCKS_LOG_WARN(immutable_db_options_.info_log, "Compaction error: %s",
|
|
|
|
status.ToString().c_str());
|
2020-03-29 02:05:54 +00:00
|
|
|
if (!io_s.ok()) {
|
First step towards handling MANIFEST write error (#6949)
Summary:
This PR provides preliminary support for handling IO error during MANIFEST write.
File write/sync is not guaranteed to be atomic. If we encounter an IOError while writing/syncing to the MANIFEST file, we cannot be sure about the state of the MANIFEST file. The version edits may or may not have reached the file. During cleanup, if we delete the newly-generated SST files referenced by the pending version edit(s), but the version edit(s) actually are persistent in the MANIFEST, then next recovery attempt will process the version edits(s) and then fail since the SST files have already been deleted.
One approach is to truncate the MANIFEST after write/sync error, so that it is safe to delete the SST files. However, file truncation may not be supported on certain file systems. Therefore, we take the following approach.
If an IOError is detected during MANIFEST write/sync, we disable file deletions for the faulty database. Depending on whether the IOError is retryable (set by underlying file system), either RocksDB or application can call `DB::Resume()`, or simply shutdown and restart. During `Resume()`, RocksDB will try to switch to a new MANIFEST and write all existing in-memory version storage in the new file. If this succeeds, then RocksDB may proceed. If all recovery is completed, then file deletions will be re-enabled.
Note that multiple threads can call `LogAndApply()` at the same time, though only one of them will be going through the process MANIFEST write, possibly batching the version edits of other threads. When the leading MANIFEST writer finishes, all of the MANIFEST writing threads in this batch will have the same IOError. They will all call `ErrorHandler::SetBGError()` in which file deletion will be disabled.
Possible future directions:
- Add an `ErrorContext` structure so that it is easier to pass more info to `ErrorHandler`. Currently, as in this example, a new `BackgroundErrorReason` has to be added.
Test plan (dev server):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6949
Reviewed By: anand1976
Differential Revision: D22026020
Pulled By: riversand963
fbshipit-source-id: f3c68a2ef45d9b505d0d625c7c5e0c88495b91c8
2020-06-25 02:05:47 +00:00
|
|
|
// Error while writing to MANIFEST.
|
|
|
|
// In fact, versions_->io_status() can also be the result of renaming
|
|
|
|
// CURRENT file. With current code, it's just difficult to tell. So just
|
|
|
|
// be pessimistic and try write to a new MANIFEST.
|
|
|
|
// TODO: distinguish between MANIFEST write and CURRENT renaming
|
|
|
|
auto err_reason = versions_->io_status().ok()
|
|
|
|
? BackgroundErrorReason::kCompaction
|
|
|
|
: BackgroundErrorReason::kManifestWrite;
|
2020-12-08 04:09:55 +00:00
|
|
|
error_handler_.SetBGError(io_s, err_reason);
|
2020-03-29 02:05:54 +00:00
|
|
|
} else {
|
2020-12-08 04:09:55 +00:00
|
|
|
error_handler_.SetBGError(status, BackgroundErrorReason::kCompaction);
|
2020-03-29 02:05:54 +00:00
|
|
|
}
|
Auto recovery from out of space errors (#4164)
Summary:
This commit implements automatic recovery from a Status::NoSpace() error
during background operations such as write callback, flush and
compaction. The broad design is as follows -
1. Compaction errors are treated as soft errors and don't put the
database in read-only mode. A compaction is delayed until enough free
disk space is available to accomodate the compaction outputs, which is
estimated based on the input size. This means that users can continue to
write, and we rely on the WriteController to delay or stop writes if the
compaction debt becomes too high due to persistent low disk space
condition
2. Errors during write callback and flush are treated as hard errors,
i.e the database is put in read-only mode and goes back to read-write
only fater certain recovery actions are taken.
3. Both types of recovery rely on the SstFileManagerImpl to poll for
sufficient disk space. We assume that there is a 1-1 mapping between an
SFM and the underlying OS storage container. For cases where multiple
DBs are hosted on a single storage container, the user is expected to
allocate a single SFM instance and use the same one for all the DBs. If
no SFM is specified by the user, DBImpl::Open() will allocate one, but
this will be one per DB and each DB will recover independently. The
recovery implemented by SFM is as follows -
a) On the first occurance of an out of space error during compaction,
subsequent
compactions will be delayed until the disk free space check indicates
enough available space. The required space is computed as the sum of
input sizes.
b) The free space check requirement will be removed once the amount of
free space is greater than the size reserved by in progress
compactions when the first error occured
c) If the out of space error is a hard error, a background thread in
SFM will poll for sufficient headroom before triggering the recovery
of the database and putting it in write-only mode. The headroom is
calculated as the sum of the write_buffer_size of all the DB instances
associated with the SFM
4. EventListener callbacks will be called at the start and completion of
automatic recovery. Users can disable the auto recov ery in the start
callback, and later initiate it manually by calling DB::Resume()
Todo:
1. More extensive testing
2. Add disk full condition to db_stress (follow-on PR)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4164
Differential Revision: D9846378
Pulled By: anand1976
fbshipit-source-id: 80ea875dbd7f00205e19c82215ff6e37da10da4a
2018-09-15 20:36:19 +00:00
|
|
|
if (c != nullptr && !is_manual && !error_handler_.IsBGWorkStopped()) {
|
|
|
|
// Put this cfd back in the compaction queue so we can retry after some
|
|
|
|
// time
|
|
|
|
auto cfd = c->column_family_data();
|
|
|
|
assert(cfd != nullptr);
|
|
|
|
// Since this compaction failed, we need to recompute the score so it
|
|
|
|
// takes the original input files into account
|
|
|
|
c->column_family_data()
|
|
|
|
->current()
|
|
|
|
->storage_info()
|
2021-06-16 23:50:43 +00:00
|
|
|
->ComputeCompactionScore(*(c->immutable_options()),
|
Auto recovery from out of space errors (#4164)
Summary:
This commit implements automatic recovery from a Status::NoSpace() error
during background operations such as write callback, flush and
compaction. The broad design is as follows -
1. Compaction errors are treated as soft errors and don't put the
database in read-only mode. A compaction is delayed until enough free
disk space is available to accomodate the compaction outputs, which is
estimated based on the input size. This means that users can continue to
write, and we rely on the WriteController to delay or stop writes if the
compaction debt becomes too high due to persistent low disk space
condition
2. Errors during write callback and flush are treated as hard errors,
i.e the database is put in read-only mode and goes back to read-write
only fater certain recovery actions are taken.
3. Both types of recovery rely on the SstFileManagerImpl to poll for
sufficient disk space. We assume that there is a 1-1 mapping between an
SFM and the underlying OS storage container. For cases where multiple
DBs are hosted on a single storage container, the user is expected to
allocate a single SFM instance and use the same one for all the DBs. If
no SFM is specified by the user, DBImpl::Open() will allocate one, but
this will be one per DB and each DB will recover independently. The
recovery implemented by SFM is as follows -
a) On the first occurance of an out of space error during compaction,
subsequent
compactions will be delayed until the disk free space check indicates
enough available space. The required space is computed as the sum of
input sizes.
b) The free space check requirement will be removed once the amount of
free space is greater than the size reserved by in progress
compactions when the first error occured
c) If the out of space error is a hard error, a background thread in
SFM will poll for sufficient headroom before triggering the recovery
of the database and putting it in write-only mode. The headroom is
calculated as the sum of the write_buffer_size of all the DB instances
associated with the SFM
4. EventListener callbacks will be called at the start and completion of
automatic recovery. Users can disable the auto recov ery in the start
callback, and later initiate it manually by calling DB::Resume()
Todo:
1. More extensive testing
2. Add disk full condition to db_stress (follow-on PR)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4164
Differential Revision: D9846378
Pulled By: anand1976
fbshipit-source-id: 80ea875dbd7f00205e19c82215ff6e37da10da4a
2018-09-15 20:36:19 +00:00
|
|
|
*(c->mutable_cf_options()));
|
|
|
|
if (!cfd->queued_for_compaction()) {
|
|
|
|
AddToCompactionQueue(cfd);
|
|
|
|
++unscheduled_compactions_;
|
|
|
|
}
|
|
|
|
}
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
Auto recovery from out of space errors (#4164)
Summary:
This commit implements automatic recovery from a Status::NoSpace() error
during background operations such as write callback, flush and
compaction. The broad design is as follows -
1. Compaction errors are treated as soft errors and don't put the
database in read-only mode. A compaction is delayed until enough free
disk space is available to accomodate the compaction outputs, which is
estimated based on the input size. This means that users can continue to
write, and we rely on the WriteController to delay or stop writes if the
compaction debt becomes too high due to persistent low disk space
condition
2. Errors during write callback and flush are treated as hard errors,
i.e the database is put in read-only mode and goes back to read-write
only fater certain recovery actions are taken.
3. Both types of recovery rely on the SstFileManagerImpl to poll for
sufficient disk space. We assume that there is a 1-1 mapping between an
SFM and the underlying OS storage container. For cases where multiple
DBs are hosted on a single storage container, the user is expected to
allocate a single SFM instance and use the same one for all the DBs. If
no SFM is specified by the user, DBImpl::Open() will allocate one, but
this will be one per DB and each DB will recover independently. The
recovery implemented by SFM is as follows -
a) On the first occurance of an out of space error during compaction,
subsequent
compactions will be delayed until the disk free space check indicates
enough available space. The required space is computed as the sum of
input sizes.
b) The free space check requirement will be removed once the amount of
free space is greater than the size reserved by in progress
compactions when the first error occured
c) If the out of space error is a hard error, a background thread in
SFM will poll for sufficient headroom before triggering the recovery
of the database and putting it in write-only mode. The headroom is
calculated as the sum of the write_buffer_size of all the DB instances
associated with the SFM
4. EventListener callbacks will be called at the start and completion of
automatic recovery. Users can disable the auto recov ery in the start
callback, and later initiate it manually by calling DB::Resume()
Todo:
1. More extensive testing
2. Add disk full condition to db_stress (follow-on PR)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4164
Differential Revision: D9846378
Pulled By: anand1976
fbshipit-source-id: 80ea875dbd7f00205e19c82215ff6e37da10da4a
2018-09-15 20:36:19 +00:00
|
|
|
// this will unref its input_version and column_family_data
|
|
|
|
c.reset();
|
2017-04-06 00:14:05 +00:00
|
|
|
|
|
|
|
if (is_manual) {
|
2022-03-15 19:31:14 +00:00
|
|
|
ManualCompactionState* m = manual_compaction;
|
2017-04-06 00:14:05 +00:00
|
|
|
if (!status.ok()) {
|
|
|
|
m->status = status;
|
|
|
|
m->done = true;
|
|
|
|
}
|
|
|
|
// For universal compaction:
|
|
|
|
// Because universal compaction always happens at level 0, so one
|
|
|
|
// compaction will pick up all overlapped files. No files will be
|
|
|
|
// filtered out due to size limit and left for a successive compaction.
|
|
|
|
// So we can safely conclude the current compaction.
|
|
|
|
//
|
|
|
|
// Also note that, if we don't stop here, then the current compaction
|
|
|
|
// writes a new file back to level 0, which will be used in successive
|
|
|
|
// compaction. Hence the manual compaction will never finish.
|
|
|
|
//
|
|
|
|
// Stop the compaction if manual_end points to nullptr -- this means
|
|
|
|
// that we compacted the whole range. manual_end should always point
|
|
|
|
// to nullptr in case of universal compaction
|
|
|
|
if (m->manual_end == nullptr) {
|
|
|
|
m->done = true;
|
|
|
|
}
|
|
|
|
if (!m->done) {
|
|
|
|
// We only compacted part of the requested range. Update *m
|
|
|
|
// to the range that is left to be compacted.
|
|
|
|
// Universal and FIFO compactions should always compact the whole range
|
|
|
|
assert(m->cfd->ioptions()->compaction_style !=
|
|
|
|
kCompactionStyleUniversal ||
|
|
|
|
m->cfd->ioptions()->num_levels > 1);
|
|
|
|
assert(m->cfd->ioptions()->compaction_style != kCompactionStyleFIFO);
|
|
|
|
m->tmp_storage = *m->manual_end;
|
|
|
|
m->begin = &m->tmp_storage;
|
|
|
|
m->incomplete = true;
|
|
|
|
}
|
2018-04-13 00:55:14 +00:00
|
|
|
m->in_progress = false; // not being processed anymore
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
TEST_SYNC_POINT("DBImpl::BackgroundCompaction:Finish");
|
|
|
|
return status;
|
|
|
|
}
|
|
|
|
|
|
|
|
bool DBImpl::HasPendingManualCompaction() {
|
|
|
|
return (!manual_compaction_dequeue_.empty());
|
|
|
|
}
|
|
|
|
|
2017-08-03 22:36:28 +00:00
|
|
|
void DBImpl::AddManualCompaction(DBImpl::ManualCompactionState* m) {
|
Prevent corruption with parallel manual compactions and `change_level == true` (#9077)
Summary:
The bug can impact the following scenario. There must be two `CompactRange()`s, call them A and B. Compaction A must have `change_level=true`. Compactions A and B must run in parallel, and new data must be added while they run as well.
Now, on to the details of the race condition. Compaction A must reach the refitting phase while B's next step is to trivial move new data (i.e., data that has been inserted behind A) down to the same level that A's refit targets (`CompactRangeOptions::target_level`). B must be unregistered (i.e., has not yet called `AddManualCompaction()` for the current `RunManualCompaction()`) while A invokes `DisableManualCompaction()`s to prepare for refitting. In the old code, B could still proceed to register a manual compaction, while A had disabled manual compaction.
The next part of the race condition is B picks and schedules a trivial move while A has released the lock in refitting phase in order to persist the LSM state change (i.e., the log phase of `LogAndApply()`). That way, B does not see the refitted data when picking a trivial-move compaction. So it is susceptible to picking one that overlaps.
Finally, B executes the picked trivial-move compaction. Trivial-move compactions are special in that they never check whether manual compaction is disabled. So the picked compaction causing overlap ends up being applied, leading to LSM corruption if `force_consistency_checks=false`, or entering read-only mode with `Status::Corruption` if `force_consistency_checks=true` (the default).
The fix is just to prevent B from registering itself in `RunManualCompaction()` while manual compactions are disabled, consequently preventing any trivial move or other compaction from being picked/scheduled.
Thanks to siying for finding the bug.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9077
Test Plan: The test does not go all the way in exposing the bug because it requires a compaction to be picked/scheduled while logging LSM state change for RefitLevel(). But the fix is to make such a compaction not picked/scheduled in the first place, so any repro of that scenario would end up hanging RefitLevel() logging. So instead I just verified no such compaction is registered in the scenario where `RefitLevel()` disables manual compactions.
Reviewed By: siying
Differential Revision: D31921908
Pulled By: ajkr
fbshipit-source-id: 9bb5d0e847ad428211227f40830c685c209fbecb
2021-10-28 06:07:29 +00:00
|
|
|
assert(manual_compaction_paused_ == 0);
|
2017-04-06 00:14:05 +00:00
|
|
|
manual_compaction_dequeue_.push_back(m);
|
|
|
|
}
|
|
|
|
|
2017-08-03 22:36:28 +00:00
|
|
|
void DBImpl::RemoveManualCompaction(DBImpl::ManualCompactionState* m) {
|
2017-04-06 00:14:05 +00:00
|
|
|
// Remove from queue
|
2017-08-03 22:36:28 +00:00
|
|
|
std::deque<ManualCompactionState*>::iterator it =
|
2017-04-06 00:14:05 +00:00
|
|
|
manual_compaction_dequeue_.begin();
|
|
|
|
while (it != manual_compaction_dequeue_.end()) {
|
|
|
|
if (m == (*it)) {
|
|
|
|
it = manual_compaction_dequeue_.erase(it);
|
|
|
|
return;
|
|
|
|
}
|
2019-05-15 20:14:18 +00:00
|
|
|
++it;
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
assert(false);
|
|
|
|
}
|
|
|
|
|
2017-08-03 22:36:28 +00:00
|
|
|
bool DBImpl::ShouldntRunManualCompaction(ManualCompactionState* m) {
|
2017-04-06 00:14:05 +00:00
|
|
|
if (m->exclusive) {
|
2017-08-03 22:36:28 +00:00
|
|
|
return (bg_bottom_compaction_scheduled_ > 0 ||
|
|
|
|
bg_compaction_scheduled_ > 0);
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
2017-08-03 22:36:28 +00:00
|
|
|
std::deque<ManualCompactionState*>::iterator it =
|
2017-04-06 00:14:05 +00:00
|
|
|
manual_compaction_dequeue_.begin();
|
|
|
|
bool seen = false;
|
|
|
|
while (it != manual_compaction_dequeue_.end()) {
|
|
|
|
if (m == (*it)) {
|
2019-05-15 20:14:18 +00:00
|
|
|
++it;
|
2017-04-06 00:14:05 +00:00
|
|
|
seen = true;
|
|
|
|
continue;
|
|
|
|
} else if (MCOverlap(m, (*it)) && (!seen && !(*it)->in_progress)) {
|
|
|
|
// Consider the other manual compaction *it, conflicts if:
|
|
|
|
// overlaps with m
|
|
|
|
// and (*it) is ahead in the queue and is not yet in progress
|
|
|
|
return true;
|
|
|
|
}
|
2019-05-15 20:14:18 +00:00
|
|
|
++it;
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
bool DBImpl::HaveManualCompaction(ColumnFamilyData* cfd) {
|
|
|
|
// Remove from priority queue
|
2017-08-03 22:36:28 +00:00
|
|
|
std::deque<ManualCompactionState*>::iterator it =
|
2017-04-06 00:14:05 +00:00
|
|
|
manual_compaction_dequeue_.begin();
|
|
|
|
while (it != manual_compaction_dequeue_.end()) {
|
|
|
|
if ((*it)->exclusive) {
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
if ((cfd == (*it)->cfd) && (!((*it)->in_progress || (*it)->done))) {
|
|
|
|
// Allow automatic compaction if manual compaction is
|
2017-06-13 23:46:17 +00:00
|
|
|
// in progress
|
2017-04-06 00:14:05 +00:00
|
|
|
return true;
|
|
|
|
}
|
2019-05-15 20:14:18 +00:00
|
|
|
++it;
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
bool DBImpl::HasExclusiveManualCompaction() {
|
|
|
|
// Remove from priority queue
|
2017-08-03 22:36:28 +00:00
|
|
|
std::deque<ManualCompactionState*>::iterator it =
|
2017-04-06 00:14:05 +00:00
|
|
|
manual_compaction_dequeue_.begin();
|
|
|
|
while (it != manual_compaction_dequeue_.end()) {
|
|
|
|
if ((*it)->exclusive) {
|
|
|
|
return true;
|
|
|
|
}
|
2019-05-15 20:14:18 +00:00
|
|
|
++it;
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2017-08-03 22:36:28 +00:00
|
|
|
bool DBImpl::MCOverlap(ManualCompactionState* m, ManualCompactionState* m1) {
|
2017-04-06 00:14:05 +00:00
|
|
|
if ((m->exclusive) || (m1->exclusive)) {
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
if (m->cfd != m1->cfd) {
|
|
|
|
return false;
|
|
|
|
}
|
2021-05-20 04:40:43 +00:00
|
|
|
return false;
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
|
2023-10-19 01:00:07 +00:00
|
|
|
void DBImpl::UpdateDeletionCompactionStats(
|
|
|
|
const std::unique_ptr<Compaction>& c) {
|
|
|
|
if (c == nullptr) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
CompactionReason reason = c->compaction_reason();
|
|
|
|
|
|
|
|
switch (reason) {
|
|
|
|
case CompactionReason::kFIFOMaxSize:
|
|
|
|
RecordTick(stats_, FIFO_MAX_SIZE_COMPACTIONS);
|
|
|
|
break;
|
|
|
|
case CompactionReason::kFIFOTtl:
|
|
|
|
RecordTick(stats_, FIFO_TTL_COMPACTIONS);
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
assert(false);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2018-12-13 22:12:02 +00:00
|
|
|
void DBImpl::BuildCompactionJobInfo(
|
|
|
|
const ColumnFamilyData* cfd, Compaction* c, const Status& st,
|
|
|
|
const CompactionJobStats& compaction_job_stats, const int job_id,
|
2023-07-28 16:47:31 +00:00
|
|
|
CompactionJobInfo* compaction_job_info) const {
|
2018-12-13 22:12:02 +00:00
|
|
|
assert(compaction_job_info != nullptr);
|
|
|
|
compaction_job_info->cf_id = cfd->GetID();
|
|
|
|
compaction_job_info->cf_name = cfd->GetName();
|
|
|
|
compaction_job_info->status = st;
|
|
|
|
compaction_job_info->thread_id = env_->GetThreadID();
|
|
|
|
compaction_job_info->job_id = job_id;
|
|
|
|
compaction_job_info->base_input_level = c->start_level();
|
|
|
|
compaction_job_info->output_level = c->output_level();
|
|
|
|
compaction_job_info->stats = compaction_job_stats;
|
2023-09-20 20:34:39 +00:00
|
|
|
const auto& input_table_properties = c->GetInputTableProperties();
|
|
|
|
const auto& output_table_properties = c->GetOutputTableProperties();
|
|
|
|
compaction_job_info->table_properties.insert(input_table_properties.begin(),
|
|
|
|
input_table_properties.end());
|
|
|
|
compaction_job_info->table_properties.insert(output_table_properties.begin(),
|
|
|
|
output_table_properties.end());
|
2018-12-13 22:12:02 +00:00
|
|
|
compaction_job_info->compaction_reason = c->compaction_reason();
|
|
|
|
compaction_job_info->compression = c->output_compression();
|
2023-04-21 16:07:18 +00:00
|
|
|
|
|
|
|
const ReadOptions read_options(Env::IOActivity::kCompaction);
|
2018-12-13 22:12:02 +00:00
|
|
|
for (size_t i = 0; i < c->num_input_levels(); ++i) {
|
|
|
|
for (const auto fmd : *c->inputs(i)) {
|
2019-10-24 21:42:43 +00:00
|
|
|
const FileDescriptor& desc = fmd->fd;
|
|
|
|
const uint64_t file_number = desc.GetNumber();
|
2021-06-16 23:50:43 +00:00
|
|
|
auto fn = TableFileName(c->immutable_options()->cf_paths, file_number,
|
2019-10-24 21:42:43 +00:00
|
|
|
desc.GetPathId());
|
2018-12-13 22:12:02 +00:00
|
|
|
compaction_job_info->input_files.push_back(fn);
|
2019-10-24 21:42:43 +00:00
|
|
|
compaction_job_info->input_file_infos.push_back(CompactionFileInfo{
|
|
|
|
static_cast<int>(i), file_number, fmd->oldest_blob_file_number});
|
2018-12-13 22:12:02 +00:00
|
|
|
}
|
|
|
|
}
|
2023-07-28 16:47:31 +00:00
|
|
|
|
2018-12-13 22:12:02 +00:00
|
|
|
for (const auto& newf : c->edit()->GetNewFiles()) {
|
2019-10-24 21:42:43 +00:00
|
|
|
const FileMetaData& meta = newf.second;
|
|
|
|
const FileDescriptor& desc = meta.fd;
|
|
|
|
const uint64_t file_number = desc.GetNumber();
|
|
|
|
compaction_job_info->output_files.push_back(TableFileName(
|
2021-06-16 23:50:43 +00:00
|
|
|
c->immutable_options()->cf_paths, file_number, desc.GetPathId()));
|
2019-10-24 21:42:43 +00:00
|
|
|
compaction_job_info->output_file_infos.push_back(CompactionFileInfo{
|
|
|
|
newf.first, file_number, meta.oldest_blob_file_number});
|
2018-12-13 22:12:02 +00:00
|
|
|
}
|
2021-09-17 00:17:40 +00:00
|
|
|
compaction_job_info->blob_compression_type =
|
|
|
|
c->mutable_cf_options()->blob_compression_type;
|
|
|
|
|
|
|
|
// Update BlobFilesInfo.
|
|
|
|
for (const auto& blob_file : c->edit()->GetBlobFileAdditions()) {
|
|
|
|
BlobFileAdditionInfo blob_file_addition_info(
|
|
|
|
BlobFileName(c->immutable_options()->cf_paths.front().path,
|
|
|
|
blob_file.GetBlobFileNumber()) /*blob_file_path*/,
|
|
|
|
blob_file.GetBlobFileNumber(), blob_file.GetTotalBlobCount(),
|
|
|
|
blob_file.GetTotalBlobBytes());
|
|
|
|
compaction_job_info->blob_file_addition_infos.emplace_back(
|
|
|
|
std::move(blob_file_addition_info));
|
|
|
|
}
|
|
|
|
|
|
|
|
// Update BlobFilesGarbageInfo.
|
|
|
|
for (const auto& blob_file : c->edit()->GetBlobFileGarbages()) {
|
|
|
|
BlobFileGarbageInfo blob_file_garbage_info(
|
|
|
|
BlobFileName(c->immutable_options()->cf_paths.front().path,
|
|
|
|
blob_file.GetBlobFileNumber()) /*blob_file_path*/,
|
|
|
|
blob_file.GetBlobFileNumber(), blob_file.GetGarbageBlobCount(),
|
|
|
|
blob_file.GetGarbageBlobBytes());
|
|
|
|
compaction_job_info->blob_file_garbage_infos.emplace_back(
|
|
|
|
std::move(blob_file_garbage_info));
|
|
|
|
}
|
2018-12-13 22:12:02 +00:00
|
|
|
}
|
|
|
|
|
2017-10-06 01:00:38 +00:00
|
|
|
// SuperVersionContext gets created and destructed outside of the lock --
|
2018-04-10 22:47:54 +00:00
|
|
|
// we use this conveniently to:
|
2017-04-06 00:14:05 +00:00
|
|
|
// * malloc one SuperVersion() outside of the lock -- new_superversion
|
|
|
|
// * delete SuperVersion()s outside of the lock -- superversions_to_free
|
|
|
|
//
|
|
|
|
// However, if InstallSuperVersionAndScheduleWork() gets called twice with the
|
2017-10-06 01:00:38 +00:00
|
|
|
// same sv_context, we can't reuse the SuperVersion() that got
|
2017-04-06 00:14:05 +00:00
|
|
|
// malloced because
|
|
|
|
// first call already used it. In that rare case, we take a hit and create a
|
|
|
|
// new SuperVersion() inside of the mutex. We do similar thing
|
|
|
|
// for superversion_to_free
|
|
|
|
|
2017-10-06 01:00:38 +00:00
|
|
|
void DBImpl::InstallSuperVersionAndScheduleWork(
|
|
|
|
ColumnFamilyData* cfd, SuperVersionContext* sv_context,
|
Make mempurge a background process (equivalent to in-memory compaction). (#8505)
Summary:
In https://github.com/facebook/rocksdb/issues/8454, I introduced a new process baptized `MemPurge` (memtable garbage collection). This new PR is built upon this past mempurge prototype.
In this PR, I made the `mempurge` process a background task, which provides superior performance since the mempurge process does not cling on the db_mutex anymore, and addresses severe restrictions from the past iteration (including a scenario where the past mempurge was failling, when a memtable was mempurged but was still referred to by an iterator/snapshot/...).
Now the mempurge process ressembles an in-memory compaction process: the stack of immutable memtables is filtered out, and the useful payload is used to populate an output memtable. If the output memtable is filled at more than 60% capacity (arbitrary heuristic) the mempurge process is aborted and a regular flush process takes place, else the output memtable is kept in the immutable memtable stack. Note that adding this output memtable to the `imm()` memtable stack does not trigger another flush process, so that the flush thread can go to sleep at the end of a successful mempurge.
MemPurge is activated by making the `experimental_allow_mempurge` flag `true`. When activated, the `MemPurge` process will always happen when the flush reason is `kWriteBufferFull`.
The 3 unit tests confirm that this process supports `Put`, `Get`, `Delete`, `DeleteRange` operators and is compatible with `Iterators` and `CompactionFilters`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8505
Reviewed By: pdillinger
Differential Revision: D29619283
Pulled By: bjlemaire
fbshipit-source-id: 8a99bee76b63a8211bff1a00e0ae32360aaece95
2021-07-10 00:16:00 +00:00
|
|
|
const MutableCFOptions& mutable_cf_options) {
|
2017-04-06 00:14:05 +00:00
|
|
|
mutex_.AssertHeld();
|
|
|
|
|
|
|
|
// Update max_total_in_memory_state_
|
|
|
|
size_t old_memtable_size = 0;
|
|
|
|
auto* old_sv = cfd->GetSuperVersion();
|
|
|
|
if (old_sv) {
|
|
|
|
old_memtable_size = old_sv->mutable_cf_options.write_buffer_size *
|
|
|
|
old_sv->mutable_cf_options.max_write_buffer_number;
|
|
|
|
}
|
|
|
|
|
2018-07-27 21:02:07 +00:00
|
|
|
// this branch is unlikely to step in
|
|
|
|
if (UNLIKELY(sv_context->new_superversion == nullptr)) {
|
2017-10-06 01:00:38 +00:00
|
|
|
sv_context->NewSuperVersion();
|
|
|
|
}
|
Fix a race in ColumnFamilyData::UnrefAndTryDelete (#8605)
Summary:
The `ColumnFamilyData::UnrefAndTryDelete` code currently on the trunk
unlocks the DB mutex before destroying the `ThreadLocalPtr` holding
the per-thread `SuperVersion` pointers when the only remaining reference
is the back reference from `super_version_`. The idea behind this was to
break the circular dependency between `ColumnFamilyData` and `SuperVersion`:
when the penultimate reference goes away, `ColumnFamilyData` can clean up
the `SuperVersion`, which can in turn clean up `ColumnFamilyData`. (Assuming there
is a `SuperVersion` and it is not referenced by anything else.) However,
unlocking the mutex throws a wrench in this plan by making it possible for another thread
to jump in and take another reference to the `ColumnFamilyData`, keeping the
object alive in a zombie `ThreadLocalPtr`-less state. This can cause issues like
https://github.com/facebook/rocksdb/issues/8440 ,
https://github.com/facebook/rocksdb/issues/8382 ,
and might also explain the `was_last_ref` assertion failures from the `ColumnFamilySet`
destructor we sometimes observe during close in our stress tests.
Digging through the archives, this unlocking goes way back to 2014 (or earlier). The original
rationale was that `SuperVersionUnrefHandle` used to lock the mutex so it can call
`SuperVersion::Cleanup`; however, this logic turned out to be deadlock-prone.
https://github.com/facebook/rocksdb/pull/3510 fixed the deadlock but left the
unlocking in place. https://github.com/facebook/rocksdb/pull/6147 then introduced
the circular dependency and associated cleanup logic described above (in order
to enable iterators to keep the `ColumnFamilyData` for dropped column families alive),
and moved the unlocking-relocking snippet to its present location in `UnrefAndTryDelete`.
Finally, https://github.com/facebook/rocksdb/pull/7749 fixed a memory leak but
apparently exacerbated the race by (otherwise correctly) switching to `UnrefAndTryDelete`
in `SuperVersion::Cleanup`.
The patch simply eliminates the unlocking and relocking, which has been unnecessary
ever since https://github.com/facebook/rocksdb/issues/3510 made `SuperVersionUnrefHandle` lock-free.
This closes the window during which another thread could increase the reference count,
and hopefully fixes the issues above.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8605
Test Plan: Ran `make check` and stress tests locally.
Reviewed By: pdillinger
Differential Revision: D30051035
Pulled By: ltamasi
fbshipit-source-id: 8fe559e4b4ad69fc142579f8bc393ef525918528
2021-08-03 01:10:57 +00:00
|
|
|
cfd->InstallSuperVersion(sv_context, mutable_cf_options);
|
2017-04-06 00:14:05 +00:00
|
|
|
|
2019-03-26 02:14:04 +00:00
|
|
|
// There may be a small data race here. The snapshot tricking bottommost
|
|
|
|
// compaction may already be released here. But assuming there will always be
|
|
|
|
// newer snapshot created and released frequently, the compaction will be
|
|
|
|
// triggered soon anyway.
|
|
|
|
bottommost_files_mark_threshold_ = kMaxSequenceNumber;
|
|
|
|
for (auto* my_cfd : *versions_->GetColumnFamilySet()) {
|
2022-10-05 16:27:14 +00:00
|
|
|
if (!my_cfd->ioptions()->allow_ingest_behind) {
|
|
|
|
bottommost_files_mark_threshold_ = std::min(
|
|
|
|
bottommost_files_mark_threshold_,
|
|
|
|
my_cfd->current()->storage_info()->bottommost_files_mark_threshold());
|
|
|
|
}
|
2019-03-26 02:14:04 +00:00
|
|
|
}
|
|
|
|
|
2017-04-06 00:14:05 +00:00
|
|
|
// Whenever we install new SuperVersion, we might need to issue new flushes or
|
|
|
|
// compactions.
|
|
|
|
SchedulePendingCompaction(cfd);
|
|
|
|
MaybeScheduleFlushOrCompaction();
|
|
|
|
|
|
|
|
// Update max_total_in_memory_state_
|
2018-04-13 00:55:14 +00:00
|
|
|
max_total_in_memory_state_ = max_total_in_memory_state_ - old_memtable_size +
|
|
|
|
mutable_cf_options.write_buffer_size *
|
|
|
|
mutable_cf_options.max_write_buffer_number;
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
2017-10-06 17:26:38 +00:00
|
|
|
|
2018-03-28 17:23:31 +00:00
|
|
|
// ShouldPurge is called by FindObsoleteFiles when doing a full scan,
|
2019-09-17 23:43:07 +00:00
|
|
|
// and db mutex (mutex_) should already be held.
|
2018-03-28 17:23:31 +00:00
|
|
|
// Actually, the current implementation of FindObsoleteFiles with
|
|
|
|
// full_scan=true can issue I/O requests to obtain list of files in
|
|
|
|
// directories, e.g. env_->getChildren while holding db mutex.
|
|
|
|
bool DBImpl::ShouldPurge(uint64_t file_number) const {
|
2019-09-17 23:43:07 +00:00
|
|
|
return files_grabbed_for_purge_.find(file_number) ==
|
|
|
|
files_grabbed_for_purge_.end() &&
|
|
|
|
purge_files_.find(file_number) == purge_files_.end();
|
2018-03-28 17:23:31 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
// MarkAsGrabbedForPurge is called by FindObsoleteFiles, and db mutex
|
|
|
|
// (mutex_) should already be held.
|
|
|
|
void DBImpl::MarkAsGrabbedForPurge(uint64_t file_number) {
|
2019-09-17 23:43:07 +00:00
|
|
|
files_grabbed_for_purge_.insert(file_number);
|
2018-03-28 17:23:31 +00:00
|
|
|
}
|
|
|
|
|
2017-10-06 17:26:38 +00:00
|
|
|
void DBImpl::SetSnapshotChecker(SnapshotChecker* snapshot_checker) {
|
|
|
|
InstrumentedMutexLock l(&mutex_);
|
|
|
|
// snapshot_checker_ should only set once. If we need to set it multiple
|
|
|
|
// times, we need to make sure the old one is not deleted while it is still
|
|
|
|
// using by a compaction job.
|
|
|
|
assert(!snapshot_checker_);
|
|
|
|
snapshot_checker_.reset(snapshot_checker);
|
|
|
|
}
|
2019-01-16 05:32:15 +00:00
|
|
|
|
|
|
|
void DBImpl::GetSnapshotContext(
|
|
|
|
JobContext* job_context, std::vector<SequenceNumber>* snapshot_seqs,
|
|
|
|
SequenceNumber* earliest_write_conflict_snapshot,
|
|
|
|
SnapshotChecker** snapshot_checker_ptr) {
|
|
|
|
mutex_.AssertHeld();
|
|
|
|
assert(job_context != nullptr);
|
|
|
|
assert(snapshot_seqs != nullptr);
|
|
|
|
assert(earliest_write_conflict_snapshot != nullptr);
|
|
|
|
assert(snapshot_checker_ptr != nullptr);
|
|
|
|
|
|
|
|
*snapshot_checker_ptr = snapshot_checker_.get();
|
|
|
|
if (use_custom_gc_ && *snapshot_checker_ptr == nullptr) {
|
|
|
|
*snapshot_checker_ptr = DisableGCSnapshotChecker::Instance();
|
|
|
|
}
|
|
|
|
if (*snapshot_checker_ptr != nullptr) {
|
|
|
|
// If snapshot_checker is used, that means the flush/compaction may
|
|
|
|
// contain values not visible to snapshot taken after
|
|
|
|
// flush/compaction job starts. Take a snapshot and it will appear
|
|
|
|
// in snapshot_seqs and force compaction iterator to consider such
|
|
|
|
// snapshots.
|
|
|
|
const Snapshot* job_snapshot =
|
|
|
|
GetSnapshotImpl(false /*write_conflict_boundary*/, false /*lock*/);
|
|
|
|
job_context->job_snapshot.reset(new ManagedSnapshot(this, job_snapshot));
|
|
|
|
}
|
|
|
|
*snapshot_seqs = snapshots_.GetAll(earliest_write_conflict_snapshot);
|
|
|
|
}
|
2022-02-01 17:00:46 +00:00
|
|
|
|
2023-05-26 00:25:51 +00:00
|
|
|
Status DBImpl::WaitForCompact(
|
|
|
|
const WaitForCompactOptions& wait_for_compact_options) {
|
2022-02-01 17:00:46 +00:00
|
|
|
InstrumentedMutexLock l(&mutex_);
|
2023-05-31 19:53:51 +00:00
|
|
|
if (wait_for_compact_options.flush) {
|
|
|
|
Status s = DBImpl::FlushAllColumnFamilies(FlushOptions(),
|
|
|
|
FlushReason::kManualFlush);
|
|
|
|
if (!s.ok()) {
|
|
|
|
return s;
|
|
|
|
}
|
2023-08-11 19:30:48 +00:00
|
|
|
} else if (wait_for_compact_options.close_db &&
|
|
|
|
has_unpersisted_data_.load(std::memory_order_relaxed) &&
|
|
|
|
!mutable_db_options_.avoid_flush_during_shutdown) {
|
|
|
|
Status s =
|
|
|
|
DBImpl::FlushAllColumnFamilies(FlushOptions(), FlushReason::kShutDown);
|
|
|
|
if (!s.ok()) {
|
|
|
|
return s;
|
|
|
|
}
|
2023-05-31 19:53:51 +00:00
|
|
|
}
|
|
|
|
TEST_SYNC_POINT("DBImpl::WaitForCompact:StartWaiting");
|
2023-08-18 18:21:45 +00:00
|
|
|
const auto deadline = immutable_db_options_.clock->NowMicros() +
|
|
|
|
wait_for_compact_options.timeout.count();
|
Remove wait_unscheduled from waitForCompact internal API (#11443)
Summary:
Context:
In pull request https://github.com/facebook/rocksdb/issues/11436, we are introducing a new public API `waitForCompact(const WaitForCompactOptions& wait_for_compact_options)`. This API invokes the internal implementation `waitForCompact(bool wait_unscheduled=false)`. The unscheduled parameter indicates the compactions that are not yet scheduled but are required to process items in the queue.
In certain cases, we are unable to wait for compactions, such as during a shutdown or when background jobs are paused. It is important to return the appropriate status in these scenarios. For all other cases, we should wait for all compaction and flush jobs, including the unscheduled ones. The primary purpose of this new API is to wait until the system has resolved its compaction debt. Currently, the usage of `wait_unscheduled` is limited to test code.
This pull request eliminates the usage of wait_unscheduled. The internal `waitForCompact()` API now waits for unscheduled compactions unless the db is undergoing a shutdown. In the event of a shutdown, the API returns `Status::ShutdownInProgress()`.
Additionally, a new parameter, `abort_on_pause`, has been introduced with a default value of `false`. This parameter addresses the possibility of waiting indefinitely for unscheduled jobs if `PauseBackgroundWork()` was called before `waitForCompact()` is invoked. By setting `abort_on_pause` to `true`, the API will immediately return `Status::Aborted`.
Furthermore, all tests that previously called `waitForCompact(true)` have been fixed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11443
Test Plan:
Existing tests that involve a shutdown in progress:
- DBCompactionTest::CompactRangeShutdownWhileDelayed
- DBTestWithParam::PreShutdownMultipleCompaction
- DBTestWithParam::PreShutdownCompactionMiddle
Reviewed By: pdillinger
Differential Revision: D45923426
Pulled By: jaykorean
fbshipit-source-id: 7dc93fe6a6841a7d9d2d72866fa647090dba8eae
2023-05-18 01:13:50 +00:00
|
|
|
for (;;) {
|
|
|
|
if (shutting_down_.load(std::memory_order_acquire)) {
|
|
|
|
return Status::ShutdownInProgress();
|
|
|
|
}
|
2023-05-26 00:25:51 +00:00
|
|
|
if (bg_work_paused_ && wait_for_compact_options.abort_on_pause) {
|
Remove wait_unscheduled from waitForCompact internal API (#11443)
Summary:
Context:
In pull request https://github.com/facebook/rocksdb/issues/11436, we are introducing a new public API `waitForCompact(const WaitForCompactOptions& wait_for_compact_options)`. This API invokes the internal implementation `waitForCompact(bool wait_unscheduled=false)`. The unscheduled parameter indicates the compactions that are not yet scheduled but are required to process items in the queue.
In certain cases, we are unable to wait for compactions, such as during a shutdown or when background jobs are paused. It is important to return the appropriate status in these scenarios. For all other cases, we should wait for all compaction and flush jobs, including the unscheduled ones. The primary purpose of this new API is to wait until the system has resolved its compaction debt. Currently, the usage of `wait_unscheduled` is limited to test code.
This pull request eliminates the usage of wait_unscheduled. The internal `waitForCompact()` API now waits for unscheduled compactions unless the db is undergoing a shutdown. In the event of a shutdown, the API returns `Status::ShutdownInProgress()`.
Additionally, a new parameter, `abort_on_pause`, has been introduced with a default value of `false`. This parameter addresses the possibility of waiting indefinitely for unscheduled jobs if `PauseBackgroundWork()` was called before `waitForCompact()` is invoked. By setting `abort_on_pause` to `true`, the API will immediately return `Status::Aborted`.
Furthermore, all tests that previously called `waitForCompact(true)` have been fixed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11443
Test Plan:
Existing tests that involve a shutdown in progress:
- DBCompactionTest::CompactRangeShutdownWhileDelayed
- DBTestWithParam::PreShutdownMultipleCompaction
- DBTestWithParam::PreShutdownCompactionMiddle
Reviewed By: pdillinger
Differential Revision: D45923426
Pulled By: jaykorean
fbshipit-source-id: 7dc93fe6a6841a7d9d2d72866fa647090dba8eae
2023-05-18 01:13:50 +00:00
|
|
|
return Status::Aborted();
|
|
|
|
}
|
|
|
|
if ((bg_bottom_compaction_scheduled_ || bg_compaction_scheduled_ ||
|
|
|
|
bg_flush_scheduled_ || unscheduled_compactions_ ||
|
2023-08-11 19:30:48 +00:00
|
|
|
unscheduled_flushes_ || error_handler_.IsRecoveryInProgress()) &&
|
Remove wait_unscheduled from waitForCompact internal API (#11443)
Summary:
Context:
In pull request https://github.com/facebook/rocksdb/issues/11436, we are introducing a new public API `waitForCompact(const WaitForCompactOptions& wait_for_compact_options)`. This API invokes the internal implementation `waitForCompact(bool wait_unscheduled=false)`. The unscheduled parameter indicates the compactions that are not yet scheduled but are required to process items in the queue.
In certain cases, we are unable to wait for compactions, such as during a shutdown or when background jobs are paused. It is important to return the appropriate status in these scenarios. For all other cases, we should wait for all compaction and flush jobs, including the unscheduled ones. The primary purpose of this new API is to wait until the system has resolved its compaction debt. Currently, the usage of `wait_unscheduled` is limited to test code.
This pull request eliminates the usage of wait_unscheduled. The internal `waitForCompact()` API now waits for unscheduled compactions unless the db is undergoing a shutdown. In the event of a shutdown, the API returns `Status::ShutdownInProgress()`.
Additionally, a new parameter, `abort_on_pause`, has been introduced with a default value of `false`. This parameter addresses the possibility of waiting indefinitely for unscheduled jobs if `PauseBackgroundWork()` was called before `waitForCompact()` is invoked. By setting `abort_on_pause` to `true`, the API will immediately return `Status::Aborted`.
Furthermore, all tests that previously called `waitForCompact(true)` have been fixed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11443
Test Plan:
Existing tests that involve a shutdown in progress:
- DBCompactionTest::CompactRangeShutdownWhileDelayed
- DBTestWithParam::PreShutdownMultipleCompaction
- DBTestWithParam::PreShutdownCompactionMiddle
Reviewed By: pdillinger
Differential Revision: D45923426
Pulled By: jaykorean
fbshipit-source-id: 7dc93fe6a6841a7d9d2d72866fa647090dba8eae
2023-05-18 01:13:50 +00:00
|
|
|
(error_handler_.GetBGError().ok())) {
|
2023-08-18 18:21:45 +00:00
|
|
|
if (wait_for_compact_options.timeout.count()) {
|
|
|
|
if (bg_cv_.TimedWait(deadline)) {
|
|
|
|
return Status::TimedOut();
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
bg_cv_.Wait();
|
|
|
|
}
|
2023-08-11 19:30:48 +00:00
|
|
|
} else if (wait_for_compact_options.close_db) {
|
|
|
|
reject_new_background_jobs_ = true;
|
|
|
|
mutex_.Unlock();
|
|
|
|
Status s = Close();
|
|
|
|
mutex_.Lock();
|
|
|
|
if (!s.ok()) {
|
|
|
|
reject_new_background_jobs_ = false;
|
|
|
|
}
|
|
|
|
return s;
|
Remove wait_unscheduled from waitForCompact internal API (#11443)
Summary:
Context:
In pull request https://github.com/facebook/rocksdb/issues/11436, we are introducing a new public API `waitForCompact(const WaitForCompactOptions& wait_for_compact_options)`. This API invokes the internal implementation `waitForCompact(bool wait_unscheduled=false)`. The unscheduled parameter indicates the compactions that are not yet scheduled but are required to process items in the queue.
In certain cases, we are unable to wait for compactions, such as during a shutdown or when background jobs are paused. It is important to return the appropriate status in these scenarios. For all other cases, we should wait for all compaction and flush jobs, including the unscheduled ones. The primary purpose of this new API is to wait until the system has resolved its compaction debt. Currently, the usage of `wait_unscheduled` is limited to test code.
This pull request eliminates the usage of wait_unscheduled. The internal `waitForCompact()` API now waits for unscheduled compactions unless the db is undergoing a shutdown. In the event of a shutdown, the API returns `Status::ShutdownInProgress()`.
Additionally, a new parameter, `abort_on_pause`, has been introduced with a default value of `false`. This parameter addresses the possibility of waiting indefinitely for unscheduled jobs if `PauseBackgroundWork()` was called before `waitForCompact()` is invoked. By setting `abort_on_pause` to `true`, the API will immediately return `Status::Aborted`.
Furthermore, all tests that previously called `waitForCompact(true)` have been fixed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11443
Test Plan:
Existing tests that involve a shutdown in progress:
- DBCompactionTest::CompactRangeShutdownWhileDelayed
- DBTestWithParam::PreShutdownMultipleCompaction
- DBTestWithParam::PreShutdownCompactionMiddle
Reviewed By: pdillinger
Differential Revision: D45923426
Pulled By: jaykorean
fbshipit-source-id: 7dc93fe6a6841a7d9d2d72866fa647090dba8eae
2023-05-18 01:13:50 +00:00
|
|
|
} else {
|
|
|
|
return error_handler_.GetBGError();
|
|
|
|
}
|
2022-02-01 17:00:46 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2020-02-20 20:07:53 +00:00
|
|
|
} // namespace ROCKSDB_NAMESPACE
|