2016-02-09 23:12:00 +00:00
|
|
|
// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
|
2017-07-15 23:03:42 +00:00
|
|
|
// This source code is licensed under both the GPLv2 (found in the
|
|
|
|
// COPYING file in the root directory) and Apache 2.0 License
|
|
|
|
// (found in the LICENSE.Apache file in the root directory).
|
Push- instead of pull-model for managing Write stalls
Summary:
Introducing WriteController, which is a source of truth about per-DB write delays. Let's define an DB epoch as a period where there are no flushes and compactions (i.e. new epoch is started when flush or compaction finishes). Each epoch can either:
* proceed with all writes without delay
* delay all writes by fixed time
* stop all writes
The three modes are recomputed at each epoch change (flush, compaction), rather than on every write (which is currently the case).
When we have a lot of column families, our current pull behavior adds a big overhead, since we need to loop over every column family for every write. With new push model, overhead on Write code-path is minimal.
This is just the start. Next step is to also take care of stalls introduced by slow memtable flushes. The final goal is to eliminate function MakeRoomForWrite(), which currently needs to be called for every column family by every write.
Test Plan: make check for now. I'll add some unit tests later. Also, perf test.
Reviewers: dhruba, yhchiang, MarkCallaghan, sdong, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22791
2014-09-08 18:20:25 +00:00
|
|
|
|
|
|
|
#include "db/write_controller.h"
|
|
|
|
|
2015-05-15 22:52:51 +00:00
|
|
|
#include <atomic>
|
Push- instead of pull-model for managing Write stalls
Summary:
Introducing WriteController, which is a source of truth about per-DB write delays. Let's define an DB epoch as a period where there are no flushes and compactions (i.e. new epoch is started when flush or compaction finishes). Each epoch can either:
* proceed with all writes without delay
* delay all writes by fixed time
* stop all writes
The three modes are recomputed at each epoch change (flush, compaction), rather than on every write (which is currently the case).
When we have a lot of column families, our current pull behavior adds a big overhead, since we need to loop over every column family for every write. With new push model, overhead on Write code-path is minimal.
This is just the start. Next step is to also take care of stalls introduced by slow memtable flushes. The final goal is to eliminate function MakeRoomForWrite(), which currently needs to be called for every column family by every write.
Test Plan: make check for now. I'll add some unit tests later. Also, perf test.
Reviewers: dhruba, yhchiang, MarkCallaghan, sdong, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22791
2014-09-08 18:20:25 +00:00
|
|
|
#include <cassert>
|
2017-02-15 02:15:05 +00:00
|
|
|
#include <ratio>
|
2021-01-26 06:07:26 +00:00
|
|
|
|
|
|
|
#include "rocksdb/system_clock.h"
|
Push- instead of pull-model for managing Write stalls
Summary:
Introducing WriteController, which is a source of truth about per-DB write delays. Let's define an DB epoch as a period where there are no flushes and compactions (i.e. new epoch is started when flush or compaction finishes). Each epoch can either:
* proceed with all writes without delay
* delay all writes by fixed time
* stop all writes
The three modes are recomputed at each epoch change (flush, compaction), rather than on every write (which is currently the case).
When we have a lot of column families, our current pull behavior adds a big overhead, since we need to loop over every column family for every write. With new push model, overhead on Write code-path is minimal.
This is just the start. Next step is to also take care of stalls introduced by slow memtable flushes. The final goal is to eliminate function MakeRoomForWrite(), which currently needs to be called for every column family by every write.
Test Plan: make check for now. I'll add some unit tests later. Also, perf test.
Reviewers: dhruba, yhchiang, MarkCallaghan, sdong, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22791
2014-09-08 18:20:25 +00:00
|
|
|
|
2020-02-20 20:07:53 +00:00
|
|
|
namespace ROCKSDB_NAMESPACE {
|
Push- instead of pull-model for managing Write stalls
Summary:
Introducing WriteController, which is a source of truth about per-DB write delays. Let's define an DB epoch as a period where there are no flushes and compactions (i.e. new epoch is started when flush or compaction finishes). Each epoch can either:
* proceed with all writes without delay
* delay all writes by fixed time
* stop all writes
The three modes are recomputed at each epoch change (flush, compaction), rather than on every write (which is currently the case).
When we have a lot of column families, our current pull behavior adds a big overhead, since we need to loop over every column family for every write. With new push model, overhead on Write code-path is minimal.
This is just the start. Next step is to also take care of stalls introduced by slow memtable flushes. The final goal is to eliminate function MakeRoomForWrite(), which currently needs to be called for every column family by every write.
Test Plan: make check for now. I'll add some unit tests later. Also, perf test.
Reviewers: dhruba, yhchiang, MarkCallaghan, sdong, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22791
2014-09-08 18:20:25 +00:00
|
|
|
|
|
|
|
std::unique_ptr<WriteControllerToken> WriteController::GetStopToken() {
|
|
|
|
++total_stopped_;
|
|
|
|
return std::unique_ptr<WriteControllerToken>(new StopWriteToken(this));
|
|
|
|
}
|
|
|
|
|
When slowdown is triggered, reduce the write rate
Summary: It's usually hard for users to set a value of options.delayed_write_rate. With this diff, after slowdown condition triggers, we greedily reduce write rate if estimated pending compaction bytes increase. If estimated compaction pending bytes drop, we increase the write rate.
Test Plan:
Add a unit test
Test with db_bench setting:
TEST_TMPDIR=/dev/shm/ ./db_bench --benchmarks=fillrandom -num=10000000 --soft_pending_compaction_bytes_limit=1000000000 --hard_pending_compaction_bytes_limit=3000000000 --delayed_write_rate=100000000
and make sure without the commit, write stop will happen, but with the commit, it will not happen.
Reviewers: igor, anthony, rven, yhchiang, kradhakrishnan, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D52131
2015-12-18 01:07:44 +00:00
|
|
|
std::unique_ptr<WriteControllerToken> WriteController::GetDelayToken(
|
|
|
|
uint64_t write_rate) {
|
|
|
|
total_delayed_++;
|
|
|
|
// Reset counters.
|
|
|
|
last_refill_time_ = 0;
|
|
|
|
bytes_left_ = 0;
|
|
|
|
set_delayed_write_rate(write_rate);
|
2015-05-15 22:52:51 +00:00
|
|
|
return std::unique_ptr<WriteControllerToken>(new DelayWriteToken(this));
|
Push- instead of pull-model for managing Write stalls
Summary:
Introducing WriteController, which is a source of truth about per-DB write delays. Let's define an DB epoch as a period where there are no flushes and compactions (i.e. new epoch is started when flush or compaction finishes). Each epoch can either:
* proceed with all writes without delay
* delay all writes by fixed time
* stop all writes
The three modes are recomputed at each epoch change (flush, compaction), rather than on every write (which is currently the case).
When we have a lot of column families, our current pull behavior adds a big overhead, since we need to loop over every column family for every write. With new push model, overhead on Write code-path is minimal.
This is just the start. Next step is to also take care of stalls introduced by slow memtable flushes. The final goal is to eliminate function MakeRoomForWrite(), which currently needs to be called for every column family by every write.
Test Plan: make check for now. I'll add some unit tests later. Also, perf test.
Reviewers: dhruba, yhchiang, MarkCallaghan, sdong, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22791
2014-09-08 18:20:25 +00:00
|
|
|
}
|
|
|
|
|
Add options.base_background_compactions as a number of compaction threads for low compaction debt
Summary:
If options.base_background_compactions is given, we try to schedule number of compactions not existing this number, only when L0 files increase to certain number, or pending compaction bytes more than certain threshold, we schedule compactions based on options.max_background_compactions.
The watermarks are calculated based on slowdown thresholds.
Test Plan:
Add new test cases in column_family_test.
Adding more unit tests.
Reviewers: IslamAbdelRahman, yhchiang, kradhakrishnan, rven, anthony
Reviewed By: anthony
Subscribers: leveldb, dhruba, yoshinorim
Differential Revision: https://reviews.facebook.net/D53409
2016-01-28 19:56:16 +00:00
|
|
|
std::unique_ptr<WriteControllerToken>
|
|
|
|
WriteController::GetCompactionPressureToken() {
|
|
|
|
++total_compaction_pressure_;
|
|
|
|
return std::unique_ptr<WriteControllerToken>(
|
|
|
|
new CompactionPressureToken(this));
|
|
|
|
}
|
|
|
|
|
2017-06-05 21:42:34 +00:00
|
|
|
bool WriteController::IsStopped() const {
|
|
|
|
return total_stopped_.load(std::memory_order_relaxed) > 0;
|
|
|
|
}
|
2015-12-10 16:54:48 +00:00
|
|
|
// This is inside DB mutex, so we can't sleep and need to minimize
|
2015-05-15 22:52:51 +00:00
|
|
|
// frequency to get time.
|
|
|
|
// If it turns out to be a performance issue, we can redesign the thread
|
|
|
|
// synchronization model here.
|
|
|
|
// The function trust caller will sleep micros returned.
|
2021-03-15 11:32:24 +00:00
|
|
|
uint64_t WriteController::GetDelay(SystemClock* clock, uint64_t num_bytes) {
|
2017-06-05 21:42:34 +00:00
|
|
|
if (total_stopped_.load(std::memory_order_relaxed) > 0) {
|
2015-05-15 22:52:51 +00:00
|
|
|
return 0;
|
|
|
|
}
|
2017-06-05 21:42:34 +00:00
|
|
|
if (total_delayed_.load(std::memory_order_relaxed) == 0) {
|
2015-05-15 22:52:51 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
const uint64_t kMicrosPerSecond = 1000000;
|
|
|
|
const uint64_t kRefillInterval = 1024U;
|
|
|
|
|
|
|
|
if (bytes_left_ >= num_bytes) {
|
|
|
|
bytes_left_ -= num_bytes;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
// The frequency to get time inside DB mutex is less than one per refill
|
|
|
|
// interval.
|
2021-01-26 06:07:26 +00:00
|
|
|
auto time_now = NowMicrosMonotonic(clock);
|
2015-05-15 22:52:51 +00:00
|
|
|
|
|
|
|
uint64_t sleep_debt = 0;
|
|
|
|
uint64_t time_since_last_refill = 0;
|
|
|
|
if (last_refill_time_ != 0) {
|
|
|
|
if (last_refill_time_ > time_now) {
|
|
|
|
sleep_debt = last_refill_time_ - time_now;
|
|
|
|
} else {
|
|
|
|
time_since_last_refill = time_now - last_refill_time_;
|
|
|
|
bytes_left_ +=
|
|
|
|
static_cast<uint64_t>(static_cast<double>(time_since_last_refill) /
|
|
|
|
kMicrosPerSecond * delayed_write_rate_);
|
|
|
|
if (time_since_last_refill >= kRefillInterval &&
|
|
|
|
bytes_left_ > num_bytes) {
|
|
|
|
// If refill interval already passed and we have enough bytes
|
|
|
|
// return without extra sleeping.
|
|
|
|
last_refill_time_ = time_now;
|
|
|
|
bytes_left_ -= num_bytes;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
uint64_t single_refill_amount =
|
|
|
|
delayed_write_rate_ * kRefillInterval / kMicrosPerSecond;
|
|
|
|
if (bytes_left_ + single_refill_amount >= num_bytes) {
|
|
|
|
// Wait until a refill interval
|
|
|
|
// Never trigger expire for less than one refill interval to avoid to get
|
|
|
|
// time.
|
|
|
|
bytes_left_ = bytes_left_ + single_refill_amount - num_bytes;
|
|
|
|
last_refill_time_ = time_now + kRefillInterval;
|
|
|
|
return kRefillInterval + sleep_debt;
|
|
|
|
}
|
|
|
|
|
|
|
|
// Need to refill more than one interval. Need to sleep longer. Check
|
|
|
|
// whether expiration will hit
|
|
|
|
|
|
|
|
// Sleep just until `num_bytes` is allowed.
|
|
|
|
uint64_t sleep_amount =
|
|
|
|
static_cast<uint64_t>(num_bytes /
|
|
|
|
static_cast<long double>(delayed_write_rate_) *
|
|
|
|
kMicrosPerSecond) +
|
|
|
|
sleep_debt;
|
|
|
|
last_refill_time_ = time_now + sleep_amount;
|
|
|
|
return sleep_amount;
|
|
|
|
}
|
Push- instead of pull-model for managing Write stalls
Summary:
Introducing WriteController, which is a source of truth about per-DB write delays. Let's define an DB epoch as a period where there are no flushes and compactions (i.e. new epoch is started when flush or compaction finishes). Each epoch can either:
* proceed with all writes without delay
* delay all writes by fixed time
* stop all writes
The three modes are recomputed at each epoch change (flush, compaction), rather than on every write (which is currently the case).
When we have a lot of column families, our current pull behavior adds a big overhead, since we need to loop over every column family for every write. With new push model, overhead on Write code-path is minimal.
This is just the start. Next step is to also take care of stalls introduced by slow memtable flushes. The final goal is to eliminate function MakeRoomForWrite(), which currently needs to be called for every column family by every write.
Test Plan: make check for now. I'll add some unit tests later. Also, perf test.
Reviewers: dhruba, yhchiang, MarkCallaghan, sdong, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22791
2014-09-08 18:20:25 +00:00
|
|
|
|
2021-03-15 11:32:24 +00:00
|
|
|
uint64_t WriteController::NowMicrosMonotonic(SystemClock* clock) {
|
2021-01-26 06:07:26 +00:00
|
|
|
return clock->NowNanos() / std::milli::den;
|
2017-02-15 02:15:05 +00:00
|
|
|
}
|
|
|
|
|
Push- instead of pull-model for managing Write stalls
Summary:
Introducing WriteController, which is a source of truth about per-DB write delays. Let's define an DB epoch as a period where there are no flushes and compactions (i.e. new epoch is started when flush or compaction finishes). Each epoch can either:
* proceed with all writes without delay
* delay all writes by fixed time
* stop all writes
The three modes are recomputed at each epoch change (flush, compaction), rather than on every write (which is currently the case).
When we have a lot of column families, our current pull behavior adds a big overhead, since we need to loop over every column family for every write. With new push model, overhead on Write code-path is minimal.
This is just the start. Next step is to also take care of stalls introduced by slow memtable flushes. The final goal is to eliminate function MakeRoomForWrite(), which currently needs to be called for every column family by every write.
Test Plan: make check for now. I'll add some unit tests later. Also, perf test.
Reviewers: dhruba, yhchiang, MarkCallaghan, sdong, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22791
2014-09-08 18:20:25 +00:00
|
|
|
StopWriteToken::~StopWriteToken() {
|
|
|
|
assert(controller_->total_stopped_ >= 1);
|
|
|
|
--controller_->total_stopped_;
|
|
|
|
}
|
|
|
|
|
|
|
|
DelayWriteToken::~DelayWriteToken() {
|
2015-05-15 22:52:51 +00:00
|
|
|
controller_->total_delayed_--;
|
2017-03-17 01:10:37 +00:00
|
|
|
assert(controller_->total_delayed_.load() >= 0);
|
Push- instead of pull-model for managing Write stalls
Summary:
Introducing WriteController, which is a source of truth about per-DB write delays. Let's define an DB epoch as a period where there are no flushes and compactions (i.e. new epoch is started when flush or compaction finishes). Each epoch can either:
* proceed with all writes without delay
* delay all writes by fixed time
* stop all writes
The three modes are recomputed at each epoch change (flush, compaction), rather than on every write (which is currently the case).
When we have a lot of column families, our current pull behavior adds a big overhead, since we need to loop over every column family for every write. With new push model, overhead on Write code-path is minimal.
This is just the start. Next step is to also take care of stalls introduced by slow memtable flushes. The final goal is to eliminate function MakeRoomForWrite(), which currently needs to be called for every column family by every write.
Test Plan: make check for now. I'll add some unit tests later. Also, perf test.
Reviewers: dhruba, yhchiang, MarkCallaghan, sdong, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22791
2014-09-08 18:20:25 +00:00
|
|
|
}
|
|
|
|
|
Add options.base_background_compactions as a number of compaction threads for low compaction debt
Summary:
If options.base_background_compactions is given, we try to schedule number of compactions not existing this number, only when L0 files increase to certain number, or pending compaction bytes more than certain threshold, we schedule compactions based on options.max_background_compactions.
The watermarks are calculated based on slowdown thresholds.
Test Plan:
Add new test cases in column_family_test.
Adding more unit tests.
Reviewers: IslamAbdelRahman, yhchiang, kradhakrishnan, rven, anthony
Reviewed By: anthony
Subscribers: leveldb, dhruba, yoshinorim
Differential Revision: https://reviews.facebook.net/D53409
2016-01-28 19:56:16 +00:00
|
|
|
CompactionPressureToken::~CompactionPressureToken() {
|
|
|
|
controller_->total_compaction_pressure_--;
|
|
|
|
assert(controller_->total_compaction_pressure_ >= 0);
|
|
|
|
}
|
|
|
|
|
2020-02-20 20:07:53 +00:00
|
|
|
} // namespace ROCKSDB_NAMESPACE
|