2017-04-06 00:14:05 +00:00
|
|
|
// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
|
2017-07-15 23:03:42 +00:00
|
|
|
// This source code is licensed under both the GPLv2 (found in the
|
|
|
|
// COPYING file in the root directory) and Apache 2.0 License
|
|
|
|
// (found in the LICENSE.Apache file in the root directory).
|
2017-04-06 00:14:05 +00:00
|
|
|
//
|
|
|
|
// Copyright (c) 2011 The LevelDB Authors. All rights reserved.
|
|
|
|
// Use of this source code is governed by a BSD-style license that can be
|
|
|
|
// found in the LICENSE file. See the AUTHORS file for names of contributors.
|
2019-06-06 20:52:39 +00:00
|
|
|
#include <cinttypes>
|
2018-05-31 19:53:43 +00:00
|
|
|
#include <set>
|
2018-03-28 17:23:31 +00:00
|
|
|
#include <unordered_set>
|
2020-11-17 23:54:49 +00:00
|
|
|
|
|
|
|
#include "db/db_impl/db_impl.h"
|
2017-04-06 00:14:05 +00:00
|
|
|
#include "db/event_helpers.h"
|
Skip deleted WALs during recovery
Summary:
This patch record min log number to keep to the manifest while flushing SST files to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic.
Before the commit, for 2PC case, we determined which log number to keep in FindObsoleteFiles(). We looked at the earliest logs with outstanding prepare entries, or prepare entries whose respective commit or abort are in memtable. With the commit, the same calculation is done while we apply the SST flush. Just before installing the flush file, we precompute the earliest log file to keep after the flush finishes using the same logic (but skipping the memtables just flushed), record this information to the manifest entry for this new flushed SST file. This pre-computed value is also remembered in memory, and will later be used to determine whether a log file can be deleted. This value is unlikely to change until next flush because the commit entry will stay in memtable. (In WritePrepared, we could have removed the older log files as soon as all prepared entries are committed. It's not yet done anyway. Even if we do it, the only thing we loss with this new approach is earlier log deletion between two flushes, which does not guarantee to happen anyway because the obsolete file clean-up function is only executed after flush or compaction)
This min log number to keep is stored in the manifest using the safely-ignore customized field of AddFile entry, in order to guarantee that the DB generated using newer release can be opened by previous releases no older than 4.2.
Closes https://github.com/facebook/rocksdb/pull/3765
Differential Revision: D7747618
Pulled By: siying
fbshipit-source-id: d00c92105b4f83852e9754a1b70d6b64cb590729
2018-05-03 22:35:11 +00:00
|
|
|
#include "db/memtable_list.h"
|
2019-05-30 03:44:08 +00:00
|
|
|
#include "file/file_util.h"
|
2020-03-21 02:17:54 +00:00
|
|
|
#include "file/filename.h"
|
2019-05-30 03:44:08 +00:00
|
|
|
#include "file/sst_file_manager_impl.h"
|
2021-09-29 11:01:57 +00:00
|
|
|
#include "logging/logging.h"
|
2020-11-17 23:54:49 +00:00
|
|
|
#include "port/port.h"
|
Group SST write in flush, compaction and db open with new stats (#11910)
Summary:
## Context/Summary
Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity.
For that, this PR does the following:
- Tag different write IOs by passing down and converting WriteOptions to IOOptions
- Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS
Some related code refactory to make implementation cleaner:
- Blob stats
- Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info.
- Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write.
- Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority
- Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification
- Build table
- TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables
- Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder.
This parameter is used for dynamically changing file io priority for flush, see https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more
- Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority
## Test
### db bench
Flush
```
./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100
rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
```
compaction, db oopen
```
Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279
rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213
rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66
```
blob stats - just to make sure they aren't broken by this PR
```
Integrated Blob DB
Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600
rocksdb.blobdb.blob.file.synced COUNT : 1
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same)
```
```
Stacked Blob DB
Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876
rocksdb.blobdb.blob.file.synced COUNT : 8
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same)
```
### Rehearsal CI stress test
Trigger 3 full runs of all our CI stress tests
### Performance
Flush
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark; enable_statistics = true
Pre-pr: avg 507515519.3 ns
497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908,
Post-pr: avg 511971266.5 ns, regressed 0.88%
502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408,
```
Compaction
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 495346098.30 ns
492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846
Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97%
502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007
```
Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats)
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 3848.10 ns
3814,3838,3839,3848,3854,3854,3854,3860,3860,3860
Post-pr: avg 3874.20 ns, regressed 0.68%
3863,3867,3871,3874,3875,3877,3877,3877,3880,3881
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910
Reviewed By: ajkr
Differential Revision: D49788060
Pulled By: hx235
fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff
2023-12-29 23:29:23 +00:00
|
|
|
#include "rocksdb/options.h"
|
2019-09-17 23:43:07 +00:00
|
|
|
#include "util/autovector.h"
|
2022-06-24 01:32:25 +00:00
|
|
|
#include "util/defer.h"
|
2017-04-06 00:14:05 +00:00
|
|
|
|
2020-02-20 20:07:53 +00:00
|
|
|
namespace ROCKSDB_NAMESPACE {
|
2018-11-06 03:28:21 +00:00
|
|
|
|
2017-04-06 00:14:05 +00:00
|
|
|
uint64_t DBImpl::MinLogNumberToKeep() {
|
Fix a race condition in WAL tracking causing DB open failure (#9715)
Summary:
There is a race condition if WAL tracking in the MANIFEST is enabled in a database that disables 2PC.
The race condition is between two background flush threads trying to install flush results to the MANIFEST.
Consider an example database with two column families: "default" (cfd0) and "cf1" (cfd1). Initially,
both column families have one mutable (active) memtable whose data backed by 6.log.
1. Trigger a manual flush for "cf1", creating a 7.log
2. Insert another key to "default", and trigger flush for "default", creating 8.log
3. BgFlushThread1 finishes writing 9.sst
4. BgFlushThread2 finishes writing 10.sst
```
Time BgFlushThread1 BgFlushThread2
| mutex_.Lock()
| precompute min_wal_to_keep as 6
| mutex_.Unlock()
| mutex_.Lock()
| precompute min_wal_to_keep as 6
| join MANIFEST write queue and mutex_.Unlock()
| write to MANIFEST
| mutex_.Lock()
| cfd1->log_number = 7
| Signal bg_flush_2 and mutex_.Unlock()
| wake up and mutex_.Lock()
| cfd0->log_number = 8
| FindObsoleteFiles() with job_context->log_number == 7
| mutex_.Unlock()
| PurgeObsoleteFiles() deletes 6.log
V
```
As shown in the above, BgFlushThread2 thinks that the min wal to keep is 6.log because "cf1" has unflushed data in 6.log (cf1.log_number=6).
Similarly, BgThread1 thinks that min wal to keep is also 6.log because "default" has unflushed data (default.log_number=6).
No WAL deletion will be written to MANIFEST because 6 is equal to `versions_->wals_.min_wal_number_to_keep`,
due to https://github.com/facebook/rocksdb/blob/7.1.fb/db/memtable_list.cc#L513:L514.
The bg flush thread that finishes last will perform file purging. `job_context.log_number` will be evaluated as 7, i.e.
the min wal that contains unflushed data, causing 6.log to be deleted. However, MANIFEST thinks 6.log should still exist.
If you close the db at this point, you won't be able to re-open it if `track_and_verify_wal_in_manifest` is true.
We must handle the case of multiple bg flush threads, and it is difficult for one bg flush thread to know
the correct min wal number until the other bg flush threads have finished committing to the manifest and updated
the `cfd::log_number`.
To fix this issue, we rename an existing variable `min_log_number_to_keep_2pc` to `min_log_number_to_keep`,
and use it to track WAL file deletion in non-2pc mode as well.
This variable is updated only 1) during recovery with mutex held, or 2) in the MANIFEST write thread.
`min_log_number_to_keep` means RocksDB will delete WALs below it, although there may be WALs
above it which are also obsolete. Formally, we will have [min_wal_to_keep, max_obsolete_wal]. During recovery, we
make sure that only WALs above max_obsolete_wal are checked and added back to `alive_log_files_`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9715
Test Plan:
```
make check
```
Also ran stress test below (with asan) to make sure it completes successfully.
```
TEST_TMPDIR=/dev/shm/rocksdb OPT=-g ASAN_OPTIONS=disable_coredump=0 \
CRASH_TEST_EXT_ARGS=--compression_type=zstd SKIP_FORMAT_BUCK_CHECKS=1 \
make J=52 -j52 blackbox_asan_crash_test
```
Reviewed By: ltamasi
Differential Revision: D34984412
Pulled By: riversand963
fbshipit-source-id: c7b21a8d84751bb55ea79c9f387103d21b231005
2022-03-24 02:41:31 +00:00
|
|
|
return versions_->min_log_number_to_keep();
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
|
2024-04-29 19:25:00 +00:00
|
|
|
uint64_t DBImpl::MinLogNumberToRecycle() { return min_log_number_to_recycle_; }
|
|
|
|
|
2018-11-06 03:28:21 +00:00
|
|
|
uint64_t DBImpl::MinObsoleteSstNumberToKeep() {
|
|
|
|
mutex_.AssertHeld();
|
|
|
|
if (!pending_outputs_.empty()) {
|
|
|
|
return *pending_outputs_.begin();
|
|
|
|
}
|
|
|
|
return std::numeric_limits<uint64_t>::max();
|
|
|
|
}
|
|
|
|
|
2023-06-13 22:52:45 +00:00
|
|
|
uint64_t DBImpl::GetObsoleteSstFilesSize() {
|
|
|
|
mutex_.AssertHeld();
|
|
|
|
return versions_->GetObsoleteSstFilesSize();
|
|
|
|
}
|
|
|
|
|
First step towards handling MANIFEST write error (#6949)
Summary:
This PR provides preliminary support for handling IO error during MANIFEST write.
File write/sync is not guaranteed to be atomic. If we encounter an IOError while writing/syncing to the MANIFEST file, we cannot be sure about the state of the MANIFEST file. The version edits may or may not have reached the file. During cleanup, if we delete the newly-generated SST files referenced by the pending version edit(s), but the version edit(s) actually are persistent in the MANIFEST, then next recovery attempt will process the version edits(s) and then fail since the SST files have already been deleted.
One approach is to truncate the MANIFEST after write/sync error, so that it is safe to delete the SST files. However, file truncation may not be supported on certain file systems. Therefore, we take the following approach.
If an IOError is detected during MANIFEST write/sync, we disable file deletions for the faulty database. Depending on whether the IOError is retryable (set by underlying file system), either RocksDB or application can call `DB::Resume()`, or simply shutdown and restart. During `Resume()`, RocksDB will try to switch to a new MANIFEST and write all existing in-memory version storage in the new file. If this succeeds, then RocksDB may proceed. If all recovery is completed, then file deletions will be re-enabled.
Note that multiple threads can call `LogAndApply()` at the same time, though only one of them will be going through the process MANIFEST write, possibly batching the version edits of other threads. When the leading MANIFEST writer finishes, all of the MANIFEST writing threads in this batch will have the same IOError. They will all call `ErrorHandler::SetBGError()` in which file deletion will be disabled.
Possible future directions:
- Add an `ErrorContext` structure so that it is easier to pass more info to `ErrorHandler`. Currently, as in this example, a new `BackgroundErrorReason` has to be added.
Test plan (dev server):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6949
Reviewed By: anand1976
Differential Revision: D22026020
Pulled By: riversand963
fbshipit-source-id: f3c68a2ef45d9b505d0d625c7c5e0c88495b91c8
2020-06-25 02:05:47 +00:00
|
|
|
Status DBImpl::DisableFileDeletions() {
|
2021-07-29 18:50:00 +00:00
|
|
|
Status s;
|
|
|
|
int my_disable_delete_obsolete_files;
|
|
|
|
{
|
|
|
|
InstrumentedMutexLock l(&mutex_);
|
|
|
|
s = DisableFileDeletionsWithLock();
|
|
|
|
my_disable_delete_obsolete_files = disable_delete_obsolete_files_;
|
|
|
|
}
|
|
|
|
if (my_disable_delete_obsolete_files == 1) {
|
First step towards handling MANIFEST write error (#6949)
Summary:
This PR provides preliminary support for handling IO error during MANIFEST write.
File write/sync is not guaranteed to be atomic. If we encounter an IOError while writing/syncing to the MANIFEST file, we cannot be sure about the state of the MANIFEST file. The version edits may or may not have reached the file. During cleanup, if we delete the newly-generated SST files referenced by the pending version edit(s), but the version edit(s) actually are persistent in the MANIFEST, then next recovery attempt will process the version edits(s) and then fail since the SST files have already been deleted.
One approach is to truncate the MANIFEST after write/sync error, so that it is safe to delete the SST files. However, file truncation may not be supported on certain file systems. Therefore, we take the following approach.
If an IOError is detected during MANIFEST write/sync, we disable file deletions for the faulty database. Depending on whether the IOError is retryable (set by underlying file system), either RocksDB or application can call `DB::Resume()`, or simply shutdown and restart. During `Resume()`, RocksDB will try to switch to a new MANIFEST and write all existing in-memory version storage in the new file. If this succeeds, then RocksDB may proceed. If all recovery is completed, then file deletions will be re-enabled.
Note that multiple threads can call `LogAndApply()` at the same time, though only one of them will be going through the process MANIFEST write, possibly batching the version edits of other threads. When the leading MANIFEST writer finishes, all of the MANIFEST writing threads in this batch will have the same IOError. They will all call `ErrorHandler::SetBGError()` in which file deletion will be disabled.
Possible future directions:
- Add an `ErrorContext` structure so that it is easier to pass more info to `ErrorHandler`. Currently, as in this example, a new `BackgroundErrorReason` has to be added.
Test plan (dev server):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6949
Reviewed By: anand1976
Differential Revision: D22026020
Pulled By: riversand963
fbshipit-source-id: f3c68a2ef45d9b505d0d625c7c5e0c88495b91c8
2020-06-25 02:05:47 +00:00
|
|
|
ROCKS_LOG_INFO(immutable_db_options_.info_log, "File Deletions Disabled");
|
|
|
|
} else {
|
2024-01-30 19:58:31 +00:00
|
|
|
ROCKS_LOG_INFO(immutable_db_options_.info_log,
|
First step towards handling MANIFEST write error (#6949)
Summary:
This PR provides preliminary support for handling IO error during MANIFEST write.
File write/sync is not guaranteed to be atomic. If we encounter an IOError while writing/syncing to the MANIFEST file, we cannot be sure about the state of the MANIFEST file. The version edits may or may not have reached the file. During cleanup, if we delete the newly-generated SST files referenced by the pending version edit(s), but the version edit(s) actually are persistent in the MANIFEST, then next recovery attempt will process the version edits(s) and then fail since the SST files have already been deleted.
One approach is to truncate the MANIFEST after write/sync error, so that it is safe to delete the SST files. However, file truncation may not be supported on certain file systems. Therefore, we take the following approach.
If an IOError is detected during MANIFEST write/sync, we disable file deletions for the faulty database. Depending on whether the IOError is retryable (set by underlying file system), either RocksDB or application can call `DB::Resume()`, or simply shutdown and restart. During `Resume()`, RocksDB will try to switch to a new MANIFEST and write all existing in-memory version storage in the new file. If this succeeds, then RocksDB may proceed. If all recovery is completed, then file deletions will be re-enabled.
Note that multiple threads can call `LogAndApply()` at the same time, though only one of them will be going through the process MANIFEST write, possibly batching the version edits of other threads. When the leading MANIFEST writer finishes, all of the MANIFEST writing threads in this batch will have the same IOError. They will all call `ErrorHandler::SetBGError()` in which file deletion will be disabled.
Possible future directions:
- Add an `ErrorContext` structure so that it is easier to pass more info to `ErrorHandler`. Currently, as in this example, a new `BackgroundErrorReason` has to be added.
Test plan (dev server):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6949
Reviewed By: anand1976
Differential Revision: D22026020
Pulled By: riversand963
fbshipit-source-id: f3c68a2ef45d9b505d0d625c7c5e0c88495b91c8
2020-06-25 02:05:47 +00:00
|
|
|
"File Deletions Disabled, but already disabled. Counter: %d",
|
2021-07-29 18:50:00 +00:00
|
|
|
my_disable_delete_obsolete_files);
|
First step towards handling MANIFEST write error (#6949)
Summary:
This PR provides preliminary support for handling IO error during MANIFEST write.
File write/sync is not guaranteed to be atomic. If we encounter an IOError while writing/syncing to the MANIFEST file, we cannot be sure about the state of the MANIFEST file. The version edits may or may not have reached the file. During cleanup, if we delete the newly-generated SST files referenced by the pending version edit(s), but the version edit(s) actually are persistent in the MANIFEST, then next recovery attempt will process the version edits(s) and then fail since the SST files have already been deleted.
One approach is to truncate the MANIFEST after write/sync error, so that it is safe to delete the SST files. However, file truncation may not be supported on certain file systems. Therefore, we take the following approach.
If an IOError is detected during MANIFEST write/sync, we disable file deletions for the faulty database. Depending on whether the IOError is retryable (set by underlying file system), either RocksDB or application can call `DB::Resume()`, or simply shutdown and restart. During `Resume()`, RocksDB will try to switch to a new MANIFEST and write all existing in-memory version storage in the new file. If this succeeds, then RocksDB may proceed. If all recovery is completed, then file deletions will be re-enabled.
Note that multiple threads can call `LogAndApply()` at the same time, though only one of them will be going through the process MANIFEST write, possibly batching the version edits of other threads. When the leading MANIFEST writer finishes, all of the MANIFEST writing threads in this batch will have the same IOError. They will all call `ErrorHandler::SetBGError()` in which file deletion will be disabled.
Possible future directions:
- Add an `ErrorContext` structure so that it is easier to pass more info to `ErrorHandler`. Currently, as in this example, a new `BackgroundErrorReason` has to be added.
Test plan (dev server):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6949
Reviewed By: anand1976
Differential Revision: D22026020
Pulled By: riversand963
fbshipit-source-id: f3c68a2ef45d9b505d0d625c7c5e0c88495b91c8
2020-06-25 02:05:47 +00:00
|
|
|
}
|
2021-07-29 18:50:00 +00:00
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
2021-11-24 22:50:52 +00:00
|
|
|
// FIXME: can be inconsistent with DisableFileDeletions in cases like
|
|
|
|
// DBImplReadOnly
|
2021-07-29 18:50:00 +00:00
|
|
|
Status DBImpl::DisableFileDeletionsWithLock() {
|
|
|
|
mutex_.AssertHeld();
|
|
|
|
++disable_delete_obsolete_files_;
|
First step towards handling MANIFEST write error (#6949)
Summary:
This PR provides preliminary support for handling IO error during MANIFEST write.
File write/sync is not guaranteed to be atomic. If we encounter an IOError while writing/syncing to the MANIFEST file, we cannot be sure about the state of the MANIFEST file. The version edits may or may not have reached the file. During cleanup, if we delete the newly-generated SST files referenced by the pending version edit(s), but the version edit(s) actually are persistent in the MANIFEST, then next recovery attempt will process the version edits(s) and then fail since the SST files have already been deleted.
One approach is to truncate the MANIFEST after write/sync error, so that it is safe to delete the SST files. However, file truncation may not be supported on certain file systems. Therefore, we take the following approach.
If an IOError is detected during MANIFEST write/sync, we disable file deletions for the faulty database. Depending on whether the IOError is retryable (set by underlying file system), either RocksDB or application can call `DB::Resume()`, or simply shutdown and restart. During `Resume()`, RocksDB will try to switch to a new MANIFEST and write all existing in-memory version storage in the new file. If this succeeds, then RocksDB may proceed. If all recovery is completed, then file deletions will be re-enabled.
Note that multiple threads can call `LogAndApply()` at the same time, though only one of them will be going through the process MANIFEST write, possibly batching the version edits of other threads. When the leading MANIFEST writer finishes, all of the MANIFEST writing threads in this batch will have the same IOError. They will all call `ErrorHandler::SetBGError()` in which file deletion will be disabled.
Possible future directions:
- Add an `ErrorContext` structure so that it is easier to pass more info to `ErrorHandler`. Currently, as in this example, a new `BackgroundErrorReason` has to be added.
Test plan (dev server):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6949
Reviewed By: anand1976
Differential Revision: D22026020
Pulled By: riversand963
fbshipit-source-id: f3c68a2ef45d9b505d0d625c7c5e0c88495b91c8
2020-06-25 02:05:47 +00:00
|
|
|
return Status::OK();
|
|
|
|
}
|
|
|
|
|
2024-02-14 02:36:25 +00:00
|
|
|
Status DBImpl::EnableFileDeletions() {
|
First step towards handling MANIFEST write error (#6949)
Summary:
This PR provides preliminary support for handling IO error during MANIFEST write.
File write/sync is not guaranteed to be atomic. If we encounter an IOError while writing/syncing to the MANIFEST file, we cannot be sure about the state of the MANIFEST file. The version edits may or may not have reached the file. During cleanup, if we delete the newly-generated SST files referenced by the pending version edit(s), but the version edit(s) actually are persistent in the MANIFEST, then next recovery attempt will process the version edits(s) and then fail since the SST files have already been deleted.
One approach is to truncate the MANIFEST after write/sync error, so that it is safe to delete the SST files. However, file truncation may not be supported on certain file systems. Therefore, we take the following approach.
If an IOError is detected during MANIFEST write/sync, we disable file deletions for the faulty database. Depending on whether the IOError is retryable (set by underlying file system), either RocksDB or application can call `DB::Resume()`, or simply shutdown and restart. During `Resume()`, RocksDB will try to switch to a new MANIFEST and write all existing in-memory version storage in the new file. If this succeeds, then RocksDB may proceed. If all recovery is completed, then file deletions will be re-enabled.
Note that multiple threads can call `LogAndApply()` at the same time, though only one of them will be going through the process MANIFEST write, possibly batching the version edits of other threads. When the leading MANIFEST writer finishes, all of the MANIFEST writing threads in this batch will have the same IOError. They will all call `ErrorHandler::SetBGError()` in which file deletion will be disabled.
Possible future directions:
- Add an `ErrorContext` structure so that it is easier to pass more info to `ErrorHandler`. Currently, as in this example, a new `BackgroundErrorReason` has to be added.
Test plan (dev server):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6949
Reviewed By: anand1976
Differential Revision: D22026020
Pulled By: riversand963
fbshipit-source-id: f3c68a2ef45d9b505d0d625c7c5e0c88495b91c8
2020-06-25 02:05:47 +00:00
|
|
|
// Job id == 0 means that this is not our background process, but rather
|
|
|
|
// user thread
|
|
|
|
JobContext job_context(0);
|
2020-09-25 20:33:05 +00:00
|
|
|
int saved_counter; // initialize on all paths
|
First step towards handling MANIFEST write error (#6949)
Summary:
This PR provides preliminary support for handling IO error during MANIFEST write.
File write/sync is not guaranteed to be atomic. If we encounter an IOError while writing/syncing to the MANIFEST file, we cannot be sure about the state of the MANIFEST file. The version edits may or may not have reached the file. During cleanup, if we delete the newly-generated SST files referenced by the pending version edit(s), but the version edit(s) actually are persistent in the MANIFEST, then next recovery attempt will process the version edits(s) and then fail since the SST files have already been deleted.
One approach is to truncate the MANIFEST after write/sync error, so that it is safe to delete the SST files. However, file truncation may not be supported on certain file systems. Therefore, we take the following approach.
If an IOError is detected during MANIFEST write/sync, we disable file deletions for the faulty database. Depending on whether the IOError is retryable (set by underlying file system), either RocksDB or application can call `DB::Resume()`, or simply shutdown and restart. During `Resume()`, RocksDB will try to switch to a new MANIFEST and write all existing in-memory version storage in the new file. If this succeeds, then RocksDB may proceed. If all recovery is completed, then file deletions will be re-enabled.
Note that multiple threads can call `LogAndApply()` at the same time, though only one of them will be going through the process MANIFEST write, possibly batching the version edits of other threads. When the leading MANIFEST writer finishes, all of the MANIFEST writing threads in this batch will have the same IOError. They will all call `ErrorHandler::SetBGError()` in which file deletion will be disabled.
Possible future directions:
- Add an `ErrorContext` structure so that it is easier to pass more info to `ErrorHandler`. Currently, as in this example, a new `BackgroundErrorReason` has to be added.
Test plan (dev server):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6949
Reviewed By: anand1976
Differential Revision: D22026020
Pulled By: riversand963
fbshipit-source-id: f3c68a2ef45d9b505d0d625c7c5e0c88495b91c8
2020-06-25 02:05:47 +00:00
|
|
|
{
|
|
|
|
InstrumentedMutexLock l(&mutex_);
|
2024-02-14 02:36:25 +00:00
|
|
|
if (disable_delete_obsolete_files_ > 0) {
|
First step towards handling MANIFEST write error (#6949)
Summary:
This PR provides preliminary support for handling IO error during MANIFEST write.
File write/sync is not guaranteed to be atomic. If we encounter an IOError while writing/syncing to the MANIFEST file, we cannot be sure about the state of the MANIFEST file. The version edits may or may not have reached the file. During cleanup, if we delete the newly-generated SST files referenced by the pending version edit(s), but the version edit(s) actually are persistent in the MANIFEST, then next recovery attempt will process the version edits(s) and then fail since the SST files have already been deleted.
One approach is to truncate the MANIFEST after write/sync error, so that it is safe to delete the SST files. However, file truncation may not be supported on certain file systems. Therefore, we take the following approach.
If an IOError is detected during MANIFEST write/sync, we disable file deletions for the faulty database. Depending on whether the IOError is retryable (set by underlying file system), either RocksDB or application can call `DB::Resume()`, or simply shutdown and restart. During `Resume()`, RocksDB will try to switch to a new MANIFEST and write all existing in-memory version storage in the new file. If this succeeds, then RocksDB may proceed. If all recovery is completed, then file deletions will be re-enabled.
Note that multiple threads can call `LogAndApply()` at the same time, though only one of them will be going through the process MANIFEST write, possibly batching the version edits of other threads. When the leading MANIFEST writer finishes, all of the MANIFEST writing threads in this batch will have the same IOError. They will all call `ErrorHandler::SetBGError()` in which file deletion will be disabled.
Possible future directions:
- Add an `ErrorContext` structure so that it is easier to pass more info to `ErrorHandler`. Currently, as in this example, a new `BackgroundErrorReason` has to be added.
Test plan (dev server):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6949
Reviewed By: anand1976
Differential Revision: D22026020
Pulled By: riversand963
fbshipit-source-id: f3c68a2ef45d9b505d0d625c7c5e0c88495b91c8
2020-06-25 02:05:47 +00:00
|
|
|
--disable_delete_obsolete_files_;
|
|
|
|
}
|
2020-09-25 20:33:05 +00:00
|
|
|
saved_counter = disable_delete_obsolete_files_;
|
|
|
|
if (saved_counter == 0) {
|
First step towards handling MANIFEST write error (#6949)
Summary:
This PR provides preliminary support for handling IO error during MANIFEST write.
File write/sync is not guaranteed to be atomic. If we encounter an IOError while writing/syncing to the MANIFEST file, we cannot be sure about the state of the MANIFEST file. The version edits may or may not have reached the file. During cleanup, if we delete the newly-generated SST files referenced by the pending version edit(s), but the version edit(s) actually are persistent in the MANIFEST, then next recovery attempt will process the version edits(s) and then fail since the SST files have already been deleted.
One approach is to truncate the MANIFEST after write/sync error, so that it is safe to delete the SST files. However, file truncation may not be supported on certain file systems. Therefore, we take the following approach.
If an IOError is detected during MANIFEST write/sync, we disable file deletions for the faulty database. Depending on whether the IOError is retryable (set by underlying file system), either RocksDB or application can call `DB::Resume()`, or simply shutdown and restart. During `Resume()`, RocksDB will try to switch to a new MANIFEST and write all existing in-memory version storage in the new file. If this succeeds, then RocksDB may proceed. If all recovery is completed, then file deletions will be re-enabled.
Note that multiple threads can call `LogAndApply()` at the same time, though only one of them will be going through the process MANIFEST write, possibly batching the version edits of other threads. When the leading MANIFEST writer finishes, all of the MANIFEST writing threads in this batch will have the same IOError. They will all call `ErrorHandler::SetBGError()` in which file deletion will be disabled.
Possible future directions:
- Add an `ErrorContext` structure so that it is easier to pass more info to `ErrorHandler`. Currently, as in this example, a new `BackgroundErrorReason` has to be added.
Test plan (dev server):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6949
Reviewed By: anand1976
Differential Revision: D22026020
Pulled By: riversand963
fbshipit-source-id: f3c68a2ef45d9b505d0d625c7c5e0c88495b91c8
2020-06-25 02:05:47 +00:00
|
|
|
FindObsoleteFiles(&job_context, true);
|
|
|
|
bg_cv_.SignalAll();
|
|
|
|
}
|
|
|
|
}
|
2020-09-25 20:33:05 +00:00
|
|
|
if (saved_counter == 0) {
|
First step towards handling MANIFEST write error (#6949)
Summary:
This PR provides preliminary support for handling IO error during MANIFEST write.
File write/sync is not guaranteed to be atomic. If we encounter an IOError while writing/syncing to the MANIFEST file, we cannot be sure about the state of the MANIFEST file. The version edits may or may not have reached the file. During cleanup, if we delete the newly-generated SST files referenced by the pending version edit(s), but the version edit(s) actually are persistent in the MANIFEST, then next recovery attempt will process the version edits(s) and then fail since the SST files have already been deleted.
One approach is to truncate the MANIFEST after write/sync error, so that it is safe to delete the SST files. However, file truncation may not be supported on certain file systems. Therefore, we take the following approach.
If an IOError is detected during MANIFEST write/sync, we disable file deletions for the faulty database. Depending on whether the IOError is retryable (set by underlying file system), either RocksDB or application can call `DB::Resume()`, or simply shutdown and restart. During `Resume()`, RocksDB will try to switch to a new MANIFEST and write all existing in-memory version storage in the new file. If this succeeds, then RocksDB may proceed. If all recovery is completed, then file deletions will be re-enabled.
Note that multiple threads can call `LogAndApply()` at the same time, though only one of them will be going through the process MANIFEST write, possibly batching the version edits of other threads. When the leading MANIFEST writer finishes, all of the MANIFEST writing threads in this batch will have the same IOError. They will all call `ErrorHandler::SetBGError()` in which file deletion will be disabled.
Possible future directions:
- Add an `ErrorContext` structure so that it is easier to pass more info to `ErrorHandler`. Currently, as in this example, a new `BackgroundErrorReason` has to be added.
Test plan (dev server):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6949
Reviewed By: anand1976
Differential Revision: D22026020
Pulled By: riversand963
fbshipit-source-id: f3c68a2ef45d9b505d0d625c7c5e0c88495b91c8
2020-06-25 02:05:47 +00:00
|
|
|
ROCKS_LOG_INFO(immutable_db_options_.info_log, "File Deletions Enabled");
|
|
|
|
if (job_context.HaveSomethingToDelete()) {
|
|
|
|
PurgeObsoleteFiles(job_context);
|
|
|
|
}
|
|
|
|
} else {
|
2024-01-30 19:58:31 +00:00
|
|
|
ROCKS_LOG_INFO(immutable_db_options_.info_log,
|
First step towards handling MANIFEST write error (#6949)
Summary:
This PR provides preliminary support for handling IO error during MANIFEST write.
File write/sync is not guaranteed to be atomic. If we encounter an IOError while writing/syncing to the MANIFEST file, we cannot be sure about the state of the MANIFEST file. The version edits may or may not have reached the file. During cleanup, if we delete the newly-generated SST files referenced by the pending version edit(s), but the version edit(s) actually are persistent in the MANIFEST, then next recovery attempt will process the version edits(s) and then fail since the SST files have already been deleted.
One approach is to truncate the MANIFEST after write/sync error, so that it is safe to delete the SST files. However, file truncation may not be supported on certain file systems. Therefore, we take the following approach.
If an IOError is detected during MANIFEST write/sync, we disable file deletions for the faulty database. Depending on whether the IOError is retryable (set by underlying file system), either RocksDB or application can call `DB::Resume()`, or simply shutdown and restart. During `Resume()`, RocksDB will try to switch to a new MANIFEST and write all existing in-memory version storage in the new file. If this succeeds, then RocksDB may proceed. If all recovery is completed, then file deletions will be re-enabled.
Note that multiple threads can call `LogAndApply()` at the same time, though only one of them will be going through the process MANIFEST write, possibly batching the version edits of other threads. When the leading MANIFEST writer finishes, all of the MANIFEST writing threads in this batch will have the same IOError. They will all call `ErrorHandler::SetBGError()` in which file deletion will be disabled.
Possible future directions:
- Add an `ErrorContext` structure so that it is easier to pass more info to `ErrorHandler`. Currently, as in this example, a new `BackgroundErrorReason` has to be added.
Test plan (dev server):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6949
Reviewed By: anand1976
Differential Revision: D22026020
Pulled By: riversand963
fbshipit-source-id: f3c68a2ef45d9b505d0d625c7c5e0c88495b91c8
2020-06-25 02:05:47 +00:00
|
|
|
"File Deletions Enable, but not really enabled. Counter: %d",
|
2020-09-25 20:33:05 +00:00
|
|
|
saved_counter);
|
First step towards handling MANIFEST write error (#6949)
Summary:
This PR provides preliminary support for handling IO error during MANIFEST write.
File write/sync is not guaranteed to be atomic. If we encounter an IOError while writing/syncing to the MANIFEST file, we cannot be sure about the state of the MANIFEST file. The version edits may or may not have reached the file. During cleanup, if we delete the newly-generated SST files referenced by the pending version edit(s), but the version edit(s) actually are persistent in the MANIFEST, then next recovery attempt will process the version edits(s) and then fail since the SST files have already been deleted.
One approach is to truncate the MANIFEST after write/sync error, so that it is safe to delete the SST files. However, file truncation may not be supported on certain file systems. Therefore, we take the following approach.
If an IOError is detected during MANIFEST write/sync, we disable file deletions for the faulty database. Depending on whether the IOError is retryable (set by underlying file system), either RocksDB or application can call `DB::Resume()`, or simply shutdown and restart. During `Resume()`, RocksDB will try to switch to a new MANIFEST and write all existing in-memory version storage in the new file. If this succeeds, then RocksDB may proceed. If all recovery is completed, then file deletions will be re-enabled.
Note that multiple threads can call `LogAndApply()` at the same time, though only one of them will be going through the process MANIFEST write, possibly batching the version edits of other threads. When the leading MANIFEST writer finishes, all of the MANIFEST writing threads in this batch will have the same IOError. They will all call `ErrorHandler::SetBGError()` in which file deletion will be disabled.
Possible future directions:
- Add an `ErrorContext` structure so that it is easier to pass more info to `ErrorHandler`. Currently, as in this example, a new `BackgroundErrorReason` has to be added.
Test plan (dev server):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6949
Reviewed By: anand1976
Differential Revision: D22026020
Pulled By: riversand963
fbshipit-source-id: f3c68a2ef45d9b505d0d625c7c5e0c88495b91c8
2020-06-25 02:05:47 +00:00
|
|
|
}
|
|
|
|
job_context.Clean();
|
|
|
|
LogFlush(immutable_db_options_.info_log);
|
|
|
|
return Status::OK();
|
|
|
|
}
|
|
|
|
|
|
|
|
bool DBImpl::IsFileDeletionsEnabled() const {
|
|
|
|
return 0 == disable_delete_obsolete_files_;
|
|
|
|
}
|
|
|
|
|
2020-05-04 22:05:34 +00:00
|
|
|
// * Returns the list of live files in 'sst_live' and 'blob_live'.
|
2017-04-06 00:14:05 +00:00
|
|
|
// If it's doing full scan:
|
|
|
|
// * Returns the list of all files in the filesystem in
|
|
|
|
// 'full_scan_candidate_files'.
|
|
|
|
// Otherwise, gets obsolete files from VersionSet.
|
|
|
|
// no_full_scan = true -- never do the full scan using GetChildren()
|
|
|
|
// force = false -- don't force the full scan, except every
|
|
|
|
// mutable_db_options_.delete_obsolete_files_period_micros
|
|
|
|
// force = true -- force the full scan
|
|
|
|
void DBImpl::FindObsoleteFiles(JobContext* job_context, bool force,
|
|
|
|
bool no_full_scan) {
|
|
|
|
mutex_.AssertHeld();
|
|
|
|
|
|
|
|
// if deletion is disabled, do nothing
|
|
|
|
if (disable_delete_obsolete_files_ > 0) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
bool doing_the_full_scan = false;
|
|
|
|
|
2018-03-08 18:18:34 +00:00
|
|
|
// logic for figuring out if we're doing the full scan
|
2017-04-06 00:14:05 +00:00
|
|
|
if (no_full_scan) {
|
|
|
|
doing_the_full_scan = false;
|
|
|
|
} else if (force ||
|
|
|
|
mutable_db_options_.delete_obsolete_files_period_micros == 0) {
|
|
|
|
doing_the_full_scan = true;
|
|
|
|
} else {
|
2021-03-15 11:32:24 +00:00
|
|
|
const uint64_t now_micros = immutable_db_options_.clock->NowMicros();
|
2017-04-06 00:14:05 +00:00
|
|
|
if ((delete_obsolete_files_last_run_ +
|
|
|
|
mutable_db_options_.delete_obsolete_files_period_micros) <
|
|
|
|
now_micros) {
|
|
|
|
doing_the_full_scan = true;
|
|
|
|
delete_obsolete_files_last_run_ = now_micros;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
// don't delete files that might be currently written to from compaction
|
|
|
|
// threads
|
|
|
|
// Since job_context->min_pending_output is set, until file scan finishes,
|
|
|
|
// mutex_ cannot be released. Otherwise, we might see no min_pending_output
|
2018-03-08 18:18:34 +00:00
|
|
|
// here but later find newer generated unfinalized files while scanning.
|
2020-05-07 16:29:21 +00:00
|
|
|
job_context->min_pending_output = MinObsoleteSstNumberToKeep();
|
2023-11-11 16:11:11 +00:00
|
|
|
job_context->files_to_quarantine = error_handler_.GetFilesToQuarantine();
|
2017-04-06 00:14:05 +00:00
|
|
|
|
|
|
|
// Get obsolete files. This function will also update the list of
|
|
|
|
// pending files in VersionSet().
|
2020-04-30 18:23:32 +00:00
|
|
|
versions_->GetObsoleteFiles(
|
|
|
|
&job_context->sst_delete_files, &job_context->blob_delete_files,
|
|
|
|
&job_context->manifest_delete_files, job_context->min_pending_output);
|
2017-04-06 00:14:05 +00:00
|
|
|
|
2020-05-07 16:29:21 +00:00
|
|
|
// Mark the elements in job_context->sst_delete_files and
|
|
|
|
// job_context->blob_delete_files as "grabbed for purge" so that other threads
|
|
|
|
// calling FindObsoleteFiles with full_scan=true will not add these files to
|
|
|
|
// candidate list for purge.
|
2018-04-06 02:49:06 +00:00
|
|
|
for (const auto& sst_to_del : job_context->sst_delete_files) {
|
|
|
|
MarkAsGrabbedForPurge(sst_to_del.metadata->fd.GetNumber());
|
2018-03-28 17:23:31 +00:00
|
|
|
}
|
|
|
|
|
2020-05-07 16:29:21 +00:00
|
|
|
for (const auto& blob_file : job_context->blob_delete_files) {
|
|
|
|
MarkAsGrabbedForPurge(blob_file.GetBlobFileNumber());
|
|
|
|
}
|
|
|
|
|
2017-04-06 00:14:05 +00:00
|
|
|
// store the current filenum, lognum, etc
|
|
|
|
job_context->manifest_file_number = versions_->manifest_file_number();
|
|
|
|
job_context->pending_manifest_file_number =
|
|
|
|
versions_->pending_manifest_file_number();
|
|
|
|
job_context->log_number = MinLogNumberToKeep();
|
|
|
|
job_context->prev_log_number = versions_->prev_log_number();
|
|
|
|
|
|
|
|
if (doing_the_full_scan) {
|
2022-05-25 19:43:48 +00:00
|
|
|
versions_->AddLiveFiles(&job_context->sst_live, &job_context->blob_live);
|
2018-03-28 17:23:31 +00:00
|
|
|
InfoLogPrefix info_log_prefix(!immutable_db_options_.db_log_dir.empty(),
|
2019-03-27 23:13:08 +00:00
|
|
|
dbname_);
|
2024-05-01 19:26:54 +00:00
|
|
|
// PurgeObsoleteFiles will dedupe duplicate files.
|
2022-10-03 17:59:45 +00:00
|
|
|
IOOptions io_opts;
|
|
|
|
io_opts.do_not_recurse = true;
|
2024-05-01 19:26:54 +00:00
|
|
|
for (auto& path : CollectAllDBPaths()) {
|
2017-04-06 00:14:05 +00:00
|
|
|
// set of all files in the directory. We'll exclude files that are still
|
|
|
|
// alive in the subsequent processings.
|
|
|
|
std::vector<std::string> files;
|
2022-10-03 17:59:45 +00:00
|
|
|
Status s = immutable_db_options_.fs->GetChildren(
|
|
|
|
path, io_opts, &files, /*IODebugContext*=*/nullptr);
|
2020-12-23 07:44:44 +00:00
|
|
|
s.PermitUncheckedError(); // TODO: What should we do on error?
|
2018-03-28 17:23:31 +00:00
|
|
|
for (const std::string& file : files) {
|
|
|
|
uint64_t number;
|
|
|
|
FileType type;
|
|
|
|
// 1. If we cannot parse the file name, we skip;
|
|
|
|
// 2. If the file with file_number equals number has already been
|
|
|
|
// grabbed for purge by another compaction job, or it has already been
|
|
|
|
// schedule for purge, we also skip it if we
|
|
|
|
// are doing full scan in order to avoid double deletion of the same
|
|
|
|
// file under race conditions. See
|
|
|
|
// https://github.com/facebook/rocksdb/issues/3573
|
|
|
|
if (!ParseFileName(file, &number, info_log_prefix.prefix, &type) ||
|
|
|
|
!ShouldPurge(number)) {
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2020-12-23 07:44:44 +00:00
|
|
|
// TODO(icanadi) clean up this mess to avoid having one-off "/"
|
|
|
|
// prefixes
|
2019-03-27 23:13:08 +00:00
|
|
|
job_context->full_scan_candidate_files.emplace_back("/" + file, path);
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
// Add log files in wal_dir
|
2021-07-30 19:15:04 +00:00
|
|
|
if (!immutable_db_options_.IsWalDirSameAsDBPath(dbname_)) {
|
2017-04-06 00:14:05 +00:00
|
|
|
std::vector<std::string> log_files;
|
2022-10-03 17:59:45 +00:00
|
|
|
Status s = immutable_db_options_.fs->GetChildren(
|
|
|
|
immutable_db_options_.wal_dir, io_opts, &log_files,
|
|
|
|
/*IODebugContext*=*/nullptr);
|
2020-12-23 07:44:44 +00:00
|
|
|
s.PermitUncheckedError(); // TODO: What should we do on error?
|
2018-03-28 17:23:31 +00:00
|
|
|
for (const std::string& log_file : log_files) {
|
2019-03-27 23:13:08 +00:00
|
|
|
job_context->full_scan_candidate_files.emplace_back(
|
|
|
|
log_file, immutable_db_options_.wal_dir);
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
}
|
Fix a race condition in WAL tracking causing DB open failure (#9715)
Summary:
There is a race condition if WAL tracking in the MANIFEST is enabled in a database that disables 2PC.
The race condition is between two background flush threads trying to install flush results to the MANIFEST.
Consider an example database with two column families: "default" (cfd0) and "cf1" (cfd1). Initially,
both column families have one mutable (active) memtable whose data backed by 6.log.
1. Trigger a manual flush for "cf1", creating a 7.log
2. Insert another key to "default", and trigger flush for "default", creating 8.log
3. BgFlushThread1 finishes writing 9.sst
4. BgFlushThread2 finishes writing 10.sst
```
Time BgFlushThread1 BgFlushThread2
| mutex_.Lock()
| precompute min_wal_to_keep as 6
| mutex_.Unlock()
| mutex_.Lock()
| precompute min_wal_to_keep as 6
| join MANIFEST write queue and mutex_.Unlock()
| write to MANIFEST
| mutex_.Lock()
| cfd1->log_number = 7
| Signal bg_flush_2 and mutex_.Unlock()
| wake up and mutex_.Lock()
| cfd0->log_number = 8
| FindObsoleteFiles() with job_context->log_number == 7
| mutex_.Unlock()
| PurgeObsoleteFiles() deletes 6.log
V
```
As shown in the above, BgFlushThread2 thinks that the min wal to keep is 6.log because "cf1" has unflushed data in 6.log (cf1.log_number=6).
Similarly, BgThread1 thinks that min wal to keep is also 6.log because "default" has unflushed data (default.log_number=6).
No WAL deletion will be written to MANIFEST because 6 is equal to `versions_->wals_.min_wal_number_to_keep`,
due to https://github.com/facebook/rocksdb/blob/7.1.fb/db/memtable_list.cc#L513:L514.
The bg flush thread that finishes last will perform file purging. `job_context.log_number` will be evaluated as 7, i.e.
the min wal that contains unflushed data, causing 6.log to be deleted. However, MANIFEST thinks 6.log should still exist.
If you close the db at this point, you won't be able to re-open it if `track_and_verify_wal_in_manifest` is true.
We must handle the case of multiple bg flush threads, and it is difficult for one bg flush thread to know
the correct min wal number until the other bg flush threads have finished committing to the manifest and updated
the `cfd::log_number`.
To fix this issue, we rename an existing variable `min_log_number_to_keep_2pc` to `min_log_number_to_keep`,
and use it to track WAL file deletion in non-2pc mode as well.
This variable is updated only 1) during recovery with mutex held, or 2) in the MANIFEST write thread.
`min_log_number_to_keep` means RocksDB will delete WALs below it, although there may be WALs
above it which are also obsolete. Formally, we will have [min_wal_to_keep, max_obsolete_wal]. During recovery, we
make sure that only WALs above max_obsolete_wal are checked and added back to `alive_log_files_`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9715
Test Plan:
```
make check
```
Also ran stress test below (with asan) to make sure it completes successfully.
```
TEST_TMPDIR=/dev/shm/rocksdb OPT=-g ASAN_OPTIONS=disable_coredump=0 \
CRASH_TEST_EXT_ARGS=--compression_type=zstd SKIP_FORMAT_BUCK_CHECKS=1 \
make J=52 -j52 blackbox_asan_crash_test
```
Reviewed By: ltamasi
Differential Revision: D34984412
Pulled By: riversand963
fbshipit-source-id: c7b21a8d84751bb55ea79c9f387103d21b231005
2022-03-24 02:41:31 +00:00
|
|
|
|
2017-04-06 00:14:05 +00:00
|
|
|
// Add info log files in db_log_dir
|
|
|
|
if (!immutable_db_options_.db_log_dir.empty() &&
|
|
|
|
immutable_db_options_.db_log_dir != dbname_) {
|
|
|
|
std::vector<std::string> info_log_files;
|
2022-10-03 17:59:45 +00:00
|
|
|
Status s = immutable_db_options_.fs->GetChildren(
|
|
|
|
immutable_db_options_.db_log_dir, io_opts, &info_log_files,
|
|
|
|
/*IODebugContext*=*/nullptr);
|
2020-12-23 07:44:44 +00:00
|
|
|
s.PermitUncheckedError(); // TODO: What should we do on error?
|
2018-04-06 02:49:06 +00:00
|
|
|
for (std::string& log_file : info_log_files) {
|
2019-03-27 23:13:08 +00:00
|
|
|
job_context->full_scan_candidate_files.emplace_back(
|
|
|
|
log_file, immutable_db_options_.db_log_dir);
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
}
|
2022-05-25 19:43:48 +00:00
|
|
|
} else {
|
|
|
|
// Instead of filling ob_context->sst_live and job_context->blob_live,
|
|
|
|
// directly remove files that show up in any Version. This is because
|
|
|
|
// candidate files tend to be a small percentage of all files, so it is
|
|
|
|
// usually cheaper to check them against every version, compared to
|
|
|
|
// building a map for all files.
|
|
|
|
versions_->RemoveLiveFiles(job_context->sst_delete_files,
|
|
|
|
job_context->blob_delete_files);
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
|
2022-06-24 01:32:25 +00:00
|
|
|
// Before potentially releasing mutex and waiting on condvar, increment
|
|
|
|
// pending_purge_obsolete_files_ so that another thread executing
|
|
|
|
// `GetSortedWals` will wait until this thread finishes execution since the
|
|
|
|
// other thread will be waiting for `pending_purge_obsolete_files_`.
|
|
|
|
// pending_purge_obsolete_files_ MUST be decremented if there is nothing to
|
|
|
|
// delete.
|
|
|
|
++pending_purge_obsolete_files_;
|
|
|
|
|
|
|
|
Defer cleanup([job_context, this]() {
|
|
|
|
assert(job_context != nullptr);
|
|
|
|
if (!job_context->HaveSomethingToDelete()) {
|
|
|
|
mutex_.AssertHeld();
|
|
|
|
--pending_purge_obsolete_files_;
|
|
|
|
}
|
|
|
|
});
|
|
|
|
|
2017-04-06 00:14:05 +00:00
|
|
|
// logs_ is empty when called during recovery, in which case there can't yet
|
|
|
|
// be any tracked obsolete logs
|
2022-07-21 20:35:36 +00:00
|
|
|
log_write_mutex_.Lock();
|
|
|
|
|
|
|
|
if (alive_log_files_.empty() || logs_.empty()) {
|
|
|
|
mutex_.AssertHeld();
|
|
|
|
// We may reach here if the db is DBImplSecondary
|
|
|
|
log_write_mutex_.Unlock();
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2023-05-03 18:12:20 +00:00
|
|
|
bool mutex_unlocked = false;
|
2017-04-06 00:14:05 +00:00
|
|
|
if (!alive_log_files_.empty() && !logs_.empty()) {
|
|
|
|
uint64_t min_log_number = job_context->log_number;
|
|
|
|
size_t num_alive_log_files = alive_log_files_.size();
|
|
|
|
// find newly obsoleted log files
|
|
|
|
while (alive_log_files_.begin()->number < min_log_number) {
|
|
|
|
auto& earliest = *alive_log_files_.begin();
|
|
|
|
if (immutable_db_options_.recycle_log_file_num >
|
2024-04-29 19:25:00 +00:00
|
|
|
log_recycle_files_.size() &&
|
|
|
|
earliest.number >= MinLogNumberToRecycle()) {
|
2017-04-06 00:14:05 +00:00
|
|
|
ROCKS_LOG_INFO(immutable_db_options_.info_log,
|
|
|
|
"adding log %" PRIu64 " to recycle list\n",
|
|
|
|
earliest.number);
|
2018-01-09 20:51:10 +00:00
|
|
|
log_recycle_files_.push_back(earliest.number);
|
2017-04-06 00:14:05 +00:00
|
|
|
} else {
|
|
|
|
job_context->log_delete_files.push_back(earliest.number);
|
|
|
|
}
|
|
|
|
if (job_context->size_log_to_delete == 0) {
|
|
|
|
job_context->prev_total_log_size = total_log_size_;
|
|
|
|
job_context->num_alive_log_files = num_alive_log_files;
|
|
|
|
}
|
|
|
|
job_context->size_log_to_delete += earliest.size;
|
|
|
|
total_log_size_ -= earliest.size;
|
2017-06-30 16:30:03 +00:00
|
|
|
alive_log_files_.pop_front();
|
2022-07-21 20:35:36 +00:00
|
|
|
|
2017-04-06 00:14:05 +00:00
|
|
|
// Current log should always stay alive since it can't have
|
|
|
|
// number < MinLogNumber().
|
|
|
|
assert(alive_log_files_.size());
|
|
|
|
}
|
2022-07-21 20:35:36 +00:00
|
|
|
log_write_mutex_.Unlock();
|
|
|
|
mutex_.Unlock();
|
2023-05-03 18:12:20 +00:00
|
|
|
mutex_unlocked = true;
|
2022-11-29 22:14:43 +00:00
|
|
|
TEST_SYNC_POINT_CALLBACK("FindObsoleteFiles::PostMutexUnlock", nullptr);
|
2022-07-21 20:35:36 +00:00
|
|
|
log_write_mutex_.Lock();
|
2017-04-06 00:14:05 +00:00
|
|
|
while (!logs_.empty() && logs_.front().number < min_log_number) {
|
|
|
|
auto& log = logs_.front();
|
2022-06-17 23:45:28 +00:00
|
|
|
if (log.IsSyncing()) {
|
2017-04-06 00:14:05 +00:00
|
|
|
log_sync_cv_.Wait();
|
|
|
|
// logs_ could have changed while we were waiting.
|
|
|
|
continue;
|
|
|
|
}
|
Ensure Close() before LinkFile() for WALs in Checkpoint (#12734)
Summary:
POSIX semantics for LinkFile (hard links) allow linking a file
that is still being written two, with both the source and destination
showing any subsequent writes to the source. This may not be practical
semantics for some FileSystem implementations such as remote storage.
They might only link the flushed or sync-ed file contents at time of
LinkFile, or might even have undefined behavior if LinkFile is called on
a file still open for write (not yet "sealed"). This change builds on https://github.com/facebook/rocksdb/issues/12731
to bring more hygiene to our handling of WAL files in Checkpoint.
Specifically, we now Close WAL files as soon as they are either
(a) inactive and fully synced, or (b) inactive and obsolete (so maybe
never fully synced), rather than letting Close() happen in handling
obsolete files (maybe a background thread). This should not be a
performance issue as Close() should be trivial cost relative to other
IO ops, but just in case:
* We don't Close() while holding a mutex, to avoid blocking, and
* The old behavior is available with a new kill switch option
`background_close_inactive_wals`.
Stacked on https://github.com/facebook/rocksdb/issues/12731
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12734
Test Plan:
Extended existing unit test, especially adding a hygiene
check to FaultInjectionTestFS to detect LinkFile() on a file still open
for writes. FaultInjectionTestFS already has relevant tracking data, and
tests can opt out of the new check, as in a smoke test I have left for
the old, deprecated functionality `background_close_inactive_wals=true`.
Also ran lengthy blackbox_crash_test to ensure the hygiene check is OK
with the crash test. (The only place I can find we use LinkFile in
production is Checkpoint.)
Reviewed By: cbi42
Differential Revision: D58295284
Pulled By: pdillinger
fbshipit-source-id: 64d90ed8477e2366c19eaf9c4c5ad60b82cac5c6
2024-06-12 18:48:45 +00:00
|
|
|
// This WAL file is not live, so it's OK if we never sync the rest of it.
|
|
|
|
// If it's already closed, then it's been fully synced. If
|
|
|
|
// !background_close_inactive_wals then we need to Close it before
|
|
|
|
// removing from logs_ but not blocking while holding log_write_mutex_.
|
|
|
|
if (!immutable_db_options_.background_close_inactive_wals &&
|
|
|
|
log.writer->file()) {
|
|
|
|
// We are taking ownership of and pinning the front entry, so we can
|
|
|
|
// expect it to be the same after releasing and re-acquiring the lock
|
|
|
|
log.PrepareForSync();
|
|
|
|
log_write_mutex_.Unlock();
|
|
|
|
// TODO: maybe check the return value of Close.
|
|
|
|
// TODO: plumb Env::IOActivity, Env::IOPriority
|
|
|
|
auto s = log.writer->file()->Close({});
|
|
|
|
s.PermitUncheckedError();
|
|
|
|
log_write_mutex_.Lock();
|
|
|
|
log.writer->PublishIfClosed();
|
|
|
|
assert(&log == &logs_.front());
|
|
|
|
log.FinishSync();
|
|
|
|
log_sync_cv_.SignalAll();
|
|
|
|
}
|
2017-04-06 00:14:05 +00:00
|
|
|
logs_to_free_.push_back(log.ReleaseWriter());
|
2022-07-21 20:35:36 +00:00
|
|
|
logs_.pop_front();
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
// Current log cannot be obsolete.
|
|
|
|
assert(!logs_.empty());
|
|
|
|
}
|
|
|
|
|
|
|
|
// We're just cleaning up for DB::Write().
|
|
|
|
assert(job_context->logs_to_free.empty());
|
|
|
|
job_context->logs_to_free = logs_to_free_;
|
2022-07-21 20:35:36 +00:00
|
|
|
|
|
|
|
logs_to_free_.clear();
|
|
|
|
log_write_mutex_.Unlock();
|
2023-05-03 18:12:20 +00:00
|
|
|
if (mutex_unlocked) {
|
|
|
|
mutex_.Lock();
|
|
|
|
}
|
2018-01-09 20:51:10 +00:00
|
|
|
job_context->log_recycle_files.assign(log_recycle_files_.begin(),
|
|
|
|
log_recycle_files_.end());
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
// Delete obsolete files and log status and information of file deletion
|
2017-09-07 21:11:15 +00:00
|
|
|
void DBImpl::DeleteObsoleteFileImpl(int job_id, const std::string& fname,
|
2018-04-26 20:51:39 +00:00
|
|
|
const std::string& path_to_sync,
|
2018-04-06 02:49:06 +00:00
|
|
|
FileType type, uint64_t number) {
|
2020-05-07 16:29:21 +00:00
|
|
|
TEST_SYNC_POINT_CALLBACK("DBImpl::DeleteObsoleteFileImpl::BeforeDeletion",
|
|
|
|
const_cast<std::string*>(&fname));
|
|
|
|
|
2017-09-07 21:11:15 +00:00
|
|
|
Status file_deletion_status;
|
2020-10-23 00:04:39 +00:00
|
|
|
if (type == kTableFile || type == kBlobFile || type == kWalFile) {
|
2021-09-30 20:24:13 +00:00
|
|
|
// Rate limit WAL deletion only if its in the DB dir
|
|
|
|
file_deletion_status = DeleteDBFile(
|
|
|
|
&immutable_db_options_, fname, path_to_sync,
|
|
|
|
/*force_bg=*/false,
|
|
|
|
/*force_fg=*/(type == kWalFile) ? !wal_in_db_path_ : false);
|
2017-04-06 00:14:05 +00:00
|
|
|
} else {
|
|
|
|
file_deletion_status = env_->DeleteFile(fname);
|
|
|
|
}
|
2018-03-28 17:23:31 +00:00
|
|
|
TEST_SYNC_POINT_CALLBACK("DBImpl::DeleteObsoleteFileImpl:AfterDeletion",
|
2019-03-27 23:13:08 +00:00
|
|
|
&file_deletion_status);
|
2017-04-06 00:14:05 +00:00
|
|
|
if (file_deletion_status.ok()) {
|
|
|
|
ROCKS_LOG_DEBUG(immutable_db_options_.info_log,
|
|
|
|
"[JOB %d] Delete %s type=%d #%" PRIu64 " -- %s\n", job_id,
|
|
|
|
fname.c_str(), type, number,
|
|
|
|
file_deletion_status.ToString().c_str());
|
|
|
|
} else if (env_->FileExists(fname).IsNotFound()) {
|
|
|
|
ROCKS_LOG_INFO(
|
|
|
|
immutable_db_options_.info_log,
|
|
|
|
"[JOB %d] Tried to delete a non-existing file %s type=%d #%" PRIu64
|
|
|
|
" -- %s\n",
|
|
|
|
job_id, fname.c_str(), type, number,
|
|
|
|
file_deletion_status.ToString().c_str());
|
|
|
|
} else {
|
|
|
|
ROCKS_LOG_ERROR(immutable_db_options_.info_log,
|
|
|
|
"[JOB %d] Failed to delete %s type=%d #%" PRIu64 " -- %s\n",
|
|
|
|
job_id, fname.c_str(), type, number,
|
|
|
|
file_deletion_status.ToString().c_str());
|
|
|
|
}
|
|
|
|
if (type == kTableFile) {
|
|
|
|
EventHelpers::LogAndNotifyTableFileDeletion(
|
|
|
|
&event_logger_, job_id, number, fname, file_deletion_status, GetName(),
|
|
|
|
immutable_db_options_.listeners);
|
|
|
|
}
|
2021-09-17 00:17:40 +00:00
|
|
|
if (type == kBlobFile) {
|
|
|
|
EventHelpers::LogAndNotifyBlobFileDeletion(
|
|
|
|
&event_logger_, immutable_db_options_.listeners, job_id, number, fname,
|
|
|
|
file_deletion_status, GetName());
|
|
|
|
}
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
// Diffs the files listed in filenames and those that do not
|
2018-03-08 18:18:34 +00:00
|
|
|
// belong to live files are possibly removed. Also, removes all the
|
2017-04-06 00:14:05 +00:00
|
|
|
// files in sst_delete_files and log_delete_files.
|
|
|
|
// It is not necessary to hold the mutex when invoking this method.
|
2018-04-06 02:49:06 +00:00
|
|
|
void DBImpl::PurgeObsoleteFiles(JobContext& state, bool schedule_only) {
|
2018-01-18 01:37:10 +00:00
|
|
|
TEST_SYNC_POINT("DBImpl::PurgeObsoleteFiles:Begin");
|
2017-04-06 00:14:05 +00:00
|
|
|
// we'd better have sth to delete
|
|
|
|
assert(state.HaveSomethingToDelete());
|
|
|
|
|
2018-01-18 01:37:10 +00:00
|
|
|
// FindObsoleteFiles() should've populated this so nonzero
|
|
|
|
assert(state.manifest_file_number != 0);
|
2017-04-06 00:14:05 +00:00
|
|
|
|
2020-05-04 22:05:34 +00:00
|
|
|
// Now, convert lists to unordered sets, WITHOUT mutex held; set is slow.
|
|
|
|
std::unordered_set<uint64_t> sst_live_set(state.sst_live.begin(),
|
|
|
|
state.sst_live.end());
|
2020-05-07 16:29:21 +00:00
|
|
|
std::unordered_set<uint64_t> blob_live_set(state.blob_live.begin(),
|
|
|
|
state.blob_live.end());
|
2017-04-06 00:14:05 +00:00
|
|
|
std::unordered_set<uint64_t> log_recycle_files_set(
|
|
|
|
state.log_recycle_files.begin(), state.log_recycle_files.end());
|
2023-11-11 16:11:11 +00:00
|
|
|
std::unordered_set<uint64_t> quarantine_files_set(
|
|
|
|
state.files_to_quarantine.begin(), state.files_to_quarantine.end());
|
2017-04-06 00:14:05 +00:00
|
|
|
|
|
|
|
auto candidate_files = state.full_scan_candidate_files;
|
|
|
|
candidate_files.reserve(
|
|
|
|
candidate_files.size() + state.sst_delete_files.size() +
|
2020-05-07 16:29:21 +00:00
|
|
|
state.blob_delete_files.size() + state.log_delete_files.size() +
|
|
|
|
state.manifest_delete_files.size());
|
2017-04-06 00:14:05 +00:00
|
|
|
// We may ignore the dbname when generating the file names.
|
2018-04-06 02:49:06 +00:00
|
|
|
for (auto& file : state.sst_delete_files) {
|
Support pro-actively erasing obsolete block cache entries (#12694)
Summary:
Currently, when files become obsolete, the block cache entries associated with them just age out naturally. With pure LRU, this is not too bad, as once you "use" enough cache entries to (re-)fill the cache, you are guranteed to have purged the obsolete entries. However, HyperClockCache is a counting clock cache with a somewhat longer memory, so could be more negatively impacted by previously-hot cache entries becoming obsolete, and taking longer to age out than newer single-hit entries.
Part of the reason we still have this natural aging-out is that there's almost no connection between block cache entries and the file they are associated with. Everything is hashed into the same pool(s) of entries with nothing like a secondary index based on file. Keeping track of such an index could be expensive.
This change adds a new, mutable CF option `uncache_aggressiveness` for erasing obsolete block cache entries. The process can be speculative, lossy, or unproductive because not all potential block cache entries associated with files will be resident in memory, and attempting to remove them all could be wasted CPU time. Rather than a simple on/off switch, `uncache_aggressiveness` basically tells RocksDB how much CPU you're willing to burn trying to purge obsolete block cache entries. When such efforts are not sufficiently productive for a file, we stop and move on.
The option is in ColumnFamilyOptions so that it is dynamically changeable for already-open files, and customizeable by CF.
Note that this block cache removal happens as part of the process of purging obsolete files, which is often in a background thread (depending on `background_purge_on_iterator_cleanup` and `avoid_unnecessary_blocking_io` options) rather than along CPU critical paths.
Notable auxiliary code details:
* Possibly fixing some issues with trivial moves with `only_delete_metadata`: unnecessary TableCache::Evict in that case and missing from the ObsoleteFileInfo move operator. (Not able to reproduce an current failure.)
* Remove suspicious TableCache::Erase() from VersionSet::AddObsoleteBlobFile() (TODO follow-up item)
Marked EXPERIMENTAL until more thorough validation is complete.
Direct stats of this functionality are omitted because they could be misleading. Block cache hit rate is a better indicator of benefit, and CPU profiling a better indicator of cost.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12694
Test Plan:
* Unit tests added, including refactoring an existing test to make better use of parameterized tests.
* Added to crash test.
* Performance, sample command:
```
for I in `seq 1 10`; do for UA in 300; do for CT in lru_cache fixed_hyper_clock_cache auto_hyper_clock_cache; do rm -rf /dev/shm/test3; TEST_TMPDIR=/dev/shm/test3 /usr/bin/time ./db_bench -benchmarks=readwhilewriting -num=13000000 -read_random_exp_range=6 -write_buffer_size=10000000 -bloom_bits=10 -cache_type=$CT -cache_size=390000000 -cache_index_and_filter_blocks=1 -disable_wal=1 -duration=60 -statistics -uncache_aggressiveness=$UA 2>&1 | grep -E 'micros/op|rocksdb.block.cache.data.(hit|miss)|rocksdb.number.keys.(read|written)|maxresident' | awk '/rocksdb.block.cache.data.miss/ { miss = $4 } /rocksdb.block.cache.data.hit/ { hit = $4 } { print } END { print "hit rate = " ((hit * 1.0) / (miss + hit)) }' | tee -a results-$CT-$UA; done; done; done
```
Averaging 10 runs each case, block cache data block hit rates
```
lru_cache
UA=0 -> hit rate = 0.327, ops/s = 87668, user CPU sec = 139.0
UA=300 -> hit rate = 0.336, ops/s = 87960, user CPU sec = 139.0
fixed_hyper_clock_cache
UA=0 -> hit rate = 0.336, ops/s = 100069, user CPU sec = 139.9
UA=300 -> hit rate = 0.343, ops/s = 100104, user CPU sec = 140.2
auto_hyper_clock_cache
UA=0 -> hit rate = 0.336, ops/s = 97580, user CPU sec = 140.5
UA=300 -> hit rate = 0.345, ops/s = 97972, user CPU sec = 139.8
```
Conclusion: up to roughly 1 percentage point of improved block cache hit rate, likely leading to overall improved efficiency (because the foreground CPU cost of cache misses likely outweighs the background CPU cost of erasure, let alone I/O savings).
Reviewed By: ajkr
Differential Revision: D57932442
Pulled By: pdillinger
fbshipit-source-id: 84a243ca5f965f731f346a4853009780a904af6c
2024-06-07 15:57:11 +00:00
|
|
|
auto* handle = file.metadata->table_reader_handle;
|
|
|
|
if (file.only_delete_metadata) {
|
|
|
|
if (handle) {
|
|
|
|
// Simply release handle of file that is not being deleted
|
|
|
|
table_cache_->Release(handle);
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
// File is being deleted (actually obsolete)
|
|
|
|
auto number = file.metadata->fd.GetNumber();
|
|
|
|
candidate_files.emplace_back(MakeTableFileName(number), file.path);
|
|
|
|
if (handle == nullptr) {
|
|
|
|
// For files not "pinned" in table cache
|
|
|
|
handle = TableCache::Lookup(table_cache_.get(), number);
|
|
|
|
}
|
|
|
|
if (handle) {
|
|
|
|
TableCache::ReleaseObsolete(table_cache_.get(), handle,
|
|
|
|
file.uncache_aggressiveness);
|
|
|
|
}
|
2017-07-27 19:10:49 +00:00
|
|
|
}
|
2018-04-06 02:49:06 +00:00
|
|
|
file.DeleteMetadata();
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
|
2020-05-07 16:29:21 +00:00
|
|
|
for (const auto& blob_file : state.blob_delete_files) {
|
|
|
|
candidate_files.emplace_back(BlobFileName(blob_file.GetBlobFileNumber()),
|
|
|
|
blob_file.GetPath());
|
|
|
|
}
|
|
|
|
|
2021-07-30 19:15:04 +00:00
|
|
|
auto wal_dir = immutable_db_options_.GetWalDir();
|
2017-04-06 00:14:05 +00:00
|
|
|
for (auto file_num : state.log_delete_files) {
|
|
|
|
if (file_num > 0) {
|
2021-07-30 19:15:04 +00:00
|
|
|
candidate_files.emplace_back(LogFileName(file_num), wal_dir);
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
for (const auto& filename : state.manifest_delete_files) {
|
2018-04-06 02:49:06 +00:00
|
|
|
candidate_files.emplace_back(filename, dbname_);
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
// dedup state.candidate_files so we don't try to delete the same
|
|
|
|
// file twice
|
|
|
|
std::sort(candidate_files.begin(), candidate_files.end(),
|
2022-07-21 20:35:36 +00:00
|
|
|
[](const JobContext::CandidateFileInfo& lhs,
|
|
|
|
const JobContext::CandidateFileInfo& rhs) {
|
2023-08-22 18:22:35 +00:00
|
|
|
if (lhs.file_name < rhs.file_name) {
|
2022-07-21 20:35:36 +00:00
|
|
|
return true;
|
2023-08-22 18:22:35 +00:00
|
|
|
} else if (lhs.file_name > rhs.file_name) {
|
2022-07-21 20:35:36 +00:00
|
|
|
return false;
|
|
|
|
} else {
|
2023-08-22 18:22:35 +00:00
|
|
|
return (lhs.file_path < rhs.file_path);
|
2022-07-21 20:35:36 +00:00
|
|
|
}
|
|
|
|
});
|
2017-04-06 00:14:05 +00:00
|
|
|
candidate_files.erase(
|
|
|
|
std::unique(candidate_files.begin(), candidate_files.end()),
|
|
|
|
candidate_files.end());
|
|
|
|
|
|
|
|
if (state.prev_total_log_size > 0) {
|
|
|
|
ROCKS_LOG_INFO(immutable_db_options_.info_log,
|
|
|
|
"[JOB %d] Try to delete WAL files size %" PRIu64
|
|
|
|
", prev total WAL file size %" PRIu64
|
|
|
|
", number of live WAL files %" ROCKSDB_PRIszt ".\n",
|
|
|
|
state.job_id, state.size_log_to_delete,
|
|
|
|
state.prev_total_log_size, state.num_alive_log_files);
|
|
|
|
}
|
|
|
|
|
|
|
|
std::vector<std::string> old_info_log_files;
|
|
|
|
InfoLogPrefix info_log_prefix(!immutable_db_options_.db_log_dir.empty(),
|
|
|
|
dbname_);
|
2018-07-11 21:49:31 +00:00
|
|
|
|
|
|
|
// File numbers of most recent two OPTIONS file in candidate_files (found in
|
|
|
|
// previos FindObsoleteFiles(full_scan=true))
|
|
|
|
// At this point, there must not be any duplicate file numbers in
|
|
|
|
// candidate_files.
|
|
|
|
uint64_t optsfile_num1 = std::numeric_limits<uint64_t>::min();
|
|
|
|
uint64_t optsfile_num2 = std::numeric_limits<uint64_t>::min();
|
|
|
|
for (const auto& candidate_file : candidate_files) {
|
|
|
|
const std::string& fname = candidate_file.file_name;
|
|
|
|
uint64_t number;
|
|
|
|
FileType type;
|
|
|
|
if (!ParseFileName(fname, &number, info_log_prefix.prefix, &type) ||
|
|
|
|
type != kOptionsFile) {
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
if (number > optsfile_num1) {
|
|
|
|
optsfile_num2 = optsfile_num1;
|
|
|
|
optsfile_num1 = number;
|
|
|
|
} else if (number > optsfile_num2) {
|
|
|
|
optsfile_num2 = number;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2019-09-18 00:15:13 +00:00
|
|
|
// Close WALs before trying to delete them.
|
|
|
|
for (const auto w : state.logs_to_free) {
|
|
|
|
// TODO: maybe check the return value of Close.
|
Group SST write in flush, compaction and db open with new stats (#11910)
Summary:
## Context/Summary
Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity.
For that, this PR does the following:
- Tag different write IOs by passing down and converting WriteOptions to IOOptions
- Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS
Some related code refactory to make implementation cleaner:
- Blob stats
- Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info.
- Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write.
- Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority
- Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification
- Build table
- TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables
- Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder.
This parameter is used for dynamically changing file io priority for flush, see https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more
- Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority
## Test
### db bench
Flush
```
./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100
rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
```
compaction, db oopen
```
Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279
rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213
rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66
```
blob stats - just to make sure they aren't broken by this PR
```
Integrated Blob DB
Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600
rocksdb.blobdb.blob.file.synced COUNT : 1
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same)
```
```
Stacked Blob DB
Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876
rocksdb.blobdb.blob.file.synced COUNT : 8
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same)
```
### Rehearsal CI stress test
Trigger 3 full runs of all our CI stress tests
### Performance
Flush
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark; enable_statistics = true
Pre-pr: avg 507515519.3 ns
497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908,
Post-pr: avg 511971266.5 ns, regressed 0.88%
502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408,
```
Compaction
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 495346098.30 ns
492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846
Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97%
502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007
```
Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats)
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 3848.10 ns
3814,3838,3839,3848,3854,3854,3854,3860,3860,3860
Post-pr: avg 3874.20 ns, regressed 0.68%
3863,3867,3871,3874,3875,3877,3877,3877,3880,3881
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910
Reviewed By: ajkr
Differential Revision: D49788060
Pulled By: hx235
fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff
2023-12-29 23:29:23 +00:00
|
|
|
// TODO: plumb Env::IOActivity, Env::IOPriority
|
Ensure Close() before LinkFile() for WALs in Checkpoint (#12734)
Summary:
POSIX semantics for LinkFile (hard links) allow linking a file
that is still being written two, with both the source and destination
showing any subsequent writes to the source. This may not be practical
semantics for some FileSystem implementations such as remote storage.
They might only link the flushed or sync-ed file contents at time of
LinkFile, or might even have undefined behavior if LinkFile is called on
a file still open for write (not yet "sealed"). This change builds on https://github.com/facebook/rocksdb/issues/12731
to bring more hygiene to our handling of WAL files in Checkpoint.
Specifically, we now Close WAL files as soon as they are either
(a) inactive and fully synced, or (b) inactive and obsolete (so maybe
never fully synced), rather than letting Close() happen in handling
obsolete files (maybe a background thread). This should not be a
performance issue as Close() should be trivial cost relative to other
IO ops, but just in case:
* We don't Close() while holding a mutex, to avoid blocking, and
* The old behavior is available with a new kill switch option
`background_close_inactive_wals`.
Stacked on https://github.com/facebook/rocksdb/issues/12731
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12734
Test Plan:
Extended existing unit test, especially adding a hygiene
check to FaultInjectionTestFS to detect LinkFile() on a file still open
for writes. FaultInjectionTestFS already has relevant tracking data, and
tests can opt out of the new check, as in a smoke test I have left for
the old, deprecated functionality `background_close_inactive_wals=true`.
Also ran lengthy blackbox_crash_test to ensure the hygiene check is OK
with the crash test. (The only place I can find we use LinkFile in
production is Checkpoint.)
Reviewed By: cbi42
Differential Revision: D58295284
Pulled By: pdillinger
fbshipit-source-id: 64d90ed8477e2366c19eaf9c4c5ad60b82cac5c6
2024-06-12 18:48:45 +00:00
|
|
|
auto s = w->Close({});
|
2020-08-21 02:16:56 +00:00
|
|
|
s.PermitUncheckedError();
|
2019-09-18 00:15:13 +00:00
|
|
|
}
|
|
|
|
|
2019-12-03 01:43:37 +00:00
|
|
|
bool own_files = OwnTablesAndLogs();
|
2018-03-28 17:23:31 +00:00
|
|
|
std::unordered_set<uint64_t> files_to_del;
|
2017-04-06 00:14:05 +00:00
|
|
|
for (const auto& candidate_file : candidate_files) {
|
2018-07-11 21:49:31 +00:00
|
|
|
const std::string& to_delete = candidate_file.file_name;
|
2017-04-06 00:14:05 +00:00
|
|
|
uint64_t number;
|
|
|
|
FileType type;
|
|
|
|
// Ignore file if we cannot recognize it.
|
|
|
|
if (!ParseFileName(to_delete, &number, info_log_prefix.prefix, &type)) {
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2023-11-11 16:11:11 +00:00
|
|
|
if (quarantine_files_set.find(number) != quarantine_files_set.end()) {
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2017-04-06 00:14:05 +00:00
|
|
|
bool keep = true;
|
|
|
|
switch (type) {
|
2020-10-23 00:04:39 +00:00
|
|
|
case kWalFile:
|
2017-04-06 00:14:05 +00:00
|
|
|
keep = ((number >= state.log_number) ||
|
|
|
|
(number == state.prev_log_number) ||
|
|
|
|
(log_recycle_files_set.find(number) !=
|
|
|
|
log_recycle_files_set.end()));
|
|
|
|
break;
|
|
|
|
case kDescriptorFile:
|
|
|
|
// Keep my manifest file, and any newer incarnations'
|
|
|
|
// (can happen during manifest roll)
|
|
|
|
keep = (number >= state.manifest_file_number);
|
|
|
|
break;
|
|
|
|
case kTableFile:
|
|
|
|
// If the second condition is not there, this makes
|
|
|
|
// DontDeletePendingOutputs fail
|
2020-05-04 22:05:34 +00:00
|
|
|
keep = (sst_live_set.find(number) != sst_live_set.end()) ||
|
2017-04-06 00:14:05 +00:00
|
|
|
number >= state.min_pending_output;
|
2018-03-28 17:23:31 +00:00
|
|
|
if (!keep) {
|
|
|
|
files_to_del.insert(number);
|
|
|
|
}
|
2017-04-06 00:14:05 +00:00
|
|
|
break;
|
2020-05-07 16:29:21 +00:00
|
|
|
case kBlobFile:
|
|
|
|
keep = number >= state.min_pending_output ||
|
|
|
|
(blob_live_set.find(number) != blob_live_set.end());
|
|
|
|
if (!keep) {
|
|
|
|
files_to_del.insert(number);
|
|
|
|
}
|
|
|
|
break;
|
2017-04-06 00:14:05 +00:00
|
|
|
case kTempFile:
|
|
|
|
// Any temp files that are currently being written to must
|
|
|
|
// be recorded in pending_outputs_, which is inserted into "live".
|
|
|
|
// Also, SetCurrentFile creates a temp file when writing out new
|
|
|
|
// manifest, which is equal to state.pending_manifest_file_number. We
|
|
|
|
// should not delete that file
|
|
|
|
//
|
|
|
|
// TODO(yhchiang): carefully modify the third condition to safely
|
|
|
|
// remove the temp options files.
|
2020-05-04 22:05:34 +00:00
|
|
|
keep = (sst_live_set.find(number) != sst_live_set.end()) ||
|
2020-05-07 16:29:21 +00:00
|
|
|
(blob_live_set.find(number) != blob_live_set.end()) ||
|
2017-04-06 00:14:05 +00:00
|
|
|
(number == state.pending_manifest_file_number) ||
|
|
|
|
(to_delete.find(kOptionsFileNamePrefix) != std::string::npos);
|
|
|
|
break;
|
|
|
|
case kInfoLogFile:
|
|
|
|
keep = true;
|
|
|
|
if (number != 0) {
|
|
|
|
old_info_log_files.push_back(to_delete);
|
|
|
|
}
|
|
|
|
break;
|
2018-07-11 21:49:31 +00:00
|
|
|
case kOptionsFile:
|
|
|
|
keep = (number >= optsfile_num2);
|
|
|
|
break;
|
2017-04-06 00:14:05 +00:00
|
|
|
case kCurrentFile:
|
|
|
|
case kDBLockFile:
|
|
|
|
case kIdentityFile:
|
|
|
|
case kMetaDatabase:
|
|
|
|
keep = true;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (keep) {
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
std::string fname;
|
2018-04-26 20:51:39 +00:00
|
|
|
std::string dir_to_sync;
|
2017-04-06 00:14:05 +00:00
|
|
|
if (type == kTableFile) {
|
2018-04-06 02:49:06 +00:00
|
|
|
fname = MakeTableFileName(candidate_file.file_path, number);
|
2018-04-26 20:51:39 +00:00
|
|
|
dir_to_sync = candidate_file.file_path;
|
2020-05-07 16:29:21 +00:00
|
|
|
} else if (type == kBlobFile) {
|
|
|
|
fname = BlobFileName(candidate_file.file_path, number);
|
|
|
|
dir_to_sync = candidate_file.file_path;
|
2017-04-06 00:14:05 +00:00
|
|
|
} else {
|
2021-07-30 19:15:04 +00:00
|
|
|
dir_to_sync = (type == kWalFile) ? wal_dir : dbname_;
|
2019-03-27 23:13:08 +00:00
|
|
|
fname = dir_to_sync +
|
|
|
|
((!dir_to_sync.empty() && dir_to_sync.back() == '/') ||
|
|
|
|
(!to_delete.empty() && to_delete.front() == '/')
|
|
|
|
? ""
|
|
|
|
: "/") +
|
|
|
|
to_delete;
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
|
2021-04-23 03:42:50 +00:00
|
|
|
if (type == kWalFile && (immutable_db_options_.WAL_ttl_seconds > 0 ||
|
|
|
|
immutable_db_options_.WAL_size_limit_MB > 0)) {
|
2017-04-06 00:14:05 +00:00
|
|
|
wal_manager_.ArchiveWALFile(fname, number);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2019-12-03 01:43:37 +00:00
|
|
|
// If I do not own these files, e.g. secondary instance with max_open_files
|
|
|
|
// = -1, then no need to delete or schedule delete these files since they
|
|
|
|
// will be removed by their owner, e.g. the primary instance.
|
|
|
|
if (!own_files) {
|
|
|
|
continue;
|
|
|
|
}
|
2017-04-06 00:14:05 +00:00
|
|
|
if (schedule_only) {
|
|
|
|
InstrumentedMutexLock guard_lock(&mutex_);
|
2018-04-26 20:51:39 +00:00
|
|
|
SchedulePendingPurge(fname, dir_to_sync, type, number, state.job_id);
|
2017-04-06 00:14:05 +00:00
|
|
|
} else {
|
2018-04-26 20:51:39 +00:00
|
|
|
DeleteObsoleteFileImpl(state.job_id, fname, dir_to_sync, type, number);
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2018-03-28 17:23:31 +00:00
|
|
|
{
|
|
|
|
// After purging obsolete files, remove them from files_grabbed_for_purge_.
|
|
|
|
InstrumentedMutexLock guard_lock(&mutex_);
|
2019-09-17 23:43:07 +00:00
|
|
|
autovector<uint64_t> to_be_removed;
|
2018-03-28 17:23:31 +00:00
|
|
|
for (auto fn : files_grabbed_for_purge_) {
|
2019-09-17 23:43:07 +00:00
|
|
|
if (files_to_del.count(fn) != 0) {
|
|
|
|
to_be_removed.emplace_back(fn);
|
2018-03-28 17:23:31 +00:00
|
|
|
}
|
|
|
|
}
|
2019-09-17 23:43:07 +00:00
|
|
|
for (auto fn : to_be_removed) {
|
|
|
|
files_grabbed_for_purge_.erase(fn);
|
|
|
|
}
|
2018-03-28 17:23:31 +00:00
|
|
|
}
|
|
|
|
|
2017-04-06 00:14:05 +00:00
|
|
|
// Delete old info log files.
|
|
|
|
size_t old_info_log_file_count = old_info_log_files.size();
|
|
|
|
if (old_info_log_file_count != 0 &&
|
|
|
|
old_info_log_file_count >= immutable_db_options_.keep_log_file_num) {
|
|
|
|
std::sort(old_info_log_files.begin(), old_info_log_files.end());
|
|
|
|
size_t end =
|
|
|
|
old_info_log_file_count - immutable_db_options_.keep_log_file_num;
|
|
|
|
for (unsigned int i = 0; i <= end; i++) {
|
|
|
|
std::string& to_delete = old_info_log_files.at(i);
|
|
|
|
std::string full_path_to_delete =
|
|
|
|
(immutable_db_options_.db_log_dir.empty()
|
|
|
|
? dbname_
|
|
|
|
: immutable_db_options_.db_log_dir) +
|
|
|
|
"/" + to_delete;
|
|
|
|
ROCKS_LOG_INFO(immutable_db_options_.info_log,
|
|
|
|
"[JOB %d] Delete info log file %s\n", state.job_id,
|
|
|
|
full_path_to_delete.c_str());
|
|
|
|
Status s = env_->DeleteFile(full_path_to_delete);
|
|
|
|
if (!s.ok()) {
|
|
|
|
if (env_->FileExists(full_path_to_delete).IsNotFound()) {
|
|
|
|
ROCKS_LOG_INFO(
|
|
|
|
immutable_db_options_.info_log,
|
|
|
|
"[JOB %d] Tried to delete non-existing info log file %s FAILED "
|
|
|
|
"-- %s\n",
|
|
|
|
state.job_id, to_delete.c_str(), s.ToString().c_str());
|
|
|
|
} else {
|
|
|
|
ROCKS_LOG_ERROR(immutable_db_options_.info_log,
|
|
|
|
"[JOB %d] Delete info log file %s FAILED -- %s\n",
|
|
|
|
state.job_id, to_delete.c_str(),
|
|
|
|
s.ToString().c_str());
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
wal_manager_.PurgeObsoleteWALFiles();
|
|
|
|
LogFlush(immutable_db_options_.info_log);
|
2018-01-18 01:37:10 +00:00
|
|
|
InstrumentedMutexLock l(&mutex_);
|
|
|
|
--pending_purge_obsolete_files_;
|
|
|
|
assert(pending_purge_obsolete_files_ >= 0);
|
2021-11-24 22:50:52 +00:00
|
|
|
if (schedule_only) {
|
|
|
|
// Must change from pending_purge_obsolete_files_ to bg_purge_scheduled_
|
|
|
|
// while holding mutex (for GetSortedWalFiles() etc.)
|
|
|
|
SchedulePurge();
|
|
|
|
}
|
2018-01-18 01:37:10 +00:00
|
|
|
if (pending_purge_obsolete_files_ == 0) {
|
|
|
|
bg_cv_.SignalAll();
|
|
|
|
}
|
|
|
|
TEST_SYNC_POINT("DBImpl::PurgeObsoleteFiles:End");
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
void DBImpl::DeleteObsoleteFiles() {
|
|
|
|
mutex_.AssertHeld();
|
|
|
|
JobContext job_context(next_job_id_.fetch_add(1));
|
|
|
|
FindObsoleteFiles(&job_context, true);
|
|
|
|
|
|
|
|
mutex_.Unlock();
|
|
|
|
if (job_context.HaveSomethingToDelete()) {
|
2021-11-03 19:20:19 +00:00
|
|
|
bool defer_purge = immutable_db_options_.avoid_unnecessary_blocking_io;
|
|
|
|
PurgeObsoleteFiles(job_context, defer_purge);
|
2017-04-06 00:14:05 +00:00
|
|
|
}
|
|
|
|
job_context.Clean();
|
|
|
|
mutex_.Lock();
|
|
|
|
}
|
2018-03-28 17:23:31 +00:00
|
|
|
|
Skip deleted WALs during recovery
Summary:
This patch record min log number to keep to the manifest while flushing SST files to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic.
Before the commit, for 2PC case, we determined which log number to keep in FindObsoleteFiles(). We looked at the earliest logs with outstanding prepare entries, or prepare entries whose respective commit or abort are in memtable. With the commit, the same calculation is done while we apply the SST flush. Just before installing the flush file, we precompute the earliest log file to keep after the flush finishes using the same logic (but skipping the memtables just flushed), record this information to the manifest entry for this new flushed SST file. This pre-computed value is also remembered in memory, and will later be used to determine whether a log file can be deleted. This value is unlikely to change until next flush because the commit entry will stay in memtable. (In WritePrepared, we could have removed the older log files as soon as all prepared entries are committed. It's not yet done anyway. Even if we do it, the only thing we loss with this new approach is earlier log deletion between two flushes, which does not guarantee to happen anyway because the obsolete file clean-up function is only executed after flush or compaction)
This min log number to keep is stored in the manifest using the safely-ignore customized field of AddFile entry, in order to guarantee that the DB generated using newer release can be opened by previous releases no older than 4.2.
Closes https://github.com/facebook/rocksdb/pull/3765
Differential Revision: D7747618
Pulled By: siying
fbshipit-source-id: d00c92105b4f83852e9754a1b70d6b64cb590729
2018-05-03 22:35:11 +00:00
|
|
|
uint64_t FindMinPrepLogReferencedByMemTable(
|
Fix a silent data loss for write-committed txn (#9571)
Summary:
The following sequence of events can cause silent data loss for write-committed
transactions.
```
Time thread 1 bg flush
| db->Put("a")
| txn = NewTxn()
| txn->Put("b", "v")
| txn->Prepare() // writes only to 5.log
| db->SwitchMemtable() // memtable 1 has "a"
| // close 5.log,
| // creates 8.log
| trigger flush
| pick memtable 1
| unlock db mutex
| write new sst
| txn->ctwb->Put("gtid", "1") // writes 8.log
| txn->Commit() // writes to 8.log
| // writes to memtable 2
| compute min_log_number_to_keep_2pc, this
| will be 8 (incorrect).
|
| Purge obsolete wals, including 5.log
|
V
```
At this point, writes of txn exists only in memtable. Close db without flush because db thinks the data in
memtable are backed by log. Then reopen, the writes are lost except key-value pair {"gtid"->"1"},
only the commit marker of txn is in 8.log
The reason lies in `PrecomputeMinLogNumberToKeep2PC()` which calls `FindMinPrepLogReferencedByMemTable()`.
In the above example, when bg flush thread tries to find obsolete wals, it uses the information
computed by `PrecomputeMinLogNumberToKeep2PC()`. The return value of `PrecomputeMinLogNumberToKeep2PC()`
depends on three components
- `PrecomputeMinLogNumberToKeepNon2PC()`. This represents the WAL that has unflushed data. As the name of this method suggests, it does not account for 2PC. Although the keys reside in the prepare section of a previous WAL, the column family references the current WAL when they are actually inserted into the memtable during txn commit.
- `prep_tracker->FindMinLogContainingOutstandingPrep()`. This represents the WAL with a prepare section but the txn hasn't committed.
- `FindMinPrepLogReferencedByMemTable()`. This represents the WAL on which some memtables (mutable and immutable) depend for their unflushed data.
The bug lies in `FindMinPrepLogReferencedByMemTable()`. Originally, this function skips checking the column families
that are being flushed, but the unit test added in this PR shows that they should not be. In this unit test, there is
only the default column family, and one of its memtables has unflushed data backed by a prepare section in 5.log.
We should return this information via `FindMinPrepLogReferencedByMemTable()`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9571
Test Plan:
```
./transaction_test --gtest_filter=*/TransactionTest.SwitchMemtableDuringPrepareAndCommit_WC/*
make check
```
Reviewed By: siying
Differential Revision: D34235236
Pulled By: riversand963
fbshipit-source-id: 120eb21a666728a38dda77b96276c6af72b008b1
2022-02-17 07:07:48 +00:00
|
|
|
VersionSet* vset, const autovector<MemTable*>& memtables_to_flush) {
|
Skip deleted WALs during recovery
Summary:
This patch record min log number to keep to the manifest while flushing SST files to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic.
Before the commit, for 2PC case, we determined which log number to keep in FindObsoleteFiles(). We looked at the earliest logs with outstanding prepare entries, or prepare entries whose respective commit or abort are in memtable. With the commit, the same calculation is done while we apply the SST flush. Just before installing the flush file, we precompute the earliest log file to keep after the flush finishes using the same logic (but skipping the memtables just flushed), record this information to the manifest entry for this new flushed SST file. This pre-computed value is also remembered in memory, and will later be used to determine whether a log file can be deleted. This value is unlikely to change until next flush because the commit entry will stay in memtable. (In WritePrepared, we could have removed the older log files as soon as all prepared entries are committed. It's not yet done anyway. Even if we do it, the only thing we loss with this new approach is earlier log deletion between two flushes, which does not guarantee to happen anyway because the obsolete file clean-up function is only executed after flush or compaction)
This min log number to keep is stored in the manifest using the safely-ignore customized field of AddFile entry, in order to guarantee that the DB generated using newer release can be opened by previous releases no older than 4.2.
Closes https://github.com/facebook/rocksdb/pull/3765
Differential Revision: D7747618
Pulled By: siying
fbshipit-source-id: d00c92105b4f83852e9754a1b70d6b64cb590729
2018-05-03 22:35:11 +00:00
|
|
|
uint64_t min_log = 0;
|
|
|
|
|
|
|
|
// we must look through the memtables for two phase transactions
|
|
|
|
// that have been committed but not yet flushed
|
2020-12-04 03:21:08 +00:00
|
|
|
std::unordered_set<MemTable*> memtables_to_flush_set(
|
|
|
|
memtables_to_flush.begin(), memtables_to_flush.end());
|
Skip deleted WALs during recovery
Summary:
This patch record min log number to keep to the manifest while flushing SST files to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic.
Before the commit, for 2PC case, we determined which log number to keep in FindObsoleteFiles(). We looked at the earliest logs with outstanding prepare entries, or prepare entries whose respective commit or abort are in memtable. With the commit, the same calculation is done while we apply the SST flush. Just before installing the flush file, we precompute the earliest log file to keep after the flush finishes using the same logic (but skipping the memtables just flushed), record this information to the manifest entry for this new flushed SST file. This pre-computed value is also remembered in memory, and will later be used to determine whether a log file can be deleted. This value is unlikely to change until next flush because the commit entry will stay in memtable. (In WritePrepared, we could have removed the older log files as soon as all prepared entries are committed. It's not yet done anyway. Even if we do it, the only thing we loss with this new approach is earlier log deletion between two flushes, which does not guarantee to happen anyway because the obsolete file clean-up function is only executed after flush or compaction)
This min log number to keep is stored in the manifest using the safely-ignore customized field of AddFile entry, in order to guarantee that the DB generated using newer release can be opened by previous releases no older than 4.2.
Closes https://github.com/facebook/rocksdb/pull/3765
Differential Revision: D7747618
Pulled By: siying
fbshipit-source-id: d00c92105b4f83852e9754a1b70d6b64cb590729
2018-05-03 22:35:11 +00:00
|
|
|
for (auto loop_cfd : *vset->GetColumnFamilySet()) {
|
Fix a silent data loss for write-committed txn (#9571)
Summary:
The following sequence of events can cause silent data loss for write-committed
transactions.
```
Time thread 1 bg flush
| db->Put("a")
| txn = NewTxn()
| txn->Put("b", "v")
| txn->Prepare() // writes only to 5.log
| db->SwitchMemtable() // memtable 1 has "a"
| // close 5.log,
| // creates 8.log
| trigger flush
| pick memtable 1
| unlock db mutex
| write new sst
| txn->ctwb->Put("gtid", "1") // writes 8.log
| txn->Commit() // writes to 8.log
| // writes to memtable 2
| compute min_log_number_to_keep_2pc, this
| will be 8 (incorrect).
|
| Purge obsolete wals, including 5.log
|
V
```
At this point, writes of txn exists only in memtable. Close db without flush because db thinks the data in
memtable are backed by log. Then reopen, the writes are lost except key-value pair {"gtid"->"1"},
only the commit marker of txn is in 8.log
The reason lies in `PrecomputeMinLogNumberToKeep2PC()` which calls `FindMinPrepLogReferencedByMemTable()`.
In the above example, when bg flush thread tries to find obsolete wals, it uses the information
computed by `PrecomputeMinLogNumberToKeep2PC()`. The return value of `PrecomputeMinLogNumberToKeep2PC()`
depends on three components
- `PrecomputeMinLogNumberToKeepNon2PC()`. This represents the WAL that has unflushed data. As the name of this method suggests, it does not account for 2PC. Although the keys reside in the prepare section of a previous WAL, the column family references the current WAL when they are actually inserted into the memtable during txn commit.
- `prep_tracker->FindMinLogContainingOutstandingPrep()`. This represents the WAL with a prepare section but the txn hasn't committed.
- `FindMinPrepLogReferencedByMemTable()`. This represents the WAL on which some memtables (mutable and immutable) depend for their unflushed data.
The bug lies in `FindMinPrepLogReferencedByMemTable()`. Originally, this function skips checking the column families
that are being flushed, but the unit test added in this PR shows that they should not be. In this unit test, there is
only the default column family, and one of its memtables has unflushed data backed by a prepare section in 5.log.
We should return this information via `FindMinPrepLogReferencedByMemTable()`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9571
Test Plan:
```
./transaction_test --gtest_filter=*/TransactionTest.SwitchMemtableDuringPrepareAndCommit_WC/*
make check
```
Reviewed By: siying
Differential Revision: D34235236
Pulled By: riversand963
fbshipit-source-id: 120eb21a666728a38dda77b96276c6af72b008b1
2022-02-17 07:07:48 +00:00
|
|
|
if (loop_cfd->IsDropped()) {
|
Skip deleted WALs during recovery
Summary:
This patch record min log number to keep to the manifest while flushing SST files to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic.
Before the commit, for 2PC case, we determined which log number to keep in FindObsoleteFiles(). We looked at the earliest logs with outstanding prepare entries, or prepare entries whose respective commit or abort are in memtable. With the commit, the same calculation is done while we apply the SST flush. Just before installing the flush file, we precompute the earliest log file to keep after the flush finishes using the same logic (but skipping the memtables just flushed), record this information to the manifest entry for this new flushed SST file. This pre-computed value is also remembered in memory, and will later be used to determine whether a log file can be deleted. This value is unlikely to change until next flush because the commit entry will stay in memtable. (In WritePrepared, we could have removed the older log files as soon as all prepared entries are committed. It's not yet done anyway. Even if we do it, the only thing we loss with this new approach is earlier log deletion between two flushes, which does not guarantee to happen anyway because the obsolete file clean-up function is only executed after flush or compaction)
This min log number to keep is stored in the manifest using the safely-ignore customized field of AddFile entry, in order to guarantee that the DB generated using newer release can be opened by previous releases no older than 4.2.
Closes https://github.com/facebook/rocksdb/pull/3765
Differential Revision: D7747618
Pulled By: siying
fbshipit-source-id: d00c92105b4f83852e9754a1b70d6b64cb590729
2018-05-03 22:35:11 +00:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
auto log = loop_cfd->imm()->PrecomputeMinLogContainingPrepSection(
|
2020-12-04 03:21:08 +00:00
|
|
|
&memtables_to_flush_set);
|
Skip deleted WALs during recovery
Summary:
This patch record min log number to keep to the manifest while flushing SST files to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic.
Before the commit, for 2PC case, we determined which log number to keep in FindObsoleteFiles(). We looked at the earliest logs with outstanding prepare entries, or prepare entries whose respective commit or abort are in memtable. With the commit, the same calculation is done while we apply the SST flush. Just before installing the flush file, we precompute the earliest log file to keep after the flush finishes using the same logic (but skipping the memtables just flushed), record this information to the manifest entry for this new flushed SST file. This pre-computed value is also remembered in memory, and will later be used to determine whether a log file can be deleted. This value is unlikely to change until next flush because the commit entry will stay in memtable. (In WritePrepared, we could have removed the older log files as soon as all prepared entries are committed. It's not yet done anyway. Even if we do it, the only thing we loss with this new approach is earlier log deletion between two flushes, which does not guarantee to happen anyway because the obsolete file clean-up function is only executed after flush or compaction)
This min log number to keep is stored in the manifest using the safely-ignore customized field of AddFile entry, in order to guarantee that the DB generated using newer release can be opened by previous releases no older than 4.2.
Closes https://github.com/facebook/rocksdb/pull/3765
Differential Revision: D7747618
Pulled By: siying
fbshipit-source-id: d00c92105b4f83852e9754a1b70d6b64cb590729
2018-05-03 22:35:11 +00:00
|
|
|
|
|
|
|
if (log > 0 && (min_log == 0 || log < min_log)) {
|
|
|
|
min_log = log;
|
|
|
|
}
|
|
|
|
|
|
|
|
log = loop_cfd->mem()->GetMinLogContainingPrepSection();
|
|
|
|
|
|
|
|
if (log > 0 && (min_log == 0 || log < min_log)) {
|
|
|
|
min_log = log;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return min_log;
|
|
|
|
}
|
|
|
|
|
2020-12-04 03:21:08 +00:00
|
|
|
uint64_t FindMinPrepLogReferencedByMemTable(
|
Fix a silent data loss for write-committed txn (#9571)
Summary:
The following sequence of events can cause silent data loss for write-committed
transactions.
```
Time thread 1 bg flush
| db->Put("a")
| txn = NewTxn()
| txn->Put("b", "v")
| txn->Prepare() // writes only to 5.log
| db->SwitchMemtable() // memtable 1 has "a"
| // close 5.log,
| // creates 8.log
| trigger flush
| pick memtable 1
| unlock db mutex
| write new sst
| txn->ctwb->Put("gtid", "1") // writes 8.log
| txn->Commit() // writes to 8.log
| // writes to memtable 2
| compute min_log_number_to_keep_2pc, this
| will be 8 (incorrect).
|
| Purge obsolete wals, including 5.log
|
V
```
At this point, writes of txn exists only in memtable. Close db without flush because db thinks the data in
memtable are backed by log. Then reopen, the writes are lost except key-value pair {"gtid"->"1"},
only the commit marker of txn is in 8.log
The reason lies in `PrecomputeMinLogNumberToKeep2PC()` which calls `FindMinPrepLogReferencedByMemTable()`.
In the above example, when bg flush thread tries to find obsolete wals, it uses the information
computed by `PrecomputeMinLogNumberToKeep2PC()`. The return value of `PrecomputeMinLogNumberToKeep2PC()`
depends on three components
- `PrecomputeMinLogNumberToKeepNon2PC()`. This represents the WAL that has unflushed data. As the name of this method suggests, it does not account for 2PC. Although the keys reside in the prepare section of a previous WAL, the column family references the current WAL when they are actually inserted into the memtable during txn commit.
- `prep_tracker->FindMinLogContainingOutstandingPrep()`. This represents the WAL with a prepare section but the txn hasn't committed.
- `FindMinPrepLogReferencedByMemTable()`. This represents the WAL on which some memtables (mutable and immutable) depend for their unflushed data.
The bug lies in `FindMinPrepLogReferencedByMemTable()`. Originally, this function skips checking the column families
that are being flushed, but the unit test added in this PR shows that they should not be. In this unit test, there is
only the default column family, and one of its memtables has unflushed data backed by a prepare section in 5.log.
We should return this information via `FindMinPrepLogReferencedByMemTable()`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9571
Test Plan:
```
./transaction_test --gtest_filter=*/TransactionTest.SwitchMemtableDuringPrepareAndCommit_WC/*
make check
```
Reviewed By: siying
Differential Revision: D34235236
Pulled By: riversand963
fbshipit-source-id: 120eb21a666728a38dda77b96276c6af72b008b1
2022-02-17 07:07:48 +00:00
|
|
|
VersionSet* vset,
|
2020-12-04 03:21:08 +00:00
|
|
|
const autovector<const autovector<MemTable*>*>& memtables_to_flush) {
|
|
|
|
uint64_t min_log = 0;
|
|
|
|
|
|
|
|
std::unordered_set<MemTable*> memtables_to_flush_set;
|
|
|
|
for (const autovector<MemTable*>* memtables : memtables_to_flush) {
|
|
|
|
memtables_to_flush_set.insert(memtables->begin(), memtables->end());
|
|
|
|
}
|
|
|
|
for (auto loop_cfd : *vset->GetColumnFamilySet()) {
|
Fix a silent data loss for write-committed txn (#9571)
Summary:
The following sequence of events can cause silent data loss for write-committed
transactions.
```
Time thread 1 bg flush
| db->Put("a")
| txn = NewTxn()
| txn->Put("b", "v")
| txn->Prepare() // writes only to 5.log
| db->SwitchMemtable() // memtable 1 has "a"
| // close 5.log,
| // creates 8.log
| trigger flush
| pick memtable 1
| unlock db mutex
| write new sst
| txn->ctwb->Put("gtid", "1") // writes 8.log
| txn->Commit() // writes to 8.log
| // writes to memtable 2
| compute min_log_number_to_keep_2pc, this
| will be 8 (incorrect).
|
| Purge obsolete wals, including 5.log
|
V
```
At this point, writes of txn exists only in memtable. Close db without flush because db thinks the data in
memtable are backed by log. Then reopen, the writes are lost except key-value pair {"gtid"->"1"},
only the commit marker of txn is in 8.log
The reason lies in `PrecomputeMinLogNumberToKeep2PC()` which calls `FindMinPrepLogReferencedByMemTable()`.
In the above example, when bg flush thread tries to find obsolete wals, it uses the information
computed by `PrecomputeMinLogNumberToKeep2PC()`. The return value of `PrecomputeMinLogNumberToKeep2PC()`
depends on three components
- `PrecomputeMinLogNumberToKeepNon2PC()`. This represents the WAL that has unflushed data. As the name of this method suggests, it does not account for 2PC. Although the keys reside in the prepare section of a previous WAL, the column family references the current WAL when they are actually inserted into the memtable during txn commit.
- `prep_tracker->FindMinLogContainingOutstandingPrep()`. This represents the WAL with a prepare section but the txn hasn't committed.
- `FindMinPrepLogReferencedByMemTable()`. This represents the WAL on which some memtables (mutable and immutable) depend for their unflushed data.
The bug lies in `FindMinPrepLogReferencedByMemTable()`. Originally, this function skips checking the column families
that are being flushed, but the unit test added in this PR shows that they should not be. In this unit test, there is
only the default column family, and one of its memtables has unflushed data backed by a prepare section in 5.log.
We should return this information via `FindMinPrepLogReferencedByMemTable()`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9571
Test Plan:
```
./transaction_test --gtest_filter=*/TransactionTest.SwitchMemtableDuringPrepareAndCommit_WC/*
make check
```
Reviewed By: siying
Differential Revision: D34235236
Pulled By: riversand963
fbshipit-source-id: 120eb21a666728a38dda77b96276c6af72b008b1
2022-02-17 07:07:48 +00:00
|
|
|
if (loop_cfd->IsDropped()) {
|
2020-12-04 03:21:08 +00:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
auto log = loop_cfd->imm()->PrecomputeMinLogContainingPrepSection(
|
|
|
|
&memtables_to_flush_set);
|
|
|
|
if (log > 0 && (min_log == 0 || log < min_log)) {
|
|
|
|
min_log = log;
|
|
|
|
}
|
|
|
|
|
|
|
|
log = loop_cfd->mem()->GetMinLogContainingPrepSection();
|
|
|
|
if (log > 0 && (min_log == 0 || log < min_log)) {
|
|
|
|
min_log = log;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return min_log;
|
|
|
|
}
|
|
|
|
|
2020-11-07 00:30:44 +00:00
|
|
|
uint64_t PrecomputeMinLogNumberToKeepNon2PC(
|
Skip deleted WALs during recovery
Summary:
This patch record min log number to keep to the manifest while flushing SST files to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic.
Before the commit, for 2PC case, we determined which log number to keep in FindObsoleteFiles(). We looked at the earliest logs with outstanding prepare entries, or prepare entries whose respective commit or abort are in memtable. With the commit, the same calculation is done while we apply the SST flush. Just before installing the flush file, we precompute the earliest log file to keep after the flush finishes using the same logic (but skipping the memtables just flushed), record this information to the manifest entry for this new flushed SST file. This pre-computed value is also remembered in memory, and will later be used to determine whether a log file can be deleted. This value is unlikely to change until next flush because the commit entry will stay in memtable. (In WritePrepared, we could have removed the older log files as soon as all prepared entries are committed. It's not yet done anyway. Even if we do it, the only thing we loss with this new approach is earlier log deletion between two flushes, which does not guarantee to happen anyway because the obsolete file clean-up function is only executed after flush or compaction)
This min log number to keep is stored in the manifest using the safely-ignore customized field of AddFile entry, in order to guarantee that the DB generated using newer release can be opened by previous releases no older than 4.2.
Closes https://github.com/facebook/rocksdb/pull/3765
Differential Revision: D7747618
Pulled By: siying
fbshipit-source-id: d00c92105b4f83852e9754a1b70d6b64cb590729
2018-05-03 22:35:11 +00:00
|
|
|
VersionSet* vset, const ColumnFamilyData& cfd_to_flush,
|
2020-11-07 00:30:44 +00:00
|
|
|
const autovector<VersionEdit*>& edit_list) {
|
Skip deleted WALs during recovery
Summary:
This patch record min log number to keep to the manifest while flushing SST files to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic.
Before the commit, for 2PC case, we determined which log number to keep in FindObsoleteFiles(). We looked at the earliest logs with outstanding prepare entries, or prepare entries whose respective commit or abort are in memtable. With the commit, the same calculation is done while we apply the SST flush. Just before installing the flush file, we precompute the earliest log file to keep after the flush finishes using the same logic (but skipping the memtables just flushed), record this information to the manifest entry for this new flushed SST file. This pre-computed value is also remembered in memory, and will later be used to determine whether a log file can be deleted. This value is unlikely to change until next flush because the commit entry will stay in memtable. (In WritePrepared, we could have removed the older log files as soon as all prepared entries are committed. It's not yet done anyway. Even if we do it, the only thing we loss with this new approach is earlier log deletion between two flushes, which does not guarantee to happen anyway because the obsolete file clean-up function is only executed after flush or compaction)
This min log number to keep is stored in the manifest using the safely-ignore customized field of AddFile entry, in order to guarantee that the DB generated using newer release can be opened by previous releases no older than 4.2.
Closes https://github.com/facebook/rocksdb/pull/3765
Differential Revision: D7747618
Pulled By: siying
fbshipit-source-id: d00c92105b4f83852e9754a1b70d6b64cb590729
2018-05-03 22:35:11 +00:00
|
|
|
assert(vset != nullptr);
|
|
|
|
|
|
|
|
// Precompute the min log number containing unflushed data for the column
|
|
|
|
// family being flushed (`cfd_to_flush`).
|
|
|
|
uint64_t cf_min_log_number_to_keep = 0;
|
|
|
|
for (auto& e : edit_list) {
|
2020-02-07 21:25:07 +00:00
|
|
|
if (e->HasLogNumber()) {
|
Skip deleted WALs during recovery
Summary:
This patch record min log number to keep to the manifest while flushing SST files to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic.
Before the commit, for 2PC case, we determined which log number to keep in FindObsoleteFiles(). We looked at the earliest logs with outstanding prepare entries, or prepare entries whose respective commit or abort are in memtable. With the commit, the same calculation is done while we apply the SST flush. Just before installing the flush file, we precompute the earliest log file to keep after the flush finishes using the same logic (but skipping the memtables just flushed), record this information to the manifest entry for this new flushed SST file. This pre-computed value is also remembered in memory, and will later be used to determine whether a log file can be deleted. This value is unlikely to change until next flush because the commit entry will stay in memtable. (In WritePrepared, we could have removed the older log files as soon as all prepared entries are committed. It's not yet done anyway. Even if we do it, the only thing we loss with this new approach is earlier log deletion between two flushes, which does not guarantee to happen anyway because the obsolete file clean-up function is only executed after flush or compaction)
This min log number to keep is stored in the manifest using the safely-ignore customized field of AddFile entry, in order to guarantee that the DB generated using newer release can be opened by previous releases no older than 4.2.
Closes https://github.com/facebook/rocksdb/pull/3765
Differential Revision: D7747618
Pulled By: siying
fbshipit-source-id: d00c92105b4f83852e9754a1b70d6b64cb590729
2018-05-03 22:35:11 +00:00
|
|
|
cf_min_log_number_to_keep =
|
2020-02-07 21:25:07 +00:00
|
|
|
std::max(cf_min_log_number_to_keep, e->GetLogNumber());
|
Skip deleted WALs during recovery
Summary:
This patch record min log number to keep to the manifest while flushing SST files to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic.
Before the commit, for 2PC case, we determined which log number to keep in FindObsoleteFiles(). We looked at the earliest logs with outstanding prepare entries, or prepare entries whose respective commit or abort are in memtable. With the commit, the same calculation is done while we apply the SST flush. Just before installing the flush file, we precompute the earliest log file to keep after the flush finishes using the same logic (but skipping the memtables just flushed), record this information to the manifest entry for this new flushed SST file. This pre-computed value is also remembered in memory, and will later be used to determine whether a log file can be deleted. This value is unlikely to change until next flush because the commit entry will stay in memtable. (In WritePrepared, we could have removed the older log files as soon as all prepared entries are committed. It's not yet done anyway. Even if we do it, the only thing we loss with this new approach is earlier log deletion between two flushes, which does not guarantee to happen anyway because the obsolete file clean-up function is only executed after flush or compaction)
This min log number to keep is stored in the manifest using the safely-ignore customized field of AddFile entry, in order to guarantee that the DB generated using newer release can be opened by previous releases no older than 4.2.
Closes https://github.com/facebook/rocksdb/pull/3765
Differential Revision: D7747618
Pulled By: siying
fbshipit-source-id: d00c92105b4f83852e9754a1b70d6b64cb590729
2018-05-03 22:35:11 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
if (cf_min_log_number_to_keep == 0) {
|
|
|
|
// No version edit contains information on log number. The log number
|
|
|
|
// for this column family should stay the same as it is.
|
|
|
|
cf_min_log_number_to_keep = cfd_to_flush.GetLogNumber();
|
|
|
|
}
|
|
|
|
|
|
|
|
// Get min log number containing unflushed data for other column families.
|
|
|
|
uint64_t min_log_number_to_keep =
|
|
|
|
vset->PreComputeMinLogNumberWithUnflushedData(&cfd_to_flush);
|
|
|
|
if (cf_min_log_number_to_keep != 0) {
|
|
|
|
min_log_number_to_keep =
|
|
|
|
std::min(cf_min_log_number_to_keep, min_log_number_to_keep);
|
|
|
|
}
|
2020-11-07 00:30:44 +00:00
|
|
|
return min_log_number_to_keep;
|
|
|
|
}
|
|
|
|
|
2020-11-17 23:54:49 +00:00
|
|
|
uint64_t PrecomputeMinLogNumberToKeepNon2PC(
|
|
|
|
VersionSet* vset, const autovector<ColumnFamilyData*>& cfds_to_flush,
|
|
|
|
const autovector<autovector<VersionEdit*>>& edit_lists) {
|
|
|
|
assert(vset != nullptr);
|
|
|
|
assert(!cfds_to_flush.empty());
|
|
|
|
assert(cfds_to_flush.size() == edit_lists.size());
|
|
|
|
|
2022-05-05 20:08:21 +00:00
|
|
|
uint64_t min_log_number_to_keep = std::numeric_limits<uint64_t>::max();
|
2020-11-17 23:54:49 +00:00
|
|
|
for (const auto& edit_list : edit_lists) {
|
|
|
|
uint64_t log = 0;
|
|
|
|
for (const auto& e : edit_list) {
|
|
|
|
if (e->HasLogNumber()) {
|
|
|
|
log = std::max(log, e->GetLogNumber());
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (log != 0) {
|
|
|
|
min_log_number_to_keep = std::min(min_log_number_to_keep, log);
|
|
|
|
}
|
|
|
|
}
|
2022-05-05 20:08:21 +00:00
|
|
|
if (min_log_number_to_keep == std::numeric_limits<uint64_t>::max()) {
|
2020-11-17 23:54:49 +00:00
|
|
|
min_log_number_to_keep = cfds_to_flush[0]->GetLogNumber();
|
|
|
|
for (size_t i = 1; i < cfds_to_flush.size(); i++) {
|
|
|
|
min_log_number_to_keep =
|
|
|
|
std::min(min_log_number_to_keep, cfds_to_flush[i]->GetLogNumber());
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
std::unordered_set<const ColumnFamilyData*> flushed_cfds(
|
|
|
|
cfds_to_flush.begin(), cfds_to_flush.end());
|
|
|
|
min_log_number_to_keep =
|
|
|
|
std::min(min_log_number_to_keep,
|
|
|
|
vset->PreComputeMinLogNumberWithUnflushedData(flushed_cfds));
|
|
|
|
|
|
|
|
return min_log_number_to_keep;
|
|
|
|
}
|
|
|
|
|
2020-11-07 00:30:44 +00:00
|
|
|
uint64_t PrecomputeMinLogNumberToKeep2PC(
|
|
|
|
VersionSet* vset, const ColumnFamilyData& cfd_to_flush,
|
|
|
|
const autovector<VersionEdit*>& edit_list,
|
|
|
|
const autovector<MemTable*>& memtables_to_flush,
|
|
|
|
LogsWithPrepTracker* prep_tracker) {
|
|
|
|
assert(vset != nullptr);
|
|
|
|
assert(prep_tracker != nullptr);
|
|
|
|
// Calculate updated min_log_number_to_keep
|
|
|
|
// Since the function should only be called in 2pc mode, log number in
|
|
|
|
// the version edit should be sufficient.
|
|
|
|
|
|
|
|
uint64_t min_log_number_to_keep =
|
|
|
|
PrecomputeMinLogNumberToKeepNon2PC(vset, cfd_to_flush, edit_list);
|
Skip deleted WALs during recovery
Summary:
This patch record min log number to keep to the manifest while flushing SST files to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic.
Before the commit, for 2PC case, we determined which log number to keep in FindObsoleteFiles(). We looked at the earliest logs with outstanding prepare entries, or prepare entries whose respective commit or abort are in memtable. With the commit, the same calculation is done while we apply the SST flush. Just before installing the flush file, we precompute the earliest log file to keep after the flush finishes using the same logic (but skipping the memtables just flushed), record this information to the manifest entry for this new flushed SST file. This pre-computed value is also remembered in memory, and will later be used to determine whether a log file can be deleted. This value is unlikely to change until next flush because the commit entry will stay in memtable. (In WritePrepared, we could have removed the older log files as soon as all prepared entries are committed. It's not yet done anyway. Even if we do it, the only thing we loss with this new approach is earlier log deletion between two flushes, which does not guarantee to happen anyway because the obsolete file clean-up function is only executed after flush or compaction)
This min log number to keep is stored in the manifest using the safely-ignore customized field of AddFile entry, in order to guarantee that the DB generated using newer release can be opened by previous releases no older than 4.2.
Closes https://github.com/facebook/rocksdb/pull/3765
Differential Revision: D7747618
Pulled By: siying
fbshipit-source-id: d00c92105b4f83852e9754a1b70d6b64cb590729
2018-05-03 22:35:11 +00:00
|
|
|
|
|
|
|
// if are 2pc we must consider logs containing prepared
|
|
|
|
// sections of outstanding transactions.
|
|
|
|
//
|
|
|
|
// We must check min logs with outstanding prep before we check
|
|
|
|
// logs references by memtables because a log referenced by the
|
|
|
|
// first data structure could transition to the second under us.
|
|
|
|
//
|
|
|
|
// TODO: iterating over all column families under db mutex.
|
|
|
|
// should find more optimal solution
|
|
|
|
auto min_log_in_prep_heap =
|
|
|
|
prep_tracker->FindMinLogContainingOutstandingPrep();
|
|
|
|
|
|
|
|
if (min_log_in_prep_heap != 0 &&
|
|
|
|
min_log_in_prep_heap < min_log_number_to_keep) {
|
|
|
|
min_log_number_to_keep = min_log_in_prep_heap;
|
|
|
|
}
|
|
|
|
|
Fix a silent data loss for write-committed txn (#9571)
Summary:
The following sequence of events can cause silent data loss for write-committed
transactions.
```
Time thread 1 bg flush
| db->Put("a")
| txn = NewTxn()
| txn->Put("b", "v")
| txn->Prepare() // writes only to 5.log
| db->SwitchMemtable() // memtable 1 has "a"
| // close 5.log,
| // creates 8.log
| trigger flush
| pick memtable 1
| unlock db mutex
| write new sst
| txn->ctwb->Put("gtid", "1") // writes 8.log
| txn->Commit() // writes to 8.log
| // writes to memtable 2
| compute min_log_number_to_keep_2pc, this
| will be 8 (incorrect).
|
| Purge obsolete wals, including 5.log
|
V
```
At this point, writes of txn exists only in memtable. Close db without flush because db thinks the data in
memtable are backed by log. Then reopen, the writes are lost except key-value pair {"gtid"->"1"},
only the commit marker of txn is in 8.log
The reason lies in `PrecomputeMinLogNumberToKeep2PC()` which calls `FindMinPrepLogReferencedByMemTable()`.
In the above example, when bg flush thread tries to find obsolete wals, it uses the information
computed by `PrecomputeMinLogNumberToKeep2PC()`. The return value of `PrecomputeMinLogNumberToKeep2PC()`
depends on three components
- `PrecomputeMinLogNumberToKeepNon2PC()`. This represents the WAL that has unflushed data. As the name of this method suggests, it does not account for 2PC. Although the keys reside in the prepare section of a previous WAL, the column family references the current WAL when they are actually inserted into the memtable during txn commit.
- `prep_tracker->FindMinLogContainingOutstandingPrep()`. This represents the WAL with a prepare section but the txn hasn't committed.
- `FindMinPrepLogReferencedByMemTable()`. This represents the WAL on which some memtables (mutable and immutable) depend for their unflushed data.
The bug lies in `FindMinPrepLogReferencedByMemTable()`. Originally, this function skips checking the column families
that are being flushed, but the unit test added in this PR shows that they should not be. In this unit test, there is
only the default column family, and one of its memtables has unflushed data backed by a prepare section in 5.log.
We should return this information via `FindMinPrepLogReferencedByMemTable()`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9571
Test Plan:
```
./transaction_test --gtest_filter=*/TransactionTest.SwitchMemtableDuringPrepareAndCommit_WC/*
make check
```
Reviewed By: siying
Differential Revision: D34235236
Pulled By: riversand963
fbshipit-source-id: 120eb21a666728a38dda77b96276c6af72b008b1
2022-02-17 07:07:48 +00:00
|
|
|
uint64_t min_log_refed_by_mem =
|
|
|
|
FindMinPrepLogReferencedByMemTable(vset, memtables_to_flush);
|
Skip deleted WALs during recovery
Summary:
This patch record min log number to keep to the manifest while flushing SST files to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic.
Before the commit, for 2PC case, we determined which log number to keep in FindObsoleteFiles(). We looked at the earliest logs with outstanding prepare entries, or prepare entries whose respective commit or abort are in memtable. With the commit, the same calculation is done while we apply the SST flush. Just before installing the flush file, we precompute the earliest log file to keep after the flush finishes using the same logic (but skipping the memtables just flushed), record this information to the manifest entry for this new flushed SST file. This pre-computed value is also remembered in memory, and will later be used to determine whether a log file can be deleted. This value is unlikely to change until next flush because the commit entry will stay in memtable. (In WritePrepared, we could have removed the older log files as soon as all prepared entries are committed. It's not yet done anyway. Even if we do it, the only thing we loss with this new approach is earlier log deletion between two flushes, which does not guarantee to happen anyway because the obsolete file clean-up function is only executed after flush or compaction)
This min log number to keep is stored in the manifest using the safely-ignore customized field of AddFile entry, in order to guarantee that the DB generated using newer release can be opened by previous releases no older than 4.2.
Closes https://github.com/facebook/rocksdb/pull/3765
Differential Revision: D7747618
Pulled By: siying
fbshipit-source-id: d00c92105b4f83852e9754a1b70d6b64cb590729
2018-05-03 22:35:11 +00:00
|
|
|
|
|
|
|
if (min_log_refed_by_mem != 0 &&
|
|
|
|
min_log_refed_by_mem < min_log_number_to_keep) {
|
|
|
|
min_log_number_to_keep = min_log_refed_by_mem;
|
|
|
|
}
|
|
|
|
return min_log_number_to_keep;
|
|
|
|
}
|
|
|
|
|
2020-12-04 03:21:08 +00:00
|
|
|
uint64_t PrecomputeMinLogNumberToKeep2PC(
|
|
|
|
VersionSet* vset, const autovector<ColumnFamilyData*>& cfds_to_flush,
|
|
|
|
const autovector<autovector<VersionEdit*>>& edit_lists,
|
|
|
|
const autovector<const autovector<MemTable*>*>& memtables_to_flush,
|
|
|
|
LogsWithPrepTracker* prep_tracker) {
|
|
|
|
assert(vset != nullptr);
|
|
|
|
assert(prep_tracker != nullptr);
|
|
|
|
assert(cfds_to_flush.size() == edit_lists.size());
|
|
|
|
assert(cfds_to_flush.size() == memtables_to_flush.size());
|
|
|
|
|
|
|
|
uint64_t min_log_number_to_keep =
|
|
|
|
PrecomputeMinLogNumberToKeepNon2PC(vset, cfds_to_flush, edit_lists);
|
|
|
|
|
|
|
|
uint64_t min_log_in_prep_heap =
|
|
|
|
prep_tracker->FindMinLogContainingOutstandingPrep();
|
|
|
|
|
|
|
|
if (min_log_in_prep_heap != 0 &&
|
|
|
|
min_log_in_prep_heap < min_log_number_to_keep) {
|
|
|
|
min_log_number_to_keep = min_log_in_prep_heap;
|
|
|
|
}
|
|
|
|
|
Fix a silent data loss for write-committed txn (#9571)
Summary:
The following sequence of events can cause silent data loss for write-committed
transactions.
```
Time thread 1 bg flush
| db->Put("a")
| txn = NewTxn()
| txn->Put("b", "v")
| txn->Prepare() // writes only to 5.log
| db->SwitchMemtable() // memtable 1 has "a"
| // close 5.log,
| // creates 8.log
| trigger flush
| pick memtable 1
| unlock db mutex
| write new sst
| txn->ctwb->Put("gtid", "1") // writes 8.log
| txn->Commit() // writes to 8.log
| // writes to memtable 2
| compute min_log_number_to_keep_2pc, this
| will be 8 (incorrect).
|
| Purge obsolete wals, including 5.log
|
V
```
At this point, writes of txn exists only in memtable. Close db without flush because db thinks the data in
memtable are backed by log. Then reopen, the writes are lost except key-value pair {"gtid"->"1"},
only the commit marker of txn is in 8.log
The reason lies in `PrecomputeMinLogNumberToKeep2PC()` which calls `FindMinPrepLogReferencedByMemTable()`.
In the above example, when bg flush thread tries to find obsolete wals, it uses the information
computed by `PrecomputeMinLogNumberToKeep2PC()`. The return value of `PrecomputeMinLogNumberToKeep2PC()`
depends on three components
- `PrecomputeMinLogNumberToKeepNon2PC()`. This represents the WAL that has unflushed data. As the name of this method suggests, it does not account for 2PC. Although the keys reside in the prepare section of a previous WAL, the column family references the current WAL when they are actually inserted into the memtable during txn commit.
- `prep_tracker->FindMinLogContainingOutstandingPrep()`. This represents the WAL with a prepare section but the txn hasn't committed.
- `FindMinPrepLogReferencedByMemTable()`. This represents the WAL on which some memtables (mutable and immutable) depend for their unflushed data.
The bug lies in `FindMinPrepLogReferencedByMemTable()`. Originally, this function skips checking the column families
that are being flushed, but the unit test added in this PR shows that they should not be. In this unit test, there is
only the default column family, and one of its memtables has unflushed data backed by a prepare section in 5.log.
We should return this information via `FindMinPrepLogReferencedByMemTable()`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9571
Test Plan:
```
./transaction_test --gtest_filter=*/TransactionTest.SwitchMemtableDuringPrepareAndCommit_WC/*
make check
```
Reviewed By: siying
Differential Revision: D34235236
Pulled By: riversand963
fbshipit-source-id: 120eb21a666728a38dda77b96276c6af72b008b1
2022-02-17 07:07:48 +00:00
|
|
|
uint64_t min_log_refed_by_mem =
|
|
|
|
FindMinPrepLogReferencedByMemTable(vset, memtables_to_flush);
|
2020-12-04 03:21:08 +00:00
|
|
|
|
|
|
|
if (min_log_refed_by_mem != 0 &&
|
|
|
|
min_log_refed_by_mem < min_log_number_to_keep) {
|
|
|
|
min_log_number_to_keep = min_log_refed_by_mem;
|
|
|
|
}
|
|
|
|
|
|
|
|
return min_log_number_to_keep;
|
|
|
|
}
|
|
|
|
|
2022-06-15 22:39:49 +00:00
|
|
|
void DBImpl::SetDBId(std::string&& id, bool read_only,
|
|
|
|
RecoveryContext* recovery_ctx) {
|
|
|
|
assert(db_id_.empty());
|
|
|
|
assert(!id.empty());
|
|
|
|
db_id_ = std::move(id);
|
|
|
|
if (!read_only && immutable_db_options_.write_dbid_to_manifest) {
|
|
|
|
assert(recovery_ctx != nullptr);
|
|
|
|
assert(versions_->GetColumnFamilySet() != nullptr);
|
|
|
|
VersionEdit edit;
|
|
|
|
edit.SetDBId(db_id_);
|
|
|
|
versions_->db_id_ = db_id_;
|
|
|
|
recovery_ctx->UpdateVersionEdits(
|
|
|
|
versions_->GetColumnFamilySet()->GetDefault(), edit);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
Group SST write in flush, compaction and db open with new stats (#11910)
Summary:
## Context/Summary
Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity.
For that, this PR does the following:
- Tag different write IOs by passing down and converting WriteOptions to IOOptions
- Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS
Some related code refactory to make implementation cleaner:
- Blob stats
- Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info.
- Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write.
- Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority
- Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification
- Build table
- TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables
- Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder.
This parameter is used for dynamically changing file io priority for flush, see https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more
- Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority
## Test
### db bench
Flush
```
./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100
rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
```
compaction, db oopen
```
Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279
rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213
rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66
```
blob stats - just to make sure they aren't broken by this PR
```
Integrated Blob DB
Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600
rocksdb.blobdb.blob.file.synced COUNT : 1
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same)
```
```
Stacked Blob DB
Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876
rocksdb.blobdb.blob.file.synced COUNT : 8
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same)
```
### Rehearsal CI stress test
Trigger 3 full runs of all our CI stress tests
### Performance
Flush
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark; enable_statistics = true
Pre-pr: avg 507515519.3 ns
497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908,
Post-pr: avg 511971266.5 ns, regressed 0.88%
502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408,
```
Compaction
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 495346098.30 ns
492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846
Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97%
502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007
```
Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats)
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 3848.10 ns
3814,3838,3839,3848,3854,3854,3854,3860,3860,3860
Post-pr: avg 3874.20 ns, regressed 0.68%
3863,3867,3871,3874,3875,3877,3877,3877,3880,3881
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910
Reviewed By: ajkr
Differential Revision: D49788060
Pulled By: hx235
fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff
2023-12-29 23:29:23 +00:00
|
|
|
Status DBImpl::SetupDBId(const WriteOptions& write_options, bool read_only,
|
|
|
|
RecoveryContext* recovery_ctx) {
|
Fix a recovery corner case (#7621)
Summary:
Consider the following sequence of events:
1. Db flushed an SST with file number N, appended to MANIFEST, and tried to sync the MANIFEST.
2. Syncing MANIFEST failed and db crashed.
3. Db tried to recover with this MANIFEST. In the meantime, no entry about the newly-flushed SST was found in the MANIFEST. Therefore, RocksDB replayed WAL and tried to flush to an SST file reusing the same file number N. This failed because file system does not support overwrite. Then Db deleted this file.
4. Db crashed again.
5. Db tried to recover. When db read the MANIFEST, there was an entry referencing N.sst. This could happen probably because the append in step 1 finally reached the MANIFEST and became visible. Since N.sst had been deleted in step 3, recovery failed.
It is possible that N.sst created in step 1 is valid. Although step 3 would still fail since the MANIFEST was not synced properly in step 1 and 2, deleting N.sst would make it impossible for the db to recover even if the remaining part of MANIFEST was appended and visible after step 5.
After this PR, in step 3, immediately after recovering from MANIFEST, a new MANIFEST is created, then we find that N.sst is not referenced in the MANIFEST, so we delete it, and we'll not reuse N as file number. Then in step 5, since the new MANIFEST does not contain N.sst, the recovery failure situation in step 5 won't happen.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7621
Test Plan:
1. some tests are updated, because these tests assume that new MANIFEST is created after WAL recovery.
2. a new unit test is added in db_basic_test to simulate step 3.
Reviewed By: riversand963
Differential Revision: D24668144
Pulled By: cheng-chang
fbshipit-source-id: 90d7487fbad2bc3714f5ede46ea949895b15ae3b
2020-11-08 05:54:55 +00:00
|
|
|
Status s;
|
2022-06-15 22:39:49 +00:00
|
|
|
// Check for the IDENTITY file and create it if not there or
|
|
|
|
// broken or not matching manifest
|
|
|
|
std::string db_id_in_file;
|
|
|
|
s = fs_->FileExists(IdentityFileName(dbname_), IOOptions(), nullptr);
|
|
|
|
if (s.ok()) {
|
|
|
|
s = GetDbIdentityFromIdentityFile(&db_id_in_file);
|
|
|
|
if (s.ok() && !db_id_in_file.empty()) {
|
|
|
|
if (db_id_.empty()) {
|
|
|
|
// Loaded from file and wasn't already known from manifest
|
|
|
|
SetDBId(std::move(db_id_in_file), read_only, recovery_ctx);
|
|
|
|
return s;
|
|
|
|
} else if (db_id_ == db_id_in_file) {
|
|
|
|
// Loaded from file and matches manifest
|
|
|
|
return s;
|
Fix a recovery corner case (#7621)
Summary:
Consider the following sequence of events:
1. Db flushed an SST with file number N, appended to MANIFEST, and tried to sync the MANIFEST.
2. Syncing MANIFEST failed and db crashed.
3. Db tried to recover with this MANIFEST. In the meantime, no entry about the newly-flushed SST was found in the MANIFEST. Therefore, RocksDB replayed WAL and tried to flush to an SST file reusing the same file number N. This failed because file system does not support overwrite. Then Db deleted this file.
4. Db crashed again.
5. Db tried to recover. When db read the MANIFEST, there was an entry referencing N.sst. This could happen probably because the append in step 1 finally reached the MANIFEST and became visible. Since N.sst had been deleted in step 3, recovery failed.
It is possible that N.sst created in step 1 is valid. Although step 3 would still fail since the MANIFEST was not synced properly in step 1 and 2, deleting N.sst would make it impossible for the db to recover even if the remaining part of MANIFEST was appended and visible after step 5.
After this PR, in step 3, immediately after recovering from MANIFEST, a new MANIFEST is created, then we find that N.sst is not referenced in the MANIFEST, so we delete it, and we'll not reuse N as file number. Then in step 5, since the new MANIFEST does not contain N.sst, the recovery failure situation in step 5 won't happen.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7621
Test Plan:
1. some tests are updated, because these tests assume that new MANIFEST is created after WAL recovery.
2. a new unit test is added in db_basic_test to simulate step 3.
Reviewed By: riversand963
Differential Revision: D24668144
Pulled By: cheng-chang
fbshipit-source-id: 90d7487fbad2bc3714f5ede46ea949895b15ae3b
2020-11-08 05:54:55 +00:00
|
|
|
}
|
2022-06-15 22:39:49 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
if (s.IsNotFound()) {
|
|
|
|
s = Status::OK();
|
|
|
|
}
|
|
|
|
if (!s.ok()) {
|
|
|
|
assert(s.IsIOError());
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
// Otherwise IDENTITY file is missing or no good.
|
|
|
|
// Generate new id if needed
|
|
|
|
if (db_id_.empty()) {
|
|
|
|
SetDBId(env_->GenerateUniqueId(), read_only, recovery_ctx);
|
|
|
|
}
|
|
|
|
// Persist it to IDENTITY file if allowed
|
|
|
|
if (!read_only) {
|
Group SST write in flush, compaction and db open with new stats (#11910)
Summary:
## Context/Summary
Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity.
For that, this PR does the following:
- Tag different write IOs by passing down and converting WriteOptions to IOOptions
- Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS
Some related code refactory to make implementation cleaner:
- Blob stats
- Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info.
- Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write.
- Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority
- Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification
- Build table
- TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables
- Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder.
This parameter is used for dynamically changing file io priority for flush, see https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more
- Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority
## Test
### db bench
Flush
```
./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100
rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
```
compaction, db oopen
```
Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279
rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213
rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66
```
blob stats - just to make sure they aren't broken by this PR
```
Integrated Blob DB
Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600
rocksdb.blobdb.blob.file.synced COUNT : 1
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same)
```
```
Stacked Blob DB
Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876
rocksdb.blobdb.blob.file.synced COUNT : 8
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same)
```
### Rehearsal CI stress test
Trigger 3 full runs of all our CI stress tests
### Performance
Flush
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark; enable_statistics = true
Pre-pr: avg 507515519.3 ns
497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908,
Post-pr: avg 511971266.5 ns, regressed 0.88%
502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408,
```
Compaction
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 495346098.30 ns
492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846
Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97%
502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007
```
Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats)
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 3848.10 ns
3814,3838,3839,3848,3854,3854,3854,3860,3860,3860
Post-pr: avg 3874.20 ns, regressed 0.68%
3863,3867,3871,3874,3875,3877,3877,3877,3880,3881
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910
Reviewed By: ajkr
Differential Revision: D49788060
Pulled By: hx235
fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff
2023-12-29 23:29:23 +00:00
|
|
|
s = SetIdentityFile(write_options, env_, dbname_, db_id_);
|
Fix a recovery corner case (#7621)
Summary:
Consider the following sequence of events:
1. Db flushed an SST with file number N, appended to MANIFEST, and tried to sync the MANIFEST.
2. Syncing MANIFEST failed and db crashed.
3. Db tried to recover with this MANIFEST. In the meantime, no entry about the newly-flushed SST was found in the MANIFEST. Therefore, RocksDB replayed WAL and tried to flush to an SST file reusing the same file number N. This failed because file system does not support overwrite. Then Db deleted this file.
4. Db crashed again.
5. Db tried to recover. When db read the MANIFEST, there was an entry referencing N.sst. This could happen probably because the append in step 1 finally reached the MANIFEST and became visible. Since N.sst had been deleted in step 3, recovery failed.
It is possible that N.sst created in step 1 is valid. Although step 3 would still fail since the MANIFEST was not synced properly in step 1 and 2, deleting N.sst would make it impossible for the db to recover even if the remaining part of MANIFEST was appended and visible after step 5.
After this PR, in step 3, immediately after recovering from MANIFEST, a new MANIFEST is created, then we find that N.sst is not referenced in the MANIFEST, so we delete it, and we'll not reuse N as file number. Then in step 5, since the new MANIFEST does not contain N.sst, the recovery failure situation in step 5 won't happen.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7621
Test Plan:
1. some tests are updated, because these tests assume that new MANIFEST is created after WAL recovery.
2. a new unit test is added in db_basic_test to simulate step 3.
Reviewed By: riversand963
Differential Revision: D24668144
Pulled By: cheng-chang
fbshipit-source-id: 90d7487fbad2bc3714f5ede46ea949895b15ae3b
2020-11-08 05:54:55 +00:00
|
|
|
}
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
2024-05-01 19:26:54 +00:00
|
|
|
std::set<std::string> DBImpl::CollectAllDBPaths() {
|
|
|
|
std::set<std::string> all_db_paths;
|
|
|
|
all_db_paths.insert(NormalizePath(dbname_));
|
2020-03-21 02:17:54 +00:00
|
|
|
for (const auto& db_path : immutable_db_options_.db_paths) {
|
2024-05-01 19:26:54 +00:00
|
|
|
all_db_paths.insert(NormalizePath(db_path.path));
|
2020-03-21 02:17:54 +00:00
|
|
|
}
|
|
|
|
for (const auto* cfd : *versions_->GetColumnFamilySet()) {
|
|
|
|
for (const auto& cf_path : cfd->ioptions()->cf_paths) {
|
2024-05-01 19:26:54 +00:00
|
|
|
all_db_paths.insert(NormalizePath(cf_path.path));
|
2020-03-21 02:17:54 +00:00
|
|
|
}
|
|
|
|
}
|
2024-05-01 19:26:54 +00:00
|
|
|
return all_db_paths;
|
|
|
|
}
|
2020-03-21 02:17:54 +00:00
|
|
|
|
2024-05-01 19:26:54 +00:00
|
|
|
Status DBImpl::MaybeUpdateNextFileNumber(RecoveryContext* recovery_ctx) {
|
|
|
|
mutex_.AssertHeld();
|
2020-03-21 02:17:54 +00:00
|
|
|
uint64_t next_file_number = versions_->current_next_file_number();
|
|
|
|
uint64_t largest_file_number = next_file_number;
|
2020-09-29 22:25:31 +00:00
|
|
|
Status s;
|
2024-05-01 19:26:54 +00:00
|
|
|
for (const auto& path : CollectAllDBPaths()) {
|
2020-03-21 02:17:54 +00:00
|
|
|
std::vector<std::string> files;
|
2020-09-29 22:25:31 +00:00
|
|
|
s = env_->GetChildren(path, &files);
|
|
|
|
if (!s.ok()) {
|
|
|
|
break;
|
|
|
|
}
|
2020-03-21 02:17:54 +00:00
|
|
|
for (const auto& fname : files) {
|
|
|
|
uint64_t number = 0;
|
|
|
|
FileType type;
|
|
|
|
if (!ParseFileName(fname, &number, &type)) {
|
|
|
|
continue;
|
|
|
|
}
|
2024-05-01 19:26:54 +00:00
|
|
|
const std::string normalized_fpath = path + kFilePathSeparator + fname;
|
2020-03-21 02:17:54 +00:00
|
|
|
largest_file_number = std::max(largest_file_number, number);
|
2024-05-01 19:26:54 +00:00
|
|
|
if ((type == kTableFile || type == kBlobFile)) {
|
|
|
|
recovery_ctx->existing_data_files_.push_back(normalized_fpath);
|
2020-03-21 02:17:54 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
2020-09-29 22:25:31 +00:00
|
|
|
if (!s.ok()) {
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
Handle rename() failure in non-local FS (#8192)
Summary:
In a distributed environment, a file `rename()` operation can succeed on server (remote)
side, but the client can somehow return non-ok status to RocksDB. Possible reasons include
network partition, connection issue, etc. This happens in `rocksdb::SetCurrentFile()`, which
can be called in `LogAndApply() -> ProcessManifestWrites()` if RocksDB tries to switch to a
new MANIFEST. We currently always delete the new MANIFEST if an error occurs.
This is problematic in distributed world. If the server-side successfully updates the CURRENT
file via renaming, then a subsequent `DB::Open()` will try to look for the new MANIFEST and fail.
As a fix, we can track the execution result of IO operations on the new MANIFEST.
- If IO operations on the new MANIFEST fail, then we know the CURRENT must point to the original
MANIFEST. Therefore, it is safe to remove the new MANIFEST.
- If IO operations on the new MANIFEST all succeed, but somehow we end up in the clean up
code block, then we do not know whether CURRENT points to the new or old MANIFEST. (For local
POSIX-compliant FS, it should still point to old MANIFEST, but it does not matter if we keep the
new MANIFEST.) Therefore, we keep the new MANIFEST.
- Any future `LogAndApply()` will switch to a new MANIFEST and update CURRENT.
- If process reopens the db immediately after the failure, then the CURRENT file can point
to either the new MANIFEST or the old one, both of which exist. Therefore, recovery can
succeed and ignore the other.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8192
Test Plan: make check
Reviewed By: zhichao-cao
Differential Revision: D27804648
Pulled By: riversand963
fbshipit-source-id: 9c16f2a5ce41bc6aadf085e48449b19ede8423e4
2021-04-20 01:10:23 +00:00
|
|
|
if (largest_file_number >= next_file_number) {
|
2020-03-21 02:17:54 +00:00
|
|
|
versions_->next_file_number_.store(largest_file_number + 1);
|
|
|
|
}
|
2020-04-23 23:18:28 +00:00
|
|
|
|
|
|
|
VersionEdit edit;
|
|
|
|
edit.SetNextFile(versions_->next_file_number_.load());
|
|
|
|
assert(versions_->GetColumnFamilySet());
|
|
|
|
ColumnFamilyData* default_cfd = versions_->GetColumnFamilySet()->GetDefault();
|
|
|
|
assert(default_cfd);
|
Persist the new MANIFEST after successfully syncing the new WAL during recovery (#9922)
Summary:
In case of non-TransactionDB and avoid_flush_during_recovery = true, RocksDB won't
flush the data from WAL to L0 for all column families if possible. As a
result, not all column families can increase their log_numbers, and
min_log_number_to_keep won't change.
For transaction DB (.allow_2pc), even with the flush, there may be old WAL files that it must not delete because they can contain data of uncommitted transactions and min_log_number_to_keep won't change.
If we persist a new MANIFEST with
advanced log_numbers for some column families, then during a second
crash after persisting the MANIFEST, RocksDB will see some column
families' log_numbers larger than the corrupted wal, and the "column family inconsistency" error will be hit, causing recovery to fail.
As a solution, RocksDB will persist the new MANIFEST after successfully syncing the new WAL.
If a future recovery starts from the new MANIFEST, then it means the new WAL is successfully synced. Due to the sentinel empty write batch at the beginning, kPointInTimeRecovery of WAL is guaranteed to go after this point.
If future recovery starts from the old MANIFEST, it means the writing the new MANIFEST failed. We won't have the "SST ahead of WAL" error.
Currently, RocksDB DB::Open() may creates and writes to two new MANIFEST files even before recovery succeeds. This PR buffers the edits in a structure and writes to a new MANIFEST after recovery is successful
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9922
Test Plan:
1. Update unit tests to fail without this change
2. make crast_test -j
Branch with unit test and no fix https://github.com/facebook/rocksdb/pull/9942 to keep track of unit test (without fix)
Reviewed By: riversand963
Differential Revision: D36043701
Pulled By: akankshamahajan15
fbshipit-source-id: 5760970db0a0920fb73d3c054a4155733500acd9
2022-06-01 17:52:26 +00:00
|
|
|
recovery_ctx->UpdateVersionEdits(default_cfd, edit);
|
2020-03-21 02:17:54 +00:00
|
|
|
return s;
|
|
|
|
}
|
2020-02-20 20:07:53 +00:00
|
|
|
} // namespace ROCKSDB_NAMESPACE
|