mirror of https://github.com/facebook/rocksdb.git
636 Commits
Author | SHA1 | Message | Date |
---|---|---|---|
Peter Dillinger | 4e60663b31 |
Remove unnecessary, confusing 'extern' (#12300)
Summary: In C++, `extern` is redundant in a number of cases: * "Global" function declarations and definitions * "Global" variable definitions when already declared `extern` For consistency and simplicity, I've removed these in code that *we own*. In a couple of cases, I removed obsolete declarations, and for MagicNumber constants, I have consolidated the declarations into a header file (format.h) as standard best practice would prescribe. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12300 Test Plan: no functional changes, CI Reviewed By: ajkr Differential Revision: D53148629 Pulled By: pdillinger fbshipit-source-id: fb8d927959892e03af09b0c0d542b0a3b38fd886 |
|
Peter Dillinger | f046a8f617 |
Deflake ColumnFamilyTest.WriteStallSingleColumnFamily (#12294)
Summary: https://github.com/facebook/rocksdb/issues/12267 apparently introduced a data race in test code where a background read of estimated_compaction_needed_bytes while holding the DB mutex could race with forground write for testing purposes. This change adds the DB mutex to those writes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12294 Test Plan: 1000 TSAN runs of test (massively fails before change, passes after) Reviewed By: ajkr Differential Revision: D53095483 Pulled By: pdillinger fbshipit-source-id: 13fcb383ebad313dabe39eb8f9085c34d370b54a |
|
Peter Dillinger | cb08a682d4 |
Fix/cleanup SeqnoToTimeMapping (#12253)
Summary: The SeqnoToTimeMapping class (RocksDB internal) used by the preserve_internal_time_seconds / preclude_last_level_data_seconds options was essentially in a prototype state with some significant flaws that would risk biting us some day. This is a big, complicated change because both the implementation and the behavioral requirements of the class needed to be upgraded together. In short, this makes SeqnoToTimeMapping more internally responsible for maintaining good invariants, so that callers don't easily encounter dangerous scenarios. * Some API functions were confusingly named and structured, so I fully refactored the APIs to use clear naming (e.g. `DecodeFrom` and `CopyFromSeqnoRange`), object states, function preconditions, etc. * Previously the object could informally be sorted / compacted or not, and there was limited checking or enforcement on these states. Now there's a well-defined "enforced" state that is consistently checked in debug mode for applicable operations. (I attempted to create a separate "builder" class for unenforced states, but IIRC found that more cumbersome for existing uses than it was worth.) * Previously operations would coalesce data in a way that was better for `GetProximalTimeBeforeSeqno` than for `GetProximalSeqnoBeforeTime` which is odd because the latter is the only one used by DB code currently (what is the seqno cut-off for data definitely older than this given time?). This is now reversed to consistently favor `GetProximalSeqnoBeforeTime`, with that logic concentrated in one place: `SeqnoToTimeMapping::SeqnoTimePair::Merge()`. Unfortunately, a lot of unit test logic was specifically testing the old, suboptimal behavior. * Previously, the natural behavior of SeqnoToTimeMapping was to THROW AWAY data needed to get reasonable answers to the important `GetProximalSeqnoBeforeTime` queries. This is because SeqnoToTimeMapping only had a FIFO policy for staying within the entry capacity (except in aggregate+sort+serialize mode). If the DB wasn't extremely careful to avoid gathering too many time mappings, it could lose track of where the seqno cutoff was for cold data (`GetProximalSeqnoBeforeTime()` returning 0) and preventing all further data migration to the cold tier--until time passes etc. for mappings to catch up with FIFO purging of them. (The problem is not so acute because SST files contain relevant snapshots of the mappings, but the problem would apply to long-lived memtables.) * Now the SeqnoToTimeMapping class has fully-integrated smarts for keeping a sufficiently complete history, within capacity limits, to give good answers to `GetProximalSeqnoBeforeTime` queries. * Fixes old `// FIXME: be smarter about how we erase to avoid data falling off the front prematurely.` * Fix an apparent bug in how entries are selected for storing into SST files. Previously, it only selected entries within the seqno range of the file, but that would easily leave a gap at the beginning of the timeline for data in the file for the purposes of answering GetProximalXXX queries with reasonable accuracy. This could probably lead to the same problem discussed above in naively throwing away entries in FIFO order in the old SeqnoToTimeMapping. The updated testing of GetProximalSeqnoBeforeTime in BasicSeqnoToTimeMapping relies on the fixed behavior. * Fix a potential compaction CPU efficiency/scaling issue in which each compaction output file would iterate over and sort all seqno-to-time mappings from all compaction input files. Now we distill the input file entries to a constant size before processing each compaction output file. Intended follow-up (me or others): * Expand some direct testing of SeqnoToTimeMapping APIs. Here I've focused on updating existing tests to make sense. * There are likely more gaps in availability of needed SeqnoToTimeMapping data when the DB shuts down and is restarted, at least with WAL. * The data tracked in the DB could be kept more accurate and limited if it used the oldest seqno of unflushed data. This might require some more API refactoring. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12253 Test Plan: unit tests updated Reviewed By: jowlyzhang Differential Revision: D52913733 Pulled By: pdillinger fbshipit-source-id: 020737fcbbe6212f6701191a6ab86565054c9593 |
|
马越 | 1a1f9f1660 |
Fix the compactRange with wrong cf handle when ClipColumnFamily (#12219)
Summary: - **Context**: In ClipColumnFamily, the DeleteRange API will be used to delete data, and then CompactRange will be called for physical deletion. But now However, the ColumnFamilyHandle is not passed , so by default only the DefaultColumnFamily will be CompactRanged. Therefore, it may cause that the data in some sst files of CompactionRange cannot be physically deleted. - **In this change** Pass the ColumnFamilyHandle when call CompactRange Pull Request resolved: https://github.com/facebook/rocksdb/pull/12219 Reviewed By: ajkr Differential Revision: D52665162 Pulled By: cbi42 fbshipit-source-id: e8e997aa25ec4ca40e347be89edc7e84a7a0edce |
|
Hui Xiao | 06e593376c |
Group SST write in flush, compaction and db open with new stats (#11910)
Summary: ## Context/Summary Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity. For that, this PR does the following: - Tag different write IOs by passing down and converting WriteOptions to IOOptions - Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS Some related code refactory to make implementation cleaner: - Blob stats - Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info. - Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write. - Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority - Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification - Build table - TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables - Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder. This parameter is used for dynamically changing file io priority for flush, see https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more - Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority ## Test ### db bench Flush ``` ./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100 rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377 rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377 rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0 rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0 ``` compaction, db oopen ``` Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench Run:./db_bench --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1 rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279 rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0 rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213 rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66 ``` blob stats - just to make sure they aren't broken by this PR ``` Integrated Blob DB Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1 pre-PR: rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600 rocksdb.blobdb.blob.file.synced COUNT : 1 rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 post-PR: rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614 - COUNT is higher and values are smaller as it includes header and footer write - COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164 rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same) rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same) ``` ``` Stacked Blob DB Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench pre-PR: rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876 rocksdb.blobdb.blob.file.synced COUNT : 8 rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 post-PR: rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924 - COUNT is higher and values are smaller as it includes header and footer write - COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164 rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same) rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same) ``` ### Rehearsal CI stress test Trigger 3 full runs of all our CI stress tests ### Performance Flush ``` TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000 -- default: 1 thread is used to run benchmark; enable_statistics = true Pre-pr: avg 507515519.3 ns 497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908, Post-pr: avg 511971266.5 ns, regressed 0.88% 502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408, ``` Compaction ``` TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1 --benchmark_repetitions=1000 -- default: 1 thread is used to run benchmark Pre-pr: avg 495346098.30 ns 492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846 Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97% 502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007 ``` Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats) ``` TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1 --benchmark_repetitions=1000 -- default: 1 thread is used to run benchmark Pre-pr: avg 3848.10 ns 3814,3838,3839,3848,3854,3854,3854,3860,3860,3860 Post-pr: avg 3874.20 ns, regressed 0.68% 3863,3867,3871,3874,3875,3877,3877,3877,3880,3881 ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910 Reviewed By: ajkr Differential Revision: D49788060 Pulled By: hx235 fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff |
|
anand76 | a036525809 |
Lightweight verification of MANIFEST file after close on shutdown (#12174)
Summary: Do a size verification on the MANIFEST file during DB shutdown, after closing the file. If the verification fails, write a new MANIFEST file. In the future, we can do a more thorough verification if we want to. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12174 Test Plan: Unit test, and some manual verification Reviewed By: ajkr Differential Revision: D52451184 Pulled By: anand1976 fbshipit-source-id: fc3bc170e22f6c9a9c482ee5ff592abab889df83 |
|
Peter Dillinger | a771a47a1b |
Fix leak or crash on failure in automatic atomic flush (#12176)
Summary: Through code inspection in debugging an apparent leak of ColumnFamilyData in the crash test, I found a case where too few UnrefAndTryDelete() could be called on a cfd. This fixes that case, which would fail like this in the new unit test: ``` db_flush_test: db/column_family.cc:1648: rocksdb::ColumnFamilySet::~ColumnFamilySet(): Assertion `last_ref' failed. ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12176 Test Plan: unit test added Reviewed By: cbi42 Differential Revision: D52417071 Pulled By: pdillinger fbshipit-source-id: 4ee33c918409cf9c1968f138e273d3347a6cc8e5 |
|
Akanksha Mahajan | cd577f6059 |
Fix WRITE_STALL start_time (#12147)
Summary:
`Delayed` is set true in two cases. One is when `delay` is specified. Other one is in the `while` loop -
|
|
Kevin Mingtarja | 44fd914128 |
Fix double counting of BYTES_WRITTEN ticker (#12111)
Summary: Fixes https://github.com/facebook/rocksdb/issues/12061. We were double counting the `BYTES_WRITTEN` ticker when doing writes with transactions. During transactions, after writing, a client can call `Prepare()`, which writes the values to WAL but not to the Memtable. After that, they can call `Commit()`, which writes a commit marker to the WAL and the values to Memtable. The cause of this bug is previously during writes, we didn't take into account `writer->ShouldWriteToMemtable()` before adding to `total_byte_size`, so it is still added to during the `Prepare()` phase even though we're not writing to the Memtable, which was why we saw the value to be double of what's written to WAL. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12111 Test Plan: Added a test in `db/db_statistics_test.cc` that tests writes with and without transactions, by comparing the values of `BYTES_WRITTEN` and `WAL_FILE_BYTES` after doing writes. Reviewed By: jaykorean Differential Revision: D51954327 Pulled By: jowlyzhang fbshipit-source-id: 57a0986a14e5b94eb5188715d819212529110d2c |
|
Yu Zhang | d68f45e777 |
Flush buffered logs when FlushRequest is rescheduled (#12105)
Summary: The optimization to not find and delete obsolete files when FlushRequest is re-scheduled also inadvertently skipped flushing the `LogBuffer`, resulting in missed logs. This PR fixes the issue. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12105 Test Plan: manually check this test has the correct info log after the fix `./column_family_test --gtest_filter=ColumnFamilyRetainUDTTest.NotAllKeysExpiredFlushRescheduled` Reviewed By: ajkr Differential Revision: D51671079 Pulled By: jowlyzhang fbshipit-source-id: da0640e07e35c69c08988772ed611ec9e67f2e92 |
|
Jay Huh | 8b8f6c63ef |
ColumnFamilyHandle Nullcheck in GetEntity and MultiGetEntity (#12057)
Summary: - Add missing null check for ColumnFamilyHandle in `GetEntity()` - `FailIfCfHasTs()` now returns `Status::InvalidArgument()` if `column_family` is null. `MultiGetEntity()` can rely on this for cfh null check. - Added `DeleteRange` API using Default Column Family to be consistent with other major APIs (This was also causing Java Test failure after the `FailIfCfHasTs()` change) Pull Request resolved: https://github.com/facebook/rocksdb/pull/12057 Test Plan: - Updated `DBWideBasicTest::GetEntityAsPinnableAttributeGroups` to include null CF case - Updated `DBWideBasicTest::MultiCFMultiGetEntityAsPinnableAttributeGroups` to include null CF case Reviewed By: jowlyzhang Differential Revision: D51167445 Pulled By: jaykorean fbshipit-source-id: 1c1e44fd7b7df4d2dc3bb2d7d251da85bad7d664 |
|
leipeng | b3ffca0e29 |
DBImpl::DelayWrite: Remove bad WRITE_STALL histogram (#12067)
Summary: When delay didn't happen, histogram WRITE_STALL is still recorded, and ticker STALL_MICROS is not recorded. This is a bug, neither WRITE_STALL or STALL_MICROS should not be recorded when delay did not happen. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12067 Reviewed By: cbi42 Differential Revision: D51263133 Pulled By: ajkr fbshipit-source-id: bd82d8328fe088d613991966e83854afdabc6a25 |
|
Yu Zhang | 509947ce2c |
Quarantine files in a limbo state after a manifest error (#12030)
Summary: Part of the procedures to handle manifest IO error is to disable file deletion in case some files in limbo state get deleted prematurely. This is not ideal because: 1) not all the VersionEdits whose commit encounter such an error contain updates for files, disabling file deletion sometimes are not necessary. 2) `EnableFileDeletion` has a force mode that could make other threads accidentally disrupt this procedure in recovery. 3) Disabling file deletion as a whole is also not as efficient as more precisely tracking impacted files from being prematurely deleted. This PR replaces this mechanism with tracking such files and quarantine them from being deleted in `ErrorHandler`. These are the types of files being actively tracked in quarantine in this PR: 1) new table files and blob files from a background job 2) old manifest file whose immediately following new manifest file's CURRENT file creation gets into unclear state. Current handling is not sufficient to make sure the old manifest file is kept in case it's needed. Note that WAL logs are not part of the quarantine because `min_log_number_to_keep` is a safe mechanism and it's only updated after successful manifest commits so it can prevent this premature deletion issue from happening. We track these files' file numbers because they share the same file number space. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12030 Test Plan: Modified existing unit tests Reviewed By: ajkr Differential Revision: D51036774 Pulled By: jowlyzhang fbshipit-source-id: 84ef26271fbbc888ef70da5c40fe843bd7038716 |
|
brodyhuang | e90e9825b4 |
Drop wal record when sequence is illegal (#11985)
Summary: - Our database is corrupted, causing some sequences of wal record to be invalid (but the `record_checksum` looks fine). - When we RecoverLogFiles in WALRecoveryMode::kPointInTimeRecovery, `assert(seq <= kMaxSequenceNumber)` will be failed. - When it is found that sequence is illegal, can we drop the file to recover as much data as possible ? Thx ! Pull Request resolved: https://github.com/facebook/rocksdb/pull/11985 Reviewed By: anand1976 Differential Revision: D50698039 Pulled By: ajkr fbshipit-source-id: 1e42113b58823088d7c0c3a92af5b3efbb5f5296 |
|
Jay Huh | 2adef5367a |
AttributeGroups - PutEntity Implementation (#11977)
Summary: Write Path for AttributeGroup Support. The new `PutEntity()` API uses `WriteBatch` and atomically writes WideColumns entities in multiple Column Families. Combined the release note from PR https://github.com/facebook/rocksdb/issues/11925 Pull Request resolved: https://github.com/facebook/rocksdb/pull/11977 Test Plan: - `DBWideBasicTest::MultiCFMultiGetEntityAsPinnableAttributeGroups` updated - `WriteBatchTest::AttributeGroupTest` added - `WriteBatchTest::AttributeGroupSavePointTest` added Reviewed By: ltamasi Differential Revision: D50457122 Pulled By: jaykorean fbshipit-source-id: 4997b265e415588ce077933082dcd1ac3eeae2cd |
|
Jay Huh | 0ecfc4fbb4 |
AttributeGroups - GetEntity Implementation (#11943)
Summary: Implementation of `GetEntity()` API that returns wide-column entities as AttributeGroups from multiple column families for a single key. Regarding the definition of Attribute groups, please see the detailed example description in PR https://github.com/facebook/rocksdb/issues/11925 Pull Request resolved: https://github.com/facebook/rocksdb/pull/11943 Test Plan: - `DBWideBasicTest::GetEntityAsPinnableAttributeGroups` added will enable the new API in the `db_stress` after merging Reviewed By: ltamasi Differential Revision: D50195794 Pulled By: jaykorean fbshipit-source-id: 218d54841ac7e337de62e13b1233b0a99bd91af3 |
|
Jay Huh | 2dab137182 |
Mark more files for periodic compaction during offpeak (#12031)
Summary: - The struct previously named `OffpeakTimeInfo` has been renamed to `OffpeakTimeOption` to indicate that it's a user-configurable option. Additionally, a new struct, `OffpeakTimeInfo`, has been introduced, which includes two fields: `is_now_offpeak` and `seconds_till_next_offpeak_start`. This change prevents the need to parse the `daily_offpeak_time_utc` string twice. - It's worth noting that we may consider adding more fields to the `OffpeakTimeInfo` struct, such as `elapsed_seconds` and `total_seconds`, as needed for further optimization. - Within `VersionStorageInfo::ComputeFilesMarkedForPeriodicCompaction()`, we've adjusted the `allowed_time_limit` to include files that are expected to expire by the next offpeak start. - We might explore further optimizations, such as evenly distributing files to mark during offpeak hours, if the initial approach results in marking too many files simultaneously during the first scoring in offpeak hours. The primary objective of this PR is to prevent periodic compactions during non-offpeak hours when offpeak hours are configured. We'll start with this straightforward solution and assess whether it suffices for now. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12031 Test Plan: Unit Tests added - `DBCompactionTest::LevelPeriodicCompactionOffpeak` for Leveled - `DBTestUniversalCompaction2::PeriodicCompaction` for Universal Reviewed By: cbi42 Differential Revision: D50900292 Pulled By: jaykorean fbshipit-source-id: 267e7d3332d45a5d9881796786c8650fa0a3b43d |
|
Yu Zhang | 60df39e530 |
Rate limiting stale sst files' deletion during recovery (#12016)
Summary: As titled. If SstFileManager is available, deleting stale sst files will be delegated to it so it can be rate limited. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12016 Reviewed By: hx235 Differential Revision: D50670482 Pulled By: jowlyzhang fbshipit-source-id: bde5b76ea1d98e67f6b4f08bfba3db48e46aab4e |
|
Jay Huh | e230e4d248 |
Make OffpeakTimeInfo available in VersionSet (#12018)
Summary: As mentioned in https://github.com/facebook/rocksdb/issues/11893, we are going to use the offpeak time information to pre-process TTL-based compactions. To do so, we need to access `daily_offpeak_time_utc` in `VersionStorageInfo::ComputeCompactionScore()` where we pick the files to compact. This PR is to make the offpeak time information available at the time of compaction-scoring. We are not changing any compaction scoring logic just yet. Will follow up in a separate PR. There were two ways to achieve what we want. 1. Make `MutableDBOptions` available in `ColumnFamilyData` and `ComputeCompactionScore()` take `MutableDBOptions` along with `ImmutableOptions` and `MutableCFOptions`. 2. Make `daily_offpeak_time_utc` and `IsNowOffpeak()` available in `VersionStorageInfo`. We chose the latter as it involves smaller changes. This change includes the following - Introduction of `OffpeakTimeInfo` and `IsNowOffpeak()` has been moved from `MutableDBOptions` - `OffpeakTimeInfo` added to `VersionSet` and it can be set during construction and by `ChangeOffpeakTimeInfo()` - During `SetDBOptions()`, if offpeak time info needs to change, it calls `MaybeScheduleFlushOrCompaction()` to re-compute compaction scores and process compactions as needed Pull Request resolved: https://github.com/facebook/rocksdb/pull/12018 Test Plan: - `DBOptionsTest::OffpeakTimes` changed to include checks for `MaybeScheduleFlushOrCompaction()` calls and `VersionSet`'s OffpeakTimeInfo value change during `SetDBOptions()`. - `VersionSetTest::OffpeakTimeInfoTest` added to test `ChangeOffpeakTimeInfo()`. `IsNowOffpeak()` tests moved from `DBOptionsTest::OffpeakTimes` Reviewed By: pdillinger Differential Revision: D50723881 Pulled By: jaykorean fbshipit-source-id: 3cff0291936f3729c0e9c7750834b9378fb435f6 |
|
Myth | 0ff7665c95 |
Fix low priority write may cause crash when it is rate limited (#11932)
Summary: Fixed https://github.com/facebook/rocksdb/issues/11902 Pull Request resolved: https://github.com/facebook/rocksdb/pull/11932 Reviewed By: akankshamahajan15 Differential Revision: D50573356 Pulled By: hx235 fbshipit-source-id: adeb1abdc43b523b0357746055ce4a2eabde56a1 |
|
Peter Dillinger | 4155087746 |
Use manifest to persist pre-allocated seqnos (#11995)
Summary: ... and other fixes for crash test after https://github.com/facebook/rocksdb/issues/11922. * When pre-allocating sequence numbers for establishing a time history, record that last sequence number in the manifest so that it is (most likely) restored on recovery even if no user writes were made or were recovered (e.g. no WAL). * When pre-allocating sequence numbers for establishing a time history, only do this for actually new DBs. * Remove the feature that ensures non-zero sequence number on creating the first column family with preserve/preclude option after initial DB::Open. Until fixed in a way compatible with the crash test, this creates a gap where some data written with active preserve/preclude option won't have a known associated time. Together, these ensure we don't upset the crash test by manipulating sequence numbers after initial DB creation (esp when re-opening with different options). (The crash test expects that the seqno after re-open corresponds to a known point in time from previous crash test operation, matching an expected DB state.) Follow-up work: * Re-fill the gap to ensure all data written under preserve/preclude settings have a known time estimate. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11995 Test Plan: Added to unit test SeqnoTimeTablePropTest.PrePopulateInDB Verified fixes two crash test scenarios: ## 1st reproducer First apply ``` diff --git a/db_stress_tool/expected_state.cc b/db_stress_tool/expected_state.cc index b483e154c..ef63b8d6c 100644 --- a/db_stress_tool/expected_state.cc +++ b/db_stress_tool/expected_state.cc @@ -333,6 +333,7 @@ Status FileExpectedStateManager::SaveAtAndAfter(DB* db) { s = NewFileTraceWriter(Env::Default(), soptions, trace_file_path, &trace_writer); } + if (getenv("CRASH")) assert(false); if (s.ok()) { TraceOptions trace_opts; trace_opts.filter |= kTraceFilterGet; ``` Then ``` mkdir -p /dev/shm/rocksdb_test/rocksdb_crashtest_expected mkdir -p /dev/shm/rocksdb_test/rocksdb_crashtest_whitebox rm -rf /dev/shm/rocksdb_test/rocksdb_crashtest_*/* CRASH=1 ./db_stress --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --destroy_db_initially=1 --manual_wal_flush_one_in=1000000 --clear_column_family_one_in=0 --preserve_internal_time_seconds=36000 ./db_stress --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --destroy_db_initially=0 --manual_wal_flush_one_in=1000000 --clear_column_family_one_in=0 --preserve_internal_time_seconds=0 ``` Without the fix you get ``` ... DB path: [/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox] (Re-)verified 34 unique IDs Error restoring historical expected values: Corruption: DB is older than any restorable expected state ``` ## 2nd reproducer First apply ``` diff --git a/db_stress_tool/db_stress_test_base.cc b/db_stress_tool/db_stress_test_base.cc index 62ddead7b..f2654980f 100644 --- a/db_stress_tool/db_stress_test_base.cc +++ b/db_stress_tool/db_stress_test_base.cc @@ -1126,6 +1126,7 @@ void StressTest::OperateDb(ThreadState* thread) { // OPERATION write TestPut(thread, write_opts, read_opts, rand_column_families, rand_keys, value); + if (getenv("CRASH")) assert(false); } else if (prob_op < del_bound) { assert(write_bound <= prob_op); // OPERATION delete ``` Then ``` rm -rf /dev/shm/rocksdb_test/rocksdb_crashtest_*/* CRASH=1 ./db_stress --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --destroy_db_initially=1 --manual_wal_flush_one_in=1000000 --clear_column_family_one_in=0 --disable_wal=1 --reopen=0 --preserve_internal_time_seconds=0 ./db_stress --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --destroy_db_initially=0 --manual_wal_flush_one_in=1000000 --clear_column_family_one_in=0 --disable_wal=1 --reopen=0 --preserve_internal_time_seconds=3600 ``` Without the fix you get ``` DB path: [/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox] (Re-)verified 34 unique IDs db_stress: db_stress_tool/expected_state.cc:380: virtual rocksdb::{anonymous}::ExpectedStateTraceRecordHandler::~ ExpectedStateTraceRecordHandler(): Assertion `IsDone()' failed. ``` Reviewed By: jowlyzhang Differential Revision: D50533346 Pulled By: pdillinger fbshipit-source-id: 1056be45c5b9e537c8c601b28c4b27431a782477 |
|
Hui Xiao | 0836a2b26d |
New tickers on deletion compactions grouped by reasons (#11957)
Summary: Context/Summary: as titled Pull Request resolved: https://github.com/facebook/rocksdb/pull/11957 Test Plan: piggyback on existing tests; fixed a failed test due to adding new stats Reviewed By: ajkr, cbi42 Differential Revision: D50294310 Pulled By: hx235 fbshipit-source-id: d99b97ebac41efc1bdeaf9ca7a1debd2927d54cd |
|
Yu Zhang | 933ee295f4 |
Fix a race condition between recovery and backup (#11955)
Summary: A race condition between recovery and backup can happen with error messages like this: ```Failure in BackupEngine::CreateNewBackup with: IO error: No such file or directory: While opening a file for sequentially reading: /dev/shm/rocksdb_test/rocksdb_crashtest_whitebox/002653.log: No such file or directory``` PR https://github.com/facebook/rocksdb/issues/6949 introduced disabling file deletion during error handling of manifest IO errors. Aformentioned race condition is caused by this chain of event: [Backup engine] disable file deletion [Recovery] disable file deletion <= this is optional for the race condition, it may or may not get called [Backup engine] get list of file to copy/link [Recovery] force enable file deletion .... some files refered by backup engine get deleted [Backup engine] copy/link file <= error no file found This PR fixes this with: 1) Recovery thread is currently forcing enabling file deletion as long as file deletion is disabled. Regardless of whether the previous error handling is for manifest IO error and that disabled it in the first place. This means it could incorrectly enabling file deletions intended by other threads like backup threads, file snapshotting threads. This PR does this check explicitly before making the call. 2) `disable_delete_obsolete_files_` is designed as a counter to allow different threads to enable and disable file deletion separately. The recovery thread currently does a force enable file deletion, because `ErrorHandler::SetBGError()` can be called multiple times by different threads when they receive a manifest IO error(details per PR https://github.com/facebook/rocksdb/issues/6949), resulting in `DBImpl::DisableFileDeletions` to be called multiple times too. Making a force enable file deletion call that resets the counter `disable_delete_obsolete_files_` to zero is a workaround for this. However, as it shows in the race condition, it can incorrectly suppress other threads like a backup thread's intention to keep the file deletion disabled. <strike>This PR adds a `std::atomic<int> disable_file_deletion_count_` to the error handler to track the needed counter decrease more precisely</strike>. This PR tracks and caps file deletion enabling/disabling in error handler. 3) for recovery, the section to find obsolete files and purge them was moved to be done after the attempt to enable file deletion. The actual finding and purging is more likely to happen if file deletion was previously disabled and get re-enabled now. An internal function `DBImpl::EnableFileDeletionsWithLock` was added to support change 2) and 3). Some useful logging was explicitly added to keep those log messages around. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11955 Test Plan: existing unit tests Reviewed By: anand1976 Differential Revision: D50290592 Pulled By: jowlyzhang fbshipit-source-id: 73aa8331ca4d636955a5b0324b1e104a26e00c9b |
|
Peter Dillinger | 2fd850c7eb |
Remove write queue synchronization from WriteOptionsFile (#11951)
Summary: This has become obsolete with the new `options_mutex_` in https://github.com/facebook/rocksdb/pull/11929 * Remove now-unnecessary parameter from WriteOptionsFile * Rename (and negate) other parameter for better clarity (the caller shouldn't tell the callee what the callee needs, just what the caller knows, provides, and requests) * Move a ROCKS_LOG_WARN (I/O) in WriteOptionsFile to outside of holding DB mutex. * Also *avoid* (but not always eliminate) write queue synchronization in SetDBOptions. Still needed if there was a change to WAL size limit or other configuration. * Improve some comments Pull Request resolved: https://github.com/facebook/rocksdb/pull/11951 Test Plan: existing unit tests and TSAN crash test local run Reviewed By: ajkr Differential Revision: D50247904 Pulled By: pdillinger fbshipit-source-id: 7dfe445c705ec013886a2adb7c50abe50d83af69 |
|
Changyu Bi | f3aef8cad7 |
Add write operation to tracer only after successful callback (#11954)
Summary: We saw optimistic transaction stress test failures like the following: ``` Verification failed for column family 0 key 000000000001E9AF000000000000012B00000000000000B5 (12535491): value_from_db: 010000000504070609080B0A0D0C0F0E111013121514171619181B1A1D1C1F1E212023222524272629282B2A2D2C2F2E313033323534373639383B3A3D3C3F3E, value_from_expected: , msg: Iterator verification: Unexpected value found``` ``` With ajkr's repro (see test plan), I found that we record duplicated writes to tracer when an optimistic transaction conflict checking fails. This PR fixes it by checking callback status before record a write operation to tracer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11954 Test Plan: this reproduces the failure consistently ``` #!/bin/bash db=/dev/shm/rocksdb_crashtest_blackbox exp=/dev/shm/rocksdb_crashtest_expected rm -rf $db $exp && mkdir -p $exp && while ./db_stress \ --atomic_flush=1 \ --clear_column_family_one_in=0 \ --db=$db \ --db_write_buffer_size=2097152 \ --delpercent=0 \ --delrangepercent=0 \ --destroy_db_initially=0 \ --disable_wal=1 \ --expected_values_dir=$exp \ --iterpercent=0 \ --max_bytes_for_level_base=2097152 \ --max_key=250000 \ --memtable_prefix_bloom_size_ratio=0.5 \ --memtable_whole_key_filtering=1 \ --occ_lock_bucket_count=100 \ --occ_validation_policy=0 \ --ops_per_thread=10 \ --prefixpercent=0 \ --readpercent=0 \ --reopen=0 \ --target_file_size_base=524288 \ --test_batches_snapshots=0 \ --use_optimistic_txn=1 \ --use_txn=1 \ --value_size_mult=32 \ --write_buffer_size=524288 \ --writepercent=100 ; do : ; done ``` Reviewed By: akankshamahajan15 Differential Revision: D50284976 Pulled By: cbi42 fbshipit-source-id: 793e3cee186c8b4f406b29166efd8d9028695206 |
|
Jay Huh | c9d8e6a5bf |
AttributeGroups - MultiGetEntity Implementation (#11925)
Summary: Introducing the notion of AttributeGroup by adding the `MultiGetEntity()` API retrieving `PinnableAttributeGroups`. An "attribute group" refers to a logical grouping of wide-column entities within RocksDB. These attribute groups are implemented using column families. Users can store WideColumns in different CFs for various reasons (e.g. similar access patterns, same types, etc.). This new API `MultiGetEntity()` takes keys and `PinnableAttributeGroups` per key. `PinnableAttributeGroups` is just a list of `PinnableAttributeGroup`s in which we have `ColumnFamilyHandle*`, `Status`, and `PinnableWideColumns`. Let's say a user stored "hot" wide columns in column family "hot_data_cf" and "cold" wide columns in column family "cold_data_cf" and all other columns in "common_cf". Prior to this PR, if the user wants to query for two keys, "key_1" and "key_2" and but only interested in "common_cf" and "hot_data_cf" for "key_1", and "common_cf" and "cold_data_cf" for "key_2", the user would have to construct input like `keys = ["key_1", "key_1", "key_2", "key_2"]`, `column_families = ["common_cf", "hot_data_cf", "common_cf", "cold_data_cf"]` and get the flat list of `PinnableWideColumns` to find the corresponding <key,CF> combo. With the new `MultiGetEntity()` introduced in this PR, users can now query only `["common_cf", "hot_data_cf"]` for `"key_1"`, and only `["common_cf", "cold_data_cf"]` for `"key_2"`. The user will get `PinnableAttributeGroups` for each key, and `PinnableAttributeGroups` gives a list of `PinnableAttributeGroup`s where the user can find column family and corresponding `PinnableWideColumns` and the `Status`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11925 Test Plan: - `DBWideBasicTest::MultiCFMultiGetEntityAsPinnableAttributeGroups` added will enable this new API in the `db_stress` in a separate PR Reviewed By: ltamasi Differential Revision: D50017414 Pulled By: jaykorean fbshipit-source-id: 643611d1273c574bc81b94c6f5aeea24b40c4586 |
|
Changyu Bi | 6e3429b8a6 |
Fix data race in accessing `recovery_in_prog_` (#11950)
Summary: We saw the following TSAN stress test failure: ``` WARNING: ThreadSanitizer: data race (pid=17523) Write of size 1 at 0x7b8c000008b9 by thread T4243 (mutexes: write M0): #0 rocksdb::ErrorHandler::RecoverFromRetryableBGIOError() fbcode/internal_repo_rocksdb/repo/db/error_handler.cc:742 (db_stress+0x95f954) (BuildId: 35795dfb86ddc9c4f20ddf08a491f24d) https://github.com/facebook/rocksdb/issues/1 std:🧵:_State_impl<std:🧵:_Invoker<std::tuple<void (rocksdb::ErrorHandler::*)(), rocksdb::ErrorHandler*>>>::_M_run() fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/invoke.h:74 (db_stress+0x95fc2b) (BuildId: 35795dfb86ddc9c4f20ddf08a491f24d) https://github.com/facebook/rocksdb/issues/2 execute_native_thread_routine /home/engshare/third-party2/libgcc/11.x/src/gcc-11.x/x86_64-facebook-linux/libstdc++-v3/src/c++11/../../../.././libstdc++-v3/src/c++11/thread.cc:82:18 (libstdc++.so.6+0xdf4e4) (BuildId: 452d1cdae868baeeb2fdf1ab140f1c219bf50c6e) Previous read of size 1 at 0x7b8c000008b9 by thread T22: #0 rocksdb::DBImpl::SyncClosedLogs(rocksdb::JobContext*, rocksdb::VersionEdit*) fbcode/internal_repo_rocksdb/repo/db/error_handler.h:76 (db_stress+0x84f69c) (BuildId: 35795dfb86ddc9c4f20ddf08a491f24d) ``` This is due to a data race in accessing `recovery_in_prog_`. This PR fixes it by accessing `recovery_in_prog_` under db mutex before calling `SyncClosedLogs()`. I think the original PR https://github.com/facebook/rocksdb/pull/10489 intended to clear the error if it's a recovery flush. So ideally we can also just check flush reason. I plan to keep a safer change in this PR and make that change in the future if needed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11950 Test Plan: check future TSAN stress test results. Reviewed By: anand1976 Differential Revision: D50242255 Pulled By: cbi42 fbshipit-source-id: 0d487948ef9546b038a34460f3bb037f6e5bfc58 |
|
Changyu Bi | 648fe25bc0 |
Always clear files marked for compaction in `ComputeCompactionScore()` (#11946)
Summary: We were seeing the following stress test failures: ```LevelCompactionBuilder::PickFileToCompact(const rocksdb::autovector<std::pair<int, rocksdb::FileMetaData*> >&, bool): Assertion `!level_file.second->being_compacted' failed``` This can happen when we are picking a file to be compacted from some files marked for compaction, but that file is already being_compacted. We prevent this by always calling `ComputeCompactionScore()` after we pick a compaction and mark some files as being_compacted. However, if SetOptions() is called to disable marking certain files to be compacted, say `enable_blob_garbage_collection`, we currently just skip the relevant logic in `ComputeCompactionScore()` without clearing the existing files already marked for compaction. This PR fixes this issue by already clearing these files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11946 Test Plan: existing tests. Reviewed By: akankshamahajan15 Differential Revision: D50232608 Pulled By: cbi42 fbshipit-source-id: 11e4fb5e9d48b0f946ad33b18f7c005f0161f496 |
|
Peter Dillinger | d010b02e86 |
Fix race in options taking effect (#11929)
Summary: In follow-up to https://github.com/facebook/rocksdb/issues/11922, fix a race in functions like CreateColumnFamily and SetDBOptions where the DB reports one option setting but a different one is left in effect. To fix, we can add an extra mutex around these rare operations. We don't want to hold the DB mutex during I/O or other slow things because of the many purposes it serves, but a mutex more limited to these cases should be fine. I believe this would fix a write-write race in https://github.com/facebook/rocksdb/issues/10079 but not the read-write race. Intended follow-up to this: * Should be able to remove write thread synchronization from DBImpl::WriteOptionsFile Pull Request resolved: https://github.com/facebook/rocksdb/pull/11929 Test Plan: Added two mini-stress style regression tests that fail with >1% probability before this change: DBOptionsTest::SetStatsDumpPeriodSecRace ColumnFamilyTest::CreateAndDropPeriodicRace I haven't reproduced such an inconsistency between in-memory options and on disk latest options, but this change at least improves safety and adds a test anyway: DBOptionsTest::SetStatsDumpPeriodSecRace Reviewed By: ajkr Differential Revision: D50024506 Pulled By: pdillinger fbshipit-source-id: 1e99a9ed4d96fdcf3ac5061ec6b3cee78aecdda4 |
|
Andrew Kryczka | 8a9cfd5292 |
Make stopped writes block on recovery (#11879)
Summary: Relaxed the constraints for blocking when writes are stopped. When a recovery is already being attempted, we might as well let `!no_slowdown` writes wait on it in case it succeeds. This makes the user-visible behavior consistent across recovery flush and non-recovery flush. This enables `db_stress` to inject retryable (soft) flush read errors without having to handle user write failures. I changed `db_stress` a bit to permit injected errors in much more foreground operations as more admin operations (like `GetLiveFiles()`) can fail on a retryable error during flush. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11879 Reviewed By: anand1976 Differential Revision: D49571196 Pulled By: ajkr fbshipit-source-id: 5d516d6faf20d2c6bfe0594ab4f2706bca6d69b0 |
|
darionyaphet | ee0829ba76 |
fix typo snapshto (#11817)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11817 Reviewed By: jaykorean Differential Revision: D50103497 Pulled By: ltamasi fbshipit-source-id: 77c5cf86ff7eb5021fc91b03225882536163af7b |
|
Peter Dillinger | 1d5bddbc58 |
Bootstrap, pre-populate seqno_to_time_mapping (#11922)
Summary: This change has two primary goals (follow-up to https://github.com/facebook/rocksdb/issues/11917, https://github.com/facebook/rocksdb/issues/11920): * Ensure the DB seqno_to_time_mapping has entries that allow us to put a good time lower bound on any writes that happen after setting up preserve/preclude options (either in a new DB, new CF, SetOptions, etc.) and haven't yet aged out of that time window. This allows us to remove a bunch of work-arounds in tests. * For new DBs using preserve/preclude options, automatically reserve some sequence numbers and pre-map them to cover the time span back to the preserve/preclude cut-off time. In the future, this will allow us to import data from another DB by key, value, and write time by assigning an appropriate seqno in this DB for that write time. Note that the pre-population (historical mappings) does not happen if the original options at DB Open time do not have preserve/preclude, so it is recommended to create initial column families at that time with create_missing_column_families, to take advantage of this (future) feature. (Adding these historical mappings after DB Open would risk non-monotonic seqno_to_time_mapping, which is dubious if not dangerous.) Recommended follow-up: * Solve existing race conditions (not memory safety) where parallel operations like CreateColumnFamily or SetDBOptions could leave the wrong setting in effect. * Make SeqnoToTimeMapping more gracefully handle a possible case in which too many mappings are added for the time range of concern. It seems like there could be cases where data is massively excluded from the cold tier because of entries falling off the front of the mapping list (causing GetProximalSeqnoBeforeTime() to return 0). (More investigation needed.) No release note for the minor bug fix because this is still an experimental feature with limited usage. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11922 Test Plan: tests added / updated Reviewed By: jowlyzhang Differential Revision: D49956563 Pulled By: pdillinger fbshipit-source-id: 92beb918c3a298fae9ca8e509717b1067caa1519 |
|
Peter Dillinger | 141b872bd4 |
Improve efficiency of create_missing_column_families, light refactor (#11920)
Summary: In preparing some seqno_to_time_mapping improvements, I found that some of the wrap-up work for creating column families was unnecessarily repeated in the case of DB::Open with create_missing_column_families. This change fixes that (`CreateColumnFamily()` -> `CreateColumnFamilyImpl()` in `DBImpl::Open()`), motivated by avoiding repeated calls to `RegisterRecordSeqnoTimeWorker()` but with the side benefit of avoiding repeated calls to `WriteOptionsFile()` for each CF. Also in this change: * Add a `Status::UpdateIfOk()` function for combining statuses in a common pattern * Rename `max_time_duration` -> `min_preserve_seconds` (include units as much as possible) * Improved comments in several places Pull Request resolved: https://github.com/facebook/rocksdb/pull/11920 Test Plan: tests added / updated Reviewed By: jaykorean Differential Revision: D49919147 Pulled By: pdillinger fbshipit-source-id: 3d0318c1d070c842c5331da0a5b415caedc104f1 |
|
Jay Huh | 5fbea87859 |
Disallow start_time == end_time in offpeak time and compare at minute level to allow 24hr offpeak (#11911)
Summary: Since allowing 24hr peak by setting start_time = end_time is not so intuitive, we are not going to allow it (e.g. `00:00-00:00` doesn't looks like a value that would cover 24hr.). Instead, we are going to compare at minute level (i.e. dropping the seconds to the nearest minute) so that `00:00-23:59` will cover 24hrs. The entire minute from 23:59:00 23:59:59 will be covered with this change. Minor fixes from previous PR - release build error - fixed random seed in test Pull Request resolved: https://github.com/facebook/rocksdb/pull/11911 Test Plan: `DBOptionsTest::OffPeakTimes` `make -j64 static_lib` to test release build issue that was fixed Reviewed By: pdillinger Differential Revision: D49787795 Pulled By: jaykorean fbshipit-source-id: e8d045b95f54f61d5dd5f1bb473579f8d55c18b3 |
|
Andrew Kryczka | 10fd05e394 |
Give retry flushes their own functions (#11903)
Summary: Recovery triggers flushes for very different scenarios: (1) `FlushReason::kErrorRecoveryRetryFlush`: a flush failed (2) `FlushReason::kErrorRecovery`: a WAL may be corrupted (3) `FlushReason::kCatchUpAfterErrorRecovery`: immutable memtables may have accumulated The old code called called `FlushAllColumnFamilies()` in all cases, which uses manual flush functions: `AtomicFlushMemTables()` and `FlushMemTable()`. Forcing flushing the latest data on all CFs was useful for (2) because it ensures all CFs move past the corrupted WAL. However, those code paths were overkill for (1) and (3), where only already-immutable memtables need to be flushed. There were conditionals to exclude some of the extraneous logic but I found there was still too much happening. For example, both of the manual flush functions enter the write thread. Entering the write thread is inconvenient because then we can't allow stalled writes to wait on a retrying flush to finish. Instead of continuing down the path of adding more conditionals to the manual flush functions, this PR introduces a dedicated function for cases (1) and (3): `RetryFlushesForErrorRecovery()`. Also I cleaned up the manual flush functions to remove existing conditionals for these cases as they're no longer needed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11903 Reviewed By: cbi42 Differential Revision: D49693812 Pulled By: ajkr fbshipit-source-id: 7630ac539b9d6c92052c13a3cdce53256134d990 |
|
Jay Huh | 63ed868840 |
Offpeak in db option (#11893)
Summary: RocksDB's primary function is to facilitate read and write operations. Compactions, while essential for minimizing read amplifications and optimizing storage, can sometimes compete with these primary tasks. Especially during periods of high read/write traffic, it's vital to ensure that primary operations receive priority, avoiding any potential disruptions or slowdowns. Conversely, during off-peak times when traffic is minimal, it's an opportune moment to tackle low-priority tasks like TTL based compactions, optimizing resource usage. In this PR, we are incorporating the concept of off-peak time into RocksDB by introducing `daily_offpeak_time_utc` within the DBOptions. This setting is formatted as "HH:mm-HH:mm" where the first one before "-" is the start time and the second one is the end time, inclusive. It will be later used for resource optimization in subsequent PRs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11893 Test Plan: - New Unit Test Added - `DBOptionsTest::OffPeakTimes` - Existing Unit Test Updated - `OptionsTest`, `OptionsSettableTest` Reviewed By: pdillinger Differential Revision: D49714553 Pulled By: jaykorean fbshipit-source-id: fef51ea7c0fede6431c715bff116ddbb567c8752 |
|
Peter Dillinger | 02443dd93f |
Refactor, clean up, fixes, and more testing for SeqnoToTimeMapping (#11905)
Summary: This change is before a planned DBImpl change to ensure all sufficiently recent sequence numbers since Open are covered by SeqnoToTimeMapping (bug fix with existing test work-arounds). **Intended follow-up** However, I found enough issues with SeqnoToTimeMapping to warrant this PR first, including very small fixes in DB implementation related to API contract of SeqnoToTimeMapping. Functional fixes / changes: * This fixes some mishandling of boundary cases. For example, if the user decides to stop writing to DB, the last written sequence number would perpetually have its write time updated to "now" and would always be ineligible for migration to cold tier. Part of the problem is that the SeqnoToTimeMapping would return a seqno known to have been written before (immediately or otherwise) the requested time, but compaction_job.cc would include that seqno in the preserve/exclude set. That is fixed (in part) by adding one in compaction_job.cc * That problem was worse because a whole range of seqnos could be updated perpetually with new times in SeqnoToTimeMapping::Append (if no writes to DB). That logic was apparently optimized for GetOldestApproximateTime (now GetProximalTimeBeforeSeqno), which is not used in production, to the detriment of GetOldestSequenceNum (now GetProximalSeqnoBeforeTime), which is used in production. (Perhaps plans changed during development?) This is fixed in Append to optimize for accuracy of GetProximalSeqnoBeforeTime. (Unit tests added and updated.) * Related: SeqnoToTimeMapping did not have a clear contract about the relationships between seqnos and times, just the idea of a rough correspondence. Now the class description makes it clear that the write time of each recorded seqno comes before or at the associated time, to support getting best results for GetProximalSeqnoBeforeTime. And this makes it easier to make clear the contract of each API function. * Update `DBImpl::RecordSeqnoToTimeMapping()` to follow this ordering in gathering samples. Some part of these changes has required an expanded test work-around for the problem (see intended follow-up above) that the DB does not immediately ensure recent seqnos are covered by its mapping. These work-arounds will be removed with that planned work. An apparent compaction bug is revealed in PrecludeLastLevelTest::RangeDelsCauseFileEndpointsToOverlap, so that test is disabled. Filed GitHub issue #11909 Cosmetic / code safety things (not exhaustive): * Fix some confusing names. * `seqno_time_mapping` was used inconsistently in places. Now just `seqno_to_time_mapping` to correspond to class name. * Rename confusing `GetOldestSequenceNum` -> `GetProximalSeqnoBeforeTime` and `GetOldestApproximateTime` -> `GetProximalTimeBeforeSeqno`. Part of the motivation is that our times and seqnos here have the same underlying type, so we want to be clear about which is expected where to avoid mixing. * Rename `kUnknownSeqnoTime` to `kUnknownTimeBeforeAll` because the value is a bad choice for unknown if we ever add ProximalAfterBlah functions. * Arithmetic on SeqnoTimePair doesn't make sense except for delta encoding, so use better names / APIs with that in mind. * (OMG) Don't allow direct comparison between SeqnoTimePair and SequenceNumber. (There is no checking that it isn't compared against time by accident.) * A field name essentially matching the containing class name is a confusing pattern (`seqno_time_mapping_`). * Wrap calls to confusing (but useful) upper_bound and lower_bound functions to have clearer names and more code reuse. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11905 Test Plan: GetOldestSequenceNum (now GetProximalSeqnoBeforeTime) and TruncateOldEntries were lacking unit tests, despite both being used in production (experimental feature). Added those and expanded others. Reviewed By: jowlyzhang Differential Revision: D49755592 Pulled By: pdillinger fbshipit-source-id: f72a3baac74d24b963c77e538bba89a7fc8dce51 |
|
Changyu Bi | 1c871a4d86 |
Only flush after recovery for retryable IOError (#11880)
Summary: https://github.com/facebook/rocksdb/issues/11872 causes a unit test to start failing with the error message below. The cause is that the additional call to `FlushAllColumnFamilies()` in `DBImpl::ResumeImpl()` can run while DB is closing. More detailed explanation: there are two places where we call `ResumeImpl()`: 1. in `ErrorHandler::RecoverFromBGError`, for manual resume or recovery from errors like OutOfSpace through sst file manager, and 2. in `Errorhandler::RecoverFromRetryableBGIOError`, for error recovery from errors like flush failure due to retryable IOError. This is tracked by `ErrorHandler::recovery_thread_`. Here is how DB close waits for error recovery: |
|
Changyu Bi | 0086809601 |
Fix a bug with atomic_flush that causes DB to stuck after a flush failure (#11872)
Summary:
With atomic_flush=true, a flush job with younger memtables wait for older memtables to be installed before install its memtables. If the flush for older memtables failed, auto-recovery starts a resume thread which can becomes stuck waiting for all background work to finish (including the flush for younger memtables). If a non-recovery flush starts now and tries to flush, it can make the situation worse since it will fail due to background error but never rollback its memtable:
|
|
Levi Tamasi | 12d9386a4f |
Return a special OK status when the number of merge operands exceeds a threshold (#11870)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11870 Having a large number of merge operands applied at query time can have a significant effect on performance; therefore, applications might want limit the number of deltas for any given key. However, there is currently no way to establish the number of operands for certain types of queries. The ticker `READ_NUM_MERGE_OPERANDS` only provides aggregate (not per-read) information. The `PerfContext` counters `internal_merge_count` and `internal_merge_point_lookup_count` can be used to get this information on a per-query basis for iterators and single point lookups; however, there is no per-key breakdown for `MultiGet` type APIs. The patch addresses this issue by introducing a special kind of OK status which signals that an application-defined threshold on the number of merge operands has been exceeded for a given key. The threshold can be specified on a per-query basis using a new field in `ReadOptions`. Reviewed By: jaykorean Differential Revision: D49522786 fbshipit-source-id: 4265b3848d1be5ff313a3e8fb604ddf56411dd2c |
|
Changyu Bi | b927ba5936 |
Rollback other pending memtable flushes when a flush fails (#11865)
Summary:
when atomic_flush=false, there are certain cases where we try to install memtable results with already deleted SST files. This can happen when the following sequence events happen:
```
Start Flush0 for memtable M0 to SST0
Start Flush1 for memtable M1 to SST1
Flush 1 returns OK, but don't install to MANIFEST and let whoever flushes M0 to take care of it
Flush0 finishes with a retryable IOError, it rollbacks M0, (incorrectly) does not rollback M1, and deletes SST0 and SST1
Starts Flush2 for M0, it does not pick up M1 since it thought M1 is flushed
Flush2 writes SST2 and finishes OK, tries to install SST2 and SST1
Error opening SST1 since it's already deleted with an error message like the following:
IO error: No such file or directory: While open a file for random read: /tmp/rocksdbtest-501/db_flush_test_3577_4230653031040984171/000011.sst: No such file or directory
```
This happens since:
1. We currently only rollback the memtables that we are flushing in a flush job when atomic_flush=false.
2. Pending output SSTs from previous flushes are deleted since a pending file number is released whenever a flush job is finished no matter of flush status:
|
|
Peter (Stig) Edwards | bf488c3052 |
Use *next_sequence -1 here (#11861)
Summary:
To fix off-by-one error: Transaction could not check for conflicts for operation at SequenceNumber 500000 as the MemTable only contains changes newer than SequenceNumber 500001.
Fixes https://github.com/facebook/rocksdb/issues/11822
I think introduced in
|
|
Hui Xiao | 089070cb36 |
Expose more info about input files in `CompactionFilter::Context` (#11857)
Summary: **Context:** As requested, lowest level as well as a map from input file to its table properties among all input files used in table creation (if any) are exposed in `CompactionFilter::Context`. **Summary:** This PR contains two commits: (1) [Refactory]( |
|
Changyu Bi | cc254efea6 |
Release compaction files in manifest write callback (#11764)
Summary:
Fixes https://github.com/facebook/rocksdb/issues/10257 (also see [here](https://github.com/facebook/rocksdb/pull/10355#issuecomment-1684308556)) by releasing compaction files earlier when writing to manifest in LogAndApply(). This is done by passing in a [callback](
|
|
dengyan | 0dac75d542 |
Fix a bug in MultiGet when skip_memtable is true (#11700)
Summary: When skip_memtable is true in MultiGetImpl, The lookup_current is always false, Causes data to be unable to be queried in super_version->current。 Pull Request resolved: https://github.com/facebook/rocksdb/pull/11700 Reviewed By: anand1976 Differential Revision: D49342877 Pulled By: jowlyzhang fbshipit-source-id: 270a36d049b4cb7fd151a1fa3080300310111271 |
|
Jay Huh | cff6490bc4 |
Add IOActivity.kMultiGetEntity (#11842)
Summary: - As a follow up from https://github.com/facebook/rocksdb/issues/11799, adding `Env::IOActivity::kMultiGetEntity` support to `DBImpl::MultiGetEntity()`. ## Minor Refactor - Because both `DBImpl::MultiGet()` and `DBImpl::MultiGetEntity()` call `DBImpl::MultiGetCommon()` which later calls `DBImpl::MultiGetWithCallback()` where we check `Env::IOActivity::kMultiGet`, minor refactor was needed so that we don't check `Env::IOActivity::kMultiGet` for `DBImpl::MultiGetEntity()`. - I still see more areas for refactoring to avoid duplicate code of checking IOActivity and setting it when Unknown, but this will be addressed separately. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11842 Test Plan: - Added the `ThreadStatus::OperationType::OP_MULTIGETENTITY` in `db_stress` to verify the pass-down IOActivity in a thread aligns with the actual activity the thread is doing. ``` python3 tools/db_crashtest.py blackbox --max_key=25000000 --write_buffer_size=4194304 --max_bytes_for_level_base=2097152 --target_file_size_base=2097152 --periodic_compaction_seconds=0 --use_put_entity_one_in=10 --use_get_entity=1 --duration=60 --interval=10 python3 tools/db_crashtest.py blackbox --simple --max_key=25000000 --write_buffer_size=4194304 --max_bytes_for_level_base=2097152 --target_file_size_base=2097152 --periodic_compaction_seconds=0 --use_put_entity_one_in=10 --use_get_entity=1 --duration=60 --interval=10 python3 tools/db_crashtest.py blackbox --cf_consistency --max_key=25000000 --write_buffer_size=4194304 --max_bytes_for_level_base=2097152 --target_file_size_base=2097152 --periodic_compaction_seconds=0 --use_put_entity_one_in=10 --use_get_entity=1 --duration=60 --interval=10 ``` Reviewed By: ltamasi Differential Revision: D49329575 Pulled By: jaykorean fbshipit-source-id: 05198f1d3f92e6be3d42a3d184bacb3ab2ce6923 |
|
leipeng | 68ce5d84f6 |
Add new Iterator API Refresh(const snapshot*) (#10594)
Summary: This PR resolves https://github.com/facebook/rocksdb/issues/10487 & https://github.com/facebook/rocksdb/issues/10536, user code needs to call Refresh() periodically. The main code change is to support range deletions. A range tombstone iterator uses a sequence number as upper bound to decide which range tombstones are effective. During Iterator refresh, this sequence number upper bound needs to be updated for all range tombstone iterators under DBIter and LevelIterator. LevelIterator may create new table iterators and range tombstone iterator during scanning, so it needs to be aware of iterator refresh. The code path that propagates this change is `db_iter_->set_sequence(read_seq) -> MergingIterator::SetRangeDelReadSeqno() -> TruncatedRangeDelIterator::SetRangeDelReadSeqno() and LevelIterator::SetRangeDelReadSeqno()`. This change also fixes an issue where range tombstone iterators created by LevelIterator may access ReadOptions::snapshot, even though we do not explicitly require users to keep a snapshot alive after creating an Iterator. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10594 Test Plan: * New unit tests. * Add Iterator::Refresh(snapshot) to stress test. Note that this change only adds tests for refreshing to the same snapshot since this is the main target use case. TODO in a following PR: * Stress test Iterator::Refresh() to different snapshots or no snapshot. Reviewed By: ajkr Differential Revision: D48456896 Pulled By: cbi42 fbshipit-source-id: 2e642c04e91235cc9542ef4cd37b3c20823bd779 |
|
Hui Xiao | ed913513bd |
Fix a bug of rocksdb.file.read.verify.file.checksums.micros not being populated (#11836)
Summary: **Context/Summary:** `rocksdb.file.read.verify.file.checksums.micros ` was added in https://github.com/facebook/rocksdb/pull/11444 but the related path was not populated with statistics and clock object correctly so the actual statistics collection didn't happen. This PR fixed it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11836 Test Plan: Setup: ``` ./db_bench --benchmarks="fillrandom" --file_checksum=1 --num=100 --db=/dev/shm/rocksdb ``` Run: ``` ./db_bench --use_existing_db=1 --benchmarks="verifyfilechecksums" --file_checksum=1 --num=100 --db=/dev/shm/rocksdb --statistics=1 --stats_level=4 ``` Post-PR ``` rocksdb.file.read.verify.file.checksums.micros P50 : 9.000000 P95 : 9.000000 P99 : 9.000000 P100 : 9.000000 COUNT : 1 SUM : 9 ``` Pre-PR ``` rocksdb.file.read.verify.file.checksums.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0 ``` Reviewed By: ajkr Differential Revision: D49293378 Pulled By: hx235 fbshipit-source-id: 1acd8b828c28e088d0c5d63897f53cd180b82f42 |
|
Yu Zhang | e1fd348b92 |
Fix a bug in multiget for cleaning up SuperVersion (#11830)
Summary: When `MultiGet` acquires `SuperVersion` via locking the db mutex and get the current `ColumnFamilyData::super_version_`, its corresponding cleanup logic is not correctly done. It's currently doing this: `MultiGetColumnFamilyData::cfd->GetSuperVersion().Unref()` This operates on the most recent `SuperVersion` without locking db mutex , which is not thread safe by itself. And this unref operation is intended for the originally acquired `SuperVersion` instead of the current one. Because a race condition could happen where a new `SuperVersion` is installed in between this `MultiGet`'s ref and unref. When this race condition does happen, it's not sufficient to just unref the `SuperVersion`, `DBImpl::CleanupSuperVersion` should be called instead to properly clean up the `SuperVersion` had this `MultiGet` call be its last reference holder. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11830 Test Plan: `make all check` Added a unit test that would originally fail Reviewed By: ltamasi Differential Revision: D49287715 Pulled By: jowlyzhang fbshipit-source-id: 8353636ee11b2e90d85c677a96a92360072644b0 |
|
Jay Huh | f2b623bcc1 |
GetEntity Support for ReadOnlyDB and SecondaryDB (#11799)
Summary: `GetEntity` API support for ReadOnly DB and Secondary DB. - Introduced `GetImpl()` with `GetImplOptions` in `db_impl_readonly` and refactored current `Get()` logic into `GetImpl()` so that look up logic can be reused for `GetEntity()` (Following the same pattern as `DBImpl::Get()` and `DBImpl::GetEntity()`) - Introduced `GetImpl()` with `GetImplOptions` in `db_impl_secondary` and refactored current `GetImpl()` logic. This is to make `DBImplSecondary::Get/GetEntity` consistent with `DBImpl::Get/GetEntity` and `DBImplReadOnly::Get/GetEntity` - `GetImpl()` in `db_impl` is now virtual. both `db_impl_readonly` and `db_impl_secondary`'s `Get()` override are no longer needed since all three dbs now have the same `Get()` which calls `GetImpl()` internally. - `GetImpl()` in `DBImplReadOnly` and `DBImplSecondary` now pass in `columns` instead of `nullptr` in lookup functions like `memtable->get()` - Introduced `GetEntity()` API in `DBImplReadOnly` and `DBImplSecondary` which simply calls `GetImpl()` with `columns` set in `GetImplOptions`. - Introduced `Env::IOActivity::kGetEntity` and set read_options.io_activity to `Env::IOActivity::kGetEntity` for `GetEntity()` operations (in db_impl) Pull Request resolved: https://github.com/facebook/rocksdb/pull/11799 Test Plan: **Unit Tests** - Added verification in `DBWideBasicTest::PutEntity` by Reopening DB as ReadOnly with the same setup. - Added verification in `DBSecondaryTest::ReopenAsSecondary` by calling `PutEntity()` and `GetEntity()` on top of existing `Put()` and `Get()` - `make -j64 check` **Crash Tests** - `python3 tools/db_crashtest.py blackbox --max_key=25000000 --write_buffer_size=4194304 --max_bytes_for_level_base=2097152 --target_file_size_base=2097152 --periodic_compaction_seconds=0 --use_put_entity_one_in=10 --use_get_entity=1 --duration=60 --inter val=10` - `python3 tools/db_crashtest.py blackbox --simple --max_key=25000000 --write_buffer_size=4194304 --max_bytes_for_level_base=2097152 --target_file_size_base=2097152 --periodic_compaction_seconds=0 --use_put_entity_one_in=10 --use_get_entity=1 ` - `python3 tools/db_crashtest.py blackbox --cf_consistency --max_key=25000000 --write_buffer_size=4194304 --max_bytes_for_level_base=2097152 --target_file_size_base=2097152 --periodic_compaction_seconds=0 --use_put_entity_one_in=10 --use_get_entity=1 --duration=60 --inter val=10` Reviewed By: ltamasi Differential Revision: D49037040 Pulled By: jaykorean fbshipit-source-id: a0648253ded6e91af7953de364ed3c6bf163626b |