mirror of https://github.com/facebook/rocksdb.git
119 Commits
Author | SHA1 | Message | Date |
---|---|---|---|
Changyu Bi | f97e33454f |
Fix a bug with auto recovery on WAL write error (#12995)
Summary:
A recent crash test failure shows that auto recovery from WAL write failure can cause CFs to be inconsistent. A unit test repro in P1569398553. The following is an example sequence of events:
```
0. manual_wal_flush is true. There are multiple CFs in a DB.
1. Submit a write batch with updates to multiple CF
2. A FlushWAL or a memtable swtich that will try to write the buffered WAL data. Fail this write so that buffered WAL data is dropped:
|
|
Peter Dillinger | d33d25f903 |
Disable WAL recycling in crash test; reproducer for recovery data loss (#12918)
Summary: I was investigating a crash test failure with "Corruption: SST file is ahead of WALs" which I haven't reproduced, but I did reproduce a data loss issue on recovery which I suspect could be the same root problem. The problem is already somewhat known (see https://github.com/facebook/rocksdb/issues/12403 and https://github.com/facebook/rocksdb/issues/12639) where it's only safe to recovery multiple recycled WAL files with trailing old data if the sequence numbers between them are adjacent (to ensure we didn't lose anything in the corrupt/obsolete WAL tail). However, aside from disableWAL=true, there are features like external file ingestion that can increment the sequence numbers without writing to the WAL. It is simply unsustainable to worry about this kind of feature interaction limiting where we can consume sequence numbers. It is very hard to test and audit as well. For reliable crash recovery of recycled WALs, we need a better way of detecting that we didn't drop data from one WAL to the next. Until then, let's disable WAL recycling in the crash test, to help stabilize it. Ideas for follow-up to fix the underlying problem: (a) With recycling, we could always sync the WAL before opening the next one. HOWEVER, this potentially very large sync could cause a big hiccup in writes (vs. O(1) sized manifest sync). (a1) The WAL sync could ensure it is truncated to size, or (a2) By requiring track_and_verify_wals_in_manifest, we could assume that the last synced size in the manifest is the final usable size of the WAL. (It might also be worth avoiding truncating recycled WALs.) (b) Add a new mechanism to record and verify the final size of a WAL without requiring a sync. (b1) By requiring track_and_verify_wals_in_manifest, this could be new WAL metadata recorded in the manifest (at the time of switching WALs). Note that new fields of WalMetadata are not forward-compatible, but a new kind of manifest record (next to WalAddition, WalDeletion; e.g. WalCompletion) is IIRC forward-compatible. (b2) A new kind of WAL header entry (not forward compatible, unfortunately) could record the final size of the previous WAL. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12918 Test Plan: Added disabled reproducer for non-linear data loss on recovery Reviewed By: hx235 Differential Revision: D60917527 Pulled By: pdillinger fbshipit-source-id: 3663d79aec81851f5cf41669f84a712bb4563fd7 |
|
Yu Zhang | 05c9c9aeed |
Fix race between test and recovery flush switch memtable (#12884)
Summary: As titled, to fix this type of data race: https://github.com/facebook/rocksdb/actions/runs/10066814221/job/27829003372?pr=12882 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12884 Test Plan: COMPILE_WITH_TSAN=1 make -j10 db_wal_test ./db_wal_test --gtest_filter=DBWALTest.RecoveryFlushSwitchWALOnEmptyMemtable --gtest_repeat=100 Reviewed By: anand1976 Differential Revision: D60197834 Pulled By: jowlyzhang fbshipit-source-id: 89524cdb4d17a1b647295bcccf5eb2d7d425bc6a |
|
Hui Xiao | 349b1ec08f |
Fix duplicate WAL entries caused by write after error recovery (#12873)
Summary: **Context/Summary:** We recently discovered a case where write of the same key right after error recovery of a previous failed write of the same key finishes causes two same WAL entries, violating our assertion. This is because we don't advance seqno on failed write and reuse the same WAL containing the failed write for the new write if the memtable at the time is empty. This PR reuses the flush path for an empty memtable to switch WAL and update min WAL to keep in error recovery flush as well as updates the INFO log message for clarity. ``` 2024/07/17-15:01:32.271789 327757 (Original Log Time 2024/07/17-15:01:25.942234) [/flush_job.cc:1017] [default] [JOB 2] Level-0 flush table https://github.com/facebook/rocksdb/issues/9: 0 bytes OK It's an empty SST file from a successful flush so won't be kept in the DB 2024/07/17-15:01:32.271798 327757 (Original Log Time 2024/07/17-15:01:32.269954) [/memtable_list.cc:560] [default] Level-0 commit flush result of table https://github.com/facebook/rocksdb/issues/9 started 2024/07/17-15:01:32.271802 327757 (Original Log Time 2024/07/17-15:01:32.271217) [/memtable_list.cc:760] [default] Level-0 commit flush result of table https://github.com/facebook/rocksdb/issues/9: memtable https://github.com/facebook/rocksdb/issues/1 done ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12873 Test Plan: New UT that failed before this PR with following assertion failure (i.e, duplicate WAL entries) and passes after ``` db_wal_test: db/write_batch.cc:2254: rocksdb::Status rocksdb::{anonymous}::MemTableInserter::PutCFImpl(uint32_t, const rocksdb::Slice&, const rocksdb::Slice&, rocksdb::ValueType, RebuildTxnOp, const ProtectionInfoKVOS64*) [with RebuildTxnOp = rocksdb::{anonymous}::MemTableInserter::PutCF(uint32_t, const rocksdb::Slice&, const rocksdb::Slice&)::<lambda(rocksdb::WriteBatch*, uint32_t, const rocksdb::Slice&, const rocksdb::Slice&)>; uint32_t = unsigned int; rocksdb::ProtectionInfoKVOS64 = rocksdb::ProtectionInfoKVOS<long unsigned int>]: Assertion `seq_per_batch_' failed. ``` Reviewed By: anand1976 Differential Revision: D59884468 Pulled By: hx235 fbshipit-source-id: 5d854b719092552c69727a979f269fb7f6c39756 |
|
Peter Dillinger | 3ee4d5a11a |
Fix possible crash in failure to sync some WALs (#12789)
Summary: I believe this was possible with recyclable logs before recent work like https://github.com/facebook/rocksdb/issues/12734, but this cleans up a couple of possible crashes revealed by the crash test. A WAL with a nullptr file writer (already closed) can persist in `logs_` if a later WAL fails to sync. In case of any WAL sync failures, we don't record WAL syncs to the manifest. Thus, even if a WAL is fully synced and closed, we might need to keep it on the `logs_` list so that we know to record its sync to the manifest if there should be a successful sync next time. (However, I believe that's future-looking because currently any failure in WAL sync is considered non-recoverable.) I don't believe this was likely enough before recent changes to warrant a release note (if it was possible). Pull Request resolved: https://github.com/facebook/rocksdb/pull/12789 Test Plan: A unit test that would reveal the crashes, now fixed Reviewed By: cbi42 Differential Revision: D58874154 Pulled By: pdillinger fbshipit-source-id: bc69407cd9cbcd080af9585d502d4e33dafc3d29 |
|
Peter Dillinger | d2ef70872f |
Rename, deprecate `LogFile` and `VectorLogPtr` (#12695)
Summary: These names are confusing with `Logger` etc. so moving to `WalFile` etc. Other small, related name refactorings. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12695 Test Plan: Left most unit tests using old names as an API compatibility test. Non-test code compiles with deprecated names removed. No functional changes. Reviewed By: ajkr Differential Revision: D57747458 Pulled By: pdillinger fbshipit-source-id: 7b77596b9c20d865d43b9dc66c30c8bd2b3b424f |
|
Andrew Kryczka | 8897bf2d04 |
Drop unsynced data in `TestFSWritableFile::Close()` (#12528)
Summary: Our `FileSystem` for simulating unsynced data loss should not sync during `Close()` because it masks bugs where we forgot to sync as long as we closed the file. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12528 Test Plan: Peeled back https://github.com/facebook/rocksdb/issues/10560 fix and verified it is caught much faster now (few seconds vs. ???) with command like ``` $ TEST_TMPDIR=./ python3 tools/db_crashtest.py blackbox --disable_wal=0 --max_key=1000 --write_buffer_size=131072 --max_bytes_for_level_base=524288 --target_file_size_base=131072 --interval=3 --sync_fault_injection=1 --enable_blob_files=0 --manual_wal_flush_one_in=10 --sync_wal_one_in=0 --get_live_files_one_in=0 --get_sorted_wal_files_one_in=0 --backup_one_in=0 --checkpoint_one_in=0 --write_fault_one_in=0 --read_fault_one_in=0 --open_write_fault_one_in=0 --compact_range_one_in=0 --compact_files_one_in=0 --open_read_fault_one_in=0 --get_property_one_in=0 --writepercent=100 -readpercent=0 -prefixpercent=0 -delpercent=0 -delrangepercent=0 -iterpercent=0 ``` Reviewed By: anand1976 Differential Revision: D56033250 Pulled By: ajkr fbshipit-source-id: 6bbf480d79a06c46f08f6214010937f6654af5ca |
|
anand76 | 63a105a481 |
Enable recycle_log_file_num option for point in time recovery (#12403)
Summary: This option was previously disabled due to a bug in the recovery logic. The recovery code in `DBImpl::RecoverLogFiles` couldn't tell if an EoF reported by the log reader was really an EoF or a possible corruption that made a record look like an old log record. To fix this, the log reader now explicitly reports when it encounters what looks like an old record. The recovery code treats it as a possible corruption, and uses the next sequence number in the WAL to determine if it should continue replaying the WAL. This PR also fixes a couple of bugs that log file recycling exposed in the backup and checkpoint path. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12403 Test Plan: 1. Add new unit tests to verify behavior upon corruption 2. Re-enable disabled tests for verifying recycling behavior Reviewed By: ajkr Differential Revision: D54544824 Pulled By: anand1976 fbshipit-source-id: 12f5ce39bd6bc0d63b0bc6432dc4db510e0e802a |
|
Yu Zhang | 4bea83aa44 |
Remove the force mode for EnableFileDeletions API (#12337)
Summary: There is no strong reason for user to need this mode while on the other hand, its behavior is destructive. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12337 Reviewed By: hx235 Differential Revision: D53630393 Pulled By: jowlyzhang fbshipit-source-id: ce94b537258102cd98f89aa4090025663664dd78 |
|
Peter Dillinger | 54cb9c77d9 |
Prefer static_cast in place of most reinterpret_cast (#12308)
Summary: The following are risks associated with pointer-to-pointer reinterpret_cast: * Can produce the "wrong result" (crash or memory corruption). IIRC, in theory this can happen for any up-cast or down-cast for a non-standard-layout type, though in practice would only happen for multiple inheritance cases (where the base class pointer might be "inside" the derived object). We don't use multiple inheritance a lot, but we do. * Can mask useful compiler errors upon code change, including converting between unrelated pointer types that you are expecting to be related, and converting between pointer and scalar types unintentionally. I can only think of some obscure cases where static_cast could be troublesome when it compiles as a replacement: * Going through `void*` could plausibly cause unnecessary or broken pointer arithmetic. Suppose we have `struct Derived: public Base1, public Base2`. If we have `Derived*` -> `void*` -> `Base2*` -> `Derived*` through reinterpret casts, this could plausibly work (though technical UB) assuming the `Base2*` is not dereferenced. Changing to static cast could introduce breaking pointer arithmetic. * Unnecessary (but safe) pointer arithmetic could arise in a case like `Derived*` -> `Base2*` -> `Derived*` where before the Base2 pointer might not have been dereferenced. This could potentially affect performance. With some light scripting, I tried replacing pointer-to-pointer reinterpret_casts with static_cast and kept the cases that still compile. Most occurrences of reinterpret_cast have successfully been changed (except for java/ and third-party/). 294 changed, 257 remain. A couple of related interventions included here: * Previously Cache::Handle was not actually derived from in the implementations and just used as a `void*` stand-in with reinterpret_cast. Now there is a relationship to allow static_cast. In theory, this could introduce pointer arithmetic (as described above) but is unlikely without multiple inheritance AND non-empty Cache::Handle. * Remove some unnecessary casts to void* as this is allowed to be implicit (for better or worse). Most of the remaining reinterpret_casts are for converting to/from raw bytes of objects. We could consider better idioms for these patterns in follow-up work. I wish there were a way to implement a template variant of static_cast that would only compile if no pointer arithmetic is generated, but best I can tell, this is not possible. AFAIK the best you could do is a dynamic check that the void* conversion after the static cast is unchanged. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12308 Test Plan: existing tests, CI Reviewed By: ltamasi Differential Revision: D53204947 Pulled By: pdillinger fbshipit-source-id: 9de23e618263b0d5b9820f4e15966876888a16e2 |
|
Andrew Kryczka | 5a9ecf6614 |
Automated modernization (#12210)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12210 Reviewed By: hx235 Differential Revision: D52559771 Pulled By: ajkr fbshipit-source-id: 1ccdd3a0180cc02bc0441f20b0e4a1db50841b03 |
|
Hui Xiao | 06e593376c |
Group SST write in flush, compaction and db open with new stats (#11910)
Summary: ## Context/Summary Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity. For that, this PR does the following: - Tag different write IOs by passing down and converting WriteOptions to IOOptions - Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS Some related code refactory to make implementation cleaner: - Blob stats - Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info. - Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write. - Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority - Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification - Build table - TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables - Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder. This parameter is used for dynamically changing file io priority for flush, see https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more - Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority ## Test ### db bench Flush ``` ./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100 rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377 rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377 rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0 rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0 ``` compaction, db oopen ``` Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench Run:./db_bench --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1 rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279 rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0 rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213 rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66 ``` blob stats - just to make sure they aren't broken by this PR ``` Integrated Blob DB Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1 pre-PR: rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600 rocksdb.blobdb.blob.file.synced COUNT : 1 rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 post-PR: rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614 - COUNT is higher and values are smaller as it includes header and footer write - COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164 rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same) rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same) ``` ``` Stacked Blob DB Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench pre-PR: rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876 rocksdb.blobdb.blob.file.synced COUNT : 8 rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 post-PR: rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924 - COUNT is higher and values are smaller as it includes header and footer write - COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164 rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same) rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same) ``` ### Rehearsal CI stress test Trigger 3 full runs of all our CI stress tests ### Performance Flush ``` TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000 -- default: 1 thread is used to run benchmark; enable_statistics = true Pre-pr: avg 507515519.3 ns 497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908, Post-pr: avg 511971266.5 ns, regressed 0.88% 502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408, ``` Compaction ``` TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1 --benchmark_repetitions=1000 -- default: 1 thread is used to run benchmark Pre-pr: avg 495346098.30 ns 492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846 Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97% 502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007 ``` Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats) ``` TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1 --benchmark_repetitions=1000 -- default: 1 thread is used to run benchmark Pre-pr: avg 3848.10 ns 3814,3838,3839,3848,3854,3854,3854,3860,3860,3860 Post-pr: avg 3874.20 ns, regressed 0.68% 3863,3867,3871,3874,3875,3877,3877,3877,3880,3881 ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910 Reviewed By: ajkr Differential Revision: D49788060 Pulled By: hx235 fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff |
|
anand76 | a036525809 |
Lightweight verification of MANIFEST file after close on shutdown (#12174)
Summary: Do a size verification on the MANIFEST file during DB shutdown, after closing the file. If the verification fails, write a new MANIFEST file. In the future, we can do a more thorough verification if we want to. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12174 Test Plan: Unit test, and some manual verification Reviewed By: ajkr Differential Revision: D52451184 Pulled By: anand1976 fbshipit-source-id: fc3bc170e22f6c9a9c482ee5ff592abab889df83 |
|
Yu Zhang | 509947ce2c |
Quarantine files in a limbo state after a manifest error (#12030)
Summary: Part of the procedures to handle manifest IO error is to disable file deletion in case some files in limbo state get deleted prematurely. This is not ideal because: 1) not all the VersionEdits whose commit encounter such an error contain updates for files, disabling file deletion sometimes are not necessary. 2) `EnableFileDeletion` has a force mode that could make other threads accidentally disrupt this procedure in recovery. 3) Disabling file deletion as a whole is also not as efficient as more precisely tracking impacted files from being prematurely deleted. This PR replaces this mechanism with tracking such files and quarantine them from being deleted in `ErrorHandler`. These are the types of files being actively tracked in quarantine in this PR: 1) new table files and blob files from a background job 2) old manifest file whose immediately following new manifest file's CURRENT file creation gets into unclear state. Current handling is not sufficient to make sure the old manifest file is kept in case it's needed. Note that WAL logs are not part of the quarantine because `min_log_number_to_keep` is a safe mechanism and it's only updated after successful manifest commits so it can prevent this premature deletion issue from happening. We track these files' file numbers because they share the same file number space. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12030 Test Plan: Modified existing unit tests Reviewed By: ajkr Differential Revision: D51036774 Pulled By: jowlyzhang fbshipit-source-id: 84ef26271fbbc888ef70da5c40fe843bd7038716 |
|
Yu Zhang | c6c683a0ca |
Remove the default force behavior for `EnableFileDeletion` API (#12001)
Summary: Disabling file deletion can be critical for operations like making a backup, recovery from manifest IO error (for now). Ideally as long as there is one caller requesting file deletion disabled, it should be kept disabled until all callers agree to re-enable it. So this PR removes the default forcing behavior for the `EnableFileDeletion` API, and users need to explicitly pass the argument if they insisted on doing so knowing the consequence of what can be potentially disrupted. This PR removes the API's default argument value so it will cause breakage for all users that are relying on the default value, regardless of whether the forcing behavior is critical for them. When fixing this breakage, it's good to check if the forcing behavior is indeed needed and potential disruption is OK. This PR also makes unit test that do not need force behavior to do a regular enable file deletion. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12001 Reviewed By: ajkr Differential Revision: D51214683 Pulled By: jowlyzhang fbshipit-source-id: ca7b1ebf15c09eed00f954da2f75c00d2c6a97e4 |
|
Jay Huh | e230e4d248 |
Make OffpeakTimeInfo available in VersionSet (#12018)
Summary: As mentioned in https://github.com/facebook/rocksdb/issues/11893, we are going to use the offpeak time information to pre-process TTL-based compactions. To do so, we need to access `daily_offpeak_time_utc` in `VersionStorageInfo::ComputeCompactionScore()` where we pick the files to compact. This PR is to make the offpeak time information available at the time of compaction-scoring. We are not changing any compaction scoring logic just yet. Will follow up in a separate PR. There were two ways to achieve what we want. 1. Make `MutableDBOptions` available in `ColumnFamilyData` and `ComputeCompactionScore()` take `MutableDBOptions` along with `ImmutableOptions` and `MutableCFOptions`. 2. Make `daily_offpeak_time_utc` and `IsNowOffpeak()` available in `VersionStorageInfo`. We chose the latter as it involves smaller changes. This change includes the following - Introduction of `OffpeakTimeInfo` and `IsNowOffpeak()` has been moved from `MutableDBOptions` - `OffpeakTimeInfo` added to `VersionSet` and it can be set during construction and by `ChangeOffpeakTimeInfo()` - During `SetDBOptions()`, if offpeak time info needs to change, it calls `MaybeScheduleFlushOrCompaction()` to re-compute compaction scores and process compactions as needed Pull Request resolved: https://github.com/facebook/rocksdb/pull/12018 Test Plan: - `DBOptionsTest::OffpeakTimes` changed to include checks for `MaybeScheduleFlushOrCompaction()` calls and `VersionSet`'s OffpeakTimeInfo value change during `SetDBOptions()`. - `VersionSetTest::OffpeakTimeInfoTest` added to test `ChangeOffpeakTimeInfo()`. `IsNowOffpeak()` tests moved from `DBOptionsTest::OffpeakTimes` Reviewed By: pdillinger Differential Revision: D50723881 Pulled By: jaykorean fbshipit-source-id: 3cff0291936f3729c0e9c7750834b9378fb435f6 |
|
Changyu Bi | d5bc30befa |
Enforce status checking after Valid() returns false for IteratorWrapper (#11975)
Summary: ... when compiled with ASSERT_STATUS_CHECKED = 1. The main change is in iterator_wrapper.h. The remaining changes are just fixing existing unit tests. Adding this check to IteratorWrapper gives a good coverage as the class is used in many places, including child iterators under merging iterator, merging iterator under DB iter, file_iter under level iterator, etc. This change can catch the bug fixed in https://github.com/facebook/rocksdb/issues/11782. Future follow up: enable `ASSERT_STATUS_CHECKED=1` for stress test and for DEBUG_LEVEL=0. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11975 Test Plan: * `ASSERT_STATUS_CHECKED=1 DEBUG_LEVEL=2 make -j32 J=32 check` * I tried to run stress test with `ASSERT_STATUS_CHECKED=1`, but there are a lot of existing stress code that ignore status checking, and fail without the change in this PR. So defer that to a follow up task. Reviewed By: ajkr Differential Revision: D50383790 Pulled By: cbi42 fbshipit-source-id: 1a28ce0f5fdf1890f93400b26b3b1b3a287624ce |
|
Yu Zhang | 4234a6a301 |
Increase full_history_ts_low when flush happens during recovery (#11774)
Summary: This PR adds a missing piece for the UDT in memtable only feature, which is to automatically increase `full_history_ts_low` when flush happens during recovery. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11774 Test Plan: Added unit test make all check Reviewed By: ltamasi Differential Revision: D48799109 Pulled By: jowlyzhang fbshipit-source-id: fd681ed66d9d40904ca2c919b2618eb692686035 |
|
Yu Zhang | c24ef26ca7 |
Support switching on / off UDT together with in-Memtable-only feature (#11623)
Summary: Add support to allow enabling / disabling user-defined timestamps feature for an existing column family in combination with the in-Memtable only feature. To do this, this PR includes: 1) Log the `persist_user_defined_timestamps` option per column family in Manifest to facilitate detecting an attempt to enable / disable UDT. This entry is enforced to be logged in the same VersionEdit as the user comparator name entry. 2) User-defined timestamps related options are validated when re-opening a column family, including user comparator name and the `persist_user_defined_timestamps` flag. These type of settings and settings change are considered valid: a) no user comparator change and no effective `persist_user_defined_timestamp` flag change. b) switch user comparator to enable UDT provided the immediately effective `persist_user_defined_timestamps` flag is false. c) switch user comparator to disable UDT provided that the before-change `persist_user_defined_timestamps` is already false. 3) when an attempt to enable UDT is detected, we mark all its existing SST files as "having no UDT" by marking its `FileMetaData.user_defined_timestamps_persisted` flag to false and handle their file boundaries `FileMetaData.smallest`, `FileMetaData.largest` by padding a min timestamp. 4) while enabling / disabling UDT feature, timestamp size inconsistency in existing WAL logs are handled to make it compatible with the running user comparator. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11623 Test Plan: ``` make all check ./db_with_timestamp_basic_test --gtest-filter="*EnableDisableUDT*" ./db_wal_test --gtest_filter="*EnableDisableUDT*" ``` Reviewed By: ltamasi Differential Revision: D47636862 Pulled By: jowlyzhang fbshipit-source-id: dcd19f67292da3c3cc9584c09ad00331c9ab9322 |
|
Yu Zhang | 4ea7b796b7 |
Respect cutoff timestamp during flush (#11599)
Summary: Make flush respect the cutoff timestamp `full_history_ts_low` as much as possible for the user-defined timestamps in Memtables only feature. We achieve this by not proceeding with the actual flushing but instead reschedule the same `FlushRequest` so a follow up flush job can continue with the check after some interval. This approach doesn't work well for atomic flush, so this feature currently is not supported in combination with atomic flush. Furthermore, this approach also requires a customized method to get the next immediately bigger user-defined timestamp. So currently it's limited to comparator that use uint64_t as the user-defined timestamp format. This support can be extended when we add such a customized method to `AdvancedColumnFamilyOptions`. For non atomic flush request, at any single time, a column family can only have as many as one FlushRequest for it in the `flush_queue_`. There is deduplication done at `FlushRequest` enqueueing(`SchedulePendingFlush`) and dequeueing time (`PopFirstFromFlushQueue`). We hold the db mutex between when a `FlushRequest` is popped from the queue and the same FlushRequest get rescheduled, so no other `FlushRequest` with a higher `max_memtable_id` can be added to the `flush_queue_` blocking us from re-enqueueing the same `FlushRequest`. Flush is continued nevertheless if there is risk of entering write stall mode had the flush being postponed, e.g. due to accumulation of write buffers, exceeding the `max_write_buffer_number` setting. When this happens, the newest user-defined timestamp in the involved Memtables need to be tracked and we use it to increase the `full_history_ts_low`, which is an inclusive cutoff timestamp for which RocksDB promises to keep all user-defined timestamps equal to and newer than it. Tet plan: ``` ./column_family_test --gtest_filter="*RetainUDT*" ./memtable_list_test --gtest_filter="*WithTimestamp*" ./flush_job_test --gtest_filter="*WithTimestamp*" ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11599 Reviewed By: ajkr Differential Revision: D47561586 Pulled By: jowlyzhang fbshipit-source-id: 9400445f983dd6eac489e9dd0fb5d9b99637fe89 |
|
Yu Zhang | baf37a0e81 |
Fix a unit test hole for recovering UDTs with WAL files (#11577)
Summary: Thanks pdillinger for pointing out this test hole. The test `DBWALTestWithTimestamp.Recover` that is intended to test recovery from WAL including user-defined timestamps doesn't achieve its promised coverage. Specifically, after https://github.com/facebook/rocksdb/issues/11557, timestamps will be removed during flush, and RocksDB by default flush memtables during recovery with `avoid_flush_during_recovery` defaults to false. This test didn't fail even if all the timestamps are quickly lost due to the default flush behavior. This PR renamed test `Recover` to `RecoverAndNoFlush`, and updated it to verify timestamps are successfully recovered from WAL with some time-travel reads. `avoid_flush_during_recovery` is set to true to help do this verification. On the other hand, for test `DBWALTestWithTimestamp.RecoverAndFlush`, since flush on reopen is DB's default behavior. Setting the flags `max_write_buffer` and `arena_block_size` are not really the factors that enforces the flush, so these flags are removed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11577 Test Plan: ./db_wal_test Reviewed By: pdillinger Differential Revision: D47142892 Pulled By: jowlyzhang fbshipit-source-id: 9465e278806faa5885b541b4e32d99e698edef7d |
|
Yu Zhang | 15053f3ab4 |
Logically strip timestamp during flush (#11557)
Summary: Logically strip the user-defined timestamp when L0 files are created during flush when `AdvancedColumnFamilyOptions.persist_user_defined_timestamps` is false. Logically stripping timestamp here means replacing the original user-defined timestamp with a mininum timestamp, which for now is hard coded to be all zeros bytes. While working on this, I caught a missing piece on the `BlockBuilder` level for this feature. The current quick path `std::min(buffer_size, last_key_size)` needs a bit tweaking to work for this feature. When user-defined timestamp is stripped during block building, on writing first entry or right after resetting, `buffer` is empty and `buffer_size` is zero as usual. However, in follow-up writes, depending on the size of the stripped user-defined timestamp, and the size of the value, what's in `buffer` can sometimes be smaller than `last_key_size`, leading `std::min(buffer_size, last_key_size)` to truncate the `last_key`. Previous test doesn't caught the bug because in those tests, the size of the stripped user-defined timestamps bytes is smaller than the length of the value. In order to avoid the conditional operation, this PR changed the original trivial `std::min` operation into an arithmetic operation. Since this is a change in a hot and performance critical path, I did the following benchmark to check no observable regression is introduced. ```TEST_TMPDIR=/dev/shm/rocksdb1 ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=50000000``` Compiled with DEBUG_LEVEL=0 Test vs. control runs simulaneous for better accuracy, units = ops/sec PR vs base: Round 1: 350652 vs 349055 Round 2: 365733 vs 364308 Round 3: 355681 vs 354475 Pull Request resolved: https://github.com/facebook/rocksdb/pull/11557 Test Plan: New timestamp specific test added or existing tests augmented, both are parameterized with `UserDefinedTimestampTestMode`: `UserDefinedTimestampTestMode::kNormal` -> UDT feature enabled, write / read with min timestamp `UserDefinedTimestampTestMode::kStripUserDefinedTimestamps` -> UDT feature enabled, write / read with min timestamp, set Options.persist_user_defined_timestamps to false. ``` make all check ./db_wal_test --gtest_filter="*WithTimestamp*" ./flush_job_test --gtest_filter="*WithTimestamp*" ./repair_test --gtest_filter="*WithTimestamp*" ./block_based_table_reader_test ``` Reviewed By: pdillinger Differential Revision: D47027664 Pulled By: jowlyzhang fbshipit-source-id: e729193b6334dfc63aaa736d684d907a022571f5 |
|
Levi Tamasi | 022d89549d |
Attempt to deflake DBWALTestWithEnrichedEnv.SkipDeletedWALs (#11537)
Summary: Calling `Flush` (even with `wait==true`) does not guarantee that obsolete WAL files are physically deleted before the call returns. The patch attempts to fix the resulting flakiness by using `SyncPoint`s to make sure `PurgeObsoleteFiles` finishes before checking for WAL deletions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11537 Test Plan: ``` gtest-parallel --repeat=1000 ./db_wal_test --gtest_filter="*SkipDeletedWALs*" ``` Reviewed By: pdillinger Differential Revision: D46736050 Pulled By: ltamasi fbshipit-source-id: 47a931b7a3a03ef681fbf4adb5a0b223d452703e |
|
Yu Zhang | 56ca9e3106 |
Logging timestamp size record in WAL and use it during recovery (#11471)
Summary: Start logging the timestamp size record in WAL and use the record during recovery. Currently, user comparator cannot be different from what was used to create a column family, so the timestamp size record is just used to confirm it's consistent with the timestamp size the running user comparator indicates. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11471 Test Plan: ``` make all check ./db_secondary_test ./db_wal_test --gtest_filter="*WithTimestamp*" ./repair_test --gtest_filter="*WithTimestamp*" ``` Reviewed By: ltamasi Differential Revision: D46236769 Pulled By: jowlyzhang fbshipit-source-id: f6c60b5c8defdb05021c63df302ccc0be1275ad0 |
|
anand76 | 03a892a9fb |
Delete empty WAL files on reopen (#11409)
Summary: When a DB is opened, RocksDB creates an empty WAL file. When the DB is reopened and the WAL is empty, the min log number to keep is not advanced until a memtable flush happens. If a process crashes soon after reopening the DB, its likely that no memtable flush would have happened, which means the empty WAL file is not deleted. In a crash loop scenario, this leads to empty WAL files accumulating. Fix this by ensuring the min log number is advanced if the WAL is empty. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11409 Test Plan: Add a unit test Reviewed By: ajkr Differential Revision: D45281685 Pulled By: anand1976 fbshipit-source-id: 0225877c613e65ffb30972a0051db2830105423e |
|
Hui Xiao | 9b66331388 |
Simplify TEST_F(DBWALTest, FixSyncWalOnObseletedWalWithNewManifestCausingMissingWAL) (#11186)
Summary: **Context/Summary**: Simplify `TEST_F(DBWALTest, FixSyncWalOnObseletedWalWithNewManifestCausingMissingWAL)` based on https://github.com/facebook/rocksdb/pull/11016#pullrequestreview-1205020134 and delete unused sync points. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11186 Test Plan: - UT failed before fix in https://github.com/facebook/rocksdb/pull/10892 and passes after - Check UT not flaky when running with https://app.circleci.com/pipelines/github/facebook/rocksdb/21985/workflows/5f6cc355-78c1-46d8-89ee-0fd679725a8a/jobs/540878 Reviewed By: ajkr Differential Revision: D43034723 Pulled By: hx235 fbshipit-source-id: f503774987b8f3718505f99e95080a7fad28ac66 |
|
Peter Dillinger | 94e3beec77 |
Cleanup, improve, stress test LockWAL() (#11143)
Summary: The previous API comments for LockWAL didn't provide much about why you might want to use it, and didn't really meet what one would infer its contract was. Also, LockWAL was not in db_stress / crash test. In this change: * Implement a counting semantics for LockWAL()+UnlockWAL(), so that they can safely be used concurrently across threads or recursively within a thread. This should make the API much less bug-prone and easier to use. * Make sure no UnlockWAL() is needed after non-OK LockWAL() (to match RocksDB conventions) * Make UnlockWAL() reliably return non-OK when there's no matching LockWAL() (for debug-ability) * Clarify API comments on LockWAL(), UnlockWAL(), FlushWAL(), and SyncWAL(). Their exact meanings are not obvious, and I don't think it's appropriate to talk about implementation mutexes in the API comments, but about what operations might block each other. * Add LockWAL()/UnlockWAL() to db_stress and crash test, mostly to check for assertion failures, but also checks that latest seqno doesn't change while WAL is locked. This is simpler to add when LockWAL() is allowed in multiple threads. * Remove unnecessary use of sync points in test DBWALTest::LockWal. There was a bug during development of above changes that caused this test to fail sporadically, with and without this sync point change. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11143 Test Plan: unit tests added / updated, added to stress/crash test Reviewed By: ajkr Differential Revision: D42848627 Pulled By: pdillinger fbshipit-source-id: 6d976c51791941a31fd8fbf28b0f82e888d9f4b4 |
|
sdong | 4720ba4391 |
Remove RocksDB LITE (#11147)
Summary: We haven't been actively mantaining RocksDB LITE recently and the size must have been gone up significantly. We are removing the support. Most of changes were done through following comments: unifdef -m -UROCKSDB_LITE `git grep -l ROCKSDB_LITE | egrep '[.](cc|h)'` by Peter Dillinger. Others changes were manually applied to build scripts, CircleCI manifests, ROCKSDB_LITE is used in an expression and file db_stress_test_base.cc. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11147 Test Plan: See CI Reviewed By: pdillinger Differential Revision: D42796341 fbshipit-source-id: 4920e15fc2060c2cd2221330a6d0e5e65d4b7fe2 |
|
Yanqin Jin | c93ba7db5d |
Revise LockWAL/UnlockWAL implementation (#11020)
Summary: RocksDB has two public APIs: `DB::LockWAL()`/`DB::UnlockWAL()`. The current implementation acquires and releases the internal `DBImpl::log_write_mutex_`. According to the comment on `DBImpl::log_write_mutex_`: https://github.com/facebook/rocksdb/blob/7.8.fb/db/db_impl/db_impl.h#L2287:L2288 > Note: to avoid dealock, if needed to acquire both log_write_mutex_ and mutex_, the order should be first mutex_ and then log_write_mutex_. This puts limitations on how applications can use the `LockWAL()` API. After `LockWAL()` returns ok, then application should not perform any operation that acquires `mutex_`. Currently, the use case of `LockWAL()` is MyRocks implementing the MySQL storage engine handlerton `lock_hton_log` interface. The operation that MyRocks performs after `LockWAL()` is `GetSortedWalFiless()` which not only acquires mutex_, but also `log_write_mutex_`. There are two issues: 1. Applications using these two APIs may hang if one thread calls `GetSortedWalFiles()` after calling `LockWAL()` because log_write_mutex is not recursive. 2. Two threads may dead lock due to lock order inversion. To fix these issues, we can modify the implementation of LockWAL so that it does not keep `log_write_mutex_` held until UnlockWAL. To achieve the goal of locking the WAL, we can instead manually inject a write stall so that all future writes will be stopped. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11020 Test Plan: make check Reviewed By: ajkr Differential Revision: D41785203 Pulled By: riversand963 fbshipit-source-id: 5ccb7a9c6eb9a2c3fa80fd2c399cc2568b8f89ce |
|
Hui Xiao | 15bb4ea084 |
Deflake DBWALTest.FixSyncWalOnObseletedWalWithNewManifestCausingMissingWAL (#11016)
Summary: **Context/Summary:** Credit to ajkr's https://github.com/facebook/rocksdb/pull/11016#pullrequestreview-1205020134, flaky test https://app.circleci.com/pipelines/github/facebook/rocksdb/21985/workflows/5f6cc355-78c1-46d8-89ee-0fd679725a8a/jobs/540878 is due to `Flush()` called in the test returned earlier than obsoleted WAL being found in background flush and SyncWAL() was called (i.e, "sync_point_called" sets to true). Fix this by making checking `sync_point_called == true` after obsoleted WAL is found and `SyncWAL()` is called. Also rename the "sync_point_called" to be something more specific. Also, fix a potential flakiness due to manually setting a log threshold to force new manifest creation. This is unreliable so I decided to use sync point to force new manifest creation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11016 Test Plan: make check Reviewed By: pdillinger Differential Revision: D41717786 Pulled By: hx235 fbshipit-source-id: ad1e4701a987285bbe6c8e7d9b05c4db06b4edf4 |
|
Hui Xiao | 2f76ab150d |
Fix missing WAL in new manifest by rolling over the WAL deletion record from prev manifest (#10892)
Summary: **Context** `Options::track_and_verify_wals_in_manifest = true` verifies each of the WALs tracked in manifest indeed presents in the WAL folder. If not, a corruption "Missing WAL with log number" will be thrown. `DB::SyncWAL()` called at a specific timing (i.e, at the `TEST_SYNC_POINT("FindObsoleteFiles::PostMutexUnlock")`) can record in a new manifest the WAL addition of a WAL file that already had a WAL deletion recorded in the previous manifest. And the WAL deletion record is not rollover-ed to the new manifest. So the new manifest creates the illusion of such WAL never gets deleted and should presents at db re/open. - Such WAL deletion record can be caused by flushing the memtable associated with that WAL and such WAL deletion can actually happen in` PurgeObsoleteFiles()`. As a consequence, upon `DB::Reopen()`, this WAL file can be deleted while manifest still has its WAL addition record , which causes a false alarm of corruption "Missing WAL with log number" to be thrown. **Summary** This PR fixes this false alarm by rolling over the WAL deletion record from prev manifest to the new manifest by adding the WAL deletion record to the new manifest. **Test** - Make check - Added new unit test `TEST_F(DBWALTest, FixSyncWalOnObseletedWalWithNewManifestCausingMissingWAL)` that failed before the fix and passed after - [Ongoing]CI stress test + aggressive value as in https://github.com/facebook/rocksdb/pull/10761 , which is how this false alarm was first surfaced, to confirm such false alarm disappears - [Ongoing]Regular CI stress test to confirm such fix didn't harm anything Pull Request resolved: https://github.com/facebook/rocksdb/pull/10892 Reviewed By: ajkr Differential Revision: D40778965 Pulled By: hx235 fbshipit-source-id: a512364bfdeb0b1a55c171890e60d856c528f37f |
|
Yanqin Jin | 538df26fcc |
Deflake DBWALTest.RaceInstallFlushResultsWithWalObsoletion (#10456)
Summary: Existing DBWALTest.RaceInstallFlushResultsWithWalObsoletion test relies on a specific interleaving of two background flush threads. We call them bg1 and bg2, and assume bg1 starts to install flush results ahead of bg2. After bg1 enters `ProcessManifestWrites`, bg1 waits for bg2 to also enter `MemTableList::TryInstallMemtableFlushResults()` before bg1 can proceed with MANIFEST write. However, if bg2 called `SyncClosedLogs()` and needed to commit to the MANIFEST but falls behind bg1, then bg2 needs to wait for bg1 to finish writing to MANIFEST. This is a circular dependency. Fix this by allowing bg2 to start only after bg1 grabs the chance to sync the WAL and commit to MANIFEST. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10456 Test Plan: 1. make check 2. export TEST_TMPDIR=/dev/shm && gtest-parallel -r 1000 -w 32 ./db_wal_test --gtest_filter=DBWALTest.RaceInstallFlushResultsWithWalObsoletion Reviewed By: ltamasi Differential Revision: D38391856 Pulled By: riversand963 fbshipit-source-id: 55f647d5b94e534c008a4dd2fb082675ddf58c96 |
|
Changyu Bi | 5f9fe7f21e |
Added WAL compression checksum (#10319)
Summary: Enabled zstd checksum flag in StreamingCompress so that WAL (de)compreression is protected by a checksum per compression frame. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10319 Test Plan: - `make check` - WAL perf: average ops/sec over 10 runs is 161226 pre PR and 159635 post PR (1% drop). ``` sudo TEST_TMPDIR=/dev/shm/memtable_write ./db_bench_checksum -benchmarks=fillseq -max_write_buffer_number=100 -num=1000000 -min_write_buffer_number_to_merge=10 -wal_compression=zstd ``` Reviewed By: ajkr Differential Revision: D37673311 Pulled By: cbi42 fbshipit-source-id: 9f34a3bfc2a82e5c80b1ec63bb339a7465108ec9 |
|
Yanqin Jin | b012d23557 |
Destroy iniital db dir for a test in DBWALTest (#10221)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10221 Reviewed By: hx235 Differential Revision: D37316280 Pulled By: riversand963 fbshipit-source-id: 062781acec2f36beebc62003bcc8ec280488d572 |
|
Gang Liao | deff48bcef |
Add blob source to retrieve blobs in RocksDB (#10198)
Summary: There is currently no caching mechanism for blobs, which is not ideal especially when the database resides on remote storage (where we cannot rely on the OS page cache). As part of this task, we would like to make it possible for the application to configure a blob cache. In this task, we formally introduced the blob source to RocksDB. BlobSource is a new abstraction layer that provides universal access to blobs, regardless of whether they are in the blob cache, secondary cache, or (remote) storage. Depending on user settings, it always fetch blobs from multi-tier cache and storage with minimal cost. Note: The new `MultiGetBlob()` implementation is not included in the current PR. To go faster, we aim to create a separate PR for it in parallel! This PR is a part of https://github.com/facebook/rocksdb/issues/10156 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10198 Reviewed By: ltamasi Differential Revision: D37294735 Pulled By: gangliao fbshipit-source-id: 9cb50422d9dd1bc03798501c2778b6c7520c7a1e |
|
Akanksha Mahajan | 8273435c22 |
Bypass tests instead of skipping to resolve internal failure (#10148)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10148 Reviewed By: hx235 Differential Revision: D37092202 Pulled By: akankshamahajan15 fbshipit-source-id: 12fae5641a1c4ab584e586db95f4044273aba23a |
|
Akanksha Mahajan | f85b31a2e9 |
Fix bug for WalManager with compressed WAL (#10130)
Summary: RocksDB uses WalManager to manage WAL files. In WalManager::ReadFirstLine(), the assumption is that reading the first record of a valid WAL file will return OK status and set the output sequence to non-zero value. This assumption has been broken by WAL compression which writes a `kSetCompressionType` record which is not associated with any sequence number. Consequently, WalManager::GetSortedWalsOfType() will skip these WALs and not return them to caller, e.g. Checkpoint, Backup, causing the operations to fail. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10130 Test Plan: - Newly Added test Reviewed By: riversand963 Differential Revision: D36985744 Pulled By: akankshamahajan15 fbshipit-source-id: dfde7b3be68b6a30b75b49479779748eedf29f7f |
|
Peter Dillinger | a00cffaf69 |
Reduce risk of backup or checkpoint missing a WAL file (#10083)
Summary: We recently saw a case in crash test in which a WAL file in the middle of the list of live WALs was not included in the backup, so the DB was not openable due to missing WAL. We are not sure why, but this change should at least turn that into a backup-time failure by ensuring all the WAL files expected by the manifest (according to VersionSet) are included in `GetSortedWalFiles()` (used by `GetLiveFilesStorageInfo()`, `BackupEngine`, and `Checkpoint`) Related: to maximize the effectiveness of track_and_verify_wals_in_manifest with GetSortedWalFiles() during checkpoint/backup, we will now sync WAL in GetLiveFilesStorageInfo() when track_and_verify_wals_in_manifest=true. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10083 Test Plan: added new unit test for the check in GetSortedWalFiles() Reviewed By: ajkr Differential Revision: D36791608 Pulled By: pdillinger fbshipit-source-id: a27bcf0213fc7ab177760fede50d4375d579afa6 |
|
sdong | 736a7b5433 |
Remove own ToString() (#9955)
Summary: ToString() is created as some platform doesn't support std::to_string(). However, we've already used std::to_string() by mistake for 16 months (in db/db_info_dumper.cc). This commit just remove ToString(). Pull Request resolved: https://github.com/facebook/rocksdb/pull/9955 Test Plan: Watch CI tests Reviewed By: riversand963 Differential Revision: D36176799 fbshipit-source-id: bdb6dcd0e3a3ab96a1ac810f5d0188f684064471 |
|
sdong | 49628c9a83 |
Use std::numeric_limits<> (#9954)
Summary: Right now we still don't fully use std::numeric_limits but use a macro, mainly for supporting VS 2013. Right now we only support VS 2017 and up so it is not a problem. The code comment claims that MinGW still needs it. We don't have a CI running MinGW so it's hard to validate. since we now require C++17, it's hard to imagine MinGW would still build RocksDB but doesn't support std::numeric_limits<>. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9954 Test Plan: See CI Runs. Reviewed By: riversand963 Differential Revision: D36173954 fbshipit-source-id: a35a73af17cdcae20e258cdef57fcf29a50b49e0 |
|
Akanksha Mahajan | ae82d91492 |
Remove corrupted WAL files in kPointRecoveryMode with avoid_flush_duing_recovery set true (#9634)
Summary: 1) In case of non-TransactionDB and avoid_flush_during_recovery = true, RocksDB won't flush the data from WAL to L0 for all column families if possible. As a result, not all column families can increase their log_numbers, and min_log_number_to_keep won't change. 2) For transaction DB (.allow_2pc), even with the flush, there may be old WAL files that it must not delete because they can contain data of uncommitted transactions and min_log_number_to_keep won't change. If we persist a new MANIFEST with advanced log_numbers for some column families, then during a second crash after persisting the MANIFEST, RocksDB will see some column families' log_numbers larger than the corrupted wal, and the "column family inconsistency" error will be hit, causing recovery to fail. As a solution, 1. the corrupted WALs whose numbers are larger than the corrupted wal and smaller than the new WAL will be moved to archive folder. 2. Currently, RocksDB DB::Open() may creates and writes to two new MANIFEST files even before recovery succeeds. This PR buffers the edits in a structure and writes to a new MANIFEST after recovery is successful Pull Request resolved: https://github.com/facebook/rocksdb/pull/9634 Test Plan: 1. Added new unit tests 2. make crast_test -j Reviewed By: riversand963 Differential Revision: D34463666 Pulled By: akankshamahajan15 fbshipit-source-id: e233d3af0ed4e2028ca0cf051e5a334a0fdc9d19 |
|
Yanqin Jin | e0c84aa0dc |
Fix a race condition in WAL tracking causing DB open failure (#9715)
Summary: There is a race condition if WAL tracking in the MANIFEST is enabled in a database that disables 2PC. The race condition is between two background flush threads trying to install flush results to the MANIFEST. Consider an example database with two column families: "default" (cfd0) and "cf1" (cfd1). Initially, both column families have one mutable (active) memtable whose data backed by 6.log. 1. Trigger a manual flush for "cf1", creating a 7.log 2. Insert another key to "default", and trigger flush for "default", creating 8.log 3. BgFlushThread1 finishes writing 9.sst 4. BgFlushThread2 finishes writing 10.sst ``` Time BgFlushThread1 BgFlushThread2 | mutex_.Lock() | precompute min_wal_to_keep as 6 | mutex_.Unlock() | mutex_.Lock() | precompute min_wal_to_keep as 6 | join MANIFEST write queue and mutex_.Unlock() | write to MANIFEST | mutex_.Lock() | cfd1->log_number = 7 | Signal bg_flush_2 and mutex_.Unlock() | wake up and mutex_.Lock() | cfd0->log_number = 8 | FindObsoleteFiles() with job_context->log_number == 7 | mutex_.Unlock() | PurgeObsoleteFiles() deletes 6.log V ``` As shown in the above, BgFlushThread2 thinks that the min wal to keep is 6.log because "cf1" has unflushed data in 6.log (cf1.log_number=6). Similarly, BgThread1 thinks that min wal to keep is also 6.log because "default" has unflushed data (default.log_number=6). No WAL deletion will be written to MANIFEST because 6 is equal to `versions_->wals_.min_wal_number_to_keep`, due to https://github.com/facebook/rocksdb/blob/7.1.fb/db/memtable_list.cc#L513:L514. The bg flush thread that finishes last will perform file purging. `job_context.log_number` will be evaluated as 7, i.e. the min wal that contains unflushed data, causing 6.log to be deleted. However, MANIFEST thinks 6.log should still exist. If you close the db at this point, you won't be able to re-open it if `track_and_verify_wal_in_manifest` is true. We must handle the case of multiple bg flush threads, and it is difficult for one bg flush thread to know the correct min wal number until the other bg flush threads have finished committing to the manifest and updated the `cfd::log_number`. To fix this issue, we rename an existing variable `min_log_number_to_keep_2pc` to `min_log_number_to_keep`, and use it to track WAL file deletion in non-2pc mode as well. This variable is updated only 1) during recovery with mutex held, or 2) in the MANIFEST write thread. `min_log_number_to_keep` means RocksDB will delete WALs below it, although there may be WALs above it which are also obsolete. Formally, we will have [min_wal_to_keep, max_obsolete_wal]. During recovery, we make sure that only WALs above max_obsolete_wal are checked and added back to `alive_log_files_`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9715 Test Plan: ``` make check ``` Also ran stress test below (with asan) to make sure it completes successfully. ``` TEST_TMPDIR=/dev/shm/rocksdb OPT=-g ASAN_OPTIONS=disable_coredump=0 \ CRASH_TEST_EXT_ARGS=--compression_type=zstd SKIP_FORMAT_BUCK_CHECKS=1 \ make J=52 -j52 blackbox_asan_crash_test ``` Reviewed By: ltamasi Differential Revision: D34984412 Pulled By: riversand963 fbshipit-source-id: c7b21a8d84751bb55ea79c9f387103d21b231005 |
|
Siddhartha Roychowdhury | 39b0d92153 |
Add record to set WAL compression type if enabled (#9556)
Summary: When WAL compression is enabled, add a record (new record type) to store the compression type to indicate that all subsequent records are compressed. The log reader will store the compression type when this record is encountered and use the type to uncompress the subsequent records. Compress and uncompress to be implemented in subsequent diffs. Enabled WAL compression in some WAL tests to check for regressions. Some tests that rely on offsets have been disabled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9556 Reviewed By: anand1976 Differential Revision: D34308216 Pulled By: sidroyc fbshipit-source-id: 7f10595e46f3277f1ea2d309fbf95e2e935a8705 |
|
Levi Tamasi | 320d9a8e8a |
Use a sorted vector instead of a map to store blob file metadata (#9526)
Summary: The patch replaces `std::map` with a sorted `std::vector` for `VersionStorageInfo::blob_files_` and preallocates the space for the `vector` before saving the `BlobFileMetaData` into the new `VersionStorageInfo` in `VersionBuilder::Rep::SaveBlobFilesTo`. These changes reduce the time the DB mutex is held while saving new `Version`s, and using a sorted `vector` also makes lookups faster thanks to better memory locality. In addition, the patch introduces helper methods `VersionStorageInfo::GetBlobFileMetaData` and `VersionStorageInfo::GetBlobFileMetaDataLB` that can be used by clients to perform lookups in the `vector`, and does some general cleanup in the parts of code where blob file metadata are used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9526 Test Plan: Ran `make check` and the crash test script for a while. Performance was tested using a load-optimized benchmark (`fillseq` with vector memtable, no WAL) and small file sizes so that a significant number of files are produced: ``` numactl --interleave=all ./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/ltamasi-dbbench --wal_dir=/data/ltamasi-dbbench --num=800000000 --num_levels=8 --key_size=20 --value_size=400 --block_size=8192 --cache_size=51539607552 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=lz4 --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --soft_pending_compaction_bytes_limit=167503724544 --hard_pending_compaction_bytes_limit=335007449088 --min_level_to_compress=0 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --enable_blob_files=1 --blob_file_size=16777216 --min_blob_size=0 --blob_compression_type=lz4 --enable_blob_garbage_collection=1 --seed=<some value> ``` Final statistics before the patch: ``` Cumulative writes: 0 writes, 700M keys, 0 commit groups, 0.0 writes per commit group, ingest: 284.62 GB, 121.27 MB/s Interval writes: 0 writes, 334K keys, 0 commit groups, 0.0 writes per commit group, ingest: 139.28 MB, 72.46 MB/s ``` With the patch: ``` Cumulative writes: 0 writes, 760M keys, 0 commit groups, 0.0 writes per commit group, ingest: 308.66 GB, 131.52 MB/s Interval writes: 0 writes, 445K keys, 0 commit groups, 0.0 writes per commit group, ingest: 185.35 MB, 93.15 MB/s ``` Total time to complete the benchmark is 2611 seconds with the patch, down from 2986 secs. Reviewed By: riversand963 Differential Revision: D34082728 Pulled By: ltamasi fbshipit-source-id: fc598abf676dce436734d06bb9d2d99a26a004fc |
|
Yanqin Jin | bd513fd075 |
Add commit marker with timestamp (#9266)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9266 This diff adds a new tag `CommitWithTimestamp`. Currently, there is no API to trigger writing this tag to WAL, thus it is unavailable to users. This is an ongoing effort to add user-defined timestamp support to write-committed transactions. This diff also indicates all column families that may potentially participate in the same transaction must either disable timestamp or have the same timestamp format, since `CommitWithTimestamp` tag is followed by a single byte-array denoting the commit timestamp of the transaction. We will enforce this checking in a future diff. We keep this diff small. Reviewed By: ltamasi Differential Revision: D31721350 fbshipit-source-id: e1450811443647feb6ca01adec4c8aaae270ffc6 |
|
Drewryz | 3b27725245 |
Fix a minor issue with initializing the test path (#8555)
Summary: The PerThreadDBPath has already specified a slash. It does not need to be specified when initializing the test path. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8555 Reviewed By: ajkr Differential Revision: D29758399 Pulled By: jay-zhuang fbshipit-source-id: 6d2b878523e3e8580536e2829cb25489844d9011 |
|
Zhichao Cao | f44e69c64a |
Use DbSessionId as cache key prefix when secondary cache is enabled (#8360)
Summary: Currently, we either use the file system inode or a monotonically incrementing runtime ID as the block cache key prefix. However, if we use a monotonically incrementing runtime ID (in the case that the file system does not support inode id generation), in some cases, it cannot ensure uniqueness (e.g., we have secondary cache migrated from host to host). We use DbSessionID (20 bytes) + current file number (at most 10 bytes) as the new cache block key prefix when the secondary cache is enabled. So can accommodate scenarios such as transfer of cache state across hosts. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8360 Test Plan: add the test to lru_cache_test Reviewed By: pdillinger Differential Revision: D29006215 Pulled By: zhichao-cao fbshipit-source-id: 6cff686b38d83904667a2bd39923cd030df16814 |
|
Peter (Stig) Edwards | c75ef03e58 |
Do not truncate WAL if in read_only mode (#8313)
Summary:
I noticed ```openat``` system call with ```O_WRONLY``` flag and ```sync_file_range``` and ```truncate``` on WAL file when using ```rocksdb::DB::OpenForReadOnly``` by way of ```db_bench --readonly=true --benchmarks=readseq --use_existing_db=1 --num=1 ...```
Noticed in ```strace``` after seeing the last modification time of the WAL file change after each run (with ```--readonly=true```).
I think introduced by
|
|
Giuseppe Ottaviano | 48cd7a3aae |
Fix flush reason attribution (#8150)
Summary: Current flush reason attribution is misleading or incorrect (depending on what the original intention was): - Flush due to WAL reaching its maximum size is attributed to `kWriteBufferManager` - Flushes due to full write buffer and write buffer manager are not distinguishable, both are attributed to `kWriteBufferFull` This changes the first to a new flush reason `kWALFull`, and splits the second between `kWriteBufferManager` and `kWriteBufferFull`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8150 Reviewed By: zhichao-cao Differential Revision: D27569645 Pulled By: ot fbshipit-source-id: 7e3c8ca186a6e71976e6b8e937297eebd4b769cc |
|
anand76 | 7d7f14480e |
Always truncate the latest WAL file on DB Open (#8122)
Summary: Currently, we only truncate the latest alive WAL files when the DB is opened. If the latest WAL file is empty or was flushed during Open, its not truncated since the file will be deleted later on in the Open path. However, before deletion, a new WAL file is created, and if the process crash loops between the new WAL file creation and deletion of the old WAL file, the preallocated space will keep accumulating and eventually use up all disk space. To prevent this, always truncate the latest WAL file, even if its empty or the data was flushed. Tests: Add unit tests to db_wal_test Pull Request resolved: https://github.com/facebook/rocksdb/pull/8122 Reviewed By: riversand963 Differential Revision: D27366132 Pulled By: anand1976 fbshipit-source-id: f923cc03ef033ccb32b140d36c6a63a8152f0e8e |