Summary:
**Context/Summary:**
We recently discovered a case where write of the same key right after error recovery of a previous failed write of the same key finishes causes two same WAL entries, violating our assertion. This is because we don't advance seqno on failed write and reuse the same WAL containing the failed write for the new write if the memtable at the time is empty.
This PR reuses the flush path for an empty memtable to switch WAL and update min WAL to keep in error recovery flush
as well as updates the INFO log message for clarity.
```
2024/07/17-15:01:32.271789 327757 (Original Log Time 2024/07/17-15:01:25.942234) [/flush_job.cc:1017] [default] [JOB 2] Level-0 flush table https://github.com/facebook/rocksdb/issues/9: 0 bytes OK It's an empty SST file from a successful flush so won't be kept in the DB
2024/07/17-15:01:32.271798 327757 (Original Log Time 2024/07/17-15:01:32.269954) [/memtable_list.cc:560] [default] Level-0 commit flush result of table https://github.com/facebook/rocksdb/issues/9 started
2024/07/17-15:01:32.271802 327757 (Original Log Time 2024/07/17-15:01:32.271217) [/memtable_list.cc:760] [default] Level-0 commit flush result of table https://github.com/facebook/rocksdb/issues/9: memtable https://github.com/facebook/rocksdb/issues/1 done
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12873
Test Plan:
New UT that failed before this PR with following assertion failure (i.e, duplicate WAL entries) and passes after
```
db_wal_test: db/write_batch.cc:2254: rocksdb::Status rocksdb::{anonymous}::MemTableInserter::PutCFImpl(uint32_t, const rocksdb::Slice&, const rocksdb::Slice&, rocksdb::ValueType, RebuildTxnOp, const ProtectionInfoKVOS64*) [with RebuildTxnOp = rocksdb::{anonymous}::MemTableInserter::PutCF(uint32_t, const rocksdb::Slice&, const rocksdb::Slice&)::<lambda(rocksdb::WriteBatch*, uint32_t, const rocksdb::Slice&, const rocksdb::Slice&)>; uint32_t = unsigned int; rocksdb::ProtectionInfoKVOS64 = rocksdb::ProtectionInfoKVOS<long unsigned int>]: Assertion `seq_per_batch_' failed.
```
Reviewed By: anand1976
Differential Revision: D59884468
Pulled By: hx235
fbshipit-source-id: 5d854b719092552c69727a979f269fb7f6c39756
Summary:
InitInputTableProperties() can open and do IOs and is called under mutex_. This PR removes it from FinalizeInputInfo(). It is now called in CompactionJob::Run() and BuildCompactionJobInfo() (called in NotifyOnCompactionBegin()) without holding mutex_.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12879
Test Plan: existing unit tests. Added assert in GetInputTableProperties() to ensure that input_table_properties_ is initialized whenever it's called.
Reviewed By: hx235
Differential Revision: D59933195
Pulled By: cbi42
fbshipit-source-id: c8089e13af8567fa3ab4b94d9ec384ae98ab2ec8
Summary:
... to enable use cases like using RocksDB to merge sort data for ingestion. A new file ingestion option `IngestExternalFileOptions::allow_db_generated_files` is introduced to allows users to ingest SST files generated by live DBs instead of SstFileWriter. For now this only works if the SST files being ingested have zero as their largest sequence number AND do not overlap with any data in the DB (so we can assign seqno 0 which matches the seqno of all ingested keys).
The feature is marked the option as experimental for now.
Main changes needed to enable this:
- ignore CF id mismatch during ingestion
- ignore the missing external file version table property
Rest of the change is mostly in new unit tests.
A previous attempt is in https://github.com/facebook/rocksdb/issues/5602.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12750
Test Plan: - new unit tests
Reviewed By: ajkr, jowlyzhang
Differential Revision: D58396673
Pulled By: cbi42
fbshipit-source-id: aae513afad7b1ff5d4faa48104df5f384926bf03
Summary:
This PR fixes an assertion failure in `DBImpl::ResumeImpl` - `assert(!versions_->descriptor_log_)`. In `VersionSet`, `descriptor_log_` has a pointer to the current MANIFEST writer. When there's an error updating the manifest, `descriptor_log_` is reset, and the error recovery thread checks `io_status()` in `VersionSet` and attempts to write a new MANIFEST. If another DB manipulation happens at the same time (like external file ingestion, column family manipulation etc), it calls `LogAndApply`, which also attempts to write a new MANIFEST. The assertion in `ResumeImpl` might fail in this case since the other MANIFEST writer may have updated `descriptor_log_`. To prevent the assertion, this fix updates both `io_status_` and `descriptor_log_` while holding the DB mutex.
The other option would have been to simply remove the assert. But I think its important to have it to ensure the invariant that `io_status_` is cleared if the MANIFEST is written successfully, and this fix makes things easier to reason about.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12871
Test Plan: Existing tests and crash test
Reviewed By: hx235
Differential Revision: D59926947
Pulled By: anand1976
fbshipit-source-id: af9ad18da3e29fc62c7ec2e30e0738aa33d4e5f1
Summary:
Basically, the fix in https://github.com/facebook/rocksdb/issues/8137 was incomplete (and I missed it in the review), because if `whole_key_filtering` is false, then `last_prefix_str_` will never be set to non-empty and the fix doesn't work. Also related to https://github.com/facebook/rocksdb/issues/5835.
This is intended as a safe, simple fix that will regress CPU efficiency slightly (for `whole_key_filtering=false` cases, because of extra prefix string copies during flush & compaction). An efficient fix is not possible without some substantial refactoring.
Also in this PR: new test DBBloomFilterTest.FilterNumEntriesCoalesce tests an adjacent code path that was previously untested for its effect of ensuring the number of unique prefixes and keys is tracked properly when both prefixes and whole keys are going into a filter. (Test fails when either of the two code segments checking for duplicates is disabled.) In addition, the same test would fail before the main bug fix here because the code would inappropriately add the empty string to the filter (because of unmodified `last_prefix_str_`).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12872
Test Plan: In addition to DBBloomFilterTest.FilterNumEntriesCoalesce, extended DBBloomFilterTest.SeekForPrevWithPartitionedFilters to cover the broken case. (Mostly whitespace change.)
Reviewed By: jowlyzhang
Differential Revision: D59873793
Pulled By: pdillinger
fbshipit-source-id: 2a7b7f09ca73dc188fb4dab833826ad6da7ebb11
Summary:
The failure of `WriteCurrentStateToManifest()` in `VersionSet::ProcessManifestWrites()` was not handled properly. If it failed, `manifest_io_status` was not updated, leading to `manifest_file_number_` being updated to the newly created manifest even though its bad. This would lead to the bad manifest immediately getting deleted, and also the good manifest (referenced by `CURRENT`) getting deleted by obsolete file deletion because of `manifest_file_number_` not referencing its number.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12865
Reviewed By: hx235
Differential Revision: D59782940
Pulled By: anand1976
fbshipit-source-id: f752fb9a1c23fd3d734616e273613cbac204301b
Summary:
**Context/Summary:**
`*auto_recovery` needs to be set true in order for `OnErrorRecoveryBegin()` to be called before auto-recovery
3db030d7ee/db/event_helpers.cc (L64-L66)
Currently it's set false for auto-recovery. This PR fixes it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12860
Test Plan:
- Manual observation that it is called
- Existing UT
Reviewed By: jowlyzhang
Differential Revision: D59693315
Pulled By: hx235
fbshipit-source-id: 3f428c5b1e9818bb7697fdcd7f245d11378eb14a
Summary:
Create C API function for iterating over WriteBatch for custom Column Families
Adding function to C API that exposes column family specific methods to iterate over WriteBatch: put_cf, delete_cf and merge_cf. This is required when the one needs to read changes for any non-default column family. Without that functionality it is impossible to iterate over changes in WAL that are relevant to custom column families.
Fixes https://github.com/facebook/rocksdb/issues/12790
Testing:
Added WriteBatch iteration test to "columnfamilies" section of C API unit tests
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12718
Reviewed By: cbi42
Differential Revision: D59483601
Pulled By: ajkr
fbshipit-source-id: b68b900636304528a38620a8c3ad82fdce4b60cb
Summary:
We are seeing a number of crash test failures coming from checkpoint and backup code, likely from WalManager::GetSortedWalFiles -> ... -> WalManager::ReadFirstLine and this code path is not needed, because we don't need to know the sequence numbers of WAL files going into a checkpoint or backup. We can minimize the impact of whatever inconsistency is causing that problem by not relying on it where it's not needed.
Similarly, when we only need a roughly accurate set of current WAL files, we don't need to query all the archived WAL files (and redundantly the live ones again).
So this reduces filesystem queries and DB mutex acquires in creating backups and checkpoints.
Needed follow-up:
Figure out what is causing various failures with an apparent inconsistency where GetSortedWalFiles fails on reading a WAL file. If it's an injected failure, perhaps it's not propagating that injected failure appropriately. It might also be an inconsistency between what the DB knows is flushed and what WalManager reads from the filesystem (which we know is dubious and should be phased out, which this is arguably another step toward). Or completing that phase-out might solve the problem without a full diagnosis.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12831
Test Plan:
existing tests (easily caught when I went too far in initally developing this change)
Update to BackupUsingDirectIO test so that there's a WAL file in what is backed up. (Was relying on some oddity.)
Reviewed By: cbi42
Differential Revision: D59252649
Pulled By: pdillinger
fbshipit-source-id: 7ad4187a1c70caa59a6d6c1c643ef95232b929f5
Summary:
**Context/Summary:**
It seems unreasonable to take the archived log size into account when calculating log size **for flush** in method CreateCheckpoint. If the user sets WAL_ttl_seconds or WAL_size_limit_MB, the argument _log_size_for_flush_ can easily be reached due to the size of the archived dir. As a result, the flush may always be triggered.
**Test**
corverd by ./checkpoint_test
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12680
Reviewed By: jaykorean
Differential Revision: D59097904
Pulled By: ajkr
fbshipit-source-id: 0ed29c1b078d8f40b85288541b008e00dbc517d3
Summary:
This PR adds user property collector factory `CompactForTieringCollectorFactory` to support observe SST file and mark it as need compaction for fast tracking data to the proper tier.
A triggering ratio `compaction_trigger_ratio_` can be configured to achieve the following:
1) Setting the ratio to be equal to or smaller than 0 disables this collector
2) Setting the ratio to be within (0, 1] will write the number of observed eligible entries into a user property and marks a file as need-compaction when aforementioned condition is met.
3) Setting the ratio to be higher than 1 can be used to just writes the user table property, and not mark any file as need compaction.
For a column family that does not enable tiering feature, even if an effective configuration is provided, this collector is still disabled. For a file that is already on the last level, this collector is also disabled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12760
Test Plan: Added unit tests
Reviewed By: pdillinger
Differential Revision: D58734976
Pulled By: jowlyzhang
fbshipit-source-id: 6daab2c4f62b5c6689c3c03e3b3907bbbe6b7a81
Summary:
Unknown why these would ignore options like deadline and read_tier. Setting total_order_seek=true is unnecessary because of the disable_prefix_seek (= true) parameter to NewIndexIterator. This is only used by the hash index, which uses total order seek if either the ReadOption or the parameter is true. The parameter is arguably redundant with the total_order_seek option, meaning it could be eliminated, but I think this case is exceptional (compared to e.g. no_io):
* Prefix seek is particular to user iterators, though might be usable, carefully, for other read operations.
* The historical default of total_order_seek=false in a sense is "wrong result by default" so cannot be interpreted as an intent to force prefix seek in an operation for which it might be usual or give bad results.
Also added a generic release note to cover this and related PRs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12764
Test Plan: existing tests
Reviewed By: hx235
Differential Revision: D58474240
Pulled By: pdillinger
fbshipit-source-id: 79014d9822ba8f09d57ce4524363aa0973017b68
Summary:
POSIX semantics for LinkFile (hard links) allow linking a file
that is still being written two, with both the source and destination
showing any subsequent writes to the source. This may not be practical
semantics for some FileSystem implementations such as remote storage.
They might only link the flushed or sync-ed file contents at time of
LinkFile, or might even have undefined behavior if LinkFile is called on
a file still open for write (not yet "sealed"). This change builds on https://github.com/facebook/rocksdb/issues/12731
to bring more hygiene to our handling of WAL files in Checkpoint.
Specifically, we now Close WAL files as soon as they are either
(a) inactive and fully synced, or (b) inactive and obsolete (so maybe
never fully synced), rather than letting Close() happen in handling
obsolete files (maybe a background thread). This should not be a
performance issue as Close() should be trivial cost relative to other
IO ops, but just in case:
* We don't Close() while holding a mutex, to avoid blocking, and
* The old behavior is available with a new kill switch option
`background_close_inactive_wals`.
Stacked on https://github.com/facebook/rocksdb/issues/12731
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12734
Test Plan:
Extended existing unit test, especially adding a hygiene
check to FaultInjectionTestFS to detect LinkFile() on a file still open
for writes. FaultInjectionTestFS already has relevant tracking data, and
tests can opt out of the new check, as in a smoke test I have left for
the old, deprecated functionality `background_close_inactive_wals=true`.
Also ran lengthy blackbox_crash_test to ensure the hygiene check is OK
with the crash test. (The only place I can find we use LinkFile in
production is Checkpoint.)
Reviewed By: cbi42
Differential Revision: D58295284
Pulled By: pdillinger
fbshipit-source-id: 64d90ed8477e2366c19eaf9c4c5ad60b82cac5c6
Summary:
**Context/Summary:** a better API design is decided lately so we decided to revert these two changes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12738
Test Plan: - CI
Reviewed By: ajkr
Differential Revision: D58162165
Pulled By: hx235
fbshipit-source-id: 9bbe4d2fe9fbe39213f4cf137a2d419e6ffb8e16
Summary:
Background: there is one active WAL file but there can be
several more WAL files in various states. Those other WALs are always
in a "flushed" state but could be on the `logs_` list not yet fully
synced. We currently allow any WAL that is not the active WAL to be
hard-linked when creating a Checkpoint, as although it might still be
open for write, we are not appending any more data to it.
The problem is that a created Checkpoint is supposed to be fully synced
on return of that function, and a hard-linked WAL in the state described
above might not be fully synced. (Through some prudence in https://github.com/facebook/rocksdb/issues/10083,
it would synced if using track_and_verify_wals_in_manifest=true.)
The fix is a step toward a long term goal of removing the need to query
the filesystem to determine WAL files and their state. (I consider it
dubious any time we independently read from or query metadata from a
file we have open for writing, as this makes us more susceptible to
FileSystem deficiencies or races.) More specifically:
* Detect which WALs might not be fully synced, according to our DBImpl
metadata, and prevent hard linking those (with `trim_to_size=true`
from `GetLiveFilesStorageInfo()`. And while we're at it, use our known
flushed sizes for those WALs.
* To avoid a race between that and GetSortedWalFiles(), track a maximum
needed WAL number for the Checkpoint/GetLiveFilesStorageInfo.
* Because of the level of consistency provided by those two, we no
longer need to consider syncing as part of the FlushWAL in
GetLiveFilesStorageInfo. (We determine the max WAL number consistent
with the manifest file size, while holding DB mutex. Should make
track_and_verify_wals_in_manifest happy.) This makes the premise of
test PutRaceWithCheckpointTrackedWalSync obsolete (sync point callback
no longer hit) so the test is removed, with crash test as backstop for
related issues. See https://github.com/facebook/rocksdb/issues/10185
Stacked on https://github.com/facebook/rocksdb/issues/12729
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12731
Test Plan:
Expanded an existing test, which now fails before fix.
Also long runs of blackbox_crash_test with amplified checkpoint frequency.
Reviewed By: cbi42
Differential Revision: D58199629
Pulled By: pdillinger
fbshipit-source-id: 376e55f4a2b082cd2adb6408a41209de14422382
Summary:
A `BlockBasedTable` with `TieredSecondaryCache` containing a NVM cache inserts blocks into the compressed cache and the corresponding compressed block into the NVM cache. The `BlockFetcher` is used to get the uncompressed and compressed blocks by calling `ReadBlockContents()` and `GetUncompressedBlock()` respectively. If the file system supports FSBuffer (i.e returning a FS allocated buffer rather than caller provided), that buffer gets freed between the two calls. This PR fixes it by making the FSBuffer unique pointer a member rather than local variable.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12712
Test Plan:
1. Add a unit test
2. Release validation stress test
Reviewed By: jaykorean
Differential Revision: D57974026
Pulled By: anand1976
fbshipit-source-id: cfa895914e74b4f628413b40e6e39d8d8e5286bd
Summary:
We tested on icelake server (vcpu=160). The default configuration is allow_concurrent_memtable_write=1, thread number =activate core number. With our optimizations, the improvement can reach up to 184% in fillseq case. op/s is as the performance indicator in db_bench, and the following are performance improvements in some cases in db_bench.
| case name | optimized/original |
|-------------------:|--------------------:|
| fillrandom | 182% |
| fillseq | 184% |
| fillsync | 136% |
| overwrite | 179% |
| randomreplacekeys | 180% |
| randomtransaction | 161% |
| updaterandom | 163% |
| xorupdaterandom | 165% |
With analysis, we find that although the process of writing memtable is processed in parallel, the process of waking up the writers is not processed in parallel, which means that only one writers is responsible for the sequential waking up other writers. The following is our method to optimize this process.
Assume that there are currently n threads in total, we parallelize SetState in LaunchParallelMemTableWriters. To wake up each writer to write its own memtable, the leader writer first wakes up the (n^0.5-1) caller writers, and then those callers and the leader will wake up n/x separately to write to the memtable. This reduces the number for the leader's to SetState n-1 writers to 2*(n^0.5) writers in turn.
A reproduction script:
./db_bench --benchmarks="fillrandom" --threads ${number of all activate vcpu} --seed 1708494134896523 --duration 60
![image](https://github.com/facebook/rocksdb/assets/22110918/c5eca02f-93b3-4434-bba2-5155fc892a97)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12545
Reviewed By: ajkr
Differential Revision: D57422827
Pulled By: cbi42
fbshipit-source-id: 94127937c0c61e4241720bd902c82c607b7b2431
Summary:
Add the `--leader_path` option to specify the directory path of the leader for a follower RocksDB instance. This PR also adds a `count` command to the repl shell. While not specific to followers, it is useful for testing purposes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12682
Reviewed By: jowlyzhang
Differential Revision: D57642296
Pulled By: anand1976
fbshipit-source-id: 53767d496ecadc363ff92cd958b8e15a7bf3b151
Summary:
This PR adds UpdateTimestamp API of WriteBatch and WBWI, create WB, WBWI with all options and Iterator Refresh in C API
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10529
Reviewed By: cbi42
Differential Revision: D57826913
Pulled By: ajkr
fbshipit-source-id: d2ec840129f61a1d3a5a12e859728be98ebbad2f
Summary:
These names are confusing with `Logger` etc. so moving to `WalFile` etc.
Other small, related name refactorings.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12695
Test Plan: Left most unit tests using old names as an API compatibility test. Non-test code compiles with deprecated names removed. No functional changes.
Reviewed By: ajkr
Differential Revision: D57747458
Pulled By: pdillinger
fbshipit-source-id: 7b77596b9c20d865d43b9dc66c30c8bd2b3b424f
Summary:
I think the point of the `if (end_of_buffer_offset_ - buffer_.size() == 0)` was to only set `recycled_` when the first record was read. However, the condition was false when reading the first record when the WAL began with a `kSetCompressionType` record because we had already dropped the `kSetCompressionType` record from `buffer_`. To fix this, I used `first_record_read_` instead.
Also, it was pretty confusing to treat the WAL as non-recycled when a recyclable record first appeared in a non-first record. I changed it to return an error if that happens.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12643
Reviewed By: hx235
Differential Revision: D57238099
Pulled By: ajkr
fbshipit-source-id: e20a2a0c9cf0c9510a7b6af463650a05d559239e
Summary:
**Context/Summary:** https://github.com/facebook/rocksdb/pull/12556 `avoid_sync_during_shutdown=false` missed an edge case where `manual_wal_flush == true` so WAL sync will still miss unflushed WAL. This PR fixes it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12684
Test Plan: modified UT to include this case `manual_wal_flush==true`
Reviewed By: cbi42
Differential Revision: D57655861
Pulled By: hx235
fbshipit-source-id: c9f49fe260e8b38b3ea387558432dcd9a3dbec19
Summary:
We recently noticed that some memtable flushed and file
ingestions could proceed during LockWAL, in violation of its stated
contract. (Note: we aren't 100% sure its actually needed by MySQL, but
we want it to be in a clean state nonetheless.)
Despite earlier skepticism that this could be done safely (https://github.com/facebook/rocksdb/issues/12666), I
found a place to wait to wait for LockWAL to be cleared before allowing
these operations to proceed: WaitForPendingWrites()
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12652
Test Plan:
Added to unit tests. Extended how db_stress validates LockWAL
and re-enabled combination of ingestion and LockWAL in crash test, in
follow-up to https://github.com/facebook/rocksdb/issues/12642
Ran blackbox_crash_test for a long while with relevant features
amplified.
Suggested follow-up: fix FaultInjectionTestFS to report file sizes
consistent with what the user has requested to be flushed.
Reviewed By: jowlyzhang
Differential Revision: D57622142
Pulled By: pdillinger
fbshipit-source-id: aef265fce69465618974b4ec47f4636257c676ce
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12668
The patch adds a new `GetEntityForUpdate` API to optimistic and WriteCommitted pessimistic transactions, which provides transactional wide-column point lookup functionality with concurrency control. For WriteCommitted transactions, user-defined timestamps are also supported similarly to the `GetForUpdate` API.
Reviewed By: jaykorean
Differential Revision: D57458304
fbshipit-source-id: 7eadbac531ca5446353e494abbd0635d63f62d24
Summary:
`ReadOptions::pin_data` already has the effect of pinning the `Slice` returned by `Iterator::value()` when the value is stored inline (e.g., `kTypeValue`). This PR adds a bit of visibility into that via a new `Iterator` property, "rocksdb.iterator.is-value-pinned", as well as some documentation and tests.
See also: https://github.com/facebook/rocksdb/issues/12658
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12659
Reviewed By: cbi42
Differential Revision: D57391200
Pulled By: ajkr
fbshipit-source-id: 0caa8db27ca1aba86ee2addc3dfd6f0e003d32e2
Summary:
**Context/Summary:**
Previously `CompactFiles()` used `RangeOverlapWithCompaction()` to check for conflict when sanitizing input files while later used `FilesRangeOverlapWithCompaction()` to assert for no conflict. The latter function checks for more conflict scenarios than the former does, particularly the ones arising from `preclude_last_level_data_seconds > 0` (i.e, compaction can output to second-to-the-last level). So we ran into assertion violation in `CompactFiles()` like below
```
Assertion `output_level == 0 || !FilesRangeOverlapWithCompaction( input_files, output_level, Compaction::EvaluatePenultimateLevel(vstorage, ioptions_, start_level, output_level))' failed.
```
This PR make `CompactFiles()` used `FilesRangeOverlapWithCompaction()` and return Aborted status upon range conflict instead of crashing (during debug build) or proceed incorrectly (during non-debug build). To do so cleanly, I included a refactoring to make `FilesRangeOverlapWithCompaction()` part of `SanitizeAndConvertCompactionInputFiles()`, replacing `RangeOverlapWithCompaction()`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12628
Test Plan: New UT crashed before the fix and return correct status after the fix.
Reviewed By: cbi42
Differential Revision: D57123536
Pulled By: hx235
fbshipit-source-id: f963a2c9e7ba1a9927a67fcc87f0dce126d3a430
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12634
The patch implements support for the `MultiGetEntity` API in optimistic transactions and pessimistic transactions with the WriteCommitted policy. Similarly to the other wide-column transaction APIs, the implementation leverages the `WriteBatchWithIndex` layer.
Reviewed By: jaykorean
Differential Revision: D57177638
fbshipit-source-id: 2d9f9f287fc97e7c126830b48d21457c7c35db3f
Summary:
In `SaveValue()`, the read lock needs to be obtained before `VerifyEntryChecksum()` because the KV checksum verification reads the entire value metadata+data, which is all mutable when `ColumnFamilyOptions::inplace_update_support == true`.
In `MemTable::Update()`, the write lock needs to be obtained before mutating the value metadata (changing the value size) because it can be read concurrently.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12624
Test Plan:
```
$ make COMPILE_WITH_TSAN=1 -j56 db_stress
...
$ python3 tools/db_crashtest.py blackbox --simple --max_key=10 --inplace_update_support=1 --interval=10 --allow_concurrent_memtable_write=0
```
Reviewed By: cbi42
Differential Revision: D57034571
Pulled By: ajkr
fbshipit-source-id: 3dddf881ad87923143acdf6bfec12ce47bb13a48
Summary:
For manual compaction, FIFO compaction will always skip key range overlapping checking with SST files. If CompactRange() is called with CompactionRangeOptions::change_level=true, a CF with FIFO compaction will now return Status::NotSupported.
For file ingestion, we will always ingest into L0. Previously, it's possible to ingest files into non-L0 levels with FIFO compaction.
These changes also help to fix [this](a178d15baf/db/db_impl/db_impl_compaction_flush.cc (L1269)) assertion failure in crash tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12618
Test Plan: added unit tests to verify the new behavior.
Reviewed By: hx235
Differential Revision: D56962401
Pulled By: cbi42
fbshipit-source-id: 19812a1509650b4162b379ca5bee02f2e9d9569d
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12606
The patch extends optimistic transactions and WriteCommitted pessimistic transactions with support for the `PutEntity` API. Similarly to the other APIs, `PutEntity` is available via both the `Transaction` and `TransactionDB` interfaces, where using the latter executes the write in a single-operation transaction as usual. Support for read APIs and other write policies (WritePrepared, WriteUnprepared) will be added in separate PRs.
Reviewed By: jaykorean
Differential Revision: D56911242
fbshipit-source-id: 57cf8bb6c6b1b40ba4a8a652831c13a617644289
Summary:
Previously we skipped syncing the non-latest WALs during memtable flush when the DB had only one column family. Normally that is fine because those non-latest WALs would not be read by recovery. However, in case of `DBOptions::allow_2pc == true`, there could be unmatched prepare records in those WALs making them needed by recovery. As a result, the missing sync could have resulted in the recovered WAL state falling behind the recovered SST state. When we detect that case, we return a `Status::Corruption` saying "SST file is ahead of WALs".
This PR proposes syncing the WAL in case of `DBOptions::allow_2pc`. This introduces the sync in some scenarios where it isn't needed (e.g., non-recent WALs contain no prepares) but I suspect the simplicity is worth it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12622
Reviewed By: cbi42
Differential Revision: D56987303
Pulled By: ajkr
fbshipit-source-id: 7fe9395458018a18d77e907a3b5429065c0e2e48
Summary:
when importing files from multiple CFs into a new CF, we were reusing the epoch numbers assigned by the original CFs. This means L0 files in the new CF can have the same epoch number (assigned originally by different CFs). While CreateColumnFamilyWithImport() requires each original CF to have disjoint key range, after an intra-l0 compaction, we still can end up with L0 files with the same epoch number but overlapping key range. This PR attempt to fix this by reassigning epoch numbers when importing multiple CFs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12602
Test Plan:
a new repro unit test. Before this PR, it fails with
```
[ RUN ] ImportColumnFamilyTest.AssignEpochNumberToMultipleCF
db/import_column_family_test.cc:1048: Failure
db_->WaitForCompact(o)
Corruption: force_consistency_checks(DEBUG): VersionBuilder: L0 files of same epoch number but overlapping range https://github.com/facebook/rocksdb/issues/44 , smallest key: '6B6579303030303030' seq:511, type:1 , largest key: '6B6579303031303239' seq:510, type:1 , epoch number: 3 vs. file https://github.com/facebook/rocksdb/issues/36 , smallest key: '6B6579303030313030' seq:401, type:1 , largest key: '6B6579303030313939' seq:500, type:1 , epoch number: 3
```
Reviewed By: hx235
Differential Revision: D56851808
Pulled By: cbi42
fbshipit-source-id: 01b8c790c9f1f2a168047ead670e73633f705b84
Summary:
This PR fix the issue that deletion of obsolete files during DB::Open are not rate limited.
The root cause is slow deletion is disabled if trash/db size ratio exceeds the configured `max_trash_db_ratio` d610e14f93/include/rocksdb/sst_file_manager.h (L126) however, the current handling in DB::Open starts with tracking nothing but the obsolete files. This will make the ratio always look like it's 1.
In order for the deletion rate limiting logic to work properly, we should only start deleting files after `SstFileManager` has finished tracking the whole DB, so the main fix is to move these two places that attempts to delete file after the tracking are done: 1) the `DeleteScheduler::CleanupDirectory` call in `SanitizeOptions`, 2) the `DB::DeleteObsoleteFiles` call.
There are some other aesthetic changes like refactoring collecting all the DB paths into a function, rename `DBImp::DeleteUnreferencedSstFiles` to `DBImpl:: MaybeUpdateNextFileNumber` as it doesn't actually delete the files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12590
Test Plan: Added unit test and verified with manual testing
Reviewed By: anand1976
Differential Revision: D56830519
Pulled By: jowlyzhang
fbshipit-source-id: 8a38a21b1ea11c5371924f2b88663648f7a17885
Summary:
**Context/Summary:**
https://github.com/facebook/rocksdb/pull/12542 introduced a bug where wrong padded bytes used to generate file checksum if flush happens during padding. This PR fixed it along with an existing same bug for `perform_data_verification_=true`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12598
Test Plan:
- New UT that failed before this fix (`db->VerifyFileChecksums: ...Corruption: ...file checksum mismatch`) and passes after
- Benchmark
```
TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=fillseq[-X300] --num=100000 --block_align=1 --compression_type=none
```
Pre-PR:
fillseq [AVG 300 runs] : 421334 (± 4126) ops/sec; 46.6 (± 0.5) MB/sec
Post-PR: (no regression observed but a slight improvement)
fillseq [AVG 300 runs] : 425768 (± 4309) ops/sec; 47.1 (± 0.5) MB/sec
Reviewed By: ajkr, anand1976
Differential Revision: D56725688
Pulled By: hx235
fbshipit-source-id: c1a700a95def8c65c0a21e44f8c1966164925ad5
Summary:
This feature has been around for a couple of years and users haven't reported any problems with it.
Not quite related: fixed a technical ODR violation in public header for info_log_level in case DEBUG build status changes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12377
Test Plan: unit tests updated, already in crash test. Some unit tests are expecting specific behaviors of optimize_filters_for_memory=false and we now need to bake that in.
Reviewed By: jowlyzhang
Differential Revision: D54129517
Pulled By: pdillinger
fbshipit-source-id: a64b614840eadd18b892624187b3e122bab6719c
Summary:
See comment at top of the test case and release note.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12597
Reviewed By: jaykorean
Differential Revision: D56718786
Pulled By: ajkr
fbshipit-source-id: 8dce185bb0d24a358372fc2b553d181793fc335f