Summary:
add `IngestExternalFileOptions::fill_cache` to allow users to ingest files without loading index/filter/data and other blocks into block cache during file ingestion. This can be useful when users are ingesting files into a CF that is not available to readers yet.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13067
Test Plan:
* unit test: `ExternalSSTFileTest.NoBlockCache`
* ran one round of crash test with fill_cache disabled: `python3 ./tools/db_crashtest.py --simple blackbox --ops_per_thread=1000000 --interval=30 --ingest_external_file_one_in=200 --level0_stop_writes_trigger=200 --level0_slowdown_writes_trigger=100 --sync_fault_injection=0 --disable_wal=0 --manual_wal_flush_one_in=0`
Reviewed By: jowlyzhang
Differential Revision: D64356424
Pulled By: cbi42
fbshipit-source-id: b380c26f5987238e1ed7d42ceef0390cfaa0b8e2
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13069
Currently, when using range scans with BlobDB, the iterator logic eagerly loads values from blob files when landing on a new entry. This can be wasteful in use cases where the values associated with some keys in the range are not used by the application. The patch introduces a new read option `allow_unprepared_value`; when specified, this option results in the above eager loading getting bypassed. Values needed by the application can be then loaded on an on-demand basis by calling the new iterator API `PrepareValue`. Note that currently, only regular single-CF iterators are supported; multi-CF iterators and transactions will be extended in later PRs.
Reviewed By: jowlyzhang
Differential Revision: D64360723
fbshipit-source-id: ee55502fa15dcb307a984922b9afc9d9da15d6e1
Summary:
In https://github.com/facebook/rocksdb/issues/13025 , we made a change to load the latest options file in the remote worker instead of serializing the entire set of options.
That was done under assumption that OPTIONS file do not get purged often. While testing, we learned that this happens more often than we want it to be, so we want to prevent the OPTIONS file from getting purged anytime between when the remote compaction is scheduled and the option is loaded in the remote worker.
Like how we are protecting new SST files from getting purged using `min_pending_output`, we are doing the same by keeping track of `min_options_file_number`. Any OPTIONS file with number greater than `min_options_file_number` will be protected from getting purged. Just like `min_pending_output`, `min_options_file_number` gets bumped when the compaction is done. This is only applicable when `options.compaction_service` is set.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13074
Test Plan:
```
./compaction_service_test --gtest_filter="*PreservedOptionsLocalCompaction*"
./compaction_service_test --gtest_filter="*PreservedOptionsRemoteCompaction*"
```
Reviewed By: anand1976
Differential Revision: D64433795
Pulled By: jaykorean
fbshipit-source-id: 0d902773f0909d9481dec40abf0b4c54ce5e86b2
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13075
The patch simplifies the iteration logic in `MultiCFIteratorImpl::{Advance,Populate}Iterator` a bit and adds some assertions to uniformly enforce the invariant that any iterators currently on the heap should be valid and have an OK status.
Reviewed By: jaykorean
Differential Revision: D64429566
fbshipit-source-id: 36bc22465285b670f859692a048e10f21df7da7a
Summary:
This PR assigns levels to files in separate batches if they overlap. This approach can potentially assign external files to lower levels.
In the prepare stage, if the input files' key range overlaps themselves, we divide them up in the user specified order into multiple batches. Where the files in the same batch do not overlap with each other, but key range could overlap between batches. If the input files' key range don't overlap, they always just make one default batch.
During the level assignment stage, we assign levels to files one batch after another. It's guaranteed that files within one batch are not overlapping, we assign level to each file one after another. If the previous batch's uppermost level is specified, all files in this batch will be assigned to levels that are higher than that level. The uppermost level used by this batch of files is also tracked, so that it can be used by the next batch.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13064
Test Plan:
Updated test and added new test
Manually stress tested
Reviewed By: cbi42
Differential Revision: D64428373
Pulled By: jowlyzhang
fbshipit-source-id: 5aeff125c14094c87cc50088505010dfd2da3d6e
Summary:
Add a timeout for the blackbox crash test final verification step, and print the db_stress stack trace on a timeout. The crash test occasionally hangs in the verification step and this will help debug.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13070
Reviewed By: hx235
Differential Revision: D64414461
Pulled By: anand1976
fbshipit-source-id: 4629aac01fbe6c788665beddc66280ba446aadbe
Summary:
Checkpoint creation skips flushing the memtable, even if explicitly requested, when the WAL is locked. This can happen if the user calls `LockWAL()`. In this case, db_stress checkpoint verification fails as the checkpoint will not contain keys present in the primary DB's memtable. Sanitize `checkpoint_one_in` and `lock_wal_one_in` so they're mutually exclusive.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13068
Reviewed By: hx235
Differential Revision: D64353998
Pulled By: anand1976
fbshipit-source-id: 7c93563347f033b6008a47a7d71471e59747e143
Summary:
- When `FileChecksumGenFactory` is set, include the `file_checksum` and `file_checksum_func_name` in the output file metadata
- ~~In Remote Compaction, try opening the output files in the temporary directory to do a quick sanity check before returning the result with status.~~
- After offline discussion, we decided to rely on Primary's existing Compaction flow to sanity check the output files. If the output file is corrupted, we will still be able to catch it and not installing it even after renaming them to cf_paths. The corrupted file in the cf_path won't be added to the MANIFEST and will be purged as part of the next `PurgeObsoleteFiles()` call.
- Unit Test has been added to validate above.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13060
Test Plan:
Unit test added
```
./compaction_service_test --gtest_filter="*CorruptedOutput*"
./compaction_service_test --gtest_filter="*TruncatedOutput*"
./compaction_service_test --gtest_filter="*CustomFileChecksum*"
./compaction_job_test --gtest_filter="*ResultSerialization*"
```
Reviewed By: cbi42
Differential Revision: D64189645
Pulled By: jaykorean
fbshipit-source-id: 6cf28720169c960c80df257806bfee3c0d177159
Summary:
In theory, there should be no danger in mutability, as table
builders and readers work from copies of BlockBasedTableOptions.
However, there is currently an unresolved read-write race that
affecting SetOptions on BBTO fields. This should be generally
acceptable for non-pointer options of 64 bits or less, but a fix
is needed to make it mutability general here. See
https://github.com/facebook/rocksdb/issues/10079
This change systematically sets all of those "simple" options (and future
such options) as mutable. (Resurrecting this PR perhaps preferable to
proposed https://github.com/facebook/rocksdb/issues/13063)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10021
Test Plan: Some unit test updates. XXX comment added to stress test code
Reviewed By: cbi42
Differential Revision: D64360967
Pulled By: pdillinger
fbshipit-source-id: ff220fa778331852fe331b42b76ac4adfcd2d760
Summary:
When user-defined timestamps are not persisted, currently we replace the actual timestamp with min timestamp after an entry is output from compaction iterator. Compaction iterator won't be able to help with removing stale entries this way. This PR adds a wrapper iterator `TimestampStrippingIterator` for `MemTableIterator` that does the min timestamp replacement at the memtable iteration step. It is used by flush and can help remove stale entries from landing in L0 files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13035
Test Plan: Added unit test
Reviewed By: pdillinger, cbi42
Differential Revision: D63423682
Pulled By: jowlyzhang
fbshipit-source-id: 087dcc9cee97b9ea51b8d2b88dc91c2984d54e55
Summary:
When the input files are not overlapping, a.k.a `files_overlap_=false`, it's best to assign them to non L0 levels so that they are not one sorted run each. This can be done regardless of compaction style being leveled or universal without any side effects.
Just my guessing, this special handling may be there because universal compaction used to have an invariant that sequence number on higher levels should not be smaller than sequence number in lower levels. File ingestion used to try to keep up to that promise by doing "sequence number stealing" from the to be assigned level. However, that invariant is no longer true after deletion triggered compaction is added for universal compaction, and we also removed the sequence stealing logic from file ingestion.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13059
Test Plan: Updated existing tests
Reviewed By: cbi42
Differential Revision: D64220100
Pulled By: jowlyzhang
fbshipit-source-id: 70a83afba7f4c52d502c393844e6b3273d5cf628
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13061
As groundwork for further changes, the patch refactors the BlobDB-related parts of `DBIter` by 1) introducing a new internal helper class `DBIter::BlobReader` that encapsulates all members needed to retrieve a blob value (namely, `Version` and the `ReadOptions` fields) and 2) factoring out and cleaning up some duplicate logic related to resolving blob references in the non-Merge (see `SetValueAndColumnsFromBlob`) and Merge (see `MergeWithBlobBaseValue`) cases.
Reviewed By: jowlyzhang
Differential Revision: D64078099
fbshipit-source-id: 22d5bd93e6e5be5cc9ecf6c4ee6954f2eb016aff
Summary:
**Context/Summary:**
A part of this test is to verify compression conditionally happens depending on the shape of the LSM when `options.level_compaction_dynamic_level_bytes = true;`. It uses the total file size to determine whether compression has happened or not. This involves some hard-coded math hard to understand. This PR replaces those with statistics that directly shows whether compression has happened or not.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13044
Test Plan: Existing test
Reviewed By: jaykorean
Differential Revision: D63666361
Pulled By: hx235
fbshipit-source-id: 8c9b1bea9b06ff1e3ed95c576aec6705159af137
Summary:
The write unix time from non L0 files are not surfaced properly because the level's wrapper iterator doesn't have a `write_unix_time` implementation that delegates to the corresponding file. The unit test didn't catch this because it incorrectly destroy the old db and reopen to check write time, instead of just reopen and check. This fix also include a change to support ldb's scan command to get write time for easier debugging.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13057
Test Plan: Updated unit tests
Reviewed By: pdillinger
Differential Revision: D64015107
Pulled By: jowlyzhang
fbshipit-source-id: 244474f78a034f80c9235eea2aa8a0f4e54dff59
Summary:
263fa15b44/CMakeLists.txt (L44)
`HOMEPAGE_URL` is introduced into CMake since 3.12. Compiling RocksDB with CMake ver < 3.12 triggers `CMake Error: Could not find cmake module file: CMakeDetermineHOMEPAGE_URLCompiler.cmake` error.
2 options to fix it:
* Remove `HOMEPAGE_URL`, since it appears to have no practical effect.
* Update RocksDB's minimum required CMake version to 3.12.
This PR chose the second option.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13056
Reviewed By: jaykorean
Differential Revision: D63993577
Pulled By: cbi42
fbshipit-source-id: a6278af6916fcdace19a6c9baaf7986037bff720
Summary:
Stress test detects this variable could potentially overflow, so added some runtime handling to avoid it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13046
Test Plan: Existing tests
Reviewed By: hx235
Differential Revision: D63911396
Pulled By: jowlyzhang
fbshipit-source-id: 7c9abcd74ac9937b211c0ea4bb683677390837c5
Summary:
a small CF can trigger parallel compaction that applies to the entire DB. This is because the bottommost file size of a small CF can be too small compared to l0 files when a l0->lbase compaction happens. We prevent this by requiring some minimum on the compaction debt.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13054
Test Plan: updated unit test.
Reviewed By: hx235
Differential Revision: D63861042
Pulled By: cbi42
fbshipit-source-id: 43bbf327988ef0ef912cd2fc700e3d096a8d2c18
Summary:
This PR added some optimizations for the per key handling for SST file for the user-defined timestamps in Memtable only feature. CPU profiling shows this part is a big culprit for regression. This optimization saves some string construction/destruction/appending/copying. vector operations like reserve/emplace_back.
When iterating keys in a block, we need to copy some shared bytes from previous key, put it together with the non shared bytes and find a right location to pad the min timestamp. Previously, we create a tmp local string buffer to first construct the key from its pieces, and then copying this local string's content into `IterKey`'s buffer. To avoid having this local string and to avoid this extra copy. Instead of piecing together the key in a local string first, we just track all the pieces that make this key in a reused Slice array. And then copy the pieces in order into `IterKey`'s buffer. Since the previous key should be kept intact while we are copying some shared bytes from it, we added a secondary buffer in `IterKey` and alternate between primary buffer and secondary buffer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13031
Test Plan: Existing tests.
Reviewed By: ltamasi
Differential Revision: D63416531
Pulled By: jowlyzhang
fbshipit-source-id: 9819b0e02301a2dbc90621b2fe4f651bc912113c
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13052
Currently, `MultiCfIteratorImpl` uses `std::function`s for `reset_func_` and `populate_func_`, which uses type erasure and has a performance overhead. The patch turns `MultiCfIteratorImpl` into a template that takes the two function object types as template parameters, and changes `AttributeGroupIteratorImpl` and `CoalescingIterator` so they pass in function objects of named types (as opposed to lambdas).
Reviewed By: jaykorean
Differential Revision: D63802598
fbshipit-source-id: e202f6d80c9054335e5b2571051a67a9e012c2d0
Summary:
There was a crash test Bus Error crash in `IndexBlockIter::SeekToFirstImpl()` <- .. <-
`BlockBasedTable::~BlockBasedTable()` with `--mmap_read=1`, which suggests some kind of incompatibility that I haven't diagnosed. Bus Error is uncommon these days as CPUs support unaligned reads, but are associated with mmap problems.
Because mmap reads really only make sense without block cache, it's not a concerning loss to essentially disable the combination.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13051
Test Plan: watch crash test
Reviewed By: jowlyzhang
Differential Revision: D63795069
Pulled By: pdillinger
fbshipit-source-id: 6c823c619840086b5c9cff53dbc7470662b096be
Summary:
This PR makes file ingestion job's flush wait a bit further until the SuperVersion is also updated. This is necessary since follow up operations will use the current SuperVersion to do range overlapping check and level assignment.
In debug mode, file ingestion job's second `NeedsFlush` call could have been invoked when the memtables are flushed but the SuperVersion hasn't been updated yet, triggering the assertion.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13045
Test Plan:
Existing tests
Manually stress tested
Reviewed By: cbi42
Differential Revision: D63671151
Pulled By: jowlyzhang
fbshipit-source-id: 95a169e58a7e59f6dd4125e7296e9060fe4c63a7
Summary:
... to note that memory may not be freed when reusing a transaction. This means reusing a large transaction can cause excessive memory usage and it may be better to destruct the transaction object in some cases.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13042
Test Plan: no code change.
Reviewed By: jowlyzhang
Differential Revision: D63570612
Pulled By: cbi42
fbshipit-source-id: f19ff556f76d54831fb94715e8808035d07e25fa
Summary:
When `avoid_flush_during_shutdown` is false, DB will flush the memtables if there is some unpersisted data:
79790cf2a8/db/db_impl/db_impl.cc (L505-L510)
`has_unpersisted_data_` is a flag that is only turned on for when WAL is disabled, for example:
79790cf2a8/db/db_impl/db_impl_write.cc (L525-L528)
In other cases, it just has its default false value.
So if disableWAL is false, and avoid_flush_during_shutdown is false, close won't flush memtables. Stress test is also not flush wal/sync wal. There could be missing data, while reopen in stress test doesn't tolerate missing data. To make the test simpler, this changes it to always flush/sync wal during reopen.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13039
Reviewed By: hx235
Differential Revision: D63494695
Pulled By: jowlyzhang
fbshipit-source-id: 8f0fd9ed50a482a3955abc0882257ecc2e95926d
Summary:
The following DBOptions were not being propagated through BuildDBOptions, which could at least lead to settings being lost through `GetOptionsFromString()`, possibly elsewhere as well:
* background_close_inactive_wals
* write_dbid_to_manifest
* write_identity_file
* prefix_seek_opt_in_only
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13038
Test Plan:
This problem was not being caught by
OptionsSettableTest.DBOptionsAllFieldsSettable when the option was omitted from both options_helper.cc and options_settable_test.cc. I have added to the test to catch future instances (and the updated test was how I found three of the four missing options).
The same kind of bug seems to be caught by
ColumnFamilyOptionsAllFieldsSettable, and AFAIK analogous code does not exist for BlockBasedTableOptions.
Reviewed By: ltamasi
Differential Revision: D63483779
Pulled By: pdillinger
fbshipit-source-id: a5d5f6e434174bacb8e5d251b767e81e62b7225a
Summary:
When an item is inserted into the compressed secondary cache, this PR calculates the charge using the malloc_usable_size of the allocated memory, as well as the unique pointer allocation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13032
Test Plan: New unit test
Reviewed By: pdillinger
Differential Revision: D63418493
Pulled By: anand1976
fbshipit-source-id: 1db2835af6867442bb8cf6d9bf412e120ddd3824
Summary:
If the lowest_used_cache_tier DB option is set to kVolatileTier, skip insertion of compressed blocks into the secondary cache. Previously, these were always inserted into the secondary cache via the InsertSaved() method, leading to pollution of the secondary cache with blocks that would never be read.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13030
Test Plan: Add a new unit test
Reviewed By: pdillinger
Differential Revision: D63329841
Pulled By: anand1976
fbshipit-source-id: 14d2fce2ed309401d9ad4d2e7c356218b6673f7b
Summary:
Add the following to the `CompactionServiceJobInfo`
- compaction_reason
- is_full_compaction
- is_manual_compaction
- bottommost_level
Added `is_remote_compaction` to the `CompactionJobStats` and set initial values to avoid UB for uninitialized values.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13029
Test Plan:
```
./compaction_service_test --gtest_filter="*CompactionInfo*"
```
Reviewed By: anand1976
Differential Revision: D63322878
Pulled By: jaykorean
fbshipit-source-id: f02a66ca45e660b9d354a43837d8ec6beb7621fb
Summary:
With some new use cases onboarding to prefix extractors/seek/filters, one of the risks is existing iterator code, e.g. for maintenance tasks, being unintentionally subject to prefix seek semantics. This is a longstanding known design flaw with prefix seek, and `prefix_same_as_start` and `auto_prefix_mode` were steps in the direction of making that obsolete. However, we can't just immediately set `total_order_seek` to true by default, because that would impact so much code instantly.
Here we add a new DB option, `prefix_seek_opt_in_only` that basically allows users to transition to the future behavior when they are ready. When set to true, all iterators will be treated as if `total_order_seek=true` and then the only ways to get prefix seek semantics are with `prefix_same_as_start` or `auto_prefix_mode`.
Related fixes / changes:
* Make sure that `prefix_same_as_start` and `auto_prefix_mode` are compatible with (or override) `total_order_seek` (depending on your interpretation).
* Fix a bug in which a new iterator after dynamically changing the prefix extractor might mix different prefix semantics between memtable and SSTs. Both should use the latest extractor semantics, which means iterators ignoring memtable prefix filters with an old extractor. And that means passing the latest prefix extractor to new memtable iterators that might use prefix seek. (Without the fix, the test added for this fails in many ways.)
Suggested follow-up:
* Investigate a FIXME where a MergeIteratorBuilder is created in db_impl.cc. No unit test detects a change in value that should impact correctness.
* Make memtable prefix bloom compatible with `auto_prefix_mode`, which might require involving the memtablereps because we don't know at iterator creation time (only seek time) whether an auto_prefix_mode seek will be a prefix seek.
* Add `prefix_same_as_start` testing to db_stress
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13026
Test Plan:
tests updated, added. Add combination of `total_order_seek=true` and `auto_prefix_mode=true` to stress test. Ran `make blackbox_crash_test` for a long while.
Manually ran tests with `prefix_seek_opt_in_only=true` as default, looking for unexpected issues. I inspected most of the results and migrated many tests to be ready for such a change (but not all).
Reviewed By: ltamasi
Differential Revision: D63147378
Pulled By: pdillinger
fbshipit-source-id: 1f4477b730683d43b4be7e933338583702d3c25e
Summary:
We've been serializing and deserializing DBOptions and CFOptions (and other CF into) as part of `CompactionServiceInput`. These are all readily available in the OPTIONS file and the remote worker can read the OPTIONS file to obtain the same information. This helps reducing the size of payload significantly.
In a very rare scenario if the OPTIONS file is purged due to options change by primary host at the same time while the remote host is loading the latest options, it may fail. In this case, we just retry once.
This also solves the problem where we had to open the default CF with the CFOption from another CF if the remote compaction is for a non-default column family. (TODO comment in /db_impl_secondary.cc)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13025
Test Plan:
Unit Tests
```
./compaction_service_test
```
```
./compaction_job_test
```
Also tested with Meta's internal Offload Infra
Reviewed By: anand1976, cbi42
Differential Revision: D63100109
Pulled By: jaykorean
fbshipit-source-id: b7162695e31e2c5a920daa7f432842163a5b156d
Summary:
This PR allows a Cache object to be created using the object registry.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13024
Reviewed By: pdillinger
Differential Revision: D63043233
Pulled By: anand1976
fbshipit-source-id: 5bc3f7c29b35ad62638ff8205451303e2cecea9d
Summary:
Per customer request, we should not merge multiple SST files together during temperature change compaction, since this can cause FIFO TTL compactions to be delayed. This PR changes the compaction picking logic to pick one file at a time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13018
Test Plan: * updated some existing unit tests to test this new behavior.
Reviewed By: jowlyzhang
Differential Revision: D62883292
Pulled By: cbi42
fbshipit-source-id: 6a9fc8c296b5d9b17168ef6645f25153241c8b93
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13022
Currently, `blob_garbage_collection_force_threshold` applies to the oldest batch of blob files, which is typically only a small subset of the blob files currently eligible for garbage collection. This can result in a form of head-of-line blocking: no GC-triggered compactions will be scheduled if the oldest batch does not currently exceed the threshold, even if a lot of higher-numbered blob files do. This can in turn lead to high space amplification that exceeds the soft bound implicit in the force threshold (e.g. 50% would suggest a space amp of <2 and 75% would imply a space amp of <4). The patch changes the semantics of this configuration threshold to apply to the entire set of blob files that are eligible for garbage collection based on `blob_garbage_collection_age_cutoff`. This provides more intuitive semantics for the option and can provide a better write amp/space amp trade-off. (Note that GC-triggered compactions still pick the same SST files as before, so triggered GC still targets the oldest the blob files.)
Reviewed By: jowlyzhang
Differential Revision: D62977860
fbshipit-source-id: a999f31fe9cdda313de513f0e7a6fc707424d4a3
Summary:
* Set write_dbid_to_manifest=true by default
* Add new option write_identity_file (default true) that allows us to opt-in to future behavior without identity file
* Refactor related DB open code to minimize code duplication
_Recommend hiding whitespace changes for review_
Intended follow-up: add support to ldb for reading and even replacing the DB identity in the manifest. Could be a variant of `update_manifest` command or based on it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13019
Test Plan: unit tests and stress test updated for new functionality
Reviewed By: anand1976
Differential Revision: D62898229
Pulled By: pdillinger
fbshipit-source-id: c08b25cf790610b034e51a9de0dc78b921abbcf0
Summary:
Add an option `--only_print_seqno_gaps` for wal dump to help with debugging. This option will check the continuity of sequence numbers in WAL logs, assuming `seq_per_batch` is false. `--walfile` option now also takes a directory, and it will check all WAL logs in the directory in chronological order.
When a gap is found, we can further check if it's related to operations like external file ingestion.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13014
Test Plan: Manually tested
Reviewed By: ltamasi
Differential Revision: D62989115
Pulled By: jowlyzhang
fbshipit-source-id: 22e3326344e7969ff9d5091d21fec2935770fbc7
Summary:
There was a subtle design/contract bug in the previous version of range filtering in experimental.h If someone implemented a key segments extractor with "all or nothing" fixed size segments, that could result in unsafe range filtering. For example, with two segments of width 3:
```
x = 0x|12 34 56|78 9A 00|
y = 0x|12 34 56||78 9B
z = 0x|12 34 56|78 9C 00|
```
Segment 1 of y (empty) is out of order with segment 1 of x and z.
I have re-worked the contract to make it clear what does work, and implemented a standard extractor for fixed-size segments, CappedKeySegmentsExtractor. The safe approach for filtering is to consume as much as is available for a segment in the case of a short key.
I have also added support for min-max filtering with reverse byte-wise comparator, which is probably the 2nd most common comparator for RocksDB users (because of MySQL). It might seem that a min-max filter doesn't care about forward or reverse ordering, but it does when trying to determine whether in input range from segment values v1 to v2, where it so happens that v2 is byte-wise less than v1, is an empty forward interval or a non-empty reverse interval. At least in the current setup, we don't have that context.
A new unit test (with some refactoring) tests CappedKeySegmentsExtractor, reverse byte-wise comparator, and the corresponding min-max filter.
I have also (contractually / mathematically) generalized the framework to comparators other than the byte-wise comparator, and made other generalizations to make the extractor limitations more explicitly connected to the particular filters and filtering used--at least in description.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13005
Test Plan: added unit tests as described
Reviewed By: jowlyzhang
Differential Revision: D62769784
Pulled By: pdillinger
fbshipit-source-id: 0d41f0d0273586bdad55e4aa30381ebc861f7044
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13015
`Close()`ing a database now releases tracked files in `SstFileManager`. Previously this space would be leaked until the database was later reopened.
Reviewed By: jowlyzhang
Differential Revision: D62590773
fbshipit-source-id: 5461bd253d974ac4967ad52fee92e2650f8a9a28
Summary:
The internal codebase is updated for the coro directory's graduation from experimental. Updating our build script for a newer version with this change too. Using this hash: 03041f014b
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13017
Reviewed By: nickbrekhus
Differential Revision: D62763932
Pulled By: jowlyzhang
fbshipit-source-id: 1b211707fbc7d974d6d6ceaf577e174424bb44ed
Summary:
A recent crash test failure shows that auto recovery from WAL write failure can cause CFs to be inconsistent. A unit test repro in P1569398553. The following is an example sequence of events:
```
0. manual_wal_flush is true. There are multiple CFs in a DB.
1. Submit a write batch with updates to multiple CF
2. A FlushWAL or a memtable swtich that will try to write the buffered WAL data. Fail this write so that buffered WAL data is dropped: 4b1d595306/file/writable_file_writer.cc (L624)
The error needs to be retryable to start background auto recovery.
3. One CF successfully flushes its memtable during auto recovery.
4. Crash the process.
5. Reopen the DB, one CF will have the update as a result of successful flush. Other CFs will miss all the updates in the write batch since WAL does not have them.
```
This can happen if a users configures manual_wal_flush, uses more than one CF, and can hit retryable error for WAL writes. This PR is a short-term fix that upgrades WAL related errors to fatal and not trigger auto recovery.
A long-term fix may be not drop buffered WAL data by checking how much data is actually written, or require atomically flushing all column families during error recovery from this kind of errors.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12995
Test Plan:
added unit test to check error severity and if recovery is triggered. A crash test repro command that fails in a few runs before this PR:
```
python3 ./tools/db_crashtest.py blackbox --interval=60 --metadata_write_fault_one_in=1000 --column_families=10 --exclude_wal_from_write_fault_injection=0 --manual_wal_flush_one_in=1000 --WAL_size_limit_MB=10240 --WAL_ttl_seconds=0 --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --adm_policy=1 --advise_random_on_open=1 --allow_data_in_errors=True --allow_fallocate=1 --async_io=0 --auto_readahead_size=0 --avoid_flush_during_recovery=1 --avoid_flush_during_shutdown=1 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=100 --block_align=1 --block_protection_bytes_per_key=0 --block_size=16384 --bloom_before_level=2147483647 --bottommost_compression_type=none --bottommost_file_compaction_delay=0 --bytes_per_sync=0 --cache_index_and_filter_blocks=1 --cache_index_and_filter_blocks_with_high_priority=1 --cache_size=33554432 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=0 --check_multiget_consistency=0 --check_multiget_entity_consistency=0 --checkpoint_one_in=0 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_readahead_size=1048576 --compaction_ttl=0 --compress_format_version=1 --compressed_secondary_cache_size=8388608 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=4 --compression_type=none --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc= --data_block_index_type=0 --db_write_buffer_size=0 --decouple_partitioned_filters=1 --default_temperature=kCold --default_write_temperature=kWarm --delete_obsolete_files_period_micros=30000000 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_file_deletions_one_in=1000000 --disable_manual_compaction_one_in=1000000 --disable_wal=0 --dump_malloc_stats=1 --enable_checksum_handoff=1 --enable_compaction_filter=0 --enable_custom_split_merge=0 --enable_do_not_compress_roles=0 --enable_index_compression=0 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=1 --enable_sst_partitioner_factory=0 --enable_thread_tracking=1 --enable_write_thread_adaptive_yield=1 --error_recovery_with_no_fault_injection=1 --fail_if_options_file_error=1 --fifo_allow_compaction=1 --file_checksum_impl=big --fill_cache=1 --flush_one_in=1000000 --format_version=6 --get_all_column_family_metadata_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=10000 --get_properties_of_all_tables_one_in=1000000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --index_block_restart_interval=4 --index_shortening=1 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100000 --last_level_temperature=kWarm --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=10000 --log_file_time_to_roll=0 --log_readahead_size=0 --long_running_snapshots=0 --lowest_used_cache_tier=2 --manifest_preallocation_size=5120 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=100000 --max_key_len=3 --max_log_file_size=0 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=16 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16777216 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=2097152 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.001 --memtable_protection_bytes_per_key=2 --memtable_whole_key_filtering=0 --memtablerep=skip_list --metadata_charge_policy=1 --metadata_read_fault_one_in=0 --min_write_buffer_number_to_merge=1 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=2 --open_files=100 --open_metadata_read_fault_one_in=0 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --optimize_filters_for_hits=0 --optimize_filters_for_memory=0 --optimize_multiget_for_io=0 --paranoid_file_checks=1 --paranoid_memory_checks=0 --partition_filters=0 --partition_pinning=2 --pause_background_one_in=10000 --periodic_compaction_seconds=0 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=0 --readahead_size=524288 --readpercent=45 --recycle_log_file_num=0 --reopen=0 --report_bg_io_stats=0 --reset_stats_one_in=10000 --sample_for_compression=5 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --set_options_one_in=10000 --skip_stats_update_on_db_open=1 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=1048576 --sqfc_name=bar --sqfc_version=1 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=600 --stats_history_buffer_size=1048576 --strict_bytes_per_sync=1 --subcompactions=2 --sync=0 --sync_fault_injection=1 --table_cache_numshardbits=6 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=3 --uncache_aggressiveness=8 --universal_max_read_amp=-1 --unpartitioned_pinning=2 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=0 --use_attribute_group=1 --use_delta_encoding=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=1 --use_multi_cf_iterator=1 --use_multi_get_entity=0 --use_multiget=0 --use_put_entity_one_in=1 --use_sqfc_for_range_queries=0 --use_timed_put_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_compression=1 --verify_db_one_in=100000 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=4194304 --write_dbid_to_manifest=0 --write_fault_one_in=50 --writepercent=35 --ops_per_thread=100000 --preserve_unverified_changes=1
```
Reviewed By: hx235
Differential Revision: D62888510
Pulled By: cbi42
fbshipit-source-id: 308bdbbb8d897cc8eba950155cd0e37cf7eb76fe
Summary: I came across this code while buckifying parts of folly and fizz in open source. This is pretty hacky code and cleaning it up doesn't seem that hard, so I did it.
Reviewed By: zertosh, pdillinger
Differential Revision: D62781766
fbshipit-source-id: 43714bce992c53149d1e619063d803297362fb5d
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13010
The OnAddFile cur_compactions_reserved_size_ accounting causes wraparound when re-opening a database with an unowned SstFileManager and during recovery. It was introduced in #4164 which addresses out of space recovery with an unclear purpose. Compaction jobs do this accounting via EnoughRoomForCompaction/OnCompactionCompletion and to my understanding would never reuse a sst file name.
Reviewed By: anand1976
Differential Revision: D62535775
fbshipit-source-id: a7c44d6e0a4b5ff74bc47abfe57c32ca6770243d
Summary:
For SST checksum mismatch corruptions in the read path, RocksDB retries the read if the underlying file system supports verification and reconstruction of data (`FSSupportedOps::kVerifyAndReconstructRead`). There were a couple of places where the retry was missing - reading the SST footer and the properties block. This PR fixes the retry in those cases.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13007
Test Plan: Add new unit tests
Reviewed By: jaykorean
Differential Revision: D62519186
Pulled By: anand1976
fbshipit-source-id: 50aa38f18f2a53531a9fc8d4ccdf34fbf034ed59
Summary:
in ReFitLevel(), we were not setting being_compacted to false after ReFitLevel() is done. This is not a issue if refit level is successful, since new FileMetaData is created for files at the target level. However, if there's an error during RefitLevel(), e.g., Manifest write failure, we should clear the being_compacted field for these files. Otherwise, these files will not be picked for compaction until db reopen.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13009
Test Plan:
existing test.
- stress test failure in T200339331 should not happen anymore.
Reviewed By: hx235
Differential Revision: D62597169
Pulled By: cbi42
fbshipit-source-id: 0ba659806da6d6d4b42384fc95268b2d7bad720e
Summary:
Prepare this internal API to be used by atomic data replacement. The main purpose of this API is to get a `VersionEdit` to mark the entire current `MemTableListVersion` as dropped. Flush needs the similar functionality when installing results, so that logic is refactored into a util function `GetDBRecoveryEditForObsoletingMemTables` to be shared by flush and this internal API.
To test this internal API, flush's result installation is redirected to use this API when it is flushing all the immutable MemTables in debug mode. It should achieve the exact same results, just with a duplicated `VersionEdit::log_number` field that doesn't upsets the recovery logic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13001
Test Plan: Existing tests
Reviewed By: pdillinger
Differential Revision: D62309591
Pulled By: jowlyzhang
fbshipit-source-id: e25914d9a2e281c25ab7ee31a66eaf6adfae4b88