Summary:
This PR fixes error for CF smallest and largest keys computation in ImportColumnFamilyJob::Prepare.
Before this fix smallest and largest keys for CF were computed incorrectly, and ImportColumnFamilyJob::Prepare function might not have detect overlaps between CFs. I added test to detect this error.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12526
Reviewed By: hx235
Differential Revision: D56046044
Pulled By: ajkr
fbshipit-source-id: d562fbfc9cc2d9624372d24d34a649198a960691
Summary:
Context/Summary:
We need a `nvm_sec_cache` when `kAdmPolicyThreeQueue` is used otherwise a nullptr cache will be accessed causing us segfault in https://github.com/facebook/rocksdb/pull/12521
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12524
Test Plan: - Re-enabled `kAdmPolicyThreeQueue` and rehearsed stress test that failed before this fix and pass after
Reviewed By: jowlyzhang
Differential Revision: D55997093
Pulled By: hx235
fbshipit-source-id: e1c6f1015091b4cff0ce6a3fff981d5dece52a62
Summary:
Previously when building with fbcode and having a system install of liburing, it would link liburing from fbcode statically as well as the system library dynamically. That led to the following error:
```
./db_stress: error while loading shared libraries: liburing.so.1: cannot open shared object file: No such file or directory
```
The fix is to skip the feature test for system liburing when `FBCODE_BUILD=true`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12525
Test Plan:
- `make clean && make ROCKSDB_NO_FBCODE=1 V=1 -j56 db_stress && ./db_stress`
- `make clean && make V=1 -j56 db_stress && ./db_stress`
Reviewed By: anand1976
Differential Revision: D55997335
Pulled By: ajkr
fbshipit-source-id: 17d8561100f41c6c9ae382a80c6cddc14f050bdc
Summary:
There are a couple of reasons to modify the current implementation of the MultiCfIterator, which implements the generic `Iterator` interface.
- The default behavior of `value()`/`columns()` returning data from different Column Families for different keys can be prone to errors, even though there might be valid use cases where users do not care about the origin of the value/columns.
- The `attribute_groups()` API, which is not yet implemented, will not be useful for a single-CF iterator.
In this PR, we are implementing the following changes:
- `IteratorBase` introduced, which includes all basic iterator functions except `value()` and `columns()`.
- `Iterator`, which now inherits from `IteratorBase`, includes `value()` and `columns()`.
- New public interface `AttributeGroupIterator` inherits from `IteratorBase` and additionally includes `attribute_groups()` (to be implemented).
- Renamed former `MultiCfIterator` to `CoalescingIterator` which inherits from `Iterator`
- Existing MultiCfIteratorTest has been split into two - `CoalescingIteratorTest` and `AttributeGroupIteratorTest`.
- Moved AttributeGroup related code from `wide_columns.h` to a new file, `attribute_groups.h`.
Some Implementation Details
- `MultiCfIteratorImpl` takes two functions - `populate_func` and `reset_func` and use them to populate `value_` and `columns_` in CoalescingIterator and `attribute_groups_` in AttributeGroupIterator. In CoalescingIterator, populate_func is `Coalesce()`, in AttributeGroupIterator populate_func is `AddToAttributeGroups()`. `reset_func` clears populated value_, columns_ and attribute_groups_ accordingly.
- `Coalesce()` merge sorts columns from multiple CFs when a key exists in more than on CFs. column that appears in later CF overwrites the prior ones.
For example, if CF1 has `"key_1" ==> {"col_1": "foo", "col_2", "baz"}` and CF2 has `"key_1" ==> {"col_2": "quux", "col_3", "bla"}`, and when the iterator is at `key_1`, `columns()` will return `{"col_1": "foo", "col_2", "quux", "col_3", "bla"}`
In this example, `value()` will be empty, because none of them have values for `kDefaultColumnName`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12480
Test Plan:
## Unit Test
```
./multi_cf_iterator_test
```
## Performance Test
To make sure this change does not impact existing `Iterator` performance
**Build**
```
$> make -j64 release
```
**Setup**
```
$> TEST_TMPDIR=/dev/shm/db_bench ./db_bench -benchmarks="filluniquerandom" -key_size=32 -value_size=512 -num=1000000 -compression_type=none
```
**Run**
```
TEST_TMPDIR=/dev/shm/db_bench ./db_bench -use_existing_db=1 -benchmarks="newiterator,seekrandom" -cache_size=10485760000
```
**Before the change**
```
DB path: [/dev/shm/db_bench/dbbench]
newiterator : 0.519 micros/op 1927904 ops/sec 0.519 seconds 1000000 operations;
DB path: [/dev/shm/db_bench/dbbench]
seekrandom : 5.302 micros/op 188589 ops/sec 5.303 seconds 1000000 operations; (0 of 1000000 found)
```
**After the change**
```
DB path: [/dev/shm/db_bench/dbbench]
newiterator : 0.497 micros/op 2011012 ops/sec 0.497 seconds 1000000 operations;
DB path: [/dev/shm/db_bench/dbbench]
seekrandom : 5.252 micros/op 190405 ops/sec 5.252 seconds 1000000 operations; (0 of 1000000 found)
```
Reviewed By: ltamasi
Differential Revision: D55353909
Pulled By: jaykorean
fbshipit-source-id: 8d7786ffee09e022261ce34aa60e8633685e1946
Summary:
**Context/Summary**
This policy leads to segfault in `CompressedCacheSetCapacityThread` with some build/compilation. Before figuring out the why, disable it for now.
**Test**
Rehearse stress test that failed before the fix but passes after
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12521
Reviewed By: jowlyzhang
Differential Revision: D55942399
Pulled By: hx235
fbshipit-source-id: 85f28e50d596dcfd4a316481570b78fdce58ed0b
Summary:
Context/Summary: for unknown reason, calling a db stress common function in db stress flag file for temperature-related flags will cause some weird behavior in some compilation/build.
```
assertion failed - iter != ROCKSDB_NAMESPACE::OptionsHelper::temperature_to_string.end()
```
For now, we decide not to call such function by hard-coding their default stress test values.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12519
Test Plan: - Run a rehearsal stress test with this fix and weird behavior is gone.
Reviewed By: jowlyzhang
Differential Revision: D55884693
Pulled By: hx235
fbshipit-source-id: ba5135f5b37a9fa686b3ccae8d3f77e62d6562c9
Summary:
**Context/Summary:**
This is to improve our crash test coverage.
Bonus change:
- Added the missing Options string mapping for `CacheTier::kVolatileCompressedTier`
- Deprecated crash test options `enable_tiered_storage` mainly for setting `last_level_temperature` which is now covered in crash test by itself
- Intensified `verify_checksum_one_in\verify_file_checksums_one_in` as I found out these together with new coverage surface more issues
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12508
Test Plan: CI to look out for trivial failures
Reviewed By: jowlyzhang
Differential Revision: D55768594
Pulled By: hx235
fbshipit-source-id: 9b829da0309a7db3fcdb17992a524dd64498325c
Summary:
https://github.com/facebook/rocksdb/issues/12466 reported a bug when `RocksDB.getColumnFamilyMetaData()` is called on an existing database(With files stored on disk). As neilramaswamy mentioned, this was caused by https://github.com/facebook/rocksdb/issues/11770 where the signature of `SstFileMetaData` constructor was changed, but JNI code wasn't updated.
This PR fix JNI code, and also properly populate `fileChecksum` on `SstFileMetaData`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12474
Reviewed By: jowlyzhang
Differential Revision: D55811808
Pulled By: ajkr
fbshipit-source-id: 2ab156f41eaf4a4f30c49e6df421b61e8451230e
Summary:
It is an important function and should be correct on legacy BlobDB, even though using legacy BlobDB is not recommended
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12468
Reviewed By: cbi42
Differential Revision: D55231038
Pulled By: ajkr
fbshipit-source-id: 2ac18e4c149590b373eb79cd92c0ca5e7fce94f2
Summary:
Since some internal user might be interested in using this feature.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12506
Test Plan:
The option was disabled in stress test due to causing failures.
I've ran a round of crash tests internally and there was no failure due to parallel compression. Will monitor if more runs cause failures. So we will know at least how it's broken and decide to fix them or reverse the change.
Reviewed By: jowlyzhang
Differential Revision: D55747552
Pulled By: cbi42
fbshipit-source-id: ae5cda78c338b8b58f651c557d9b70790362444d
Summary:
**Context/Summary:**
Debugging crash test makes me realize there are a few places can use some improvement of logging more info
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12504
Test Plan:
Manual testing
Debug build
```
2024/04/04-16:12:12.289791 1636007 [/db_filesnapshot.cc:156] Number of log files 2 (0 required by manifest)
...
2024/04/04-16:12:12.289814 1636007 [/db_filesnapshot.cc:171] Log files : /000004.log /000008.log .Log files required by manifest: .
```
Non-debug build
```
2024/04/04-16:19:23.222168 1685043 [/db_filesnapshot.cc:156] Number of log files 1 (0 required by manifest)
```
CI
Reviewed By: jaykorean
Differential Revision: D55710013
Pulled By: hx235
fbshipit-source-id: 9964d46cfb0a2074620f31571cf9fd29d0a88819
Summary:
Without this override, `FaultInjectionTestFs` use the implementation from `FileSystemWrapper` that delegates to the base file system: 2207a66fe5/include/rocksdb/file_system.h (L1451-L1457)
That will create a regular `FSWritableFile` instead of a `TestFSWritableFile`:
2207a66fe5/env/file_system.cc (L98-L108)
We have seen verification failures with a WAL hole because the last log writer is a `FSWritableFile` created from recycling a previous log file, while the second to last log write is a `TestFSWritableFile`. The former can survive a process crash, while the latter cannot. It makes the WAL look like it has a hole.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12510
Reviewed By: hx235
Differential Revision: D55769158
Pulled By: jowlyzhang
fbshipit-source-id: ebeffee8255bfa155434e17afe5082908d41a0d1
Summary:
The unit test fails occasionally can cannot be reproed locally.
```
[ RUN ] DBCompactionTest.CompactionLimiter
db/db_compaction_test.cc:6139: Failure
Expected equality of these values:
cf_count
Which is: 17
env_->GetThreadPoolQueueLen(Env::LOW)
Which is: 15
[ FAILED ] DBCompactionTest.CompactionLimiter (512 ms)
```
Add some debug print to help triaging if it fails again.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12509
Reviewed By: jowlyzhang
Differential Revision: D55770552
Pulled By: cbi42
fbshipit-source-id: 2a39b2199f80352fcf2c6cd2b9c8b81c727eee8c
Summary:
Make `autovector` constructs the stack based element in place before move or copy another `autovector`'s stack based elements. This is already done in the move/copy version of `autovector::push_back` when adding item to the stack based memory
8e6e8957fb/util/autovector.h (L269-L285)
The ` values_ = reinterpret_cast<pointer>(buf_);` statement is not sufficient to ensure the class's member variables are properly constructed. I'm able to reproduce this consistently in a unit test in this change: https://github.com/facebook/rocksdb/compare/main...jowlyzhang:fix_sv_install with unit test:
`./tiered_compaction_test --gtest_filter="\*FastTrack\*"
With below stack trace P1203997597 showing the `std::string` copy destination is invalid, which indicates the object in the destination `autovector` is not constructed properly.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12499
Test Plan: Existing unit tests.
Reviewed By: anand1976
Differential Revision: D55662354
Pulled By: jowlyzhang
fbshipit-source-id: 581ceb11155d3dd711998607ec6950c0e327556a
Summary:
When we use the CreateColumnFamilyWithImport interface of PessimisticTransactionDB to create column family, the lack of related information may cause subsequent writes to be unable to find the Column Family ID.
The issue: (https://github.com/facebook/rocksdb/issues/12493)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12490
Reviewed By: jowlyzhang
Differential Revision: D55700343
Pulled By: cbi42
fbshipit-source-id: dc992a3eef433e1193d579cbf58b6ba940fa460d
Summary:
This PR adds support to programmatically iterate a raw table file with an iterator returned by `SstFileReader::NewTableIterator`. For third party tools to use to observe SST files created by RocksDB.
The original feature request was from this merge request: https://github.com/facebook/rocksdb/pull/12370
Since keys returned by raw table iterators are internal keys, this PR also adds a struct `ParsedEntryInfo` and util method `ParseEntry` to support user to parse internal key. `GetInternalKeyForSeek`, and `GetInternalKeyForSeekForPrev` to support users to create internal keys for seek operations with this raw table iterator.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12385
Test Plan: Added unit tests
Reviewed By: cbi42
Differential Revision: D55662855
Pulled By: jowlyzhang
fbshipit-source-id: 0716a173ee95924fbd4e1f9b6cccf06525c40049
Summary:
`nullptr` is typesafe. `0` and `NULL` are not. In the future, only `nullptr` will be allowed.
This diff helps us embrace the future _now_ in service of enabling `-Wzero-as-null-pointer-constant`.
Reviewed By: dmm-fb
Differential Revision: D55559752
fbshipit-source-id: 9f1edc836ded919022c4b53722f6f86208fecf8d
Summary:
`-Wextra-semi` or `-Wextra-semi-stmt`
If the code compiles, this is safe to land.
Reviewed By: palmje
Differential Revision: D55534619
fbshipit-source-id: 26f3c35a51b38a3cbfa12a6f76a2bb783a7b4d8e
Summary:
`-Wextra-semi` or `-Wextra-semi-stmt`
If the code compiles, this is safe to land.
Reviewed By: palmje
Differential Revision: D55534622
fbshipit-source-id: dfff34924da6f2cdad34ed21f8f08a9bab9189a7
Summary:
**Context/Summary:**
`wal_bytes_per_sync > 0` can sync newer WAL but not an older WAL by its nature. This creates a hole in synced WAL data. By our crash test, we recently discovered that our DB can recover past that hole. This resulted in crash-recovery-verification error. Before we fix that recovery behavior, we will temporarily disable `wal_bytes_per_sync` in crash test
Bonus: updated the API to make the nature of this option more explicitly documented
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12489
Test Plan: More stabilized crash test
Reviewed By: ajkr
Differential Revision: D55531589
Pulled By: hx235
fbshipit-source-id: 6dea6486420dc0f50550d488c15652f93972a0ea
Summary:
**Context/Summary:**
We recently discovered that `CompactRange(change_level=true, target_level=0)` can possibly refit more than 1 files to L0. This refitting can cause read performance regression as we need to go through every file in L0, corruption in some edge case and false positive corruption caught by force consistency check. We decided to explicitly disallow such behavior.
A related change to OptionChangeMigration():
- When migrating to FIFO with `compaction_options_fifo.max_table_files_size > 0`, RocksDB will [CompactRange() all the to-be-migrate data into a couple L0 files](https://github.com/facebook/rocksdb/blob/main/utilities/option_change_migration/option_change_migration.cc#L164-L169) to avoid dropping all the data upon migration finishes when the migrated data is larger than max_table_files_size. This is achieved by first compacting all the data into a couple non-L0 files and refitting those files from non-L0 to L0 if needed. In that way, only some data instead of all data will be dropped immediately after migration to FIFO with a max_table_files_size.
- Since this type of refitting behavior is disallowed from now on, we won't do this trick anymore and explicitly state such risk in API comment.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12481
Test Plan:
- New UT
- Modified UT
Reviewed By: cbi42
Differential Revision: D55351178
Pulled By: hx235
fbshipit-source-id: 9d8854f2f81d7e8aff859c3a4e53b7d688048e80
Summary:
Errors were being swallowed in `BlockBasedTable::MultiGet` under some circumstances, such as error when parsing the internal key from the block, or IO error when reading the blob value. We need to set the status for the key to the observed error.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12486
Test Plan: Run db_stress and verify the expected error failure before, and no failures after the change.
Reviewed By: jaykorean, ajkr
Differential Revision: D55483940
Pulled By: anand1976
fbshipit-source-id: 493e44db507d5db45e8d1ef2e67808d2c4046318
Summary:
Previously it was uninitialized. Setting `checksum_handoff_file_types` will cause `kCRC32c` checksums to be passed down in the `DataVerificationInfo`, so it makes sense for `kCRC32c` to be the default.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12485
Test Plan:
ran `db_stress` in a way that failed before. Building with ASAN was needed to ensure the uninitialized bytes are nonzero according to `malloc_fill_byte` (default 0xbe)
```
$ COMPILE_WITH_ASAN=1 make -j28 db_stress
...
$ ./db_stress -sync_fault_injection=1 -enable_checksum_handoff=true
```
Reviewed By: jaykorean
Differential Revision: D55450587
Pulled By: ajkr
fbshipit-source-id: 53dc829b86e49b3fa80570032e83af0bb12adaad
Summary:
As a follow up for https://github.com/facebook/rocksdb/issues/12422 , this PR includes the following two changes.
- Removal of `direction_` in the MultiCfIterator
- Use of Member Func Template instead of `std::function`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12465
Test Plan:
```
./multi_cf_iterator_test
```
Reviewed By: pdillinger, ltamasi
Differential Revision: D55208448
Pulled By: jaykorean
fbshipit-source-id: 8b3167c1d59839d076afc29097b5ad21a453460a
Summary:
ScopedArenaIterator is not an iterator. It is a pointer wrapper. And we don't need a custom implemented pointer wrapper when std::unique_ptr can be instantiated with what we want.
So this adds ScopedArenaPtr<T> to replace those uses.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12470
Test Plan: CI (including ASAN/UBSAN)
Reviewed By: jowlyzhang
Differential Revision: D55254362
Pulled By: pdillinger
fbshipit-source-id: cc96a0b9840df99aa807f417725e120802c0ae18
Summary:
Fix the heap use after free bug caused by freeing the file system IO buffer in `BlockFetcher::ReadBlock()` instead of the caller.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12464
Test Plan: Update the `DBIOCorruptionTest` tests
Reviewed By: akankshamahajan15
Differential Revision: D55206920
Pulled By: anand1976
fbshipit-source-id: fd6b608a61cd229b20c1e5f348ff3cc92328de0f
Summary:
This option was previously disabled due to a bug in the recovery logic. The recovery code in `DBImpl::RecoverLogFiles` couldn't tell if an EoF reported by the log reader was really an EoF or a possible corruption that made a record look like an old log record. To fix this, the log reader now explicitly reports when it encounters what looks like an old record. The recovery code treats it as a possible corruption, and uses the next sequence number in the WAL to determine if it should continue replaying the WAL.
This PR also fixes a couple of bugs that log file recycling exposed in the backup and checkpoint path.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12403
Test Plan:
1. Add new unit tests to verify behavior upon corruption
2. Re-enable disabled tests for verifying recycling behavior
Reviewed By: ajkr
Differential Revision: D54544824
Pulled By: anand1976
fbshipit-source-id: 12f5ce39bd6bc0d63b0bc6432dc4db510e0e802a
Summary:
This PR contains a few follow ups from https://github.com/facebook/rocksdb/issues/12419 and https://github.com/facebook/rocksdb/issues/12428 including:
1) Handle a special case for `WriteBatch::TimedPut`. When the user specified write time is `std::numeric_limits<uint64_t>::max()`, it's not treated as an error, but it instead creates and writes a regular `Put` entry.
2) Update the `InternalIterator::write_unix_time` APIs to handle `kTypeValuePreferredSeqno` entries.
3) FlushJob is updated to use the seqno to time mapping copy in `SuperVersion`. FlushJob currently copy the DB's seqno to time mapping while holding db mutex and only copies the part of interest, a.k.a, the part that only goes back to the earliest sequence number of the to-be-flushed memtables. While updating FlushJob to use the mapping copy in `SuperVersion`, it's given access to the full mapping to help cover the need to convert `kTypeValuePreferredSeqno`'s write time to preferred seqno as much as possible.
Test plans:
Added unit tests
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12455
Reviewed By: pdillinger
Differential Revision: D55165422
Pulled By: jowlyzhang
fbshipit-source-id: dc022653077f678c24661de5743146a74cce4b47
Summary:
fixes https://github.com/facebook/rocksdb/issues/12409
### Issue
ZSTD_TrainDictionary [[link](a53ed91691/table/block_based/block_based_table_builder.cc (L1894))] runs for SSTFileWriter::Finish even when bottommost_compression option is set to kNoCompression. This reduces throughput for SstFileWriter::Finish
We construct rocksdb options using ZSTD compression for levels including 2 and above. For levels 0 and 1, we set it to kNoCompression. We also set zstd_max_train_bytes to a non-zero positive value (which is applicable for levels with ZSTD compression enabled). These options are used for the database and also passed to SstFileWriter for creating sst files to be later added to that database. Since the BlockBasedTableBuilder::Finish [[link](a53ed91691/table/block_based/block_based_table_builder.cc (L1892))] only checks for zstd_max_train_bytes to be non-zero positive value, it runs ZSTD_TrainDictionary even when it shouldn't since SSTFileWriter is operating at bottommost level
### Fix
If compression_type is set to kNoCompression, then don't run ZSTD_TrainDictionary and dictionary building
### Testing
I see we have tests for sst file writer with compression type set/unset. Let me know if it isn't covered and I can extend
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12453
Reviewed By: cbi42
Differential Revision: D55030484
Pulled By: ajkr
fbshipit-source-id: 834de2174c2b087d61bf045ca1ae29f337b821a7
Summary:
Fixing the not-checked status failure as in https://github.com/facebook/rocksdb/actions/runs/8334988399/job/22809612148.
When the status is not ok() for any reason, we do not check the `wal_read_status` because it's not necessary. It's causing the test failure when running with `ASSERT_STATUS_CHECKED=1`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12460
Test Plan: Existing tests
Reviewed By: ajkr
Differential Revision: D55104844
Pulled By: jaykorean
fbshipit-source-id: 919b1fddca835494f9087c51c4da6eabc9e8df2b
Summary:
`-Wextra-semi` or `-Wextra-semi-stmt`
If the code compiles, this is safe to land.
Reviewed By: palmje
Differential Revision: D55087322
fbshipit-source-id: ca4db7285444306d6c91545cd2c33483dfe05385
Summary:
`-Wextra-semi` or `-Wextra-semi-stmt`
If the code compiles, this is safe to land.
Reviewed By: palmje
Differential Revision: D54362227
fbshipit-source-id: ac634ba34f9351ba559c4ed96448f51d6ef33175
Summary:
On file systems that support storage level data checksum and reconstruction, retry SST block reads for point lookups, scans, and flush and compaction if there's a checksum mismatch on the initial read. A file system can indicate its support by setting the `FSSupportedOps::kVerifyAndReconstructRead` bit in `SupportedOps`.
Tests:
Add new unit tests
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12427
Reviewed By: ajkr
Differential Revision: D55025941
Pulled By: anand1976
fbshipit-source-id: dbd990cb75e03f756c8a66d42956f645c0b6d55e
Summary:
Update `compaction_service_test` to make sure remote compaction works with multiple column family set up. Minor refactor to get rid of duplicate code
Fixing one quick bug in the existing test util: Test util's `FilesPerLevel` didn't honor `cf_id` properly)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12430
Test Plan:
```
./compaction_service_test
```
Reviewed By: ajkr
Differential Revision: D54883035
Pulled By: jaykorean
fbshipit-source-id: 83b4f6f566fed5c4824bfef7de01074354a72b44
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12442
The patch deduplicates and unifies the logic of `WriteBatchWithIndex::{Get,GetEntity}FromBatch` using templates and makes some small code hygiene improvements, including consistently clearing the output value in the various non-success cases.
Reviewed By: jaykorean
Differential Revision: D54922935
fbshipit-source-id: c92e89f905a3c80cef57c2c840f49f806629238f
Summary:
This PR continues https://github.com/facebook/rocksdb/issues/12153 by implementing the missing `Iterator` APIs - `Seek()`, `SeekForPrev()`, `SeekToLast()`, and `Prev`. A MaxHeap Implementation has been added to handle the reverse direction.
The current implementation does not include upper/lower bounds yet. These will be added in subsequent PRs. The API is still marked as under construction and will be lifted after being added to the stress test.
Please note that changing the iterator direction in the middle of iteration is expensive, as it requires seeking the element in each iterator again in the opposite direction and rebuilding the heap along the way. The first `Next()` after `SeekForPrev()` requires changing the direction under the current implementation. We may optimize this in later PRs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12422
Test Plan: The `multi_cf_iterator_test` has been extended to cover the API implementations.
Reviewed By: pdillinger
Differential Revision: D54820754
Pulled By: jaykorean
fbshipit-source-id: 9eb741508df0f7bad598fb8e6bd5cdffc39e81d1
Summary:
with https://github.com/facebook/rocksdb/issues/12414 enabling `ReadOptions::pin_data`, this bug surfaced as corrupted per key-value checksum during crash test. `saved_key_.GetUserKey()` could be pinned user key, so DBIter should not overwrite it.
In one case, it only surfaces when iterator skips many keys of the same user key. To stress that code path, this PR also added `max_sequential_skip_in_iterations` to crash test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12451
Test Plan:
- Set ReadOptions::pin_data to true, the bug can be reproed quickly with `./db_stress --persist_user_defined_timestamps=1 --user_timestamp_size=8 --writepercent=35 --delpercent=4 --delrangepercent=1 --iterpercent=20 --nooverwritepercent=1 --prefix_size=8 --prefixpercent=10 --readpercent=30 --memtable_protection_bytes_per_key=8 --block_protection_bytes_per_key=2 --clear_column_family_one_in=0`.
- Set max_sequential_skip_in_iterations to 1 for the other occurrence of the bug.
Reviewed By: jowlyzhang
Differential Revision: D55003766
Pulled By: cbi42
fbshipit-source-id: 23e1049129456684dafb028b6132b70e0afc07fb
Summary:
This PR adds support to return data's approximate unix write time in the iterator property API. The general implementation is:
1) If the entry comes from a SST file, the sequence number to time mapping recorded in that file's table properties will be used to deduce the entry's write time from its sequence number. If no such recording is available, `std::numeric_limits<uint64_t>::max()` is returned to indicate the write time is unknown except if the entry's sequence number is zero, in which case, 0 is returned. This also means that even if `preclude_last_level_data_seconds` and `preserve_internal_time_seconds` can be toggled off between DB reopens, as long as the SST file's table property has the mapping available, the entry's write time can be deduced and returned.
2) If the entry comes from memtable, we will use the DB's sequence number to write time mapping to do similar things. A copy of the DB's seqno to write time mapping is kept in SuperVersion to allow iterators to have lock free access. This also means a new `SuperVersion` is installed each time DB's seqno to time mapping updates, which is originally proposed by Peter in https://github.com/facebook/rocksdb/issues/11928 . Similarly, if the feature is not enabled, `std::numeric_limits<uint64_t>::max()` is returned to indicate the write time is unknown.
Needed follow up:
1) The write time for `kTypeValuePreferredSeqno` should be special cased, where it's already specified by the user, so we can directly return it.
2) Flush job can be updated to use DB's seqno to time mapping copy in the SuperVersion.
3) Handle the case when `TimedPut` is called with a write time that is `std::numeric_limits<uint64_t>::max()`. We can make it a regular `Put`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12428
Test Plan: Added unit test
Reviewed By: pdillinger
Differential Revision: D54967067
Pulled By: jowlyzhang
fbshipit-source-id: c795b1b7ec142e09e53f2ed3461cf719833cb37a
Summary:
Thanks ltamasi for pointing out this bug.
We were incorrectly overwriting `Status::Incomplete` with `Status::OK` after a table cache miss failed to open the file due to the read being memory-only (`kBlockCacheTier`). The fix is to simply stop overwriting the status.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12443
Reviewed By: cbi42
Differential Revision: D54930128
Pulled By: ajkr
fbshipit-source-id: 52f912a2e93b46e71d79fc5968f8ca35b299213d