Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11858
The patch builds on https://github.com/facebook/rocksdb/pull/11807 and integrates the `FullMergeV3` API into the read and compaction code paths by updating and extending the logic in `MergeHelper`.
In particular, when it comes to merge inputs, the existing `TimedFullMergeWithEntity` is folded into `TimedFullMerge`, since wide-column base values are now handled the same way as plain base values (or no base values for that matter), e.g. they are passed directly to the `MergeOperator`. On the other hand, there is some new differentiation on the output side. Namely, there are now two sets of `TimedFullMerge` variants: one set for contexts where the complete merge result and its value type are needed (used by iterators and compactions), and another set where the merge result is needed in a form determined by the client (used by the point lookup APIs, where e.g. for `Get` we have to extract the value of the default column of any wide-column results).
Implementation-wise, the two sets of overloads use different visitors to process the `std::variant` produced by `FullMergeV3`. This has the benefit of eliminating some repeated code e.g. in the point lookup paths, since `TimedFullMerge` now populates the application's result object (`PinnableSlice`/`string` or `PinnableWideColumns`) directly. Moreover, within each set of variants, there is a separate overload for the no base value/plain base value/wide-column base value cases, which eliminates some repeated branching w/r/t to the type of the base value if any.
Reviewed By: jaykorean
Differential Revision: D49352562
fbshipit-source-id: c2fb9853dba3fbbc6918665bde4195c4ea150a0c
Summary:
FaultInjectionTestFS is not directly writable by default. Should set it to direct writable if there is no write fault injection.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11862
Test Plan: internal stress test failure reduces.
Reviewed By: jaykorean
Differential Revision: D49428108
Pulled By: cbi42
fbshipit-source-id: 5dfa1fbb454272a14f8228a5c496d480d7138ef1
Summary:
This PR contains two fixes:
1. disable write fault injection since it caused several other kinds of internal stress test failures. I'll try to fix those separately before enabling it again.
2. Fix segfault like
```
https://github.com/facebook/rocksdb/issues/5 0x000000000083dc43 in rocksdb::port::Mutex::Lock (this=0x30) at internal_repo_rocksdb/repo/port/port_posix.cc:80
80 internal_repo_rocksdb/repo/port/port_posix.cc: No such file or directory.
https://github.com/facebook/rocksdb/issues/6 0x0000000000465142 in rocksdb::MutexLock::MutexLock (mu=0x30, this=<optimized out>) at internal_repo_rocksdb/repo/util/mutexlock.h:37
37 internal_repo_rocksdb/repo/util/mutexlock.h: No such file or directory.
https://github.com/facebook/rocksdb/issues/7 rocksdb::FaultInjectionTestFS::DisableWriteErrorInjection (this=0x0) at internal_repo_rocksdb/repo/utilities/fault_injection_fs.h:505
505 internal_repo_rocksdb/repo/utilities/fault_injection_fs.h: No such file or directory.
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11859
Test Plan: db_stress with no fault injection: `./db_stress --write_fault_one_in=0 --read_fault_one_in=0 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --sync_fault_injection=0`
Reviewed By: jaykorean
Differential Revision: D49408247
Pulled By: cbi42
fbshipit-source-id: 0ca01f20e6e81bf52af77818b50d562ef7462165
Summary:
* db_crashtest.py now may set `write_fault_one_in` to 500 for blackbox and whitebox simple test.
* Error injection only applies to writing to SST files. Flush error will cause DB to pause background operations and auto-resume. Compaction error will just re-schedule later.
* File ingestion and back up tests are updated to check if the result status is due to an injected error.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11829
Test Plan:
a full round of whitebox simple and blackbox simple crash test
* `python3 ./tools/db_crashtest.py whitebox/blackbox --simple --write_fault_one_in=500`
Reviewed By: ajkr
Differential Revision: D49256962
Pulled By: cbi42
fbshipit-source-id: 68e0c9648d8e03bad39c7672b25d5500fc286d97
Summary:
Fixes https://github.com/facebook/rocksdb/issues/10257 (also see [here](https://github.com/facebook/rocksdb/pull/10355#issuecomment-1684308556)) by releasing compaction files earlier when writing to manifest in LogAndApply(). This is done by passing in a [callback](ba59751430/db/version_set.h (L1199)) to LogAndApply(). The new Version is created in the same critical section where compaction files are released. When compaction picker is picking compaction based on the new version, these compaction files will already be released.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11764
Test Plan:
* Existing unit tests
* A repro unit test to validate that compaction files are released: `./db_compaction_test --gtest_filter=DBCompactionTest.ReleaseCompactionDuringManifestWrite`
* `python3 ./tools/db_crashtest.py --simple whitebox` with some assertions to check compaction files are released
Reviewed By: ajkr
Differential Revision: D48742152
Pulled By: cbi42
fbshipit-source-id: 7560fd0e723a63fe692234015d2b96850f8b5d77
Summary:
before 459969e9, we were using CMake `option()` to represent `PORTABLE`. so the CMake boolean representations like ON, OFF, 0 and 1 are supported. this is also the downstream package maintainers were using before v8.3.2.
in 459969e9, this option is expanded to specify the argument of `-march` passed to compiler in order to be more flexible and hence allows user to specify CPU type directly. but in the typical use cases, user would just want to use "ON" for the best performance on the building host, and "OFF" for a portable build when it comes to a distro package maintainer.
so, in this change, let's check for the boolean representations supported by CMake.
Fixes https://github.com/facebook/rocksdb/issues/11558
Signed-off-by: Kefu Chai <tchaikov@gmail.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11724
Reviewed By: anand1976
Differential Revision: D48827447
Pulled By: ajkr
fbshipit-source-id: b2fef7076b2e90ad13a1fbec80e197841fa06d38
Summary:
When skip_memtable is true in MultiGetImpl, The lookup_current is always false, Causes data to be unable to be queried in super_version->current。
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11700
Reviewed By: anand1976
Differential Revision: D49342877
Pulled By: jowlyzhang
fbshipit-source-id: 270a36d049b4cb7fd151a1fa3080300310111271
Summary:
With the async_io option, the Seek happens in 2 phases. Phase 1 starts an asynchronous read on a block cache miss, and phase 2 waits for it to complete and finishes the seek. In both phases, BlockBasedTable::NewDataBlockIterator is called, which tries to lookup the block cache for the data block first before looking in the prefetch buffer. It's optimized by doing the block cache lookup only in the first phase and save some CPU.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11616
Test Plan: Added unit test
Reviewed By: jaykorean
Differential Revision: D47477887
Pulled By: akankshamahajan15
fbshipit-source-id: 0355e0a68fc0ea2eb92340ae42735afcdbcbfd79
Summary:
An internal user wants to be able to dynamically switch between Bloom and Ribbon filters, without a custom FilterPolicy. Making `filter_policy` mutable would actually make issue https://github.com/facebook/rocksdb/issues/10079 worse, because it would be a race on a pointer field, not just on scalars.
As a reasonable compromise until that is fixed, I am enabling dynamic control over Bloom vs. Ribbon choice by making
RibbonFilterPolicy::bloom_before_level mutable, and doing that safely by using an atomic.
I've also slightly tweaked the interpretation of that field so that setting it to INT_MAX really means "always Bloom."
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11838
Test Plan: unit tests added/extended. crash test updated for SetOptions call and tested under TSAN with amplified probability (lower set_options_one_in).
Reviewed By: ajkr
Differential Revision: D49296284
Pulled By: pdillinger
fbshipit-source-id: e4251c077510df9a9c719876f482448c0d15402a
Summary:
- As a follow up from https://github.com/facebook/rocksdb/issues/11799, adding `Env::IOActivity::kMultiGetEntity` support to `DBImpl::MultiGetEntity()`.
## Minor Refactor
- Because both `DBImpl::MultiGet()` and `DBImpl::MultiGetEntity()` call `DBImpl::MultiGetCommon()` which later calls `DBImpl::MultiGetWithCallback()` where we check `Env::IOActivity::kMultiGet`, minor refactor was needed so that we don't check `Env::IOActivity::kMultiGet` for `DBImpl::MultiGetEntity()`.
- I still see more areas for refactoring to avoid duplicate code of checking IOActivity and setting it when Unknown, but this will be addressed separately.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11842
Test Plan:
- Added the `ThreadStatus::OperationType::OP_MULTIGETENTITY` in `db_stress` to verify the pass-down IOActivity in a thread aligns with the actual activity the thread is doing.
```
python3 tools/db_crashtest.py blackbox --max_key=25000000 --write_buffer_size=4194304 --max_bytes_for_level_base=2097152 --target_file_size_base=2097152 --periodic_compaction_seconds=0 --use_put_entity_one_in=10 --use_get_entity=1 --duration=60 --interval=10
python3 tools/db_crashtest.py blackbox --simple --max_key=25000000 --write_buffer_size=4194304 --max_bytes_for_level_base=2097152 --target_file_size_base=2097152 --periodic_compaction_seconds=0 --use_put_entity_one_in=10 --use_get_entity=1 --duration=60 --interval=10
python3 tools/db_crashtest.py blackbox --cf_consistency --max_key=25000000 --write_buffer_size=4194304 --max_bytes_for_level_base=2097152 --target_file_size_base=2097152 --periodic_compaction_seconds=0 --use_put_entity_one_in=10 --use_get_entity=1 --duration=60 --interval=10
```
Reviewed By: ltamasi
Differential Revision: D49329575
Pulled By: jaykorean
fbshipit-source-id: 05198f1d3f92e6be3d42a3d184bacb3ab2ce6923
Summary:
This PR resolves https://github.com/facebook/rocksdb/issues/10487 & https://github.com/facebook/rocksdb/issues/10536, user code needs to call Refresh() periodically.
The main code change is to support range deletions. A range tombstone iterator uses a sequence number as upper bound to decide which range tombstones are effective. During Iterator refresh, this sequence number upper bound needs to be updated for all range tombstone iterators under DBIter and LevelIterator. LevelIterator may create new table iterators and range tombstone iterator during scanning, so it needs to be aware of iterator refresh. The code path that propagates this change is `db_iter_->set_sequence(read_seq) -> MergingIterator::SetRangeDelReadSeqno() -> TruncatedRangeDelIterator::SetRangeDelReadSeqno() and LevelIterator::SetRangeDelReadSeqno()`.
This change also fixes an issue where range tombstone iterators created by LevelIterator may access ReadOptions::snapshot, even though we do not explicitly require users to keep a snapshot alive after creating an Iterator.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10594
Test Plan:
* New unit tests.
* Add Iterator::Refresh(snapshot) to stress test. Note that this change only adds tests for refreshing to the same snapshot since this is the main target use case.
TODO in a following PR:
* Stress test Iterator::Refresh() to different snapshots or no snapshot.
Reviewed By: ajkr
Differential Revision: D48456896
Pulled By: cbi42
fbshipit-source-id: 2e642c04e91235cc9542ef4cd37b3c20823bd779
Summary:
**Context/Summary:**
It allows db bench reflect the default behavior of this option. For example, we recently changed its default value.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11831
Test Plan: No code change
Reviewed By: cbi42
Differential Revision: D49253690
Pulled By: hx235
fbshipit-source-id: 445d4e54f62b4b538626e301a3014d2f00849d30
Summary:
Application using rocksdb today dont have much control over the cost of reads when merge-ops are enabled, expect for waiting for compactions to kick in or using max_successive_merges hint, which only applies to memtable. This change adds Transaction::CollapseKey api giving applications the ability to request merge chain collapse on-demand, when they detect high read costs due to merges. Currently, this only supported on PessimisticTransactions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11815
Test Plan: Add a unit test
Reviewed By: ajkr
Differential Revision: D49309387
Pulled By: sarangbh
fbshipit-source-id: a1eb5cc9e3bd4b3206a36150aacead770318e3e1
Summary:
Update the logic in FilePrefetchBuffer to update `upper_bound_offset_` during reseek. During Reseek, `iterate_upper_bound` can be changed dynamically. So added an API to update that in FilePrefetchBuffer.
Added unit test to confirm the behavior.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11775
Test Plan:
- Check stress tests in case there is any failure after this diff.
- make crash_test -j32 with auto_readahead_size=1 passed locally
Reviewed By: anand1976
Differential Revision: D48815177
Pulled By: akankshamahajan15
fbshipit-source-id: 5f44fbb3af06c86a1c38f139c5fa4543891837f4
Summary:
When `MultiGet` acquires `SuperVersion` via locking the db mutex and get the current `ColumnFamilyData::super_version_`, its corresponding cleanup logic is not correctly done.
It's currently doing this:
`MultiGetColumnFamilyData::cfd->GetSuperVersion().Unref()`
This operates on the most recent `SuperVersion` without locking db mutex , which is not thread safe by itself. And this unref operation is intended for the originally acquired `SuperVersion` instead of the current one. Because a race condition could happen where a new `SuperVersion` is installed in between this `MultiGet`'s ref and unref. When this race condition does happen, it's not sufficient to just unref the `SuperVersion`, `DBImpl::CleanupSuperVersion` should be called instead to properly clean up the `SuperVersion` had this `MultiGet` call be its last reference holder.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11830
Test Plan:
`make all check`
Added a unit test that would originally fail
Reviewed By: ltamasi
Differential Revision: D49287715
Pulled By: jowlyzhang
fbshipit-source-id: 8353636ee11b2e90d85c677a96a92360072644b0
Summary:
`GetEntity` API support for ReadOnly DB and Secondary DB.
- Introduced `GetImpl()` with `GetImplOptions` in `db_impl_readonly` and refactored current `Get()` logic into `GetImpl()` so that look up logic can be reused for `GetEntity()` (Following the same pattern as `DBImpl::Get()` and `DBImpl::GetEntity()`)
- Introduced `GetImpl()` with `GetImplOptions` in `db_impl_secondary` and refactored current `GetImpl()` logic. This is to make `DBImplSecondary::Get/GetEntity` consistent with `DBImpl::Get/GetEntity` and `DBImplReadOnly::Get/GetEntity`
- `GetImpl()` in `db_impl` is now virtual. both `db_impl_readonly` and `db_impl_secondary`'s `Get()` override are no longer needed since all three dbs now have the same `Get()` which calls `GetImpl()` internally.
- `GetImpl()` in `DBImplReadOnly` and `DBImplSecondary` now pass in `columns` instead of `nullptr` in lookup functions like `memtable->get()`
- Introduced `GetEntity()` API in `DBImplReadOnly` and `DBImplSecondary` which simply calls `GetImpl()` with `columns` set in `GetImplOptions`.
- Introduced `Env::IOActivity::kGetEntity` and set read_options.io_activity to `Env::IOActivity::kGetEntity` for `GetEntity()` operations (in db_impl)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11799
Test Plan:
**Unit Tests**
- Added verification in `DBWideBasicTest::PutEntity` by Reopening DB as ReadOnly with the same setup.
- Added verification in `DBSecondaryTest::ReopenAsSecondary` by calling `PutEntity()` and `GetEntity()` on top of existing `Put()` and `Get()`
- `make -j64 check`
**Crash Tests**
- `python3 tools/db_crashtest.py blackbox --max_key=25000000 --write_buffer_size=4194304 --max_bytes_for_level_base=2097152 --target_file_size_base=2097152 --periodic_compaction_seconds=0 --use_put_entity_one_in=10 --use_get_entity=1 --duration=60 --inter
val=10`
- `python3 tools/db_crashtest.py blackbox --simple --max_key=25000000 --write_buffer_size=4194304 --max_bytes_for_level_base=2097152 --target_file_size_base=2097152 --periodic_compaction_seconds=0 --use_put_entity_one_in=10 --use_get_entity=1 `
- `python3 tools/db_crashtest.py blackbox --cf_consistency --max_key=25000000 --write_buffer_size=4194304 --max_bytes_for_level_base=2097152 --target_file_size_base=2097152 --periodic_compaction_seconds=0 --use_put_entity_one_in=10 --use_get_entity=1 --duration=60 --inter
val=10`
Reviewed By: ltamasi
Differential Revision: D49037040
Pulled By: jaykorean
fbshipit-source-id: a0648253ded6e91af7953de364ed3c6bf163626b
Summary:
**Context/Summary:** `rocksdb.file.read.db.open.micros` is landed in 8.3 but not 8.6. It was included in the HISTORY due to an error.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11839
Test Plan: no code change; Will cherry pick this to 8.6 branch when landed.
Reviewed By: anand1976
Differential Revision: D49294250
Pulled By: hx235
fbshipit-source-id: b2ac10758a15eadd5c129d80e93e1c3d0aa569cb
Summary:
When executing ClipColumnFamily, if end_key is equal to largest_user_key in a file, this key will not be deleted. So we need to change less than to less than or equal to
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11811
Reviewed By: ajkr
Differential Revision: D49206936
Pulled By: cbi42
fbshipit-source-id: 3e8bcb7b52040a9b4d1176de727616cc298d3445
Summary:
`last_stats_dump_time_microsec_` is not used after initialization.
I guess that it was previously used to implement periodically dumping stats,
but this functionality has now been delegated to the `PeriodicTaskScheduler`.
4b79e8c003/db/db_impl/db_impl.cc (L770-L778)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11824
Reviewed By: cbi42
Differential Revision: D49278311
Pulled By: jowlyzhang
fbshipit-source-id: 5856245580afc026e6b490755a45c5436a2375c9
Summary:
As discussed in https://github.com/facebook/rocksdb/issues/11730 , this PR tracks the effective `full_history_ts_low` per SuperVersion and update existing sanity checks for `ReadOptions.timestamp >= full_history_ts_low` to use this per SuperVersion `full_history_ts_low` instead. This also means the check is moved to happen after acquiring SuperVersion.
There are two motivations for this: 1) Each time `full_history_ts_low` really come into effect to collapse history, a new SuperVersion is always installed, because it would involve either a Flush or Compaction, both of which change the LSM tree shape. We can take advantage of this to ensure that as long as this sanity check is passed, even if `full_history_ts_low` can be concurrently increased and collapse some history above the requested `ReadOptions.timestamp`, a read request won’t have visibility to that part of history through this SuperVersion that it already acquired. 2) the existing sanity check uses `ColumnFamilyData::GetFullHistoryTsLow` without locking the db mutex, which is the mutex all `IncreaseFullHistoryTsLow` operation is using when mutating this field. So there is a race condition. This also solve the race condition on the read path.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11784
Test Plan:
`make all check`
// Checks success scenario really provide the read consistency attribute as mentioned above.
`./db_with_timestamp_basic_test --gtest_filter=*FullHistoryTsLowSanityCheckPassReadIsConsistent*`
// Checks failure scenario cleans up SuperVersion properly.
`./db_with_timestamp_basic_test --gtest_filter=*FullHistoryTsLowSanityCheckFail*`
`./db_secondary_test --gtest_filter=*FullHistoryTsLowSanityCheckFail*`
`./db_readonly_with_timestamp_test --gtest_filter=*FullHistoryTsLowSanitchCheckFail*`
Reviewed By: ltamasi
Differential Revision: D48894795
Pulled By: jowlyzhang
fbshipit-source-id: 1f801fe8e1bc8e63ca76c03cbdbd0974e5ff5bf6
Summary:
**Context/Summary:**
A size amp compaction can select and prevent a large number of L0 files from being selected by other compaction. If such compaction is running long or being queued behind, these L0 files will exist for long. With a few more flushes, we can run into write stop triggered by # L0 files. We've seen this happen on a host with many DBs sharing same thread pool, each of these DBs submits a size amp compaction with (110-180)+ files to the pool upon reopen and with a few more flushes, they hit the 200 L0 write stop condition.
The idea is to exclude some L0 input files in size amp compaction that are harmless to size amp reduction but improve the situation described above.
The exclusion algorithm is in `MightExcludeNewL0sToReduceWriteStop()` with two elements:
1. #L0 to exclude + (level0_stop_writes_trigger - num_l0_input_pre_exclusion) should be in the range of [min_merge_width, max_merge_width].
- This is to ensure we are excluding enough L0 input files but not too many to be qualified to picked for another compaction along with the incoming future L0 files before write stop.
2. Based on (1), further constrain #L0 to exclude based on the post-exclusion compaction score. The goal is to ensure our exclusion will not disqualify the size amp compaction from being a size amp compaction after exclusion.
**Tets plan:** New unit test
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11749
Reviewed By: ajkr
Differential Revision: D48850631
Pulled By: hx235
fbshipit-source-id: 2c321036e164087c36319dd5645cbbf6b6152092
Summary:
Existing compaction statistics are `COMPACTION_TIME` and `COMPACTION_CPU_TIME` which are histogram and are logged at the end of a compaction. The new statistics `COMPACTION_CPU_TOTAL_TIME` is for cumulative total compaction time which is updated regularly during a compaction. This allows user to more closely track compaction cpu usage.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11741
Test Plan: * new unit test `DBTestWithParam.CompactionTotalTimeTest`
Reviewed By: ajkr
Differential Revision: D48608094
Pulled By: cbi42
fbshipit-source-id: b597109f3e4bf2237fb5a216b6fd036e5363b4c0
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11823
Similarly to https://github.com/facebook/rocksdb/pull/11813, the patch is a small refactoring that eliminates some copy-paste around sorting the columns of entities by column name.
Reviewed By: jaykorean
Differential Revision: D49195504
fbshipit-source-id: d48c9f290e3203f838cc5949856c469ecf730008
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11807
For now, RocksDB has limited support for using merge with wide columns: when a bunch of merge operands have to be applied to a wide-column base value, RocksDB currently passes only the value of the default column to the application's `MergeOperator`, which means there is no way to update any other columns during a merge. As a first step in making this more general, the patch adds a new API `FullMergeV3` to `MergeOperator`.
`FullMergeV3`'s interface enables applications to receive a plain, wide-column, or non-existent base value as merge input, and to produce a new plain value, a new wide-column value, or an existing operand as merge output. Note that there are no limitations on the column names and values if the merge result is a wide-column entity. Also, the interface is general in the sense that it makes it possible e.g. for a merge that takes a plain base value and some deltas to produce a wide-column entity as a result.
For backward compatibility, the default implementation of `FullMergeV3` falls back to `FullMergeV2` and implements the current logic where merge operands are applied to the default column of the base entity and any other columns are unchanged. (Note that with `FullMergeV3` in the `MergeOperator` interface, this behavior will become customizable.)
This patch just introduces the new API and the default backward compatible implementation. I plan to integrate `FullMergeV3` into the query and compaction logic in subsequent diffs.
Reviewed By: jaykorean
Differential Revision: D49117253
fbshipit-source-id: 109e016f25cd130fc504790818d927bae7fec6bd
Summary:
During Seek, the iterator seeks every file on L0. In async_io, it submit the requests to seek on every file on L0 asynchronously using RocksDB FilePrefetchBuffer.
However, FilePrefetchBuffer does alignment and reads extra bytes then needed that can increase the throughput.
In case of non direct io, the alignment can be avoided.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11793
Test Plan:
- Added a unit test that fails without this PR.
- make crash_test -j32 completed successfully
Reviewed By: anand1976
Differential Revision: D48985051
Pulled By: akankshamahajan15
fbshipit-source-id: 2d130a9e7c3df9c4fcd0408406e6277ab75a4389
Summary:
This function uses racing reads for heuristic performance improvement. My change in https://github.com/facebook/rocksdb/issues/11792 only worked for clang, not gcc, and gcc does not accurately handle TSAN suppressions. I would have to mark much more code as suppressed than I want to.
So I've taken a different approach: TSAN build does not use the racing reads but substitutes random results, as an extra test that a "correct" value is not needed for correct overall behavior.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11806
Test Plan: manual TSAN builds & tests with cache_bench
Reviewed By: ajkr
Differential Revision: D49100115
Pulled By: pdillinger
fbshipit-source-id: d6d0dfb796d710b953212dd3fc171b6e88fadea1
Summary:
This is for fb internal use. Please see the comment in internal Phabricator for details.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11803
Reviewed By: hx235
Differential Revision: D49065093
Pulled By: jaykorean
fbshipit-source-id: acd71d7c1163f3c95c59c427caf944dacfe58ef6
Summary:
Tests a scenario where range tombstone reseek used to cause MergingIterator to discard non-ok status.
Ran on main without https://github.com/facebook/rocksdb/issues/11786:
```
./db_range_del_test --gtest_filter="*RangeDelReseekAfterFileReadError*"
Note: Google Test filter = *RangeDelReseekAfterFileReadError*
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from DBRangeDelTest
[ RUN ] DBRangeDelTest.RangeDelReseekAfterFileReadError
db/db_range_del_test.cc:3577: Failure
Value of: iter->Valid()
Actual: true
Expected: false
[ FAILED ] DBRangeDelTest.RangeDelReseekAfterFileReadError (64 ms)
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11790
Reviewed By: ajkr
Differential Revision: D48972869
Pulled By: cbi42
fbshipit-source-id: b1a71867533b0fb60af86f8ce8a9e391ba84dd57
Summary:
https://github.com/facebook/rocksdb/issues/11789 added error injection during compaction to db_stress. However, error injection was not disabled after compaction completion, which resulted in some test failures due to stale errors.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11798
Reviewed By: cbi42
Differential Revision: D49022821
Pulled By: anand1976
fbshipit-source-id: 3cbfe18d55bee393697e063d05e7a7a7f88b7635
Summary:
Some repro unit tests for the bug fixed in https://github.com/facebook/rocksdb/pull/11782.
Ran on main without https://github.com/facebook/rocksdb/pull/11782:
```
./db_compaction_test --gtest_filter='*ErrorWhenReadFileHead'
Note: Google Test filter = *ErrorWhenReadFileHead
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from DBCompactionTest
[ RUN ] DBCompactionTest.ErrorWhenReadFileHead
db/db_compaction_test.cc:10105: Failure
Value of: s.IsIOError()
Actual: false
Expected: true
[ FAILED ] DBCompactionTest.ErrorWhenReadFileHead (3960 ms)
./db_iterator_test --gtest_filter="*ErrorWhenReadFile*"
Note: Google Test filter = *ErrorWhenReadFile*
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from DBIteratorTest
[ RUN ] DBIteratorTest.ErrorWhenReadFile
db/db_iterator_test.cc:3399: Failure
Value of: (iter->status()).ok()
Actual: true
Expected: false
[ FAILED ] DBIteratorTest.ErrorWhenReadFile (280 ms)
[----------] 1 test from DBIteratorTest (280 ms total)
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11788
Reviewed By: ajkr
Differential Revision: D48940284
Pulled By: cbi42
fbshipit-source-id: 06f3c5963f576db3f85d305ffb2745ee13d209bb
Summary:
Currently, rocksdb users would use the event listener to catch the compaction/flush event and log them if any. But now the reason is an integer type instead of a human-readable string, so we would like to convert them into a human-readable string.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11778
Reviewed By: jaykorean
Differential Revision: D49012934
Pulled By: ajkr
fbshipit-source-id: a4935b95d70c1be02aec65da7bf1c98a8cf8b933
Summary:
There was a `#include "port/lang.h"` situated inside an `extern "C" {` which just started causing the header to be unusuable in some contexts. This was a regression on the CircleCI job build-linux-unity-and-headers in https://github.com/facebook/rocksdb/issues/11792
The include, and another like it, now appears obsolete so removed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11797
Test Plan: local `make check-headers` and `make`, CI
Reviewed By: jaykorean
Differential Revision: D48976826
Pulled By: pdillinger
fbshipit-source-id: 131d66969e045f2ded0f8936924ee30c9ef2655a
Summary:
- Fixed misspellings of "inject"
- Made user read errors retryable when `FLAGS_inject_error_severity == 1`
- Added compaction read errors when `FLAGS_read_fault_one_in > 0`. These are always retryable so that the DB will keep accepting writes
- Reenabled setting `compaction_readahead_size` in crash test. The reason for disabling it was to "keep the test clean", which is not a good enough reason to skip testing it
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11789
Test Plan:
With https://github.com/facebook/rocksdb/issues/11782 reverted, reproduced the bug:
- Build: `make -j56 db_stress`
- Command: `TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py blackbox --simple --write_buffer_size=524288 --target_file_size_base=524288 --max_bytes_for_level_base=2097152 --interval=10 --max_key=1000000`
- Output:
```
stderr has error message:
***put or merge error: Corruption: Compaction number of input keys does not match number of keys processed.***
```
Reviewed By: cbi42
Differential Revision: D48939994
Pulled By: ajkr
fbshipit-source-id: a1efb799efecdfd5d9cfd185e4a6321db8fccfbb