Summary:
**Context:**
DB destruction will wait for ongoing error recovery through `EndAutoRecovery()` and join the recovery thread: 519f2a41fb/db/db_impl/db_impl.cc (L525) -> 519f2a41fb/db/error_handler.cc (L250) -> 519f2a41fb/db/error_handler.cc (L808-L823)
However, due to a race between flush error recovery and db destruction, recovery can actually start after such wait during the db shutdown. The consequence is that the recovery thread created as part of this recovery will not be properly joined upon its destruction as part the db destruction. It then crashes the program as below.
```
std::terminate()
std::default_delete<std::thread>::operator()(std::thread*) const
std::unique_ptr<std::thread, std::default_delete<std::thread>>::~unique_ptr()
rocksdb::ErrorHandler::~ErrorHandler() (rocksdb/db/error_handler.h:31)
rocksdb::DBImpl::~DBImpl() (rocksdb/db/db_impl/db_impl.cc:725)
rocksdb::DBImpl::~DBImpl() (rocksdb/db/db_impl/db_impl.cc:725)
rocksdb::DBTestBase::Close() (rocksdb/db/db_test_util.cc:678)
```
**Summary:**
This PR fixed it by considering whether EndAutoRecovery() has been called before creating such thread. This fix is similar to how we currently [handle](519f2a41fb/db/error_handler.cc (L688-L694)) such case inside the created recovery thread.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12002
Test Plan: A new UT repro-ed the crash before this fix and and pass after.
Reviewed By: ajkr
Differential Revision: D50586191
Pulled By: hx235
fbshipit-source-id: b372f6d7a94eadee4b9283b826cc5fb81779a093
Summary:
With fragmented record span across multiple blocks, if any following blocks corrupted with arbitary data, and intepreted log number less than the current log number, program will fall into infinite loop due to
not skipping buffer leading bytes
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11979
Test Plan: existing unit tests
Reviewed By: ajkr
Differential Revision: D50604408
Pulled By: jowlyzhang
fbshipit-source-id: e50a0c7e7c3d293fb9d5afec0a3eb4a1835b7a3b
Summary:
- Right now in blackbox test we don't exit if there are std::error as we do in whitebox crash tests. As result those errors are swallowed.
It only errors out if state is unexpected.
One example that was noticed in blackbox crash test -
```
stderr has error message:
***Error restoring historical expected values: Corruption: DB is older than any restorable expected state***
Running db_stress with pid=30454: /packages/rocksdb_db_stress_internal_repo/rocks_db_stress ....
```
- This diff also provided support to export files - db_crashtest.py file to be used by different repo.
Reviewed By: ajkr
Differential Revision: D50564889
fbshipit-source-id: 7bafbbc6179dc79467ca2b680fe83afc7850616a
Summary:
**Context/Summary:**
We ignore trace writing status e.g, 543191f2ea/db/db_impl/db_impl_write.cc (L221-L222)
If a write into the trace file fails, subsequent trace write will continue onto the same file.
This will trigger the assertion `assert(sync_without_flush_called_)` intended to catch write to a file that has previously seen error, added in https://github.com/facebook/rocksdb/pull/10489, https://github.com/facebook/rocksdb/pull/10555
Alternative (rejected) is to handle trace writing status at a higher level at e.g, 543191f2ea/db/db_impl/db_impl_write.cc (L221-L222). However, it makes sense to ignore such status considering tracing is not a critical but assistant component to db operation. And this alternative requires more code change. So it's better to handle the failure at a lower level as this PR
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11996
Test Plan: Add new UT failed before this PR and pass after
Reviewed By: akankshamahajan15
Differential Revision: D50532467
Pulled By: hx235
fbshipit-source-id: f2032abafd94917adbf89a20841d15b448782a33
Summary:
Add stats for better observability of scan prefetching. Its only implemented for sync scan right now. These stats can help inform future improvements in scan prefetching.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11981
Test Plan: Add a new unit test
Reviewed By: akankshamahajan15
Differential Revision: D50516505
Pulled By: anand1976
fbshipit-source-id: cb1cc6cf02df8295930a49c62b11870020df3f97
Summary:
... until I can reproduce and resolve assertion failures (mostly in PurgeImplLocked) seen in crash test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12000
Test Plan: make blackbox_crash_test
Reviewed By: hx235
Differential Revision: D50565984
Pulled By: pdillinger
fbshipit-source-id: 5eea1638ff2683c41b4f65ee1ffc2398071911e7
Summary:
... and other fixes for crash test after https://github.com/facebook/rocksdb/issues/11922.
* When pre-allocating sequence numbers for establishing a time history, record that last sequence number in the manifest so that it is (most likely) restored on recovery even if no user writes were made or were recovered (e.g. no WAL).
* When pre-allocating sequence numbers for establishing a time history, only do this for actually new DBs.
* Remove the feature that ensures non-zero sequence number on creating the first column family with preserve/preclude option after initial DB::Open. Until fixed in a way compatible with the crash test, this creates a gap where some data written with active preserve/preclude option won't have a known associated time.
Together, these ensure we don't upset the crash test by manipulating sequence numbers after initial DB creation (esp when re-opening with different options). (The crash test expects that the seqno after re-open corresponds to a known point in time from previous crash test operation, matching an expected DB state.)
Follow-up work:
* Re-fill the gap to ensure all data written under preserve/preclude settings have a known time estimate.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11995
Test Plan:
Added to unit test SeqnoTimeTablePropTest.PrePopulateInDB
Verified fixes two crash test scenarios:
## 1st reproducer
First apply
```
diff --git a/db_stress_tool/expected_state.cc b/db_stress_tool/expected_state.cc
index b483e154c..ef63b8d6c 100644
--- a/db_stress_tool/expected_state.cc
+++ b/db_stress_tool/expected_state.cc
@@ -333,6 +333,7 @@ Status FileExpectedStateManager::SaveAtAndAfter(DB* db) {
s = NewFileTraceWriter(Env::Default(), soptions, trace_file_path,
&trace_writer);
}
+ if (getenv("CRASH")) assert(false);
if (s.ok()) {
TraceOptions trace_opts;
trace_opts.filter |= kTraceFilterGet;
```
Then
```
mkdir -p /dev/shm/rocksdb_test/rocksdb_crashtest_expected
mkdir -p /dev/shm/rocksdb_test/rocksdb_crashtest_whitebox
rm -rf /dev/shm/rocksdb_test/rocksdb_crashtest_*/*
CRASH=1 ./db_stress --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --destroy_db_initially=1 --manual_wal_flush_one_in=1000000 --clear_column_family_one_in=0 --preserve_internal_time_seconds=36000
./db_stress --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --destroy_db_initially=0 --manual_wal_flush_one_in=1000000 --clear_column_family_one_in=0 --preserve_internal_time_seconds=0
```
Without the fix you get
```
...
DB path: [/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox]
(Re-)verified 34 unique IDs
Error restoring historical expected values: Corruption: DB is older than any restorable expected state
```
## 2nd reproducer
First apply
```
diff --git a/db_stress_tool/db_stress_test_base.cc b/db_stress_tool/db_stress_test_base.cc
index 62ddead7b..f2654980f 100644
--- a/db_stress_tool/db_stress_test_base.cc
+++ b/db_stress_tool/db_stress_test_base.cc
@@ -1126,6 +1126,7 @@ void StressTest::OperateDb(ThreadState* thread) {
// OPERATION write
TestPut(thread, write_opts, read_opts, rand_column_families, rand_keys,
value);
+ if (getenv("CRASH")) assert(false);
} else if (prob_op < del_bound) {
assert(write_bound <= prob_op);
// OPERATION delete
```
Then
```
rm -rf /dev/shm/rocksdb_test/rocksdb_crashtest_*/*
CRASH=1 ./db_stress --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --destroy_db_initially=1 --manual_wal_flush_one_in=1000000 --clear_column_family_one_in=0 --disable_wal=1 --reopen=0 --preserve_internal_time_seconds=0
./db_stress --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --destroy_db_initially=0 --manual_wal_flush_one_in=1000000 --clear_column_family_one_in=0 --disable_wal=1 --reopen=0 --preserve_internal_time_seconds=3600
```
Without the fix you get
```
DB path: [/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox]
(Re-)verified 34 unique IDs
db_stress: db_stress_tool/expected_state.cc:380: virtual rocksdb::{anonymous}::ExpectedStateTraceRecordHandler::~
ExpectedStateTraceRecordHandler(): Assertion `IsDone()' failed.
```
Reviewed By: jowlyzhang
Differential Revision: D50533346
Pulled By: pdillinger
fbshipit-source-id: 1056be45c5b9e537c8c601b28c4b27431a782477
Summary:
Fix https://github.com/facebook/rocksdb/issues/11607
Fix https://github.com/facebook/rocksdb/issues/11679
Fix https://github.com/facebook/rocksdb/issues/11606
Fix https://github.com/facebook/rocksdb/issues/2343
Add bounds checking to `WBWIIteratorImpl`, which will be reflected in `BaseDeltaIterator::delta_iterator_::Valid()`, just like `BaseDeltaIterator::base_iterator_::Valid()`. In this way, the two sub itertors become more aligned from `BaseDeltaIterator`'s perspective. Like `DBIter`, the added bounds checking caps in either bound when seeking and disvalidates the `WBWIIteratorImpl` iterator when the lower bound is past or the upper bound is reached.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11680
Test Plan:
- A simple test added to write_batch_with_index_test.cc to exercise the bounds checking in `WBWIIteratorImpl`.
- A sophisticated test added to transaction_test.cc to assert that `Transaction` with different write policies honor bounds in `ReadOptions`. It should be so as long as the `BaseDeltaIterator` is correctly coordinating the two sub iterators to perform iterating and bounds checking.
Reviewed By: ajkr
Differential Revision: D48125229
Pulled By: cbi42
fbshipit-source-id: c9acea52595aed1471a63d7ca6ef15d2a2af1367
Summary:
and some other small enhancements/fixes:
* The main bug fixed is that in some rare cases, the "published" table size might be smaller than the actual table size. This is a transient state that can happen with concurrent growth that is normally fixed after enough insertions, but if the cache is destroyed soon enough after growth, it could fail to fully destroy some entries and cause assertion failures. We can fix this by detecting the true table size in the destructor.
* Change the "too many iterations" debug threshold from 512 to 768. We might have hit at least one false positive failure. (Failed despite legitimate operation.)
* Added some stronger assertions in some places to aid in debugging.
* Use COERCE_CONTEXT_SWITCH to make behavior of Grow less predictable in terms of thread interleaving. (Might add in more places.) This was useful in reproducing the destructor bug.
* Fix some comments with typos or that were based on earlier revisions of the code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11988
Test Plan:
Variants of this bug-finding command:
```
USE_CLANG=1 COMPILE_WITH_ASAN=1 COMPILE_WITH_UBSAN=1 COERCE_CONTEXT_SWITCH=1 DEBUG_LEVEL=2 make -j32 cache_bench && while ROCKSDB_DEBUG=1 ./cache_bench -cache_type=auto_hyper_clock_cache -histograms=0 -cache_size=80000000 -threads=32 -populate_cache=0 -ops_per_thread=1000 -num_shard_bits=0; do :; done
```
Reviewed By: jowlyzhang
Differential Revision: D50470318
Pulled By: pdillinger
fbshipit-source-id: d407a8bb0b6d2ddc598a954c319a1640136f12f2
Summary:
Context/Summary: as titled
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11957
Test Plan: piggyback on existing tests; fixed a failed test due to adding new stats
Reviewed By: ajkr, cbi42
Differential Revision: D50294310
Pulled By: hx235
fbshipit-source-id: d99b97ebac41efc1bdeaf9ca7a1debd2927d54cd
Summary:
Add a new method to check if a key exists in the database. It avoids copying data between C++ and Java code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11705
Reviewed By: ajkr
Differential Revision: D50370934
Pulled By: akankshamahajan15
fbshipit-source-id: ab2d42213fbebcaff919b0ffbbef9d45e88ca365
Summary:
... when compiled with ASSERT_STATUS_CHECKED = 1.
The main change is in iterator_wrapper.h. The remaining changes are just fixing existing unit tests. Adding this check to IteratorWrapper gives a good coverage as the class is used in many places, including child iterators under merging iterator, merging iterator under DB iter, file_iter under level iterator, etc. This change can catch the bug fixed in https://github.com/facebook/rocksdb/issues/11782.
Future follow up: enable `ASSERT_STATUS_CHECKED=1` for stress test and for DEBUG_LEVEL=0.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11975
Test Plan:
* `ASSERT_STATUS_CHECKED=1 DEBUG_LEVEL=2 make -j32 J=32 check`
* I tried to run stress test with `ASSERT_STATUS_CHECKED=1`, but there are a lot of existing stress code that ignore status checking, and fail without the change in this PR. So defer that to a follow up task.
Reviewed By: ajkr
Differential Revision: D50383790
Pulled By: cbi42
fbshipit-source-id: 1a28ce0f5fdf1890f93400b26b3b1b3a287624ce
Summary:
As titled. The most notable place that marks the feature as experimental is its wiki page. That was updated. And this PR removes the experimental marker from a few places for this feature.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11974
Reviewed By: ltamasi
Differential Revision: D50383640
Pulled By: jowlyzhang
fbshipit-source-id: 0bfe26ceda0793515f54b602cf3cd13d0737ec25
Summary:
A race condition between recovery and backup can happen with error messages like this:
```Failure in BackupEngine::CreateNewBackup with: IO error: No such file or directory: While opening a file for sequentially reading: /dev/shm/rocksdb_test/rocksdb_crashtest_whitebox/002653.log: No such file or directory```
PR https://github.com/facebook/rocksdb/issues/6949 introduced disabling file deletion during error handling of manifest IO errors. Aformentioned race condition is caused by this chain of event:
[Backup engine] disable file deletion
[Recovery] disable file deletion <= this is optional for the race condition, it may or may not get called
[Backup engine] get list of file to copy/link
[Recovery] force enable file deletion
.... some files refered by backup engine get deleted
[Backup engine] copy/link file <= error no file found
This PR fixes this with:
1) Recovery thread is currently forcing enabling file deletion as long as file deletion is disabled. Regardless of whether the previous error handling is for manifest IO error and that disabled it in the first place. This means it could incorrectly enabling file deletions intended by other threads like backup threads, file snapshotting threads. This PR does this check explicitly before making the call.
2) `disable_delete_obsolete_files_` is designed as a counter to allow different threads to enable and disable file deletion separately. The recovery thread currently does a force enable file deletion, because `ErrorHandler::SetBGError()` can be called multiple times by different threads when they receive a manifest IO error(details per PR https://github.com/facebook/rocksdb/issues/6949), resulting in `DBImpl::DisableFileDeletions` to be called multiple times too. Making a force enable file deletion call that resets the counter `disable_delete_obsolete_files_` to zero is a workaround for this. However, as it shows in the race condition, it can incorrectly suppress other threads like a backup thread's intention to keep the file deletion disabled. <strike>This PR adds a `std::atomic<int> disable_file_deletion_count_` to the error handler to track the needed counter decrease more precisely</strike>. This PR tracks and caps file deletion enabling/disabling in error handler.
3) for recovery, the section to find obsolete files and purge them was moved to be done after the attempt to enable file deletion. The actual finding and purging is more likely to happen if file deletion was previously disabled and get re-enabled now. An internal function `DBImpl::EnableFileDeletionsWithLock` was added to support change 2) and 3). Some useful logging was explicitly added to keep those log messages around.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11955
Test Plan: existing unit tests
Reviewed By: anand1976
Differential Revision: D50290592
Pulled By: jowlyzhang
fbshipit-source-id: 73aa8331ca4d636955a5b0324b1e104a26e00c9b
Summary:
Fix corruption error - "Corruption: first key in index doesn't match first key in block". when auto_readahead_size is enabled. Error is because of bug when index_iter_ moves forward, first_internal_key of that index_iter_ is not copied. So the Slice points to a different key resulting in wrong comparison when doing comparison.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11961
Test Plan: Ran stress test which reproduced this error.
Reviewed By: anand1976
Differential Revision: D50310589
Pulled By: akankshamahajan15
fbshipit-source-id: 95d8320b8388f1e3822c32024f84754f3a20a631
Summary:
Integrate pmd on the Java API to catch and report common Java coding problems; fix or suppress a basic set of PMD checks.
Link pmd into java build / CI
Add a pmd dependency to maven
Add a jpmd target to Makefile which runs pmd
Add a workflow to Circle CI which runs pmd
Configure an initial default pmd for CI
Repair lots of very simple PMD reports generated when we apply pmd-rules.xml
Repair or exception for PMD rules in the CloseResource category, which finds unclosed AutoCloseable resources.
We special-case the configuration of CloseResource and use the // NOPMD comment in source the avoid reports where we are the API creating an AutoCloseable, and therefore returning an unclosed resource is correct behaviour.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11221
Reviewed By: akankshamahajan15
Differential Revision: D50369930
Pulled By: jowlyzhang
fbshipit-source-id: a41c36b44b3bab7644df3e9cc16afbdf33b84f6b
Summary:
When `secondary_cache_uri` is non-empty and the `cache_type` is not a tiered cache, then sanitize `compressed_secondary_cache_size` to 0.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11967
Test Plan: Run crash test
Reviewed By: akankshamahajan15
Differential Revision: D50346157
Pulled By: anand1976
fbshipit-source-id: 57bcbad2ec81fa736f1539a0a41ed6854ded2077
Summary:
Remove assertion from PrefetchAsync (roundup_len2 >= alignment) as for non direct_io, buffer size can be less than alignment resulting in assertion.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11965
Test Plan: Ran the issue causing db_stress without this assertion and the verification completes successfully.
Reviewed By: anand1976
Differential Revision: D50328955
Pulled By: akankshamahajan15
fbshipit-source-id: 65f55ca230d2bbc63f4e2cc34c7273b22b515879
Summary:
This has become obsolete with the new `options_mutex_` in https://github.com/facebook/rocksdb/pull/11929
* Remove now-unnecessary parameter from WriteOptionsFile
* Rename (and negate) other parameter for better clarity (the caller shouldn't tell the callee what the callee needs, just what the caller knows, provides, and requests)
* Move a ROCKS_LOG_WARN (I/O) in WriteOptionsFile to outside of holding DB mutex.
* Also *avoid* (but not always eliminate) write queue synchronization in SetDBOptions. Still needed if there was a change to WAL size limit or other configuration.
* Improve some comments
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11951
Test Plan: existing unit tests and TSAN crash test local run
Reviewed By: ajkr
Differential Revision: D50247904
Pulled By: pdillinger
fbshipit-source-id: 7dfe445c705ec013886a2adb7c50abe50d83af69
Summary:
We saw frequent stress test failures with error messages like:
```
Verification failed for column family 0 key ...: value_from_db: , value_from_expected: ..., msg: GetEntity verification: Value not found: NotFound:
```
One cause for this is that data in WAL is lost after a crash. We initialize FaultInjectionTestFS to be not direct writable when write_fault_injection is enabled (see code change). This can cause the first WAL created during DB open to be lost if a db_stress is killed before the first WAL is synced. This PR initializes FaultInjectionTestFS to be direct writable. Note that FaultInjectionTestFS will be configured propertly for write fault injection after DB open in `RunStressTestImpl()`. So this change should not affect write fault injection coverage.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11958
Test Plan:
a repro for the above bug:
```
Simulate crash before first WAL is sealed:
--- a/db_stress_tool/db_stress_driver.cc
+++ b/db_stress_tool/db_stress_driver.cc
@@ -256,6 +256,7 @@ bool RunStressTestImpl(SharedState* shared) {
fprintf(stderr, "Verification failed :(\n");
return false;
}
+ exit(1);
return true;
}
./db_stress --clear_column_family_one_in=0 --column_families=1 --preserve_internal_time_seconds=60 --destroy_db_initially=0 --db=/dev/shm/rocksdb_crashtest_blackbox --db_write_buffer_size=2097152 --destroy_db_initially=0 --expected_values_dir=/dev/shm/rocksdb_crashtest_expected --reopen=0 --test_batches_snapshots=0 --threads=1 --ops_per_thread=100 --write_fault_one_in=1000 --sync_fault_injection=0
./db_stress_main --clear_column_family_one_in=0 --column_families=1 --preserve_internal_time_seconds=60 --destroy_db_initially=0 --db=/dev/shm/rocksdb_crashtest_blackbox --db_write_buffer_size=2097152 --destroy_db_initially=0 --expected_values_dir=/dev/shm/rocksdb_crashtest_expected --reopen=0 --test_batches_snapshots=0 --sync_fault_injection=1
```
Reviewed By: akankshamahajan15
Differential Revision: D50300347
Pulled By: cbi42
fbshipit-source-id: 3a4881d72197f5ece82364382a0100912e16c2d6
Summary:
Currently, if file ingestion hit injected error, stress test is considered failed since it prints a message to stderr containing the keyword "error" and db_crashtest.py looks for it in stderr. This PR fixes it by print injected error to stdout.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11956
Test Plan: Check future stress test runs.
Reviewed By: akankshamahajan15
Differential Revision: D50293537
Pulled By: cbi42
fbshipit-source-id: e74915b1b3c6876a61ab6933c4529780362ec02b
Summary:
Introducing the notion of AttributeGroup by adding the `MultiGetEntity()` API retrieving `PinnableAttributeGroups`.
An "attribute group" refers to a logical grouping of wide-column entities within RocksDB. These attribute groups are implemented using column families.
Users can store WideColumns in different CFs for various reasons (e.g. similar access patterns, same types, etc.). This new API `MultiGetEntity()` takes keys and `PinnableAttributeGroups` per key. `PinnableAttributeGroups` is just a list of `PinnableAttributeGroup`s in which we have `ColumnFamilyHandle*`, `Status`, and `PinnableWideColumns`.
Let's say a user stored "hot" wide columns in column family "hot_data_cf" and "cold" wide columns in column family "cold_data_cf" and all other columns in "common_cf".
Prior to this PR, if the user wants to query for two keys, "key_1" and "key_2" and but only interested in "common_cf" and "hot_data_cf" for "key_1", and "common_cf" and "cold_data_cf" for "key_2", the user would have to construct input like `keys = ["key_1", "key_1", "key_2", "key_2"]`, `column_families = ["common_cf", "hot_data_cf", "common_cf", "cold_data_cf"]` and get the flat list of `PinnableWideColumns` to find the corresponding <key,CF> combo.
With the new `MultiGetEntity()` introduced in this PR, users can now query only `["common_cf", "hot_data_cf"]` for `"key_1"`, and only `["common_cf", "cold_data_cf"]` for `"key_2"`. The user will get `PinnableAttributeGroups` for each key, and `PinnableAttributeGroups` gives a list of `PinnableAttributeGroup`s where the user can find column family and corresponding `PinnableWideColumns` and the `Status`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11925
Test Plan:
- `DBWideBasicTest::MultiCFMultiGetEntityAsPinnableAttributeGroups` added
will enable this new API in the `db_stress` in a separate PR
Reviewed By: ltamasi
Differential Revision: D50017414
Pulled By: jaykorean
fbshipit-source-id: 643611d1273c574bc81b94c6f5aeea24b40c4586
Summary:
Saw this in stress test:
```
db_stress: cache/clock_cache.cc:3152:[...] Assertion `i < 0x2000' failed.
```
The problem is related to Lookups on a chain currently involved in a Grow operation. To avoid Lookup waiting on Grow, Lookup is able to walk a chain whose first part is already migrated and tail is not yet migrated, so is mixed with entries with a different destination home (according to `home_shift`) than what we're looking for. This is fine until we save one of these entries as a safe point in the chain to backtrack to (`read_ref_on_chain`) in case of concurrent modification and end up backtracking to it. In that case, we can get stuck on the wrong destination chain and keep trying to backtrack to an entry that is supposed to be on the correct chain but is not (anymore).
For some reason I haven't quite worked out, I believe it's usually able to recover after some 1000+ looop iterations, so reproducibility depends on the threshold at which we consider a Lookup loop to be too many iterations for a plausibly valid Lookup.
Detecting and working around this case is relatively simple. We can (and must) keep going on the chain but ensure we don't save it as a safe entry to backtrack to.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11948
Test Plan:
The problem could be reproduced in a few minutes with this (debug build):
```
$ while ./cache_bench -cache_type=auto_hyper_clock_cache -histograms=0 -cache_size=80000000 -threads=32 -populate_cache=0 -ops_per_thread=10000 -degenerate_hash_bits=6 -num_shard_bits=0; do :; done
```
At least with a lower threshold on suspiciously high number of iterations. I've lowered the thresholds quite a bit and no longer able to reproduce a failure.
Reviewed By: jowlyzhang
Differential Revision: D50236574
Pulled By: pdillinger
fbshipit-source-id: 2cb54a4e02bb51d5933eea41fcd489ab9d34aa96
Summary:
We saw the following TSAN stress test failure:
```
WARNING: ThreadSanitizer: data race (pid=17523)
Write of size 1 at 0x7b8c000008b9 by thread T4243 (mutexes: write M0):
#0 rocksdb::ErrorHandler::RecoverFromRetryableBGIOError() fbcode/internal_repo_rocksdb/repo/db/error_handler.cc:742 (db_stress+0x95f954) (BuildId: 35795dfb86ddc9c4f20ddf08a491f24d)
https://github.com/facebook/rocksdb/issues/1 std:🧵:_State_impl<std:🧵:_Invoker<std::tuple<void (rocksdb::ErrorHandler::*)(), rocksdb::ErrorHandler*>>>::_M_run() fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/invoke.h:74 (db_stress+0x95fc2b) (BuildId: 35795dfb86ddc9c4f20ddf08a491f24d)
https://github.com/facebook/rocksdb/issues/2 execute_native_thread_routine /home/engshare/third-party2/libgcc/11.x/src/gcc-11.x/x86_64-facebook-linux/libstdc++-v3/src/c++11/../../../.././libstdc++-v3/src/c++11/thread.cc:82:18 (libstdc++.so.6+0xdf4e4) (BuildId: 452d1cdae868baeeb2fdf1ab140f1c219bf50c6e)
Previous read of size 1 at 0x7b8c000008b9 by thread T22:
#0 rocksdb::DBImpl::SyncClosedLogs(rocksdb::JobContext*, rocksdb::VersionEdit*) fbcode/internal_repo_rocksdb/repo/db/error_handler.h:76 (db_stress+0x84f69c) (BuildId: 35795dfb86ddc9c4f20ddf08a491f24d)
```
This is due to a data race in accessing `recovery_in_prog_`. This PR fixes it by accessing `recovery_in_prog_` under db mutex before calling `SyncClosedLogs()`. I think the original PR https://github.com/facebook/rocksdb/pull/10489 intended to clear the error if it's a recovery flush. So ideally we can also just check flush reason. I plan to keep a safer change in this PR and make that change in the future if needed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11950
Test Plan: check future TSAN stress test results.
Reviewed By: anand1976
Differential Revision: D50242255
Pulled By: cbi42
fbshipit-source-id: 0d487948ef9546b038a34460f3bb037f6e5bfc58
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11947
The patch is a small refactoring of `BaseDeltaIterator`: instead of determining the iterator's value during the `value()` call, it is resolved up front in `UpdateCurrent()`. This has multiple benefits: the value is now computed only once even if `value()` is called multiple times for the same iterator position (note that with the previous code, merges for example would get performed multiple times in this case), it makes it possible to remove the `mutable` modifiers from the `status_` and `merge_result_` members, and it also serves as groundwork for adding wide-column support to `WriteBatchWithIndex`.
Reviewed By: jaykorean
Differential Revision: D50236117
fbshipit-source-id: ae3d05863f811e9bac4c09edc49eca5f37e072a5
Summary:
We were seeing the following stress test failures:
```LevelCompactionBuilder::PickFileToCompact(const rocksdb::autovector<std::pair<int, rocksdb::FileMetaData*> >&, bool): Assertion `!level_file.second->being_compacted' failed```
This can happen when we are picking a file to be compacted from some files marked for compaction, but that file is already being_compacted. We prevent this by always calling `ComputeCompactionScore()` after we pick a compaction and mark some files as being_compacted. However, if SetOptions() is called to disable marking certain files to be compacted, say `enable_blob_garbage_collection`, we currently just skip the relevant logic in `ComputeCompactionScore()` without clearing the existing files already marked for compaction. This PR fixes this issue by already clearing these files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11946
Test Plan: existing tests.
Reviewed By: akankshamahajan15
Differential Revision: D50232608
Pulled By: cbi42
fbshipit-source-id: 11e4fb5e9d48b0f946ad33b18f7c005f0161f496
Summary:
With the introduction of the `UpdateTieredCache` API, its possible to dynamically change the compressed secondary cache ratio of the total cache capacity. In order to optimize performance, we avoid using a mutex when inserting/releasing placeholder entries, which can result in some inaccuracy in the accounting during the dynamic update. This inaccuracy was causing a runtime error due to an integer underflow in `UpdateCacheReservationRatio`, causing ubsan crash tests to fail. This PR fixes it by explicitly checking for the underflow.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11949
Test Plan:
1. Added a unit test that fails without the fix
2. Run ubsan_crash
Reviewed By: akankshamahajan15
Differential Revision: D50240217
Pulled By: anand1976
fbshipit-source-id: d2f7b79da54eec8b61aec2cc1f2943da5d5847ac
Summary:
In follow-up to https://github.com/facebook/rocksdb/issues/11922, fix a race in functions like CreateColumnFamily and SetDBOptions where the DB reports one option setting but a different one is left in effect.
To fix, we can add an extra mutex around these rare operations. We don't want to hold the DB mutex during I/O or other slow things because of the many purposes it serves, but a mutex more limited to these cases should be fine.
I believe this would fix a write-write race in https://github.com/facebook/rocksdb/issues/10079 but not the read-write race.
Intended follow-up to this:
* Should be able to remove write thread synchronization from DBImpl::WriteOptionsFile
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11929
Test Plan:
Added two mini-stress style regression tests that fail with >1% probability before this change:
DBOptionsTest::SetStatsDumpPeriodSecRace
ColumnFamilyTest::CreateAndDropPeriodicRace
I haven't reproduced such an inconsistency between in-memory options and on disk latest options, but this change at least improves safety and adds a test anyway:
DBOptionsTest::SetStatsDumpPeriodSecRace
Reviewed By: ajkr
Differential Revision: D50024506
Pulled By: pdillinger
fbshipit-source-id: 1e99a9ed4d96fdcf3ac5061ec6b3cee78aecdda4
Summary:
Closes https://github.com/facebook/rocksdb/issues/5297
The BlockBasedTableConfig (or more generally, the TableFormatConfig) of ColumnFamilyOptions, isn't being constructed when column family options are loaded. This happens in `OptionsUtil` which implements the loading.
In `OptionsUtil` we add the method `private native static TableFormatConfig readTableFormatConfig(final long nativeHandle_)` which defers to a JNI method which creates a `TableFormatConfig` (specifically a `BlockBasedTableConfig`) for the supplied `ColumnFamilyOptions`, by copying the table format attached to the C++ column family options. A new Java constructor for `BlockBasedTableConfig` is implemented which is called from C++ with the parameters retrieved from the table format, and then returned to the calling `readTableFormatConfig`.
At the Java side in `OptionsUtil`, the new `TableFormatConfig` is added as the `tableFormatConfig_` field of the `ColumnFamilyOptions`.
To support this, the new class `BlockBasedTableOptionsJni` and associated support methods are added to 'portal.h'.
`BloomFilter.java` has a constructor and field added so that the filter in use can be read back and inspected.
`FilterPolicyType.java` implements an enum (shadowed in C++) to support transfer of filter policy information back to Java from being read at the C++ side.
Tests written to cover the block based table config, and cleaned up and generalised a bit as some of the methods on OptionsUtil weren't tested; and these had their own unique JNI method variants which in turn were never exercised in test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10826
Reviewed By: ajkr
Differential Revision: D50136247
Pulled By: jowlyzhang
fbshipit-source-id: 39387448147abc574e99f43979d89b0900e5f81d
Summary:
Thanks ltamasi and ajkr for initial investigations on the test failure. Per the investigations, the following scenario is likely causing the test to fail.
1. Recovery is needed (could be any reason during crash test)
2. Trying to recover from the latest manifest fails (likely due to read error injection)
3. DB opens with recovery from the next manifest which is different from step 2.
4. Expected state is based on the manifest we tried and failed in step 2.
5. Two manifests used in step 2 and 3 are confirmed to have difference in LSM trees (Thanks ltamasi again for the finding).
```
2023/10/05-11:24:18.942189 56341 [db/version_set.cc:6079] Trying to recover from manifest: /dev/shm/rocksdb_test/rocksdb_crashtest_blackbox/MANIFEST-007184
...
2023/10/05-11:24:18.978007 56341 [db/version_set.cc:6079] Trying to recover from manifest: /dev/shm/rocksdb_test/rocksdb_crashtest_blackbox/MANIFEST-007180
```
```
[ltamasi@devbig1024.prn1 /tmp/x]$ ldb manifest_dump --hex --path=MANIFEST-007184_renamed_ > 2
[ltamasi@devbig1024.prn1 /tmp/x]$ ldb manifest_dump --hex --path=MANIFEST-007180_renamed_ > 1
[ltamasi@devbig1024.prn1 /tmp/x]$ diff 1 2
--- 1 2023-10-09 10:29:16.966215207 -0700
+++ 2 2023-10-09 10:29:11.984241645 -0700
@@ -13,7 +13,7 @@
7174:3950254[1875617 .. 2203952]['000000000003415B000000000000012B000000000000007D' seq:1906214, type:1 .. '000000000003CA59000000000000012B000000000000005C' seq:2039838, type:1]
7175:88060[2074748 .. 2203892]['000000000003CA6300000000000000CF78787878787878' seq:2167539, type:2 .. '000000000003D08F000000000000012B0000000000000130' seq:2112478, type:0]
--- level 6 --- version# 1 ---
- 7057:3132633[0 .. 2046144]['0000000000000009000000000000000978' seq:0, type:1 .. '0000000000005F8B000000000000012B00000000000002AC' seq:0, type:1]
+ 7219:2135565[0 .. 2046144]['0000000000000009000000000000000978' seq:0, type:1 .. '0000000000005F8B000000000000012B00000000000002AC' seq:0, type:1]
7061:827724[0 .. 2046131]['0000000000005F95000000000000000778787878787878' seq:0, type:1 .. '000000000000784F000000000000012B0000000000000113' seq:0, type:1]
6763:1352[0 .. 0]['000000000000784F000000000000012B0000000000000129' seq:0, type:1 .. '000000000000784F000000000000012B0000000000000129' seq:0, type:1]
7173:4812291[0 .. 2203957]['000000000000784F000000000000012B0000000000000138' seq:0, type:1 .. '0000000000020FAE787878787878' seq:0, type:1]
@@ -77,4 +77,4 @@
--- level 61 --- version# 1 ---
--- level 62 --- version# 1 ---
--- level 63 --- version# 1 ---
-next_file_number 7182 last_sequence 2203963 prev_log_number 0 max_column_family 0 min_log_number_to_keep 7015
+next_file_number 7221 last_sequence 2203963 prev_log_number 0 max_column_family 0 min_log_number_to_keep 7015
```
We have two options to fix this. Either skip verification against expected state or disable read injection when BE recovery is enabled. I chose to skip verification against expected state per discussion. (See comments in this PR)
Please note that some linter changes were included in this PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11938
Test Plan:
```
TEST_TMPDIR=/dev/shm/rocksdb make crash_test_with_best_efforts_recovery
```
Reviewed By: ltamasi
Differential Revision: D50136341
Pulled By: jaykorean
fbshipit-source-id: ac7434d592aebc148bfc3a4fcaa34936f136b95c
Summary:
Fix the TSAN false positive caused by reading a bool flag without synchronization.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11941
Test Plan: Run tsan crash test locally
Reviewed By: akankshamahajan15
Differential Revision: D50181799
Pulled By: anand1976
fbshipit-source-id: 889e7237e9f3c9452a9df94a0d949db5fe13bb57
Summary:
This PR depends on https://github.com/facebook/rocksdb/issues/11879 . Enable write fault injection for the basic whitebox, blackbox, and cf_consistency modes. For other test modes like multiops_txn, best_efforts_recovery etc., leave it disabled for now until we can do more testing.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11924
Reviewed By: ajkr
Differential Revision: D50178252
Pulled By: anand1976
fbshipit-source-id: 5794f81c14cded1eb28762b2de818dfff1c1a34c
Summary:
1. Prevent a double join on a `port::Thread`
2. Ensure `recovery_in_prog_` and `bg_error_` are both set under same lock hold. This is useful for writers who see a non-OK `bg_error_` and are deciding whether to stall based on whether the error will be auto-recovered.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11939
Reviewed By: cbi42
Differential Revision: D50155484
Pulled By: ajkr
fbshipit-source-id: fbc1f85c50e7eaee27ee0e376aee688d8a06c93b
Summary:
This PR expose RocksDB C++ API for performance measurement in Java.
It's initial implementation and it doesn't support ```level_to_perf_context```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11805
Reviewed By: akankshamahajan15
Differential Revision: D50128356
Pulled By: ltamasi
fbshipit-source-id: afb35980a89129a30d4a6b4cce12352c9de186b6
Summary:
This address https://github.com/facebook/rocksdb/issues/11277. Java native library is not anymore loaded until the code is first used.
It should allow to manually load native library from different location with `RocksDB#loadLibrary(List<java.lang.String>)`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11919
Reviewed By: jaykorean
Differential Revision: D50103182
Pulled By: ltamasi
fbshipit-source-id: 6090b529c7299b032f4e93cd0c3025a60f58652f
Summary:
cbi42 pointed out a race condition in which `recovery_io_error_` and `recovery_error_` could be updated inconsistently due to releasing the DB mutex in `EventHelpers::NotifyOnBackgroundError()`. There doesn't seem to be a point to having two status objects, so this PR consolidates them.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11937
Reviewed By: cbi42
Differential Revision: D50105793
Pulled By: ajkr
fbshipit-source-id: 3de95baccfa44351a49a5c2aa0986c9bc81baa8f
Summary:
Relaxed the constraints for blocking when writes are stopped. When a recovery is already being attempted, we might as well let `!no_slowdown` writes wait on it in case it succeeds. This makes the user-visible behavior consistent across recovery flush and non-recovery flush.
This enables `db_stress` to inject retryable (soft) flush read errors without having to handle user write failures. I changed `db_stress` a bit to permit injected errors in much more foreground operations as more admin operations (like `GetLiveFiles()`) can fail on a retryable error during flush.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11879
Reviewed By: anand1976
Differential Revision: D49571196
Pulled By: ajkr
fbshipit-source-id: 5d516d6faf20d2c6bfe0594ab4f2706bca6d69b0