Summary:
... when compiled with ASSERT_STATUS_CHECKED = 1.
The main change is in iterator_wrapper.h. The remaining changes are just fixing existing unit tests. Adding this check to IteratorWrapper gives a good coverage as the class is used in many places, including child iterators under merging iterator, merging iterator under DB iter, file_iter under level iterator, etc. This change can catch the bug fixed in https://github.com/facebook/rocksdb/issues/11782.
Future follow up: enable `ASSERT_STATUS_CHECKED=1` for stress test and for DEBUG_LEVEL=0.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11975
Test Plan:
* `ASSERT_STATUS_CHECKED=1 DEBUG_LEVEL=2 make -j32 J=32 check`
* I tried to run stress test with `ASSERT_STATUS_CHECKED=1`, but there are a lot of existing stress code that ignore status checking, and fail without the change in this PR. So defer that to a follow up task.
Reviewed By: ajkr
Differential Revision: D50383790
Pulled By: cbi42
fbshipit-source-id: 1a28ce0f5fdf1890f93400b26b3b1b3a287624ce
Summary:
As titled. The most notable place that marks the feature as experimental is its wiki page. That was updated. And this PR removes the experimental marker from a few places for this feature.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11974
Reviewed By: ltamasi
Differential Revision: D50383640
Pulled By: jowlyzhang
fbshipit-source-id: 0bfe26ceda0793515f54b602cf3cd13d0737ec25
Summary:
A race condition between recovery and backup can happen with error messages like this:
```Failure in BackupEngine::CreateNewBackup with: IO error: No such file or directory: While opening a file for sequentially reading: /dev/shm/rocksdb_test/rocksdb_crashtest_whitebox/002653.log: No such file or directory```
PR https://github.com/facebook/rocksdb/issues/6949 introduced disabling file deletion during error handling of manifest IO errors. Aformentioned race condition is caused by this chain of event:
[Backup engine] disable file deletion
[Recovery] disable file deletion <= this is optional for the race condition, it may or may not get called
[Backup engine] get list of file to copy/link
[Recovery] force enable file deletion
.... some files refered by backup engine get deleted
[Backup engine] copy/link file <= error no file found
This PR fixes this with:
1) Recovery thread is currently forcing enabling file deletion as long as file deletion is disabled. Regardless of whether the previous error handling is for manifest IO error and that disabled it in the first place. This means it could incorrectly enabling file deletions intended by other threads like backup threads, file snapshotting threads. This PR does this check explicitly before making the call.
2) `disable_delete_obsolete_files_` is designed as a counter to allow different threads to enable and disable file deletion separately. The recovery thread currently does a force enable file deletion, because `ErrorHandler::SetBGError()` can be called multiple times by different threads when they receive a manifest IO error(details per PR https://github.com/facebook/rocksdb/issues/6949), resulting in `DBImpl::DisableFileDeletions` to be called multiple times too. Making a force enable file deletion call that resets the counter `disable_delete_obsolete_files_` to zero is a workaround for this. However, as it shows in the race condition, it can incorrectly suppress other threads like a backup thread's intention to keep the file deletion disabled. <strike>This PR adds a `std::atomic<int> disable_file_deletion_count_` to the error handler to track the needed counter decrease more precisely</strike>. This PR tracks and caps file deletion enabling/disabling in error handler.
3) for recovery, the section to find obsolete files and purge them was moved to be done after the attempt to enable file deletion. The actual finding and purging is more likely to happen if file deletion was previously disabled and get re-enabled now. An internal function `DBImpl::EnableFileDeletionsWithLock` was added to support change 2) and 3). Some useful logging was explicitly added to keep those log messages around.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11955
Test Plan: existing unit tests
Reviewed By: anand1976
Differential Revision: D50290592
Pulled By: jowlyzhang
fbshipit-source-id: 73aa8331ca4d636955a5b0324b1e104a26e00c9b
Summary:
Fix corruption error - "Corruption: first key in index doesn't match first key in block". when auto_readahead_size is enabled. Error is because of bug when index_iter_ moves forward, first_internal_key of that index_iter_ is not copied. So the Slice points to a different key resulting in wrong comparison when doing comparison.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11961
Test Plan: Ran stress test which reproduced this error.
Reviewed By: anand1976
Differential Revision: D50310589
Pulled By: akankshamahajan15
fbshipit-source-id: 95d8320b8388f1e3822c32024f84754f3a20a631
Summary:
Integrate pmd on the Java API to catch and report common Java coding problems; fix or suppress a basic set of PMD checks.
Link pmd into java build / CI
Add a pmd dependency to maven
Add a jpmd target to Makefile which runs pmd
Add a workflow to Circle CI which runs pmd
Configure an initial default pmd for CI
Repair lots of very simple PMD reports generated when we apply pmd-rules.xml
Repair or exception for PMD rules in the CloseResource category, which finds unclosed AutoCloseable resources.
We special-case the configuration of CloseResource and use the // NOPMD comment in source the avoid reports where we are the API creating an AutoCloseable, and therefore returning an unclosed resource is correct behaviour.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11221
Reviewed By: akankshamahajan15
Differential Revision: D50369930
Pulled By: jowlyzhang
fbshipit-source-id: a41c36b44b3bab7644df3e9cc16afbdf33b84f6b
Summary:
When `secondary_cache_uri` is non-empty and the `cache_type` is not a tiered cache, then sanitize `compressed_secondary_cache_size` to 0.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11967
Test Plan: Run crash test
Reviewed By: akankshamahajan15
Differential Revision: D50346157
Pulled By: anand1976
fbshipit-source-id: 57bcbad2ec81fa736f1539a0a41ed6854ded2077
Summary:
Remove assertion from PrefetchAsync (roundup_len2 >= alignment) as for non direct_io, buffer size can be less than alignment resulting in assertion.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11965
Test Plan: Ran the issue causing db_stress without this assertion and the verification completes successfully.
Reviewed By: anand1976
Differential Revision: D50328955
Pulled By: akankshamahajan15
fbshipit-source-id: 65f55ca230d2bbc63f4e2cc34c7273b22b515879
Summary:
This has become obsolete with the new `options_mutex_` in https://github.com/facebook/rocksdb/pull/11929
* Remove now-unnecessary parameter from WriteOptionsFile
* Rename (and negate) other parameter for better clarity (the caller shouldn't tell the callee what the callee needs, just what the caller knows, provides, and requests)
* Move a ROCKS_LOG_WARN (I/O) in WriteOptionsFile to outside of holding DB mutex.
* Also *avoid* (but not always eliminate) write queue synchronization in SetDBOptions. Still needed if there was a change to WAL size limit or other configuration.
* Improve some comments
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11951
Test Plan: existing unit tests and TSAN crash test local run
Reviewed By: ajkr
Differential Revision: D50247904
Pulled By: pdillinger
fbshipit-source-id: 7dfe445c705ec013886a2adb7c50abe50d83af69
Summary:
We saw frequent stress test failures with error messages like:
```
Verification failed for column family 0 key ...: value_from_db: , value_from_expected: ..., msg: GetEntity verification: Value not found: NotFound:
```
One cause for this is that data in WAL is lost after a crash. We initialize FaultInjectionTestFS to be not direct writable when write_fault_injection is enabled (see code change). This can cause the first WAL created during DB open to be lost if a db_stress is killed before the first WAL is synced. This PR initializes FaultInjectionTestFS to be direct writable. Note that FaultInjectionTestFS will be configured propertly for write fault injection after DB open in `RunStressTestImpl()`. So this change should not affect write fault injection coverage.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11958
Test Plan:
a repro for the above bug:
```
Simulate crash before first WAL is sealed:
--- a/db_stress_tool/db_stress_driver.cc
+++ b/db_stress_tool/db_stress_driver.cc
@@ -256,6 +256,7 @@ bool RunStressTestImpl(SharedState* shared) {
fprintf(stderr, "Verification failed :(\n");
return false;
}
+ exit(1);
return true;
}
./db_stress --clear_column_family_one_in=0 --column_families=1 --preserve_internal_time_seconds=60 --destroy_db_initially=0 --db=/dev/shm/rocksdb_crashtest_blackbox --db_write_buffer_size=2097152 --destroy_db_initially=0 --expected_values_dir=/dev/shm/rocksdb_crashtest_expected --reopen=0 --test_batches_snapshots=0 --threads=1 --ops_per_thread=100 --write_fault_one_in=1000 --sync_fault_injection=0
./db_stress_main --clear_column_family_one_in=0 --column_families=1 --preserve_internal_time_seconds=60 --destroy_db_initially=0 --db=/dev/shm/rocksdb_crashtest_blackbox --db_write_buffer_size=2097152 --destroy_db_initially=0 --expected_values_dir=/dev/shm/rocksdb_crashtest_expected --reopen=0 --test_batches_snapshots=0 --sync_fault_injection=1
```
Reviewed By: akankshamahajan15
Differential Revision: D50300347
Pulled By: cbi42
fbshipit-source-id: 3a4881d72197f5ece82364382a0100912e16c2d6
Summary:
Currently, if file ingestion hit injected error, stress test is considered failed since it prints a message to stderr containing the keyword "error" and db_crashtest.py looks for it in stderr. This PR fixes it by print injected error to stdout.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11956
Test Plan: Check future stress test runs.
Reviewed By: akankshamahajan15
Differential Revision: D50293537
Pulled By: cbi42
fbshipit-source-id: e74915b1b3c6876a61ab6933c4529780362ec02b
Summary:
Introducing the notion of AttributeGroup by adding the `MultiGetEntity()` API retrieving `PinnableAttributeGroups`.
An "attribute group" refers to a logical grouping of wide-column entities within RocksDB. These attribute groups are implemented using column families.
Users can store WideColumns in different CFs for various reasons (e.g. similar access patterns, same types, etc.). This new API `MultiGetEntity()` takes keys and `PinnableAttributeGroups` per key. `PinnableAttributeGroups` is just a list of `PinnableAttributeGroup`s in which we have `ColumnFamilyHandle*`, `Status`, and `PinnableWideColumns`.
Let's say a user stored "hot" wide columns in column family "hot_data_cf" and "cold" wide columns in column family "cold_data_cf" and all other columns in "common_cf".
Prior to this PR, if the user wants to query for two keys, "key_1" and "key_2" and but only interested in "common_cf" and "hot_data_cf" for "key_1", and "common_cf" and "cold_data_cf" for "key_2", the user would have to construct input like `keys = ["key_1", "key_1", "key_2", "key_2"]`, `column_families = ["common_cf", "hot_data_cf", "common_cf", "cold_data_cf"]` and get the flat list of `PinnableWideColumns` to find the corresponding <key,CF> combo.
With the new `MultiGetEntity()` introduced in this PR, users can now query only `["common_cf", "hot_data_cf"]` for `"key_1"`, and only `["common_cf", "cold_data_cf"]` for `"key_2"`. The user will get `PinnableAttributeGroups` for each key, and `PinnableAttributeGroups` gives a list of `PinnableAttributeGroup`s where the user can find column family and corresponding `PinnableWideColumns` and the `Status`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11925
Test Plan:
- `DBWideBasicTest::MultiCFMultiGetEntityAsPinnableAttributeGroups` added
will enable this new API in the `db_stress` in a separate PR
Reviewed By: ltamasi
Differential Revision: D50017414
Pulled By: jaykorean
fbshipit-source-id: 643611d1273c574bc81b94c6f5aeea24b40c4586
Summary:
Saw this in stress test:
```
db_stress: cache/clock_cache.cc:3152:[...] Assertion `i < 0x2000' failed.
```
The problem is related to Lookups on a chain currently involved in a Grow operation. To avoid Lookup waiting on Grow, Lookup is able to walk a chain whose first part is already migrated and tail is not yet migrated, so is mixed with entries with a different destination home (according to `home_shift`) than what we're looking for. This is fine until we save one of these entries as a safe point in the chain to backtrack to (`read_ref_on_chain`) in case of concurrent modification and end up backtracking to it. In that case, we can get stuck on the wrong destination chain and keep trying to backtrack to an entry that is supposed to be on the correct chain but is not (anymore).
For some reason I haven't quite worked out, I believe it's usually able to recover after some 1000+ looop iterations, so reproducibility depends on the threshold at which we consider a Lookup loop to be too many iterations for a plausibly valid Lookup.
Detecting and working around this case is relatively simple. We can (and must) keep going on the chain but ensure we don't save it as a safe entry to backtrack to.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11948
Test Plan:
The problem could be reproduced in a few minutes with this (debug build):
```
$ while ./cache_bench -cache_type=auto_hyper_clock_cache -histograms=0 -cache_size=80000000 -threads=32 -populate_cache=0 -ops_per_thread=10000 -degenerate_hash_bits=6 -num_shard_bits=0; do :; done
```
At least with a lower threshold on suspiciously high number of iterations. I've lowered the thresholds quite a bit and no longer able to reproduce a failure.
Reviewed By: jowlyzhang
Differential Revision: D50236574
Pulled By: pdillinger
fbshipit-source-id: 2cb54a4e02bb51d5933eea41fcd489ab9d34aa96
Summary:
We saw the following TSAN stress test failure:
```
WARNING: ThreadSanitizer: data race (pid=17523)
Write of size 1 at 0x7b8c000008b9 by thread T4243 (mutexes: write M0):
#0 rocksdb::ErrorHandler::RecoverFromRetryableBGIOError() fbcode/internal_repo_rocksdb/repo/db/error_handler.cc:742 (db_stress+0x95f954) (BuildId: 35795dfb86ddc9c4f20ddf08a491f24d)
https://github.com/facebook/rocksdb/issues/1 std:🧵:_State_impl<std:🧵:_Invoker<std::tuple<void (rocksdb::ErrorHandler::*)(), rocksdb::ErrorHandler*>>>::_M_run() fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/invoke.h:74 (db_stress+0x95fc2b) (BuildId: 35795dfb86ddc9c4f20ddf08a491f24d)
https://github.com/facebook/rocksdb/issues/2 execute_native_thread_routine /home/engshare/third-party2/libgcc/11.x/src/gcc-11.x/x86_64-facebook-linux/libstdc++-v3/src/c++11/../../../.././libstdc++-v3/src/c++11/thread.cc:82:18 (libstdc++.so.6+0xdf4e4) (BuildId: 452d1cdae868baeeb2fdf1ab140f1c219bf50c6e)
Previous read of size 1 at 0x7b8c000008b9 by thread T22:
#0 rocksdb::DBImpl::SyncClosedLogs(rocksdb::JobContext*, rocksdb::VersionEdit*) fbcode/internal_repo_rocksdb/repo/db/error_handler.h:76 (db_stress+0x84f69c) (BuildId: 35795dfb86ddc9c4f20ddf08a491f24d)
```
This is due to a data race in accessing `recovery_in_prog_`. This PR fixes it by accessing `recovery_in_prog_` under db mutex before calling `SyncClosedLogs()`. I think the original PR https://github.com/facebook/rocksdb/pull/10489 intended to clear the error if it's a recovery flush. So ideally we can also just check flush reason. I plan to keep a safer change in this PR and make that change in the future if needed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11950
Test Plan: check future TSAN stress test results.
Reviewed By: anand1976
Differential Revision: D50242255
Pulled By: cbi42
fbshipit-source-id: 0d487948ef9546b038a34460f3bb037f6e5bfc58
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11947
The patch is a small refactoring of `BaseDeltaIterator`: instead of determining the iterator's value during the `value()` call, it is resolved up front in `UpdateCurrent()`. This has multiple benefits: the value is now computed only once even if `value()` is called multiple times for the same iterator position (note that with the previous code, merges for example would get performed multiple times in this case), it makes it possible to remove the `mutable` modifiers from the `status_` and `merge_result_` members, and it also serves as groundwork for adding wide-column support to `WriteBatchWithIndex`.
Reviewed By: jaykorean
Differential Revision: D50236117
fbshipit-source-id: ae3d05863f811e9bac4c09edc49eca5f37e072a5
Summary:
We were seeing the following stress test failures:
```LevelCompactionBuilder::PickFileToCompact(const rocksdb::autovector<std::pair<int, rocksdb::FileMetaData*> >&, bool): Assertion `!level_file.second->being_compacted' failed```
This can happen when we are picking a file to be compacted from some files marked for compaction, but that file is already being_compacted. We prevent this by always calling `ComputeCompactionScore()` after we pick a compaction and mark some files as being_compacted. However, if SetOptions() is called to disable marking certain files to be compacted, say `enable_blob_garbage_collection`, we currently just skip the relevant logic in `ComputeCompactionScore()` without clearing the existing files already marked for compaction. This PR fixes this issue by already clearing these files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11946
Test Plan: existing tests.
Reviewed By: akankshamahajan15
Differential Revision: D50232608
Pulled By: cbi42
fbshipit-source-id: 11e4fb5e9d48b0f946ad33b18f7c005f0161f496
Summary:
With the introduction of the `UpdateTieredCache` API, its possible to dynamically change the compressed secondary cache ratio of the total cache capacity. In order to optimize performance, we avoid using a mutex when inserting/releasing placeholder entries, which can result in some inaccuracy in the accounting during the dynamic update. This inaccuracy was causing a runtime error due to an integer underflow in `UpdateCacheReservationRatio`, causing ubsan crash tests to fail. This PR fixes it by explicitly checking for the underflow.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11949
Test Plan:
1. Added a unit test that fails without the fix
2. Run ubsan_crash
Reviewed By: akankshamahajan15
Differential Revision: D50240217
Pulled By: anand1976
fbshipit-source-id: d2f7b79da54eec8b61aec2cc1f2943da5d5847ac
Summary:
In follow-up to https://github.com/facebook/rocksdb/issues/11922, fix a race in functions like CreateColumnFamily and SetDBOptions where the DB reports one option setting but a different one is left in effect.
To fix, we can add an extra mutex around these rare operations. We don't want to hold the DB mutex during I/O or other slow things because of the many purposes it serves, but a mutex more limited to these cases should be fine.
I believe this would fix a write-write race in https://github.com/facebook/rocksdb/issues/10079 but not the read-write race.
Intended follow-up to this:
* Should be able to remove write thread synchronization from DBImpl::WriteOptionsFile
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11929
Test Plan:
Added two mini-stress style regression tests that fail with >1% probability before this change:
DBOptionsTest::SetStatsDumpPeriodSecRace
ColumnFamilyTest::CreateAndDropPeriodicRace
I haven't reproduced such an inconsistency between in-memory options and on disk latest options, but this change at least improves safety and adds a test anyway:
DBOptionsTest::SetStatsDumpPeriodSecRace
Reviewed By: ajkr
Differential Revision: D50024506
Pulled By: pdillinger
fbshipit-source-id: 1e99a9ed4d96fdcf3ac5061ec6b3cee78aecdda4
Summary:
Closes https://github.com/facebook/rocksdb/issues/5297
The BlockBasedTableConfig (or more generally, the TableFormatConfig) of ColumnFamilyOptions, isn't being constructed when column family options are loaded. This happens in `OptionsUtil` which implements the loading.
In `OptionsUtil` we add the method `private native static TableFormatConfig readTableFormatConfig(final long nativeHandle_)` which defers to a JNI method which creates a `TableFormatConfig` (specifically a `BlockBasedTableConfig`) for the supplied `ColumnFamilyOptions`, by copying the table format attached to the C++ column family options. A new Java constructor for `BlockBasedTableConfig` is implemented which is called from C++ with the parameters retrieved from the table format, and then returned to the calling `readTableFormatConfig`.
At the Java side in `OptionsUtil`, the new `TableFormatConfig` is added as the `tableFormatConfig_` field of the `ColumnFamilyOptions`.
To support this, the new class `BlockBasedTableOptionsJni` and associated support methods are added to 'portal.h'.
`BloomFilter.java` has a constructor and field added so that the filter in use can be read back and inspected.
`FilterPolicyType.java` implements an enum (shadowed in C++) to support transfer of filter policy information back to Java from being read at the C++ side.
Tests written to cover the block based table config, and cleaned up and generalised a bit as some of the methods on OptionsUtil weren't tested; and these had their own unique JNI method variants which in turn were never exercised in test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10826
Reviewed By: ajkr
Differential Revision: D50136247
Pulled By: jowlyzhang
fbshipit-source-id: 39387448147abc574e99f43979d89b0900e5f81d
Summary:
Thanks ltamasi and ajkr for initial investigations on the test failure. Per the investigations, the following scenario is likely causing the test to fail.
1. Recovery is needed (could be any reason during crash test)
2. Trying to recover from the latest manifest fails (likely due to read error injection)
3. DB opens with recovery from the next manifest which is different from step 2.
4. Expected state is based on the manifest we tried and failed in step 2.
5. Two manifests used in step 2 and 3 are confirmed to have difference in LSM trees (Thanks ltamasi again for the finding).
```
2023/10/05-11:24:18.942189 56341 [db/version_set.cc:6079] Trying to recover from manifest: /dev/shm/rocksdb_test/rocksdb_crashtest_blackbox/MANIFEST-007184
...
2023/10/05-11:24:18.978007 56341 [db/version_set.cc:6079] Trying to recover from manifest: /dev/shm/rocksdb_test/rocksdb_crashtest_blackbox/MANIFEST-007180
```
```
[ltamasi@devbig1024.prn1 /tmp/x]$ ldb manifest_dump --hex --path=MANIFEST-007184_renamed_ > 2
[ltamasi@devbig1024.prn1 /tmp/x]$ ldb manifest_dump --hex --path=MANIFEST-007180_renamed_ > 1
[ltamasi@devbig1024.prn1 /tmp/x]$ diff 1 2
--- 1 2023-10-09 10:29:16.966215207 -0700
+++ 2 2023-10-09 10:29:11.984241645 -0700
@@ -13,7 +13,7 @@
7174:3950254[1875617 .. 2203952]['000000000003415B000000000000012B000000000000007D' seq:1906214, type:1 .. '000000000003CA59000000000000012B000000000000005C' seq:2039838, type:1]
7175:88060[2074748 .. 2203892]['000000000003CA6300000000000000CF78787878787878' seq:2167539, type:2 .. '000000000003D08F000000000000012B0000000000000130' seq:2112478, type:0]
--- level 6 --- version# 1 ---
- 7057:3132633[0 .. 2046144]['0000000000000009000000000000000978' seq:0, type:1 .. '0000000000005F8B000000000000012B00000000000002AC' seq:0, type:1]
+ 7219:2135565[0 .. 2046144]['0000000000000009000000000000000978' seq:0, type:1 .. '0000000000005F8B000000000000012B00000000000002AC' seq:0, type:1]
7061:827724[0 .. 2046131]['0000000000005F95000000000000000778787878787878' seq:0, type:1 .. '000000000000784F000000000000012B0000000000000113' seq:0, type:1]
6763:1352[0 .. 0]['000000000000784F000000000000012B0000000000000129' seq:0, type:1 .. '000000000000784F000000000000012B0000000000000129' seq:0, type:1]
7173:4812291[0 .. 2203957]['000000000000784F000000000000012B0000000000000138' seq:0, type:1 .. '0000000000020FAE787878787878' seq:0, type:1]
@@ -77,4 +77,4 @@
--- level 61 --- version# 1 ---
--- level 62 --- version# 1 ---
--- level 63 --- version# 1 ---
-next_file_number 7182 last_sequence 2203963 prev_log_number 0 max_column_family 0 min_log_number_to_keep 7015
+next_file_number 7221 last_sequence 2203963 prev_log_number 0 max_column_family 0 min_log_number_to_keep 7015
```
We have two options to fix this. Either skip verification against expected state or disable read injection when BE recovery is enabled. I chose to skip verification against expected state per discussion. (See comments in this PR)
Please note that some linter changes were included in this PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11938
Test Plan:
```
TEST_TMPDIR=/dev/shm/rocksdb make crash_test_with_best_efforts_recovery
```
Reviewed By: ltamasi
Differential Revision: D50136341
Pulled By: jaykorean
fbshipit-source-id: ac7434d592aebc148bfc3a4fcaa34936f136b95c
Summary:
Fix the TSAN false positive caused by reading a bool flag without synchronization.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11941
Test Plan: Run tsan crash test locally
Reviewed By: akankshamahajan15
Differential Revision: D50181799
Pulled By: anand1976
fbshipit-source-id: 889e7237e9f3c9452a9df94a0d949db5fe13bb57
Summary:
This PR depends on https://github.com/facebook/rocksdb/issues/11879 . Enable write fault injection for the basic whitebox, blackbox, and cf_consistency modes. For other test modes like multiops_txn, best_efforts_recovery etc., leave it disabled for now until we can do more testing.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11924
Reviewed By: ajkr
Differential Revision: D50178252
Pulled By: anand1976
fbshipit-source-id: 5794f81c14cded1eb28762b2de818dfff1c1a34c
Summary:
1. Prevent a double join on a `port::Thread`
2. Ensure `recovery_in_prog_` and `bg_error_` are both set under same lock hold. This is useful for writers who see a non-OK `bg_error_` and are deciding whether to stall based on whether the error will be auto-recovered.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11939
Reviewed By: cbi42
Differential Revision: D50155484
Pulled By: ajkr
fbshipit-source-id: fbc1f85c50e7eaee27ee0e376aee688d8a06c93b
Summary:
This PR expose RocksDB C++ API for performance measurement in Java.
It's initial implementation and it doesn't support ```level_to_perf_context```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11805
Reviewed By: akankshamahajan15
Differential Revision: D50128356
Pulled By: ltamasi
fbshipit-source-id: afb35980a89129a30d4a6b4cce12352c9de186b6
Summary:
This address https://github.com/facebook/rocksdb/issues/11277. Java native library is not anymore loaded until the code is first used.
It should allow to manually load native library from different location with `RocksDB#loadLibrary(List<java.lang.String>)`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11919
Reviewed By: jaykorean
Differential Revision: D50103182
Pulled By: ltamasi
fbshipit-source-id: 6090b529c7299b032f4e93cd0c3025a60f58652f
Summary:
cbi42 pointed out a race condition in which `recovery_io_error_` and `recovery_error_` could be updated inconsistently due to releasing the DB mutex in `EventHelpers::NotifyOnBackgroundError()`. There doesn't seem to be a point to having two status objects, so this PR consolidates them.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11937
Reviewed By: cbi42
Differential Revision: D50105793
Pulled By: ajkr
fbshipit-source-id: 3de95baccfa44351a49a5c2aa0986c9bc81baa8f
Summary:
Relaxed the constraints for blocking when writes are stopped. When a recovery is already being attempted, we might as well let `!no_slowdown` writes wait on it in case it succeeds. This makes the user-visible behavior consistent across recovery flush and non-recovery flush.
This enables `db_stress` to inject retryable (soft) flush read errors without having to handle user write failures. I changed `db_stress` a bit to permit injected errors in much more foreground operations as more admin operations (like `GetLiveFiles()`) can fail on a retryable error during flush.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11879
Reviewed By: anand1976
Differential Revision: D49571196
Pulled By: ajkr
fbshipit-source-id: 5d516d6faf20d2c6bfe0594ab4f2706bca6d69b0
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11930
The patch cleans up and refactors the logic in/around `WriteBatchWithIndexInternal` a bit as groundwork for further changes. Specifically, the class is turned back into a stateless collection of static helpers (which is the way it was before PR 6851). Note that there were two apparent reasons for introducing this instance state in PR 6851: a) encapsulating `MergeContext` and b) resolving objects like `Logger` and `Statistics` based on a variety of handles. However, neither reason seems justified at this point. Regarding a), the `MultiGetFromBatchAndDB` logic passes in its own `MergeContext` objects via a second set of methods that do not use the member `MergeContext`. As for b), `Logger` and friends are only needed for Merge, which is only supported if a column family handle is provided; in turn, the column family handle enables us to resolve all the necessary objects without the need for any other handles like `DB` or `DBOptions`. In addition to the above, the patch changes the type of `BaseDeltaIterator::merge_result_` to `std::string` from `PinnableSlice` (since no pinning is ever done) and makes some other small code quality improvements.
Reviewed By: jaykorean
Differential Revision: D50038302
fbshipit-source-id: 5f34abe2e808bdaea0f3a8033b5764ebd446b85d
Summary:
Context/Summary: this option is experimental right now
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11926
Test Plan: no code change
Reviewed By: jaykorean
Differential Revision: D49985000
Pulled By: hx235
fbshipit-source-id: a5b439ed35e3d6bb04c125f222ac29cd3842d1a1
Summary:
This change has two primary goals (follow-up to https://github.com/facebook/rocksdb/issues/11917, https://github.com/facebook/rocksdb/issues/11920):
* Ensure the DB seqno_to_time_mapping has entries that allow us to put a good time lower bound on any writes that happen after setting up preserve/preclude options (either in a new DB, new CF, SetOptions, etc.) and haven't yet aged out of that time window. This allows us to remove a bunch of work-arounds in tests.
* For new DBs using preserve/preclude options, automatically reserve some sequence numbers and pre-map them to cover the time span back to the preserve/preclude cut-off time. In the future, this will allow us to import data from another DB by key, value, and write time by assigning an appropriate seqno in this DB for that write time.
Note that the pre-population (historical mappings) does not happen if the original options at DB Open time do not have preserve/preclude, so it is recommended to create initial column families at that time with create_missing_column_families, to take advantage of this (future) feature. (Adding these historical mappings after DB Open would risk non-monotonic seqno_to_time_mapping, which is dubious if not dangerous.)
Recommended follow-up:
* Solve existing race conditions (not memory safety) where parallel operations like CreateColumnFamily or SetDBOptions could leave the wrong setting in effect.
* Make SeqnoToTimeMapping more gracefully handle a possible case in which too many mappings are added for the time range of concern. It seems like there could be cases where data is massively excluded from the cold tier because of entries falling off the front of the mapping list (causing GetProximalSeqnoBeforeTime() to return 0). (More investigation needed.)
No release note for the minor bug fix because this is still an experimental feature with limited usage.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11922
Test Plan: tests added / updated
Reviewed By: jowlyzhang
Differential Revision: D49956563
Pulled By: pdillinger
fbshipit-source-id: 92beb918c3a298fae9ca8e509717b1067caa1519
Summary:
In preparing some seqno_to_time_mapping improvements, I found that some of the wrap-up work for creating column families was unnecessarily repeated in the case of DB::Open with create_missing_column_families. This change fixes that (`CreateColumnFamily()` -> `CreateColumnFamilyImpl()` in `DBImpl::Open()`), motivated by avoiding repeated calls to `RegisterRecordSeqnoTimeWorker()` but with the side benefit of avoiding repeated calls to `WriteOptionsFile()` for each CF.
Also in this change:
* Add a `Status::UpdateIfOk()` function for combining statuses in a common pattern
* Rename `max_time_duration` -> `min_preserve_seconds` (include units as much as possible)
* Improved comments in several places
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11920
Test Plan: tests added / updated
Reviewed By: jaykorean
Differential Revision: D49919147
Pulled By: pdillinger
fbshipit-source-id: 3d0318c1d070c842c5331da0a5b415caedc104f1
Summary:
1. **Error** in TestIterateAgainstExpected API - `Assertion index < pre_read_expected_values.size() && index < post_read_expected_values.size() failed.`
**Fix** - `Prev` op is not supported with `auto_readahead_size`. So added support to Reseek in db_iter, if Prev is called. In BlockBasedTableIterator, index_iter_ already moves forward. So there is no way to do Prev from BlockBasedTableIterator.
2. **Error** - `void rocksdb::BlockBasedTableIterator::BlockCacheLookupForReadAheadSize(uint64_t, size_t, size_t&): Assertion index_iter_->value().handle.offset() == offset`
**Fix** - Remove prefetch_buffer to be used when uncompressed dict is read.
3. ** Error in TestPrefixScan API - `db_stress: db/db_iter.cc:369: bool rocksdb::DBIter::FindNextUserEntryInternal(bool, const rocksdb::Slice*): Assertion !skipping_saved_key || CompareKeyForSkip(ikey_.user_key, saved_key_.GetUserKey()) > 0 failed.
Received signal 6 (Aborted)
Invoking GDB for stack trace...
db_stress: table/merging_iterator.cc:1036: bool rocksdb::MergingIterator::SkipNextDeleted(): Assertion comparator_->Compare(range_tombstone_iters_[i]->start_key(), pik) <= 0 failed`
**Fix** - SeekPrev also calls 1) SeekPrev , 2)Seek and then 3)Prev in some cases in db_iter.cc leading to failure of Prev operation. These backward operations also call Seek. Added direction to disable lookup once direction is backwards in BlockBasedTableIterator.cc
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11884
Test Plan: Ran various flavors of crash tests locally for the whole duration
Reviewed By: anand1976
Differential Revision: D49834201
Pulled By: akankshamahajan15
fbshipit-source-id: 9a007b4d46a48002c43dc4623a400ecf47d997fe
Summary:
Since allowing 24hr peak by setting start_time = end_time is not so intuitive, we are not going to allow it (e.g. `00:00-00:00` doesn't looks like a value that would cover 24hr.). Instead, we are going to compare at minute level (i.e. dropping the seconds to the nearest minute) so that `00:00-23:59` will cover 24hrs. The entire minute from 23:59:00 23:59:59 will be covered with this change.
Minor fixes from previous PR
- release build error
- fixed random seed in test
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11911
Test Plan:
`DBOptionsTest::OffPeakTimes`
`make -j64 static_lib` to test release build issue that was fixed
Reviewed By: pdillinger
Differential Revision: D49787795
Pulled By: jaykorean
fbshipit-source-id: e8d045b95f54f61d5dd5f1bb473579f8d55c18b3
Summary:
Recovery triggers flushes for very different scenarios:
(1) `FlushReason::kErrorRecoveryRetryFlush`: a flush failed
(2) `FlushReason::kErrorRecovery`: a WAL may be corrupted
(3) `FlushReason::kCatchUpAfterErrorRecovery`: immutable memtables may have accumulated
The old code called called `FlushAllColumnFamilies()` in all cases, which uses manual flush functions: `AtomicFlushMemTables()` and `FlushMemTable()`. Forcing flushing the latest data on all CFs was useful for (2) because it ensures all CFs move past the corrupted WAL.
However, those code paths were overkill for (1) and (3), where only already-immutable memtables need to be flushed. There were conditionals to exclude some of the extraneous logic but I found there was still too much happening. For example, both of the manual flush functions enter the write thread. Entering the write thread is inconvenient because then we can't allow stalled writes to wait on a retrying flush to finish.
Instead of continuing down the path of adding more conditionals to the manual flush functions, this PR introduces a dedicated function for cases (1) and (3): `RetryFlushesForErrorRecovery()`. Also I cleaned up the manual flush functions to remove existing conditionals for these cases as they're no longer needed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11903
Reviewed By: cbi42
Differential Revision: D49693812
Pulled By: ajkr
fbshipit-source-id: 7630ac539b9d6c92052c13a3cdce53256134d990
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11913
The `max_successive_merges` logic currently does not handle wide-column base values correctly, since it uses the `Get` API, which only returns the value of the default column. The patch fixes this by switching to `GetEntity` and passing all columns (if applicable) to the merge operator.
Reviewed By: jaykorean
Differential Revision: D49795097
fbshipit-source-id: 75eb7cc9476226255062cdb3d43ab6bd1cc2faa3
Summary:
After https://github.com/facebook/rocksdb/issues/11905, I am preparing a DBImpl change to ensure all sufficiently recent sequence numbers since Open are covered by SeqnoToTimeMapping. **Intended follow-up**
However, there are a number of test changes I want to make prior to that to make it clear that I am not regressing the tests and production behavior at the same time.
* Start mock time in the tests well beyond epoch (time 0) so that we aren't normally reaching into pre-history for current time minus the preserve/preclude duration.
* Majorly clean up BasicSeqnoToTimeMapping to avoid confusing hard-coded bounds on GetProximalTimeBeforeSeqno() results.
* There is an unresolved/unexplained issue marked with FIXME that should be investigated when GetProximalTimeBeforeSeqno() is put into production.
* MultiCFs test was strangely generating 5 L0 files, four of which would be compacted into an L1, and then letting TTL compaction compact 1@L0+1@L1. Changing the starting time of the tests seemed to mess up the TTL compaction. But I suspect the TTL compaction was unintentional, so I've cut it down to just 4 L0 files, which compacts predictably.
* Unrelated: allow ROCKSDB_NO_STACK=1 to skip printing a stack trace on assertion failures.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11917
Test Plan: no changes to production code
Reviewed By: jowlyzhang
Differential Revision: D49841436
Pulled By: pdillinger
fbshipit-source-id: 753348ace9c548e82bcb77fcc8b2ffb7a6beeb0a
Summary:
Separate the message for value mismatch from the message for an extra value in the DB
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11912
Reviewed By: hx235
Differential Revision: D49792137
Pulled By: ajkr
fbshipit-source-id: 311bc1801843a15367f409ead88ef755acbde468
Summary:
Missed `GetFileSize()` forwarding , this PR fix this issue, and mark `WritableFile::GetFileSize()` as pure virtual to detect such issue in compile time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11726
Reviewed By: ajkr
Differential Revision: D49791240
Pulled By: jowlyzhang
fbshipit-source-id: ef219508d6b15c9a24df9b706a9fdc33cc6a286e
Summary:
Changed `DBOptions::fail_if_options_file_error` default from `false` to
`true`. It is safer to fail an operation by default when it encounters
an error.
Also changed the API doc to list items in the conventional way for listing items in a sentence. The slashes weren't working well as one got dropped, probably because it looked like a typo.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11800
Test Plan: rely on CI
Reviewed By: jowlyzhang
Differential Revision: D49030532
Pulled By: ajkr
fbshipit-source-id: e606062aa25f9063d8c6fb0d03aebca5c2bc56d3
Summary:
RocksDB's primary function is to facilitate read and write operations. Compactions, while essential for minimizing read amplifications and optimizing storage, can sometimes compete with these primary tasks. Especially during periods of high read/write traffic, it's vital to ensure that primary operations receive priority, avoiding any potential disruptions or slowdowns. Conversely, during off-peak times when traffic is minimal, it's an opportune moment to tackle low-priority tasks like TTL based compactions, optimizing resource usage.
In this PR, we are incorporating the concept of off-peak time into RocksDB by introducing `daily_offpeak_time_utc` within the DBOptions. This setting is formatted as "HH:mm-HH:mm" where the first one before "-" is the start time and the second one is the end time, inclusive. It will be later used for resource optimization in subsequent PRs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11893
Test Plan:
- New Unit Test Added - `DBOptionsTest::OffPeakTimes`
- Existing Unit Test Updated - `OptionsTest`, `OptionsSettableTest`
Reviewed By: pdillinger
Differential Revision: D49714553
Pulled By: jaykorean
fbshipit-source-id: fef51ea7c0fede6431c715bff116ddbb567c8752
Summary:
This change is before a planned DBImpl change to ensure all sufficiently recent sequence numbers since Open are covered by SeqnoToTimeMapping (bug fix with existing test work-arounds). **Intended follow-up**
However, I found enough issues with SeqnoToTimeMapping to warrant this PR first, including very small fixes in DB implementation related to API contract of SeqnoToTimeMapping.
Functional fixes / changes:
* This fixes some mishandling of boundary cases. For example, if the user decides to stop writing to DB, the last written sequence number would perpetually have its write time updated to "now" and would always be ineligible for migration to cold tier. Part of the problem is that the SeqnoToTimeMapping would return a seqno known to have been written before (immediately or otherwise) the requested time, but compaction_job.cc would include that seqno in the preserve/exclude set. That is fixed (in part) by adding one in compaction_job.cc
* That problem was worse because a whole range of seqnos could be updated perpetually with new times in SeqnoToTimeMapping::Append (if no writes to DB). That logic was apparently optimized for GetOldestApproximateTime (now GetProximalTimeBeforeSeqno), which is not used in production, to the detriment of GetOldestSequenceNum (now GetProximalSeqnoBeforeTime), which is used in production. (Perhaps plans changed during development?) This is fixed in Append to optimize for accuracy of GetProximalSeqnoBeforeTime. (Unit tests added and updated.)
* Related: SeqnoToTimeMapping did not have a clear contract about the relationships between seqnos and times, just the idea of a rough correspondence. Now the class description makes it clear that the write time of each recorded seqno comes before or at the associated time, to support getting best results for GetProximalSeqnoBeforeTime. And this makes it easier to make clear the contract of each API function.
* Update `DBImpl::RecordSeqnoToTimeMapping()` to follow this ordering in gathering samples.
Some part of these changes has required an expanded test work-around for the problem (see intended follow-up above) that the DB does not immediately ensure recent seqnos are covered by its mapping. These work-arounds will be removed with that planned work.
An apparent compaction bug is revealed in
PrecludeLastLevelTest::RangeDelsCauseFileEndpointsToOverlap, so that test is disabled. Filed GitHub issue #11909
Cosmetic / code safety things (not exhaustive):
* Fix some confusing names.
* `seqno_time_mapping` was used inconsistently in places. Now just `seqno_to_time_mapping` to correspond to class name.
* Rename confusing `GetOldestSequenceNum` -> `GetProximalSeqnoBeforeTime` and `GetOldestApproximateTime` -> `GetProximalTimeBeforeSeqno`. Part of the motivation is that our times and seqnos here have the same underlying type, so we want to be clear about which is expected where to avoid mixing.
* Rename `kUnknownSeqnoTime` to `kUnknownTimeBeforeAll` because the value is a bad choice for unknown if we ever add ProximalAfterBlah functions.
* Arithmetic on SeqnoTimePair doesn't make sense except for delta encoding, so use better names / APIs with that in mind.
* (OMG) Don't allow direct comparison between SeqnoTimePair and SequenceNumber. (There is no checking that it isn't compared against time by accident.)
* A field name essentially matching the containing class name is a confusing pattern (`seqno_time_mapping_`).
* Wrap calls to confusing (but useful) upper_bound and lower_bound functions to have clearer names and more code reuse.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11905
Test Plan: GetOldestSequenceNum (now GetProximalSeqnoBeforeTime) and TruncateOldEntries were lacking unit tests, despite both being used in production (experimental feature). Added those and expanded others.
Reviewed By: jowlyzhang
Differential Revision: D49755592
Pulled By: pdillinger
fbshipit-source-id: f72a3baac74d24b963c77e538bba89a7fc8dce51