Commit Graph

12276 Commits

Author SHA1 Message Date
Yingchun Lai 37064d631b Add encfs plugin link (#12070)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12070

Reviewed By: jaykorean

Differential Revision: D51307148

Pulled By: ajkr

fbshipit-source-id: d04335506becd5970802f87ab0573b6307479222
2023-11-14 07:33:21 -08:00
Dzmitry Ivaniuk 65d71ee371 Fix warnings when using API (#12066)
Summary:
Fixes https://github.com/facebook/rocksdb/issues/11457.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12066

Reviewed By: cbi42

Differential Revision: D51259966

Pulled By: ajkr

fbshipit-source-id: a158b6f341b6b48233d917bfe4d00b639dbd8619
2023-11-13 20:03:44 -08:00
Changyu Bi e7896f03ad Enable unit test `PrecludeLastLevelTest.RangeDelsCauseFileEndpointsToOverlap` (#12064)
Summary:
Fixes https://github.com/facebook/rocksdb/issues/11909. The test passes after the change in https://github.com/facebook/rocksdb/issues/11917 to start mock clock from a non-zero time.

The reason for test failing is a bit complicated:
- The Put here e4ad4a0ef1/db/compaction/tiered_compaction_test.cc (L2045) happens before mock clock advances beyond 0.
- This causes oldest_key_time_ to be 0 for memtable.
- oldest_ancester_time of the first L0 file becomes 0
- L0 -> L5/6 compaction output files sets `oldest_ancestoer_time` to the current time due to these lines: 509947ce2c/db/compaction/compaction_job.cc (L1898C34-L1904).
- This causes some small sequence number to be mapped to current time: 509947ce2c/db/compaction/compaction_job.cc (L301)
- Keys in L6 is being moved up to L5 due to the unexpected seqno_to_time mapping
- When compacting keys from last level to the penultimate level, we only check keys to be within user key range of penultimate level input files. If we compact the following file 3 with file 1 and output keys to L5, we can get the reported inconsistency bug.
```
L5: file 1 [K5@20, K10@kMaxSeqno], file 2 [K10@30, K14@34)
L6: file 3 [K6@5, K10@20]
```

https://github.com/facebook/rocksdb/issues/12063 will add fixes to check internal key range when compacting keys from last level up to the penultimate level.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12064

Test Plan: the unit test passes

Reviewed By: ajkr

Differential Revision: D51281149

Pulled By: cbi42

fbshipit-source-id: 00b7f026c453454d9f3af5b2de441383a96f0c62
2023-11-13 15:26:52 -08:00
Jay Huh 8b8f6c63ef ColumnFamilyHandle Nullcheck in GetEntity and MultiGetEntity (#12057)
Summary:
- Add missing null check for ColumnFamilyHandle in `GetEntity()`
- `FailIfCfHasTs()` now returns `Status::InvalidArgument()` if `column_family` is null. `MultiGetEntity()` can rely on this for cfh null check.
- Added `DeleteRange` API using Default Column Family to be consistent with other major APIs (This was also causing Java Test failure after the `FailIfCfHasTs()` change)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12057

Test Plan:
- Updated `DBWideBasicTest::GetEntityAsPinnableAttributeGroups` to include null CF case
- Updated `DBWideBasicTest::MultiCFMultiGetEntityAsPinnableAttributeGroups` to include null CF case

Reviewed By: jowlyzhang

Differential Revision: D51167445

Pulled By: jaykorean

fbshipit-source-id: 1c1e44fd7b7df4d2dc3bb2d7d251da85bad7d664
2023-11-13 14:30:04 -08:00
leipeng b3ffca0e29 DBImpl::DelayWrite: Remove bad WRITE_STALL histogram (#12067)
Summary:
When delay didn't happen, histogram WRITE_STALL is still recorded, and ticker STALL_MICROS is not recorded.

This is a bug, neither WRITE_STALL or STALL_MICROS should not be recorded when delay did not happen.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12067

Reviewed By: cbi42

Differential Revision: D51263133

Pulled By: ajkr

fbshipit-source-id: bd82d8328fe088d613991966e83854afdabc6a25
2023-11-13 12:48:44 -08:00
brodyhuang 9fb6851918 fix(StackableDB): Resume API (#12060)
Summary:
When I call `DBWithTTLImpl::Resume()`, it returns `Status::NotSupported`.  Did `StackableDB` miss this API ?
Thanks !

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12060

Reviewed By: jaykorean

Differential Revision: D51202742

Pulled By: ajkr

fbshipit-source-id: 5e01a54a42efd81fd57b3c992b9af8bc45c59c9c
2023-11-13 12:09:58 -08:00
Yu Zhang 509947ce2c Quarantine files in a limbo state after a manifest error (#12030)
Summary:
Part of the procedures to handle manifest IO error is to disable file deletion in case some files in limbo state get deleted prematurely. This is not ideal because: 1) not all the VersionEdits whose commit encounter such an error contain updates for files, disabling file deletion sometimes are not necessary. 2) `EnableFileDeletion` has a force mode that could make other threads accidentally disrupt this procedure in recovery.  3) Disabling file deletion as a whole is also not as efficient as more precisely tracking impacted files from being prematurely deleted.  This PR replaces this mechanism with tracking such files and quarantine them from being deleted in `ErrorHandler`.

These are the types of files being actively tracked in quarantine in this PR:
1) new table files and blob files from a background job
2) old manifest file whose immediately following new manifest file's CURRENT file creation gets into unclear state. Current handling is not sufficient to make sure the old manifest file is kept in case it's needed.

Note that WAL logs are not part of the quarantine because `min_log_number_to_keep` is a safe mechanism and it's only updated after successful manifest commits so it can prevent this premature deletion issue from happening.

We track these files' file numbers because they share the same file number space.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12030

Test Plan: Modified existing unit tests

Reviewed By: ajkr

Differential Revision: D51036774

Pulled By: jowlyzhang

fbshipit-source-id: 84ef26271fbbc888ef70da5c40fe843bd7038716
2023-11-11 08:11:11 -08:00
Andrew Kryczka 0ffc0c7db1 Allow `TtlMergeOperator` to wrap an unregistered `MergeOperator` (#12056)
Summary:
Followed mrambacher's first suggestion in https://github.com/facebook/rocksdb/pull/12044#issuecomment-1800706148.

This change allows serializing a `TtlMergeOperator` that wraps an unregistered `MergeOperator`. Such a `TtlMergeOperator` cannot be loaded (validation will fail in `TtlMergeOperator::ValidateOptions()`), but that is OK for us currently.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12056

Reviewed By: hx235

Differential Revision: D51125097

Pulled By: ajkr

fbshipit-source-id: 8ed3705e8d36ab473673b9198eea6db64397ed15
2023-11-10 16:57:17 -08:00
Yu Zhang c6c683a0ca Remove the default force behavior for `EnableFileDeletion` API (#12001)
Summary:
Disabling file deletion can be critical for operations like making a backup, recovery from manifest IO error (for now). Ideally as long as there is one caller requesting file deletion disabled, it should be kept disabled until all callers agree to re-enable it. So this PR removes the default forcing behavior for the `EnableFileDeletion` API, and users need to explicitly pass the argument if they insisted on doing so knowing the consequence of what can be potentially disrupted.

This PR removes the API's default argument value so it will cause breakage for all users that are relying on the default value, regardless of whether the forcing behavior is critical for them.  When fixing this breakage, it's good to check if the forcing behavior is indeed needed and potential disruption is OK.

This PR also makes unit test that do not need force behavior to do a regular enable file deletion.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12001

Reviewed By: ajkr

Differential Revision: D51214683

Pulled By: jowlyzhang

fbshipit-source-id: ca7b1ebf15c09eed00f954da2f75c00d2c6a97e4
2023-11-10 14:35:54 -08:00
Yueh-Hsuan Chiang 5ef92b8ea4 Add rocksdb_options_set_cf_paths (#11151)
Summary:
This PR adds a missing set function for rocksdb_options in the C-API:
rocksdb_options_set_cf_paths().  Without this function, users cannot
specify different paths for different column families as it will fall back
to db_paths.

As a bonus, this PR also includes rocksdb_sst_file_metadata_get_directory()
to the C api -- a missing public function that will also make the test easier to write.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11151

Test Plan: Augment existing c_test to verify the specified cf_path.

Reviewed By: hx235

Differential Revision: D51201888

Pulled By: ajkr

fbshipit-source-id: 62a96451f26fab60ada2005ede3eea8e9b431f30
2023-11-10 11:36:11 -08:00
Yueh-Hsuan Chiang 73d223c4e2 Add auto_tuned option to RateLimiter C API (#12058)
Summary:
#### Problem
While the RocksDB C API does have the RateLimiter API, it does not
expose the auto_tuned option.

#### Summary of Change
This PR exposes auto_tuned RateLimiter option in RocksDB C API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12058

Test Plan: Augment the C API existing test to cover the new API.

Reviewed By: cbi42

Differential Revision: D51201933

Pulled By: ajkr

fbshipit-source-id: 5bc595a9cf9f88f50fee797b729ba96f09ed8266
2023-11-10 09:53:09 -08:00
Yu Zhang dfaf4dc111 Stubs for piping write time (#12043)
Summary:
As titled. This PR contains the API and stubbed implementation for piping write time.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12043

Reviewed By: pdillinger

Differential Revision: D51076575

Pulled By: jowlyzhang

fbshipit-source-id: 3b341263498351b9ccaff27cf35d5aeb5bdf0cf1
2023-11-09 15:58:07 -08:00
Yingchun Lai c4c62c2304 Support to use environment variable to test customer encryption plugins (#12025)
Summary:
The CreateEnvTest.CreateEncryptedFileSystem unit test is to verify the creation functionality of EncryptedFileSystem, but now it just support the builtin CTREncryptionProvider class.
This patch make it flexible to use environment variable `TEST_FS_URI`, it is useful to test customer encryption plugins.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12025

Reviewed By: anand1976

Differential Revision: D50799656

Pulled By: ajkr

fbshipit-source-id: dbcacfefbf07de9c7803f7707b34c5193bec17bf
2023-11-09 10:45:13 -08:00
brodyhuang e90e9825b4 Drop wal record when sequence is illegal (#11985)
Summary:
- Our database is corrupted, causing some sequences of wal record to be invalid (but the `record_checksum` looks fine).
- When we RecoverLogFiles in WALRecoveryMode::kPointInTimeRecovery, `assert(seq <= kMaxSequenceNumber)` will be failed.
- When it is found that sequence is illegal, can we drop the file  to recover as much data as possible ?  Thx !

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11985

Reviewed By: anand1976

Differential Revision: D50698039

Pulled By: ajkr

fbshipit-source-id: 1e42113b58823088d7c0c3a92af5b3efbb5f5296
2023-11-09 10:43:16 -08:00
Kasper Isager Dalsgarð f9b7877cf3 Ensure `target_include_directories()` is called with correct target name (#12055)
Summary:
`${PROJECT_NAME}` isn't guaranteed to match a target name when an artefact suffix is specified.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12055

Reviewed By: anand1976

Differential Revision: D51125532

Pulled By: ajkr

fbshipit-source-id: cd1f4a5b11eb517c379e3ee3f78592f7e606a034
2023-11-09 10:41:38 -08:00
Hui Xiao f337533b6f Ensure and clarify how RocksDB calls TablePropertiesCollector's functions (#12053)
Summary:
**Context/Summary:**
It's intuitive for users to assume `TablePropertiesCollector::Finish()` is called only once by RocksDB internal by the word "finish".

However, this is currently not true as RocksDB also calls this function in `BlockBased/PlainTableBuilder::GetTableProperties()` to populate user collected properties on demand.

This PR avoids that by moving that populating to where we first call `Finish()` (i.e, `NotifyCollectTableCollectorsOnFinish`)

Bonus: clarified in the API that `GetReadableProperties()` will be called after `Finish()` and added UT to ensure that.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12053

Test Plan:
- Modified test `DBPropertiesTest.GetUserDefinedTableProperties` to ensure `Finish()` only called once.
- Existing test particularly `db_properties_test, table_properties_collector_test` verify the functionality  `NotifyCollectTableCollectorsOnFinish` and `GetReadableProperties()` are not broken by this change.

Reviewed By: ajkr

Differential Revision: D51095434

Pulled By: hx235

fbshipit-source-id: 1c6275258f9b99dedad313ee8427119126817973
2023-11-08 14:00:36 -08:00
Peter Dillinger 65cde19f40 Safer wrapper for std::atomic, use in HCC (#12051)
Summary:
See new atomic.h file comments for motivation.

I have updated HyperClockCache to use the new atomic wrapper, fixing a few cases where an implicit conversion was accidentally used and therefore mixing std::memory_order_seq_cst where release/acquire ordering (or relaxed) was intended. There probably wasn't a real bug because I think all the cases happened to be in single-threaded contexts like constructors/destructors or statistical ops like `GetCapacity()` that don't need any particular ordering constraints.

Recommended follow-up:
* Replace other uses of std::atomic to help keep them safe from bugs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12051

Test Plan:
Did some local correctness stress testing with cache_bench. Also triggered 15 runs of fbcode_blackbox_crash_test and saw no related failures (just 3 failures in ~CacheWithSecondaryAdapter(), already known)

No performance difference seen before & after running simultaneously:
```
(while ./cache_bench -cache_type=fixed_hyper_clock_cache -populate_cache=0 -cache_size=3000000000 -ops_per_thread=500000 -threads=12 -histograms=0 2>&1 | grep parallel; do :; done) | awk '{ s += $3; c++; print "Avg time: " (s/c);}'
```

... for both fixed_hcc and auto_hcc.

Reviewed By: jowlyzhang

Differential Revision: D51090518

Pulled By: pdillinger

fbshipit-source-id: eeb324facb3185584603f9ea0c4de6f32919a2d7
2023-11-08 13:28:43 -08:00
Yingchun Lai e406c26c4e Update the API comments of NewRandomRWFile() (#11820)
Summary:
Env::NewRandomRWFile() will not create the file if it doesn't exist, as the test saying https://github.com/facebook/rocksdb/blob/main/env/env_test.cc#L2208.
This patch correct the comments of Env::NewRandomRWFile(), it may mislead the developers who use rocksdb Env() as an utility.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11820

Reviewed By: ajkr

Differential Revision: D50176707

Pulled By: jowlyzhang

fbshipit-source-id: a6ee469f549360de8d551a4fe8517b4450df7b15
2023-11-08 12:28:00 -08:00
Peter Dillinger 9af25a392b Clean up AutoHyperClockTable::PurgeImpl (#12052)
Summary:
There was some unncessary logic (e.g. a dead assignment to home_shift) left over from earlier revision of the code.

Also, rename confusing ChainRewriteLock::new_head_ / GetNewHead() to saved_head_ / GetSavedHead().

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12052

Test Plan: existing tests

Reviewed By: jowlyzhang

Differential Revision: D51091499

Pulled By: pdillinger

fbshipit-source-id: 4b191b60a2b16085681e59d49c4d97e802869db8
2023-11-07 16:35:19 -08:00
Zaidoon Abd Al Hadi 58f2a29fb4 Expose Options::periodic_compaction_seconds through C API (#12019)
Summary:
fixes [11090](https://github.com/facebook/rocksdb/issues/11090)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12019

Reviewed By: jaykorean

Differential Revision: D51076427

Pulled By: cbi42

fbshipit-source-id: de353ff66c7f73aba70ab3379e20d8c40f50d873
2023-11-07 12:46:50 -08:00
Alan Paxton c181667c4f FIX new blog post (JNI performance) Locate images correctly (#12050)
Summary:
We set up the images / references to the images wrongly in https://github.com/facebook/rocksdb/pull/11818
Images should be in the docs/static/images/… directory with an absolute reference to /static/images/…

Make it so.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12050

Reviewed By: pdillinger

Differential Revision: D51079811

Pulled By: jaykorean

fbshipit-source-id: 4c1ab80d313b70d0e60eec94086451d7b2814922
2023-11-07 11:58:58 -08:00
Guozhang Wu c06309c832 Not to print unnecessary commands in Makefile (#11978)
Summary:
When I run `make check`, there is a command that should not be printed to screen, which is shown below.

```text
... ...
  Generating parallel test scripts for util_merge_operators_test
  Generating parallel test scripts for write_batch_with_index_test
make[2]: Leaving directory '/home/z/rocksdb'
make[1]: Leaving directory '/home/z/rocksdb'
  GEN      check
make[1]: Entering directory '/home/z/rocksdb'
$DEBUG_LEVEL is 1, $LIB_MODE is shared
Makefile:185: Warning: Compiling in debug mode. Don't use the resulting binary in production
printf '%s\n' ''                                                \
  'To monitor subtest <duration,pass/fail,name>,'               \
  '  run "make watch-log" in a separate window' '';             \
{ \
        printf './%s\n' db_bloom_filter_test deletefile_test env_test c_test; \
        find t -name 'run-*' -print; \
} \
  | perl -pe 's,(^.*MySQLStyleTransactionTest.*$|^.*SnapshotConcurrentAccessTest.*$|^.*SeqAdvanceConcurrentTest.*$|^t/run-table_test-HarnessTest.Randomized$|^t/run-db_test-.*(?:FileCreationRandomFailure|EncodeDecompressedBlockSizeTest)$|^.*RecoverFromCorruptedWALWithoutFlush$),100 $1,' | sort -k1,1gr | sed 's/^[.0-9]* //'                             \
  | grep -E '.'                                 \
  | grep -E -v '"^$"'                                   \
  | build_tools/gnu_parallel -j100% --plain --joblog=LOG --eta --gnu \
    --tmpdir=/dev/shm/rocksdb.6lop '{} >& t/log-{/} || bash -c "cat t/log-{/}; exit $?"' ; \
parallel_retcode=$? ; \
awk '{ if ($7 != 0 || $8 != 0) { if ($7 == "Exitval") { h = $0; } else { if (!f) print h; print; f = 1 } } } END { if(f) exit 1; }' < LOG ; \
awk_retcode=$?; \
if [ $parallel_retcode -ne 0 ] || [ $awk_retcode -ne 0 ] ; then exit 1 ; fi

To monitor subtest <duration,pass/fail,name>,
  run "make watch-log" in a separate window

Computers / CPU cores / Max jobs to run
1:local / 16 / 16
```

The `printf` command will make the output confusing. It would be better not to print it.

**Before Change**

![image](https://github.com/facebook/rocksdb/assets/30565051/92cf681a-40b7-462e-ae5b-23eeacbb8f82)

**After Change**

![image](https://github.com/facebook/rocksdb/assets/30565051/4a70b04b-e4ef-4bed-9ce0-d942ed9d132e)

**Test Plan**

Not applicable. This is a trivial change, only to add a `@` before a Makefile command, and it will not impact any workflows.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11978

Reviewed By: jaykorean

Differential Revision: D51076606

Pulled By: cbi42

fbshipit-source-id: dc079ab8f60a5a5b9d04a83888884657b2e442ff
2023-11-07 11:44:20 -08:00
Peter Dillinger 16ae3548a2 AutoHCC: Improve/fix allocation/detection of grow homes (#12047)
Summary:
This change simplifies some code and logic by introducing a new atomic field that tracks the next slot to grow into. It should offer slightly better performance during the growth phase (not measurable; see Test Plan below) and fix a suspected (but unconfirmed) bug like this:
* Thread 1 is in non-trivial SplitForGrow() with grow_home=n.
* Thread 2 reaches Grow() with grow_home=2n, and waits at the start of SplitForGrow() for the rewrite lock on n. By this point, the head at 2n is marked with the new shift amount but no chain is locked.
* Thread 3 reaches Grow() with grow_home=4n, and waits before SplitForGrow() for the rewrite lock on n. By this point, the head at 4n is marked with the new shift amount but no chain is locked.
* Thread 4 reaches Grow() with grow_home=8n and meets no resistance to proceeding through a SplitForGrow() on an empty chain, permanently missing out on any entries from chain n that should have ended up here.

This is fixed by not updating the shift amount at the grow_home head until we have checked the preconditions that Grow()s feeding into this one have completed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12047

Test Plan:
Some manual cache_bench stress runs, and about 20 triggered runs of fbcode_blackbox_crash_test

No discernible performance difference on this benchmark, running before & after in parallel for a few minutes:
```
(while ./cache_bench -cache_type=auto_hyper_clock_cache -populate_cache=0 -cache_size=3000000000 -ops_per_thread=50000 -threads=12 -histograms=0 2>&1 | grep parallel; do :; done) | awk '{ s += $3; c++; print "Avg time: " (s/c);}'
```

Reviewed By: jowlyzhang

Differential Revision: D51017007

Pulled By: pdillinger

fbshipit-source-id: 5f6d6a6194fc966f94693f3205ed75c87cdad269
2023-11-07 10:40:39 -08:00
Jay Huh 2adef5367a AttributeGroups - PutEntity Implementation (#11977)
Summary:
Write Path for AttributeGroup Support. The new `PutEntity()` API uses `WriteBatch` and atomically writes WideColumns entities in multiple Column Families.

Combined the release note from PR https://github.com/facebook/rocksdb/issues/11925

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11977

Test Plan:
- `DBWideBasicTest::MultiCFMultiGetEntityAsPinnableAttributeGroups` updated
- `WriteBatchTest::AttributeGroupTest` added
- `WriteBatchTest::AttributeGroupSavePointTest` added

Reviewed By: ltamasi

Differential Revision: D50457122

Pulled By: jaykorean

fbshipit-source-id: 4997b265e415588ce077933082dcd1ac3eeae2cd
2023-11-06 16:52:51 -08:00
Peter Dillinger 92dc5f3e67 AutoHCC: fix a bug with "blind" Insert (#12046)
Summary:
I have finally tracked down and fixed a bug affecting AutoHCC that was causing CI crash test assertion failures in AutoHCC when using secondary cache, but I was only able to reproduce locally a couple of times, after very long runs/repetitions.

It turns out that the essential feature used by secondary cache to trigger the bug is Insert without keeping a handle, which is otherwise rarely used in RocksDB and not incorporated into cache_bench (also used for targeted correctness stress testing) until this change (new option `-blind_insert_percent`).

The problem was in copying some logic from FixedHCC that makes the entry "sharable" but unreferenced once populated, if no reference is to be saved. The problem in AutoHCC is that we can only add the entry to a chain after it is in the sharable state, and must be removed from the chain while in the "under (de)construction" state and before it is back in the "empty" state. Also, it is possible for Lookup to find entries that are not connected to any chain, by design for efficiency, and for Release to erase_if_last_ref. Therefore, we could have
* Thread 1 starts to Insert a cache entry without keeping ref, and pauses before adding to the chain.
* Thread 2 finds it with Lookup optimizations, and then does Release with `erase_if_last_ref=true` causing it to trigger erasure on the entry. It successfully locks the home chain for the entry and purges any entries pending erasure. It is OK that this entry is not found on the chain, as another thread is allowed to remove it from the chain before we are able to (but after is it marked for (de)construction). And after the purge of the chain, the entry is marked empty.
* Thread 1 resumes in adding the slot (presumed entry) to the home chain for what was being inserted, but that now violates invariants and sets up a race or double-chain-reference as another thread could insert a new entry in the slot and try to insert into a different chain.

This is easily fixed by holding on to a reference until inserted onto the chain.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12046

Test Plan:
As I don't have a reliable local reproducer, I triggered 20 runs of internal CI on fbcode_blackbox_crash_test that were previously failing in AutoHCC with about 1/3 probability, and they all passed.

Also re-enabling AutoHCC in the crash test with this change. (Revert https://github.com/facebook/rocksdb/issues/12000)

Reviewed By: jowlyzhang

Differential Revision: D51016979

Pulled By: pdillinger

fbshipit-source-id: 3840fb829d65b97c779d8aed62a4a4a433aeff2b
2023-11-06 16:06:01 -08:00
Jay Huh 0ecfc4fbb4 AttributeGroups - GetEntity Implementation (#11943)
Summary:
Implementation of `GetEntity()` API that returns wide-column entities as AttributeGroups from multiple column families for a single key. Regarding the definition of Attribute groups, please see the detailed example description in PR https://github.com/facebook/rocksdb/issues/11925

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11943

Test Plan:
- `DBWideBasicTest::GetEntityAsPinnableAttributeGroups` added

will enable the new API in the `db_stress` after merging

Reviewed By: ltamasi

Differential Revision: D50195794

Pulled By: jaykorean

fbshipit-source-id: 218d54841ac7e337de62e13b1233b0a99bd91af3
2023-11-06 15:04:41 -08:00
Jay Huh 2dab137182 Mark more files for periodic compaction during offpeak (#12031)
Summary:
- The struct previously named `OffpeakTimeInfo` has been renamed to `OffpeakTimeOption` to indicate that it's a user-configurable option. Additionally, a new struct, `OffpeakTimeInfo`, has been introduced, which includes two fields: `is_now_offpeak` and `seconds_till_next_offpeak_start`. This change prevents the need to parse the `daily_offpeak_time_utc` string twice.
- It's worth noting that we may consider adding more fields to the `OffpeakTimeInfo` struct, such as `elapsed_seconds` and `total_seconds`, as needed for further optimization.
- Within `VersionStorageInfo::ComputeFilesMarkedForPeriodicCompaction()`, we've adjusted the `allowed_time_limit` to include files that are expected to expire by the next offpeak start.
- We might explore further optimizations, such as evenly distributing files to mark during offpeak hours, if the initial approach results in marking too many files simultaneously during the first scoring in offpeak hours. The primary objective of this PR is to prevent periodic compactions during non-offpeak hours when offpeak hours are configured. We'll start with this straightforward solution and assess whether it suffices for now.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12031

Test Plan:
Unit Tests added
- `DBCompactionTest::LevelPeriodicCompactionOffpeak` for Leveled
- `DBTestUniversalCompaction2::PeriodicCompaction` for Universal

Reviewed By: cbi42

Differential Revision: D50900292

Pulled By: jaykorean

fbshipit-source-id: 267e7d3332d45a5d9881796786c8650fa0a3b43d
2023-11-06 11:43:59 -08:00
Peter Dillinger a399bbc037 More fixes and enhancements for cache_bench (#12041)
Summary:
Mostly things for using cache_bench for stress/correctness testing.
* Make secondary_cache_uri option work with HCC (forgot to update when secondary support was added for HCC)
* Add -pinned_ratio option to keep more than just one entry per thread pinned. This can be important for testing eviction stress.
* Add -vary_capacity_ratio for testing dynamically changing capacity.

Also added some overrides to CacheWrapper to help with diagnostic output.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12041

Test Plan: manual, make check

Reviewed By: jowlyzhang

Differential Revision: D51013430

Pulled By: pdillinger

fbshipit-source-id: 7914adc1218f0afacace05ccd77d3bfb91a878d0
2023-11-06 09:59:09 -08:00
Alan Paxton 6979e9dc6a Create blog post from report on JNI performance work (#11818)
Summary:
We did some investigation into the performance of JNI for workloads emulating how data is carried between Java and C++
for RocksDB. The repo for our performance work lives at https://github.com/evolvedbinary/jni-benchmarks

This is a report text from that work, extracted as a blog post.
Along with some supporting files (png, pdf of graphs).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11818

Reviewed By: jaykorean

Differential Revision: D50907467

Pulled By: pdillinger

fbshipit-source-id: ec6a43c83bd9ad94a3d11cfd87031e613acf7659
2023-11-06 09:15:00 -08:00
Changyu Bi 520c64fd2e Add missing status check in ExternalSstFileIngestionJob and ImportColumnFamilyJob (#12042)
Summary:
.. and update some unit tests that failed with this change. See comment in ExternalSSTFileBasicTest.IngestFileWithCorruptedDataBlock for more explanation.

The missing status check is not caught by `ASSERT_STATUS_CHECKED=1` due to this line: 8505b26db1/table/block_based/block.h (L394). Will explore if we can remove it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12042

Test Plan: existing unit tests.

Reviewed By: ajkr

Differential Revision: D50994769

Pulled By: cbi42

fbshipit-source-id: c91615bccd6094a91634c50b98401d456cbb927b
2023-11-06 07:41:36 -08:00
马越 19768a923a Add jni Support for API CreateColumnFamilyWithImport (#11646)
Summary:
- Add the following missing options to src/main/java/org/rocksdb/ImportColumnFamilyOptions.java and in java/rocksjni/import_column_family_options.cc in RocksJava.
- Add the struct to src/main/java/org/rocksdb/ExportImportFilesMetaData.java and in java/rocksjni/export_import_files_metadatajni.cc in RocksJava.
- Add New Java API `createColumnFamilyWithImport` to src/main/java/org/rocksdb/RocksDB.java
- Add New Java API `exportColumnFamily` to src/main/java/org/rocksdb/Checkpoint.java

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11646

Test Plan:
- added unit tests for exportColumnFamily in org.rocksdb.CheckpointTest
- added unit tests for createColumnFamilyWithImport to org.rocksdb.ImportColumnFamilyTest

Reviewed By: ajkr

Differential Revision: D50889700

Pulled By: cbi42

fbshipit-source-id: d623b35e445bba62a0d3c007d74352e937678f6c
2023-11-06 07:38:42 -08:00
Changyu Bi b48480cfd0 Enable `TestIterateAgainstExpected()` in more crash tests (#12040)
Summary:
db_stress flag `verify_iterator_with_expected_state_one_in` is only enabled for in crash test if --simple flag is set. This PR enables it for all supported crash tests by enabling it by default. This adds coverage for --txn and --enable_ts crash tests.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12040

Test Plan:
ran crash tests that disabled this flag before for a few hours
```
python3 ./tools/db_crashtest.py blackbox --verify_iterator_with_expected_state_one_in=1 --txn --txn_write_policy=[0,1,2]

python3 ./tools/db_crashtest.py blackbox --verify_iterator_with_expected_state_one_in=1 --enable_ts
```

Reviewed By: ajkr, hx235

Differential Revision: D50980001

Pulled By: cbi42

fbshipit-source-id: 3daf6b4c32bdddc5df057240068162aa1a907587
2023-11-03 16:27:11 -07:00
Changyu Bi 8505b26db1 Fix stress test error message for black/whitebox test to catch failures (#12039)
Summary:
black/whitebox crash test relies on error/fail keyword in stderr to catch stress test failure. If a db_stress run prints an error message without these keyword, and then is killed before it graceful exits and prints out "Verification failed" here (2648e0a747/db_stress_tool/db_stress_driver.cc (L256)), the error won't be caught. This is more likely to happen if db_stress is printing a stack trace. This PR fixes some error messages. Ideally in the future we should not rely on searching for keywords in stderr to determine failed stress tests.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12039

Test Plan:
```
Added the following change on top of this PR to simulate exit without relevant keyword:

@@ -1586,6 +1587,8 @@ class NonBatchedOpsStressTest : public StressTest {
     assert(thread);
     assert(!rand_column_families.empty());
     assert(!rand_keys.empty());
+    fprintf(stderr, "Inconsistency");
+    thread->shared->SafeTerminate();

python3 ./tools/db_crashtest.py blackbox --simple --verify_iterator_with_expected_state_one_in=1 --interval=10

will print a stack trace but continue to run db_stress.
```

Reviewed By: jaykorean

Differential Revision: D50960076

Pulled By: cbi42

fbshipit-source-id: 5c60a1be04ce4a43adbd33f040d54434f2ae24c9
2023-11-03 09:53:22 -07:00
914022466 2648e0a747 Fix a bug when ingest plaintable sst file (#11969)
Summary:
Plaintable doesn't support SeekToLast. And GetIngestedFileInfo is using SeekToLast without checking the validity.

We are using IngestExternalFile or CreateColumnFamilyWithImport with some sst file in PlainTable format . But after running for a while, compaction error often happens. Such as
![image](https://github.com/facebook/rocksdb/assets/13954644/b4fa49fc-73fc-49ce-96c6-f198a30800b8)

I simply add some std::cerr log to find why.
![image](https://github.com/facebook/rocksdb/assets/13954644/2cf1d5ff-48cc-4125-b917-87090f764fcd)
It shows that the smallest key is always equal to largest key.
![image](https://github.com/facebook/rocksdb/assets/13954644/6d43e978-0be0-4306-aae3-f9e4ae366395)
Then I found the root cause is that PlainTable do not support SeekToLast, so the smallest key is always the same with the largest

I try to write an unit test. But it's not easy to reproduce this error.
(This PR is similar to https://github.com/facebook/rocksdb/pull/11266. Sorry for open another PR)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11969

Reviewed By: ajkr

Differential Revision: D50933854

Pulled By: cbi42

fbshipit-source-id: 6c6af53c1388922cbabbe64ed3be1cdc58df5431
2023-11-02 13:45:37 -07:00
Yu Zhang a42910537d Save the correct user comparator name in OPTIONS file (#12037)
Summary:
I noticed the user comparator name in OPTIONS file can be incorrect when working on a recent stress test failure. The name of the comparator retrieved via the "Comparator::GetRootComparator" API is saved in OPTIONS file as the user comparator. The intention was to get the user comparator wrapped in the internal comparator. However `ImmutableCFOptions.user_comparator` has always been a user comparator of type `Comparator`. The corresponding `GetRootComparator` API is also defined only for user comparator type `Comparator`, not the internal key comparator type `InternalKeyComparator`.

For built in comparator `BytewiseComparator` and `ReverseBytewiseComparator`, there is no difference between `Comparator::Name` and `Comparator::GetRootComparator::Name` because these built in comparators' root comparator is themselves. However, for built in comparator `BytewiseComparatorWithU64Ts` and `ReverseBytewiseComparatorWithU64Ts`, there are differences. So this change update the logic to persist the user comparator's name, not its root comparator's name.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12037

Test Plan:
The restore flow in stress test, which relies on converting Options object to string and back to Options object is updated to help validate comparator object can be correctly serialized and deserialized with the OPTIONS file mechanism

Updated unit test to use a comparator that has a root comparator that is not itself.

Reviewed By: cbi42

Differential Revision: D50909750

Pulled By: jowlyzhang

fbshipit-source-id: 9086d7135c7a6f4b5565fb47fce194ea0a024f52
2023-11-02 13:27:59 -07:00
马越 8e1adab5ce add RocksDB#clipColumnFamily to Java API (#11868)
Summary:
### main change:

- add java clipColumnFamily api in Rocksdb.java
The method signature of the new API is
 ```
 public void clipColumnFamily(final ColumnFamilyHandle columnFamilyHandle, final byte[] beginKey,
      final byte[] endKey)
```
### Test
add unit test RocksDBTest#clipColumnFamily()

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11868

Reviewed By: jaykorean

Differential Revision: D50889783

Pulled By: cbi42

fbshipit-source-id: 7f545171ad9adb9c20bdd92efae2e6bc55d5703f
2023-11-02 08:00:08 -07:00
Yu Zhang 4b013dcbed Remove VersionEdit's friends pattern (#12024)
Summary:
Almost each of VersionEdit private member has its own getter and setter. Current code access them with a combination of directly accessing private members and via getter and setters. There is no obvious benefits to have this pattern except potential performance gains. I tried this simple benchmark for removing the friends pattern completely, and there is no obvious regression. So I think it would good to remove VersionEdit's friends completely.

```TEST_TMPDIR=/dev/shm/rocksdb1 ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num_column_families=10 -num=50000000```

With change:
fillseq      :       2.994 micros/op 333980 ops/sec 149.710 seconds 50000000 operations;   36.9 MB/s
fillseq      :       3.033 micros/op 329656 ops/sec 151.673 seconds 50000000 operations;   36.5 MB/s
fillseq      :       2.991 micros/op 334369 ops/sec 149.535 seconds 50000000 operations;   37.0 MB/s
Without change:
fillseq      :       3.015 micros/op 331715 ops/sec 150.732 seconds 50000000 operations;   36.7 MB/s
fillseq      :       3.044 micros/op 328553 ops/sec 152.182 seconds 50000000 operations;   36.3 MB/s
fillseq      :       3.091 micros/op 323520 ops/sec 154.550 seconds 50000000 operations;   35.8 MB/s

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12024

Reviewed By: pdillinger

Differential Revision: D50806066

Pulled By: jowlyzhang

fbshipit-source-id: 35d287ce638a38c30f243f85992e615b4c90eb27
2023-11-01 12:04:11 -07:00
Jay Huh 04225a2cfa Fix for RecoverFromRetryableBGIOError starting with recovery_in_prog_ false (#11991)
Summary:
cbi42 helped investigation and found a potential scenario where `RecoverFromRetryableBGIOError()` may start with `recovery_in_prog_ ` set as false. (and other booleans like `bg_error_` and `soft_error_no_bg_work_`)

**Thread 1**
- `StartRecoverFromRetryableBGIOError()`): (mutex held) sets `recovery_in_prog_ = true`

**Thread 1's `recovery_thread_`**
- (waits for mutex and acquires it)
- `RecoverFromRetryableBGIOError()` -> `ResumeImpl()` -> `ClearBGError()`: sets `recovery_in_prog_ = false`
- `ClearBGError()` -> `NotifyOnErrorRecoveryEnd()`: releases `mutex`

**Thread 2**
- `StartRecoverFromRetryableBGIOError()`): (mutex held) sets `recovery_in_prog_ = true`
- Waits for Thread 1 (`recovery_thread_`) to finish

**Thread 1's `recovery_thread_`**
- re-lock mutex in `NotifyOnErrorRecoveryEnd()`
- Still inside `RecoverFromRetryableBGIOError()`: sets `recovery_in_prog_ = false`
- Done

**Thread 2's `recovery_thread_`**
- recovery thread started with `recovery_in_prog_` set as `false`

# Fix
- Remove double-clearing `bg_error_`,  `recovery_in_prog_` and other fields after `ResumeImpl()` already returned `OK()`.
- Minor typo and linter fixes in `DBErrorHandlingFSTest`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11991

Test Plan:
- `DBErrorHandlingFSTest::MultipleRecoveryThreads` added to reproduce the scenario.
- Adding `assert(recovery_in_prog_);` at the start of `ErrorHandler::RecoverFromRetryableBGIOError()` fails the test without the fix and succeeds with the fix as expected.

Reviewed By: cbi42

Differential Revision: D50506113

Pulled By: jaykorean

fbshipit-source-id: 6dabe01e9ecd3fc50bbe9019587f2f4858bed9c6
2023-10-31 16:13:36 -07:00
Yu Zhang 0b057a7acc Initialize comparator explicitly in PrepareOptionsForRestoredDB() (#12034)
Summary:
This is to fix below error seeing in stress test:
```
Failure in DB::Open in backup/restore with: Invalid argument: Cannot open a column family and disable user-defined timestamps feature if its existing persist_user_defined_timestamps flag is not false.
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12034

Reviewed By: cbi42

Differential Revision: D50860689

Pulled By: jowlyzhang

fbshipit-source-id: ebc6cf0a75caa43d3d3bd58e3d5c2ac754cc637c
2023-10-31 16:10:48 -07:00
Adam Retter e0c45c15a7 Fix the ZStd checksum (#12005)
Summary:
Somehow we had the wrong checksum when validating the ZStd 1.5.5 download for RocksJava in the previous Pull Request - https://github.com/facebook/rocksdb/pull/9304. This PR fixes that.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12005

Reviewed By: jaykorean

Differential Revision: D50840338

Pulled By: cbi42

fbshipit-source-id: 8a92779d3bef013d812eecb89aaaf33fc73991ec
2023-10-31 12:23:34 -07:00
Changyu Bi 2818a74b95 Initialize merge operator explicitly in PrepareOptionsForRestoredDB() (#12033)
Summary:
We are seeing the following stress test failure: `Failure in DB::Get in backup/restore with: Invalid argument: merge_operator is not properly initialized. Verification failed: Backup/restore failed: Invalid argument: merge_operator is not properly initialized.`. The reason is likely that `GetColumnFamilyOptionsFromString()` does not set merge operator if it's a customized merge operator. Fixing it by initializing merge operator explicitly.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12033

Test Plan:
this repro gives the error consistently before this PR
```
./db_stress --acquire_snapshot_one_in=10000 --adaptive_readahead=0 --allow_concurrent_memtable_write=1 --allow_data_in_errors=True --async_io=0 --atomic_flush=1 --auto_readahead_size=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=1048576000000 --backup_one_in=50 --batch_protection_bytes_per_key=8 --block_protection_bytes_per_key=2 --block_size=16384 --bloom_before_level=2147483646 --bloom_bits=31.014388066505518 --bottommost_compression_type=lz4hc --bottommost_file_compaction_delay=0 --bytes_per_sync=0 --cache_index_and_filter_blocks=0 --cache_size=33554432 --cache_type=fixed_hyper_clock_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=1 --checkpoint_one_in=1000000 --checksum_type=kxxHash --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=3 --compaction_readahead_size=0 --compaction_ttl=10 --compressed_secondary_cache_ratio=0.0 --compressed_secondary_cache_size=0 --compression_checksum=1 --compression_max_dict_buffer_bytes=4095 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=none --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_blackbox --db_write_buffer_size=0 --delpercent=4 --delrangepercent=1 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --enable_thread_tracking=0 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=1 --fifo_allow_compaction=0 --file_checksum_impl=xxh64 --flush_one_in=1000000 --format_version=2 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=10 --index_type=2 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=1000000 --long_running_snapshots=1 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=524288 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=100 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=8388608 --memtable_max_range_deletions=1000 --memtable_prefix_bloom_size_ratio=0 --memtable_protection_bytes_per_key=2 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=16 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=0 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=-1 --prefixpercent=0 --prepopulate_block_cache=1 --preserve_internal_time_seconds=0 --progress_reports=0 --read_fault_one_in=0 --readahead_size=0 --readpercent=50 --recycle_log_file_num=0 --reopen=0 --secondary_cache_fault_one_in=0 --set_options_one_in=0 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=0 --subcompactions=1 --sync=0 --sync_fault_injection=1 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=1 --use_multi_get_entity=0 --use_multiget=1 --use_put_entity_one_in=10 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=zstd --write_buffer_size=33554432 --write_dbid_to_manifest=0 --write_fault_one_in=0 --writepercent=35
```

Reviewed By: hx235

Differential Revision: D50825558

Pulled By: cbi42

fbshipit-source-id: 8468dc0444c112415a515af8291ef3abec8a42de
2023-10-31 07:39:41 -07:00
Yingchun Lai 76402c034e Fix incorrect parameters order in env_basic_test.cc (#11997)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11997

Reviewed By: jaykorean

Differential Revision: D50608182

Pulled By: ajkr

fbshipit-source-id: d33cfdb5adfea91175c8fa21e8b80e22f728f6c6
2023-10-30 10:47:04 -07:00
Radek Hubner b3fd3838d4 Remove build dependencies for java tests. (#12021)
Summary:
Final fix for https://github.com/facebook/rocksdb/issues/12013

- Reverting back changes on CirleCI explicit image declaration.
- Removed CMake dependencies between java classed and java test classes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12021

Reviewed By: akankshamahajan15

Differential Revision: D50745392

Pulled By: ajkr

fbshipit-source-id: 6a7a1da1e7e4da8da72130c9272915974e10fffc
2023-10-30 10:08:19 -07:00
Yu Zhang 60df39e530 Rate limiting stale sst files' deletion during recovery (#12016)
Summary:
As titled. If SstFileManager is available, deleting stale sst files will be delegated to it so it can be rate limited.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12016

Reviewed By: hx235

Differential Revision: D50670482

Pulled By: jowlyzhang

fbshipit-source-id: bde5b76ea1d98e67f6b4f08bfba3db48e46aab4e
2023-10-28 09:50:52 -07:00
Hui Xiao 212b5bf826 Deep-copy Options in restored db for stress test to avoid race with SetOptions() (#12015)
Summary:
**Context**
DB open will persist the `Options` in memory to options file and verify the file right after the write. The verification is done by comparing the options from parsing the written options file against the `Options` object in memory. Upon inconsistency, corruption such as https://github.com/facebook/rocksdb/blob/main/options/options_parser.cc#L725 will be returned.

This verification assumes the `Options` object in memory is not changed from before the write till the verification. This assumption can break during [opening the restored db in stress test](0f141352d8/db_stress_tool/db_stress_test_base.cc (L1784-L1799)).

This [line](0f141352d8/db_stress_tool/db_stress_test_base.cc (L1770)) makes it shares some pointer options (e.g, `std::shared_ptr<const FilterPolicy> filter_policy`) with other threads (e.g, SetOptions()) in db stress.

And since https://github.com/facebook/rocksdb/pull/11838, filter_policy's field `bloom_before_level ` has now been mutable by SetOptions(). Therefore we started to see stress test failure like below:

```
Failure in DB::Open in backup/restore with: IO error: DB::Open() failed --- Unable to persist Options file: IO error: Unable to persist options.: Corruption: [RocksDBOptionsParser]:failed the verification on BlockBasedTable::: filter_policy.id

Verification failed: Backup/restore failed: IO error: DB::Open() failed --- Unable to persist Options file: IO error: Unable to persist options.: Corruption: [RocksDBOptionsParser]:failed the verification on BlockBasedTable::: filter_policy.id

db_stress: db_stress_tool/db_stress_test_base.cc:479: void rocksdb::StressTest::ProcessStatus(rocksdb::SharedState*, std::string, rocksdb::Status) const: Assertion `false' failed.
```

**Summary**
This PR uses "deep copy" of the `options_` by CreateXXXFromString() to avoid sharing pointer options.

**Test plan**
Run the below db stress command that failed before this PR and pass after
```
./db_stress --column_families=1 --threads=2 --preserve_unverified_changes=0 --acquire_snapshot_one_in=10000 --adaptive_readahead=0 --allow_data_in_errors=True --async_io=0 --auto_readahead_size=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=10 --batch_protection_bytes_per_key=0 --block_protection_bytes_per_key=1 --block_size=16384 --bloom_before_level=2147483646 --bloom_bits=0 --bottommost_compression_type=disable --bottommost_file_compaction_delay=86400 --bytes_per_sync=0 --cache_index_and_filter_blocks=0 --cache_size=33554432 --cache_type=tiered_auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=0 --charge_filter_construction=0 --charge_table_reader=1 --checkpoint_one_in=1000000 --checksum_type=kXXH3 --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=2 --compaction_readahead_size=0 --compaction_ttl=0 --compressed_secondary_cache_ratio=0.3333333333333333 --compressed_secondary_cache_size=0 --compression_checksum=1 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=lz4 --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=1 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_blackbox --db_write_buffer_size=8388608 --delpercent=4 --delrangepercent=1 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=0 --enable_thread_tracking=0 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=1 --fifo_allow_compaction=1 --file_checksum_impl=big --flush_one_in=1000000 --format_version=2 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=14 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=524288 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=1000000 --long_running_snapshots=1 --manual_wal_flush_one_in=1000 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=2500 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=16777216 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=4194304 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0 --memtable_protection_bytes_per_key=0 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=1 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=1 --open_files=500000 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=2 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=-1 --prefixpercent=0 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=1000 --readahead_size=0 --readpercent=50 --recycle_log_file_num=0 --reopen=0 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --set_options_one_in=5 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=1048576 --stats_dump_period_sec=600 --subcompactions=3 --sync=0 --sync_fault_injection=0 --target_file_size_base=2097152 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=2 --unpartitioned_pinning=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=1 --use_multi_get_entity=0 --use_multiget=1 --use_put_entity_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_file_checksums_one_in=1000000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=zstd --write_buffer_size=4194304 --write_dbid_to_manifest=1 --write_fault_one_in=0 --writepercent=35
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12015

Reviewed By: pdillinger

Differential Revision: D50666136

Pulled By: hx235

fbshipit-source-id: 804acc23aecb4eedfe5c44f732e86291f2420b2b
2023-10-27 17:07:39 -07:00
Jay Huh e230e4d248 Make OffpeakTimeInfo available in VersionSet (#12018)
Summary:
As mentioned in  https://github.com/facebook/rocksdb/issues/11893, we are going to use the offpeak time information to pre-process TTL-based compactions. To do so, we need to access `daily_offpeak_time_utc` in `VersionStorageInfo::ComputeCompactionScore()` where we pick the files to compact. This PR is to make the offpeak time information available at the time of compaction-scoring. We are not changing any compaction scoring logic just yet. Will follow up in a separate PR.

There were two ways to achieve what we want.
1.  Make `MutableDBOptions` available in `ColumnFamilyData` and `ComputeCompactionScore()` take `MutableDBOptions` along with `ImmutableOptions` and `MutableCFOptions`.
2. Make `daily_offpeak_time_utc` and `IsNowOffpeak()` available in `VersionStorageInfo`.

We chose the latter as it involves smaller changes.

This change includes the following
- Introduction of `OffpeakTimeInfo` and `IsNowOffpeak()` has been moved from `MutableDBOptions`
- `OffpeakTimeInfo` added to `VersionSet` and it can be set during construction and by `ChangeOffpeakTimeInfo()`
- During `SetDBOptions()`, if offpeak time info needs to change, it calls `MaybeScheduleFlushOrCompaction()` to re-compute compaction scores and process compactions as needed

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12018

Test Plan:
- `DBOptionsTest::OffpeakTimes` changed to include checks for `MaybeScheduleFlushOrCompaction()` calls and `VersionSet`'s OffpeakTimeInfo value change during `SetDBOptions()`.
- `VersionSetTest::OffpeakTimeInfoTest` added to test `ChangeOffpeakTimeInfo()`. `IsNowOffpeak()` tests moved from `DBOptionsTest::OffpeakTimes`

Reviewed By: pdillinger

Differential Revision: D50723881

Pulled By: jaykorean

fbshipit-source-id: 3cff0291936f3729c0e9c7750834b9378fb435f6
2023-10-27 15:56:48 -07:00
Yu Zhang 526f36b483 Remove extra semicolon (#12017)
Summary:
As titled.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12017

Reviewed By: hx235

Differential Revision: D50670406

Pulled By: jowlyzhang

fbshipit-source-id: 28b3acd930ee676d78ebb47144047ce233fc11c5
2023-10-25 17:48:21 -07:00
anand76 52be8f54f2 Add APIs to query secondary cache capacity and usage for TieredCache (#12011)
Summary:
In `TieredCache`, the underlying compressed secondary cache is hidden from the user. So we need a way to query the capacity, as well as the portion of cache reservation charged to the compressed secondary cache.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12011

Test Plan: Update the unit tests

Reviewed By: akankshamahajan15

Differential Revision: D50651943

Pulled By: anand1976

fbshipit-source-id: 06d1cb5edb75a790c919bce718e2ff65f5908220
2023-10-25 16:54:50 -07:00
Radek Hubner 8ee009f0d8 Downgrade windows 2019 build to older image. (#12014)
Summary:
This should fix failed java windows build  https://github.com/facebook/rocksdb/issues/12013

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12014

Reviewed By: ajkr

Differential Revision: D50664503

Pulled By: akankshamahajan15

fbshipit-source-id: 3466ce42d3cae3f8e0beba88a18744d647a32a2c
2023-10-25 15:43:05 -07:00
Hui Xiao 0f141352d8 Fix race between flush error recovery and db destruction (#12002)
Summary:
**Context:**
DB destruction will wait for ongoing error recovery through `EndAutoRecovery()` and join the recovery thread: 519f2a41fb/db/db_impl/db_impl.cc (L525) -> 519f2a41fb/db/error_handler.cc (L250) -> 519f2a41fb/db/error_handler.cc (L808-L823)

However, due to a race between flush error recovery and db destruction, recovery can actually start after such wait during the db shutdown. The consequence is that the recovery thread created as part of this recovery will not be properly joined upon its destruction as part the db destruction. It then crashes the program as below.

```
std::terminate()
std::default_delete<std::thread>::operator()(std::thread*) const
std::unique_ptr<std::thread, std::default_delete<std::thread>>::~unique_ptr()
rocksdb::ErrorHandler::~ErrorHandler() (rocksdb/db/error_handler.h:31)
rocksdb::DBImpl::~DBImpl() (rocksdb/db/db_impl/db_impl.cc:725)
rocksdb::DBImpl::~DBImpl() (rocksdb/db/db_impl/db_impl.cc:725)
rocksdb::DBTestBase::Close() (rocksdb/db/db_test_util.cc:678)
```

**Summary:**
This PR fixed it by considering whether EndAutoRecovery() has been called before creating such thread. This fix is similar to how we currently [handle](519f2a41fb/db/error_handler.cc (L688-L694)) such case inside the created recovery thread.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12002

Test Plan: A new UT repro-ed the crash before this fix and and pass after.

Reviewed By: ajkr

Differential Revision: D50586191

Pulled By: hx235

fbshipit-source-id: b372f6d7a94eadee4b9283b826cc5fb81779a093
2023-10-25 11:59:09 -07:00