rocksdb

Commit Graph

Author	SHA1	Message	Date
Gus Wynn	6d10f8d690	add WriteBufferManager to c api (#11710 ) Summary: I want to use the `WriteBufferManager` in my rust project, which requires exposing it through the c api, just like `Cache` is. Hopefully the changes are fairly straightfoward! Pull Request resolved: https://github.com/facebook/rocksdb/pull/11710 Reviewed By: cbi42 Differential Revision: D51166518 Pulled By: ajkr fbshipit-source-id: cd266ff1e4a7ab145d05385cd125a8390f51f3fc	2023-11-16 10:34:00 -08:00
Andrew Kryczka	9202db1867	Consider archived WALs for deletion more frequently (#12069 ) Summary: Fixes https://github.com/facebook/rocksdb/issues/11000. That issue pointed out that RocksDB was slow to delete archived WALs in case time-based and size-based expiration were enabled, and the time-based threshold (`WAL_ttl_seconds`) was small. This PR prevents the delay by taking into account `WAL_ttl_seconds` when deciding the frequency to process archived WALs for deletion. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12069 Reviewed By: pdillinger Differential Revision: D51262589 Pulled By: ajkr fbshipit-source-id: e65431a06ee96f4c599ba84a27d1aedebecbb003	2023-11-15 15:42:28 -08:00
Changyu Bi	e7896f03ad	Enable unit test `PrecludeLastLevelTest.RangeDelsCauseFileEndpointsToOverlap` (#12064 ) Summary: Fixes https://github.com/facebook/rocksdb/issues/11909. The test passes after the change in https://github.com/facebook/rocksdb/issues/11917 to start mock clock from a non-zero time. The reason for test failing is a bit complicated: - The Put here `e4ad4a0ef1/db/compaction/tiered_compaction_test.cc (L2045)` happens before mock clock advances beyond 0. - This causes oldest_key_time_ to be 0 for memtable. - oldest_ancester_time of the first L0 file becomes 0 - L0 -> L5/6 compaction output files sets `oldest_ancestoer_time` to the current time due to these lines: `509947ce2c/db/compaction/compaction_job.cc (L1898C34-L1904)`. - This causes some small sequence number to be mapped to current time: `509947ce2c/db/compaction/compaction_job.cc (L301)` - Keys in L6 is being moved up to L5 due to the unexpected seqno_to_time mapping - When compacting keys from last level to the penultimate level, we only check keys to be within user key range of penultimate level input files. If we compact the following file 3 with file 1 and output keys to L5, we can get the reported inconsistency bug. ``` L5: file 1 [K5@20, K10@kMaxSeqno], file 2 [K10@30, K14@34) L6: file 3 [K6@5, K10@20] ``` https://github.com/facebook/rocksdb/issues/12063 will add fixes to check internal key range when compacting keys from last level up to the penultimate level. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12064 Test Plan: the unit test passes Reviewed By: ajkr Differential Revision: D51281149 Pulled By: cbi42 fbshipit-source-id: 00b7f026c453454d9f3af5b2de441383a96f0c62	2023-11-13 15:26:52 -08:00
Jay Huh	8b8f6c63ef	ColumnFamilyHandle Nullcheck in GetEntity and MultiGetEntity (#12057 ) Summary: - Add missing null check for ColumnFamilyHandle in `GetEntity()` - `FailIfCfHasTs()` now returns `Status::InvalidArgument()` if `column_family` is null. `MultiGetEntity()` can rely on this for cfh null check. - Added `DeleteRange` API using Default Column Family to be consistent with other major APIs (This was also causing Java Test failure after the `FailIfCfHasTs()` change) Pull Request resolved: https://github.com/facebook/rocksdb/pull/12057 Test Plan: - Updated `DBWideBasicTest::GetEntityAsPinnableAttributeGroups` to include null CF case - Updated `DBWideBasicTest::MultiCFMultiGetEntityAsPinnableAttributeGroups` to include null CF case Reviewed By: jowlyzhang Differential Revision: D51167445 Pulled By: jaykorean fbshipit-source-id: 1c1e44fd7b7df4d2dc3bb2d7d251da85bad7d664	2023-11-13 14:30:04 -08:00
leipeng	b3ffca0e29	DBImpl::DelayWrite: Remove bad WRITE_STALL histogram (#12067 ) Summary: When delay didn't happen, histogram WRITE_STALL is still recorded, and ticker STALL_MICROS is not recorded. This is a bug, neither WRITE_STALL or STALL_MICROS should not be recorded when delay did not happen. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12067 Reviewed By: cbi42 Differential Revision: D51263133 Pulled By: ajkr fbshipit-source-id: bd82d8328fe088d613991966e83854afdabc6a25	2023-11-13 12:48:44 -08:00
Yu Zhang	509947ce2c	Quarantine files in a limbo state after a manifest error (#12030 ) Summary: Part of the procedures to handle manifest IO error is to disable file deletion in case some files in limbo state get deleted prematurely. This is not ideal because: 1) not all the VersionEdits whose commit encounter such an error contain updates for files, disabling file deletion sometimes are not necessary. 2) `EnableFileDeletion` has a force mode that could make other threads accidentally disrupt this procedure in recovery. 3) Disabling file deletion as a whole is also not as efficient as more precisely tracking impacted files from being prematurely deleted. This PR replaces this mechanism with tracking such files and quarantine them from being deleted in `ErrorHandler`. These are the types of files being actively tracked in quarantine in this PR: 1) new table files and blob files from a background job 2) old manifest file whose immediately following new manifest file's CURRENT file creation gets into unclear state. Current handling is not sufficient to make sure the old manifest file is kept in case it's needed. Note that WAL logs are not part of the quarantine because `min_log_number_to_keep` is a safe mechanism and it's only updated after successful manifest commits so it can prevent this premature deletion issue from happening. We track these files' file numbers because they share the same file number space. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12030 Test Plan: Modified existing unit tests Reviewed By: ajkr Differential Revision: D51036774 Pulled By: jowlyzhang fbshipit-source-id: 84ef26271fbbc888ef70da5c40fe843bd7038716	2023-11-11 08:11:11 -08:00
Yu Zhang	c6c683a0ca	Remove the default force behavior for `EnableFileDeletion` API (#12001 ) Summary: Disabling file deletion can be critical for operations like making a backup, recovery from manifest IO error (for now). Ideally as long as there is one caller requesting file deletion disabled, it should be kept disabled until all callers agree to re-enable it. So this PR removes the default forcing behavior for the `EnableFileDeletion` API, and users need to explicitly pass the argument if they insisted on doing so knowing the consequence of what can be potentially disrupted. This PR removes the API's default argument value so it will cause breakage for all users that are relying on the default value, regardless of whether the forcing behavior is critical for them. When fixing this breakage, it's good to check if the forcing behavior is indeed needed and potential disruption is OK. This PR also makes unit test that do not need force behavior to do a regular enable file deletion. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12001 Reviewed By: ajkr Differential Revision: D51214683 Pulled By: jowlyzhang fbshipit-source-id: ca7b1ebf15c09eed00f954da2f75c00d2c6a97e4	2023-11-10 14:35:54 -08:00
Yueh-Hsuan Chiang	5ef92b8ea4	Add rocksdb_options_set_cf_paths (#11151 ) Summary: This PR adds a missing set function for rocksdb_options in the C-API: rocksdb_options_set_cf_paths(). Without this function, users cannot specify different paths for different column families as it will fall back to db_paths. As a bonus, this PR also includes rocksdb_sst_file_metadata_get_directory() to the C api -- a missing public function that will also make the test easier to write. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11151 Test Plan: Augment existing c_test to verify the specified cf_path. Reviewed By: hx235 Differential Revision: D51201888 Pulled By: ajkr fbshipit-source-id: 62a96451f26fab60ada2005ede3eea8e9b431f30	2023-11-10 11:36:11 -08:00
Yueh-Hsuan Chiang	73d223c4e2	Add auto_tuned option to RateLimiter C API (#12058 ) Summary: #### Problem While the RocksDB C API does have the RateLimiter API, it does not expose the auto_tuned option. #### Summary of Change This PR exposes auto_tuned RateLimiter option in RocksDB C API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12058 Test Plan: Augment the C API existing test to cover the new API. Reviewed By: cbi42 Differential Revision: D51201933 Pulled By: ajkr fbshipit-source-id: 5bc595a9cf9f88f50fee797b729ba96f09ed8266	2023-11-10 09:53:09 -08:00
Yu Zhang	dfaf4dc111	Stubs for piping write time (#12043 ) Summary: As titled. This PR contains the API and stubbed implementation for piping write time. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12043 Reviewed By: pdillinger Differential Revision: D51076575 Pulled By: jowlyzhang fbshipit-source-id: 3b341263498351b9ccaff27cf35d5aeb5bdf0cf1	2023-11-09 15:58:07 -08:00
brodyhuang	e90e9825b4	Drop wal record when sequence is illegal (#11985 ) Summary: - Our database is corrupted, causing some sequences of wal record to be invalid (but the `record_checksum` looks fine). - When we RecoverLogFiles in WALRecoveryMode::kPointInTimeRecovery, `assert(seq <= kMaxSequenceNumber)` will be failed. - When it is found that sequence is illegal, can we drop the file to recover as much data as possible ? Thx ! Pull Request resolved: https://github.com/facebook/rocksdb/pull/11985 Reviewed By: anand1976 Differential Revision: D50698039 Pulled By: ajkr fbshipit-source-id: 1e42113b58823088d7c0c3a92af5b3efbb5f5296	2023-11-09 10:43:16 -08:00
Hui Xiao	f337533b6f	Ensure and clarify how RocksDB calls TablePropertiesCollector's functions (#12053 ) Summary: Context/Summary: It's intuitive for users to assume `TablePropertiesCollector::Finish()` is called only once by RocksDB internal by the word "finish". However, this is currently not true as RocksDB also calls this function in `BlockBased/PlainTableBuilder::GetTableProperties()` to populate user collected properties on demand. This PR avoids that by moving that populating to where we first call `Finish()` (i.e, `NotifyCollectTableCollectorsOnFinish`) Bonus: clarified in the API that `GetReadableProperties()` will be called after `Finish()` and added UT to ensure that. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12053 Test Plan: - Modified test `DBPropertiesTest.GetUserDefinedTableProperties` to ensure `Finish()` only called once. - Existing test particularly `db_properties_test, table_properties_collector_test` verify the functionality `NotifyCollectTableCollectorsOnFinish` and `GetReadableProperties()` are not broken by this change. Reviewed By: ajkr Differential Revision: D51095434 Pulled By: hx235 fbshipit-source-id: 1c6275258f9b99dedad313ee8427119126817973	2023-11-08 14:00:36 -08:00
Zaidoon Abd Al Hadi	58f2a29fb4	Expose Options::periodic_compaction_seconds through C API (#12019 ) Summary: fixes [11090](https://github.com/facebook/rocksdb/issues/11090) Pull Request resolved: https://github.com/facebook/rocksdb/pull/12019 Reviewed By: jaykorean Differential Revision: D51076427 Pulled By: cbi42 fbshipit-source-id: de353ff66c7f73aba70ab3379e20d8c40f50d873	2023-11-07 12:46:50 -08:00
Jay Huh	2adef5367a	AttributeGroups - PutEntity Implementation (#11977 ) Summary: Write Path for AttributeGroup Support. The new `PutEntity()` API uses `WriteBatch` and atomically writes WideColumns entities in multiple Column Families. Combined the release note from PR https://github.com/facebook/rocksdb/issues/11925 Pull Request resolved: https://github.com/facebook/rocksdb/pull/11977 Test Plan: - `DBWideBasicTest::MultiCFMultiGetEntityAsPinnableAttributeGroups` updated - `WriteBatchTest::AttributeGroupTest` added - `WriteBatchTest::AttributeGroupSavePointTest` added Reviewed By: ltamasi Differential Revision: D50457122 Pulled By: jaykorean fbshipit-source-id: 4997b265e415588ce077933082dcd1ac3eeae2cd	2023-11-06 16:52:51 -08:00
Jay Huh	0ecfc4fbb4	AttributeGroups - GetEntity Implementation (#11943 ) Summary: Implementation of `GetEntity()` API that returns wide-column entities as AttributeGroups from multiple column families for a single key. Regarding the definition of Attribute groups, please see the detailed example description in PR https://github.com/facebook/rocksdb/issues/11925 Pull Request resolved: https://github.com/facebook/rocksdb/pull/11943 Test Plan: - `DBWideBasicTest::GetEntityAsPinnableAttributeGroups` added will enable the new API in the `db_stress` after merging Reviewed By: ltamasi Differential Revision: D50195794 Pulled By: jaykorean fbshipit-source-id: 218d54841ac7e337de62e13b1233b0a99bd91af3	2023-11-06 15:04:41 -08:00
Jay Huh	2dab137182	Mark more files for periodic compaction during offpeak (#12031 ) Summary: - The struct previously named `OffpeakTimeInfo` has been renamed to `OffpeakTimeOption` to indicate that it's a user-configurable option. Additionally, a new struct, `OffpeakTimeInfo`, has been introduced, which includes two fields: `is_now_offpeak` and `seconds_till_next_offpeak_start`. This change prevents the need to parse the `daily_offpeak_time_utc` string twice. - It's worth noting that we may consider adding more fields to the `OffpeakTimeInfo` struct, such as `elapsed_seconds` and `total_seconds`, as needed for further optimization. - Within `VersionStorageInfo::ComputeFilesMarkedForPeriodicCompaction()`, we've adjusted the `allowed_time_limit` to include files that are expected to expire by the next offpeak start. - We might explore further optimizations, such as evenly distributing files to mark during offpeak hours, if the initial approach results in marking too many files simultaneously during the first scoring in offpeak hours. The primary objective of this PR is to prevent periodic compactions during non-offpeak hours when offpeak hours are configured. We'll start with this straightforward solution and assess whether it suffices for now. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12031 Test Plan: Unit Tests added - `DBCompactionTest::LevelPeriodicCompactionOffpeak` for Leveled - `DBTestUniversalCompaction2::PeriodicCompaction` for Universal Reviewed By: cbi42 Differential Revision: D50900292 Pulled By: jaykorean fbshipit-source-id: 267e7d3332d45a5d9881796786c8650fa0a3b43d	2023-11-06 11:43:59 -08:00
Changyu Bi	520c64fd2e	Add missing status check in ExternalSstFileIngestionJob and ImportColumnFamilyJob (#12042 ) Summary: .. and update some unit tests that failed with this change. See comment in ExternalSSTFileBasicTest.IngestFileWithCorruptedDataBlock for more explanation. The missing status check is not caught by `ASSERT_STATUS_CHECKED=1` due to this line: `8505b26db1/table/block_based/block.h (L394)`. Will explore if we can remove it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12042 Test Plan: existing unit tests. Reviewed By: ajkr Differential Revision: D50994769 Pulled By: cbi42 fbshipit-source-id: c91615bccd6094a91634c50b98401d456cbb927b	2023-11-06 07:41:36 -08:00
914022466	2648e0a747	Fix a bug when ingest plaintable sst file (#11969 ) Summary: Plaintable doesn't support SeekToLast. And GetIngestedFileInfo is using SeekToLast without checking the validity. We are using IngestExternalFile or CreateColumnFamilyWithImport with some sst file in PlainTable format . But after running for a while, compaction error often happens. Such as ![image](https://github.com/facebook/rocksdb/assets/13954644/b4fa49fc-73fc-49ce-96c6-f198a30800b8) I simply add some std::cerr log to find why. ![image](https://github.com/facebook/rocksdb/assets/13954644/2cf1d5ff-48cc-4125-b917-87090f764fcd) It shows that the smallest key is always equal to largest key. ![image](https://github.com/facebook/rocksdb/assets/13954644/6d43e978-0be0-4306-aae3-f9e4ae366395) Then I found the root cause is that PlainTable do not support SeekToLast, so the smallest key is always the same with the largest I try to write an unit test. But it's not easy to reproduce this error. (This PR is similar to https://github.com/facebook/rocksdb/pull/11266. Sorry for open another PR) Pull Request resolved: https://github.com/facebook/rocksdb/pull/11969 Reviewed By: ajkr Differential Revision: D50933854 Pulled By: cbi42 fbshipit-source-id: 6c6af53c1388922cbabbe64ed3be1cdc58df5431	2023-11-02 13:45:37 -07:00
Yu Zhang	4b013dcbed	Remove VersionEdit's friends pattern (#12024 ) Summary: Almost each of VersionEdit private member has its own getter and setter. Current code access them with a combination of directly accessing private members and via getter and setters. There is no obvious benefits to have this pattern except potential performance gains. I tried this simple benchmark for removing the friends pattern completely, and there is no obvious regression. So I think it would good to remove VersionEdit's friends completely. ```TEST_TMPDIR=/dev/shm/rocksdb1 ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num_column_families=10 -num=50000000``` With change: fillseq : 2.994 micros/op 333980 ops/sec 149.710 seconds 50000000 operations; 36.9 MB/s fillseq : 3.033 micros/op 329656 ops/sec 151.673 seconds 50000000 operations; 36.5 MB/s fillseq : 2.991 micros/op 334369 ops/sec 149.535 seconds 50000000 operations; 37.0 MB/s Without change: fillseq : 3.015 micros/op 331715 ops/sec 150.732 seconds 50000000 operations; 36.7 MB/s fillseq : 3.044 micros/op 328553 ops/sec 152.182 seconds 50000000 operations; 36.3 MB/s fillseq : 3.091 micros/op 323520 ops/sec 154.550 seconds 50000000 operations; 35.8 MB/s Pull Request resolved: https://github.com/facebook/rocksdb/pull/12024 Reviewed By: pdillinger Differential Revision: D50806066 Pulled By: jowlyzhang fbshipit-source-id: 35d287ce638a38c30f243f85992e615b4c90eb27	2023-11-01 12:04:11 -07:00
Jay Huh	04225a2cfa	Fix for RecoverFromRetryableBGIOError starting with recovery_in_prog_ false (#11991 ) Summary: cbi42 helped investigation and found a potential scenario where `RecoverFromRetryableBGIOError()` may start with `recovery_in_prog_ ` set as false. (and other booleans like `bg_error_` and `soft_error_no_bg_work_`) Thread 1 - `StartRecoverFromRetryableBGIOError()`): (mutex held) sets `recovery_in_prog_ = true` Thread 1's `recovery_thread_` - (waits for mutex and acquires it) - `RecoverFromRetryableBGIOError()` -> `ResumeImpl()` -> `ClearBGError()`: sets `recovery_in_prog_ = false` - `ClearBGError()` -> `NotifyOnErrorRecoveryEnd()`: releases `mutex` Thread 2 - `StartRecoverFromRetryableBGIOError()`): (mutex held) sets `recovery_in_prog_ = true` - Waits for Thread 1 (`recovery_thread_`) to finish Thread 1's `recovery_thread_` - re-lock mutex in `NotifyOnErrorRecoveryEnd()` - Still inside `RecoverFromRetryableBGIOError()`: sets `recovery_in_prog_ = false` - Done Thread 2's `recovery_thread_` - recovery thread started with `recovery_in_prog_` set as `false` # Fix - Remove double-clearing `bg_error_`, `recovery_in_prog_` and other fields after `ResumeImpl()` already returned `OK()`. - Minor typo and linter fixes in `DBErrorHandlingFSTest` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11991 Test Plan: - `DBErrorHandlingFSTest::MultipleRecoveryThreads` added to reproduce the scenario. - Adding `assert(recovery_in_prog_);` at the start of `ErrorHandler::RecoverFromRetryableBGIOError()` fails the test without the fix and succeeds with the fix as expected. Reviewed By: cbi42 Differential Revision: D50506113 Pulled By: jaykorean fbshipit-source-id: 6dabe01e9ecd3fc50bbe9019587f2f4858bed9c6	2023-10-31 16:13:36 -07:00
Yu Zhang	60df39e530	Rate limiting stale sst files' deletion during recovery (#12016 ) Summary: As titled. If SstFileManager is available, deleting stale sst files will be delegated to it so it can be rate limited. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12016 Reviewed By: hx235 Differential Revision: D50670482 Pulled By: jowlyzhang fbshipit-source-id: bde5b76ea1d98e67f6b4f08bfba3db48e46aab4e	2023-10-28 09:50:52 -07:00
Jay Huh	e230e4d248	Make OffpeakTimeInfo available in VersionSet (#12018 ) Summary: As mentioned in https://github.com/facebook/rocksdb/issues/11893, we are going to use the offpeak time information to pre-process TTL-based compactions. To do so, we need to access `daily_offpeak_time_utc` in `VersionStorageInfo::ComputeCompactionScore()` where we pick the files to compact. This PR is to make the offpeak time information available at the time of compaction-scoring. We are not changing any compaction scoring logic just yet. Will follow up in a separate PR. There were two ways to achieve what we want. 1. Make `MutableDBOptions` available in `ColumnFamilyData` and `ComputeCompactionScore()` take `MutableDBOptions` along with `ImmutableOptions` and `MutableCFOptions`. 2. Make `daily_offpeak_time_utc` and `IsNowOffpeak()` available in `VersionStorageInfo`. We chose the latter as it involves smaller changes. This change includes the following - Introduction of `OffpeakTimeInfo` and `IsNowOffpeak()` has been moved from `MutableDBOptions` - `OffpeakTimeInfo` added to `VersionSet` and it can be set during construction and by `ChangeOffpeakTimeInfo()` - During `SetDBOptions()`, if offpeak time info needs to change, it calls `MaybeScheduleFlushOrCompaction()` to re-compute compaction scores and process compactions as needed Pull Request resolved: https://github.com/facebook/rocksdb/pull/12018 Test Plan: - `DBOptionsTest::OffpeakTimes` changed to include checks for `MaybeScheduleFlushOrCompaction()` calls and `VersionSet`'s OffpeakTimeInfo value change during `SetDBOptions()`. - `VersionSetTest::OffpeakTimeInfoTest` added to test `ChangeOffpeakTimeInfo()`. `IsNowOffpeak()` tests moved from `DBOptionsTest::OffpeakTimes` Reviewed By: pdillinger Differential Revision: D50723881 Pulled By: jaykorean fbshipit-source-id: 3cff0291936f3729c0e9c7750834b9378fb435f6	2023-10-27 15:56:48 -07:00
Hui Xiao	0f141352d8	Fix race between flush error recovery and db destruction (#12002 ) Summary: Context: DB destruction will wait for ongoing error recovery through `EndAutoRecovery()` and join the recovery thread: `519f2a41fb/db/db_impl/db_impl.cc (L525)` -> `519f2a41fb/db/error_handler.cc (L250)` -> `519f2a41fb/db/error_handler.cc (L808-L823)` However, due to a race between flush error recovery and db destruction, recovery can actually start after such wait during the db shutdown. The consequence is that the recovery thread created as part of this recovery will not be properly joined upon its destruction as part the db destruction. It then crashes the program as below. ``` std::terminate() std::default_delete<std::thread>::operator()(std::thread) const std::unique_ptr<std::thread, std::default_delete<std::thread>>::~unique_ptr() rocksdb::ErrorHandler::~ErrorHandler() (rocksdb/db/error_handler.h:31) rocksdb::DBImpl::~DBImpl() (rocksdb/db/db_impl/db_impl.cc:725) rocksdb::DBImpl::~DBImpl() (rocksdb/db/db_impl/db_impl.cc:725) rocksdb::DBTestBase::Close() (rocksdb/db/db_test_util.cc:678) ``` Summary:* This PR fixed it by considering whether EndAutoRecovery() has been called before creating such thread. This fix is similar to how we currently [handle](`519f2a41fb/db/error_handler.cc (L688-L694)`) such case inside the created recovery thread. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12002 Test Plan: A new UT repro-ed the crash before this fix and and pass after. Reviewed By: ajkr Differential Revision: D50586191 Pulled By: hx235 fbshipit-source-id: b372f6d7a94eadee4b9283b826cc5fb81779a093	2023-10-25 11:59:09 -07:00
qiuchengxuan	f2c9075d16	Fix dead loop with kSkipAnyCorruptedRecords mode selected in some cases (#11955 ) (#11979 ) Summary: With fragmented record span across multiple blocks, if any following blocks corrupted with arbitary data, and intepreted log number less than the current log number, program will fall into infinite loop due to not skipping buffer leading bytes Pull Request resolved: https://github.com/facebook/rocksdb/pull/11979 Test Plan: existing unit tests Reviewed By: ajkr Differential Revision: D50604408 Pulled By: jowlyzhang fbshipit-source-id: e50a0c7e7c3d293fb9d5afec0a3eb4a1835b7a3b	2023-10-25 09:16:24 -07:00
Myth	0ff7665c95	Fix low priority write may cause crash when it is rate limited (#11932 ) Summary: Fixed https://github.com/facebook/rocksdb/issues/11902 Pull Request resolved: https://github.com/facebook/rocksdb/pull/11932 Reviewed By: akankshamahajan15 Differential Revision: D50573356 Pulled By: hx235 fbshipit-source-id: adeb1abdc43b523b0357746055ce4a2eabde56a1	2023-10-24 14:41:46 -07:00
Peter Dillinger	4155087746	Use manifest to persist pre-allocated seqnos (#11995 ) Summary: ... and other fixes for crash test after https://github.com/facebook/rocksdb/issues/11922. * When pre-allocating sequence numbers for establishing a time history, record that last sequence number in the manifest so that it is (most likely) restored on recovery even if no user writes were made or were recovered (e.g. no WAL). * When pre-allocating sequence numbers for establishing a time history, only do this for actually new DBs. * Remove the feature that ensures non-zero sequence number on creating the first column family with preserve/preclude option after initial DB::Open. Until fixed in a way compatible with the crash test, this creates a gap where some data written with active preserve/preclude option won't have a known associated time. Together, these ensure we don't upset the crash test by manipulating sequence numbers after initial DB creation (esp when re-opening with different options). (The crash test expects that the seqno after re-open corresponds to a known point in time from previous crash test operation, matching an expected DB state.) Follow-up work: * Re-fill the gap to ensure all data written under preserve/preclude settings have a known time estimate. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11995 Test Plan: Added to unit test SeqnoTimeTablePropTest.PrePopulateInDB Verified fixes two crash test scenarios: ## 1st reproducer First apply ``` diff --git a/db_stress_tool/expected_state.cc b/db_stress_tool/expected_state.cc index b483e154c..ef63b8d6c 100644 --- a/db_stress_tool/expected_state.cc +++ b/db_stress_tool/expected_state.cc @@ -333,6 +333,7 @@ Status FileExpectedStateManager::SaveAtAndAfter(DB* db) { s = NewFileTraceWriter(Env::Default(), soptions, trace_file_path, &trace_writer); } + if (getenv("CRASH")) assert(false); if (s.ok()) { TraceOptions trace_opts; trace_opts.filter \|= kTraceFilterGet; ``` Then ``` mkdir -p /dev/shm/rocksdb_test/rocksdb_crashtest_expected mkdir -p /dev/shm/rocksdb_test/rocksdb_crashtest_whitebox rm -rf /dev/shm/rocksdb_test/rocksdb_crashtest_/ CRASH=1 ./db_stress --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --destroy_db_initially=1 --manual_wal_flush_one_in=1000000 --clear_column_family_one_in=0 --preserve_internal_time_seconds=36000 ./db_stress --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --destroy_db_initially=0 --manual_wal_flush_one_in=1000000 --clear_column_family_one_in=0 --preserve_internal_time_seconds=0 ``` Without the fix you get ``` ... DB path: [/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox] (Re-)verified 34 unique IDs Error restoring historical expected values: Corruption: DB is older than any restorable expected state ``` ## 2nd reproducer First apply ``` diff --git a/db_stress_tool/db_stress_test_base.cc b/db_stress_tool/db_stress_test_base.cc index 62ddead7b..f2654980f 100644 --- a/db_stress_tool/db_stress_test_base.cc +++ b/db_stress_tool/db_stress_test_base.cc @@ -1126,6 +1126,7 @@ void StressTest::OperateDb(ThreadState* thread) { // OPERATION write TestPut(thread, write_opts, read_opts, rand_column_families, rand_keys, value); + if (getenv("CRASH")) assert(false); } else if (prob_op < del_bound) { assert(write_bound <= prob_op); // OPERATION delete ``` Then ``` rm -rf /dev/shm/rocksdb_test/rocksdb_crashtest_/ CRASH=1 ./db_stress --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --destroy_db_initially=1 --manual_wal_flush_one_in=1000000 --clear_column_family_one_in=0 --disable_wal=1 --reopen=0 --preserve_internal_time_seconds=0 ./db_stress --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --destroy_db_initially=0 --manual_wal_flush_one_in=1000000 --clear_column_family_one_in=0 --disable_wal=1 --reopen=0 --preserve_internal_time_seconds=3600 ``` Without the fix you get ``` DB path: [/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox] (Re-)verified 34 unique IDs db_stress: db_stress_tool/expected_state.cc:380: virtual rocksdb::{anonymous}::ExpectedStateTraceRecordHandler::~ ExpectedStateTraceRecordHandler(): Assertion `IsDone()' failed. ``` Reviewed By: jowlyzhang Differential Revision: D50533346 Pulled By: pdillinger fbshipit-source-id: 1056be45c5b9e537c8c601b28c4b27431a782477	2023-10-23 09:20:59 -07:00
Hui Xiao	0836a2b26d	New tickers on deletion compactions grouped by reasons (#11957 ) Summary: Context/Summary: as titled Pull Request resolved: https://github.com/facebook/rocksdb/pull/11957 Test Plan: piggyback on existing tests; fixed a failed test due to adding new stats Reviewed By: ajkr, cbi42 Differential Revision: D50294310 Pulled By: hx235 fbshipit-source-id: d99b97ebac41efc1bdeaf9ca7a1debd2927d54cd	2023-10-18 18:00:07 -07:00
Changyu Bi	d5bc30befa	Enforce status checking after Valid() returns false for IteratorWrapper (#11975 ) Summary: ... when compiled with ASSERT_STATUS_CHECKED = 1. The main change is in iterator_wrapper.h. The remaining changes are just fixing existing unit tests. Adding this check to IteratorWrapper gives a good coverage as the class is used in many places, including child iterators under merging iterator, merging iterator under DB iter, file_iter under level iterator, etc. This change can catch the bug fixed in https://github.com/facebook/rocksdb/issues/11782. Future follow up: enable `ASSERT_STATUS_CHECKED=1` for stress test and for DEBUG_LEVEL=0. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11975 Test Plan: * `ASSERT_STATUS_CHECKED=1 DEBUG_LEVEL=2 make -j32 J=32 check` * I tried to run stress test with `ASSERT_STATUS_CHECKED=1`, but there are a lot of existing stress code that ignore status checking, and fail without the change in this PR. So defer that to a follow up task. Reviewed By: ajkr Differential Revision: D50383790 Pulled By: cbi42 fbshipit-source-id: 1a28ce0f5fdf1890f93400b26b3b1b3a287624ce	2023-10-18 09:38:38 -07:00
Yu Zhang	933ee295f4	Fix a race condition between recovery and backup (#11955 ) Summary: A race condition between recovery and backup can happen with error messages like this: ```Failure in BackupEngine::CreateNewBackup with: IO error: No such file or directory: While opening a file for sequentially reading: /dev/shm/rocksdb_test/rocksdb_crashtest_whitebox/002653.log: No such file or directory``` PR https://github.com/facebook/rocksdb/issues/6949 introduced disabling file deletion during error handling of manifest IO errors. Aformentioned race condition is caused by this chain of event: [Backup engine] disable file deletion [Recovery] disable file deletion <= this is optional for the race condition, it may or may not get called [Backup engine] get list of file to copy/link [Recovery] force enable file deletion .... some files refered by backup engine get deleted [Backup engine] copy/link file <= error no file found This PR fixes this with: 1) Recovery thread is currently forcing enabling file deletion as long as file deletion is disabled. Regardless of whether the previous error handling is for manifest IO error and that disabled it in the first place. This means it could incorrectly enabling file deletions intended by other threads like backup threads, file snapshotting threads. This PR does this check explicitly before making the call. 2) `disable_delete_obsolete_files_` is designed as a counter to allow different threads to enable and disable file deletion separately. The recovery thread currently does a force enable file deletion, because `ErrorHandler::SetBGError()` can be called multiple times by different threads when they receive a manifest IO error(details per PR https://github.com/facebook/rocksdb/issues/6949), resulting in `DBImpl::DisableFileDeletions` to be called multiple times too. Making a force enable file deletion call that resets the counter `disable_delete_obsolete_files_` to zero is a workaround for this. However, as it shows in the race condition, it can incorrectly suppress other threads like a backup thread's intention to keep the file deletion disabled. <strike>This PR adds a `std::atomic<int> disable_file_deletion_count_` to the error handler to track the needed counter decrease more precisely</strike>. This PR tracks and caps file deletion enabling/disabling in error handler. 3) for recovery, the section to find obsolete files and purge them was moved to be done after the attempt to enable file deletion. The actual finding and purging is more likely to happen if file deletion was previously disabled and get re-enabled now. An internal function `DBImpl::EnableFileDeletionsWithLock` was added to support change 2) and 3). Some useful logging was explicitly added to keep those log messages around. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11955 Test Plan: existing unit tests Reviewed By: anand1976 Differential Revision: D50290592 Pulled By: jowlyzhang fbshipit-source-id: 73aa8331ca4d636955a5b0324b1e104a26e00c9b	2023-10-17 13:18:04 -07:00
Peter Dillinger	2fd850c7eb	Remove write queue synchronization from WriteOptionsFile (#11951 ) Summary: This has become obsolete with the new `options_mutex_` in https://github.com/facebook/rocksdb/pull/11929 * Remove now-unnecessary parameter from WriteOptionsFile * Rename (and negate) other parameter for better clarity (the caller shouldn't tell the callee what the callee needs, just what the caller knows, provides, and requests) * Move a ROCKS_LOG_WARN (I/O) in WriteOptionsFile to outside of holding DB mutex. * Also avoid (but not always eliminate) write queue synchronization in SetDBOptions. Still needed if there was a change to WAL size limit or other configuration. * Improve some comments Pull Request resolved: https://github.com/facebook/rocksdb/pull/11951 Test Plan: existing unit tests and TSAN crash test local run Reviewed By: ajkr Differential Revision: D50247904 Pulled By: pdillinger fbshipit-source-id: 7dfe445c705ec013886a2adb7c50abe50d83af69	2023-10-16 08:58:47 -07:00
Changyu Bi	f3aef8cad7	Add write operation to tracer only after successful callback (#11954 ) Summary: We saw optimistic transaction stress test failures like the following: ``` Verification failed for column family 0 key 000000000001E9AF000000000000012B00000000000000B5 (12535491): value_from_db: 010000000504070609080B0A0D0C0F0E111013121514171619181B1A1D1C1F1E212023222524272629282B2A2D2C2F2E313033323534373639383B3A3D3C3F3E, value_from_expected: , msg: Iterator verification: Unexpected value found``` ``` With ajkr's repro (see test plan), I found that we record duplicated writes to tracer when an optimistic transaction conflict checking fails. This PR fixes it by checking callback status before record a write operation to tracer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11954 Test Plan: this reproduces the failure consistently ``` #!/bin/bash db=/dev/shm/rocksdb_crashtest_blackbox exp=/dev/shm/rocksdb_crashtest_expected rm -rf $db $exp && mkdir -p $exp && while ./db_stress \ --atomic_flush=1 \ --clear_column_family_one_in=0 \ --db=$db \ --db_write_buffer_size=2097152 \ --delpercent=0 \ --delrangepercent=0 \ --destroy_db_initially=0 \ --disable_wal=1 \ --expected_values_dir=$exp \ --iterpercent=0 \ --max_bytes_for_level_base=2097152 \ --max_key=250000 \ --memtable_prefix_bloom_size_ratio=0.5 \ --memtable_whole_key_filtering=1 \ --occ_lock_bucket_count=100 \ --occ_validation_policy=0 \ --ops_per_thread=10 \ --prefixpercent=0 \ --readpercent=0 \ --reopen=0 \ --target_file_size_base=524288 \ --test_batches_snapshots=0 \ --use_optimistic_txn=1 \ --use_txn=1 \ --value_size_mult=32 \ --write_buffer_size=524288 \ --writepercent=100 ; do : ; done ``` Reviewed By: akankshamahajan15 Differential Revision: D50284976 Pulled By: cbi42 fbshipit-source-id: 793e3cee186c8b4f406b29166efd8d9028695206	2023-10-14 12:00:31 -07:00
Jay Huh	c9d8e6a5bf	AttributeGroups - MultiGetEntity Implementation (#11925 ) Summary: Introducing the notion of AttributeGroup by adding the `MultiGetEntity()` API retrieving `PinnableAttributeGroups`. An "attribute group" refers to a logical grouping of wide-column entities within RocksDB. These attribute groups are implemented using column families. Users can store WideColumns in different CFs for various reasons (e.g. similar access patterns, same types, etc.). This new API `MultiGetEntity()` takes keys and `PinnableAttributeGroups` per key. `PinnableAttributeGroups` is just a list of `PinnableAttributeGroup`s in which we have `ColumnFamilyHandle*`, `Status`, and `PinnableWideColumns`. Let's say a user stored "hot" wide columns in column family "hot_data_cf" and "cold" wide columns in column family "cold_data_cf" and all other columns in "common_cf". Prior to this PR, if the user wants to query for two keys, "key_1" and "key_2" and but only interested in "common_cf" and "hot_data_cf" for "key_1", and "common_cf" and "cold_data_cf" for "key_2", the user would have to construct input like `keys = ["key_1", "key_1", "key_2", "key_2"]`, `column_families = ["common_cf", "hot_data_cf", "common_cf", "cold_data_cf"]` and get the flat list of `PinnableWideColumns` to find the corresponding <key,CF> combo. With the new `MultiGetEntity()` introduced in this PR, users can now query only `["common_cf", "hot_data_cf"]` for `"key_1"`, and only `["common_cf", "cold_data_cf"]` for `"key_2"`. The user will get `PinnableAttributeGroups` for each key, and `PinnableAttributeGroups` gives a list of `PinnableAttributeGroup`s where the user can find column family and corresponding `PinnableWideColumns` and the `Status`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11925 Test Plan: - `DBWideBasicTest::MultiCFMultiGetEntityAsPinnableAttributeGroups` added will enable this new API in the `db_stress` in a separate PR Reviewed By: ltamasi Differential Revision: D50017414 Pulled By: jaykorean fbshipit-source-id: 643611d1273c574bc81b94c6f5aeea24b40c4586	2023-10-13 15:58:03 -07:00
Changyu Bi	6e3429b8a6	Fix data race in accessing `recovery_in_prog_` (#11950 ) Summary: We saw the following TSAN stress test failure: ``` WARNING: ThreadSanitizer: data race (pid=17523) Write of size 1 at 0x7b8c000008b9 by thread T4243 (mutexes: write M0): #0 rocksdb::ErrorHandler::RecoverFromRetryableBGIOError() fbcode/internal_repo_rocksdb/repo/db/error_handler.cc:742 (db_stress+0x95f954) (BuildId: 35795dfb86ddc9c4f20ddf08a491f24d) https://github.com/facebook/rocksdb/issues/1 std:🧵:_State_impl<std:🧵:_Invoker<std::tuple<void (rocksdb::ErrorHandler::)(), rocksdb::ErrorHandler>>>::_M_run() fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/invoke.h:74 (db_stress+0x95fc2b) (BuildId: 35795dfb86ddc9c4f20ddf08a491f24d) https://github.com/facebook/rocksdb/issues/2 execute_native_thread_routine /home/engshare/third-party2/libgcc/11.x/src/gcc-11.x/x86_64-facebook-linux/libstdc++-v3/src/c++11/../../../.././libstdc++-v3/src/c++11/thread.cc:82:18 (libstdc++.so.6+0xdf4e4) (BuildId: 452d1cdae868baeeb2fdf1ab140f1c219bf50c6e) Previous read of size 1 at 0x7b8c000008b9 by thread T22: #0 rocksdb::DBImpl::SyncClosedLogs(rocksdb::JobContext, rocksdb::VersionEdit) fbcode/internal_repo_rocksdb/repo/db/error_handler.h:76 (db_stress+0x84f69c) (BuildId: 35795dfb86ddc9c4f20ddf08a491f24d) ``` This is due to a data race in accessing `recovery_in_prog_`. This PR fixes it by accessing `recovery_in_prog_` under db mutex before calling `SyncClosedLogs()`. I think the original PR https://github.com/facebook/rocksdb/pull/10489 intended to clear the error if it's a recovery flush. So ideally we can also just check flush reason. I plan to keep a safer change in this PR and make that change in the future if needed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11950 Test Plan: check future TSAN stress test results. Reviewed By: anand1976 Differential Revision: D50242255 Pulled By: cbi42 fbshipit-source-id: 0d487948ef9546b038a34460f3bb037f6e5bfc58	2023-10-12 16:55:25 -07:00
Changyu Bi	648fe25bc0	Always clear files marked for compaction in `ComputeCompactionScore()` (#11946 ) Summary: We were seeing the following stress test failures: ```LevelCompactionBuilder::PickFileToCompact(const rocksdb::autovector<std::pair<int, rocksdb::FileMetaData*> >&, bool): Assertion `!level_file.second->being_compacted' failed``` This can happen when we are picking a file to be compacted from some files marked for compaction, but that file is already being_compacted. We prevent this by always calling `ComputeCompactionScore()` after we pick a compaction and mark some files as being_compacted. However, if SetOptions() is called to disable marking certain files to be compacted, say `enable_blob_garbage_collection`, we currently just skip the relevant logic in `ComputeCompactionScore()` without clearing the existing files already marked for compaction. This PR fixes this issue by already clearing these files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11946 Test Plan: existing tests. Reviewed By: akankshamahajan15 Differential Revision: D50232608 Pulled By: cbi42 fbshipit-source-id: 11e4fb5e9d48b0f946ad33b18f7c005f0161f496	2023-10-12 15:26:10 -07:00
Peter Dillinger	d010b02e86	Fix race in options taking effect (#11929 ) Summary: In follow-up to https://github.com/facebook/rocksdb/issues/11922, fix a race in functions like CreateColumnFamily and SetDBOptions where the DB reports one option setting but a different one is left in effect. To fix, we can add an extra mutex around these rare operations. We don't want to hold the DB mutex during I/O or other slow things because of the many purposes it serves, but a mutex more limited to these cases should be fine. I believe this would fix a write-write race in https://github.com/facebook/rocksdb/issues/10079 but not the read-write race. Intended follow-up to this: * Should be able to remove write thread synchronization from DBImpl::WriteOptionsFile Pull Request resolved: https://github.com/facebook/rocksdb/pull/11929 Test Plan: Added two mini-stress style regression tests that fail with >1% probability before this change: DBOptionsTest::SetStatsDumpPeriodSecRace ColumnFamilyTest::CreateAndDropPeriodicRace I haven't reproduced such an inconsistency between in-memory options and on disk latest options, but this change at least improves safety and adds a test anyway: DBOptionsTest::SetStatsDumpPeriodSecRace Reviewed By: ajkr Differential Revision: D50024506 Pulled By: pdillinger fbshipit-source-id: 1e99a9ed4d96fdcf3ac5061ec6b3cee78aecdda4	2023-10-12 10:05:23 -07:00
Andrew Kryczka	4bd5aa4f55	Fix two `ErrorHandler` race conditions (#11939 ) Summary: 1. Prevent a double join on a `port::Thread` 2. Ensure `recovery_in_prog_` and `bg_error_` are both set under same lock hold. This is useful for writers who see a non-OK `bg_error_` and are deciding whether to stall based on whether the error will be auto-recovered. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11939 Reviewed By: cbi42 Differential Revision: D50155484 Pulled By: ajkr fbshipit-source-id: fbc1f85c50e7eaee27ee0e376aee688d8a06c93b	2023-10-11 09:42:48 -07:00
Andrew Kryczka	77d160ef47	Consolidate `ErrorHandler`'s recovery status variables (#11937 ) Summary: cbi42 pointed out a race condition in which `recovery_io_error_` and `recovery_error_` could be updated inconsistently due to releasing the DB mutex in `EventHelpers::NotifyOnBackgroundError()`. There doesn't seem to be a point to having two status objects, so this PR consolidates them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11937 Reviewed By: cbi42 Differential Revision: D50105793 Pulled By: ajkr fbshipit-source-id: 3de95baccfa44351a49a5c2aa0986c9bc81baa8f	2023-10-10 06:31:45 -07:00
Andrew Kryczka	8a9cfd5292	Make stopped writes block on recovery (#11879 ) Summary: Relaxed the constraints for blocking when writes are stopped. When a recovery is already being attempted, we might as well let `!no_slowdown` writes wait on it in case it succeeds. This makes the user-visible behavior consistent across recovery flush and non-recovery flush. This enables `db_stress` to inject retryable (soft) flush read errors without having to handle user write failures. I changed `db_stress` a bit to permit injected errors in much more foreground operations as more admin operations (like `GetLiveFiles()`) can fail on a retryable error during flush. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11879 Reviewed By: anand1976 Differential Revision: D49571196 Pulled By: ajkr fbshipit-source-id: 5d516d6faf20d2c6bfe0594ab4f2706bca6d69b0	2023-10-10 06:29:01 -07:00
darionyaphet	ee0829ba76	fix typo snapshto (#11817 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11817 Reviewed By: jaykorean Differential Revision: D50103497 Pulled By: ltamasi fbshipit-source-id: 77c5cf86ff7eb5021fc91b03225882536163af7b	2023-10-09 19:10:06 -07:00
Levi Tamasi	51d7e6a49e	Clean up WriteBatchWithIndexInternal a bit (#11930 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11930 The patch cleans up and refactors the logic in/around `WriteBatchWithIndexInternal` a bit as groundwork for further changes. Specifically, the class is turned back into a stateless collection of static helpers (which is the way it was before PR 6851). Note that there were two apparent reasons for introducing this instance state in PR 6851: a) encapsulating `MergeContext` and b) resolving objects like `Logger` and `Statistics` based on a variety of handles. However, neither reason seems justified at this point. Regarding a), the `MultiGetFromBatchAndDB` logic passes in its own `MergeContext` objects via a second set of methods that do not use the member `MergeContext`. As for b), `Logger` and friends are only needed for Merge, which is only supported if a column family handle is provided; in turn, the column family handle enables us to resolve all the necessary objects without the need for any other handles like `DB` or `DBOptions`. In addition to the above, the patch changes the type of `BaseDeltaIterator::merge_result_` to `std::string` from `PinnableSlice` (since no pinning is ever done) and makes some other small code quality improvements. Reviewed By: jaykorean Differential Revision: D50038302 fbshipit-source-id: 5f34abe2e808bdaea0f3a8033b5764ebd446b85d	2023-10-09 15:25:35 -07:00
Peter Dillinger	1d5bddbc58	Bootstrap, pre-populate seqno_to_time_mapping (#11922 ) Summary: This change has two primary goals (follow-up to https://github.com/facebook/rocksdb/issues/11917, https://github.com/facebook/rocksdb/issues/11920): * Ensure the DB seqno_to_time_mapping has entries that allow us to put a good time lower bound on any writes that happen after setting up preserve/preclude options (either in a new DB, new CF, SetOptions, etc.) and haven't yet aged out of that time window. This allows us to remove a bunch of work-arounds in tests. * For new DBs using preserve/preclude options, automatically reserve some sequence numbers and pre-map them to cover the time span back to the preserve/preclude cut-off time. In the future, this will allow us to import data from another DB by key, value, and write time by assigning an appropriate seqno in this DB for that write time. Note that the pre-population (historical mappings) does not happen if the original options at DB Open time do not have preserve/preclude, so it is recommended to create initial column families at that time with create_missing_column_families, to take advantage of this (future) feature. (Adding these historical mappings after DB Open would risk non-monotonic seqno_to_time_mapping, which is dubious if not dangerous.) Recommended follow-up: * Solve existing race conditions (not memory safety) where parallel operations like CreateColumnFamily or SetDBOptions could leave the wrong setting in effect. * Make SeqnoToTimeMapping more gracefully handle a possible case in which too many mappings are added for the time range of concern. It seems like there could be cases where data is massively excluded from the cold tier because of entries falling off the front of the mapping list (causing GetProximalSeqnoBeforeTime() to return 0). (More investigation needed.) No release note for the minor bug fix because this is still an experimental feature with limited usage. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11922 Test Plan: tests added / updated Reviewed By: jowlyzhang Differential Revision: D49956563 Pulled By: pdillinger fbshipit-source-id: 92beb918c3a298fae9ca8e509717b1067caa1519	2023-10-06 08:21:21 -07:00
Hui Xiao	8e949116f7	Fix comments about creation_time/oldest_ancester_time/oldest_key_time (#11921 ) Summary: Code reference for the comments change: `40b618f234/table/block_based/block_based_table_builder.cc`?fbclid=IwAR0JlfnG8wysclFP5wv0fSngFbi_j32BUCKbFayeGdr10tzDhyyk5QqpclA#L2093 `40b618f234/db/flush_job.cc`?fbclid=IwAR1ri6eTX3wyD_2fAEBRzFSwZItcbmDS8LaB11k1letDMQmB2L8nF6TfXDs#L945-L949 `40b618f234/db/compaction/compaction_job.cc (L1882-L1904)` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11921 Reviewed By: cbi42 Differential Revision: D49921304 Pulled By: hx235 fbshipit-source-id: 2ae17e43c0fd52044404d7b63fea254d2d1f3595	2023-10-04 14:42:35 -07:00
Peter Dillinger	141b872bd4	Improve efficiency of create_missing_column_families, light refactor (#11920 ) Summary: In preparing some seqno_to_time_mapping improvements, I found that some of the wrap-up work for creating column families was unnecessarily repeated in the case of DB::Open with create_missing_column_families. This change fixes that (`CreateColumnFamily()` -> `CreateColumnFamilyImpl()` in `DBImpl::Open()`), motivated by avoiding repeated calls to `RegisterRecordSeqnoTimeWorker()` but with the side benefit of avoiding repeated calls to `WriteOptionsFile()` for each CF. Also in this change: * Add a `Status::UpdateIfOk()` function for combining statuses in a common pattern * Rename `max_time_duration` -> `min_preserve_seconds` (include units as much as possible) * Improved comments in several places Pull Request resolved: https://github.com/facebook/rocksdb/pull/11920 Test Plan: tests added / updated Reviewed By: jaykorean Differential Revision: D49919147 Pulled By: pdillinger fbshipit-source-id: 3d0318c1d070c842c5331da0a5b415caedc104f1	2023-10-04 14:14:22 -07:00
akankshamahajan	97f6f475bc	Fix various failures in auto_readahead_size (#11884 ) Summary: 1. Error in TestIterateAgainstExpected API - `Assertion index < pre_read_expected_values.size() && index < post_read_expected_values.size() failed.` Fix - `Prev` op is not supported with `auto_readahead_size`. So added support to Reseek in db_iter, if Prev is called. In BlockBasedTableIterator, index_iter_ already moves forward. So there is no way to do Prev from BlockBasedTableIterator. 2. Error - `void rocksdb::BlockBasedTableIterator::BlockCacheLookupForReadAheadSize(uint64_t, size_t, size_t&): Assertion index_iter_->value().handle.offset() == offset` Fix - Remove prefetch_buffer to be used when uncompressed dict is read. 3. ** Error in TestPrefixScan API - `db_stress: db/db_iter.cc:369: bool rocksdb::DBIter::FindNextUserEntryInternal(bool, const rocksdb::Slice): Assertion !skipping_saved_key \|\| CompareKeyForSkip(ikey_.user_key, saved_key_.GetUserKey()) > 0 failed. Received signal 6 (Aborted) Invoking GDB for stack trace... db_stress: table/merging_iterator.cc:1036: bool rocksdb::MergingIterator::SkipNextDeleted(): Assertion comparator_->Compare(range_tombstone_iters_[i]->start_key(), pik) <= 0 failed` Fix* - SeekPrev also calls 1) SeekPrev , 2)Seek and then 3)Prev in some cases in db_iter.cc leading to failure of Prev operation. These backward operations also call Seek. Added direction to disable lookup once direction is backwards in BlockBasedTableIterator.cc Pull Request resolved: https://github.com/facebook/rocksdb/pull/11884 Test Plan: Ran various flavors of crash tests locally for the whole duration Reviewed By: anand1976 Differential Revision: D49834201 Pulled By: akankshamahajan15 fbshipit-source-id: 9a007b4d46a48002c43dc4623a400ecf47d997fe	2023-10-02 17:47:24 -07:00
Jay Huh	5fbea87859	Disallow start_time == end_time in offpeak time and compare at minute level to allow 24hr offpeak (#11911 ) Summary: Since allowing 24hr peak by setting start_time = end_time is not so intuitive, we are not going to allow it (e.g. `00:00-00:00` doesn't looks like a value that would cover 24hr.). Instead, we are going to compare at minute level (i.e. dropping the seconds to the nearest minute) so that `00:00-23:59` will cover 24hrs. The entire minute from 23:59:00 23:59:59 will be covered with this change. Minor fixes from previous PR - release build error - fixed random seed in test Pull Request resolved: https://github.com/facebook/rocksdb/pull/11911 Test Plan: `DBOptionsTest::OffPeakTimes` `make -j64 static_lib` to test release build issue that was fixed Reviewed By: pdillinger Differential Revision: D49787795 Pulled By: jaykorean fbshipit-source-id: e8d045b95f54f61d5dd5f1bb473579f8d55c18b3	2023-10-02 16:52:39 -07:00
Andrew Kryczka	10fd05e394	Give retry flushes their own functions (#11903 ) Summary: Recovery triggers flushes for very different scenarios: (1) `FlushReason::kErrorRecoveryRetryFlush`: a flush failed (2) `FlushReason::kErrorRecovery`: a WAL may be corrupted (3) `FlushReason::kCatchUpAfterErrorRecovery`: immutable memtables may have accumulated The old code called called `FlushAllColumnFamilies()` in all cases, which uses manual flush functions: `AtomicFlushMemTables()` and `FlushMemTable()`. Forcing flushing the latest data on all CFs was useful for (2) because it ensures all CFs move past the corrupted WAL. However, those code paths were overkill for (1) and (3), where only already-immutable memtables need to be flushed. There were conditionals to exclude some of the extraneous logic but I found there was still too much happening. For example, both of the manual flush functions enter the write thread. Entering the write thread is inconvenient because then we can't allow stalled writes to wait on a retrying flush to finish. Instead of continuing down the path of adding more conditionals to the manual flush functions, this PR introduces a dedicated function for cases (1) and (3): `RetryFlushesForErrorRecovery()`. Also I cleaned up the manual flush functions to remove existing conditionals for these cases as they're no longer needed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11903 Reviewed By: cbi42 Differential Revision: D49693812 Pulled By: ajkr fbshipit-source-id: 7630ac539b9d6c92052c13a3cdce53256134d990	2023-10-02 16:26:24 -07:00
Levi Tamasi	b00fa5597e	Fix the handling of wide-column base values in the max_successive_merges logic (#11913 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11913 The `max_successive_merges` logic currently does not handle wide-column base values correctly, since it uses the `Get` API, which only returns the value of the default column. The patch fixes this by switching to `GetEntity` and passing all columns (if applicable) to the merge operator. Reviewed By: jaykorean Differential Revision: D49795097 fbshipit-source-id: 75eb7cc9476226255062cdb3d43ab6bd1cc2faa3	2023-10-02 16:25:25 -07:00
Peter Dillinger	7bebd3036d	Update tiered storage tests (ahead of next change) (#11917 ) Summary: After https://github.com/facebook/rocksdb/issues/11905, I am preparing a DBImpl change to ensure all sufficiently recent sequence numbers since Open are covered by SeqnoToTimeMapping. Intended follow-up However, there are a number of test changes I want to make prior to that to make it clear that I am not regressing the tests and production behavior at the same time. * Start mock time in the tests well beyond epoch (time 0) so that we aren't normally reaching into pre-history for current time minus the preserve/preclude duration. * Majorly clean up BasicSeqnoToTimeMapping to avoid confusing hard-coded bounds on GetProximalTimeBeforeSeqno() results. * There is an unresolved/unexplained issue marked with FIXME that should be investigated when GetProximalTimeBeforeSeqno() is put into production. * MultiCFs test was strangely generating 5 L0 files, four of which would be compacted into an L1, and then letting TTL compaction compact 1@L0+1@L1. Changing the starting time of the tests seemed to mess up the TTL compaction. But I suspect the TTL compaction was unintentional, so I've cut it down to just 4 L0 files, which compacts predictably. * Unrelated: allow ROCKSDB_NO_STACK=1 to skip printing a stack trace on assertion failures. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11917 Test Plan: no changes to production code Reviewed By: jowlyzhang Differential Revision: D49841436 Pulled By: pdillinger fbshipit-source-id: 753348ace9c548e82bcb77fcc8b2ffb7a6beeb0a	2023-10-02 16:19:05 -07:00
leipeng	d98a9cfb27	test: WritableFile derived class: add missing GetFileSize() override (#11726 ) Summary: Missed `GetFileSize()` forwarding , this PR fix this issue, and mark `WritableFile::GetFileSize()` as pure virtual to detect such issue in compile time. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11726 Reviewed By: ajkr Differential Revision: D49791240 Pulled By: jowlyzhang fbshipit-source-id: ef219508d6b15c9a24df9b706a9fdc33cc6a286e	2023-09-29 15:58:08 -07:00
Jay Huh	63ed868840	Offpeak in db option (#11893 ) Summary: RocksDB's primary function is to facilitate read and write operations. Compactions, while essential for minimizing read amplifications and optimizing storage, can sometimes compete with these primary tasks. Especially during periods of high read/write traffic, it's vital to ensure that primary operations receive priority, avoiding any potential disruptions or slowdowns. Conversely, during off-peak times when traffic is minimal, it's an opportune moment to tackle low-priority tasks like TTL based compactions, optimizing resource usage. In this PR, we are incorporating the concept of off-peak time into RocksDB by introducing `daily_offpeak_time_utc` within the DBOptions. This setting is formatted as "HH:mm-HH:mm" where the first one before "-" is the start time and the second one is the end time, inclusive. It will be later used for resource optimization in subsequent PRs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11893 Test Plan: - New Unit Test Added - `DBOptionsTest::OffPeakTimes` - Existing Unit Test Updated - `OptionsTest`, `OptionsSettableTest` Reviewed By: pdillinger Differential Revision: D49714553 Pulled By: jaykorean fbshipit-source-id: fef51ea7c0fede6431c715bff116ddbb567c8752	2023-09-29 13:03:39 -07:00
Peter Dillinger	02443dd93f	Refactor, clean up, fixes, and more testing for SeqnoToTimeMapping (#11905 ) Summary: This change is before a planned DBImpl change to ensure all sufficiently recent sequence numbers since Open are covered by SeqnoToTimeMapping (bug fix with existing test work-arounds). Intended follow-up However, I found enough issues with SeqnoToTimeMapping to warrant this PR first, including very small fixes in DB implementation related to API contract of SeqnoToTimeMapping. Functional fixes / changes: * This fixes some mishandling of boundary cases. For example, if the user decides to stop writing to DB, the last written sequence number would perpetually have its write time updated to "now" and would always be ineligible for migration to cold tier. Part of the problem is that the SeqnoToTimeMapping would return a seqno known to have been written before (immediately or otherwise) the requested time, but compaction_job.cc would include that seqno in the preserve/exclude set. That is fixed (in part) by adding one in compaction_job.cc * That problem was worse because a whole range of seqnos could be updated perpetually with new times in SeqnoToTimeMapping::Append (if no writes to DB). That logic was apparently optimized for GetOldestApproximateTime (now GetProximalTimeBeforeSeqno), which is not used in production, to the detriment of GetOldestSequenceNum (now GetProximalSeqnoBeforeTime), which is used in production. (Perhaps plans changed during development?) This is fixed in Append to optimize for accuracy of GetProximalSeqnoBeforeTime. (Unit tests added and updated.) * Related: SeqnoToTimeMapping did not have a clear contract about the relationships between seqnos and times, just the idea of a rough correspondence. Now the class description makes it clear that the write time of each recorded seqno comes before or at the associated time, to support getting best results for GetProximalSeqnoBeforeTime. And this makes it easier to make clear the contract of each API function. * Update `DBImpl::RecordSeqnoToTimeMapping()` to follow this ordering in gathering samples. Some part of these changes has required an expanded test work-around for the problem (see intended follow-up above) that the DB does not immediately ensure recent seqnos are covered by its mapping. These work-arounds will be removed with that planned work. An apparent compaction bug is revealed in PrecludeLastLevelTest::RangeDelsCauseFileEndpointsToOverlap, so that test is disabled. Filed GitHub issue #11909 Cosmetic / code safety things (not exhaustive): * Fix some confusing names. * `seqno_time_mapping` was used inconsistently in places. Now just `seqno_to_time_mapping` to correspond to class name. * Rename confusing `GetOldestSequenceNum` -> `GetProximalSeqnoBeforeTime` and `GetOldestApproximateTime` -> `GetProximalTimeBeforeSeqno`. Part of the motivation is that our times and seqnos here have the same underlying type, so we want to be clear about which is expected where to avoid mixing. * Rename `kUnknownSeqnoTime` to `kUnknownTimeBeforeAll` because the value is a bad choice for unknown if we ever add ProximalAfterBlah functions. * Arithmetic on SeqnoTimePair doesn't make sense except for delta encoding, so use better names / APIs with that in mind. * (OMG) Don't allow direct comparison between SeqnoTimePair and SequenceNumber. (There is no checking that it isn't compared against time by accident.) * A field name essentially matching the containing class name is a confusing pattern (`seqno_time_mapping_`). * Wrap calls to confusing (but useful) upper_bound and lower_bound functions to have clearer names and more code reuse. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11905 Test Plan: GetOldestSequenceNum (now GetProximalSeqnoBeforeTime) and TruncateOldEntries were lacking unit tests, despite both being used in production (experimental feature). Added those and expanded others. Reviewed By: jowlyzhang Differential Revision: D49755592 Pulled By: pdillinger fbshipit-source-id: f72a3baac74d24b963c77e538bba89a7fc8dce51	2023-09-29 11:21:59 -07:00
Levi Tamasi	6b4315ee8b	Extend the test coverage of FullMergeV3 (#11896 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11896 The patch extends the test coverage of the wide column aware merge logic by adding two new tests that perform general transformations during merge by implementing the `FullMergeV3` interface. The first one uses a merge operator that produces a wide-column entity as result in all cases (i.e. even if the base value is a plain key-value, or if there is no base value). The second one uses a merge operator that results in a plain key-value in all cases. Reviewed By: jaykorean Differential Revision: D49665946 fbshipit-source-id: 419b9e557c064525b659685eb8c09ae446656439	2023-09-27 14:53:25 -07:00
Yu Zhang	7ea6e724fa	Mark recovery_in_prog_ to false whenever recovery thread joins (#11890 ) Summary: Make the `RecoverFromRetryableBGIOError` function always mark `recovery_in_prog_` to false when it returns. Otherwise, in below code snippet, when db closes and the `error_handler_.CancelErrorRecovery()` call successfully joined the recovery thread, the immediately following while loop will incorrectly think the error recovery is still in progress and loops in `bg_cv_.Wait()`. `1c871a4d86/db/db_impl/db_impl.cc (L542-L545)` This is the issue https://github.com/facebook/rocksdb/issues/11440 Pull Request resolved: https://github.com/facebook/rocksdb/pull/11890 Reviewed By: anand1976 Differential Revision: D49624216 Pulled By: jowlyzhang fbshipit-source-id: ee10cf6527d95b8dd4705a326eb6208d741fe002	2023-09-25 20:15:40 -07:00
Changyu Bi	1c871a4d86	Only flush after recovery for retryable IOError (#11880 ) Summary: https://github.com/facebook/rocksdb/issues/11872 causes a unit test to start failing with the error message below. The cause is that the additional call to `FlushAllColumnFamilies()` in `DBImpl::ResumeImpl()` can run while DB is closing. More detailed explanation: there are two places where we call `ResumeImpl()`: 1. in `ErrorHandler::RecoverFromBGError`, for manual resume or recovery from errors like OutOfSpace through sst file manager, and 2. in `Errorhandler::RecoverFromRetryableBGIOError`, for error recovery from errors like flush failure due to retryable IOError. This is tracked by `ErrorHandler::recovery_thread_`. Here is how DB close waits for error recovery: `49da91ec09/db/db_impl/db_impl.cc (L540-L543)` `CancelErrorRecovery()` waits until `recovery_thread_` finishes and `IsRecoveryInProgress()` checks the `recovery_in_prog_` flag. The additional call to `FlushAllColumnFamilies()` in `ResumeImpl()` happens after it clears bg error and the `recovery_in_prog_` flag: `49da91ec09/db/db_impl/db_impl.cc (L436-L463)`. So if `ResumeImpl()` is called in `RecoverFromBGError()`, we can have a thread running `FlushAllColumnFamilies()` while DB is closing and thought that recovery is done. The fix is to only do the additional call to `FlushAllColumnFamilies()` when doing error recovery through `Errorhandler::RecoverFromRetryableBGIOError` by setting flags in `DBRecoverContext`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11880 Test Plan: `gtest-parallel --repeat=100 --workers=4 ./error_handler_fs_test --gtest_filter="AutoRecoverFlushError"` reproduces the error pretty reliably. ```[==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from DBErrorHandlingFSTest [ RUN ] DBErrorHandlingFSTest.AutoRecoverFlushError error_handler_fs_test: db/column_family.cc:1618: rocksdb::ColumnFamilySet::~ColumnFamilySet(): Assertion `last_ref' failed. Received signal 6 (Aborted) ... https://github.com/facebook/rocksdb/issues/10 0x00007fac4409efd6 in __GI___assert_fail (assertion=0x7fac452c0afa "last_ref", file=0x7fac452c9fb5 "db/column_family.cc", line=1618, function=0x7fac452cb950 "rocksdb::ColumnFamilySet::~ColumnFamilySet()") at assert.c:101 101 in assert.c https://github.com/facebook/rocksdb/issues/11 0x00007fac44b5324f in rocksdb::ColumnFamilySet::~ColumnFamilySet (this=0x7b5400000000) at db/column_family.cc:1618 1618 assert(last_ref); https://github.com/facebook/rocksdb/issues/12 0x00007fac44e0f047 in std::default_delete<rocksdb::ColumnFamilySet>::operator() (this=0x7b5800000940, __ptr=0x7b5400000000) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:85 85 delete __ptr; https://github.com/facebook/rocksdb/issues/13 std::__uniq_ptr_impl<rocksdb::ColumnFamilySet, std::default_delete<rocksdb::ColumnFamilySet> >::reset (this=0x7b5800000940, __p=0x0) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:182 182 _M_deleter()(__old_p); https://github.com/facebook/rocksdb/issues/14 std::unique_ptr<rocksdb::ColumnFamilySet, std::default_delete<rocksdb::ColumnFamilySet> >::reset (this=0x7b5800000940, __p=0x0) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:456 456 _M_t.reset(std::move(__p)); https://github.com/facebook/rocksdb/issues/15 rocksdb::VersionSet::~VersionSet (this=this@entry=0x7b5800000900) at db/version_set.cc:5081 5081 column_family_set_.reset(); https://github.com/facebook/rocksdb/issues/16 0x00007fac44e0f97a in rocksdb::VersionSet::~VersionSet (this=0x7b5800000900) at db/version_set.cc:5078 5078 VersionSet::~VersionSet() { https://github.com/facebook/rocksdb/issues/17 0x00007fac44bf0b2f in std::default_delete<rocksdb::VersionSet>::operator() (this=0x7b8c00000068, __ptr=0x7b5800000900) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:85 85 delete __ptr; https://github.com/facebook/rocksdb/issues/18 std::__uniq_ptr_impl<rocksdb::VersionSet, std::default_delete<rocksdb::VersionSet> >::reset (this=0x7b8c00000068, __p=0x0) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:182 182 _M_deleter()(__old_p); https://github.com/facebook/rocksdb/issues/19 std::unique_ptr<rocksdb::VersionSet, std::default_delete<rocksdb::VersionSet> >::reset (this=0x7b8c00000068, __p=0x0) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:456 456 _M_t.reset(std::move(__p)); https://github.com/facebook/rocksdb/issues/20 rocksdb::DBImpl::CloseHelper (this=this@entry=0x7b8c00000000) at db/db_impl/db_impl.cc:676 676 versions_.reset(); https://github.com/facebook/rocksdb/issues/21 0x00007fac44bf1346 in rocksdb::DBImpl::CloseImpl (this=0x7b8c00000000) at db/db_impl/db_impl.cc:720 720 Status DBImpl::CloseImpl() { return CloseHelper(); } https://github.com/facebook/rocksdb/issues/22 rocksdb::DBImpl::~DBImpl (this=this@entry=0x7b8c00000000) at db/db_impl/db_impl.cc:738 738 closing_status_ = CloseImpl(); https://github.com/facebook/rocksdb/issues/23 0x00007fac44bf2bba in rocksdb::DBImpl::~DBImpl (this=0x7b8c00000000) at db/db_impl/db_impl.cc:722 722 DBImpl::~DBImpl() { https://github.com/facebook/rocksdb/issues/24 0x00007fac455444d4 in rocksdb::DBTestBase::Close (this=this@entry=0x7b6c00000000) at db/db_test_util.cc:678 678 delete db_; https://github.com/facebook/rocksdb/issues/25 0x00007fac455455fb in rocksdb::DBTestBase::TryReopen (this=this@entry=0x7b6c00000000, options=...) at db/db_test_util.cc:707 707 Close(); https://github.com/facebook/rocksdb/issues/26 0x00007fac45543459 in rocksdb::DBTestBase::Reopen (this=0x7ffed74b79a0, options=...) at db/db_test_util.cc:670 670 ASSERT_OK(TryReopen(options)); https://github.com/facebook/rocksdb/issues/27 0x00000000004f2522 in rocksdb::DBErrorHandlingFSTest_AutoRecoverFlushError_Test::TestBody (this=this@entry=0x7b6c00000000) at db/error_handler_fs_test.cc:1224 1224 Reopen(options); ``` Reviewed By: jowlyzhang Differential Revision: D49579701 Pulled By: cbi42 fbshipit-source-id: 3fc8325e6dde7e7faa8bcad95060cb4e26eda638	2023-09-25 09:34:39 -07:00
Changyu Bi	0086809601	Fix a bug with atomic_flush that causes DB to stuck after a flush failure (#11872 ) Summary: With atomic_flush=true, a flush job with younger memtables wait for older memtables to be installed before install its memtables. If the flush for older memtables failed, auto-recovery starts a resume thread which can becomes stuck waiting for all background work to finish (including the flush for younger memtables). If a non-recovery flush starts now and tries to flush, it can make the situation worse since it will fail due to background error but never rollback its memtable: `269478ee46/db/db_impl/db_impl_compaction_flush.cc (L725)` This prevents any future flush to pick old memtables. A more detailed repro is in unit test. This PR fixes this issue by 1. Ensure we rollback memtables if an atomic flush fails due to background error 2. When there is a background error, abort atomic flushes that are waiting for older memtables to be installed 3. Do not schedule non-recovery flushes when there is a background error that stops background work There was another issue with atomic_flush=true where DB can hang during DB close, see more in #11867. The fix in this PR, specifically fix 2 above, should be enough to resolve it too. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11872 Test Plan: new unit test. Reviewed By: jowlyzhang Differential Revision: D49556867 Pulled By: cbi42 fbshipit-source-id: 4a0210ff28a8552a99ece7fbb0f574fd24b4da3f	2023-09-22 16:43:50 -07:00
Levi Tamasi	12d9386a4f	Return a special OK status when the number of merge operands exceeds a threshold (#11870 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11870 Having a large number of merge operands applied at query time can have a significant effect on performance; therefore, applications might want limit the number of deltas for any given key. However, there is currently no way to establish the number of operands for certain types of queries. The ticker `READ_NUM_MERGE_OPERANDS` only provides aggregate (not per-read) information. The `PerfContext` counters `internal_merge_count` and `internal_merge_point_lookup_count` can be used to get this information on a per-query basis for iterators and single point lookups; however, there is no per-key breakdown for `MultiGet` type APIs. The patch addresses this issue by introducing a special kind of OK status which signals that an application-defined threshold on the number of merge operands has been exceeded for a given key. The threshold can be specified on a per-query basis using a new field in `ReadOptions`. Reviewed By: jaykorean Differential Revision: D49522786 fbshipit-source-id: 4265b3848d1be5ff313a3e8fb604ddf56411dd2c	2023-09-22 13:49:19 -07:00
anand76	269478ee46	Support compressed and local flash secondary cache stacking (#11812 ) Summary: This PR implements support for a three tier cache - primary block cache, compressed secondary cache, and a nvm (local flash) secondary cache. This allows more effective utilization of the nvm cache, and minimizes the number of reads from local flash by caching compressed blocks in the compressed secondary cache. The basic design is as follows - 1. A new secondary cache implementation, ```TieredSecondaryCache```, is introduced. It keeps the compressed and nvm secondary caches and manages the movement of blocks between them and the primary block cache. To setup a three tier cache, we allocate a ```CacheWithSecondaryAdapter```, with a ```TieredSecondaryCache``` instance as the secondary cache. 2. The table reader passes both the uncompressed and compressed block to ```FullTypedCacheInterface::InsertFull```, allowing the block cache to optionally store the compressed block. 3. When there's a miss, the block object is constructed and inserted in the primary cache, and the compressed block is inserted into the nvm cache by calling ```InsertSaved```. This avoids the overhead of recompressing the block, as well as avoiding putting more memory pressure on the compressed secondary cache. 4. When there's a hit in the nvm cache, we attempt to insert the block in the compressed secondary cache and the primary cache, subject to the admission policy of those caches (i.e admit on second access). Blocks/items evicted from any tier are simply discarded. We can easily implement additional admission policies if desired. Todo (In a subsequent PR): 1. Add to db_bench and run benchmarks 2. Add to db_stress Pull Request resolved: https://github.com/facebook/rocksdb/pull/11812 Reviewed By: pdillinger Differential Revision: D49461842 Pulled By: anand1976 fbshipit-source-id: b40ac1330ef7cd8c12efa0a3ca75128e602e3a0b	2023-09-21 20:30:53 -07:00
Changyu Bi	b927ba5936	Rollback other pending memtable flushes when a flush fails (#11865 ) Summary: when atomic_flush=false, there are certain cases where we try to install memtable results with already deleted SST files. This can happen when the following sequence events happen: ``` Start Flush0 for memtable M0 to SST0 Start Flush1 for memtable M1 to SST1 Flush 1 returns OK, but don't install to MANIFEST and let whoever flushes M0 to take care of it Flush0 finishes with a retryable IOError, it rollbacks M0, (incorrectly) does not rollback M1, and deletes SST0 and SST1 Starts Flush2 for M0, it does not pick up M1 since it thought M1 is flushed Flush2 writes SST2 and finishes OK, tries to install SST2 and SST1 Error opening SST1 since it's already deleted with an error message like the following: IO error: No such file or directory: While open a file for random read: /tmp/rocksdbtest-501/db_flush_test_3577_4230653031040984171/000011.sst: No such file or directory ``` This happens since: 1. We currently only rollback the memtables that we are flushing in a flush job when atomic_flush=false. 2. Pending output SSTs from previous flushes are deleted since a pending file number is released whenever a flush job is finished no matter of flush status: `f42e70bf56/db/db_impl/db_impl_compaction_flush.cc (L3161)` This PR fixes the issue by rollback these pending flushes. There is another issue where if a new flush for new memtable starts and finishes after Flush0 finishes. Its output may also be deleted (see more in unit test). It is fixed by checking bg error status before installing a memtable result, and rollback if there is an error. There is a more efficient fix where we just don't release the pending file output number for flushes that delegate installation. It is more efficient since it does not have to rewrite the flush output file. With the fix in this PR, we can end up with a giant file if a lot of memtables are being flushed together. However, the more efficient fix is a bit more complicated to implement (requires associating such pending file numbers with flush job/memtables) and is more risky since it changes normal flush code path. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11865 Test Plan: * Added repro unit tests. Reviewed By: anand1976 Differential Revision: D49484922 Pulled By: cbi42 fbshipit-source-id: 25b536c08f4e02e7f1d0f86571663737d2b5d53d	2023-09-21 15:31:29 -07:00
Yu Zhang	32fc1e6cdc	Add unit test for the multiget fix when ReadOptions.read_tier == kPersistedTier and disableWAL == true (#11854 ) Summary: Add unit tests for the fix in https://github.com/facebook/rocksdb/pull/11700 Pull Request resolved: https://github.com/facebook/rocksdb/pull/11854 Reviewed By: anand1976 Differential Revision: D49392462 Pulled By: jowlyzhang fbshipit-source-id: bd6978e4888074fa5417f3ccda7a78a2c7eee9c6	2023-09-21 14:59:58 -07:00
Peter (Stig) Edwards	bf488c3052	Use *next_sequence -1 here (#11861 ) Summary: To fix off-by-one error: Transaction could not check for conflicts for operation at SequenceNumber 500000 as the MemTable only contains changes newer than SequenceNumber 500001. Fixes https://github.com/facebook/rocksdb/issues/11822 I think introduced in `a657ee9a9c` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11861 Reviewed By: pdillinger Differential Revision: D49457273 Pulled By: ajkr fbshipit-source-id: b527cbae4ecc7874633a11f07027adee62940d74	2023-09-21 13:52:01 -07:00
Hui Xiao	089070cb36	Expose more info about input files in `CompactionFilter::Context` (#11857 ) Summary: Context: As requested, lowest level as well as a map from input file to its table properties among all input files used in table creation (if any) are exposed in `CompactionFilter::Context`. Summary: This PR contains two commits: (1) [Refactory](`0012777f0e`) to make resonating/using what is in `Compaction:: table_properties_` easier - Separate `Compaction:: table_properties_` into `Compaction:: input_table_properties_` and `Compaction:: output_table_properties_` - Separate the "set input table properties" logic into `Compaction:: SetInputTableProperties()`) from `Compaction:: GetInputTableProperties` - Call `Compaction:: SetInputTableProperties()` as soon as possible, which is right after `Compaction::SetInputVersion()`. Bundle these two functions into one `Compaction::FinalizeInputInfo()` to minimize missing one or the other (2) [Expose more info about input files:](`6093e7dfba`) `CompactionFilter::Context::input_start_level/input_table_properties` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11857 Test Plan: - Modify existing UT ` TEST_F(DBTestCompactionFilter, CompactionFilterContextManual)` to cover new logics Reviewed By: ajkr Differential Revision: D49402540 Pulled By: hx235 fbshipit-source-id: 469fff50fa0e5964ffa5ea8db0743f61438ea392	2023-09-20 13:34:39 -07:00
chuhao zeng	8acf17002a	Fix row cache falsely return kNotFound when timestamp enabled (#11816 ) Summary: Summary: When row cache hits and a timestamp is being set in read_options, even though ROW_CACHE entry is hit, the return status is kNotFound. Cause of error: If timestamp is provided in readoptions, a callback for sequence number checking is registered [here](`8fc78a3a9e/db/db_impl/db_impl.cc (L2112)`). Hence the default value set at this [line](`694e49cbb1/table/get_context.cc (L611)`) prevents get_context from saving value found in cache. Causing the final status to be kNotFound even though the entry exist in both cache and SST file. Proposed Solution Row cache key contains a sequence number in it. If the key for row cache lookup matches the key in cache, this cache entry should be good to be exposed to user and hence we reuse the sequence number in cache key rather than passing kMaxSequenceNumber. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11816 Reviewed By: ajkr Differential Revision: D49419029 Pulled By: jowlyzhang fbshipit-source-id: 6c77e9e751628d7d8e6c389f299e29a11ea824c6	2023-09-20 11:34:38 -07:00
Levi Tamasi	51b3b7e08c	Remove a now-unnecessary WideColumnSerialization::Serialize variant (#11864 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11864 Reviewed By: jaykorean Differential Revision: D49445163 fbshipit-source-id: a1275112b9f87a3652df5f155bd215db5819326c	2023-09-20 08:04:35 -07:00
Levi Tamasi	f42e70bf56	Integrate FullMergeV3 into the query and compaction paths (#11858 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11858 The patch builds on https://github.com/facebook/rocksdb/pull/11807 and integrates the `FullMergeV3` API into the read and compaction code paths by updating and extending the logic in `MergeHelper`. In particular, when it comes to merge inputs, the existing `TimedFullMergeWithEntity` is folded into `TimedFullMerge`, since wide-column base values are now handled the same way as plain base values (or no base values for that matter), e.g. they are passed directly to the `MergeOperator`. On the other hand, there is some new differentiation on the output side. Namely, there are now two sets of `TimedFullMerge` variants: one set for contexts where the complete merge result and its value type are needed (used by iterators and compactions), and another set where the merge result is needed in a form determined by the client (used by the point lookup APIs, where e.g. for `Get` we have to extract the value of the default column of any wide-column results). Implementation-wise, the two sets of overloads use different visitors to process the `std::variant` produced by `FullMergeV3`. This has the benefit of eliminating some repeated code e.g. in the point lookup paths, since `TimedFullMerge` now populates the application's result object (`PinnableSlice`/`string` or `PinnableWideColumns`) directly. Moreover, within each set of variants, there is a separate overload for the no base value/plain base value/wide-column base value cases, which eliminates some repeated branching w/r/t to the type of the base value if any. Reviewed By: jaykorean Differential Revision: D49352562 fbshipit-source-id: c2fb9853dba3fbbc6918665bde4195c4ea150a0c	2023-09-19 17:27:04 -07:00
Changyu Bi	cc254efea6	Release compaction files in manifest write callback (#11764 ) Summary: Fixes https://github.com/facebook/rocksdb/issues/10257 (also see [here](https://github.com/facebook/rocksdb/pull/10355#issuecomment-1684308556)) by releasing compaction files earlier when writing to manifest in LogAndApply(). This is done by passing in a [callback](`ba59751430/db/version_set.h (L1199)`) to LogAndApply(). The new Version is created in the same critical section where compaction files are released. When compaction picker is picking compaction based on the new version, these compaction files will already be released. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11764 Test Plan: * Existing unit tests * A repro unit test to validate that compaction files are released: `./db_compaction_test --gtest_filter=DBCompactionTest.ReleaseCompactionDuringManifestWrite` * `python3 ./tools/db_crashtest.py --simple whitebox` with some assertions to check compaction files are released Reviewed By: ajkr Differential Revision: D48742152 Pulled By: cbi42 fbshipit-source-id: 7560fd0e723a63fe692234015d2b96850f8b5d77	2023-09-18 13:11:53 -07:00
dengyan	0dac75d542	Fix a bug in MultiGet when skip_memtable is true (#11700 ) Summary: When skip_memtable is true in MultiGetImpl, The lookup_current is always false, Causes data to be unable to be queried in super_version->current。 Pull Request resolved: https://github.com/facebook/rocksdb/pull/11700 Reviewed By: anand1976 Differential Revision: D49342877 Pulled By: jowlyzhang fbshipit-source-id: 270a36d049b4cb7fd151a1fa3080300310111271	2023-09-18 12:06:58 -07:00
Changyu Bi	6997a06c63	Invalidate threadlocal SV before incrementing `super_version_number_` (#11848 ) Summary: CI has been hitting assertion error like ``` https://github.com/facebook/rocksdb/issues/8 0x00007fafd9294fd6 in __GI___assert_fail (assertion=assertion@entry=0x7fafda270300 "!memtable_range_tombstone_iter_ \|\| sv_number_ != cfd_->GetSuperVersionNumber()", file=file@entry=0x7fafda270350 "db/arena_wrapped_db_iter.cc", line=line@entry=124, function=function@entry=0x7fafda270288 "virtual rocksdb::Status rocksdb::ArenaWrappedDBIter::Refresh(const rocksdb::Snapshot)") at assert.c:101 ``` This is due to * Iterator::Refresh() passing in `cur_sv_number` instead of `sv->version_number` here: `1c6faf3587/db/arena_wrapped_db_iter.cc (L94-L96)` * `super_version_number_` can be incremented before thread local SV is installed: https://github.com/facebook/rocksdb/blob/main/db/column_family.cc#L1287-L1306 * The optimization in https://github.com/facebook/rocksdb/issues/11452 removed the check for SV number, such that `cur_sv_number > sv.version_number` is possible in the following code. ``` uint64_t cur_sv_number = cfd_->GetSuperVersionNumber(); SuperVersion* sv = cfd_->GetReferencedSuperVersion(db_impl_); ``` Not sure why assertion only started failing after https://github.com/facebook/rocksdb/issues/10594, maybe it's because Refresh() is called more often in stress test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11848 Test Plan: * This repros hits the assertion pretty consistently before this change: ``` ./db_stress --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --allow_data_in_errors=True --async_io=0 --atomic_flush=1 --auto_readahead_size=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_one_in=0 --block_size=16384 --bloom_bits=0.7161318870366848 --cache_index_and_filter_blocks=0 --cache_size=8388608 --charge_table_reader=0 --checkpoint_one_in=1000000 --checksum_type=kxxHash --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=3 --compaction_readahead_size=0 --compaction_ttl=0 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=zlib --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db_write_buffer_size=8388608 --delpercent=4 --delrangepercent=1 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --enable_thread_tracking=1 --fail_if_options_file_error=0 --fifo_allow_compaction=1 --file_checksum_impl=none --flush_one_in=1000000 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=14 --index_type=2 --ingest_external_file_one_in=0 --initial_auto_readahead_size=524288 --iterpercent=30 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=1000000 --long_running_snapshots=1 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=524288 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=2500000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=16777216 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=1048576 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=0 --memtable_whole_key_filtering=1 --memtablerep=skip_list --min_write_buffer_number_to_merge=1 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=1 --open_files=500000 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=-1 --prefixpercent=0 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --read_fault_one_in=32 --readahead_size=16384 --readpercent=30 --recycle_log_file_num=1 --reopen=0 --ribbon_starting_level=999 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=600 --subcompactions=1 --sync=0 --sync_fault_injection=1 --target_file_size_base=2097152 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_cf_consistency=1 --top_level_index_pinning=3 --unpartitioned_pinning=3 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_get_entity=0 --use_multiget=1 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_file_checksums_one_in=0 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=1048576 --write_dbid_to_manifest=1 --write_fault_one_in=0 --writepercent=35 --use_io_uring=0 --db=/tmp/rocksdb_crashtest_blackboxnf3pyv_0 --expected_values_dir=/tmp/rocksdb_crashtest_expected_6opy9nqg ``` Reviewed By: ajkr Differential Revision: D49344066 Pulled By: cbi42 fbshipit-source-id: d5373ddb48d933acb42a5dd8fae3f3019b0241e5	2023-09-18 09:37:40 -07:00
Niklas Fiekas	1a9b42bbdd	Add C API for ReadOptions::auto_readahead_size (#11837 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11837 Reviewed By: cbi42 Differential Revision: D49303447 Pulled By: ajkr fbshipit-source-id: 7debf722339f4fd551760ef8d6801b7a41498565	2023-09-17 19:51:28 -07:00
Peter Dillinger	1c6faf3587	Make RibbonFilterPolicy::bloom_before_level mutable (SetOptions()) (#11838 ) Summary: An internal user wants to be able to dynamically switch between Bloom and Ribbon filters, without a custom FilterPolicy. Making `filter_policy` mutable would actually make issue https://github.com/facebook/rocksdb/issues/10079 worse, because it would be a race on a pointer field, not just on scalars. As a reasonable compromise until that is fixed, I am enabling dynamic control over Bloom vs. Ribbon choice by making RibbonFilterPolicy::bloom_before_level mutable, and doing that safely by using an atomic. I've also slightly tweaked the interpretation of that field so that setting it to INT_MAX really means "always Bloom." Pull Request resolved: https://github.com/facebook/rocksdb/pull/11838 Test Plan: unit tests added/extended. crash test updated for SetOptions call and tested under TSAN with amplified probability (lower set_options_one_in). Reviewed By: ajkr Differential Revision: D49296284 Pulled By: pdillinger fbshipit-source-id: e4251c077510df9a9c719876f482448c0d15402a	2023-09-15 15:46:10 -07:00
Jay Huh	cff6490bc4	Add IOActivity.kMultiGetEntity (#11842 ) Summary: - As a follow up from https://github.com/facebook/rocksdb/issues/11799, adding `Env::IOActivity::kMultiGetEntity` support to `DBImpl::MultiGetEntity()`. ## Minor Refactor - Because both `DBImpl::MultiGet()` and `DBImpl::MultiGetEntity()` call `DBImpl::MultiGetCommon()` which later calls `DBImpl::MultiGetWithCallback()` where we check `Env::IOActivity::kMultiGet`, minor refactor was needed so that we don't check `Env::IOActivity::kMultiGet` for `DBImpl::MultiGetEntity()`. - I still see more areas for refactoring to avoid duplicate code of checking IOActivity and setting it when Unknown, but this will be addressed separately. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11842 Test Plan: - Added the `ThreadStatus::OperationType::OP_MULTIGETENTITY` in `db_stress` to verify the pass-down IOActivity in a thread aligns with the actual activity the thread is doing. ``` python3 tools/db_crashtest.py blackbox --max_key=25000000 --write_buffer_size=4194304 --max_bytes_for_level_base=2097152 --target_file_size_base=2097152 --periodic_compaction_seconds=0 --use_put_entity_one_in=10 --use_get_entity=1 --duration=60 --interval=10 python3 tools/db_crashtest.py blackbox --simple --max_key=25000000 --write_buffer_size=4194304 --max_bytes_for_level_base=2097152 --target_file_size_base=2097152 --periodic_compaction_seconds=0 --use_put_entity_one_in=10 --use_get_entity=1 --duration=60 --interval=10 python3 tools/db_crashtest.py blackbox --cf_consistency --max_key=25000000 --write_buffer_size=4194304 --max_bytes_for_level_base=2097152 --target_file_size_base=2097152 --periodic_compaction_seconds=0 --use_put_entity_one_in=10 --use_get_entity=1 --duration=60 --interval=10 ``` Reviewed By: ltamasi Differential Revision: D49329575 Pulled By: jaykorean fbshipit-source-id: 05198f1d3f92e6be3d42a3d184bacb3ab2ce6923	2023-09-15 15:34:04 -07:00
leipeng	68ce5d84f6	Add new Iterator API Refresh(const snapshot) (#10594 ) Summary: This PR resolves https://github.com/facebook/rocksdb/issues/10487 & https://github.com/facebook/rocksdb/issues/10536, user code needs to call Refresh() periodically. The main code change is to support range deletions. A range tombstone iterator uses a sequence number as upper bound to decide which range tombstones are effective. During Iterator refresh, this sequence number upper bound needs to be updated for all range tombstone iterators under DBIter and LevelIterator. LevelIterator may create new table iterators and range tombstone iterator during scanning, so it needs to be aware of iterator refresh. The code path that propagates this change is `db_iter_->set_sequence(read_seq) -> MergingIterator::SetRangeDelReadSeqno() -> TruncatedRangeDelIterator::SetRangeDelReadSeqno() and LevelIterator::SetRangeDelReadSeqno()`. This change also fixes an issue where range tombstone iterators created by LevelIterator may access ReadOptions::snapshot, even though we do not explicitly require users to keep a snapshot alive after creating an Iterator. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10594 Test Plan: New unit tests. * Add Iterator::Refresh(snapshot) to stress test. Note that this change only adds tests for refreshing to the same snapshot since this is the main target use case. TODO in a following PR: * Stress test Iterator::Refresh() to different snapshots or no snapshot. Reviewed By: ajkr Differential Revision: D48456896 Pulled By: cbi42 fbshipit-source-id: 2e642c04e91235cc9542ef4cd37b3c20823bd779	2023-09-15 10:44:43 -07:00
Hui Xiao	ed913513bd	Fix a bug of rocksdb.file.read.verify.file.checksums.micros not being populated (#11836 ) Summary: Context/Summary: `rocksdb.file.read.verify.file.checksums.micros ` was added in https://github.com/facebook/rocksdb/pull/11444 but the related path was not populated with statistics and clock object correctly so the actual statistics collection didn't happen. This PR fixed it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11836 Test Plan: Setup: ``` ./db_bench --benchmarks="fillrandom" --file_checksum=1 --num=100 --db=/dev/shm/rocksdb ``` Run: ``` ./db_bench --use_existing_db=1 --benchmarks="verifyfilechecksums" --file_checksum=1 --num=100 --db=/dev/shm/rocksdb --statistics=1 --stats_level=4 ``` Post-PR ``` rocksdb.file.read.verify.file.checksums.micros P50 : 9.000000 P95 : 9.000000 P99 : 9.000000 P100 : 9.000000 COUNT : 1 SUM : 9 ``` Pre-PR ``` rocksdb.file.read.verify.file.checksums.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0 ``` Reviewed By: ajkr Differential Revision: D49293378 Pulled By: hx235 fbshipit-source-id: 1acd8b828c28e088d0c5d63897f53cd180b82f42	2023-09-15 10:36:14 -07:00
Yu Zhang	e1fd348b92	Fix a bug in multiget for cleaning up SuperVersion (#11830 ) Summary: When `MultiGet` acquires `SuperVersion` via locking the db mutex and get the current `ColumnFamilyData::super_version_`, its corresponding cleanup logic is not correctly done. It's currently doing this: `MultiGetColumnFamilyData::cfd->GetSuperVersion().Unref()` This operates on the most recent `SuperVersion` without locking db mutex , which is not thread safe by itself. And this unref operation is intended for the originally acquired `SuperVersion` instead of the current one. Because a race condition could happen where a new `SuperVersion` is installed in between this `MultiGet`'s ref and unref. When this race condition does happen, it's not sufficient to just unref the `SuperVersion`, `DBImpl::CleanupSuperVersion` should be called instead to properly clean up the `SuperVersion` had this `MultiGet` call be its last reference holder. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11830 Test Plan: `make all check` Added a unit test that would originally fail Reviewed By: ltamasi Differential Revision: D49287715 Pulled By: jowlyzhang fbshipit-source-id: 8353636ee11b2e90d85c677a96a92360072644b0	2023-09-15 09:50:39 -07:00
Jay Huh	f2b623bcc1	GetEntity Support for ReadOnlyDB and SecondaryDB (#11799 ) Summary: `GetEntity` API support for ReadOnly DB and Secondary DB. - Introduced `GetImpl()` with `GetImplOptions` in `db_impl_readonly` and refactored current `Get()` logic into `GetImpl()` so that look up logic can be reused for `GetEntity()` (Following the same pattern as `DBImpl::Get()` and `DBImpl::GetEntity()`) - Introduced `GetImpl()` with `GetImplOptions` in `db_impl_secondary` and refactored current `GetImpl()` logic. This is to make `DBImplSecondary::Get/GetEntity` consistent with `DBImpl::Get/GetEntity` and `DBImplReadOnly::Get/GetEntity` - `GetImpl()` in `db_impl` is now virtual. both `db_impl_readonly` and `db_impl_secondary`'s `Get()` override are no longer needed since all three dbs now have the same `Get()` which calls `GetImpl()` internally. - `GetImpl()` in `DBImplReadOnly` and `DBImplSecondary` now pass in `columns` instead of `nullptr` in lookup functions like `memtable->get()` - Introduced `GetEntity()` API in `DBImplReadOnly` and `DBImplSecondary` which simply calls `GetImpl()` with `columns` set in `GetImplOptions`. - Introduced `Env::IOActivity::kGetEntity` and set read_options.io_activity to `Env::IOActivity::kGetEntity` for `GetEntity()` operations (in db_impl) Pull Request resolved: https://github.com/facebook/rocksdb/pull/11799 Test Plan: Unit Tests - Added verification in `DBWideBasicTest::PutEntity` by Reopening DB as ReadOnly with the same setup. - Added verification in `DBSecondaryTest::ReopenAsSecondary` by calling `PutEntity()` and `GetEntity()` on top of existing `Put()` and `Get()` - `make -j64 check` Crash Tests - `python3 tools/db_crashtest.py blackbox --max_key=25000000 --write_buffer_size=4194304 --max_bytes_for_level_base=2097152 --target_file_size_base=2097152 --periodic_compaction_seconds=0 --use_put_entity_one_in=10 --use_get_entity=1 --duration=60 --inter val=10` - `python3 tools/db_crashtest.py blackbox --simple --max_key=25000000 --write_buffer_size=4194304 --max_bytes_for_level_base=2097152 --target_file_size_base=2097152 --periodic_compaction_seconds=0 --use_put_entity_one_in=10 --use_get_entity=1 ` - `python3 tools/db_crashtest.py blackbox --cf_consistency --max_key=25000000 --write_buffer_size=4194304 --max_bytes_for_level_base=2097152 --target_file_size_base=2097152 --periodic_compaction_seconds=0 --use_put_entity_one_in=10 --use_get_entity=1 --duration=60 --inter val=10` Reviewed By: ltamasi Differential Revision: D49037040 Pulled By: jaykorean fbshipit-source-id: a0648253ded6e91af7953de364ed3c6bf163626b	2023-09-15 08:30:44 -07:00
马越	3c27f56d0b	Fix the problem that some keys of ClipColumnFamily may not be deleted (#11811 ) Summary: When executing ClipColumnFamily, if end_key is equal to largest_user_key in a file, this key will not be deleted. So we need to change less than to less than or equal to Pull Request resolved: https://github.com/facebook/rocksdb/pull/11811 Reviewed By: ajkr Differential Revision: D49206936 Pulled By: cbi42 fbshipit-source-id: 3e8bcb7b52040a9b4d1176de727616cc298d3445	2023-09-14 13:36:39 -07:00
Jonah Gao	84d335b619	Remove an unused variable: `last_stats_dump_time_microsec_` (#11824 ) Summary: `last_stats_dump_time_microsec_` is not used after initialization. I guess that it was previously used to implement periodically dumping stats, but this functionality has now been delegated to the `PeriodicTaskScheduler`. `4b79e8c003/db/db_impl/db_impl.cc (L770-L778)` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11824 Reviewed By: cbi42 Differential Revision: D49278311 Pulled By: jowlyzhang fbshipit-source-id: 5856245580afc026e6b490755a45c5436a2375c9	2023-09-14 11:25:33 -07:00
Yu Zhang	39a4ff2cab	Track full_history_ts_low per SuperVersion (#11784 ) Summary: As discussed in https://github.com/facebook/rocksdb/issues/11730 , this PR tracks the effective `full_history_ts_low` per SuperVersion and update existing sanity checks for `ReadOptions.timestamp >= full_history_ts_low` to use this per SuperVersion `full_history_ts_low` instead. This also means the check is moved to happen after acquiring SuperVersion. There are two motivations for this: 1) Each time `full_history_ts_low` really come into effect to collapse history, a new SuperVersion is always installed, because it would involve either a Flush or Compaction, both of which change the LSM tree shape. We can take advantage of this to ensure that as long as this sanity check is passed, even if `full_history_ts_low` can be concurrently increased and collapse some history above the requested `ReadOptions.timestamp`, a read request won’t have visibility to that part of history through this SuperVersion that it already acquired. 2) the existing sanity check uses `ColumnFamilyData::GetFullHistoryTsLow` without locking the db mutex, which is the mutex all `IncreaseFullHistoryTsLow` operation is using when mutating this field. So there is a race condition. This also solve the race condition on the read path. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11784 Test Plan: `make all check` // Checks success scenario really provide the read consistency attribute as mentioned above. `./db_with_timestamp_basic_test --gtest_filter=FullHistoryTsLowSanityCheckPassReadIsConsistent` // Checks failure scenario cleans up SuperVersion properly. `./db_with_timestamp_basic_test --gtest_filter=FullHistoryTsLowSanityCheckFail` `./db_secondary_test --gtest_filter=FullHistoryTsLowSanityCheckFail` `./db_readonly_with_timestamp_test --gtest_filter=FullHistoryTsLowSanitchCheckFail` Reviewed By: ltamasi Differential Revision: D48894795 Pulled By: jowlyzhang fbshipit-source-id: 1f801fe8e1bc8e63ca76c03cbdbd0974e5ff5bf6	2023-09-13 16:34:18 -07:00
Changyu Bi	3285ba7a29	Fix unit test tsan failure (#11828 ) Summary: The test DBCompactionWaitForCompactTest.WaitForCompactWithOptionToFlushAndCloseDB failed tsan in https://app.circleci.com/pipelines/github/facebook/rocksdb/32009/workflows/577e4e1f-a909-4e80-8ef4-af98b5ff7446/jobs/660989. I cannot repro locally, but this should be the fix. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11828 Test Plan: `./db_compaction_test --gtest_filter="DBCompactionWaitForCompactTest/DBCompactionWaitForCompactTest."` Reviewed By: jaykorean Differential Revision: D49241904 Pulled By: cbi42 fbshipit-source-id: 68714c836d982dcb3946da104533d5c0594980de	2023-09-13 15:53:05 -07:00
Hui Xiao	ef3e289b2d	Conditionally exclude some L0 input files in size amp compaction (#11749 ) Summary: Context/Summary: A size amp compaction can select and prevent a large number of L0 files from being selected by other compaction. If such compaction is running long or being queued behind, these L0 files will exist for long. With a few more flushes, we can run into write stop triggered by # L0 files. We've seen this happen on a host with many DBs sharing same thread pool, each of these DBs submits a size amp compaction with (110-180)+ files to the pool upon reopen and with a few more flushes, they hit the 200 L0 write stop condition. The idea is to exclude some L0 input files in size amp compaction that are harmless to size amp reduction but improve the situation described above. The exclusion algorithm is in `MightExcludeNewL0sToReduceWriteStop()` with two elements: 1. #L0 to exclude + (level0_stop_writes_trigger - num_l0_input_pre_exclusion) should be in the range of [min_merge_width, max_merge_width]. - This is to ensure we are excluding enough L0 input files but not too many to be qualified to picked for another compaction along with the incoming future L0 files before write stop. 2. Based on (1), further constrain #L0 to exclude based on the post-exclusion compaction score. The goal is to ensure our exclusion will not disqualify the size amp compaction from being a size amp compaction after exclusion. Tets plan: New unit test Pull Request resolved: https://github.com/facebook/rocksdb/pull/11749 Reviewed By: ajkr Differential Revision: D48850631 Pulled By: hx235 fbshipit-source-id: 2c321036e164087c36319dd5645cbbf6b6152092	2023-09-12 15:53:15 -07:00
Changyu Bi	9d71682d1b	Add statistics `COMPACTION_CPU_TOTAL_TIME` for total compaction time (#11741 ) Summary: Existing compaction statistics are `COMPACTION_TIME` and `COMPACTION_CPU_TIME` which are histogram and are logged at the end of a compaction. The new statistics `COMPACTION_CPU_TOTAL_TIME` is for cumulative total compaction time which is updated regularly during a compaction. This allows user to more closely track compaction cpu usage. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11741 Test Plan: * new unit test `DBTestWithParam.CompactionTotalTimeTest` Reviewed By: ajkr Differential Revision: D48608094 Pulled By: cbi42 fbshipit-source-id: b597109f3e4bf2237fb5a216b6fd036e5363b4c0	2023-09-12 15:48:36 -07:00
Levi Tamasi	1e63fc9925	Add a helper method WideColumnsHelper::SortColumns (#11823 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11823 Similarly to https://github.com/facebook/rocksdb/pull/11813, the patch is a small refactoring that eliminates some copy-paste around sorting the columns of entities by column name. Reviewed By: jaykorean Differential Revision: D49195504 fbshipit-source-id: d48c9f290e3203f838cc5949856c469ecf730008	2023-09-12 12:36:07 -07:00
Hui Xiao	4b123f3a54	Change file size related variables type to uint64_t in PickCompactionToReduceSizeAmp() (#11814 ) Summary: Context/Summary: size_t is not most likely not needed as SortedRun::size/compensated_file_size is uint64_t. This is a pre-requisite to addressing https://github.com/facebook/rocksdb/pull/11749/files#r1321828933. Other places already uses uint64_t e.g, https://github.com/facebook/rocksdb/blob/8.6.fb/db/compaction/compaction_picker_universal.cc#L349-L353 Test CI Pull Request resolved: https://github.com/facebook/rocksdb/pull/11814 Reviewed By: ajkr Differential Revision: D49169155 Pulled By: hx235 fbshipit-source-id: 2b3ad70e6f18aa360e94ed8907c8534ad2797e62	2023-09-12 10:00:57 -07:00
Levi Tamasi	8fc78a3a9e	Add helper methods WideColumnsHelper::{Has,Get}DefaultColumn (#11813 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11813 The patch adds a couple of helper methods `WideColumnsHelper::{Has,Get}DefaultColumn` to eliminate some code duplication. Reviewed By: jaykorean Differential Revision: D49166682 fbshipit-source-id: f229ca5b94599f7445a0112b10f8317292505c82	2023-09-11 16:32:32 -07:00
Levi Tamasi	760ea373a8	Introduce a wide column aware MergeOperator API (#11807 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11807 For now, RocksDB has limited support for using merge with wide columns: when a bunch of merge operands have to be applied to a wide-column base value, RocksDB currently passes only the value of the default column to the application's `MergeOperator`, which means there is no way to update any other columns during a merge. As a first step in making this more general, the patch adds a new API `FullMergeV3` to `MergeOperator`. `FullMergeV3`'s interface enables applications to receive a plain, wide-column, or non-existent base value as merge input, and to produce a new plain value, a new wide-column value, or an existing operand as merge output. Note that there are no limitations on the column names and values if the merge result is a wide-column entity. Also, the interface is general in the sense that it makes it possible e.g. for a merge that takes a plain base value and some deltas to produce a wide-column entity as a result. For backward compatibility, the default implementation of `FullMergeV3` falls back to `FullMergeV2` and implements the current logic where merge operands are applied to the default column of the base entity and any other columns are unchanged. (Note that with `FullMergeV3` in the `MergeOperator` interface, this behavior will become customizable.) This patch just introduces the new API and the default backward compatible implementation. I plan to integrate `FullMergeV3` into the query and compaction logic in subsequent diffs. Reviewed By: jaykorean Differential Revision: D49117253 fbshipit-source-id: 109e016f25cd130fc504790818d927bae7fec6bd	2023-09-11 12:13:58 -07:00
Changyu Bi	195f35c08b	Add a unit test for the fix in #11786 (#11790 ) Summary: Tests a scenario where range tombstone reseek used to cause MergingIterator to discard non-ok status. Ran on main without https://github.com/facebook/rocksdb/issues/11786: ``` ./db_range_del_test --gtest_filter="RangeDelReseekAfterFileReadError" Note: Google Test filter = RangeDelReseekAfterFileReadError [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from DBRangeDelTest [ RUN ] DBRangeDelTest.RangeDelReseekAfterFileReadError db/db_range_del_test.cc:3577: Failure Value of: iter->Valid() Actual: true Expected: false [ FAILED ] DBRangeDelTest.RangeDelReseekAfterFileReadError (64 ms) ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11790 Reviewed By: ajkr Differential Revision: D48972869 Pulled By: cbi42 fbshipit-source-id: b1a71867533b0fb60af86f8ce8a9e391ba84dd57	2023-09-06 15:22:39 -07:00
Changyu Bi	458acf8169	Add some unit tests when file read returns error during compaction/scanning (#11788 ) Summary: Some repro unit tests for the bug fixed in https://github.com/facebook/rocksdb/pull/11782. Ran on main without https://github.com/facebook/rocksdb/pull/11782: ``` ./db_compaction_test --gtest_filter='ErrorWhenReadFileHead' Note: Google Test filter = ErrorWhenReadFileHead [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from DBCompactionTest [ RUN ] DBCompactionTest.ErrorWhenReadFileHead db/db_compaction_test.cc:10105: Failure Value of: s.IsIOError() Actual: false Expected: true [ FAILED ] DBCompactionTest.ErrorWhenReadFileHead (3960 ms) ./db_iterator_test --gtest_filter="ErrorWhenReadFile" Note: Google Test filter = ErrorWhenReadFile [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from DBIteratorTest [ RUN ] DBIteratorTest.ErrorWhenReadFile db/db_iterator_test.cc:3399: Failure Value of: (iter->status()).ok() Actual: true Expected: false [ FAILED ] DBIteratorTest.ErrorWhenReadFile (280 ms) [----------] 1 test from DBIteratorTest (280 ms total) ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11788 Reviewed By: ajkr Differential Revision: D48940284 Pulled By: cbi42 fbshipit-source-id: 06f3c5963f576db3f85d305ffb2745ee13d209bb	2023-09-06 10:23:41 -07:00
Jay Huh	47be3ffffb	Minor refactor on LDB command for wide column support and release note (#11777 ) Summary: As mentioned in https://github.com/facebook/rocksdb/issues/11754 , refactor to clean up some nearly identical logic. This PR changes the existing debugging string format of Scan command as the following. ``` ❯ ./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_2675429_2308393776696827948/ scan --hex ``` Before ``` 0x6669727374 : :0x68656C6C6F 0x617474725F6E616D6531:0x666F6F 0x617474725F6E616D6532:0x626172 0x7365636F6E64 : 0x617474725F6F6E65:0x74776F 0x617474725F7468726565:0x666F7572 0x7468697264 : 0x62617A ``` After ``` 0x6669727374 ==> :0x68656C6C6F 0x617474725F6E616D6531:0x666F6F 0x617474725F6E616D6532:0x626172 0x7365636F6E64 ==> 0x617474725F6F6E65:0x74776F 0x617474725F7468726565:0x666F7572 0x7468697264 ==> 0x62617A ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11777 Test Plan: ``` ❯ ./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_2675429_2308393776696827948/ dump first ==> :hello attr_name1:foo attr_name2:bar second ==> attr_one:two attr_three:four third ==> baz Keys in range: 3 ❯ ./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_2675429_2308393776696827948/ scan first ==> :hello attr_name1:foo attr_name2:bar second ==> attr_one:two attr_three:four third ==> baz ❯ ./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_2675429_2308393776696827948/ dump --hex 0x6669727374 ==> :0x68656C6C6F 0x617474725F6E616D6531:0x666F6F 0x617474725F6E616D6532:0x626172 0x7365636F6E64 ==> 0x617474725F6F6E65:0x74776F 0x617474725F7468726565:0x666F7572 0x7468697264 ==> 0x62617A Keys in range: 3 ❯ ./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_2675429_2308393776696827948/ scan --hex 0x6669727374 ==> :0x68656C6C6F 0x617474725F6E616D6531:0x666F6F 0x617474725F6E616D6532:0x626172 0x7365636F6E64 ==> 0x617474725F6F6E65:0x74776F 0x617474725F7468726565:0x666F7572 0x7468697264 ==> 0x62617A ``` Reviewed By: jowlyzhang Differential Revision: D48876755 Pulled By: jaykorean fbshipit-source-id: b1c608a810fe038999ac528b690d398abf5f21d7	2023-08-31 16:17:03 -07:00
Peter Dillinger	83eb7b8c2c	Log host name (#11776 ) Summary: ... in info_log. Becoming more important with disaggregated storage. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11776 Test Plan: manual Reviewed By: jaykorean Differential Revision: D48849471 Pulled By: pdillinger fbshipit-source-id: 9a8fd8b2564a4f133526ecd7c1414cb667e4ba54	2023-08-31 08:39:09 -07:00
Hui Xiao	05daa12332	Change compaction_readahead_size default value to 2MB (#11762 ) Summary: Context/Summary: After https://github.com/facebook/rocksdb/pull/11631, we rely on `compaction_readahead_size` for how much to read ahead for compaction read under non-direct IO case. https://github.com/facebook/rocksdb/pull/11658 therefore also sanitized 0 `compaction_readahead_size` to 2MB under non-direct IO, which is consistent with the existing sanitization with direct IO. However, this makes disabling compaction readahead impossible as well as add one more scenario to the inconsistent effects between `Options.compaction_readahead_size=0` during DB open and `SetDBOptions("compaction_readahead_size", "0")` . - `SetDBOptions("compaction_readahead_size", "0")` will disable compaction readahead as its logic never goes through sanitization above while `Options.compaction_readahead_size=0` will go through sanitization. Therefore we decided to do this PR. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11762 Test Plan: Modified existing UTs to cover this PR Reviewed By: ajkr Differential Revision: D48759560 Pulled By: hx235 fbshipit-source-id: b3f85e58bda362a6fa1dc26bd8a87aa0e171af79	2023-08-30 14:57:08 -07:00
Yu Zhang	fc58c7c62a	Add UDT support in SstFileDumper (#11757 ) Summary: For a SST file that uses user-defined timestamp aware comparators, if a lower or upper bound is set, sst_dump tool doesn't handle it well. This PR adds support for that. While working on this `MaybeAddTimestampsToRange` is moved to the udt_util.h file to be shared. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11757 Test Plan: make all check for changes in db_impl.cc and db_impl_compaction_flush.cc for changes in sst_file_dumper.cc, I manually tested this change handles specifying bounds for UDT use cases. It probably should have a unit test file eventually. Reviewed By: ltamasi Differential Revision: D48668048 Pulled By: jowlyzhang fbshipit-source-id: 1560465f40e44668d6d82a7439fe9012be0e74a8	2023-08-30 13:42:04 -07:00
Jay Huh	ea9a5b2914	Wide Column support in ldb (#11754 ) Summary: wide_columns can now be pretty-printed in the following commands - `./ldb dump_wal` - `./ldb dump` - `./ldb idump` - `./ldb dump_live_files` - `./ldb scan` - `./sst_dump --command=scan` There are opportunities to refactor to reduce some nearly identical code. This PR is initial change to add wide column support in `ldb` and `sst_dump` tool. More PRs to come for the refactor. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11754 Test Plan: New Tests added - `WideColumnsHelperTest::DumpWideColumns` - `WideColumnsHelperTest::DumpSliceAsWideColumns` Changes added to existing tests - `ExternalSSTFileTest::BasicMixed` added to cover mixed case (This test should have been added in https://github.com/facebook/rocksdb/issues/11688). This test does not verify the ldb or sst_dump output. This test was used to create test SST files having some rows with wide columns and some without and the generated SST files were used to manually test sst_dump_tool. - `createSST()` in `sst_dump_test` now takes `wide_column_one_in` to add wide column value in SST dump_wal ``` ./ldb dump_wal --walfile=/tmp/rocksdbtest-226125/db_wide_basic_test_2675429_2308393776696827948/000004.log --print_value --header ``` ``` Sequence,Count,ByteSize,Physical Offset,Key(s) : value 1,1,59,0,PUT_ENTITY(0) : 0x:0x68656C6C6F 0x617474725F6E616D6531:0x666F6F 0x617474725F6E616D6532:0x626172 2,1,34,42,PUT_ENTITY(0) : 0x617474725F6F6E65:0x74776F 0x617474725F7468726565:0x666F7572 3,1,17,7d,PUT(0) : 0x7468697264 : 0x62617A ``` idump ``` ./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/ idump ``` ``` 'first' seq:1, type:22 => :hello attr_name1:foo attr_name2:bar 'second' seq:2, type:22 => attr_one:two attr_three:four 'third' seq:3, type:1 => baz Internal keys in range: 3 ``` SST Dump from dump_live_files ``` ./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/ compact ./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/ dump_live_files ``` ``` ... ============================== SST Files ============================== /tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/000013.sst level:1 ------------------------------ Process /tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/000013.sst Sst file format: block-based 'first' seq:0, type:22 => :hello attr_name1:foo attr_name2:bar 'second' seq:0, type:22 => attr_one:two attr_three:four 'third' seq:0, type:1 => baz ... ``` dump ``` ./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/ dump ``` ``` first ==> :hello attr_name1:foo attr_name2:bar second ==> attr_one:two attr_three:four third ==> baz Keys in range: 3 ``` scan ``` ./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/ scan ``` ``` first : :hello attr_name1:foo attr_name2:bar second : attr_one:two attr_three:four third : baz ``` sst_dump ``` ./sst_dump --file=/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/000013.sst --command=scan ``` ``` options.env is 0x7ff54b296000 Process /tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/000013.sst Sst file format: block-based from [] to [] 'first' seq:0, type:22 => :hello attr_name1:foo attr_name2:bar 'second' seq:0, type:22 => attr_one:two attr_three:four 'third' seq:0, type:1 => baz ``` Reviewed By: ltamasi Differential Revision: D48837999 Pulled By: jaykorean fbshipit-source-id: b0280f0589d2b9716bb9b50530ffcabb397d140f	2023-08-30 12:45:52 -07:00
Yu Zhang	4234a6a301	Increase full_history_ts_low when flush happens during recovery (#11774 ) Summary: This PR adds a missing piece for the UDT in memtable only feature, which is to automatically increase `full_history_ts_low` when flush happens during recovery. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11774 Test Plan: Added unit test make all check Reviewed By: ltamasi Differential Revision: D48799109 Pulled By: jowlyzhang fbshipit-source-id: fd681ed66d9d40904ca2c919b2618eb692686035	2023-08-30 09:34:31 -07:00
jsteemann	c1e6ffc40a	remove a sub-condition that is always true (#11746 ) Summary: the value of `done` is always false here, so the sub-condition `!done` will always be true and the check can be removed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11746 Reviewed By: anand1976 Differential Revision: D48656845 Pulled By: ajkr fbshipit-source-id: 523ba3d07b3af7880c8c8ccb20442fd7c0f49417	2023-08-29 18:40:13 -07:00
Yu Zhang	ecbeb305a0	Removing some checks for UDT in memtable only feature (#11732 ) Summary: The user-defined timestamps feature only enforces that for the same key, user-defined timestamps should be non-decreasing. For the user-defined timestamps in memtable only feature, during flush, we check the user-defined timestamps in each memtable to examine if the data is considered expired with regard to `full_history_ts_low`. In this process, it's assuming that a newer memtable should not have smaller user-defined timestamps than an older memtable. This check however is enforcing ordering of user-defined timestamps across keys, as apposed to the vanilla UDT feature, that only enforce ordering of user-defined timestamps for the same key. This more strict user-defined timestamp ordering requirement could be an issue for secondary instances where commits can be out of order. And after thinking more about it, this requirement is really an overkill to keep the invariants of `full_history_ts_low` which are: 1) users cannot read below `full_history_ts_low` 2) users cannot write at or below `full_history_ts_low` 3) `full_history_ts_low` can only be increasing As long as RocksDB enforces these 3 checks, we can prohibit inconsistent read that returns a different value. And these three checks are covered in existing APIs. So this PR removes the extra checks in the UDT in memtable only feature that requires user-defined timestamps to be non decreasing across keys. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11732 Reviewed By: ltamasi Differential Revision: D48541466 Pulled By: jowlyzhang fbshipit-source-id: 95453c6e391cbd511c0feab05f0b11c312d17186	2023-08-29 16:51:48 -07:00
Jan	ba59751430	remove an unused typedef (#11286 ) Summary: `VersionBuilderMap` type alias definition seem unused. If this PR can be compiled fine then the alias is probably not needed anymore. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11286 Reviewed By: jaykorean Differential Revision: D48656747 Pulled By: ajkr fbshipit-source-id: ac8554922aead7dc3d24fe7e6544a4622578c514	2023-08-25 18:01:14 -07:00
Hui Xiao	f833ca3878	Pick files from the last sorted run in size amp compaction picker (#11740 ) Summary: Context/Summary: Same intention as https://github.com/facebook/rocksdb/pull/2693 - basically we now pick from the last sorted run and expand forward till we can't Pull Request resolved: https://github.com/facebook/rocksdb/pull/11740 Test Plan: Existing UT Stress test Reviewed By: ajkr Differential Revision: D48586475 Pulled By: hx235 fbshipit-source-id: 3eb3c3ee1d5f7e0b0d6d649baaeb8c6990fee398	2023-08-23 11:27:48 -07:00
Alexander Bulimov	2b6bcfe590	Add C API for WaitForCompact (#11737 ) Summary: Add a bunch of C API functions to expose new `WaitForCompact` function and related options. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11737 Test Plan: unit tests Reviewed By: jaykorean Differential Revision: D48568239 Pulled By: abulimov fbshipit-source-id: 1ff35972d7abacd7e1e17fe2ada1e20cdc88d8de	2023-08-22 14:32:35 -07:00
chuhao zeng	1303573589	Reverse sort order in dedup to enable iter checking in callback (#11725 ) Summary: Fix https://github.com/facebook/rocksdb/issues/6470 Ensure TransactionLogIter being initialized correctly with SYNC_POINT API when calling `GetSortedWALFiles`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11725 Reviewed By: akankshamahajan15 Differential Revision: D48529411 Pulled By: ajkr fbshipit-source-id: 970ca1a6259ed996c6d87f7fcd40f95acf441517	2023-08-22 11:22:35 -07:00
Yu Zhang	03a74411c0	Add unit test for default temperature (#11722 ) Summary: This piggy back the existing last level file temperature statistics test to test the default temperature becoming effective. While adding this unit test, I found that the approach to swap out and use default temperature in `VersionBuilder::LoadTableHandlers` will miss the L0 files created from flush, and only work for existing SST files, SST files created by compaction. So this PR moves that logic to `TableCache::GetTableReader`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11722 Test Plan: ``` ./db_test2 --gtest_filter="LastLevelStatistics" make all check ``` Reviewed By: pdillinger Differential Revision: D48489171 Pulled By: jowlyzhang fbshipit-source-id: ac29f7d484916f3218729594c5bb35c4f2979ac2	2023-08-21 12:14:03 -07:00
Changyu Bi	c2aad555c3	Add `CompressionOptions::checksum` for enabling ZSTD checksum (#11666 ) Summary: Optionally enable zstd checksum flag (`d857369028/lib/zstd.h (L428)`) to detect corruption during decompression. Main changes are in compression.h: * User can set CompressionOptions::checksum to true to enable this feature. * We enable this feature in ZSTD by setting the checksum flag in ZSTD compression context: `ZSTD_CCtx`. * Uses `ZSTD_compress2()` to do compression since it supports frame parameter like the checksum flag. Compression level is also set in compression context as a flag. * Error handling during decompression to propagate error message from ZSTD. * Updated microbench to test read performance impact. About compatibility, the current compression decoders should continue to work with the data created by the new compression API `ZSTD_compress2()`: https://github.com/facebook/zstd/issues/3711. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11666 Test Plan: * Existing unit tests for zstd compression * Add unit test `DBTest2.ZSTDChecksum` to test the corruption case * Manually tested that compression levels, parallel compression, dictionary compression, index compression all work with the new ZSTD_compress2() API. * Manually tested with `sst_dump --command=recompress` that different compression levels and dictionary compression settings all work. * Manually tested compiling with older versions of ZSTD: v1.3.8, v1.1.0, v0.6.2. * Perf impact: from public benchmark data: http://fastcompression.blogspot.com/2019/03/presenting-xxh3.html for checksum and https://github.com/facebook/zstd#benchmarks, if decompression is 1700MB/s and checksum computation is 70000MB/s, checksum computation is an additional ~2.4% time for decompression. Compression is slower and checksumming should be less noticeable. * Microbench: ``` TEST_TMPDIR=/dev/shm ./branch_db_basic_bench --benchmark_filter=DBGet/comp_style:0/max_data:1048576/per_key_size:256/enable_statistics:0/negative_query:0/enable_filter:0/mmap:0/compression_type:7/compression_checksum:1/no_blockcache:1/iterations:10000/threads:1 --benchmark_repetitions=100 Min out of 100 runs: Main: 10390 10436 10456 10484 10499 10535 10544 10545 10565 10568 After this PR, checksum=false 10285 10397 10503 10508 10515 10557 10562 10635 10640 10660 After this PR, checksum=true 10827 10876 10925 10949 10971 11052 11061 11063 11100 11109 ``` * db_bench: ``` Write perf TEST_TMPDIR=/dev/shm/ ./db_bench_ichecksum --benchmarks=fillseq[-X10] --compression_type=zstd --num=10000000 --compression_checksum=.. [FillSeq checksum=0] fillseq [AVG 10 runs] : 281635 (± 31711) ops/sec; 31.2 (± 3.5) MB/sec fillseq [MEDIAN 10 runs] : 294027 ops/sec; 32.5 MB/sec [FillSeq checksum=1] fillseq [AVG 10 runs] : 286961 (± 34700) ops/sec; 31.7 (± 3.8) MB/sec fillseq [MEDIAN 10 runs] : 283278 ops/sec; 31.3 MB/sec Read perf TEST_TMPDIR=/dev/shm ./db_bench_ichecksum --benchmarks=readrandom[-X20] --num=100000000 --reads=1000000 --use_existing_db=true --readonly=1 [Readrandom checksum=1] readrandom [AVG 20 runs] : 360928 (± 3579) ops/sec; 4.0 (± 0.0) MB/sec readrandom [MEDIAN 20 runs] : 362468 ops/sec; 4.0 MB/sec [Readrandom checksum=0] readrandom [AVG 20 runs] : 380365 (± 2384) ops/sec; 4.2 (± 0.0) MB/sec readrandom [MEDIAN 20 runs] : 379800 ops/sec; 4.2 MB/sec Compression TEST_TMPDIR=/dev/shm ./db_bench_ichecksum --benchmarks=compress[-X20] --compression_type=zstd --num=100000000 --compression_checksum=1 checksum=1 compress [AVG 20 runs] : 54074 (± 634) ops/sec; 211.2 (± 2.5) MB/sec compress [MEDIAN 20 runs] : 54396 ops/sec; 212.5 MB/sec checksum=0 compress [AVG 20 runs] : 54598 (± 393) ops/sec; 213.3 (± 1.5) MB/sec compress [MEDIAN 20 runs] : 54592 ops/sec; 213.3 MB/sec Decompression: TEST_TMPDIR=/dev/shm ./db_bench_ichecksum --benchmarks=uncompress[-X20] --compression_type=zstd --compression_checksum=1 checksum = 0 uncompress [AVG 20 runs] : 167499 (± 962) ops/sec; 654.3 (± 3.8) MB/sec uncompress [MEDIAN 20 runs] : 167210 ops/sec; 653.2 MB/sec checksum = 1 uncompress [AVG 20 runs] : 167980 (± 924) ops/sec; 656.2 (± 3.6) MB/sec uncompress [MEDIAN 20 runs] : 168465 ops/sec; 658.1 MB/sec ``` Reviewed By: ajkr Differential Revision: D48019378 Pulled By: cbi42 fbshipit-source-id: 674120c6e1853c2ced1436ac8138559d0204feba	2023-08-18 15:01:59 -07:00
Jay Huh	0fa0c97d3e	Timeout in microsecond option in WaitForCompactOptions (#11711 ) Summary: While it's rare, we may run into a scenario where `WaitForCompact()` waits for background jobs indefinitely. For example, not enough space error will add the job back to the queue while WaitForCompact() waits for _all jobs_ including the jobs that are in the queue to be completed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11711 Test Plan: `DBCompactionWaitForCompactTest::WaitForCompactToTimeout` added `timeout` option added to the variables for all of the existing DBCompactionWaitForCompactTests Reviewed By: pdillinger, jowlyzhang Differential Revision: D48416390 Pulled By: jaykorean fbshipit-source-id: 7b6a12f705ab6c6dfaf8ad736a484ca654a86106	2023-08-18 11:21:45 -07:00
Yu Zhang	1e77e35d26	Add a per column family default temperature option for accounting (#11708 ) Summary: Add a column family option `default_temperature` that will be used for file reading accounting purpose, such as io statistics, for files that don't have an explicitly set temperature. This options is not a mutable one, changing its value would require a DB restart. This is to avoid the confusion that had the option being a mutable one, the users may expect it to take effect on all files immediately, while in reality, it would only become effective for SST files opened in the future. This `default_temperature` also just affect accounting during one DB session. It won't be recorded in manifest as the file's temperature and can be different across different DB sessions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11708 Test Plan: ``` make all check ``` Reviewed By: pdillinger Differential Revision: D48375763 Pulled By: jowlyzhang fbshipit-source-id: eb756696c14a694c6e2a93d2bb6f040563194981	2023-08-17 17:06:57 -07:00
Changyu Bi	d1ff401472	Delay bottommost level single file compactions (#11701 ) Summary: For leveled compaction, RocksDB has a special kind of compaction with reason "kBottommmostFiles" that compacts bottommost level files to clear data held by snapshots (more detail in https://github.com/facebook/rocksdb/issues/3009). Such compactions can happen soon after a relevant snapshot is released. For some use cases, a bottommost file may contain only a small amount of keys that can be cleared, so compacting such a file has a high write amp. In addition, these bottommost files may be compacted in compactions with reason other than "kBottommmostFiles" if we wait for some time (so that enough data is ingested to trigger such a compaction). This PR introduces an option `bottommost_file_compaction_delay` to specify the delay of these bottommost level single file compactions. * The main change is in `VersionStorageInfo::ComputeBottommostFilesMarkedForCompaction()` where we only add a file to `bottommost_files_marked_for_compaction_` if it oldest_snapshot is larger than its non-zero largest_seqno and the file is old enough. Note that if a file is not old enough but its largest_seqno is less than oldest_snapshot, we exclude it from the calculation of `bottommost_files_mark_threshold_`. This makes the change simpler, but such a file's eligibility for compaction will only be checked the next time `ComputeBottommostFilesMarkedForCompaction()` is called. This happens when a new Version is created (compaction, flush, SetOptions()...), a new enough snapshot is released (`VersionStorageInfo::UpdateOldestSnapshot()`) or when a compaction is picked and compaction score has to be re-calculated. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11701 Test Plan: * Add two unit tests to test when bottommost_file_compaction_delay > 0. * Ran crash test with the new option. Reviewed By: jaykorean, ajkr Differential Revision: D48331564 Pulled By: cbi42 fbshipit-source-id: c584f3dc5f6354fce3ed65f4c6366dc450b15ba8	2023-08-16 17:45:44 -07:00
Yu Zhang	6a3da5635e	Add documentation to some formatting util functions (#11674 ) Summary: As titled, mostly adding documentation. While updating one usage of these util functions in the external file ingestion job based on code inspection. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11674 Test Plan: ``` make check ``` Note that no unit test was added or updated to check the change in the external file ingestion flow works. This is because user-defined timestamp doesn't support bulk loading yet. There could be other missing pieces that are needed to make this flow functional and testable. That work is separately tracked and unit tests will be added then. Reviewed By: cbi42 Differential Revision: D48271338 Pulled By: jowlyzhang fbshipit-source-id: c05c3440f1c08632dd0de51b563a30b44b4eb8b5	2023-08-14 22:04:18 -07:00
Jay Huh	793a786fa3	Fix for unchecked status in CancelAllBackgroundWork (#11699 ) Summary: ## Summary PR https://github.com/facebook/rocksdb/issues/11497 introduced this. Status from `CancelPeriodicTaskScheduler()` is unchecked and causing test failure like https://app.circleci.com/pipelines/github/facebook/rocksdb/30743/workflows/24443a9b-6fc3-41e6-86c1-992d766eb1ec/jobs/642419 Pull Request resolved: https://github.com/facebook/rocksdb/pull/11699 Test Plan: Existing tests Reviewed By: cbi42 Differential Revision: D48287188 Pulled By: jaykorean fbshipit-source-id: b6bcf6e3c3c47f126c34c24a3dfed2649635cc8c	2023-08-11 19:59:56 -07:00
Peter Dillinger	ef6f025563	Placeholder for AutoHyperClockCache, more (#11692 ) Summary: * The plan is for AutoHyperClockCache to be selected when HyperClockCacheOptions::estimated_entry_charge == 0, and in that case to use a new configuration option min_avg_entry_charge for determining an extreme case maximum size for the hash table. For the placeholder, a hack is in place in HyperClockCacheOptions::MakeSharedCache() to make the unit tests happy despite the new options not really making sense with the current implementation. * Mostly updating and refactoring tests to test both the current HCC (internal name FixedHyperClockCache) and a placeholder for the new version (internal name AutoHyperClockCache). * Simplify some existing tests not to depend directly on cache type. * Type-parameterize the shard-level unit tests, which unfortunately requires more syntax like `this->` in places for disambiguation. * Added means of choosing auto_hyper_clock_cache to cache_bench, db_bench, and db_stress, including add to crash test. * Add another templated class BaseHyperClockCache to reduce future copy-paste * Added ReportProblems support to cache_bench * Added a DEBUG-level diagnostic to ReportProblems for the variance in load factor throughout the table, which will become more of a concern with linear hashing to be used in the Auto implementation. Example with current Fixed HCC: ``` 2023/08/10-13:41:41.602450 6ac36 [DEBUG] [che/clock_cache.cc:1507] Slot occupancy stats: Overall 49% (129008/262144), Min/Max/Window = 39%/60%/500, MaxRun{Pos/Neg} = 18/17 ``` In other words, with overall occupancy of 49%, the lowest across any 500 contiguous cells is 39% and highest 60%. Longest run of occupied is 18 and longest run of unoccupied is 17. This seems consistent with random samples from a uniform distribution. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11692 Test Plan: Shouldn't be any meaningful changes yet to production code or to what is tested, but there is temporary redundancy in testing until the new implementation is plugged in. Reviewed By: jowlyzhang Differential Revision: D48247413 Pulled By: pdillinger fbshipit-source-id: 11541f996d97af403c2e43c92fb67ff22dd0b5da	2023-08-11 16:27:38 -07:00
Jay Huh	52816ff64d	Close DB option in WaitForCompact() (#11497 ) Summary: Context: As mentioned in https://github.com/facebook/rocksdb/issues/11436, introducing `close_db` option in `WaitForCompactOptions` to close DB after waiting for compactions to finish. Must be set to true to close the DB upon compactions finishing. 1. `bool close_db = false` added to `WaitForCompactOptions` 2. Introduced `CancelPeriodicTaskSchedulers()` and moved unregistering PeriodicTaskSchedulers to it.`CancelAllBackgroundWork()` calls it now. 3. When close_db option is on, unpersisted data (data in memtable when WAL is disabled) will be flushed in `WaitForCompact()` if flush option is not on (and `mutable_db_options_.avoid_flush_during_shutdown` is not true). The unpersisted data flush in `CancelAllBackgroundWork()` will be skipped because `shutting_down_` flag will be set true before calling `Close()`. 4. Atomic boolean `reject_new_background_jobs_` is introduced to prevent new background jobs from being added during the short period of time after waiting is done and before `shutting_down_` is set by `Close()`. 5. `WaitForCompact()` now waits for recovery in progress to complete as well. (flush operations from WAL -> L0 files) 6. Added `close_db_` cases to all existing `WaitForCompactTests` 7. Added a scenario to `DBBasicTest::DBClose` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11497 Test Plan: - Existing DBCompactionTests - `WaitForCompactWithOptionToFlushAndCloseDB` added - Added a scenario to `DBBasicTest::DBClose` Reviewed By: pdillinger, jowlyzhang Differential Revision: D46337560 Pulled By: jaykorean fbshipit-source-id: 0f8c7ee09394847f2af5ea4bdd331b47bcdef0b0	2023-08-11 12:30:48 -07:00
Yu Zhang	7cdbce4564	Add UDT support in API DB::GetApproximateMemTableStats (#11689 ) Summary: This API should consider the case when user-defined timestamp is enabled. Also added some documentation to some related API to clarify the usage in the case when user-defined timestamp is enabled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11689 Test Plan: Unit test added ``` make check ./db_with_timestamp_basic_test --gtest_filter=GetApproximateSizes ``` Reviewed By: ltamasi Differential Revision: D48208568 Pulled By: jowlyzhang fbshipit-source-id: c5baa4a2923441f8ea3a3672c98223a43a3428dc	2023-08-11 11:26:38 -07:00
Jay Huh	66643b8106	PutEntity Support in SST File Writer (#11688 ) Summary: RocksDB provides APIs that enable creating SST files offline and then bulk loading them into the LSM tree quickly using metadata operations. Namely, clients can use the `SstFileWriter` class for the offline data preparation and then the IngestExternalFile family of APIs to perform the bulk loading. However, `SstFileWriter` currently does not support creating files with wide-column data in them. This PR adds `PutEntity` API implementation to `SstFileWriter` to support creating files with wide-column data. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11688 Test Plan: - `BasicWideColumn` test added in external_sst_file_test Reviewed By: ltamasi Differential Revision: D48243779 Pulled By: jaykorean fbshipit-source-id: 1697e5bd67121a648c03946f867416a94be0cadf	2023-08-10 18:16:10 -07:00
Changyu Bi	76ed9a3990	Add missing status check when compiling with `ASSERT_STATUS_CHECKED=1` (#11686 ) Summary: It seems the flag `-fno-elide-constructors` is incorrectly overwritten in Makefile by `9c2ebcc2c3/Makefile (L243)` Applying the change in PR https://github.com/facebook/rocksdb/issues/11675 shows a lot of missing status checks. This PR adds the missing status checks. Most of changes are just adding asserts in unit tests. I'll add pr comment around more interesting changes that need review. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11686 Test Plan: change Makefile as in https://github.com/facebook/rocksdb/issues/11675, and run `ASSERT_STATUS_CHECKED=1 TEST_UINT128_COMPAT=1 ROCKSDB_MODIFY_NPHASH=1 LIB_MODE=static OPT="-DROCKSDB_NAMESPACE=alternative_rocksdb_ns" make V=1 -j24 J=24 check` Reviewed By: hx235 Differential Revision: D48176132 Pulled By: cbi42 fbshipit-source-id: 6758946cfb1c6ff84c4c1e0ca540d05e6fc390bd	2023-08-09 15:46:44 -07:00
Yu Zhang	c751583c03	Set default cf ts sz for a reused transaction (#11685 ) Summary: Set up the default column family timestamp size for a reused write committed transaction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11685 Test Plan: Added unit test. Reviewed By: ltamasi Differential Revision: D48195129 Pulled By: jowlyzhang fbshipit-source-id: 54faa900c123fc6daa412c01490e36c10a24a678	2023-08-09 13:49:42 -07:00
Hui Xiao	9a034801ce	Group rocksdb.sst.read.micros stat by different user read IOActivity + misc (#11444 ) Summary: Context/Summary: - Similar to https://github.com/facebook/rocksdb/pull/11288 but for user read such as `Get(), MultiGet(), DBIterator::XXX(), Verify(File)Checksum()`. - For this, I refactored some user-facing `MultiGet` calls in `TransactionBase` and various types of `DB` so that it does not call a user-facing `Get()` but `GetImpl()` for passing the `ReadOptions::io_activity` check (see PR conversation) - New user read stats breakdown are guarded by `kExceptDetailedTimers` since measurement shows they have 4-5% regression to the upstream/main. - Misc - More refactoring: with https://github.com/facebook/rocksdb/pull/11288, we complete passing `ReadOptions/IOOptions` to FS level. So we can now replace the previously [added](https://github.com/facebook/rocksdb/pull/9424) `rate_limiter_priority` parameter in `RandomAccessFileReader`'s `Read/MultiRead/Prefetch()` with `IOOptions::rate_limiter_priority` - Also, `ReadAsync()` call time is measured in `SST_READ_MICRO` now Pull Request resolved: https://github.com/facebook/rocksdb/pull/11444 Test Plan: - CI fake db crash/stress test - Microbenchmarking Build `make clean && ROCKSDB_NO_FBCODE=1 DEBUG_LEVEL=0 make -jN db_basic_bench` - google benchmark version: `604f6fd3f4` - db_basic_bench_base: upstream - db_basic_bench_pr: db_basic_bench_base + this PR - asyncread_db_basic_bench_base: upstream + [db basic bench patch for IteratorNext](https://github.com/facebook/rocksdb/compare/main...hx235:rocksdb:micro_bench_async_read) - asyncread_db_basic_bench_pr: asyncread_db_basic_bench_base + this PR Test Get ``` TEST_TMPDIR=/dev/shm ./db_basic_bench_{null_stat\|base\|pr} --benchmark_filter=DBGet/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1/negative_query:0/enable_filter:0/mmap:1/threads:1 --benchmark_repetitions=1000 ``` Result ``` Coming soon ``` AsyncRead ``` TEST_TMPDIR=/dev/shm ./asyncread_db_basic_bench_{base\|pr} --benchmark_filter=IteratorNext/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1/async_io:1/include_detailed_timers:0 --benchmark_repetitions=1000 > syncread_db_basic_bench_{base\|pr}.out ``` Result ``` Base: 1956,1956,1968,1977,1979,1986,1988,1988,1988,1990,1991,1991,1993,1993,1993,1993,1994,1996,1997,1997,1997,1998,1999,2001,2001,2002,2004,2007,2007,2008, PR (2.3% regression, due to measuring `SST_READ_MICRO` that wasn't measured before): 1993,2014,2016,2022,2024,2027,2027,2028,2028,2030,2031,2031,2032,2032,2038,2039,2042,2044,2044,2047,2047,2047,2048,2049,2050,2052,2052,2052,2053,2053, ``` Reviewed By: ajkr Differential Revision: D45918925 Pulled By: hx235 fbshipit-source-id: 58a54560d9ebeb3a59b6d807639692614dad058a	2023-08-08 17:26:50 -07:00
Yu Zhang	9c2ebcc2c3	Log user_defined_timestamps_persisted flag in event logger (#11683 ) Summary: As titled, and also removed an undefined and unused member function in for ColumnFamilyData Pull Request resolved: https://github.com/facebook/rocksdb/pull/11683 Reviewed By: ajkr Differential Revision: D48156290 Pulled By: jowlyzhang fbshipit-source-id: cc99aaafe69db6611af3854cb2b2ebc5044941f7	2023-08-08 12:25:21 -07:00
Peter Dillinger	e214964f40	Fix a potential memory leak on row_cache insertion failure (#11682 ) Summary: Although the built-in Cache implementations never return failure on Insert without keeping a reference (Handle), a custom implementation could. The code for inserting into row_cache does not keep a reference but does not clean up appropriately on non-OK. This is a fix. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11682 Test Plan: unit test added that previously fails under ASAN Reviewed By: ajkr Differential Revision: D48153831 Pulled By: pdillinger fbshipit-source-id: 86eb7387915c5b38b6ff5dd8deb4e1e223b7d020	2023-08-08 11:34:41 -07:00
tabokie	6d1effaf01	exclude uninitialized files when estimating compression ratio (#11664 ) Summary: Exclude files with uninitialized table properties when estimating compression ratio. Cherry-picking downstream PR: https://github.com/tikv/rocksdb/pull/335 Pull Request resolved: https://github.com/facebook/rocksdb/pull/11664 Reviewed By: cbi42 Differential Revision: D48002518 Pulled By: ajkr fbshipit-source-id: 931fac8a06b4ed7b7b605cf79903302f1b8babfd	2023-08-07 12:35:42 -07:00
Xinye Tao	d2b0652b32	compute compaction score once for a batch of range file deletes (#10744 ) Summary: Only re-calculate compaction score once for a batch of deletions. Fix performance regression brought by https://github.com/facebook/rocksdb/pull/8434. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10744 Test Plan: In one of our production cluster that recently upgraded to RocksDB 6.29, it takes more than 10 minutes to delete files in 30,000 ranges. The RocksDB instance contains approximately 80,000 files. After this patch, the duration reduces to 100+ ms, which is on par with RocksDB 6.4. Cherry-picking downstream PR: https://github.com/tikv/rocksdb/pull/316 Signed-off-by: tabokie <xy.tao@outlook.com> Reviewed By: cbi42 Differential Revision: D48002581 Pulled By: ajkr fbshipit-source-id: 7245607ee3ad79c53b648a6396c9159f166b9437	2023-08-07 12:29:31 -07:00
Andrew Kryczka	4500a0d6ec	Avoid an std::map copy in persistent stats (#11681 ) Summary: An internal user reported this copy showing up in a CPU profile. We can use move instead. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11681 Differential Revision: D48103170 Pulled By: ajkr fbshipit-source-id: 083d6470181a0041bb5275b657aa61bee23a3729	2023-08-06 18:01:08 -07:00
Changyu Bi	eca48bc166	Avoid shifting component too large error in FileTtlBooster (#11673 ) Summary: When `num_levels` > 65, we may be shifting more than 63 bits in FileTtlBooster. This can give errors like: `runtime error: shift exponent 98 is too large for 64-bit type 'uint64_t' (aka 'unsigned long')`. This PR makes a quick fix for this issue by taking a min in the shifting component. This issue should be rare since it requires a user using a large `num_levels`. I'll follow up with a more complex fix if needed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11673 Test Plan: * Add a unit test that produce the above error before this PR. Need to compile it with ubsan: `COMPILE_WITH_UBSAN=1 OPT="-fsanitize-blacklist=.circleci/ubsan_suppression_list.txt" ROCKSDB_DISABLE_ALIGNED_NEW=1 USE_CLANG=1 make V=1 -j32 compaction_picker_test` Reviewed By: hx235 Differential Revision: D48074386 Pulled By: cbi42 fbshipit-source-id: 25e59df7e93f20e0793cffb941de70ac815d9392	2023-08-04 14:29:50 -07:00
Hui Xiao	09882a52d6	Prepare for deprecation of Options::access_hint_on_compaction_start (#11658 ) Summary: Context/Summary: After https://github.com/facebook/rocksdb/pull/11631, file hint is not longer needed for compaction read. Therefore we can deprecate `Options::access_hint_on_compaction_start`. As this is a public API change, we should first mark the relevant APIs (including the Java's) deprecated and remove it in next major release 9.0. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11658 Test Plan: No code change Reviewed By: ajkr Differential Revision: D47997856 Pulled By: hx235 fbshipit-source-id: 16e015ae7728c224b1caef73143aa9915668f4ac	2023-08-03 17:23:02 -07:00
Vardhan	87a21d08fe	Add an option to trigger flush when the number of range deletions reach a threshold (#11358 ) Summary: Add a mutable column family option `memtable_max_range_deletions`. When non-zero, RocksDB will try to flush the current memtable after it has at least `memtable_max_range_deletions` range deletions. Java API is added and crash test is updated accordingly to randomly enable this option. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11358 Test Plan: * New unit test: `DBRangeDelTest.MemtableMaxRangeDeletions` * Ran crash test `python3 ./tools/db_crashtest.py whitebox --simple --memtable_max_range_deletions=20` and saw logs showing flushed memtables usually with 20 range deletions. Reviewed By: ajkr Differential Revision: D46582680 Pulled By: cbi42 fbshipit-source-id: f23d6fa8d8264ecf0a18d55c113ba03f5e2504da	2023-08-02 19:58:56 -07:00
amatveev-cf	946d1009bc	Expand Statistics support in the C API (#11263 ) Summary: Adds a few missing features to the C API: 1) Statistics level 2) Getting individual values instead of a serialized string Pull Request resolved: https://github.com/facebook/rocksdb/pull/11263 Test Plan: unit tests Reviewed By: ajkr Differential Revision: D47309963 Pulled By: hx235 fbshipit-source-id: 84df59db4045fc0fb3ea4aec451bc5c2afd2a248	2023-08-02 10:53:40 -07:00
Peter Dillinger	7a1b0207e6	format_version=6 and context-aware block checksums (#9058 ) Summary: ## Context checksum All RocksDB checksums currently use 32 bits of checking power, which should be 1 in 4 billion false negative (FN) probability (failing to detect corruption). This is true for random corruptions, and in some cases small corruptions are guaranteed to be detected. But some possible corruptions, such as in storage metadata rather than storage payload data, would have a much higher FN rate. For example: * Data larger than one SST block is replaced by data from elsewhere in the same or another SST file. Especially with block_align=true, the probability of exact block size match is probably around 1 in 100, making the FN probability around that same. Without `block_align=true` the probability of same block start location is probably around 1 in 10,000, for FN probability around 1 in a million. To solve this problem in new format_version=6, we add "context awareness" to block checksum checks. The stored and expected checksum value is modified based on the block's position in the file and which file it is in. The modifications are cleverly chosen so that, for example * blocks within about 4GB of each other are guaranteed to use different context * blocks that are offset by exactly some multiple of 4GiB are guaranteed to use different context * files generated by the same process are guaranteed to use different context for the same offsets, until wrap-around after 2^32 - 1 files Thus, with format_version=6, if a valid SST block and checksum is misplaced, its checksum FN probability should be essentially ideal, 1 in 4B. ## Footer checksum This change also adds checksum protection to the SST footer (with format_version=6), for the first time without relying on whole file checksum. To prevent a corruption of the format_version in the footer (e.g. 6 -> 5) to defeat the footer checksum, we change much of the footer data format including an "extended magic number" in format_version 6 that would be interpreted as empty index and metaindex block handles in older footer versions. We also change the encoding of handles to free up space for other new data in footer. ## More detail: making space in footer In order to keep footer the same size in format_version=6 (avoid change to IO patterns), we have to free up some space for new data. We do this two ways: * Metaindex block handle is encoded down to 4 bytes (from 10) by assuming it immediately precedes the footer, and by assuming it is < 4GB. * Index block handle is moved into metaindex. (I don't know why it was in footer to begin with.) ## Performance In case of small performance penalty, I've made a "pay as you go" optimization to compensate: replace `MutableCFOptions` in BlockBasedTableBuilder::Rep with the only field used in that structure after construction: `prefix_extractor`. This makes the PR an overall performance improvement (results below). Nevertheless I'm seeing essentially no difference going from fv=5 to fv=6, even including that improvement for both. That's based on extreme case table write performance testing, many files with many blocks. This is relatively checksum intensive (small blocks) and salt generation intensive (small files). ``` (for I in `seq 1 100`; do TEST_TMPDIR=/dev/shm/dbbench2 ./db_bench -benchmarks=fillseq -memtablerep=vector -disable_wal=1 -allow_concurrent_memtable_write=false -num=3000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -write_buffer_size=100000 -compression_type=none -block_size=1000; done) 2>&1 \| grep micros/op \| tee out awk '{ tot += $5; n += 1; } END { print int(1.0 * tot / n) }' < out ``` Each value below is ops/s averaged over 100 runs, run simultaneously with competing configuration for load fairness Before -> after (both fv=5): 483530 -> 483673 (negligible) Re-run 1: 480733 -> 485427 (1.0% faster) Re-run 2: 483821 -> 484541 (0.1% faster) Before (fv=5) -> after (fv=6): 482006 -> 485100 (0.6% faster) Re-run 1: 482212 -> 485075 (0.6% faster) Re-run 2: 483590 -> 484073 (0.1% faster) After fv=5 -> after fv=6: 483878 -> 485542 (0.3% faster) Re-run 1: 485331 -> 483385 (0.4% slower) Re-run 2: 485283 -> 483435 (0.4% slower) Re-run 3: 483647 -> 486109 (0.5% faster) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9058 Test Plan: unit tests included (table_test, db_properties_test, salt in env_test). General DB tests and crash test updated to test new format_version. Also temporarily updated the default format version to 6 and saw some test failures. Almost all were due to an inadvertent additional read in VerifyChecksum to verify the index block checksum, though it's arguably a bug that VerifyChecksum does not appear to (re-)verify the index block checksum, just assuming it was verified in opening the index reader (probably usually true but probably not always true). Some other concerns about VerifyChecksum are left in FIXME comments. The only remaining test failure on change of default (in block_fetcher_test) now has a comment about how to upgrade the test. The format compatibility test does not need updating because we have not updated the default format_version. Reviewed By: ajkr, mrambacher Differential Revision: D33100915 Pulled By: pdillinger fbshipit-source-id: 8679e3e572fa580181a737fd6d113ed53c5422ee	2023-07-30 16:40:01 -07:00
Changyu Bi	6a0f637633	Compare the number of input keys and processed keys for compactions (#11571 ) Summary: ... to improve data integrity validation during compaction. A new option `compaction_verify_record_count` is introduced for this verification and is enabled by default. One exception when the verification is not done is when a compaction filter returns kRemoveAndSkipUntil which can cause CompactionIterator to seek until some key and hence not able to keep track of the number of keys processed. For expected number of input keys, we sum over the number of total keys - number of range tombstones across compaction input files (`CompactionJob::UpdateCompactionStats()`). Table properties are consulted if `FileMetaData` is not initialized for some input file. Since table properties for all input files were also constructed during `DBImpl::NotifyOnCompactionBegin()`, `Compaction::GetTableProperties()` is introduced to reduce duplicated code. For actual number of keys processed, each subcompaction will record its number of keys processed to `sub_compact->compaction_job_stats.num_input_records` and aggregated when all subcompactions finish (`CompactionJob::AggregateCompactionStats()`). In the case when some subcompaction encountered kRemoveAndSkipUntil from compaction filter and does not have accurate count, it propagates this information through `sub_compact->compaction_job_stats.has_num_input_records`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11571 Test Plan: * Add a new unit test `DBCompactionTest.VerifyRecordCount` for the corruption case. * All other unit tests for non-corrupted case. * Ran crash test for a few hours: `python3 ./tools/db_crashtest.py whitebox --simple` Reviewed By: ajkr Differential Revision: D47131965 Pulled By: cbi42 fbshipit-source-id: cc8e94565dd526c4347e9d3843ecf32f6727af92	2023-07-28 09:47:31 -07:00
Yu Zhang	c24ef26ca7	Support switching on / off UDT together with in-Memtable-only feature (#11623 ) Summary: Add support to allow enabling / disabling user-defined timestamps feature for an existing column family in combination with the in-Memtable only feature. To do this, this PR includes: 1) Log the `persist_user_defined_timestamps` option per column family in Manifest to facilitate detecting an attempt to enable / disable UDT. This entry is enforced to be logged in the same VersionEdit as the user comparator name entry. 2) User-defined timestamps related options are validated when re-opening a column family, including user comparator name and the `persist_user_defined_timestamps` flag. These type of settings and settings change are considered valid: a) no user comparator change and no effective `persist_user_defined_timestamp` flag change. b) switch user comparator to enable UDT provided the immediately effective `persist_user_defined_timestamps` flag is false. c) switch user comparator to disable UDT provided that the before-change `persist_user_defined_timestamps` is already false. 3) when an attempt to enable UDT is detected, we mark all its existing SST files as "having no UDT" by marking its `FileMetaData.user_defined_timestamps_persisted` flag to false and handle their file boundaries `FileMetaData.smallest`, `FileMetaData.largest` by padding a min timestamp. 4) while enabling / disabling UDT feature, timestamp size inconsistency in existing WAL logs are handled to make it compatible with the running user comparator. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11623 Test Plan: ``` make all check ./db_with_timestamp_basic_test --gtest-filter="EnableDisableUDT" ./db_wal_test --gtest_filter="EnableDisableUDT" ``` Reviewed By: ltamasi Differential Revision: D47636862 Pulled By: jowlyzhang fbshipit-source-id: dcd19f67292da3c3cc9584c09ad00331c9ab9322	2023-07-26 20:16:32 -07:00
Yu Zhang	4ea7b796b7	Respect cutoff timestamp during flush (#11599 ) Summary: Make flush respect the cutoff timestamp `full_history_ts_low` as much as possible for the user-defined timestamps in Memtables only feature. We achieve this by not proceeding with the actual flushing but instead reschedule the same `FlushRequest` so a follow up flush job can continue with the check after some interval. This approach doesn't work well for atomic flush, so this feature currently is not supported in combination with atomic flush. Furthermore, this approach also requires a customized method to get the next immediately bigger user-defined timestamp. So currently it's limited to comparator that use uint64_t as the user-defined timestamp format. This support can be extended when we add such a customized method to `AdvancedColumnFamilyOptions`. For non atomic flush request, at any single time, a column family can only have as many as one FlushRequest for it in the `flush_queue_`. There is deduplication done at `FlushRequest` enqueueing(`SchedulePendingFlush`) and dequeueing time (`PopFirstFromFlushQueue`). We hold the db mutex between when a `FlushRequest` is popped from the queue and the same FlushRequest get rescheduled, so no other `FlushRequest` with a higher `max_memtable_id` can be added to the `flush_queue_` blocking us from re-enqueueing the same `FlushRequest`. Flush is continued nevertheless if there is risk of entering write stall mode had the flush being postponed, e.g. due to accumulation of write buffers, exceeding the `max_write_buffer_number` setting. When this happens, the newest user-defined timestamp in the involved Memtables need to be tracked and we use it to increase the `full_history_ts_low`, which is an inclusive cutoff timestamp for which RocksDB promises to keep all user-defined timestamps equal to and newer than it. Tet plan: ``` ./column_family_test --gtest_filter="RetainUDT" ./memtable_list_test --gtest_filter="WithTimestamp" ./flush_job_test --gtest_filter="WithTimestamp" ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11599 Reviewed By: ajkr Differential Revision: D47561586 Pulled By: jowlyzhang fbshipit-source-id: 9400445f983dd6eac489e9dd0fb5d9b99637fe89	2023-07-26 16:25:06 -07:00
Changyu Bi	5c2a063c49	Clarify usage for options `ttl` and `periodic_compaction_seconds` for universal compaction (#11552 ) Summary: this is stacked on https://github.com/facebook/rocksdb/issues/11550 to further clarify usage of these two options for universal compaction. Similar to FIFO, the two options have the same meaning for universal compaction, which can be confusing to use. For example, for universal compaction, dynamically changing the value of `ttl` has no impact on periodic compactions. Users should dynamically change `periodic_compaction_seconds` instead. From feature matrix (https://fburl.com/daiquery/5s647hwh), there are instances where users set `ttl` to non-zero value and `periodic_compaction_seconds` to 0. For backward compatibility reason, instead of deprecating `ttl`, comments are added to mention that `periodic_compaction_seconds` are preferred. In `SanitizeOptions()`, we update the value of `periodic_compaction_seconds` to take into account value of `ttl`. The logic is documented in relevant option comment. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11552 Test Plan: * updated existing unit test `DBTestUniversalCompaction2.PeriodicCompactionDefault` Reviewed By: ajkr Differential Revision: D47381434 Pulled By: cbi42 fbshipit-source-id: bc41f29f77318bae9a96be84dd89bf5617c7fd57	2023-07-26 11:31:54 -07:00
Rémi Calixte	6628ff12d6	Extend C API to expose base db of transaction db (#11562 ) Summary: Add `rocksdb_transactiondb_get_base_db` and `rocksdb_transactiondb_close_base_db` functions to the C API modeled after `rocksdb_optimistictransactiondb_get_base_db` and `rocksdb_optimistictransactiondb_close_base_db`: `ca50ccc71a/include/rocksdb/c.h (L2711-L2716)` With this pair of functions, it is possible to get a `rocksdb_t ` from a `rocksdb_transactiondb_t `. The main goal is to be able to use the approximate memory usage API, only accessible to the `rocksdb_t *` type: `ca50ccc71a/include/rocksdb/c.h (L2821-L2833)` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11562 Reviewed By: ajkr Differential Revision: D47603343 Pulled By: jowlyzhang fbshipit-source-id: c70cf6af5834026e232fe7791634db3a396f7d5e	2023-07-20 12:03:09 -07:00
shuzz	2f712235ab	optimized code (#11614 ) Summary: improvement code by std::move and c++17 Pull Request resolved: https://github.com/facebook/rocksdb/pull/11614 Reviewed By: ajkr Differential Revision: D47599519 Pulled By: jowlyzhang fbshipit-source-id: 6b897876f4e87e94a74c53d8db2a01303d500bff	2023-07-19 12:52:39 -07:00
huangmengbin	98d0f6ec08	fix: VersionSet::DumpManifest (#11605 ) Summary: Fixes https://github.com/facebook/rocksdb/issues/11604 Pull Request resolved: https://github.com/facebook/rocksdb/pull/11605 Reviewed By: jowlyzhang Differential Revision: D47459254 Pulled By: ajkr fbshipit-source-id: 4420e443fbf4bd01ddaa2b47285fc4445bf36246	2023-07-19 10:44:10 -07:00
Dan Wang	8a7b9888d4	Fix the sync point SanitizeOptions::AfterChangeMaxOpenFiles which is not executed in db_compaction_test (#11583 ) Summary: In [db_impl_open.cc](https://github.com/facebook/rocksdb/blob/main/db/db_impl/db_impl_open.cc), the sync point `SanitizeOptions::AfterChangeMaxOpenFiles` is used to set `max_open_files` with some specified "invalid" value even if it has been sanitized. However, in [db_compaction_test.cc](https://github.com/facebook/rocksdb/blob/main/db/db_compaction_test.cc), `SanitizeOptions::AfterChangeMaxOpenFiles` would not be executed since `SyncPoint::EnableProcessing()` is run after `DBTestBase::Reopen()`. To enable `SanitizeOptions::AfterChangeMaxOpenFiles`, `SyncPoint::EnableProcessing()` should be put ahead of `DBTestBase::Reopen()`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11583 Test Plan: run unit tests locally as below: ``` make J=1 check [ RUN ] DBCompactionTest.LevelTtlCascadingCompactions [ OK ] DBCompactionTest.LevelTtlCascadingCompactions (85 ms) [ RUN ] DBCompactionTest.LevelPeriodicCompaction [ OK ] DBCompactionTest.LevelPeriodicCompaction (57 ms) ``` Reviewed By: jowlyzhang Differential Revision: D47311827 Pulled By: ajkr fbshipit-source-id: 99165e87a8129e404af06fdf9b4c96eca540fd23	2023-07-19 10:41:09 -07:00
Andrew Kryczka	05c3b8ecac	Prepare for specialized interface for row cache (#11620 ) Summary: An internal user wants to implement a key-aware row cache policy. For that, they need to know the components of the cache key, especially the user key component. With a specialized `RowCache` interface, we will be able to tell them the components so they won't have to make assumptions about our internal key schema. This PR prepares for the specialized `RowCache` interface by updating the migration plan of https://github.com/facebook/rocksdb/issues/11450. I added a release note for the removed APIs and didn't mention the added ones for now. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11620 Reviewed By: pdillinger Differential Revision: D47536962 Pulled By: ajkr fbshipit-source-id: bbee0fc4ad67fc699a66b8f2b4ea4544dd003691	2023-07-18 19:12:58 -07:00
Changyu Bi	662a1c99f6	Verify number of keys flushed during DB open (#11611 ) Summary: Extend the coverage for option `flush_verify_memtable_count`. The verification code is similar to the ones for regular flush: `c3c84b3397/db/flush_job.cc (L956-L965)` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11611 Test Plan: existing tests. Reviewed By: ajkr Differential Revision: D47478893 Pulled By: cbi42 fbshipit-source-id: ca580c9dbcd6e91facf2e49210661336a79a248e	2023-07-18 10:39:11 -07:00
leipeng	bc0db33483	Optimize about sstableKeyCompare (#11610 ) Summary: We observed `CompactionOutputs::UpdateGrandparentBoundaryInfo` consumes much time for `InternalKey::DecodeFrom` and `InternalKey::~InternalKey` in flame graph. This PR omit the InternalKey object in `CompactionOutputs::UpdateGrandparentBoundaryInfo` . ![image](https://github.com/facebook/rocksdb/assets/1574991/661eaeec-2f46-46c6-a6a8-9738d6c191de) Pull Request resolved: https://github.com/facebook/rocksdb/pull/11610 Reviewed By: ajkr Differential Revision: D47426971 Pulled By: cbi42 fbshipit-source-id: f0d3a8186d778294515c0685032f5b395c4d6a62	2023-07-13 22:26:55 -07:00
Changyu Bi	854eb76a8c	Improve error message when an SST file in MANIFEST is not found (#11573 ) Summary: I got the following error message when an SST file is recorded in MANIFEST but is missing from the db folder. It's confusing in two ways: 1. The part about file "./074837.ldb" which RocksDB will attempt to open only after ./074837.sst is not found. 2. The last part about "No such file or directory in file ./MANIFEST-074507" sounds like `074837.ldb` is not found in manifest. ``` ldb --hex --db=. get some_key Failed: Corruption: Corruption: IO error: No such file or directory: While open a file for random read: ./074837.ldb: No such file or directory in file ./MANIFEST-074507 ``` Improving the error message a little bit: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11573 Test Plan: run the same command after this PR ``` Failed: Corruption: Corruption: IO error: No such file or directory: While open a file for random read: ./074837.sst: No such file or directory The file ./MANIFEST-074507 may be corrupted. ``` Reviewed By: ajkr Differential Revision: D47192056 Pulled By: cbi42 fbshipit-source-id: 06863f376cc4455803cffb2250c41399b4c39467	2023-07-10 15:52:38 -07:00
weedge	1a7c741977	fix: std::optional value() build error on older macOS SDK (#11574 ) Summary: `PORTABLE=1 USE_SSE=1 USE_PCLMUL=1 WITH_JEMALLOC_FLAG=1 JEMALLOC=1 make static_lib` on MacOS clang --version: Apple clang version 12.0.0 (clang-1200.0.32.29) Target: x86_64-apple-darwin22.4.0 Thread model: posix InstalledDir: /Library/Developer/CommandLineTools/usr/bin compile err like this: util/udt_util.cc:39:39: error: 'value' is unavailable: introduced in macOS 10.14 if (running_ts_sz != recorded_ts_sz.value()) { ^ /Library/Developer/CommandLineTools/usr/bin/../include/c++/v1/optional:944:33: note: 'value' has been explicitly marked unavailable here constexpr value_type const& value() const& ^ util/udt_util.cc:217:62: error: 'value' is unavailable: introduced in macOS 10.14 new_key = StripTimestampFromUserKey(key, record_ts_sz.value()); ^ /Library/Developer/CommandLineTools/usr/bin/../include/c++/v1/optional:953:27: note: 'value' has been explicitly marked unavailable here constexpr value_type& value() & ^ 2 errors generated. make: ** [util/udt_util.o] Error 1 Pull Request resolved: https://github.com/facebook/rocksdb/pull/11574 Reviewed By: ajkr Differential Revision: D47269519 Pulled By: cbi42 fbshipit-source-id: da49d90cdf00a0af519f91c0cf7d257401eb395f	2023-07-10 14:21:34 -07:00
Yu Zhang	f74526341d	Handle file boundaries when timestamps should not be persisted (#11578 ) Summary: Handle file boundaries `FileMetaData.smallest`, `FileMetaData.largest` for when `persist_user_defined_timestamps` is false: 1) on the manifest write path, the original user-defined timestamps in file boundaries are stripped. This stripping is done during `VersionEdit::Encode` to limit the effect of the stripping to only the persisted version of the file boundaries. 2) on the manifest read path during DB open, a a min timestamp is padded to the file boundaries. Ideally, this padding should happen during `VersionEdit::Decode` so that all in memory file boundaries have a compatible user key format as the running user comparator. However, because the user-defined timestamp size information is not available at that time. This change is added to `VersionEditHandler::OnNonCfOperation`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11578 Test Plan: ``` make all check ./version_edit_test --gtest_filter="EncodeDecodeNewFile4HandleFileBoundary". ./db_with_timestamp_basic_test --gtest_filter="HandleFileBoundariesTest" ``` Reviewed By: pdillinger Differential Revision: D47309399 Pulled By: jowlyzhang fbshipit-source-id: 21b4d54d2089a62826b31d779094a39cb2bbbd51	2023-07-10 11:03:25 -07:00
Yu Zhang	baf37a0e81	Fix a unit test hole for recovering UDTs with WAL files (#11577 ) Summary: Thanks pdillinger for pointing out this test hole. The test `DBWALTestWithTimestamp.Recover` that is intended to test recovery from WAL including user-defined timestamps doesn't achieve its promised coverage. Specifically, after https://github.com/facebook/rocksdb/issues/11557, timestamps will be removed during flush, and RocksDB by default flush memtables during recovery with `avoid_flush_during_recovery` defaults to false. This test didn't fail even if all the timestamps are quickly lost due to the default flush behavior. This PR renamed test `Recover` to `RecoverAndNoFlush`, and updated it to verify timestamps are successfully recovered from WAL with some time-travel reads. `avoid_flush_during_recovery` is set to true to help do this verification. On the other hand, for test `DBWALTestWithTimestamp.RecoverAndFlush`, since flush on reopen is DB's default behavior. Setting the flags `max_write_buffer` and `arena_block_size` are not really the factors that enforces the flush, so these flags are removed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11577 Test Plan: ./db_wal_test Reviewed By: pdillinger Differential Revision: D47142892 Pulled By: jowlyzhang fbshipit-source-id: 9465e278806faa5885b541b4e32d99e698edef7d	2023-07-07 16:47:49 -07:00
Changyu Bi	1f410ff95f	Make `rocksdb_options_add_compact_on_deletion_collector_factory` backward compatible (#11593 ) Summary: https://github.com/facebook/rocksdb/issues/11542 added a parameter to the C API `rocksdb_options_add_compact_on_deletion_collector_factory` which causes some internal builds to fail. External users using this API would also require code change. Making the API backward compatible by restoring the old C API and add the parameter to a new C API `rocksdb_options_add_compact_on_deletion_collector_factory_del_ratio`. Also updated change log for 8.4 and will backport this change to 8.4 branch once landed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11593 Test Plan: `make c_test && ./c_test` Reviewed By: akankshamahajan15 Differential Revision: D47299555 Pulled By: cbi42 fbshipit-source-id: 517dc093ef4cf02cac2fe4af4f1af13754bbda63	2023-07-07 13:16:20 -07:00
Changyu Bi	df082c8d1d	Deprecate option `periodic_compaction_seconds` for FIFO compaction (#11550 ) Summary: both options `ttl` and `periodic_compaction_seconds` have the same meaning for FIFO compaction, which is redundant and can be confusing to use. For example, setting TTL to 0 does not disable TTL: user needs to also set periodic_compaction_seconds to 0. Another example is that dynamically setting `periodic_compaction_seconds` (surprisingly) has no effect on TTL compaction. This is because FIFO compaction picker internally only looks at value of `ttl`. The value of `ttl` is in `SanitizeOptions()` which take into account the value of `periodic_compaction_seconds`, but dynamically setting an option does not invoke this method. This PR clarifies the usage of both options for FIFO compaction: only `ttl` should be used, `periodic_compaction_seconds` will not have any effect on FIFO compaction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11550 Test Plan: - updated existing unit test `DBOptionsTest.SanitizeFIFOPeriodicCompaction` - checked existing values of both options in feature matrix: https://fburl.com/daiquery/xxd0gs9w. All current uses cases either have `periodic_compaction_seconds = 0` or have `periodic_compaction_seconds > ttl`, so should not cause change of behavior. Reviewed By: ajkr Differential Revision: D46902959 Pulled By: cbi42 fbshipit-source-id: a9ede235b276783b4906aaec443551fa62ceff4c	2023-07-05 14:40:45 -07:00
leipeng	25b08eb438	MemTable::Add: first_seqno_.compare_exchange_weak to earliest_seqno_ (#11398 ) Summary: This should be a benign bug caused by a long lived typo, this PR fix this issue. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11398 Reviewed By: ajkr Differential Revision: D47163379 Pulled By: cbi42 fbshipit-source-id: 531728cae496fd7ac1371bbbd64fc103c3a90dcf	2023-07-03 15:05:38 -07:00
darionyaphet	f4e304f987	Simplify conditional judgment (#11580 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11580 Reviewed By: ajkr Differential Revision: D47158687 Pulled By: cbi42 fbshipit-source-id: 4841b77eee78ddcf35da6ea33da71861c5f1e773	2023-07-03 09:41:48 -07:00
Yu Zhang	15053f3ab4	Logically strip timestamp during flush (#11557 ) Summary: Logically strip the user-defined timestamp when L0 files are created during flush when `AdvancedColumnFamilyOptions.persist_user_defined_timestamps` is false. Logically stripping timestamp here means replacing the original user-defined timestamp with a mininum timestamp, which for now is hard coded to be all zeros bytes. While working on this, I caught a missing piece on the `BlockBuilder` level for this feature. The current quick path `std::min(buffer_size, last_key_size)` needs a bit tweaking to work for this feature. When user-defined timestamp is stripped during block building, on writing first entry or right after resetting, `buffer` is empty and `buffer_size` is zero as usual. However, in follow-up writes, depending on the size of the stripped user-defined timestamp, and the size of the value, what's in `buffer` can sometimes be smaller than `last_key_size`, leading `std::min(buffer_size, last_key_size)` to truncate the `last_key`. Previous test doesn't caught the bug because in those tests, the size of the stripped user-defined timestamps bytes is smaller than the length of the value. In order to avoid the conditional operation, this PR changed the original trivial `std::min` operation into an arithmetic operation. Since this is a change in a hot and performance critical path, I did the following benchmark to check no observable regression is introduced. ```TEST_TMPDIR=/dev/shm/rocksdb1 ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=50000000``` Compiled with DEBUG_LEVEL=0 Test vs. control runs simulaneous for better accuracy, units = ops/sec PR vs base: Round 1: 350652 vs 349055 Round 2: 365733 vs 364308 Round 3: 355681 vs 354475 Pull Request resolved: https://github.com/facebook/rocksdb/pull/11557 Test Plan: New timestamp specific test added or existing tests augmented, both are parameterized with `UserDefinedTimestampTestMode`: `UserDefinedTimestampTestMode::kNormal` -> UDT feature enabled, write / read with min timestamp `UserDefinedTimestampTestMode::kStripUserDefinedTimestamps` -> UDT feature enabled, write / read with min timestamp, set Options.persist_user_defined_timestamps to false. ``` make all check ./db_wal_test --gtest_filter="WithTimestamp" ./flush_job_test --gtest_filter="WithTimestamp" ./repair_test --gtest_filter="WithTimestamp" ./block_based_table_reader_test ``` Reviewed By: pdillinger Differential Revision: D47027664 Pulled By: jowlyzhang fbshipit-source-id: e729193b6334dfc63aaa736d684d907a022571f5	2023-06-29 15:50:50 -07:00
Griffin Smith	bfdc91017c	C-API: Expose remaining PlainTableOptions (#11442 ) Summary: Expose the remaining fields of PlainTableOptions as arguments to `rocksdb_options_set_plain_table_factory` in the C API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11442 Reviewed By: ajkr Differential Revision: D46786962 Pulled By: hx235 fbshipit-source-id: 8862083dde332bfecc5ff02f9375776ad35c11f5	2023-06-27 12:30:28 -07:00
Jay Schmidek	f7aa70a72f	Add create_column_families to C api (#9527 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9527 Reviewed By: akankshamahajan15 Differential Revision: D47007647 Pulled By: ajkr fbshipit-source-id: e13544130b2731e07fa5fa4b9a2aa5f75b548c7e	2023-06-27 11:58:33 -07:00
Changyu Bi	ca50ccc71a	Add CreateColumnFamilyWithImport to `StackableDB` and `DBImplReadOnly` (#11556 ) Summary: https://github.com/facebook/rocksdb/issues/11378 added a new overloaded `CreateColumnFamilyWithImport` API and updated the virtual function in `StackableDB` and `DBImplReadOnly` to the newly overloaded one. This caused internal error when there is a derived class that tries to override the original `CreateColumnFamilyWithImport` function. This PR adds the original `CreateColumnFamilyWithImport` function back to `StackableDB` and `DBImplReadOnly`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11556 Test Plan: check if this fixes an internal build Reviewed By: akankshamahajan15 Differential Revision: D46980506 Pulled By: cbi42 fbshipit-source-id: 975a6c5748bf9481499a62ee5997ca59e542e3bc	2023-06-23 13:56:26 -07:00
akankshamahajan	fbd2f563bb	Add an interface to provide support for underlying FS to pass their own buffer during reads (#11324 ) Summary: 1. Public API change: Replace `use_async_io` API in file_system with `SupportedOps` API which is used by underlying FileSystem to indicate to upper layers whether the FileSystem supports different operations introduced in `enum FSSupportedOps `. Right now operations are `async_io` and whether FS will provide its own buffer during reads or not. The api is changed to extend it to various FileSystem operations in one API rather than creating a separate API for each operation. 2. Provide support for underlying FS to pass their own buffer during Reads (async and sync read) instead of using RocksDB provided `scratch` (buffer) in `FSReadRequest`. Currently only MultiRead supports it and later will be extended to other reads as well (point lookup, scan etc). More details in how to enable in file_system.h Pull Request resolved: https://github.com/facebook/rocksdb/pull/11324 Test Plan: Tested locally Reviewed By: anand1976 Differential Revision: D44465322 Pulled By: akankshamahajan15 fbshipit-source-id: 9ec9e08f839b5cc815e75d5dade6cd549998d0ec	2023-06-23 11:48:49 -07:00
Yu Zhang	7521478b43	Record the `persist_user_defined_timestamps` flag in manifest (#11515 ) Summary: Start to record the value of the flag `AdvancedColumnFamilyOptions.persist_user_defined_timestamps` in the Manifest and table properties for a SST file when it is created. And use the recorded flag when creating a table reader for the SST file. This flag's default value is true, it is only explicitly recorded if it's false. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11515 Test Plan: ``` make all check ./version_edit_test ``` Reviewed By: ltamasi Differential Revision: D46920386 Pulled By: jowlyzhang fbshipit-source-id: 075c20363d3d2cc1368422ecc805617ed135cc26	2023-06-21 21:49:01 -07:00
Alexandre Lavigne	2926e0718c	Add missing parameter in C API (#11542 ) Summary: The class `NewCompactOnDeletionCollectorFactory` exposes the parameter `delete_ratio`. The C API `rocksdb_options_add_compact_on_deletion_collector_factory` does not allow a user to pass a delete ration to be passed down the the C++ class bellow. The class has default value for the delete ratio which makes it pass the compilation and the tests. closes https://github.com/facebook/rocksdb/issues/11541 Pull Request resolved: https://github.com/facebook/rocksdb/pull/11542 Reviewed By: ajkr Differential Revision: D46770908 Pulled By: cbi42 fbshipit-source-id: 7b5162fe459896052e392e2d85a8f6c01db3b464	2023-06-20 13:18:04 -07:00
Levi Tamasi	022d89549d	Attempt to deflake DBWALTestWithEnrichedEnv.SkipDeletedWALs (#11537 ) Summary: Calling `Flush` (even with `wait==true`) does not guarantee that obsolete WAL files are physically deleted before the call returns. The patch attempts to fix the resulting flakiness by using `SyncPoint`s to make sure `PurgeObsoleteFiles` finishes before checking for WAL deletions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11537 Test Plan: ``` gtest-parallel --repeat=1000 ./db_wal_test --gtest_filter="SkipDeletedWALs" ``` Reviewed By: pdillinger Differential Revision: D46736050 Pulled By: ltamasi fbshipit-source-id: 47a931b7a3a03ef681fbf4adb5a0b223d452703e	2023-06-19 16:04:49 -07:00
Changyu Bi	bc04ec85db	Make option `level_compaction_dynamic_level_bytes` true by default (#11525 ) Summary: after https://github.com/facebook/rocksdb/issues/11321 and https://github.com/facebook/rocksdb/issues/11340 (both included in RocksDB v8.2), migration from `level_compaction_dynamic_level_bytes=false` to `level_compaction_dynamic_level_bytes=true` is automatic by RocksDB and requires no manual compaction from user. Making the option true by default as it has several advantages: 1. better space amplification guarantee (a more stable LSM shape). 2. compaction is more adaptive to write traffic. 3. automatic draining of unneeded levels. Wiki is updated with more detail: https://github.com/facebook/rocksdb/wiki/Leveled-Compaction#option-level_compaction_dynamic_level_bytes-and-levels-target-size. The PR mostly contains fixes for unit tests as they assumed `level_compaction_dynamic_level_bytes=false`. Most notable change is commit `f742be330c` and `b1928e42b3` which override the default option in DBTestBase to still set `level_compaction_dynamic_level_bytes=false` by default. This helps to reduce the change needed for unit tests. I think this default option override in unit tests is okay since the behavior of `level_compaction_dynamic_level_bytes=true` is tested by explicitly setting this option. Also, `level_compaction_dynamic_level_bytes=false` may be more desired in unit tests as it makes it easier to create a desired LSM shape. Comment for option `level_compaction_dynamic_level_bytes` is updated to reflect this change and change made in https://github.com/facebook/rocksdb/issues/10057. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11525 Test Plan: `make -j32 J=32 check` several times to try to catch flaky tests due to this option change. Reviewed By: ajkr Differential Revision: D46654256 Pulled By: cbi42 fbshipit-source-id: 6b5827dae124f6f1fdc8cca2ac6f6fcd878830e1	2023-06-15 21:12:39 -07:00
yaphet	253bc91953	Move the status judgment into the block (#11534 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11534 Reviewed By: cbi42 Differential Revision: D46732248 Pulled By: ajkr fbshipit-source-id: c9360866c35a2c436ab83b85a14ed7bd43a88d95	2023-06-15 16:55:38 -07:00
darionyaphet	9f774baaa8	Support Error Recovery Retry Flush in GetFlushReasonString (#11536 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11536 Reviewed By: cbi42 Differential Revision: D46732297 Pulled By: ajkr fbshipit-source-id: 82bb078f189da233addc4f483eaa6eaf7fdd3910	2023-06-15 16:53:44 -07:00
mayue.fight	fa878a0107	Support to create a CF by importing multiple non-overlapping CFs (#11378 ) Summary: The original Feature Request is from [https://github.com/facebook/rocksdb/issues/11317](https://github.com/facebook/rocksdb/issues/11317). Flink uses rocksdb as the state backend, all DB options are the same, and the keys of each DB instance are adjacent and there is no key overlap between two db instances. In the Flink rescaling scenario, it is necessary to quickly split the DB according to a certain key range or quickly merge multiple DBs into one. This PR is mainly used to quickly merge multiple DBs into one. We hope to extend the function of `CreateColumnFamilyWithImports` to support creating ColumnFamily by importing multiple ColumnFamily with no overlapping keys. The import logic is almost the same as `CreateColumnFamilyWithImport`, but it will check whether there is key overlap between CF when importing. The import will fail if there are key overlaps. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11378 Reviewed By: ajkr Differential Revision: D46413709 Pulled By: cbi42 fbshipit-source-id: 846d0049fad11c59cf460fa846c345b26c658dfb	2023-06-15 12:25:04 -07:00
Changyu Bi	15e8a843d9	Do not include last level in compaction when `allow_ingest_behind=true` (#11489 ) Summary: when a DB is configured with `allow_ingest_behind = true`, the last level should be reserved for ingested files and these files should not be included in any compaction. Currently, a major compaction can compact these files to smaller levels. This can cause future files to be rejected for ingest behind (see `ExternalSstFileIngestionJob::CheckLevelForIngestedBehindFile()`). This PR fixes the issue such that files in the last level is not included in any compaction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11489 Test Plan: * Updated unit test `ExternalSSTFileTest.IngestBehind` to test that last level is not included in manual and auto-compaction. Reviewed By: ajkr Differential Revision: D46455711 Pulled By: cbi42 fbshipit-source-id: 5e2142c2a709ef932ad797897795021c06c4ac8c	2023-06-14 11:28:56 -07:00
Andrew Kryczka	cac3240cbf	add property "rocksdb.obsolete-sst-files-size" (#11533 ) Summary: See "unreleased_history/new_features/obsolete_sst_files_size.md" for description Pull Request resolved: https://github.com/facebook/rocksdb/pull/11533 Test Plan: updated unit test Reviewed By: jowlyzhang Differential Revision: D46703152 Pulled By: ajkr fbshipit-source-id: ea5e31cd6293eccc154130c13e66b5271f57c102	2023-06-13 15:52:45 -07:00
Ignat Loskutov	7c67aee4a0	statistics.cc: fix mistype (#11509 ) Summary: Add new tickers: `rocksdb.error.handler.bg.error.count`, `rocksdb.error.handler.bg.io.error.count`, `rocksdb.error.handler.bg.retryable.io.error.count` to replace the misspelled ones: `rocksdb.error.handler.bg.errro.count`, `rocksdb.error.handler.bg.io.errro.count`, `rocksdb.error.handler.bg.retryable.io.errro.count` ('error' instead of 'errro'). Users should switch to use the new tickers before 9.0 release as the misspelled old tickers will be completely removed then. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11509 Reviewed By: ltamasi Differential Revision: D46542809 Pulled By: jowlyzhang fbshipit-source-id: a2a6d8354af46a060de81d40ef6f5336a80bd32e	2023-06-09 13:25:57 -07:00
Ignat Loskutov	05fcacdb42	Add missing stopwatch and perf timer to DBImplReadOnly (#11521 ) Summary: The `rocksdb.db.get.micros` histogram is never updated if the DB is open in ReadOnly mode, as well as the `get_cpu_nanos` perf counter. An earlier PR (https://github.com/facebook/rocksdb/issues/4260) for some reason has only added the TODO line, not the accounting itself, so this one is intended to fix it, adding two lines to match [DBImplSecondary](`4dafa5b220/db/db_impl/db_impl_secondary.cc (L366-L367)`). Pull Request resolved: https://github.com/facebook/rocksdb/pull/11521 Reviewed By: cbi42 Differential Revision: D46577330 Pulled By: jowlyzhang fbshipit-source-id: be147923e763af32bbc18fd6bdf3aff8ebf08aee	2023-06-08 17:40:56 -07:00
Changyu Bi	2e8cc98ab2	Fix subcompaction bug to allow running two subcompactions (#11501 ) Summary: as reported in https://github.com/facebook/rocksdb/issues/11476, RocksDB currently does not execute compactions in two subcompactions even when they qualify. This PR fixes this issue. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11501 Test Plan: * Add a new unit test. * Run crash test with max_subcompactions=2: `python3 tools/db_crashtest.py blackbox --simple --subcompactions=2 --target_file_size_base=1048576 --compaction_style=0` * saw logs showing compactions being executed as 2 subcompactions ``` 2023/06/01-17:28:44.028470 3147486 (Original Log Time 2023/06/01-17:28:44.025972) EVENT_LOG_v1 {"time_micros": 1685665724025939, "job": 6, "event": "compaction_finished", "compaction_time_micros": 34539, "compaction_time_cpu_micros": 26404, "output_level": 6, "num_output_files": 2, "total_output_size": 1109796, "num_input_records": 13188, "num_output_records": 13021, "num_subcompactions": 2, "output_compression": "NoCompression", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [0, 0, 0, 0, 0, 0, 13]} ``` Reviewed By: ajkr Differential Revision: D46411497 Pulled By: cbi42 fbshipit-source-id: 3ebfc02e19f78f782e114a9546dc3d481d496258	2023-06-06 13:36:02 -07:00
leipeng	ddfcbea3e1	IterKey: change space_[32] to 39 to utilize padding space (#10633 ) Summary: This PR utilize padding space. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10633 Reviewed By: pdillinger Differential Revision: D46410978 Pulled By: ajkr fbshipit-source-id: 23ec757b1eea9221c1390971e39d341c6b7f2003	2023-06-06 11:19:15 -07:00
Changyu Bi	633c738a98	Fix unit test `DBRangeDelTest.NonBottommostCompactionDropRangetombstone` (#11512 ) Summary: Fix the test added in https://github.com/facebook/rocksdb/issues/11459 that is failing. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11512 Test Plan: `./db_range_del_test --gtest_filter="*NonBottommostCompactionDropRangetombstone"` Reviewed By: pdillinger Differential Revision: D46451450 Pulled By: cbi42 fbshipit-source-id: bcad20b8fd21c4f71924cec6cb045ee4b2038b90	2023-06-05 15:20:57 -07:00
Yu Zhang	4dafa5b220	switch to use RocksDB UnorderedMap (#11507 ) Summary: Switch from std::unordered_map to RocksDB UnorderedMap for all the places that logging user-defined timestamp size in WAL used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11507 Test Plan: ``` make all check ``` Reviewed By: ltamasi Differential Revision: D46448975 Pulled By: jowlyzhang fbshipit-source-id: bdb4d56a723b697a33daaf0f856a61d49a367a99	2023-06-05 13:36:26 -07:00
Changyu Bi	4aa52d89cf	Drop range tombstone during non-bottommost compaction (#11459 ) Summary: Similar to point tombstones, we can drop a range tombstone during compaction when we know its range does not exist in any higher level. This PR adds this optimization. Some existing test in db_range_del_test is fixed to work under this optimization. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11459 Test Plan: * Add unit test `DBRangeDelTest, NonBottommostCompactionDropRangetombstone`. * Ran crash test that issues range deletion for a few hours: `python3 tools/db_crashtest.py blackbox --simple --write_buffer_size=1048576 --delrangepercent=10 --writepercent=31 --readpercent=40` Reviewed By: ajkr Differential Revision: D46007904 Pulled By: cbi42 fbshipit-source-id: 3f37205b6778b7d55ed106369ca41b0632a6d0fd	2023-06-05 10:26:40 -07:00
Changyu Bi	71ca9a1dcd	Log correct compaction score for Universal Compaction (#11487 ) Summary: currently 0 is incorrectly logged as the compaction score for L0 when num_levels > 1. This PR fixes the issue to log the correct score. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11487 Test Plan: ``` ./db_bench --benchmarks=fillrandom --max_background_jobs=8 --num=1000000 --compaction_style=1 --stats_dump_period_sec=20 --num_levels=7 --write_buffer_size=1048576 grep "L0 " /tmp/rocksdbtest-543376/dbbench/LOG before: Compaction Stats [default] Priority Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB) --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- L0 0/0 0.00 KB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 9.9 0.42 0.33 9 0.046 0 0 0.0 0.0 L0 3/1 1.37 MB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.1 0.6 9.6 3.76 3.03 76 0.050 34K 140 0.0 0.0 L0 2/0 2.26 MB 0.0 0.0 0.0 0.0 0.1 0.1 0.0 1.6 3.2 8.2 12.59 11.17 163 0.077 619K 5499 0.0 0.0 after: compaction scores are non-zero L0 0/0 0.00 KB 0.8 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 9.6 0.43 0.34 9 0.048 0 0 0.0 0.0 L0 2/1 937.08 KB 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.1 0.6 9.3 3.85 3.07 75 0.051 34K 165 0.0 0.0 L0 2/2 1.82 MB 1.0 0.0 0.0 0.0 0.1 0.1 0.0 1.6 3.0 8.0 12.45 10.99 160 0.078 577K 5399 0.0 0.0 ``` Reviewed By: ajkr Differential Revision: D46293993 Pulled By: cbi42 fbshipit-source-id: 19753f7df68c5f54a84c4ed52794f83e510c9721	2023-06-01 15:36:19 -07:00
Changyu Bi	e95cc1217d	`CompactRange()` always compacts to bottommost level for leveled compaction (#11468 ) Summary: currently for leveled compaction, the max output level of a call to `CompactRange()` is pre-computed before compacting each level. This max output level is the max level whose key range overlaps with the manual compaction key range. However, during manual compaction, files in the max output level may be compacted down further by some background compaction. When this background compaction is a trivial move, there is a race condition and the manual compaction may not be able to compact all keys in the specified key range. This PR updates `CompactRange()` to always compact to the bottommost level to make this race condition more unlikely (it can still happen, see more in comment here: `796f58f42a/db/db_impl/db_impl_compaction_flush.cc (L1180C29-L1184)`). This PR also changes the behavior of CompactRange() when `bottommost_level_compaction=kIfHaveCompactionFilter` (the default option). The old behavior is that, if a compaction filter is provided, CompactRange() always does an intra-level compaction at the final output level for all files in the manual compaction key range. The only exception when `first_overlapped_level = 0` and `max_overlapped_level = 0`. It’s awkward to maintain the same behavior after this PR since we do not compute max_overlapped_level anymore. So the new behavior is similar to kForceOptimized: always does intra-level compaction at the bottommost level, but not including new files generated during this manual compaction. Several unit tests are updated to work with this new manual compaction behavior. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11468 Test Plan: Add new unit tests `DBCompactionTest.ManualCompactionCompactAllKeysInRange*` Reviewed By: ajkr Differential Revision: D46079619 Pulled By: cbi42 fbshipit-source-id: 19d844ba4ec8dc1a0b8af5d2f36ff15820c6e76f	2023-06-01 15:27:29 -07:00
Jay Huh	87bc929db3	Flush option in WaitForCompact() (#11483 ) Summary: Context: As mentioned in https://github.com/facebook/rocksdb/issues/11436, introducing `flush` option in `WaitForCompactOptions` to flush before waiting for compactions to finish. Must be set to true to ensure no immediate compactions (except perhaps periodic compactions) after closing and re-opening the DB. 1. `bool flush = false` added to `WaitForCompactOptions` 2. `DBImpl::FlushAllColumnFamilies()` is introduced and `DBImpl::FlushForGetLiveFiles()` is refactored to call it. 3. `DBImpl::FlushAllColumnFamilies()` gets called before waiting in `WaitForCompact()` if `flush` option is `true` 4. Some previous WaitForCompact tests were parameterized to include both cases for `abort_on_pause_` being true/false as well as `flush_` being true/false Pull Request resolved: https://github.com/facebook/rocksdb/pull/11483 Test Plan: - `DBCompactionTest::WaitForCompactWithOptionToFlush` added - Changed existing DBCompactionTest::WaitForCompact tests to `DBCompactionWaitForCompactTest` to include params Reviewed By: pdillinger Differential Revision: D46289770 Pulled By: jaykorean fbshipit-source-id: 70d3f461d96a6e06390be60170dd7c4d0d38f8b0	2023-05-31 12:53:51 -07:00
Yu Zhang	56ca9e3106	Logging timestamp size record in WAL and use it during recovery (#11471 ) Summary: Start logging the timestamp size record in WAL and use the record during recovery. Currently, user comparator cannot be different from what was used to create a column family, so the timestamp size record is just used to confirm it's consistent with the timestamp size the running user comparator indicates. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11471 Test Plan: ``` make all check ./db_secondary_test ./db_wal_test --gtest_filter="WithTimestamp" ./repair_test --gtest_filter="WithTimestamp" ``` Reviewed By: ltamasi Differential Revision: D46236769 Pulled By: jowlyzhang fbshipit-source-id: f6c60b5c8defdb05021c63df302ccc0be1275ad0	2023-05-30 19:32:00 -07:00
Changyu Bi	e1c7209beb	Fix flaky test: `DBCompactionTest.WaitForCompactShutdownWhileWaiting` (#11488 ) Summary: tsan complains with the following error message. This is likely due to DB object destroyed while WaitForCompact() is still running. ``` [ RUN ] DBCompactionTest.WaitForCompactShutdownWhileWaiting ================== WARNING: ThreadSanitizer: data race (pid=1128703) Atomic read of size 1 at 0x7b8c00000740 by thread T4: #0 pthread_cond_wait <null> (db_compaction_test+0x46970a) https://github.com/facebook/rocksdb/issues/1 rocksdb::port::CondVar::Wait() /root/project/port/port_posix.cc:119:23 (librocksdb.so.8.4+0x7c4c60) https://github.com/facebook/rocksdb/issues/2 rocksdb::InstrumentedCondVar::WaitInternal() /root/project/monitoring/instrumented_mutex.cc:69:9 (librocksdb.so.8.4+0x75f697) https://github.com/facebook/rocksdb/issues/3 rocksdb::InstrumentedCondVar::Wait() /root/project/monitoring/instrumented_mutex.cc:62:3 (librocksdb.so.8.4+0x75f697) https://github.com/facebook/rocksdb/issues/4 rocksdb::DBImpl::WaitForCompact(rocksdb::WaitForCompactOptions const&) /root/project/db/db_impl/db_impl_compaction_flush.cc:3978:14 (librocksdb.so.8.4+0x494174) https://github.com/facebook/rocksdb/issues/5 rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_30::operator()() const /root/project/db/db_compaction_test.cc:3479:26 (db_compaction_test+0x5cdc90) https://github.com/facebook/rocksdb/issues/6 void std::__invoke_impl<void, rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_30>(std::__invoke_other, rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_30&&) /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:61:14 (db_compaction_test+0x5cdc90) https://github.com/facebook/rocksdb/issues/7 std::__invoke_result<rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_30>::type std::__invoke<rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_30>(rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_30&&) /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:96:14 (db_compaction_test+0x5cdc90) https://github.com/facebook/rocksdb/issues/8 void std:🧵:_Invoker<std::tuple<rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_30> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:253:13 (db_compaction_test+0x5cdc90) https://github.com/facebook/rocksdb/issues/9 std:🧵:_Invoker<std::tuple<rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_30> >::operator()() /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:260:11 (db_compaction_test+0x5cdc90) https://github.com/facebook/rocksdb/issues/10 std:🧵:_State_impl<std:🧵:_Invoker<std::tuple<rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_30> > >::_M_run() /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:211:13 (db_compaction_test+0x5cdc90) https://github.com/facebook/rocksdb/issues/11 <null> <null> (libstdc++.so.6+0xda6b3) Previous write of size 1 at 0x7b8c00000740 by thread T5: #0 pthread_mutex_destroy <null> (db_compaction_test+0x46a4f8) https://github.com/facebook/rocksdb/issues/1 rocksdb::port::Mutex::~Mutex() /root/project/port/port_posix.cc:77:48 (librocksdb.so.8.4+0x7c480e) https://github.com/facebook/rocksdb/issues/2 rocksdb::InstrumentedMutex::~InstrumentedMutex() /root/project/./monitoring/instrumented_mutex.h:20:7 (librocksdb.so.8.4+0x41fda6) https://github.com/facebook/rocksdb/issues/3 rocksdb::DBImpl::~DBImpl() /root/project/db/db_impl/db_impl.cc:755:1 (librocksdb.so.8.4+0x41fda6) https://github.com/facebook/rocksdb/issues/4 rocksdb::DBImpl::~DBImpl() /root/project/db/db_impl/db_impl.cc:737:19 (librocksdb.so.8.4+0x4203d9) https://github.com/facebook/rocksdb/issues/5 rocksdb::DBTestBase::Close() /root/project/db/db_test_util.cc:670:3 (librocksdb_test_debug.so+0x57413) https://github.com/facebook/rocksdb/issues/6 rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_31::operator()() const /root/project/db/db_compaction_test.cc:3485:49 (db_compaction_test+0x5cdf03) https://github.com/facebook/rocksdb/issues/7 void std::__invoke_impl<void, rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_31>(std::__invoke_other, rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_31&&) /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:61:14 (db_compaction_test+0x5cdf03) https://github.com/facebook/rocksdb/issues/8 std::__invoke_result<rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_31>::type std::__invoke<rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_31>(rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_31&&) /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:96:14 (db_compaction_test+0x5cdf03) https://github.com/facebook/rocksdb/issues/9 void std:🧵:_Invoker<std::tuple<rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_31> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:253:13 (db_compaction_test+0x5cdf03) https://github.com/facebook/rocksdb/issues/10 std:🧵:_Invoker<std::tuple<rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_31> >::operator()() /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:260:11 (db_compaction_test+0x5cdf03) https://github.com/facebook/rocksdb/issues/11 std:🧵:_State_impl<std:🧵:_Invoker<std::tuple<rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_31> > >::_M_run() /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:211:13 (db_compaction_test+0x5cdf03) https://github.com/facebook/rocksdb/issues/12 <null> <null> (libstdc++.so.6+0xda6b3) ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11488 Test Plan: ``` COMPILE_WITH_TSAN=1 CC=clang-13 CXX=clang++-13 ROCKSDB_DISABLE_ALIGNED_NEW=1 USE_CLANG=1 make V=1 -j32 db_compaction_test gtest-parallel --repeat=10000 ./db_compaction_test --gtest_filter="WaitForCompactShutdownWhileWaiting" -w200 ``` Reviewed By: jaykorean Differential Revision: D46293891 Pulled By: cbi42 fbshipit-source-id: 8ca259cb1e09a9e4f4095b2d084f2ba92b710b97	2023-05-30 15:00:05 -07:00
Andrew Kryczka	3e7fc88167	add WriteBatch::Release() (#11482 ) Summary: Together with the existing constructor, `explicit WriteBatch(std::string&& rep)`, this enables transferring `WriteBatch` via its `std::string` representation. Associated info like KV checksums are dropped but the caller can use `WriteBatch::VerifyChecksum()` before taking ownership if needed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11482 Reviewed By: cbi42 Differential Revision: D46233884 Pulled By: ajkr fbshipit-source-id: 6bc64a6e75fb7bbf61d08c09520fc3705a7b44d8	2023-05-26 18:15:14 -07:00
Soli	de1dd4ca19	Tweak on IsTrivialMove() (#11467 ) Summary: `output_level_` and `number_levels_` are not changing in iteration of `inputs_` files. Moving the check out of `for` loop could slightly improve performance. It is easier to review when ignore whitespace changes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11467 Reviewed By: cbi42 Differential Revision: D46155962 Pulled By: ajkr fbshipit-source-id: 45ec80b13152b3bed7305e6f707cb9b187d5f315	2023-05-26 16:40:50 -07:00
Jay Huh	81aeb15988	Add WaitForCompact with WaitForCompactOptions to public API (#11436 ) Summary: Context: This is the first PR for WaitForCompact() Implementation with WaitForCompactOptions. In this PR, we are introducing `Status WaitForCompact(const WaitForCompactOptions& wait_for_compact_options)` in the public API. This currently utilizes the existing internal `WaitForCompact()` implementation (with default abort_on_pause = false). `abort_on_pause` has been moved to `WaitForCompactOptions&`. In the later PRs, we will introduce the following two options in `WaitForCompactOptions` 1. `bool flush = false` by default - If true, flush before waiting for compactions to finish. Must be set to true to ensure no immediate compactions (except perhaps periodic compactions) after closing and re-opening the DB. 2. `bool close_db = false` by default - If true, will also close the DB upon compactions finishing. 1. struct `WaitForCompactOptions` added to options.h and `abort_on_pause` in the internal API moved to the option struct. 2. `Status WaitForCompact(const WaitForCompactOptions& wait_for_compact_options)` introduced in `db.h` 3. Changed the internal WaitForCompact() to `WaitForCompact(const WaitForCompactOptions& wait_for_compact_options)` and checks for the `abort_on_pause` inside the option. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11436 Test Plan: Following tests added - `DBCompactionTest::WaitForCompactWaitsOnCompactionToFinish` - `DBCompactionTest::WaitForCompactAbortOnPauseAborted` - `DBCompactionTest::WaitForCompactContinueAfterPauseNotAborted` - `DBCompactionTest::WaitForCompactShutdownWhileWaiting` - `TransactionTest::WaitForCompactAbortOnPause` NOTE: `TransactionTest::WaitForCompactAbortOnPause` was added to use `StackableDB` to ensure the wrapper function is in place. Reviewed By: pdillinger Differential Revision: D45799659 Pulled By: jaykorean fbshipit-source-id: b5b58f95957f2ab47d1221dee32a61d6cdc4685b	2023-05-25 17:25:51 -07:00
Yu Zhang	d1ae7f6c41	Add support to strip / pad timestamp when writing / reading a block (#11472 ) Summary: This patch adds support in `BlockBuilder` to strip user-defined timestamp from the `key` added via `Add(key, value)` and its equivalent APIs. The stripping logic is different when the key is either a user key or an internal key, so the `BlockBuilder` is created with a flag to indicate that. This patch also add support on the read path to APIs `NewIndexIterator`, `NewDataIterator` to support pad a min timestamp. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11472 Test Plan: Three test modes are added to parameterize existing tests: UserDefinedTimestampTestMode::kNone -> UDT feature is not enabled UserDefinedTimestampTestMode::kNormal -> UDT feature enabled, write / read with min timestamp UserDefinedTimestampTestMode::kStripUserDefinedTimestamps -> UDT feature enabled, write / read with min timestamp, set `persist_user_defined_timestamps` where it applies to false. The tests read/write with min timestamp so that point read and range scan can correctly read values in all three test modes. `block_test` are parameterized to run with above three test modes and some additional parameteriazation ``` make all check ./block_test --gtest_filter="P/BlockTest" ./block_test --gtest_filter="P/IndexBlockTest" ``` Reviewed By: ltamasi Differential Revision: D46200539 Pulled By: jowlyzhang fbshipit-source-id: 59f5d6b584639976b69c2943eba723bd47d9b3c0	2023-05-25 15:41:32 -07:00
Hui Xiao	dcc6fc99f9	Fix StopWatch bug; Remove setting `record_read_stats` (#11474 ) Summary: Context/Summary: - StopWatch enable stats even when `StatsLevel::kExceptTimers` is set. It's a harmless bug though since `reportTimeToHistogram()` will not report it anyway according to https://github.com/facebook/rocksdb/blob/main/include/rocksdb/statistics.h#L705 - https://github.com/facebook/rocksdb/pull/11288 should have removed logics of setting `record_read_stats = !for_compaction` as we don't differentiate `RandomAccessFileReader`'s stats behavior based on compaction or not (instead we now report stats of different IO activities including compaction to different stats). Fixing this should report more compaction related file read micros that aren't reported previously due to `for_compaction==true` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11474 Test Plan: - DB bench pre vs post fix with small max_open_files Setup command `./db_ bench -db=/dev/shm/testdb/ -statistics=true -benchmarks=fillseq -key_size=32 -value_size=512 -num=5000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=true -compression_type=none -bloom_bits=3` Run command `./db_bench --open_files=1 -use_existing_db=true -db=/dev/shm/testdb2/ -statistics=true -benchmarks=compactall -key_size=32 -value_size=512 -num=5000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=true -compression_type=none -bloom_bits=3` Pre-fix ``` rocksdb.sst.read.micros P50 : 2.056175 P95 : 4.647739 P99 : 8.948475 P100 : 25.000000 COUNT : 4451 SUM : 12827 rocksdb.file.read.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0 rocksdb.file.read.compaction.micros P50 : 2.057397 P95 : 4.625253 P99 : 8.749474 P100 : 25.000000 COUNT : 4382 SUM : 12608 rocksdb.file.read.db.open.micros P50 : 1.985294 P95 : 9.100000 P99 : 13.000000 P100 : 13.000000 COUNT : 69 SUM : 219 ``` Post-fix (with a higher `rocksdb.file.read.compaction.micros` count) ``` rocksdb.sst.read.micros P50 : 1.858968 P95 : 3.653086 P99 : 5.968000 P100 : 21.000000 COUNT : 3548 SUM : 9119 rocksdb.file.read.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0 rocksdb.file.read.compaction.micros P50 : 1.857027 P95 : 3.627614 P99 : 5.738621 P100 : 21.000000 COUNT : 3479 SUM : 8904 rocksdb.file.read.db.open.micros P50 : 2.000000 P95 : 6.733333 P99 : 11.000000 P100 : 11.000000 COUNT : 69 SUM : 215 ``` - CI Reviewed By: ajkr Differential Revision: D46137221 Pulled By: hx235 fbshipit-source-id: e5b4ee7001af26f2ee0377bc6334f43b2a527388	2023-05-25 10:16:58 -07:00
Peter Dillinger	17bc27741f	Improve memory efficiency of many OptimisticTransactionDBs (#11439 ) Summary: Currently it's easy to use a ton of memory with many small OptimisticTransactionDB instances, because each one by default allocates a million mutexes (40 bytes each on my compiler) for validating transactions. It even puts a lot of pressure on the allocator by allocating each one individually! In this change: * Create a new object and option that enables sharing these buckets of mutexes between instances. This is generally good for load balancing potential contention as various DBs become hotter or colder with txn writes. About the only cases where this sharing wouldn't make sense (e.g. each DB usually written by one thread) are cases that would be better off with OccValidationPolicy::kValidateSerial which doesn't use the buckets anyway. * Allocate the mutexes in a contiguous array, for efficiency * Add an option to ensure the mutexes are cache-aligned. In several other places we use cache-aligned mutexes but OptimisticTransactionDB historically does not. It should be a space-time trade-off the user can choose. * Provide some visibility into the memory used by the mutex buckets with an ApproximateMemoryUsage() function (also used in unit testing) * Share code with other users of "striped" mutexes, appropriate refactoring for customization & efficiency (e.g. using FastRange instead of modulus) Pull Request resolved: https://github.com/facebook/rocksdb/pull/11439 Test Plan: unit tests added. Ran sized-up versions of stress test in unit test, including a before-and-after performance test showing no consistent difference. (NOTE: OptimisticTransactionDB not currently covered by db_stress!) Reviewed By: ltamasi Differential Revision: D45796393 Pulled By: pdillinger fbshipit-source-id: ae2b3a26ad91ceeec15debcdc63ff48df6736a54	2023-05-24 11:57:15 -07:00
rogertyang	28bf7ba77d	remove unnecessary code in super version getter (#11452 ) Summary: Do not bother comparing the version of the local super version handle with the global one. An inequality comparison result indicates nothing but a spurious obsoleteness. It only happens when the writer has increased the `ColumnFamilyData::super_version_number_`(`5fc57eec2b/db/column_family.cc (L1309)`) but has not yet called `ResetThreadLocalSuperVersions()`(`5fc57eec2b/db/column_family.cc (L1328)`) at the time when a reader reads the local handle(`void* ptr = local_sv_->Swap(SuperVersion::kSVInUse);`). In other words, the existence of a local handle is a sufficent evidence of its fressness. With this PR, we save one or even two atomic instructions when getting a handle of super version. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11452 Reviewed By: ajkr Differential Revision: D46059317 Pulled By: cbi42 fbshipit-source-id: 68b4b1ca8a9929a4aa470105c37a09e0625b014d	2023-05-23 12:16:24 -07:00
Yu Zhang	11ebddb1d4	Add utils to use for handling user defined timestamp size record in WAL (#11451 ) Summary: Add a util method `HandleWriteBatchTimestampSizeDifference` to handle a `WriteBatch` read from WAL log when user-defined timestamp size record is written and read. Two check modes are added: `kVerifyConsistency` that just verifies the recorded timestamp size are consistent with the running ones. This mode is to be used by `db_impl_secondary` for opening a DB as secondary instance. It will also be used by `db_impl_open` before the user comparator switch support is added to make a column switch between enabling/disable UDT feature. The other mode `kReconcileInconsistency` will be used by `db_impl_open` later when user comparator can be changed. Another change is to extract a method `CollectColumnFamilyIdsFromWriteBatch` in db_secondary_impl.h into its standalone util file so it can be shared. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11451 Test Plan: ``` make check ./udt_util_test ``` Reviewed By: ltamasi Differential Revision: D45894386 Pulled By: jowlyzhang fbshipit-source-id: b96790777f154cddab6d45d9ba2e5d20ebc6fe9d	2023-05-22 14:28:58 -07:00
Peter Dillinger	39f5846ec7	Much better stats for seeks and prefix filtering (#11460 ) Summary: We want to know more about opportunities for better range filters, and the effectiveness of our own range filters. Currently the stats are very limited, essentially logging just hits and misses against prefix filters for range scans in BLOOM_FILTER_PREFIX_* without tracking the false positive rate. Perhaps confusingly, when prefix filters are used for point queries, the stats are currently going into the non-PREFIX tickers. This change does several things: * Introduce new stat tickers for seeks and related filtering, \LEVEL_SEEK\ * Most importantly, allows us to see opportunities for range filtering. Specifically, we can count how many times a seek in an SST file accesses at least one data block, and how many times at least one value() is then accessed. If a data block was accessed but no value(), we can generally assume that the key(s) seen was(were) not of interest so could have been filtered with the right kind of filter, avoiding the data block access. * We can get the same level of detail when a filter (for now, prefix Bloom/ribbon) is used, or not. Specifically, we can infer a false positive rate for prefix filters (not available before) from the seek "false positive" rate: when a data block is accessed but no value() is called. (There can be other explanations for a seek false positive, but in typical iterator usage it would indicate a filter false positive.) * For efficiency, I wanted to avoid making additional calls to the prefix extractor (or key comparisons, etc.), which would be required if we wanted to more precisely detect filter false positives. I believe that instrumenting value() is the best balance of efficiency vs. accurately measuring what we are often interested in. * The stats are divided between last level and non-last levels, to help understand potential tiered storage use cases. * The old BLOOM_FILTER_PREFIX_* stats have a different meaning: no longer referring to iterators but to point queries using prefix filters. BLOOM_FILTER_PREFIX_TRUE_POSITIVE is added for computing the prefix false positive rate on point queries, which can be due to filter false positives as well as different keys with the same prefix. * Similarly, the non-PREFIX BLOOM_FILTER stats are now for whole key filtering only. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11460 Test Plan: unit tests updated, including updating many to pop the stat value since last read to improve test readability and maintainability. Performance test shows a consistent small improvement with these changes, both with clang and with gcc. CPU profile indicates that RecordTick is using less CPU, and this makes sense at least for a high filter miss rate. Before, we were recording two ticks per filter miss in iterators (CHECKED & USEFUL) and now recording just one (FILTERED). Create DB with ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 ``` And run simultaneous before&after with ``` TEST_TMPDIR=/dev/shm ./db_bench -readonly -benchmarks=seekrandom[-X1000] -num=10000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -seek_nexts=1 -duration=20 -seed=43 -threads=8 -cache_size=1000000000 -statistics ``` Before: seekrandom [AVG 275 runs] : 189680 (± 222) ops/sec; 18.4 (± 0.0) MB/sec After: seekrandom [AVG 275 runs] : 197110 (± 208) ops/sec; 19.1 (± 0.0) MB/sec Reviewed By: ajkr Differential Revision: D46029177 Pulled By: pdillinger fbshipit-source-id: cdace79a2ea548d46c5900b068c5b7c3a02e5822	2023-05-19 15:25:49 -07:00
Peter Dillinger	4067acabca	Compatibility step for separating BlockCache and GeneralCache APIs (#11450 ) Summary: Add two type aliases for Cache: BlockCache and GeneralCache, and add LRUCacheOptions::MakeSharedGeneralCache(). This will ease upgrade to an intended future change to separate the cache API between block cache and other (general) uses, including row cache. Separating the APIs will make it easier to expose more details of block caching for customization. For example, it would be nice to pass the file unique ID and offset as the logical cache key instead of using a Slice, which could facilitate some file-specific customizations in block cache. This would also make it clear that HyperClockCache is not usable as a general cache, because it can only deal with fixed-size block cache keys. block_cache, row_cache, and blob_cache are the uses of Cache in the public API. blob_cache should be able to use BlockCache while row_cache is a GeneralCache user, as its keys are of arbitrary size. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11450 Test Plan: see updated unit test. Reviewed By: anand1976 Differential Revision: D45882067 Pulled By: pdillinger fbshipit-source-id: ff5d9f0b644f87ae337a29a7728ce3ed07b2a4b2	2023-05-18 20:40:19 -07:00
mayue.fight	8d8eb0e77e	Support Clip DB to KeyRange (#11379 ) Summary: This PR is part of the request https://github.com/facebook/rocksdb/issues/11317. (Another part is https://github.com/facebook/rocksdb/pull/11378) ClipDB() will clip the entries in the CF according to the range [begin_key, end_key). All the entries outside this range will be completely deleted (including tombstones). This feature is mainly used to ensure that there is no overlapping Key when calling CreateColumnFamilyWithImports() to import multiple CFs. When Calling ClipDB [begin, end), there are the following steps 1. Quickly and directly delete files without overlap DeleteFilesInRanges(nullptr, begin) + DeleteFilesInRanges(end, nullptr) 2. Delete the Key outside the range Delete[smallest_key, begin) + Delete[end, largest_key] 3. Delete the tombstone through Manul Compact CompactRange(option, nullptr, nullptr) Pull Request resolved: https://github.com/facebook/rocksdb/pull/11379 Reviewed By: ajkr Differential Revision: D45840358 Pulled By: cbi42 fbshipit-source-id: 54152e8a45fd8ede137f99787eb252f0b51440a4	2023-05-18 13:25:01 -07:00
Jay Huh	586d78b31e	Remove wait_unscheduled from waitForCompact internal API (#11443 ) Summary: Context: In pull request https://github.com/facebook/rocksdb/issues/11436, we are introducing a new public API `waitForCompact(const WaitForCompactOptions& wait_for_compact_options)`. This API invokes the internal implementation `waitForCompact(bool wait_unscheduled=false)`. The unscheduled parameter indicates the compactions that are not yet scheduled but are required to process items in the queue. In certain cases, we are unable to wait for compactions, such as during a shutdown or when background jobs are paused. It is important to return the appropriate status in these scenarios. For all other cases, we should wait for all compaction and flush jobs, including the unscheduled ones. The primary purpose of this new API is to wait until the system has resolved its compaction debt. Currently, the usage of `wait_unscheduled` is limited to test code. This pull request eliminates the usage of wait_unscheduled. The internal `waitForCompact()` API now waits for unscheduled compactions unless the db is undergoing a shutdown. In the event of a shutdown, the API returns `Status::ShutdownInProgress()`. Additionally, a new parameter, `abort_on_pause`, has been introduced with a default value of `false`. This parameter addresses the possibility of waiting indefinitely for unscheduled jobs if `PauseBackgroundWork()` was called before `waitForCompact()` is invoked. By setting `abort_on_pause` to `true`, the API will immediately return `Status::Aborted`. Furthermore, all tests that previously called `waitForCompact(true)` have been fixed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11443 Test Plan: Existing tests that involve a shutdown in progress: - DBCompactionTest::CompactRangeShutdownWhileDelayed - DBTestWithParam::PreShutdownMultipleCompaction - DBTestWithParam::PreShutdownCompactionMiddle Reviewed By: pdillinger Differential Revision: D45923426 Pulled By: jaykorean fbshipit-source-id: 7dc93fe6a6841a7d9d2d72866fa647090dba8eae	2023-05-17 18:13:50 -07:00
Peter Dillinger	206fdea3d9	Change internal headers with duplicate names (#11408 ) Summary: In IDE navigation I find it annoying that there are two statistics.h files (etc.) and often land on the wrong one. Here I migrate several headers to use the blah.h <- blah_impl.h <- blah.cc idiom. Although clang-format wants "blah.h" to be the top include for "blah.cc", I think overall this is an improvement. No public API changes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11408 Test Plan: existing tests Reviewed By: ltamasi Differential Revision: D45456696 Pulled By: pdillinger fbshipit-source-id: 809d931253f3272c908cf5facf7e1d32fc507373	2023-05-17 11:27:09 -07:00
Andrew Kryczka	fb636f2498	Fix write stall stats dump format (#11445 ) Summary: I noticed in https://github.com/facebook/rocksdb/issues/11426 there is a missing line break. Before this PR the output looked like ``` $ ./db_bench -stats_per_interval=1 -stats_interval=100000 ... Write Stall (count): cf-l0-file-count-limit-delays-with-ongoing-compaction: 0, cf-l0-file-count-limit-stops-with-ongoing-compaction: 0, l0-file-count-limit-delays: 0, l0-file-count-limit-stops: 0, memtable-limit-delays: 0, memtable-limit-stops: 0, pending-compaction-bytes-delays: 0, pending-compaction-bytes-stops: 0, total-delays: 0, total-stops: 0, Block cache LRUCache@0x7f8695831b50#2766536 capacity: 32.00 MB seed: 1155354975 usage: 0.09 KB table_size: 1024 occupancy: 1 collections: 1 last_copies: 0 last_secs: 9.3e-05 secs_since: 2 ... Write Stall (count): write-buffer-manager-limit-stops: 0, num-running-compactions: 0 ... ``` After this PR it looks like ``` $ ./db_bench -stats_per_interval=1 -stats_interval=100000 ... Write Stall (count): cf-l0-file-count-limit-delays-with-ongoing-compaction: 0, cf-l0-file-count-limit-stops-with-ongoing-compaction: 0, l0-file-count-limit-delays: 0, l0-file-count-limit-stops: 0, memtable-limit-delays: 0, memtable-limit-stops: 0, pending-compaction-bytes-delays: 0, pending-compaction-bytes-stops: 0, total-delays: 0, total-stops: 0 Block cache LRUCache@0x7f8e0d231b50#2736585 capacity: 32.00 MB seed: 920433955 usage: 0.09 KB table_size: 1024 occupancy: 1 collections: 1 last_copies: 1 last_secs: 6.5e-05 secs_since: 4 ... Write Stall (count): write-buffer-manager-limit-stops: 0 num-running-compactions: 0 ... ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11445 Reviewed By: hx235 Differential Revision: D45844752 Pulled By: ajkr fbshipit-source-id: 1c708cb05b6e270922ac2fa95f5d011f273347eb	2023-05-15 11:47:17 -07:00
anand76	2084cdf237	Delete temp OPTIONS file on failure to write it (#11423 ) Summary: When the DB is opened, RocksDB creates a temp OPTIONS file, writes the current options to it, and renames it. In case of a failure, the temp file is left behind, and is not deleted by PurgeObsoleteFiles(). Fix this by explicitly deleting the temp file if writing to it or renaming it fails. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11423 Test Plan: Add a unit test Reviewed By: akankshamahajan15 Differential Revision: D45540454 Pulled By: anand1976 fbshipit-source-id: 47facdc30d8cc5667036312d04b21d3fc253c92e	2023-05-12 22:39:39 -07:00
Andrew Kryczka	113f3250f1	Add block checksum mismatch ticker stat (#11438 ) Summary: Added a ticker stat, `BLOCK_CHECKSUM_MISMATCH_COUNT`, to count how many block checksum verifications detected a mismatch. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11438 Test Plan: new unit test Reviewed By: pdillinger Differential Revision: D45788179 Pulled By: ajkr fbshipit-source-id: e2b44eba7c23b3e110ebe69eaa78a710dec2590f	2023-05-12 18:16:11 -07:00
Yu Zhang	47235dda9e	Add support in log writer and reader for a user-defined timestamp size record (#11433 ) Summary: This patch adds support to write and read a user-defined timestamp size record in log writer and log reader. It will be used by WAL logs to persist the user-defined timestamp format for subsequent WriteBatch records. Reading and writing UDT sizes for WAL logs are not included in this patch. It will be in a follow up. The syntax for the record is: at write time, one such record is added when log writer encountered any non-zero UDT size it hasn't recorded so far. At read time, all such records read up to a point are accumulated and applicable to all subsequent WriteBatch records. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11433 Test Plan: ``` make clean && make -j32 all ./log_test --gtest_filter="WithTimestampSize" ``` Reviewed By: ltamasi Differential Revision: D45678708 Pulled By: jowlyzhang fbshipit-source-id: b770c8f45bb7b9383b14aac9f22af781304fb41d	2023-05-11 17:26:19 -07:00
Changyu Bi	8827cd0618	Support compacting files to different temperatures in FIFO compaction (#11428 ) Summary: - Add a new option `CompactionOptionsFIFO::file_temperature_age_thresholds` that allows user to specify age thresholds for compacting files to different temperatures. File temperature can be used to store files in different storage media. The new options allows specifying multiple temperature-age pairs. The option uses struct for a temperature-age pair to use the existing parsing functionality to make the option dynamically settable. - Deprecate the old option `age_for_warm` that was added for a similar purpose. - Compaction score calculation logic is updated to check if a file needs to be compacted to change its temperature. - Some refactoring is done in `FIFOCompactionPicker::PickTemperatureChangeCompaction`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11428 Test Plan: adapted unit tests that were for `age_for_warm` to this new option. Reviewed By: ajkr Differential Revision: D45611412 Pulled By: cbi42 fbshipit-source-id: 2dc384841f61cc04abb9681e31aa2de0f0b06106	2023-05-11 16:40:59 -07:00
Peter Dillinger	f4a02f2c52	Add hash_seed to Caches (#11391 ) Summary: See motivation and description in new ShardedCacheOptions::hash_seed option. Updated db_bench so that its seed param is used for the cache hash seed. Made its code more safe to ensure seed is set before use. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11391 Test Plan: unit tests added / updated Performance - no discernible difference seen running cache_bench repeatedly before & after. With lru_cache and hyper_clock_cache. Reviewed By: hx235 Differential Revision: D45557797 Pulled By: pdillinger fbshipit-source-id: 40bf4da6d66f9d41a8a0eb8e5cf4246a4aa07934	2023-05-09 22:24:26 -07:00
akankshamahajan	6ba4717f35	Fix build error: variable 'base_level' may be uninitialized (#11435 ) Summary: Fix build error: variable 'base_level' may be uninitialized ``` db_impl_compaction_flush.cc:1195:21: error: variable 'base_level' may be uninitialized when used here [-Werror,-Wconditional-uninitialized] level = base_level; ``` ^~~~~~~~~~ Pull Request resolved: https://github.com/facebook/rocksdb/pull/11435 Test Plan: CircleCI jobs Reviewed By: cbi42 Differential Revision: D45708176 Pulled By: akankshamahajan15 fbshipit-source-id: 851b1205b22b63d728495e5735fa91b0ad8e012b	2023-05-09 15:43:43 -07:00
Hui Xiao	8f763bdeab	Record and use the tail size to prefetch table tail (#11406 ) Summary: Context: We prefetch the tail part of a SST file (i.e, the blocks after data blocks till the end of the file) during each SST file open in hope to prefetch all the stuff at once ahead of time for later read e.g, footer, meta index, filter/index etc. The existing approach to estimate the tail size to prefetch is through `TailPrefetchStats` heuristics introduced in https://github.com/facebook/rocksdb/pull/4156, which has caused small reads in unlucky case (e.g, small read into the tail buffer during table open in thread 1 under the same BlockBasedTableFactory object can make thread 2's tail prefetching use a small size that it shouldn't) and is hard to debug. Therefore we decide to record the exact tail size and use it directly to prefetch tail of the SST instead of relying heuristics. Summary: - Obtain and record in manifest the tail size in `BlockBasedTableBuilder::Finish()` - For backward compatibility, we fall back to TailPrefetchStats and last to simple heuristics that the tail size is a linear portion of the file size - see PR conversation for more. - Make`tail_start_offset` part of the table properties and deduct tail size to record in manifest for external files (e.g, file ingestion, import CF) and db repair (with no access to manifest). Pull Request resolved: https://github.com/facebook/rocksdb/pull/11406 Test Plan: 1. New UT 2. db bench Note: db bench on /tmp/ where direct read is supported is too slow to finish and the default pinning setting in db bench is not helpful to profile # sst read of Get. Therefore I hacked the following to obtain the following comparison. ``` diff --git a/table/block_based/block_based_table_reader.cc b/table/block_based/block_based_table_reader.cc index bd5669f0f..791484c1f 100644 --- a/table/block_based/block_based_table_reader.cc +++ b/table/block_based/block_based_table_reader.cc @@ -838,7 +838,7 @@ Status BlockBasedTable::PrefetchTail( &tail_prefetch_size); // Try file system prefetch - if (!file->use_direct_io() && !force_direct_prefetch) { + if (false && !file->use_direct_io() && !force_direct_prefetch) { if (!file->Prefetch(prefetch_off, prefetch_len, ro.rate_limiter_priority) .IsNotSupported()) { prefetch_buffer->reset(new FilePrefetchBuffer( diff --git a/tools/db_bench_tool.cc b/tools/db_bench_tool.cc index ea40f5fa0..39a0ac385 100644 --- a/tools/db_bench_tool.cc +++ b/tools/db_bench_tool.cc @@ -4191,6 +4191,8 @@ class Benchmark { std::shared_ptr<TableFactory>(NewCuckooTableFactory(table_options)); } else { BlockBasedTableOptions block_based_options; + block_based_options.metadata_cache_options.partition_pinning = + PinningTier::kAll; block_based_options.checksum = static_cast<ChecksumType>(FLAGS_checksum_type); if (FLAGS_use_hash_search) { ``` Create DB ``` ./db_bench --bloom_bits=3 --use_existing_db=1 --seed=1682546046158958 --partition_index_and_filters=1 --statistics=1 -db=/dev/shm/testdb/ -benchmarks=readrandom -key_size=3200 -value_size=512 -num=1000000 -write_buffer_size=6550000 -disable_auto_compactions=false -target_file_size_base=6550000 -compression_type=none ``` ReadRandom ``` ./db_bench --bloom_bits=3 --use_existing_db=1 --seed=1682546046158958 --partition_index_and_filters=1 --statistics=1 -db=/dev/shm/testdb/ -benchmarks=readrandom -key_size=3200 -value_size=512 -num=1000000 -write_buffer_size=6550000 -disable_auto_compactions=false -target_file_size_base=6550000 -compression_type=none ``` (a) Existing (Use TailPrefetchStats for tail size + use seperate prefetch buffer in PartitionedFilter/IndexReader::CacheDependencies()) ``` rocksdb.table.open.prefetch.tail.hit COUNT : 3395 rocksdb.sst.read.micros P50 : 5.655570 P95 : 9.931396 P99 : 14.845454 P100 : 585.000000 COUNT : 999905 SUM : 6590614 ``` (b) This PR (Record tail size + use the same tail buffer in PartitionedFilter/IndexReader::CacheDependencies()) ``` rocksdb.table.open.prefetch.tail.hit COUNT : 14257 rocksdb.sst.read.micros P50 : 5.173347 P95 : 9.015017 P99 : 12.912610 P100 : 228.000000 COUNT : 998547 SUM : 5976540 ``` As we can see, we increase the prefetch tail hit count and decrease SST read count with this PR 3. Test backward compatibility by stepping through reading with post-PR code on a db generated pre-PR. Reviewed By: pdillinger Differential Revision: D45413346 Pulled By: hx235 fbshipit-source-id: 7d5e36a60a72477218f79905168d688452a4c064	2023-05-08 13:14:28 -07:00
Changyu Bi	a11f1e12ca	Fix flaky test `DBTestUniversalManualCompactionOutputPathId.ManualCompactionOutputPathId` (#11412 ) Summary: the test is flaky when compiled with `make -j56 COERCE_CONTEXT_SWITCH=1 ./db_universal_compaction_test`. The cause is that a manual compaction `CompactRange()` can finish and return before obsolete files are deleted. One reason for this is that a manual compaction waits until `manual.done` is set here `62fc15f009/db/db_impl/db_impl_compaction_flush.cc (L1978)` and the compaction thread can set `manual.done`: `62fc15f009/db/db_impl/db_impl_compaction_flush.cc (L3672)` and then temporarily release mutex_: `62fc15f009/db/db_impl/db_impl_files.cc (L317)` before purging obsolete files: `62fc15f009/db/db_impl/db_impl_compaction_flush.cc (L3144)` With `COERCE_CONTEXT_SWITCH=1`, `bg_cv_.SignalAll()` is called during `mutex_.Lock()`, so the manual compaction thread can wake up and return before obsolete files are deleted. Updated the test to only count live SST files. Also updated `FindObsoleteFiles()` to avoid locking a locked mutex. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11412 Test Plan: `make -j56 COERCE_CONTEXT_SWITCH=1 ./db_universal_compaction_test` Reviewed By: ajkr Differential Revision: D45342242 Pulled By: cbi42 fbshipit-source-id: 955c9796aa3f484e3557d300f97cffacb3ed9b0c	2023-05-03 11:12:20 -07:00
clundro	50b33ebb1b	remove redundant move (#11418 ) Summary: when I use g++-13 to exec the `make all` command, the output throws the warnings. ``` db/compaction/compaction_job_test.cc: In member function ‘void rocksdb::CompactionJobTestBase::AddMockFile(const rocksdb::mock::KVVector&, int)’: db/compaction/compaction_job_test.cc:376:57: error: redundant move in initialization [-Werror=redundant-move] 376 \| env_, GenerateFileName(file_number), std::move(contents))); \| ~~~~~~~~~^~~~~~~~~~ db/compaction/compaction_job_test.cc:375:7: note: in expansion of macro ‘EXPECT_OK’ 375 \| EXPECT_OK(mock_table_factory_->CreateMockTable( \| ^~~~~~~~~ db/compaction/compaction_job_test.cc:376:57: note: remove ‘std::move’ call 376 \| env_, GenerateFileName(file_number), std::move(contents))); \| ~~~~~~~~~^~~~~~~~~~ db/compaction/compaction_job_test.cc:375:7: note: in expansion of macro ‘EXPECT_OK’ 375 \| EXPECT_OK(mock_table_factory_->CreateMockTable( \| ^~~~~~~~~ cc1plus: all warnings being treated as errors make: *** [Makefile:2507: db/compaction/compaction_job_test.o] Error 1 ``` and I also add some `(void)unused_variable` statements because of the cmake argument `-Wunused-but-set-variable -Wunused-but-set-variable` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11418 Reviewed By: akankshamahajan15 Differential Revision: D45528223 Pulled By: ajkr fbshipit-source-id: fee1a77c30039a56b481de953f0a834cc788abbc	2023-05-03 09:37:21 -07:00
leipeng	a475e9f746	DBIter::FindValueForCurrentKey: remove unused Status s (#11394 ) Summary: This PR remove a historical useless code Pull Request resolved: https://github.com/facebook/rocksdb/pull/11394 Reviewed By: ajkr Differential Revision: D45506226 Pulled By: akankshamahajan15 fbshipit-source-id: 32c98627100c9ad131bf65c4a1fe97ab61502daf	2023-05-03 08:52:03 -07:00
anand76	03a892a9fb	Delete empty WAL files on reopen (#11409 ) Summary: When a DB is opened, RocksDB creates an empty WAL file. When the DB is reopened and the WAL is empty, the min log number to keep is not advanced until a memtable flush happens. If a process crashes soon after reopening the DB, its likely that no memtable flush would have happened, which means the empty WAL file is not deleted. In a crash loop scenario, this leads to empty WAL files accumulating. Fix this by ensuring the min log number is advanced if the WAL is empty. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11409 Test Plan: Add a unit test Reviewed By: ajkr Differential Revision: D45281685 Pulled By: anand1976 fbshipit-source-id: 0225877c613e65ffb30972a0051db2830105423e	2023-05-02 15:54:29 -07:00
Peter Dillinger	41a7fbf758	Avoid long parameter lists configuring Caches (#11386 ) Summary: For better clarity, encouraging more options explicitly specified using fields rather than positionally via constructor parameter lists. Simplifies code maintenance as new fields are added. Deprecate some cases of the confusing pattern of NewWhatever() functions returning shared_ptr. Net reduction of about 70 source code lines (including comments). Pull Request resolved: https://github.com/facebook/rocksdb/pull/11386 Test Plan: existing tests Reviewed By: ajkr Differential Revision: D45059075 Pulled By: pdillinger fbshipit-source-id: d53fa09b268024f9c55254bb973b6c69feebf41a	2023-05-01 14:52:01 -07:00
Changyu Bi	62fc15f009	Block per key-value checksum (#11287 ) Summary: add option `block_protection_bytes_per_key` and implementation for block per key-value checksum. The main changes are 1. checksum construction and verification in block.cc/h 2. pass the option `block_protection_bytes_per_key` around (mainly for methods defined in table_cache.h) 3. unit tests/crash test updates Tests: * Added unit tests * Crash test: `python3 tools/db_crashtest.py blackbox --simple --block_protection_bytes_per_key=1 --write_buffer_size=1048576` Follow up (maybe as a separate PR): make sure corruption status returned from BlockIters are correctly handled. Performance: Turning on block per KV protection has a non-trivial negative impact on read performance and costs additional memory. For memory, each block includes additional 24 bytes for checksum-related states beside checksum itself. For CPU, I set up a DB of size ~1.2GB with 5M keys (32 bytes key and 200 bytes value) which compacts to ~5 SST files (target file size 256 MB) in L6 without compression. I tested readrandom performance with various block cache size (to mimic various cache hit rates): ``` SETUP make OPTIMIZE_LEVEL="-O3" USE_LTO=1 DEBUG_LEVEL=0 -j32 db_bench ./db_bench -benchmarks=fillseq,compact0,waitforcompaction,compact,waitforcompaction -write_buffer_size=33554432 -level_compaction_dynamic_level_bytes=true -max_background_jobs=8 -target_file_size_base=268435456 --num=5000000 --key_size=32 --value_size=200 --compression_type=none BENCHMARK ./db_bench --use_existing_db -benchmarks=readtocache,readrandom[-X10] --num=5000000 --key_size=32 --disable_auto_compactions --reads=1000000 --block_protection_bytes_per_key=[0\|1] --cache_size=$CACHESIZE The readrandom ops/sec looks like the following: Block cache size: 2GB 1.2GB * 0.9 1.2GB * 0.8 1.2GB * 0.5 8MB Main 240805 223604 198176 161653 139040 PR prot_bytes=0 238691 226693 200127 161082 141153 PR prot_bytes=1 214983 193199 178532 137013 108211 prot_bytes=1 vs -10% -15% -10.8% -15% -23% prot_bytes=0 ``` The benchmark has a lot of variance, but there was a 5% to 25% regression in this benchmark with different cache hit rates. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11287 Reviewed By: ajkr Differential Revision: D43970708 Pulled By: cbi42 fbshipit-source-id: ef98d898b71779846fa74212b9ec9e08b7183940	2023-04-25 12:08:23 -07:00
leipeng	40d69b59ad	DBImpl::MultiGet: delete unused var `superversions_to_delete` (#11395 ) Summary: In db_impl.cc DBImpl::MultiGet: delete unused var `superversions_to_delete` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11395 Reviewed By: ajkr Differential Revision: D45240896 Pulled By: cbi42 fbshipit-source-id: 0fff99b0d794b6f6d4aadee6036bddd6cb19eb31	2023-04-25 10:46:29 -07:00
Peter Dillinger	fb63d9b4ee	Fix compression tests when snappy not available (#11396 ) Summary: Tweak some bounds and things, and reduce risk of surprise results by running on all supported compressions (mostly). Also improves the precise compressibility of CompressibleString by using RandomBinaryString. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11396 Test Plan: updated tests Reviewed By: ltamasi Differential Revision: D45211938 Pulled By: pdillinger fbshipit-source-id: 9dc1dd8574a60a9364efe18558be66d31a35598b	2023-04-22 12:41:36 -07:00
Peter Dillinger	d79be3dca2	Changes and enhancements to compression stats, thresholds (#11388 ) Summary: ## Option API updates * Add new CompressionOptions::max_compressed_bytes_per_kb, which corresponds to 1024.0 / min allowable compression ratio. This avoids the hard-coded minimum ratio of 8/7. * Remove unnecessary constructor for CompressionOptions. * Document undocumented CompressionOptions. Use idiom for default values shown clearly in one place (not precariously repeated). ## Stat API updates * Deprecate the BYTES_COMPRESSED, BYTES_DECOMPRESSED histograms. Histograms incur substantial extra space & time costs compared to tickers, and the distribution of uncompressed data block sizes tends to be uninteresting. If we're interested in that distribution, I don't see why it should be limited to blocks stored as compressed. * Deprecate the NUMBER_BLOCK_NOT_COMPRESSED ticker, because the name is very confusing. * New or existing tickers relevant to compression: * BYTES_COMPRESSED_FROM * BYTES_COMPRESSED_TO * BYTES_COMPRESSION_BYPASSED * BYTES_COMPRESSION_REJECTED * COMPACT_WRITE_BYTES + FLUSH_WRITE_BYTES (both existing) * NUMBER_BLOCK_COMPRESSED (existing) * NUMBER_BLOCK_COMPRESSION_BYPASSED * NUMBER_BLOCK_COMPRESSION_REJECTED * BYTES_DECOMPRESSED_FROM * BYTES_DECOMPRESSED_TO We can compute a number of things with these stats: * "Successful" compression ratio: BYTES_COMPRESSED_FROM / BYTES_COMPRESSED_TO * Compression ratio of data on which compression was attempted: (BYTES_COMPRESSED_FROM + BYTES_COMPRESSION_REJECTED) / (BYTES_COMPRESSED_TO + BYTES_COMPRESSION_REJECTED) * Compression ratio of data that could be eligible for compression: (BYTES_COMPRESSED_FROM + X) / (BYTES_COMPRESSED_TO + X) where X = BYTES_COMPRESSION_REJECTED + NUMBER_BLOCK_COMPRESSION_REJECTED * Overall SST compression ratio (compression disabled vs. actual): (Y - BYTES_COMPRESSED_TO + BYTES_COMPRESSED_FROM) / Y where Y = COMPACT_WRITE_BYTES + FLUSH_WRITE_BYTES Keeping _REJECTED separate from _BYPASSED helps us to understand "wasted" CPU time in compression. ## BlockBasedTableBuilder Various small refactorings, optimizations, and name clean-ups. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11388 Test Plan: unit tests added * `options_settable_test.cc`: use non-deprecated idiom for configuring CompressionOptions from string. The old idiom is tested elsewhere and does not need to be updated to support the new field. Reviewed By: ajkr Differential Revision: D45128202 Pulled By: pdillinger fbshipit-source-id: 5a652bf5c022b7ec340cf79018cccf0686962803	2023-04-21 21:57:40 -07:00
Hui Xiao	151242ce46	Group rocksdb.sst.read.micros stat by IOActivity flush and compaction (#11288 ) Summary: Context: The existing stat rocksdb.sst.read.micros does not reflect each of compaction and flush cases but aggregate them, which is not so helpful for us to understand IO read behavior of each of them. Summary - Update `StopWatch` and `RandomAccessFileReader` to record `rocksdb.sst.read.micros` and `rocksdb.file.{flush/compaction}.read.micros` - Fixed the default histogram in `RandomAccessFileReader` - New field `ReadOptions/IOOptions::io_activity`; Pass `ReadOptions` through paths under db open, flush and compaction to where we can prepare `IOOptions` and pass it to `RandomAccessFileReader` - Use `thread_status_util` for assertion in `DbStressFSWrapper` for continuous testing on we are passing correct `io_activity` under db open, flush and compaction Pull Request resolved: https://github.com/facebook/rocksdb/pull/11288 Test Plan: - Stress test - Db bench 1: rocksdb.sst.read.micros COUNT ≈ sum of rocksdb.file.read.flush.micros's and rocksdb.file.read.compaction.micros's. (without blob) - May not be exactly the same due to `HistogramStat::Add` only guarantees atomic not accuracy across threads. ``` ./db_bench -db=/dev/shm/testdb/ -statistics=true -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 (-use_plain_table=1 -prefix_size=10) ``` ``` // BlockBasedTable rocksdb.sst.read.micros P50 : 2.009374 P95 : 4.968548 P99 : 8.110362 P100 : 43.000000 COUNT : 40456 SUM : 114805 rocksdb.file.read.flush.micros P50 : 1.871841 P95 : 3.872407 P99 : 5.540541 P100 : 43.000000 COUNT : 2250 SUM : 6116 rocksdb.file.read.compaction.micros P50 : 2.023109 P95 : 5.029149 P99 : 8.196910 P100 : 26.000000 COUNT : 38206 SUM : 108689 // PlainTable Does not apply ``` - Db bench 2: performance Read SETUP: db with 900 files ``` ./db_bench -db=/dev/shm/testdb/ -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -disable_auto_compactions=true -target_file_size_base=655 -compression_type=none ```run till convergence ``` ./db_bench -seed=1678564177044286 -use_existing_db=true -db=/dev/shm/testdb -benchmarks=readrandom[-X60] -statistics=true -num=1000000 -disable_auto_compactions=true -compression_type=none -bloom_bits=3 ``` Pre-change `readrandom [AVG 60 runs] : 21568 (± 248) ops/sec` Post-change (no regression, -0.3%) `readrandom [AVG 60 runs] : 21486 (± 236) ops/sec` Compaction/Flushrun till convergence ``` ./db_bench -db=/dev/shm/testdb2/ -seed=1678564177044286 -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -disable_auto_compactions=false -target_file_size_base=655 -compression_type=none rocksdb.sst.read.micros COUNT : 33820 rocksdb.sst.read.flush.micros COUNT : 1800 rocksdb.sst.read.compaction.micros COUNT : 32020 ``` Pre-change `fillseq [AVG 46 runs] : 1391 (± 214) ops/sec; 0.7 (± 0.1) MB/sec` Post-change (no regression, ~-0.4%) `fillseq [AVG 46 runs] : 1385 (± 216) ops/sec; 0.7 (± 0.1) MB/sec` Reviewed By: ajkr Differential Revision: D44007011 Pulled By: hx235 fbshipit-source-id: a54c89e4846dfc9a135389edf3f3eedfea257132	2023-04-21 09:07:18 -07:00
Andrew Kryczka	0a774a102f	Clarify `SstFileWriter::DeleteRange()` ordering requirements (#11390 ) Summary: As titled Pull Request resolved: https://github.com/facebook/rocksdb/pull/11390 Reviewed By: cbi42 Differential Revision: D45148830 Pulled By: ajkr fbshipit-source-id: 9a8dfd040514bae3d8ed9e97a79cae7683f2749a	2023-04-20 13:02:16 -07:00
Changyu Bi	43e9a60bb2	Always allow L0->L1 trivial move during manual compaction (#11375 ) Summary: during manual compaction (CompactRange()), L0->L1 trivial move is disabled when only L0 overlaps with compacting key range (introduced in https://github.com/facebook/rocksdb/issues/7368 to enforce kForce* contract). This can cause large memory usage due to compaction readahead when number of L0 files is large. This PR allows L0->L1 trivial move in this case, and will do a L1 -> L1 intra-level compaction when needed (`bottommost_level_compaction` is kForce*). In brief, consider a DB with only L0 file, and user calls CompactRange(kForce, nullptr, nullptr), - before this PR, RocksDB does a L0 -> L1 compaction (disallow trivial move), - after this PR, RocksDB does a L0 -> L1 compaction (allow trivial move), and a L1 -> L1 compaction. Users can use kForceOptimized to avoid this extra L1->L1 compaction overhead when L0s are overlapping and cannot be trivial moved. This PR also fixed a bug (see previous discussion in https://github.com/facebook/rocksdb/issues/11041) where `final_output_level` of a manual compaction can be miscalculated when `level_compaction_dynamic_level_bytes=true`. This bug could cause incorrect level being moved when CompactRangeOptions::change_level is specified. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11375 Test Plan: - Added new unit tests to test that L0 -> L1 compaction allows trivial move and L1 -> L1 compaction is done when needed. Reviewed By: ajkr Differential Revision: D44943518 Pulled By: cbi42 fbshipit-source-id: e9fb770d17b163c18a623e1d1bd6b81159192708	2023-04-20 11:10:48 -07:00

... 2 3 4 5 6 ...

5601 Commits