mirror of https://github.com/facebook/rocksdb.git
583 Commits
Author | SHA1 | Message | Date |
---|---|---|---|
Hui Xiao | 05daa12332 |
Change compaction_readahead_size default value to 2MB (#11762)
Summary: **Context/Summary:** After https://github.com/facebook/rocksdb/pull/11631, we rely on `compaction_readahead_size` for how much to read ahead for compaction read under non-direct IO case. https://github.com/facebook/rocksdb/pull/11658 therefore also sanitized 0 `compaction_readahead_size` to 2MB under non-direct IO, which is consistent with the existing sanitization with direct IO. However, this makes disabling compaction readahead impossible as well as add one more scenario to the inconsistent effects between `Options.compaction_readahead_size=0` during DB open and `SetDBOptions("compaction_readahead_size", "0")` . - `SetDBOptions("compaction_readahead_size", "0")` will disable compaction readahead as its logic never goes through sanitization above while `Options.compaction_readahead_size=0` will go through sanitization. Therefore we decided to do this PR. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11762 Test Plan: Modified existing UTs to cover this PR Reviewed By: ajkr Differential Revision: D48759560 Pulled By: hx235 fbshipit-source-id: b3f85e58bda362a6fa1dc26bd8a87aa0e171af79 |
|
Yu Zhang | fc58c7c62a |
Add UDT support in SstFileDumper (#11757)
Summary: For a SST file that uses user-defined timestamp aware comparators, if a lower or upper bound is set, sst_dump tool doesn't handle it well. This PR adds support for that. While working on this `MaybeAddTimestampsToRange` is moved to the udt_util.h file to be shared. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11757 Test Plan: make all check for changes in db_impl.cc and db_impl_compaction_flush.cc for changes in sst_file_dumper.cc, I manually tested this change handles specifying bounds for UDT use cases. It probably should have a unit test file eventually. Reviewed By: ltamasi Differential Revision: D48668048 Pulled By: jowlyzhang fbshipit-source-id: 1560465f40e44668d6d82a7439fe9012be0e74a8 |
|
Yu Zhang | 4234a6a301 |
Increase full_history_ts_low when flush happens during recovery (#11774)
Summary: This PR adds a missing piece for the UDT in memtable only feature, which is to automatically increase `full_history_ts_low` when flush happens during recovery. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11774 Test Plan: Added unit test make all check Reviewed By: ltamasi Differential Revision: D48799109 Pulled By: jowlyzhang fbshipit-source-id: fd681ed66d9d40904ca2c919b2618eb692686035 |
|
chuhao zeng | 1303573589 |
Reverse sort order in dedup to enable iter checking in callback (#11725)
Summary: Fix https://github.com/facebook/rocksdb/issues/6470 Ensure TransactionLogIter being initialized correctly with SYNC_POINT API when calling `GetSortedWALFiles`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11725 Reviewed By: akankshamahajan15 Differential Revision: D48529411 Pulled By: ajkr fbshipit-source-id: 970ca1a6259ed996c6d87f7fcd40f95acf441517 |
|
Jay Huh | 0fa0c97d3e |
Timeout in microsecond option in WaitForCompactOptions (#11711)
Summary: While it's rare, we may run into a scenario where `WaitForCompact()` waits for background jobs indefinitely. For example, not enough space error will add the job back to the queue while WaitForCompact() waits for _all jobs_ including the jobs that are in the queue to be completed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11711 Test Plan: `DBCompactionWaitForCompactTest::WaitForCompactToTimeout` added `timeout` option added to the variables for all of the existing DBCompactionWaitForCompactTests Reviewed By: pdillinger, jowlyzhang Differential Revision: D48416390 Pulled By: jaykorean fbshipit-source-id: 7b6a12f705ab6c6dfaf8ad736a484ca654a86106 |
|
Changyu Bi | d1ff401472 |
Delay bottommost level single file compactions (#11701)
Summary: For leveled compaction, RocksDB has a special kind of compaction with reason "kBottommmostFiles" that compacts bottommost level files to clear data held by snapshots (more detail in https://github.com/facebook/rocksdb/issues/3009). Such compactions can happen soon after a relevant snapshot is released. For some use cases, a bottommost file may contain only a small amount of keys that can be cleared, so compacting such a file has a high write amp. In addition, these bottommost files may be compacted in compactions with reason other than "kBottommmostFiles" if we wait for some time (so that enough data is ingested to trigger such a compaction). This PR introduces an option `bottommost_file_compaction_delay` to specify the delay of these bottommost level single file compactions. * The main change is in `VersionStorageInfo::ComputeBottommostFilesMarkedForCompaction()` where we only add a file to `bottommost_files_marked_for_compaction_` if it oldest_snapshot is larger than its non-zero largest_seqno **and** the file is old enough. Note that if a file is not old enough but its largest_seqno is less than oldest_snapshot, we exclude it from the calculation of `bottommost_files_mark_threshold_`. This makes the change simpler, but such a file's eligibility for compaction will only be checked the next time `ComputeBottommostFilesMarkedForCompaction()` is called. This happens when a new Version is created (compaction, flush, SetOptions()...), a new enough snapshot is released (`VersionStorageInfo::UpdateOldestSnapshot()`) or when a compaction is picked and compaction score has to be re-calculated. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11701 Test Plan: * Add two unit tests to test when bottommost_file_compaction_delay > 0. * Ran crash test with the new option. Reviewed By: jaykorean, ajkr Differential Revision: D48331564 Pulled By: cbi42 fbshipit-source-id: c584f3dc5f6354fce3ed65f4c6366dc450b15ba8 |
|
Jay Huh | 793a786fa3 |
Fix for unchecked status in CancelAllBackgroundWork (#11699)
Summary: ## Summary PR https://github.com/facebook/rocksdb/issues/11497 introduced this. Status from `CancelPeriodicTaskScheduler()` is unchecked and causing test failure like https://app.circleci.com/pipelines/github/facebook/rocksdb/30743/workflows/24443a9b-6fc3-41e6-86c1-992d766eb1ec/jobs/642419 Pull Request resolved: https://github.com/facebook/rocksdb/pull/11699 Test Plan: Existing tests Reviewed By: cbi42 Differential Revision: D48287188 Pulled By: jaykorean fbshipit-source-id: b6bcf6e3c3c47f126c34c24a3dfed2649635cc8c |
|
Jay Huh | 52816ff64d |
Close DB option in WaitForCompact() (#11497)
Summary: Context: As mentioned in https://github.com/facebook/rocksdb/issues/11436, introducing `close_db` option in `WaitForCompactOptions` to close DB after waiting for compactions to finish. Must be set to true to close the DB upon compactions finishing. 1. `bool close_db = false` added to `WaitForCompactOptions` 2. Introduced `CancelPeriodicTaskSchedulers()` and moved unregistering PeriodicTaskSchedulers to it.`CancelAllBackgroundWork()` calls it now. 3. When close_db option is on, unpersisted data (data in memtable when WAL is disabled) will be flushed in `WaitForCompact()` if flush option is not on (and `mutable_db_options_.avoid_flush_during_shutdown` is not true). The unpersisted data flush in `CancelAllBackgroundWork()` will be skipped because `shutting_down_` flag will be set true before calling `Close()`. 4. Atomic boolean `reject_new_background_jobs_` is introduced to prevent new background jobs from being added during the short period of time after waiting is done and before `shutting_down_` is set by `Close()`. 5. `WaitForCompact()` now waits for recovery in progress to complete as well. (flush operations from WAL -> L0 files) 6. Added `close_db_` cases to all existing `WaitForCompactTests` 7. Added a scenario to `DBBasicTest::DBClose` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11497 Test Plan: - Existing DBCompactionTests - `WaitForCompactWithOptionToFlushAndCloseDB` added - Added a scenario to `DBBasicTest::DBClose` Reviewed By: pdillinger, jowlyzhang Differential Revision: D46337560 Pulled By: jaykorean fbshipit-source-id: 0f8c7ee09394847f2af5ea4bdd331b47bcdef0b0 |
|
Yu Zhang | 7cdbce4564 |
Add UDT support in API DB::GetApproximateMemTableStats (#11689)
Summary: This API should consider the case when user-defined timestamp is enabled. Also added some documentation to some related API to clarify the usage in the case when user-defined timestamp is enabled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11689 Test Plan: Unit test added ``` make check ./db_with_timestamp_basic_test --gtest_filter=*GetApproximateSizes* ``` Reviewed By: ltamasi Differential Revision: D48208568 Pulled By: jowlyzhang fbshipit-source-id: c5baa4a2923441f8ea3a3672c98223a43a3428dc |
|
Changyu Bi | 76ed9a3990 |
Add missing status check when compiling with `ASSERT_STATUS_CHECKED=1` (#11686)
Summary:
It seems the flag `-fno-elide-constructors` is incorrectly overwritten in Makefile by
|
|
Hui Xiao | 9a034801ce |
Group rocksdb.sst.read.micros stat by different user read IOActivity + misc (#11444)
Summary:
**Context/Summary:**
- Similar to https://github.com/facebook/rocksdb/pull/11288 but for user read such as `Get(), MultiGet(), DBIterator::XXX(), Verify(File)Checksum()`.
- For this, I refactored some user-facing `MultiGet` calls in `TransactionBase` and various types of `DB` so that it does not call a user-facing `Get()` but `GetImpl()` for passing the `ReadOptions::io_activity` check (see PR conversation)
- New user read stats breakdown are guarded by `kExceptDetailedTimers` since measurement shows they have 4-5% regression to the upstream/main.
- Misc
- More refactoring: with https://github.com/facebook/rocksdb/pull/11288, we complete passing `ReadOptions/IOOptions` to FS level. So we can now replace the previously [added](https://github.com/facebook/rocksdb/pull/9424) `rate_limiter_priority` parameter in `RandomAccessFileReader`'s `Read/MultiRead/Prefetch()` with `IOOptions::rate_limiter_priority`
- Also, `ReadAsync()` call time is measured in `SST_READ_MICRO` now
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11444
Test Plan:
- CI fake db crash/stress test
- Microbenchmarking
**Build** `make clean && ROCKSDB_NO_FBCODE=1 DEBUG_LEVEL=0 make -jN db_basic_bench`
- google benchmark version:
|
|
Xinye Tao | d2b0652b32 |
compute compaction score once for a batch of range file deletes (#10744)
Summary: Only re-calculate compaction score once for a batch of deletions. Fix performance regression brought by https://github.com/facebook/rocksdb/pull/8434. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10744 Test Plan: In one of our production cluster that recently upgraded to RocksDB 6.29, it takes more than 10 minutes to delete files in 30,000 ranges. The RocksDB instance contains approximately 80,000 files. After this patch, the duration reduces to 100+ ms, which is on par with RocksDB 6.4. Cherry-picking downstream PR: https://github.com/tikv/rocksdb/pull/316 Signed-off-by: tabokie <xy.tao@outlook.com> Reviewed By: cbi42 Differential Revision: D48002581 Pulled By: ajkr fbshipit-source-id: 7245607ee3ad79c53b648a6396c9159f166b9437 |
|
Andrew Kryczka | 4500a0d6ec |
Avoid an std::map copy in persistent stats (#11681)
Summary: An internal user reported this copy showing up in a CPU profile. We can use move instead. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11681 Differential Revision: D48103170 Pulled By: ajkr fbshipit-source-id: 083d6470181a0041bb5275b657aa61bee23a3729 |
|
Hui Xiao | 09882a52d6 |
Prepare for deprecation of Options::access_hint_on_compaction_start (#11658)
Summary: **Context/Summary:** After https://github.com/facebook/rocksdb/pull/11631, file hint is not longer needed for compaction read. Therefore we can deprecate `Options::access_hint_on_compaction_start`. As this is a public API change, we should first mark the relevant APIs (including the Java's) deprecated and remove it in next major release 9.0. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11658 Test Plan: No code change Reviewed By: ajkr Differential Revision: D47997856 Pulled By: hx235 fbshipit-source-id: 16e015ae7728c224b1caef73143aa9915668f4ac |
|
Peter Dillinger | 7a1b0207e6 |
format_version=6 and context-aware block checksums (#9058)
Summary: ## Context checksum All RocksDB checksums currently use 32 bits of checking power, which should be 1 in 4 billion false negative (FN) probability (failing to detect corruption). This is true for random corruptions, and in some cases small corruptions are guaranteed to be detected. But some possible corruptions, such as in storage metadata rather than storage payload data, would have a much higher FN rate. For example: * Data larger than one SST block is replaced by data from elsewhere in the same or another SST file. Especially with block_align=true, the probability of exact block size match is probably around 1 in 100, making the FN probability around that same. Without `block_align=true` the probability of same block start location is probably around 1 in 10,000, for FN probability around 1 in a million. To solve this problem in new format_version=6, we add "context awareness" to block checksum checks. The stored and expected checksum value is modified based on the block's position in the file and which file it is in. The modifications are cleverly chosen so that, for example * blocks within about 4GB of each other are guaranteed to use different context * blocks that are offset by exactly some multiple of 4GiB are guaranteed to use different context * files generated by the same process are guaranteed to use different context for the same offsets, until wrap-around after 2^32 - 1 files Thus, with format_version=6, if a valid SST block and checksum is misplaced, its checksum FN probability should be essentially ideal, 1 in 4B. ## Footer checksum This change also adds checksum protection to the SST footer (with format_version=6), for the first time without relying on whole file checksum. To prevent a corruption of the format_version in the footer (e.g. 6 -> 5) to defeat the footer checksum, we change much of the footer data format including an "extended magic number" in format_version 6 that would be interpreted as empty index and metaindex block handles in older footer versions. We also change the encoding of handles to free up space for other new data in footer. ## More detail: making space in footer In order to keep footer the same size in format_version=6 (avoid change to IO patterns), we have to free up some space for new data. We do this two ways: * Metaindex block handle is encoded down to 4 bytes (from 10) by assuming it immediately precedes the footer, and by assuming it is < 4GB. * Index block handle is moved into metaindex. (I don't know why it was in footer to begin with.) ## Performance In case of small performance penalty, I've made a "pay as you go" optimization to compensate: replace `MutableCFOptions` in BlockBasedTableBuilder::Rep with the only field used in that structure after construction: `prefix_extractor`. This makes the PR an overall performance improvement (results below). Nevertheless I'm seeing essentially no difference going from fv=5 to fv=6, even including that improvement for both. That's based on extreme case table write performance testing, many files with many blocks. This is relatively checksum intensive (small blocks) and salt generation intensive (small files). ``` (for I in `seq 1 100`; do TEST_TMPDIR=/dev/shm/dbbench2 ./db_bench -benchmarks=fillseq -memtablerep=vector -disable_wal=1 -allow_concurrent_memtable_write=false -num=3000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -write_buffer_size=100000 -compression_type=none -block_size=1000; done) 2>&1 | grep micros/op | tee out awk '{ tot += $5; n += 1; } END { print int(1.0 * tot / n) }' < out ``` Each value below is ops/s averaged over 100 runs, run simultaneously with competing configuration for load fairness Before -> after (both fv=5): 483530 -> 483673 (negligible) Re-run 1: 480733 -> 485427 (1.0% faster) Re-run 2: 483821 -> 484541 (0.1% faster) Before (fv=5) -> after (fv=6): 482006 -> 485100 (0.6% faster) Re-run 1: 482212 -> 485075 (0.6% faster) Re-run 2: 483590 -> 484073 (0.1% faster) After fv=5 -> after fv=6: 483878 -> 485542 (0.3% faster) Re-run 1: 485331 -> 483385 (0.4% slower) Re-run 2: 485283 -> 483435 (0.4% slower) Re-run 3: 483647 -> 486109 (0.5% faster) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9058 Test Plan: unit tests included (table_test, db_properties_test, salt in env_test). General DB tests and crash test updated to test new format_version. Also temporarily updated the default format version to 6 and saw some test failures. Almost all were due to an inadvertent additional read in VerifyChecksum to verify the index block checksum, though it's arguably a bug that VerifyChecksum does not appear to (re-)verify the index block checksum, just assuming it was verified in opening the index reader (probably *usually* true but probably not always true). Some other concerns about VerifyChecksum are left in FIXME comments. The only remaining test failure on change of default (in block_fetcher_test) now has a comment about how to upgrade the test. The format compatibility test does not need updating because we have not updated the default format_version. Reviewed By: ajkr, mrambacher Differential Revision: D33100915 Pulled By: pdillinger fbshipit-source-id: 8679e3e572fa580181a737fd6d113ed53c5422ee |
|
Changyu Bi | 6a0f637633 |
Compare the number of input keys and processed keys for compactions (#11571)
Summary: ... to improve data integrity validation during compaction. A new option `compaction_verify_record_count` is introduced for this verification and is enabled by default. One exception when the verification is not done is when a compaction filter returns kRemoveAndSkipUntil which can cause CompactionIterator to seek until some key and hence not able to keep track of the number of keys processed. For expected number of input keys, we sum over the number of total keys - number of range tombstones across compaction input files (`CompactionJob::UpdateCompactionStats()`). Table properties are consulted if `FileMetaData` is not initialized for some input file. Since table properties for all input files were also constructed during `DBImpl::NotifyOnCompactionBegin()`, `Compaction::GetTableProperties()` is introduced to reduce duplicated code. For actual number of keys processed, each subcompaction will record its number of keys processed to `sub_compact->compaction_job_stats.num_input_records` and aggregated when all subcompactions finish (`CompactionJob::AggregateCompactionStats()`). In the case when some subcompaction encountered kRemoveAndSkipUntil from compaction filter and does not have accurate count, it propagates this information through `sub_compact->compaction_job_stats.has_num_input_records`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11571 Test Plan: * Add a new unit test `DBCompactionTest.VerifyRecordCount` for the corruption case. * All other unit tests for non-corrupted case. * Ran crash test for a few hours: `python3 ./tools/db_crashtest.py whitebox --simple` Reviewed By: ajkr Differential Revision: D47131965 Pulled By: cbi42 fbshipit-source-id: cc8e94565dd526c4347e9d3843ecf32f6727af92 |
|
Yu Zhang | c24ef26ca7 |
Support switching on / off UDT together with in-Memtable-only feature (#11623)
Summary: Add support to allow enabling / disabling user-defined timestamps feature for an existing column family in combination with the in-Memtable only feature. To do this, this PR includes: 1) Log the `persist_user_defined_timestamps` option per column family in Manifest to facilitate detecting an attempt to enable / disable UDT. This entry is enforced to be logged in the same VersionEdit as the user comparator name entry. 2) User-defined timestamps related options are validated when re-opening a column family, including user comparator name and the `persist_user_defined_timestamps` flag. These type of settings and settings change are considered valid: a) no user comparator change and no effective `persist_user_defined_timestamp` flag change. b) switch user comparator to enable UDT provided the immediately effective `persist_user_defined_timestamps` flag is false. c) switch user comparator to disable UDT provided that the before-change `persist_user_defined_timestamps` is already false. 3) when an attempt to enable UDT is detected, we mark all its existing SST files as "having no UDT" by marking its `FileMetaData.user_defined_timestamps_persisted` flag to false and handle their file boundaries `FileMetaData.smallest`, `FileMetaData.largest` by padding a min timestamp. 4) while enabling / disabling UDT feature, timestamp size inconsistency in existing WAL logs are handled to make it compatible with the running user comparator. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11623 Test Plan: ``` make all check ./db_with_timestamp_basic_test --gtest-filter="*EnableDisableUDT*" ./db_wal_test --gtest_filter="*EnableDisableUDT*" ``` Reviewed By: ltamasi Differential Revision: D47636862 Pulled By: jowlyzhang fbshipit-source-id: dcd19f67292da3c3cc9584c09ad00331c9ab9322 |
|
Yu Zhang | 4ea7b796b7 |
Respect cutoff timestamp during flush (#11599)
Summary: Make flush respect the cutoff timestamp `full_history_ts_low` as much as possible for the user-defined timestamps in Memtables only feature. We achieve this by not proceeding with the actual flushing but instead reschedule the same `FlushRequest` so a follow up flush job can continue with the check after some interval. This approach doesn't work well for atomic flush, so this feature currently is not supported in combination with atomic flush. Furthermore, this approach also requires a customized method to get the next immediately bigger user-defined timestamp. So currently it's limited to comparator that use uint64_t as the user-defined timestamp format. This support can be extended when we add such a customized method to `AdvancedColumnFamilyOptions`. For non atomic flush request, at any single time, a column family can only have as many as one FlushRequest for it in the `flush_queue_`. There is deduplication done at `FlushRequest` enqueueing(`SchedulePendingFlush`) and dequeueing time (`PopFirstFromFlushQueue`). We hold the db mutex between when a `FlushRequest` is popped from the queue and the same FlushRequest get rescheduled, so no other `FlushRequest` with a higher `max_memtable_id` can be added to the `flush_queue_` blocking us from re-enqueueing the same `FlushRequest`. Flush is continued nevertheless if there is risk of entering write stall mode had the flush being postponed, e.g. due to accumulation of write buffers, exceeding the `max_write_buffer_number` setting. When this happens, the newest user-defined timestamp in the involved Memtables need to be tracked and we use it to increase the `full_history_ts_low`, which is an inclusive cutoff timestamp for which RocksDB promises to keep all user-defined timestamps equal to and newer than it. Tet plan: ``` ./column_family_test --gtest_filter="*RetainUDT*" ./memtable_list_test --gtest_filter="*WithTimestamp*" ./flush_job_test --gtest_filter="*WithTimestamp*" ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11599 Reviewed By: ajkr Differential Revision: D47561586 Pulled By: jowlyzhang fbshipit-source-id: 9400445f983dd6eac489e9dd0fb5d9b99637fe89 |
|
shuzz | 2f712235ab |
optimized code (#11614)
Summary: improvement code by std::move and c++17 Pull Request resolved: https://github.com/facebook/rocksdb/pull/11614 Reviewed By: ajkr Differential Revision: D47599519 Pulled By: jowlyzhang fbshipit-source-id: 6b897876f4e87e94a74c53d8db2a01303d500bff |
|
Changyu Bi | 662a1c99f6 |
Verify number of keys flushed during DB open (#11611)
Summary:
Extend the coverage for option `flush_verify_memtable_count`. The verification code is similar to the ones for regular flush:
|
|
Changyu Bi | ca50ccc71a |
Add CreateColumnFamilyWithImport to `StackableDB` and `DBImplReadOnly` (#11556)
Summary: https://github.com/facebook/rocksdb/issues/11378 added a new overloaded `CreateColumnFamilyWithImport` API and updated the virtual function in `StackableDB` and `DBImplReadOnly` to the newly overloaded one. This caused internal error when there is a derived class that tries to override the original `CreateColumnFamilyWithImport` function. This PR adds the original `CreateColumnFamilyWithImport` function back to `StackableDB` and `DBImplReadOnly`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11556 Test Plan: check if this fixes an internal build Reviewed By: akankshamahajan15 Differential Revision: D46980506 Pulled By: cbi42 fbshipit-source-id: 975a6c5748bf9481499a62ee5997ca59e542e3bc |
|
Yu Zhang | 7521478b43 |
Record the `persist_user_defined_timestamps` flag in manifest (#11515)
Summary: Start to record the value of the flag `AdvancedColumnFamilyOptions.persist_user_defined_timestamps` in the Manifest and table properties for a SST file when it is created. And use the recorded flag when creating a table reader for the SST file. This flag's default value is true, it is only explicitly recorded if it's false. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11515 Test Plan: ``` make all check ./version_edit_test ``` Reviewed By: ltamasi Differential Revision: D46920386 Pulled By: jowlyzhang fbshipit-source-id: 075c20363d3d2cc1368422ecc805617ed135cc26 |
|
mayue.fight | fa878a0107 |
Support to create a CF by importing multiple non-overlapping CFs (#11378)
Summary: The original Feature Request is from [https://github.com/facebook/rocksdb/issues/11317](https://github.com/facebook/rocksdb/issues/11317). Flink uses rocksdb as the state backend, all DB options are the same, and the keys of each DB instance are adjacent and there is no key overlap between two db instances. In the Flink rescaling scenario, it is necessary to quickly split the DB according to a certain key range or quickly merge multiple DBs into one. This PR is mainly used to quickly merge multiple DBs into one. We hope to extend the function of `CreateColumnFamilyWithImports` to support creating ColumnFamily by importing multiple ColumnFamily with no overlapping keys. The import logic is almost the same as `CreateColumnFamilyWithImport`, but it will check whether there is key overlap between CF when importing. The import will fail if there are key overlaps. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11378 Reviewed By: ajkr Differential Revision: D46413709 Pulled By: cbi42 fbshipit-source-id: 846d0049fad11c59cf460fa846c345b26c658dfb |
|
Changyu Bi | 15e8a843d9 |
Do not include last level in compaction when `allow_ingest_behind=true` (#11489)
Summary: when a DB is configured with `allow_ingest_behind = true`, the last level should be reserved for ingested files and these files should not be included in any compaction. Currently, a major compaction can compact these files to smaller levels. This can cause future files to be rejected for ingest behind (see `ExternalSstFileIngestionJob::CheckLevelForIngestedBehindFile()`). This PR fixes the issue such that files in the last level is not included in any compaction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11489 Test Plan: * Updated unit test `ExternalSSTFileTest.IngestBehind` to test that last level is not included in manual and auto-compaction. Reviewed By: ajkr Differential Revision: D46455711 Pulled By: cbi42 fbshipit-source-id: 5e2142c2a709ef932ad797897795021c06c4ac8c |
|
Andrew Kryczka | cac3240cbf |
add property "rocksdb.obsolete-sst-files-size" (#11533)
Summary: See "unreleased_history/new_features/obsolete_sst_files_size.md" for description Pull Request resolved: https://github.com/facebook/rocksdb/pull/11533 Test Plan: updated unit test Reviewed By: jowlyzhang Differential Revision: D46703152 Pulled By: ajkr fbshipit-source-id: ea5e31cd6293eccc154130c13e66b5271f57c102 |
|
Ignat Loskutov | 05fcacdb42 |
Add missing stopwatch and perf timer to DBImplReadOnly (#11521)
Summary:
The `rocksdb.db.get.micros` histogram is never updated if the DB is open in ReadOnly mode, as well as the `get_cpu_nanos` perf counter. An earlier PR (https://github.com/facebook/rocksdb/issues/4260) for some reason has only added the TODO line, not the accounting itself, so this one is intended to fix it, adding two lines to match [DBImplSecondary](
|
|
Yu Zhang | 4dafa5b220 |
switch to use RocksDB UnorderedMap (#11507)
Summary: Switch from std::unordered_map to RocksDB UnorderedMap for all the places that logging user-defined timestamp size in WAL used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11507 Test Plan: ``` make all check ``` Reviewed By: ltamasi Differential Revision: D46448975 Pulled By: jowlyzhang fbshipit-source-id: bdb4d56a723b697a33daaf0f856a61d49a367a99 |
|
Changyu Bi | e95cc1217d |
`CompactRange()` always compacts to bottommost level for leveled compaction (#11468)
Summary:
currently for leveled compaction, the max output level of a call to `CompactRange()` is pre-computed before compacting each level. This max output level is the max level whose key range overlaps with the manual compaction key range. However, during manual compaction, files in the max output level may be compacted down further by some background compaction. When this background compaction is a trivial move, there is a race condition and the manual compaction may not be able to compact all keys in the specified key range. This PR updates `CompactRange()` to always compact to the bottommost level to make this race condition more unlikely (it can still happen, see more in comment here:
|
|
Jay Huh | 87bc929db3 |
Flush option in WaitForCompact() (#11483)
Summary: Context: As mentioned in https://github.com/facebook/rocksdb/issues/11436, introducing `flush` option in `WaitForCompactOptions` to flush before waiting for compactions to finish. Must be set to true to ensure no immediate compactions (except perhaps periodic compactions) after closing and re-opening the DB. 1. `bool flush = false` added to `WaitForCompactOptions` 2. `DBImpl::FlushAllColumnFamilies()` is introduced and `DBImpl::FlushForGetLiveFiles()` is refactored to call it. 3. `DBImpl::FlushAllColumnFamilies()` gets called before waiting in `WaitForCompact()` if `flush` option is `true` 4. Some previous WaitForCompact tests were parameterized to include both cases for `abort_on_pause_` being true/false as well as `flush_` being true/false Pull Request resolved: https://github.com/facebook/rocksdb/pull/11483 Test Plan: - `DBCompactionTest::WaitForCompactWithOptionToFlush` added - Changed existing DBCompactionTest::WaitForCompact tests to `DBCompactionWaitForCompactTest` to include params Reviewed By: pdillinger Differential Revision: D46289770 Pulled By: jaykorean fbshipit-source-id: 70d3f461d96a6e06390be60170dd7c4d0d38f8b0 |
|
Yu Zhang | 56ca9e3106 |
Logging timestamp size record in WAL and use it during recovery (#11471)
Summary: Start logging the timestamp size record in WAL and use the record during recovery. Currently, user comparator cannot be different from what was used to create a column family, so the timestamp size record is just used to confirm it's consistent with the timestamp size the running user comparator indicates. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11471 Test Plan: ``` make all check ./db_secondary_test ./db_wal_test --gtest_filter="*WithTimestamp*" ./repair_test --gtest_filter="*WithTimestamp*" ``` Reviewed By: ltamasi Differential Revision: D46236769 Pulled By: jowlyzhang fbshipit-source-id: f6c60b5c8defdb05021c63df302ccc0be1275ad0 |
|
Jay Huh | 81aeb15988 |
Add WaitForCompact with WaitForCompactOptions to public API (#11436)
Summary: Context: This is the first PR for WaitForCompact() Implementation with WaitForCompactOptions. In this PR, we are introducing `Status WaitForCompact(const WaitForCompactOptions& wait_for_compact_options)` in the public API. This currently utilizes the existing internal `WaitForCompact()` implementation (with default abort_on_pause = false). `abort_on_pause` has been moved to `WaitForCompactOptions&`. In the later PRs, we will introduce the following two options in `WaitForCompactOptions` 1. `bool flush = false` by default - If true, flush before waiting for compactions to finish. Must be set to true to ensure no immediate compactions (except perhaps periodic compactions) after closing and re-opening the DB. 2. `bool close_db = false` by default - If true, will also close the DB upon compactions finishing. 1. struct `WaitForCompactOptions` added to options.h and `abort_on_pause` in the internal API moved to the option struct. 2. `Status WaitForCompact(const WaitForCompactOptions& wait_for_compact_options)` introduced in `db.h` 3. Changed the internal WaitForCompact() to `WaitForCompact(const WaitForCompactOptions& wait_for_compact_options)` and checks for the `abort_on_pause` inside the option. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11436 Test Plan: Following tests added - `DBCompactionTest::WaitForCompactWaitsOnCompactionToFinish` - `DBCompactionTest::WaitForCompactAbortOnPauseAborted` - `DBCompactionTest::WaitForCompactContinueAfterPauseNotAborted` - `DBCompactionTest::WaitForCompactShutdownWhileWaiting` - `TransactionTest::WaitForCompactAbortOnPause` NOTE: `TransactionTest::WaitForCompactAbortOnPause` was added to use `StackableDB` to ensure the wrapper function is in place. Reviewed By: pdillinger Differential Revision: D45799659 Pulled By: jaykorean fbshipit-source-id: b5b58f95957f2ab47d1221dee32a61d6cdc4685b |
|
Yu Zhang | 11ebddb1d4 |
Add utils to use for handling user defined timestamp size record in WAL (#11451)
Summary: Add a util method `HandleWriteBatchTimestampSizeDifference` to handle a `WriteBatch` read from WAL log when user-defined timestamp size record is written and read. Two check modes are added: `kVerifyConsistency` that just verifies the recorded timestamp size are consistent with the running ones. This mode is to be used by `db_impl_secondary` for opening a DB as secondary instance. It will also be used by `db_impl_open` before the user comparator switch support is added to make a column switch between enabling/disable UDT feature. The other mode `kReconcileInconsistency` will be used by `db_impl_open` later when user comparator can be changed. Another change is to extract a method `CollectColumnFamilyIdsFromWriteBatch` in db_secondary_impl.h into its standalone util file so it can be shared. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11451 Test Plan: ``` make check ./udt_util_test ``` Reviewed By: ltamasi Differential Revision: D45894386 Pulled By: jowlyzhang fbshipit-source-id: b96790777f154cddab6d45d9ba2e5d20ebc6fe9d |
|
mayue.fight | 8d8eb0e77e |
Support Clip DB to KeyRange (#11379)
Summary: This PR is part of the request https://github.com/facebook/rocksdb/issues/11317. (Another part is https://github.com/facebook/rocksdb/pull/11378) ClipDB() will clip the entries in the CF according to the range [begin_key, end_key). All the entries outside this range will be completely deleted (including tombstones). This feature is mainly used to ensure that there is no overlapping Key when calling CreateColumnFamilyWithImports() to import multiple CFs. When Calling ClipDB [begin, end), there are the following steps 1. Quickly and directly delete files without overlap DeleteFilesInRanges(nullptr, begin) + DeleteFilesInRanges(end, nullptr) 2. Delete the Key outside the range Delete[smallest_key, begin) + Delete[end, largest_key] 3. Delete the tombstone through Manul Compact CompactRange(option, nullptr, nullptr) Pull Request resolved: https://github.com/facebook/rocksdb/pull/11379 Reviewed By: ajkr Differential Revision: D45840358 Pulled By: cbi42 fbshipit-source-id: 54152e8a45fd8ede137f99787eb252f0b51440a4 |
|
Jay Huh | 586d78b31e |
Remove wait_unscheduled from waitForCompact internal API (#11443)
Summary: Context: In pull request https://github.com/facebook/rocksdb/issues/11436, we are introducing a new public API `waitForCompact(const WaitForCompactOptions& wait_for_compact_options)`. This API invokes the internal implementation `waitForCompact(bool wait_unscheduled=false)`. The unscheduled parameter indicates the compactions that are not yet scheduled but are required to process items in the queue. In certain cases, we are unable to wait for compactions, such as during a shutdown or when background jobs are paused. It is important to return the appropriate status in these scenarios. For all other cases, we should wait for all compaction and flush jobs, including the unscheduled ones. The primary purpose of this new API is to wait until the system has resolved its compaction debt. Currently, the usage of `wait_unscheduled` is limited to test code. This pull request eliminates the usage of wait_unscheduled. The internal `waitForCompact()` API now waits for unscheduled compactions unless the db is undergoing a shutdown. In the event of a shutdown, the API returns `Status::ShutdownInProgress()`. Additionally, a new parameter, `abort_on_pause`, has been introduced with a default value of `false`. This parameter addresses the possibility of waiting indefinitely for unscheduled jobs if `PauseBackgroundWork()` was called before `waitForCompact()` is invoked. By setting `abort_on_pause` to `true`, the API will immediately return `Status::Aborted`. Furthermore, all tests that previously called `waitForCompact(true)` have been fixed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11443 Test Plan: Existing tests that involve a shutdown in progress: - DBCompactionTest::CompactRangeShutdownWhileDelayed - DBTestWithParam::PreShutdownMultipleCompaction - DBTestWithParam::PreShutdownCompactionMiddle Reviewed By: pdillinger Differential Revision: D45923426 Pulled By: jaykorean fbshipit-source-id: 7dc93fe6a6841a7d9d2d72866fa647090dba8eae |
|
Peter Dillinger | 206fdea3d9 |
Change internal headers with duplicate names (#11408)
Summary: In IDE navigation I find it annoying that there are two statistics.h files (etc.) and often land on the wrong one. Here I migrate several headers to use the blah.h <- blah_impl.h <- blah.cc idiom. Although clang-format wants "blah.h" to be the top include for "blah.cc", I think overall this is an improvement. No public API changes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11408 Test Plan: existing tests Reviewed By: ltamasi Differential Revision: D45456696 Pulled By: pdillinger fbshipit-source-id: 809d931253f3272c908cf5facf7e1d32fc507373 |
|
anand76 | 2084cdf237 |
Delete temp OPTIONS file on failure to write it (#11423)
Summary: When the DB is opened, RocksDB creates a temp OPTIONS file, writes the current options to it, and renames it. In case of a failure, the temp file is left behind, and is not deleted by PurgeObsoleteFiles(). Fix this by explicitly deleting the temp file if writing to it or renaming it fails. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11423 Test Plan: Add a unit test Reviewed By: akankshamahajan15 Differential Revision: D45540454 Pulled By: anand1976 fbshipit-source-id: 47facdc30d8cc5667036312d04b21d3fc253c92e |
|
Peter Dillinger | f4a02f2c52 |
Add hash_seed to Caches (#11391)
Summary: See motivation and description in new ShardedCacheOptions::hash_seed option. Updated db_bench so that its seed param is used for the cache hash seed. Made its code more safe to ensure seed is set before use. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11391 Test Plan: unit tests added / updated **Performance** - no discernible difference seen running cache_bench repeatedly before & after. With lru_cache and hyper_clock_cache. Reviewed By: hx235 Differential Revision: D45557797 Pulled By: pdillinger fbshipit-source-id: 40bf4da6d66f9d41a8a0eb8e5cf4246a4aa07934 |
|
akankshamahajan | 6ba4717f35 |
Fix build error: variable 'base_level' may be uninitialized (#11435)
Summary: Fix build error: variable 'base_level' may be uninitialized ``` db_impl_compaction_flush.cc:1195:21: error: variable 'base_level' may be uninitialized when used here [-Werror,-Wconditional-uninitialized] level = base_level; ``` ^~~~~~~~~~ Pull Request resolved: https://github.com/facebook/rocksdb/pull/11435 Test Plan: CircleCI jobs Reviewed By: cbi42 Differential Revision: D45708176 Pulled By: akankshamahajan15 fbshipit-source-id: 851b1205b22b63d728495e5735fa91b0ad8e012b |
|
Hui Xiao | 8f763bdeab |
Record and use the tail size to prefetch table tail (#11406)
Summary: **Context:** We prefetch the tail part of a SST file (i.e, the blocks after data blocks till the end of the file) during each SST file open in hope to prefetch all the stuff at once ahead of time for later read e.g, footer, meta index, filter/index etc. The existing approach to estimate the tail size to prefetch is through `TailPrefetchStats` heuristics introduced in https://github.com/facebook/rocksdb/pull/4156, which has caused small reads in unlucky case (e.g, small read into the tail buffer during table open in thread 1 under the same BlockBasedTableFactory object can make thread 2's tail prefetching use a small size that it shouldn't) and is hard to debug. Therefore we decide to record the exact tail size and use it directly to prefetch tail of the SST instead of relying heuristics. **Summary:** - Obtain and record in manifest the tail size in `BlockBasedTableBuilder::Finish()` - For backward compatibility, we fall back to TailPrefetchStats and last to simple heuristics that the tail size is a linear portion of the file size - see PR conversation for more. - Make`tail_start_offset` part of the table properties and deduct tail size to record in manifest for external files (e.g, file ingestion, import CF) and db repair (with no access to manifest). Pull Request resolved: https://github.com/facebook/rocksdb/pull/11406 Test Plan: 1. New UT 2. db bench Note: db bench on /tmp/ where direct read is supported is too slow to finish and the default pinning setting in db bench is not helpful to profile # sst read of Get. Therefore I hacked the following to obtain the following comparison. ``` diff --git a/table/block_based/block_based_table_reader.cc b/table/block_based/block_based_table_reader.cc index bd5669f0f..791484c1f 100644 --- a/table/block_based/block_based_table_reader.cc +++ b/table/block_based/block_based_table_reader.cc @@ -838,7 +838,7 @@ Status BlockBasedTable::PrefetchTail( &tail_prefetch_size); // Try file system prefetch - if (!file->use_direct_io() && !force_direct_prefetch) { + if (false && !file->use_direct_io() && !force_direct_prefetch) { if (!file->Prefetch(prefetch_off, prefetch_len, ro.rate_limiter_priority) .IsNotSupported()) { prefetch_buffer->reset(new FilePrefetchBuffer( diff --git a/tools/db_bench_tool.cc b/tools/db_bench_tool.cc index ea40f5fa0..39a0ac385 100644 --- a/tools/db_bench_tool.cc +++ b/tools/db_bench_tool.cc @@ -4191,6 +4191,8 @@ class Benchmark { std::shared_ptr<TableFactory>(NewCuckooTableFactory(table_options)); } else { BlockBasedTableOptions block_based_options; + block_based_options.metadata_cache_options.partition_pinning = + PinningTier::kAll; block_based_options.checksum = static_cast<ChecksumType>(FLAGS_checksum_type); if (FLAGS_use_hash_search) { ``` Create DB ``` ./db_bench --bloom_bits=3 --use_existing_db=1 --seed=1682546046158958 --partition_index_and_filters=1 --statistics=1 -db=/dev/shm/testdb/ -benchmarks=readrandom -key_size=3200 -value_size=512 -num=1000000 -write_buffer_size=6550000 -disable_auto_compactions=false -target_file_size_base=6550000 -compression_type=none ``` ReadRandom ``` ./db_bench --bloom_bits=3 --use_existing_db=1 --seed=1682546046158958 --partition_index_and_filters=1 --statistics=1 -db=/dev/shm/testdb/ -benchmarks=readrandom -key_size=3200 -value_size=512 -num=1000000 -write_buffer_size=6550000 -disable_auto_compactions=false -target_file_size_base=6550000 -compression_type=none ``` (a) Existing (Use TailPrefetchStats for tail size + use seperate prefetch buffer in PartitionedFilter/IndexReader::CacheDependencies()) ``` rocksdb.table.open.prefetch.tail.hit COUNT : 3395 rocksdb.sst.read.micros P50 : 5.655570 P95 : 9.931396 P99 : 14.845454 P100 : 585.000000 COUNT : 999905 SUM : 6590614 ``` (b) This PR (Record tail size + use the same tail buffer in PartitionedFilter/IndexReader::CacheDependencies()) ``` rocksdb.table.open.prefetch.tail.hit COUNT : 14257 rocksdb.sst.read.micros P50 : 5.173347 P95 : 9.015017 P99 : 12.912610 P100 : 228.000000 COUNT : 998547 SUM : 5976540 ``` As we can see, we increase the prefetch tail hit count and decrease SST read count with this PR 3. Test backward compatibility by stepping through reading with post-PR code on a db generated pre-PR. Reviewed By: pdillinger Differential Revision: D45413346 Pulled By: hx235 fbshipit-source-id: 7d5e36a60a72477218f79905168d688452a4c064 |
|
Changyu Bi | a11f1e12ca |
Fix flaky test `DBTestUniversalManualCompactionOutputPathId.ManualCompactionOutputPathId` (#11412)
Summary: the test is flaky when compiled with `make -j56 COERCE_CONTEXT_SWITCH=1 ./db_universal_compaction_test`. The cause is that a manual compaction `CompactRange()` can finish and return before obsolete files are deleted. One reason for this is that a manual compaction waits until `manual.done` is set here |
|
anand76 | 03a892a9fb |
Delete empty WAL files on reopen (#11409)
Summary: When a DB is opened, RocksDB creates an empty WAL file. When the DB is reopened and the WAL is empty, the min log number to keep is not advanced until a memtable flush happens. If a process crashes soon after reopening the DB, its likely that no memtable flush would have happened, which means the empty WAL file is not deleted. In a crash loop scenario, this leads to empty WAL files accumulating. Fix this by ensuring the min log number is advanced if the WAL is empty. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11409 Test Plan: Add a unit test Reviewed By: ajkr Differential Revision: D45281685 Pulled By: anand1976 fbshipit-source-id: 0225877c613e65ffb30972a0051db2830105423e |
|
leipeng | 40d69b59ad |
DBImpl::MultiGet: delete unused var `superversions_to_delete` (#11395)
Summary: In db_impl.cc DBImpl::MultiGet: delete unused var `superversions_to_delete` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11395 Reviewed By: ajkr Differential Revision: D45240896 Pulled By: cbi42 fbshipit-source-id: 0fff99b0d794b6f6d4aadee6036bddd6cb19eb31 |
|
Hui Xiao | 151242ce46 |
Group rocksdb.sst.read.micros stat by IOActivity flush and compaction (#11288)
Summary: **Context:** The existing stat rocksdb.sst.read.micros does not reflect each of compaction and flush cases but aggregate them, which is not so helpful for us to understand IO read behavior of each of them. **Summary** - Update `StopWatch` and `RandomAccessFileReader` to record `rocksdb.sst.read.micros` and `rocksdb.file.{flush/compaction}.read.micros` - Fixed the default histogram in `RandomAccessFileReader` - New field `ReadOptions/IOOptions::io_activity`; Pass `ReadOptions` through paths under db open, flush and compaction to where we can prepare `IOOptions` and pass it to `RandomAccessFileReader` - Use `thread_status_util` for assertion in `DbStressFSWrapper` for continuous testing on we are passing correct `io_activity` under db open, flush and compaction Pull Request resolved: https://github.com/facebook/rocksdb/pull/11288 Test Plan: - **Stress test** - **Db bench 1: rocksdb.sst.read.micros COUNT ≈ sum of rocksdb.file.read.flush.micros's and rocksdb.file.read.compaction.micros's.** (without blob) - May not be exactly the same due to `HistogramStat::Add` only guarantees atomic not accuracy across threads. ``` ./db_bench -db=/dev/shm/testdb/ -statistics=true -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 (-use_plain_table=1 -prefix_size=10) ``` ``` // BlockBasedTable rocksdb.sst.read.micros P50 : 2.009374 P95 : 4.968548 P99 : 8.110362 P100 : 43.000000 COUNT : 40456 SUM : 114805 rocksdb.file.read.flush.micros P50 : 1.871841 P95 : 3.872407 P99 : 5.540541 P100 : 43.000000 COUNT : 2250 SUM : 6116 rocksdb.file.read.compaction.micros P50 : 2.023109 P95 : 5.029149 P99 : 8.196910 P100 : 26.000000 COUNT : 38206 SUM : 108689 // PlainTable Does not apply ``` - **Db bench 2: performance** **Read** SETUP: db with 900 files ``` ./db_bench -db=/dev/shm/testdb/ -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -disable_auto_compactions=true -target_file_size_base=655 -compression_type=none ```run till convergence ``` ./db_bench -seed=1678564177044286 -use_existing_db=true -db=/dev/shm/testdb -benchmarks=readrandom[-X60] -statistics=true -num=1000000 -disable_auto_compactions=true -compression_type=none -bloom_bits=3 ``` Pre-change `readrandom [AVG 60 runs] : 21568 (± 248) ops/sec` Post-change (no regression, -0.3%) `readrandom [AVG 60 runs] : 21486 (± 236) ops/sec` **Compaction/Flush**run till convergence ``` ./db_bench -db=/dev/shm/testdb2/ -seed=1678564177044286 -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -disable_auto_compactions=false -target_file_size_base=655 -compression_type=none rocksdb.sst.read.micros COUNT : 33820 rocksdb.sst.read.flush.micros COUNT : 1800 rocksdb.sst.read.compaction.micros COUNT : 32020 ``` Pre-change `fillseq [AVG 46 runs] : 1391 (± 214) ops/sec; 0.7 (± 0.1) MB/sec` Post-change (no regression, ~-0.4%) `fillseq [AVG 46 runs] : 1385 (± 216) ops/sec; 0.7 (± 0.1) MB/sec` Reviewed By: ajkr Differential Revision: D44007011 Pulled By: hx235 fbshipit-source-id: a54c89e4846dfc9a135389edf3f3eedfea257132 |
|
Changyu Bi | 43e9a60bb2 |
Always allow L0->L1 trivial move during manual compaction (#11375)
Summary: during manual compaction (CompactRange()), L0->L1 trivial move is disabled when only L0 overlaps with compacting key range (introduced in https://github.com/facebook/rocksdb/issues/7368 to enforce kForce* contract). This can cause large memory usage due to compaction readahead when number of L0 files is large. This PR allows L0->L1 trivial move in this case, and will do a L1 -> L1 intra-level compaction when needed (`bottommost_level_compaction` is kForce*). In brief, consider a DB with only L0 file, and user calls CompactRange(kForce, nullptr, nullptr), - before this PR, RocksDB does a L0 -> L1 compaction (disallow trivial move), - after this PR, RocksDB does a L0 -> L1 compaction (allow trivial move), and a L1 -> L1 compaction. Users can use kForceOptimized to avoid this extra L1->L1 compaction overhead when L0s are overlapping and cannot be trivial moved. This PR also fixed a bug (see previous discussion in https://github.com/facebook/rocksdb/issues/11041) where `final_output_level` of a manual compaction can be miscalculated when `level_compaction_dynamic_level_bytes=true`. This bug could cause incorrect level being moved when CompactRangeOptions::change_level is specified. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11375 Test Plan: - Added new unit tests to test that L0 -> L1 compaction allows trivial move and L1 -> L1 compaction is done when needed. Reviewed By: ajkr Differential Revision: D44943518 Pulled By: cbi42 fbshipit-source-id: e9fb770d17b163c18a623e1d1bd6b81159192708 |
|
Changyu Bi | b3c43a5b99 |
Drain unnecessary levels when `level_compaction_dynamic_level_bytes=true` (#11340)
Summary: When a user migrates to level compaction + `level_compaction_dynamic_level_bytes=true`, or when a DB shrinks, there can be unnecessary levels in the DB. Before this PR, this is no way to remove these levels except a manual compaction. These extra unnecessary levels make it harder to guarantee max_bytes_for_level_multiplier and can cause extra space amp. This PR boosts compaction score for these levels to allow RocksDB to automatically drain these levels. Together with https://github.com/facebook/rocksdb/issues/11321, this makes migration to `level_compaction_dynamic_level_bytes=true` automatic without needing user to do a one time full manual compaction. Credit: this PR is modified from https://github.com/facebook/rocksdb/issues/3921. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11340 Test Plan: - New unit tests - `python3 tools/db_crashtest.py whitebox --simple` which randomly sets level_compaction_dynamic_level_bytes in each run. Reviewed By: ajkr Differential Revision: D44563884 Pulled By: cbi42 fbshipit-source-id: e20d3620bd73dff22be18c5a91a07f340740bcc8 |
|
Changyu Bi | 601320164b |
Trivially move files down when opening db with level_compaction_dynamic_l… (#11321)
Summary: …evel_bytes During DB open, if a column family uses level compaction with level_compaction_dynamic_level_bytes=true, trivially move its files down in the LSM such that the bottommost files are in Lmax, the second from bottommost level files are in Lmax-1 and so on. This is aimed to make it easier to migrate level_compaction_dynamic_level_bytes from false to true. Before this change, a full manual compaction is suggested for such migration. After this change, user can just restart DB to turn on this option. db_crashtest.py is updated to randomly choose value for level_compaction_dynamic_level_bytes. Note that there may still be too many unnecessary levels if a user is migrating from universal compaction or level compaction with a smaller level multiplier. A full manual compaction may still be needed in that case before some PR that automatically drain unnecessary levels like https://github.com/facebook/rocksdb/issues/3921 lands. Eventually we may want to change the default value of option level_compaction_dynamic_level_bytes to true. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11321 Test Plan: 1. Added unit tests. 2. Crash test: ran a variation of db_crashtest.py (like 32516507e77521ae887e45091b69139e32e8efb7) that turns level_compaction_dynamic_level_bytes on and off and switches between LC and UC for the same DB. TODO: Update `OptionChangeMigration`, either after this PR or https://github.com/facebook/rocksdb/issues/3921. Reviewed By: ajkr Differential Revision: D44341930 Pulled By: cbi42 fbshipit-source-id: 013de19a915c6a0502be569f07c4cc8f1c3c6be2 |
|
Hui Xiao | cb58477185 |
New stat rocksdb.{cf|db}-write-stall-stats exposed in a structural way (#11300)
Summary: **Context/Summary:** Users are interested in figuring out what has caused write stall. - Refactor write stall related stats from property `kCFStats` into its own db property `rocksdb.cf-write-stall-stats` as a map or string. For now, this only contains count of different combination of (CF-scope `WriteStallCause`) + (`WriteStallCondition`) - Add new `WriteStallCause::kWriteBufferManagerLimit` to reflect write stall caused by write buffer manager - Add new `rocksdb.db-write-stall-stats`. For now, this only contains `WriteStallCause::kWriteBufferManagerLimit` + `WriteStallCondition::kStopped` - Expose functions in new class `WriteStallStatsMapKeys` for examining the above two properties returned as map - Misc: rename/comment some write stall InternalStats for clarity Pull Request resolved: https://github.com/facebook/rocksdb/pull/11300 Test Plan: - New UT - Stress test `python3 tools/db_crashtest.py blackbox --simple --get_property_one_in=1` - Perf test: Both converge very slowly at similar rates but post-change has higher average ops/sec than pre-change even though they are run at the same time. ``` ./db_bench -seed=1679014417652004 -db=/dev/shm/testdb/ -statistics=false -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=100000 -db_write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 ``` pre-change: ``` fillseq [AVG 15 runs] : 1176 (± 732) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1052.671 micros/op 949 ops/sec 105.267 seconds 100000 operations; 0.5 MB/s fillseq [AVG 16 runs] : 1162 (± 685) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1387.330 micros/op 720 ops/sec 138.733 seconds 100000 operations; 0.4 MB/s fillseq [AVG 17 runs] : 1136 (± 646) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1232.011 micros/op 811 ops/sec 123.201 seconds 100000 operations; 0.4 MB/s fillseq [AVG 18 runs] : 1118 (± 610) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1282.567 micros/op 779 ops/sec 128.257 seconds 100000 operations; 0.4 MB/s fillseq [AVG 19 runs] : 1100 (± 578) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1914.336 micros/op 522 ops/sec 191.434 seconds 100000 operations; 0.3 MB/s fillseq [AVG 20 runs] : 1071 (± 551) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1227.510 micros/op 814 ops/sec 122.751 seconds 100000 operations; 0.4 MB/s fillseq [AVG 21 runs] : 1059 (± 525) ops/sec; 0.5 (± 0.3) MB/sec ``` post-change: ``` fillseq [AVG 15 runs] : 1226 (± 732) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1323.825 micros/op 755 ops/sec 132.383 seconds 100000 operations; 0.4 MB/s fillseq [AVG 16 runs] : 1196 (± 687) ops/sec; 0.6 (± 0.4) MB/sec fillseq : 1223.905 micros/op 817 ops/sec 122.391 seconds 100000 operations; 0.4 MB/s fillseq [AVG 17 runs] : 1174 (± 647) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1168.996 micros/op 855 ops/sec 116.900 seconds 100000 operations; 0.4 MB/s fillseq [AVG 18 runs] : 1156 (± 611) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1348.729 micros/op 741 ops/sec 134.873 seconds 100000 operations; 0.4 MB/s fillseq [AVG 19 runs] : 1134 (± 579) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1196.887 micros/op 835 ops/sec 119.689 seconds 100000 operations; 0.4 MB/s fillseq [AVG 20 runs] : 1119 (± 550) ops/sec; 0.6 (± 0.3) MB/sec fillseq : 1193.697 micros/op 837 ops/sec 119.370 seconds 100000 operations; 0.4 MB/s fillseq [AVG 21 runs] : 1106 (± 524) ops/sec; 0.6 (± 0.3) MB/sec ``` Reviewed By: ajkr Differential Revision: D44159541 Pulled By: hx235 fbshipit-source-id: 8d29efb70001fdc52d34535eeb3364fc3e71e40b |
|
Hui Xiao | 11cb6af6e5 |
Fix bug of prematurely excluded CF in atomic flush contains unflushed data that should've been included in the atomic flush (#11148)
Summary: **Context:** Atomic flush should guarantee recoverability of all data of seqno up to the max seqno of the flush. It achieves this by ensuring all such data are flushed by the time this atomic flush finishes through `SelectColumnFamiliesForAtomicFlush()`. However, our crash test exposed the following case where an excluded CF from an atomic flush contains unflushed data of seqno less than the max seqno of that atomic flush and loses its data with `WriteOptions::DisableWAL=true` in face of a crash right after the atomic flush finishes . ``` ./db_stress --preserve_unverified_changes=1 --reopen=0 --acquire_snapshot_one_in=0 --adaptive_readahead=1 --allow_data_in_errors=True --async_io=1 --atomic_flush=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=15 --bottommost_compression_type=none --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=0 --checkpoint_one_in=0 --checksum_type=kXXH3 --clear_column_family_one_in=0 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_ttl=100 --compression_max_dict_buffer_bytes=134217727 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=lz4hc --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=$db --db_write_buffer_size=1048576 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --expected_values_dir=$exp --fail_if_options_file_error=0 --fifo_allow_compaction=0 --file_checksum_impl=none --flush_one_in=0 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=100 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=524288 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --long_running_snapshots=1 --manual_wal_flush_one_in=100 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=10000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=4 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=0 --periodic_compaction_seconds=100 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=32 --readahead_size=16384 --readpercent=50 --recycle_log_file_num=0 --ribbon_starting_level=6 --secondary_cache_fault_one_in=0 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=1048576 --stats_dump_period_sec=10 --subcompactions=1 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 --use_multiget=1 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=none --write_buffer_size=524288 --write_dbid_to_manifest=1 --write_fault_one_in=0 --writepercent=30 & pid=$! sleep 0.2 sleep 10 kill $pid sleep 0.2 ./db_stress --ops_per_thread=1 --preserve_unverified_changes=1 --reopen=0 --acquire_snapshot_one_in=0 --adaptive_readahead=1 --allow_data_in_errors=True --async_io=1 --atomic_flush=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=15 --bottommost_compression_type=none --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=0 --checkpoint_one_in=0 --checksum_type=kXXH3 --clear_column_family_one_in=0 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_ttl=100 --compression_max_dict_buffer_bytes=134217727 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=lz4hc --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=$db --db_write_buffer_size=1048576 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --expected_values_dir=$exp --fail_if_options_file_error=0 --fifo_allow_compaction=0 --file_checksum_impl=none --flush_one_in=0 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=100 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=524288 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --long_running_snapshots=1 --manual_wal_flush_one_in=100 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=10000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=4 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=0 --periodic_compaction_seconds=100 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=32 --readahead_size=16384 --readpercent=50 --recycle_log_file_num=0 --ribbon_starting_level=6 --secondary_cache_fault_one_in=0 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=1048576 --stats_dump_period_sec=10 --subcompactions=1 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 --use_multiget=1 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=none --write_buffer_size=524288 --write_dbid_to_manifest=1 --write_fault_one_in=0 --writepercent=30 & pid=$! sleep 0.2 sleep 40 kill $pid sleep 0.2 Verification failed for column family 6 key 0000000000000239000000000000012B0000000000000138 (56622): value_from_db: , value_from_expected: 4A6331754E4F4C4D42434041464744455A5B58595E5F5C5D5253505156575455, msg: Value not found: NotFound: Crash-recovery verification failed :( No writes or ops? Verification failed :( ``` The bug is due to the following: - When atomic flush is used, an empty CF is legally [excluded](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_filesnapshot.cc#L39) in `SelectColumnFamiliesForAtomicFlush` as the first step of `DBImpl::FlushForGetLiveFiles` before [passing](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_filesnapshot.cc#L42) the included CFDs to `AtomicFlushMemTables`. - But [later](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_impl/db_impl_compaction_flush.cc#L2133) in `AtomicFlushMemTables`, `WaitUntilFlushWouldNotStallWrites` will [release the db mutex](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_impl/db_impl_compaction_flush.cc#L2403), during which data@seqno N can be inserted into the excluded CF and data@seqno M can be inserted into one of the included CFs, where M > N. - However, data@seqno N in an already-excluded CF is thus excluded from this atomic flush while we seqno N is less than seqno M. **Summary:** - Replace `SelectColumnFamiliesForAtomicFlush()`-before-`AtomicFlushMemTables()` with `SelectColumnFamiliesForAtomicFlush()`-after-wait-within-`AtomicFlushMemTables()` so we ensure no write affecting the recoverability of this atomic job (i.e, change to max seqno of this atomic flush or insertion of data with less seqno than the max seqno of the atomic flush to excluded CF) can happen after calling `SelectColumnFamiliesForAtomicFlush()`. - For above, refactored and clarified comments on `SelectColumnFamiliesForAtomicFlush()` and `AtomicFlushMemTables()` for clearer semantics of passed-in CFDs to atomic-flush Pull Request resolved: https://github.com/facebook/rocksdb/pull/11148 Test Plan: - New unit test failed before the fix and passes after - Make check - Rehearsal stress test Reviewed By: ajkr Differential Revision: D42799871 Pulled By: hx235 fbshipit-source-id: 13636b63e9c25c5895857afc36ea580d57f6d644 |
|
Igor Canadi | ddde1e6af8 |
Avoid ColumnFamilyDescriptor copy (#10978)
Summary: Hi. :) Noticed we are copying ColumnFamilyDescriptor here because my process crashed during copy constructor (cause unrelated) Pull Request resolved: https://github.com/facebook/rocksdb/pull/10978 Reviewed By: cbi42 Differential Revision: D41473924 Pulled By: ajkr fbshipit-source-id: 58a3473f2d7b24918f79d4b2726c20081c5e95b4 |
|
ywave | 9fa9becf53 |
fix -Wrange-loop-analysis in Apple clang version 12.0.0 (clang-1200.0.32.29) (#11240)
Summary: Fix complain ``` db/db_impl/db_impl_compaction_flush.cc:417:19: error: loop variable 'bg_flush_arg' of type 'const rocksdb::DBImpl::BGFlushArg' creates a copy from type 'const rocksdb::DBImpl::BGFlushArg' [-Werror,-Wrange-loop-analysis] for (const auto bg_flush_arg : bg_flush_args) { ^ db/db_impl/db_impl_compaction_flush.cc:417:8: note: use reference type 'const rocksdb::DBImpl::BGFlushArg &' to prevent copying for (const auto bg_flush_arg : bg_flush_args) { ^~~~~~~~~~~~~~~~~~~~~~~~~ & db/db_impl/db_impl_compaction_flush.cc:2911:21: error: loop variable 'bg_flush_arg' of type 'const rocksdb::DBImpl::BGFlushArg' creates a copy from type 'const rocksdb::DBImpl::BGFlushArg' [-Werror,-Wrange-loop-analysis] for (const auto bg_flush_arg : bg_flush_args) { ^ db/db_impl/db_impl_compaction_flush.cc:2911:10: note: use reference type 'const rocksdb::DBImpl::BGFlushArg &' to prevent copying for (const auto bg_flush_arg : bg_flush_args) { ^~~~~~~~~~~~~~~~~~~~~~~~~ & ``` from ```sh xxx@MacBook-Pro / % g++ -v Configured with: --prefix=/Library/Developer/CommandLineTools/usr --with-gxx-include-dir=/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/4.2.1 Apple clang version 12.0.0 (clang-1200.0.32.29) Target: x86_64-apple-darwin21.6.0 Thread model: posix ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11240 Reviewed By: cbi42 Differential Revision: D43458729 Pulled By: ajkr fbshipit-source-id: 26e110f83451509463a1bc308f737ccb693c9f45 |