mirror of https://github.com/facebook/rocksdb.git
12937 Commits
Author | SHA1 | Message | Date |
---|---|---|---|
Changyu Bi | bceb2dfe6a |
Introduce minimum compaction debt requirement for parallel compaction (#13054)
Summary: a small CF can trigger parallel compaction that applies to the entire DB. This is because the bottommost file size of a small CF can be too small compared to l0 files when a l0->lbase compaction happens. We prevent this by requiring some minimum on the compaction debt. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13054 Test Plan: updated unit test. Reviewed By: hx235 Differential Revision: D63861042 Pulled By: cbi42 fbshipit-source-id: 43bbf327988ef0ef912cd2fc700e3d096a8d2c18 |
|
Yu Zhang | 32dd657bad |
Add some per key optimization for UDT in memtable only feature (#13031)
Summary: This PR added some optimizations for the per key handling for SST file for the user-defined timestamps in Memtable only feature. CPU profiling shows this part is a big culprit for regression. This optimization saves some string construction/destruction/appending/copying. vector operations like reserve/emplace_back. When iterating keys in a block, we need to copy some shared bytes from previous key, put it together with the non shared bytes and find a right location to pad the min timestamp. Previously, we create a tmp local string buffer to first construct the key from its pieces, and then copying this local string's content into `IterKey`'s buffer. To avoid having this local string and to avoid this extra copy. Instead of piecing together the key in a local string first, we just track all the pieces that make this key in a reused Slice array. And then copy the pieces in order into `IterKey`'s buffer. Since the previous key should be kept intact while we are copying some shared bytes from it, we added a secondary buffer in `IterKey` and alternate between primary buffer and secondary buffer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13031 Test Plan: Existing tests. Reviewed By: ltamasi Differential Revision: D63416531 Pulled By: jowlyzhang fbshipit-source-id: 9819b0e02301a2dbc90621b2fe4f651bc912113c |
|
Levi Tamasi | 917e98ff9e |
Templatize MultiCfIteratorImpl to avoid std::function's overhead (#13052)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/13052 Currently, `MultiCfIteratorImpl` uses `std::function`s for `reset_func_` and `populate_func_`, which uses type erasure and has a performance overhead. The patch turns `MultiCfIteratorImpl` into a template that takes the two function object types as template parameters, and changes `AttributeGroupIteratorImpl` and `CoalescingIterator` so they pass in function objects of named types (as opposed to lambdas). Reviewed By: jaykorean Differential Revision: D63802598 fbshipit-source-id: e202f6d80c9054335e5b2571051a67a9e012c2d0 |
|
Peter Dillinger | 12960f0b57 |
Disable uncache_aggressiveness with allow_mmap_reads (#13051)
Summary: There was a crash test Bus Error crash in `IndexBlockIter::SeekToFirstImpl()` <- .. <- `BlockBasedTable::~BlockBasedTable()` with `--mmap_read=1`, which suggests some kind of incompatibility that I haven't diagnosed. Bus Error is uncommon these days as CPUs support unaligned reads, but are associated with mmap problems. Because mmap reads really only make sense without block cache, it's not a concerning loss to essentially disable the combination. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13051 Test Plan: watch crash test Reviewed By: jowlyzhang Differential Revision: D63795069 Pulled By: pdillinger fbshipit-source-id: 6c823c619840086b5c9cff53dbc7470662b096be |
|
Yu Zhang | 9375c3b635 |
Fix `needs_flush` assertion in file ingestion (#13045)
Summary: This PR makes file ingestion job's flush wait a bit further until the SuperVersion is also updated. This is necessary since follow up operations will use the current SuperVersion to do range overlapping check and level assignment. In debug mode, file ingestion job's second `NeedsFlush` call could have been invoked when the memtables are flushed but the SuperVersion hasn't been updated yet, triggering the assertion. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13045 Test Plan: Existing tests Manually stress tested Reviewed By: cbi42 Differential Revision: D63671151 Pulled By: jowlyzhang fbshipit-source-id: 95a169e58a7e59f6dd4125e7296e9060fe4c63a7 |
|
Peter Dillinger | dd23e84cad |
Re-implement GetApproximateMemTableStats for skip lists (#13047)
Summary: GetApproximateMemTableStats() could return some bad results with the standard skip list memtable. See this new db_bench test showing the dismal distribution of results when the actual number of entries in range is 1000: ``` $ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=1000 ... filluniquerandom : 1.391 micros/op 718915 ops/sec 1.391 seconds 1000000 operations; 11.7 MB/s approximatememtablestats : 3.711 micros/op 269492 ops/sec 3.711 seconds 1000000 operations; Reported entry count stats (expected 1000): Count: 1000000 Average: 2344.1611 StdDev: 26587.27 Min: 0 Median: 965.8555 Max: 835273 Percentiles: P50: 965.86 P75: 1610.77 P99: 12618.01 P99.9: 74991.58 P99.99: 830970.97 ------------------------------------------------------ [ 0, 1 ] 131344 13.134% 13.134% ### ( 1, 2 ] 115 0.011% 13.146% ( 2, 3 ] 106 0.011% 13.157% ( 3, 4 ] 190 0.019% 13.176% ( 4, 6 ] 214 0.021% 13.197% ( 6, 10 ] 522 0.052% 13.249% ( 10, 15 ] 748 0.075% 13.324% ( 15, 22 ] 1002 0.100% 13.424% ( 22, 34 ] 1948 0.195% 13.619% ( 34, 51 ] 3067 0.307% 13.926% ( 51, 76 ] 4213 0.421% 14.347% ( 76, 110 ] 5721 0.572% 14.919% ( 110, 170 ] 11375 1.137% 16.056% ( 170, 250 ] 17928 1.793% 17.849% ( 250, 380 ] 36597 3.660% 21.509% # ( 380, 580 ] 77882 7.788% 29.297% ## ( 580, 870 ] 160193 16.019% 45.317% ### ( 870, 1300 ] 210098 21.010% 66.326% #### ( 1300, 1900 ] 167461 16.746% 83.072% ### ( 1900, 2900 ] 78678 7.868% 90.940% ## ( 2900, 4400 ] 47743 4.774% 95.715% # ( 4400, 6600 ] 17650 1.765% 97.480% ( 6600, 9900 ] 11895 1.190% 98.669% ( 9900, 14000 ] 4993 0.499% 99.168% ( 14000, 22000 ] 2384 0.238% 99.407% ( 22000, 33000 ] 1966 0.197% 99.603% ( 50000, 75000 ] 2968 0.297% 99.900% ( 570000, 860000 ] 999 0.100% 100.000% readrandom : 1.967 micros/op 508487 ops/sec 1.967 seconds 1000000 operations; 8.2 MB/s (1000000 of 1000000 found) ``` Perhaps the only good thing to say about the old implementation was that it was fast, though apparently not that fast. I've implemented a much more robust and reasonably fast new version of the function. It's still logarithmic but with some larger constant factors. The standard deviation from true count is around 20% or less, and roughly the CPU cost of two memtable point look-ups. See code comments for detail. ``` $ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=1000 ... filluniquerandom : 1.478 micros/op 676434 ops/sec 1.478 seconds 1000000 operations; 11.0 MB/s approximatememtablestats : 2.694 micros/op 371157 ops/sec 2.694 seconds 1000000 operations; Reported entry count stats (expected 1000): Count: 1000000 Average: 1073.5158 StdDev: 197.80 Min: 608 Median: 1079.9506 Max: 2176 Percentiles: P50: 1079.95 P75: 1223.69 P99: 1852.36 P99.9: 1898.70 P99.99: 2176.00 ------------------------------------------------------ ( 580, 870 ] 134848 13.485% 13.485% ### ( 870, 1300 ] 747868 74.787% 88.272% ############### ( 1300, 1900 ] 116536 11.654% 99.925% ## ( 1900, 2900 ] 748 0.075% 100.000% readrandom : 1.997 micros/op 500654 ops/sec 1.997 seconds 1000000 operations; 8.1 MB/s (1000000 of 1000000 found) ``` We can already see that the distribution of results is dramatically better and wonderfully normal-looking, with relative standard deviation around 20%. The function is also FASTER, at least with these parameters. Let's look how this behavior generalizes, first *much* larger range: ``` $ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=30000 filluniquerandom : 1.390 micros/op 719654 ops/sec 1.376 seconds 990000 operations; 11.7 MB/s approximatememtablestats : 1.129 micros/op 885649 ops/sec 1.129 seconds 1000000 operations; Reported entry count stats (expected 30000): Count: 1000000 Average: 31098.8795 StdDev: 3601.47 Min: 21504 Median: 29333.9303 Max: 43008 Percentiles: P50: 29333.93 P75: 33018.00 P99: 43008.00 P99.9: 43008.00 P99.99: 43008.00 ------------------------------------------------------ ( 14000, 22000 ] 408 0.041% 0.041% ( 22000, 33000 ] 749327 74.933% 74.974% ############### ( 33000, 50000 ] 250265 25.027% 100.000% ##### readrandom : 1.894 micros/op 528083 ops/sec 1.894 seconds 1000000 operations; 8.5 MB/s (989989 of 1000000 found) ``` This is *even faster* and relatively *more accurate*, with relative standard deviation closer to 10%. Code comments explain why. Now let's look at smaller ranges. Implementation quirks or conveniences: * When actual number in range is >= 40, the minimum return value is 40. * When the actual is <= 10, it is guaranteed to return that actual number. ``` $ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=75 ... filluniquerandom : 1.417 micros/op 705668 ops/sec 1.417 seconds 999975 operations; 11.4 MB/s approximatememtablestats : 3.342 micros/op 299197 ops/sec 3.342 seconds 1000000 operations; Reported entry count stats (expected 75): Count: 1000000 Average: 75.1210 StdDev: 15.02 Min: 40 Median: 71.9395 Max: 256 Percentiles: P50: 71.94 P75: 89.69 P99: 119.12 P99.9: 166.68 P99.99: 229.78 ------------------------------------------------------ ( 34, 51 ] 38867 3.887% 3.887% # ( 51, 76 ] 550554 55.055% 58.942% ########### ( 76, 110 ] 398854 39.885% 98.828% ######## ( 110, 170 ] 11353 1.135% 99.963% ( 170, 250 ] 364 0.036% 99.999% ( 250, 380 ] 8 0.001% 100.000% readrandom : 1.861 micros/op 537224 ops/sec 1.861 seconds 1000000 operations; 8.7 MB/s (999974 of 1000000 found) $ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=25 ... filluniquerandom : 1.501 micros/op 666283 ops/sec 1.501 seconds 1000000 operations; 10.8 MB/s approximatememtablestats : 5.118 micros/op 195401 ops/sec 5.118 seconds 1000000 operations; Reported entry count stats (expected 25): Count: 1000000 Average: 26.2392 StdDev: 4.58 Min: 25 Median: 28.4590 Max: 72 Percentiles: P50: 28.46 P75: 31.69 P99: 49.27 P99.9: 67.95 P99.99: 72.00 ------------------------------------------------------ ( 22, 34 ] 928936 92.894% 92.894% ################### ( 34, 51 ] 67960 6.796% 99.690% # ( 51, 76 ] 3104 0.310% 100.000% readrandom : 1.892 micros/op 528595 ops/sec 1.892 seconds 1000000 operations; 8.6 MB/s (1000000 of 1000000 found) $ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=10 ... filluniquerandom : 1.642 micros/op 608916 ops/sec 1.642 seconds 1000000 operations; 9.9 MB/s approximatememtablestats : 3.042 micros/op 328721 ops/sec 3.042 seconds 1000000 operations; Reported entry count stats (expected 10): Count: 1000000 Average: 10.0000 StdDev: 0.00 Min: 10 Median: 10.0000 Max: 10 Percentiles: P50: 10.00 P75: 10.00 P99: 10.00 P99.9: 10.00 P99.99: 10.00 ------------------------------------------------------ ( 6, 10 ] 1000000 100.000% 100.000% #################### readrandom : 1.805 micros/op 554126 ops/sec 1.805 seconds 1000000 operations; 9.0 MB/s (1000000 of 1000000 found) ``` Remarkably consistent. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13047 Test Plan: new db_bench test for both performance and accuracy (see above); added to crash test; unit test updated. Reviewed By: cbi42 Differential Revision: D63722003 Pulled By: pdillinger fbshipit-source-id: cfc8613c085e87c17ecec22d82601aac2a5a1b26 |
|
Changyu Bi | 389e66bef5 |
Add comment for memory usage in BeginTransaction() and WriteBatch::Clear() (#13042)
Summary: ... to note that memory may not be freed when reusing a transaction. This means reusing a large transaction can cause excessive memory usage and it may be better to destruct the transaction object in some cases. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13042 Test Plan: no code change. Reviewed By: jowlyzhang Differential Revision: D63570612 Pulled By: cbi42 fbshipit-source-id: f19ff556f76d54831fb94715e8808035d07e25fa |
|
Yu Zhang | 2c2776f1f3 |
Fix some missing values in stress test (#13039)
Summary: When `avoid_flush_during_shutdown` is false, DB will flush the memtables if there is some unpersisted data: |
|
Peter Dillinger | 79790cf2a8 |
Bug fix and test BuildDBOptions (#13038)
Summary: The following DBOptions were not being propagated through BuildDBOptions, which could at least lead to settings being lost through `GetOptionsFromString()`, possibly elsewhere as well: * background_close_inactive_wals * write_dbid_to_manifest * write_identity_file * prefix_seek_opt_in_only Pull Request resolved: https://github.com/facebook/rocksdb/pull/13038 Test Plan: This problem was not being caught by OptionsSettableTest.DBOptionsAllFieldsSettable when the option was omitted from both options_helper.cc and options_settable_test.cc. I have added to the test to catch future instances (and the updated test was how I found three of the four missing options). The same kind of bug seems to be caught by ColumnFamilyOptionsAllFieldsSettable, and AFAIK analogous code does not exist for BlockBasedTableOptions. Reviewed By: ltamasi Differential Revision: D63483779 Pulled By: pdillinger fbshipit-source-id: a5d5f6e434174bacb8e5d251b767e81e62b7225a |
|
anand76 | 22105366d9 |
More accurate accounting of compressed cache memory (#13032)
Summary: When an item is inserted into the compressed secondary cache, this PR calculates the charge using the malloc_usable_size of the allocated memory, as well as the unique pointer allocation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13032 Test Plan: New unit test Reviewed By: pdillinger Differential Revision: D63418493 Pulled By: anand1976 fbshipit-source-id: 1db2835af6867442bb8cf6d9bf412e120ddd3824 |
|
Jay Huh | d327d56081 |
Remove unnecessary semi-colon (#13034)
Summary: As title Pull Request resolved: https://github.com/facebook/rocksdb/pull/13034 Reviewed By: ltamasi Differential Revision: D63413712 Pulled By: jaykorean fbshipit-source-id: 0070761b0d9de1f50fe0baf235643d36aeb9f7f5 |
|
anand76 | d02f63cc54 |
Respect lowest_used_cache_tier option for compressed blocks (#13030)
Summary: If the lowest_used_cache_tier DB option is set to kVolatileTier, skip insertion of compressed blocks into the secondary cache. Previously, these were always inserted into the secondary cache via the InsertSaved() method, leading to pollution of the secondary cache with blocks that would never be read. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13030 Test Plan: Add a new unit test Reviewed By: pdillinger Differential Revision: D63329841 Pulled By: anand1976 fbshipit-source-id: 14d2fce2ed309401d9ad4d2e7c356218b6673f7b |
|
Jay Huh | 2a5ff78c12 |
More info in CompactionServiceJobInfo and CompactionJobStats (#13029)
Summary: Add the following to the `CompactionServiceJobInfo` - compaction_reason - is_full_compaction - is_manual_compaction - bottommost_level Added `is_remote_compaction` to the `CompactionJobStats` and set initial values to avoid UB for uninitialized values. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13029 Test Plan: ``` ./compaction_service_test --gtest_filter="*CompactionInfo*" ``` Reviewed By: anand1976 Differential Revision: D63322878 Pulled By: jaykorean fbshipit-source-id: f02a66ca45e660b9d354a43837d8ec6beb7621fb |
|
Levi Tamasi | fbbb08770f |
Update HISTORY.md, version.h, and the format compatibility check script for the 9.7 release (#13027)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/13027 Reviewed By: jowlyzhang Differential Revision: D63158344 Pulled By: ltamasi fbshipit-source-id: e650a0024155d52c7aa2afd0f242b8363071279a |
|
Peter Dillinger | a1a102ffce |
Steps toward deprecating implicit prefix seek, related fixes (#13026)
Summary: With some new use cases onboarding to prefix extractors/seek/filters, one of the risks is existing iterator code, e.g. for maintenance tasks, being unintentionally subject to prefix seek semantics. This is a longstanding known design flaw with prefix seek, and `prefix_same_as_start` and `auto_prefix_mode` were steps in the direction of making that obsolete. However, we can't just immediately set `total_order_seek` to true by default, because that would impact so much code instantly. Here we add a new DB option, `prefix_seek_opt_in_only` that basically allows users to transition to the future behavior when they are ready. When set to true, all iterators will be treated as if `total_order_seek=true` and then the only ways to get prefix seek semantics are with `prefix_same_as_start` or `auto_prefix_mode`. Related fixes / changes: * Make sure that `prefix_same_as_start` and `auto_prefix_mode` are compatible with (or override) `total_order_seek` (depending on your interpretation). * Fix a bug in which a new iterator after dynamically changing the prefix extractor might mix different prefix semantics between memtable and SSTs. Both should use the latest extractor semantics, which means iterators ignoring memtable prefix filters with an old extractor. And that means passing the latest prefix extractor to new memtable iterators that might use prefix seek. (Without the fix, the test added for this fails in many ways.) Suggested follow-up: * Investigate a FIXME where a MergeIteratorBuilder is created in db_impl.cc. No unit test detects a change in value that should impact correctness. * Make memtable prefix bloom compatible with `auto_prefix_mode`, which might require involving the memtablereps because we don't know at iterator creation time (only seek time) whether an auto_prefix_mode seek will be a prefix seek. * Add `prefix_same_as_start` testing to db_stress Pull Request resolved: https://github.com/facebook/rocksdb/pull/13026 Test Plan: tests updated, added. Add combination of `total_order_seek=true` and `auto_prefix_mode=true` to stress test. Ran `make blackbox_crash_test` for a long while. Manually ran tests with `prefix_seek_opt_in_only=true` as default, looking for unexpected issues. I inspected most of the results and migrated many tests to be ready for such a change (but not all). Reviewed By: ltamasi Differential Revision: D63147378 Pulled By: pdillinger fbshipit-source-id: 1f4477b730683d43b4be7e933338583702d3c25e |
|
Jay Huh | 5f4a8c3da4 |
Load latest options from OPTIONS file in Remote host (#13025)
Summary: We've been serializing and deserializing DBOptions and CFOptions (and other CF into) as part of `CompactionServiceInput`. These are all readily available in the OPTIONS file and the remote worker can read the OPTIONS file to obtain the same information. This helps reducing the size of payload significantly. In a very rare scenario if the OPTIONS file is purged due to options change by primary host at the same time while the remote host is loading the latest options, it may fail. In this case, we just retry once. This also solves the problem where we had to open the default CF with the CFOption from another CF if the remote compaction is for a non-default column family. (TODO comment in /db_impl_secondary.cc) Pull Request resolved: https://github.com/facebook/rocksdb/pull/13025 Test Plan: Unit Tests ``` ./compaction_service_test ``` ``` ./compaction_job_test ``` Also tested with Meta's internal Offload Infra Reviewed By: anand1976, cbi42 Differential Revision: D63100109 Pulled By: jaykorean fbshipit-source-id: b7162695e31e2c5a920daa7f432842163a5b156d |
|
anand76 | 6549b11714 |
Make Cache a customizable class (#13024)
Summary: This PR allows a Cache object to be created using the object registry. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13024 Reviewed By: pdillinger Differential Revision: D63043233 Pulled By: anand1976 fbshipit-source-id: 5bc3f7c29b35ad62638ff8205451303e2cecea9d |
|
Changyu Bi | 71e38dbe25 |
Compact one file at a time for FIFO temperature change compactions (#13018)
Summary: Per customer request, we should not merge multiple SST files together during temperature change compaction, since this can cause FIFO TTL compactions to be delayed. This PR changes the compaction picking logic to pick one file at a time. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13018 Test Plan: * updated some existing unit tests to test this new behavior. Reviewed By: jowlyzhang Differential Revision: D62883292 Pulled By: cbi42 fbshipit-source-id: 6a9fc8c296b5d9b17168ef6645f25153241c8b93 |
|
Levi Tamasi | 54ace7f340 |
Change the semantics of blob_garbage_collection_force_threshold to provide better control over space amp (#13022)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/13022 Currently, `blob_garbage_collection_force_threshold` applies to the oldest batch of blob files, which is typically only a small subset of the blob files currently eligible for garbage collection. This can result in a form of head-of-line blocking: no GC-triggered compactions will be scheduled if the oldest batch does not currently exceed the threshold, even if a lot of higher-numbered blob files do. This can in turn lead to high space amplification that exceeds the soft bound implicit in the force threshold (e.g. 50% would suggest a space amp of <2 and 75% would imply a space amp of <4). The patch changes the semantics of this configuration threshold to apply to the entire set of blob files that are eligible for garbage collection based on `blob_garbage_collection_age_cutoff`. This provides more intuitive semantics for the option and can provide a better write amp/space amp trade-off. (Note that GC-triggered compactions still pick the same SST files as before, so triggered GC still targets the oldest the blob files.) Reviewed By: jowlyzhang Differential Revision: D62977860 fbshipit-source-id: a999f31fe9cdda313de513f0e7a6fc707424d4a3 |
|
Peter Dillinger | 98c33cb8e3 |
Steps toward making IDENTITY file obsolete (#13019)
Summary: * Set write_dbid_to_manifest=true by default * Add new option write_identity_file (default true) that allows us to opt-in to future behavior without identity file * Refactor related DB open code to minimize code duplication _Recommend hiding whitespace changes for review_ Intended follow-up: add support to ldb for reading and even replacing the DB identity in the manifest. Could be a variant of `update_manifest` command or based on it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13019 Test Plan: unit tests and stress test updated for new functionality Reviewed By: anand1976 Differential Revision: D62898229 Pulled By: pdillinger fbshipit-source-id: c08b25cf790610b034e51a9de0dc78b921abbcf0 |
|
Yu Zhang | 1238120fe6 |
Add an option to dump wal seqno gaps (#13014)
Summary: Add an option `--only_print_seqno_gaps` for wal dump to help with debugging. This option will check the continuity of sequence numbers in WAL logs, assuming `seq_per_batch` is false. `--walfile` option now also takes a directory, and it will check all WAL logs in the directory in chronological order. When a gap is found, we can further check if it's related to operations like external file ingestion. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13014 Test Plan: Manually tested Reviewed By: ltamasi Differential Revision: D62989115 Pulled By: jowlyzhang fbshipit-source-id: 22e3326344e7969ff9d5091d21fec2935770fbc7 |
|
Peter Dillinger | 10984e8c26 |
Fix and generalize framework for filtering range queries, etc. (#13005)
Summary: There was a subtle design/contract bug in the previous version of range filtering in experimental.h If someone implemented a key segments extractor with "all or nothing" fixed size segments, that could result in unsafe range filtering. For example, with two segments of width 3: ``` x = 0x|12 34 56|78 9A 00| y = 0x|12 34 56||78 9B z = 0x|12 34 56|78 9C 00| ``` Segment 1 of y (empty) is out of order with segment 1 of x and z. I have re-worked the contract to make it clear what does work, and implemented a standard extractor for fixed-size segments, CappedKeySegmentsExtractor. The safe approach for filtering is to consume as much as is available for a segment in the case of a short key. I have also added support for min-max filtering with reverse byte-wise comparator, which is probably the 2nd most common comparator for RocksDB users (because of MySQL). It might seem that a min-max filter doesn't care about forward or reverse ordering, but it does when trying to determine whether in input range from segment values v1 to v2, where it so happens that v2 is byte-wise less than v1, is an empty forward interval or a non-empty reverse interval. At least in the current setup, we don't have that context. A new unit test (with some refactoring) tests CappedKeySegmentsExtractor, reverse byte-wise comparator, and the corresponding min-max filter. I have also (contractually / mathematically) generalized the framework to comparators other than the byte-wise comparator, and made other generalizations to make the extractor limitations more explicitly connected to the particular filters and filtering used--at least in description. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13005 Test Plan: added unit tests as described Reviewed By: jowlyzhang Differential Revision: D62769784 Pulled By: pdillinger fbshipit-source-id: 0d41f0d0273586bdad55e4aa30381ebc861f7044 |
|
Nick Brekhus | 0611eb5b9d |
Fix orphaned files in SstFileManager (#13015)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/13015 `Close()`ing a database now releases tracked files in `SstFileManager`. Previously this space would be leaked until the database was later reopened. Reviewed By: jowlyzhang Differential Revision: D62590773 fbshipit-source-id: 5461bd253d974ac4967ad52fee92e2650f8a9a28 |
|
Yu Zhang | f411c8bc97 |
Update folly Github hash (#13017)
Summary:
The internal codebase is updated for the coro directory's graduation from experimental. Updating our build script for a newer version with this change too. Using this hash:
|
|
Changyu Bi | f97e33454f |
Fix a bug with auto recovery on WAL write error (#12995)
Summary:
A recent crash test failure shows that auto recovery from WAL write failure can cause CFs to be inconsistent. A unit test repro in P1569398553. The following is an example sequence of events:
```
0. manual_wal_flush is true. There are multiple CFs in a DB.
1. Submit a write batch with updates to multiple CF
2. A FlushWAL or a memtable swtich that will try to write the buffered WAL data. Fail this write so that buffered WAL data is dropped:
|
|
Jon Janzen | a164576252 |
Remove last user of AutoHeaders.RECURSIVE_GLOB
Summary: I came across this code while buckifying parts of folly and fizz in open source. This is pretty hacky code and cleaning it up doesn't seem that hard, so I did it. Reviewed By: zertosh, pdillinger Differential Revision: D62781766 fbshipit-source-id: 43714bce992c53149d1e619063d803297362fb5d |
|
leipeng | 8648fbcba3 |
Add missing RemapFileSystem::ReopenWritableFile (#12941)
Summary: `RemapFileSystem::ReopenWritableFile` is missing, add it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12941 Reviewed By: pdillinger Differential Revision: D61822540 Pulled By: cbi42 fbshipit-source-id: dc228f7e8b842216f63de8b925cb663898455345 |
|
Nicholas Ormrod | 0e04ef1a96 |
Deshim coro in fbcode/internal_repo_rocksdb
Summary: The following rules were deshimmed: ``` //folly/experimental/coro:accumulate -> //folly/coro:accumulate //folly/experimental/coro:async_generator -> //folly/coro:async_generator //folly/experimental/coro:async_pipe -> //folly/coro:async_pipe //folly/experimental/coro:async_scope -> //folly/coro:async_scope //folly/experimental/coro:async_stack -> //folly/coro:async_stack //folly/experimental/coro:baton -> //folly/coro:baton //folly/experimental/coro:blocking_wait -> //folly/coro:blocking_wait //folly/experimental/coro:collect -> //folly/coro:collect //folly/experimental/coro:concat -> //folly/coro:concat //folly/experimental/coro:coroutine -> //folly/coro:coroutine //folly/experimental/coro:current_executor -> //folly/coro:current_executor //folly/experimental/coro:detach_on_cancel -> //folly/coro:detach_on_cancel //folly/experimental/coro:detail_barrier -> //folly/coro:detail_barrier //folly/experimental/coro:detail_barrier_task -> //folly/coro:detail_barrier_task //folly/experimental/coro:detail_current_async_frame -> //folly/coro:detail_current_async_frame //folly/experimental/coro:detail_helpers -> //folly/coro:detail_helpers //folly/experimental/coro:detail_malloc -> //folly/coro:detail_malloc //folly/experimental/coro:detail_manual_lifetime -> //folly/coro:detail_manual_lifetime //folly/experimental/coro:detail_traits -> //folly/coro:detail_traits //folly/experimental/coro:filter -> //folly/coro:filter //folly/experimental/coro:future_util -> //folly/coro:future_util //folly/experimental/coro:generator -> //folly/coro:generator //folly/experimental/coro:gmock_helpers -> //folly/coro:gmock_helpers //folly/experimental/coro:gtest_helpers -> //folly/coro:gtest_helpers //folly/experimental/coro:inline_task -> //folly/coro:inline_task //folly/experimental/coro:invoke -> //folly/coro:invoke //folly/experimental/coro:merge -> //folly/coro:merge //folly/experimental/coro:mutex -> //folly/coro:mutex //folly/experimental/coro:promise -> //folly/coro:promise //folly/experimental/coro:result -> //folly/coro:result //folly/experimental/coro:retry -> //folly/coro:retry //folly/experimental/coro:rust_adaptors -> //folly/coro:rust_adaptors //folly/experimental/coro:scope_exit -> //folly/coro:scope_exit //folly/experimental/coro:shared_lock -> //folly/coro:shared_lock //folly/experimental/coro:shared_mutex -> //folly/coro:shared_mutex //folly/experimental/coro:sleep -> //folly/coro:sleep //folly/experimental/coro:small_unbounded_queue -> //folly/coro:small_unbounded_queue //folly/experimental/coro:task -> //folly/coro:task //folly/experimental/coro:timed_wait -> //folly/coro:timed_wait //folly/experimental/coro:timeout -> //folly/coro:timeout //folly/experimental/coro:traits -> //folly/coro:traits //folly/experimental/coro:transform -> //folly/coro:transform //folly/experimental/coro:unbounded_queue -> //folly/coro:unbounded_queue //folly/experimental/coro:via_if_async -> //folly/coro:via_if_async //folly/experimental/coro:with_async_stack -> //folly/coro:with_async_stack //folly/experimental/coro:with_cancellation -> //folly/coro:with_cancellation //folly/experimental/coro:bounded_queue -> //folly/coro:bounded_queue //folly/experimental/coro:shared_promise -> //folly/coro:shared_promise //folly/experimental/coro:cleanup -> //folly/coro:cleanup //folly/experimental/coro:auto_cleanup_fwd -> //folly/coro:auto_cleanup_fwd //folly/experimental/coro:auto_cleanup -> //folly/coro:auto_cleanup ``` The following headers were deshimmed: ``` folly/experimental/coro/Accumulate.h -> folly/coro/Accumulate.h folly/experimental/coro/Accumulate-inl.h -> folly/coro/Accumulate-inl.h folly/experimental/coro/AsyncGenerator.h -> folly/coro/AsyncGenerator.h folly/experimental/coro/AsyncPipe.h -> folly/coro/AsyncPipe.h folly/experimental/coro/AsyncScope.h -> folly/coro/AsyncScope.h folly/experimental/coro/AsyncStack.h -> folly/coro/AsyncStack.h folly/experimental/coro/Baton.h -> folly/coro/Baton.h folly/experimental/coro/BlockingWait.h -> folly/coro/BlockingWait.h folly/experimental/coro/Collect.h -> folly/coro/Collect.h folly/experimental/coro/Collect-inl.h -> folly/coro/Collect-inl.h folly/experimental/coro/Concat.h -> folly/coro/Concat.h folly/experimental/coro/Concat-inl.h -> folly/coro/Concat-inl.h folly/experimental/coro/Coroutine.h -> folly/coro/Coroutine.h folly/experimental/coro/CurrentExecutor.h -> folly/coro/CurrentExecutor.h folly/experimental/coro/DetachOnCancel.h -> folly/coro/DetachOnCancel.h folly/experimental/coro/detail/Barrier.h -> folly/coro/detail/Barrier.h folly/experimental/coro/detail/BarrierTask.h -> folly/coro/detail/BarrierTask.h folly/experimental/coro/detail/CurrentAsyncFrame.h -> folly/coro/detail/CurrentAsyncFrame.h folly/experimental/coro/detail/Helpers.h -> folly/coro/detail/Helpers.h folly/experimental/coro/detail/Malloc.h -> folly/coro/detail/Malloc.h folly/experimental/coro/detail/ManualLifetime.h -> folly/coro/detail/ManualLifetime.h folly/experimental/coro/detail/Traits.h -> folly/coro/detail/Traits.h folly/experimental/coro/Filter.h -> folly/coro/Filter.h folly/experimental/coro/Filter-inl.h -> folly/coro/Filter-inl.h folly/experimental/coro/FutureUtil.h -> folly/coro/FutureUtil.h folly/experimental/coro/Generator.h -> folly/coro/Generator.h folly/experimental/coro/GmockHelpers.h -> folly/coro/GmockHelpers.h folly/experimental/coro/GtestHelpers.h -> folly/coro/GtestHelpers.h folly/experimental/coro/detail/InlineTask.h -> folly/coro/detail/InlineTask.h folly/experimental/coro/Invoke.h -> folly/coro/Invoke.h folly/experimental/coro/Merge.h -> folly/coro/Merge.h folly/experimental/coro/Merge-inl.h -> folly/coro/Merge-inl.h folly/experimental/coro/Mutex.h -> folly/coro/Mutex.h folly/experimental/coro/Promise.h -> folly/coro/Promise.h folly/experimental/coro/Result.h -> folly/coro/Result.h folly/experimental/coro/Retry.h -> folly/coro/Retry.h folly/experimental/coro/RustAdaptors.h -> folly/coro/RustAdaptors.h folly/experimental/coro/ScopeExit.h -> folly/coro/ScopeExit.h folly/experimental/coro/SharedLock.h -> folly/coro/SharedLock.h folly/experimental/coro/SharedMutex.h -> folly/coro/SharedMutex.h folly/experimental/coro/Sleep.h -> folly/coro/Sleep.h folly/experimental/coro/Sleep-inl.h -> folly/coro/Sleep-inl.h folly/experimental/coro/SmallUnboundedQueue.h -> folly/coro/SmallUnboundedQueue.h folly/experimental/coro/Task.h -> folly/coro/Task.h folly/experimental/coro/TimedWait.h -> folly/coro/TimedWait.h folly/experimental/coro/Timeout.h -> folly/coro/Timeout.h folly/experimental/coro/Timeout-inl.h -> folly/coro/Timeout-inl.h folly/experimental/coro/Traits.h -> folly/coro/Traits.h folly/experimental/coro/Transform.h -> folly/coro/Transform.h folly/experimental/coro/Transform-inl.h -> folly/coro/Transform-inl.h folly/experimental/coro/UnboundedQueue.h -> folly/coro/UnboundedQueue.h folly/experimental/coro/ViaIfAsync.h -> folly/coro/ViaIfAsync.h folly/experimental/coro/WithAsyncStack.h -> folly/coro/WithAsyncStack.h folly/experimental/coro/WithCancellation.h -> folly/coro/WithCancellation.h folly/experimental/coro/BoundedQueue.h -> folly/coro/BoundedQueue.h folly/experimental/coro/SharedPromise.h -> folly/coro/SharedPromise.h folly/experimental/coro/Cleanup.h -> folly/coro/Cleanup.h folly/experimental/coro/AutoCleanup-fwd.h -> folly/coro/AutoCleanup-fwd.h folly/experimental/coro/AutoCleanup.h -> folly/coro/AutoCleanup.h ``` This is a codemod. It was automatically generated and will be landed once it is approved and tests are passing in sandcastle. You have been added as a reviewer by Sentinel or Butterfly. Autodiff project: dcoro Autodiff partition: fbcode.internal_repo_rocksdb Autodiff bookmark: ad.dcoro.fbcode.internal_repo_rocksdb Reviewed By: dtolnay Differential Revision: D62684411 fbshipit-source-id: 8dbd31ab64fcdd99435d322035b9668e3200e0a3 |
|
Nick Brekhus | 40adb2bab7 |
Fix wraparound in SstFileManager (#13010)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/13010 The OnAddFile cur_compactions_reserved_size_ accounting causes wraparound when re-opening a database with an unowned SstFileManager and during recovery. It was introduced in #4164 which addresses out of space recovery with an unclear purpose. Compaction jobs do this accounting via EnoughRoomForCompaction/OnCompactionCompletion and to my understanding would never reuse a sst file name. Reviewed By: anand1976 Differential Revision: D62535775 fbshipit-source-id: a7c44d6e0a4b5ff74bc47abfe57c32ca6770243d |
|
anand76 | cabd2d8718 |
Fix a couple of missing cases of retry on corruption (#13007)
Summary: For SST checksum mismatch corruptions in the read path, RocksDB retries the read if the underlying file system supports verification and reconstruction of data (`FSSupportedOps::kVerifyAndReconstructRead`). There were a couple of places where the retry was missing - reading the SST footer and the properties block. This PR fixes the retry in those cases. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13007 Test Plan: Add new unit tests Reviewed By: jaykorean Differential Revision: D62519186 Pulled By: anand1976 fbshipit-source-id: 50aa38f18f2a53531a9fc8d4ccdf34fbf034ed59 |
|
Changyu Bi | e490f2b051 |
Fix a bug in ReFitLevel() where `FileMetaData::being_compacted` is not cleared (#13009)
Summary: in ReFitLevel(), we were not setting being_compacted to false after ReFitLevel() is done. This is not a issue if refit level is successful, since new FileMetaData is created for files at the target level. However, if there's an error during RefitLevel(), e.g., Manifest write failure, we should clear the being_compacted field for these files. Otherwise, these files will not be picked for compaction until db reopen. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13009 Test Plan: existing test. - stress test failure in T200339331 should not happen anymore. Reviewed By: hx235 Differential Revision: D62597169 Pulled By: cbi42 fbshipit-source-id: 0ba659806da6d6d4b42384fc95268b2d7bad720e |
|
Yu Zhang | 43bc71fef6 |
Add an internal API MemTableList::GetEditForDroppingCurrentVersion (#13001)
Summary: Prepare this internal API to be used by atomic data replacement. The main purpose of this API is to get a `VersionEdit` to mark the entire current `MemTableListVersion` as dropped. Flush needs the similar functionality when installing results, so that logic is refactored into a util function `GetDBRecoveryEditForObsoletingMemTables` to be shared by flush and this internal API. To test this internal API, flush's result installation is redirected to use this API when it is flushing all the immutable MemTables in debug mode. It should achieve the exact same results, just with a duplicated `VersionEdit::log_number` field that doesn't upsets the recovery logic. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13001 Test Plan: Existing tests Reviewed By: pdillinger Differential Revision: D62309591 Pulled By: jowlyzhang fbshipit-source-id: e25914d9a2e281c25ab7ee31a66eaf6adfae4b88 |
|
Levi Tamasi | 55ac0b729e |
Support building db_bench using buck (#13004)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/13004 The patch extends the buckifier script so it generates a target for `db_bench` as well. Reviewed By: cbi42 Differential Revision: D62407071 fbshipit-source-id: 0cb98a324ce0598ad84a8675aa77b7d0f91bf40c |
|
anand76 | f48b64460e |
Provide a way to invoke a callback for a Cache handle (#12987)
Summary: Add the `ApplyToHandle` method to the `Cache` interface to allow a caller to request the invocation of a callback on the given cache handle. The goal here is to allow a cache that manages multiple cache instances to use a callback on a handle to determine which instance it belongs to. For example, the callback can hash the key and use that to pick the correct target instance. This is useful to redirect methods like `Ref` and `Release`, which don't know the cache key. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12987 Reviewed By: pdillinger Differential Revision: D62151907 Pulled By: anand1976 fbshipit-source-id: e4ffbbb96eac9061d2ab0e7e1739eea5ebb1cd58 |
|
Yu Zhang | 0c6e9c036a |
Make compaction always use the input version with extra ref protection (#12992)
Summary: `Compaction` is already creating its own ref for the input Version: |
|
Yu Zhang | a24574e80a |
Add documentation for background job's state transition (#12994)
Summary: The `SchedulePending*` API is a bit confusing since it doesn't immediately schedule the work and can be confused with the actual scheduling. So I have changed these to be `EnqueuePending*` and added some documentation for the corresponding state transitions of these background work. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12994 Test Plan: existing tests Reviewed By: cbi42 Differential Revision: D62252746 Pulled By: jowlyzhang fbshipit-source-id: ee68be6ed33070cad9a5004b7b3e16f5bcb041bf |
|
Changyu Bi | 0bea5a2cfe |
Disable WAL fault injection in some case (#13000)
Summary: when manual_wal_flush is true and when there are more than 1 CF, WAL fault injection can cause CFs to be inconsistent. See more explanation and repro in T199157789. Disable the combination for now until we have a fix that allows auto recovery. This also helps to see if there's other cause of stress test failures. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13000 Test Plan: the following command could repro db consistency failure in a few runs before this PR. From stress test output we can also see that exclude_wal_from_write_fault_injection and metadata_write_fault_one_in are sanitized to 0. ``` python3 ./tools/db_crashtest.py blackbox --interval=60 --metadata_write_fault_one_in=1000 --column_families=10 --exclude_wal_from_write_fault_injection=0 --manual_wal_flush_one_in=1000 --WAL_size_limit_MB=10240 --WAL_ttl_seconds=0 --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --adm_policy=1 --advise_random_on_open=1 --allow_data_in_errors=True --allow_fallocate=1 --async_io=0 --auto_readahead_size=0 --avoid_flush_during_recovery=1 --avoid_flush_during_shutdown=1 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=100 --block_align=1 --block_protection_bytes_per_key=0 --block_size=16384 --bloom_before_level=2147483647 --bottommost_compression_type=none --bottommost_file_compaction_delay=0 --bytes_per_sync=0 --cache_index_and_filter_blocks=1 --cache_index_and_filter_blocks_with_high_priority=1 --cache_size=33554432 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=0 --check_multiget_consistency=0 --check_multiget_entity_consistency=0 --checkpoint_one_in=0 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_readahead_size=1048576 --compaction_ttl=0 --compress_format_version=1 --compressed_secondary_cache_size=8388608 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=4 --compression_type=none --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc= --data_block_index_type=0 --db_write_buffer_size=0 --decouple_partitioned_filters=1 --default_temperature=kCold --default_write_temperature=kWarm --delete_obsolete_files_period_micros=30000000 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_file_deletions_one_in=1000000 --disable_manual_compaction_one_in=1000000 --disable_wal=0 --dump_malloc_stats=1 --enable_checksum_handoff=1 --enable_compaction_filter=0 --enable_custom_split_merge=0 --enable_do_not_compress_roles=0 --enable_index_compression=0 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=1 --enable_sst_partitioner_factory=0 --enable_thread_tracking=1 --enable_write_thread_adaptive_yield=1 --error_recovery_with_no_fault_injection=1 --fail_if_options_file_error=1 --fifo_allow_compaction=1 --file_checksum_impl=big --fill_cache=1 --flush_one_in=1000000 --format_version=6 --get_all_column_family_metadata_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=10000 --get_properties_of_all_tables_one_in=1000000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --index_block_restart_interval=4 --index_shortening=1 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100000 --last_level_temperature=kWarm --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=10000 --log_file_time_to_roll=0 --log_readahead_size=0 --long_running_snapshots=0 --lowest_used_cache_tier=2 --manifest_preallocation_size=5120 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=100000 --max_key_len=3 --max_log_file_size=0 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=16 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16777216 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=2097152 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.001 --memtable_protection_bytes_per_key=2 --memtable_whole_key_filtering=0 --memtablerep=skip_list --metadata_charge_policy=1 --metadata_read_fault_one_in=0 --min_write_buffer_number_to_merge=1 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=2 --open_files=100 --open_metadata_read_fault_one_in=0 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --optimize_filters_for_hits=0 --optimize_filters_for_memory=0 --optimize_multiget_for_io=0 --paranoid_file_checks=1 --paranoid_memory_checks=0 --partition_filters=0 --partition_pinning=2 --pause_background_one_in=10000 --periodic_compaction_seconds=0 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=0 --readahead_size=524288 --readpercent=45 --recycle_log_file_num=0 --reopen=0 --report_bg_io_stats=0 --reset_stats_one_in=10000 --sample_for_compression=5 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --set_options_one_in=10000 --skip_stats_update_on_db_open=1 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=1048576 --sqfc_name=bar --sqfc_version=1 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=600 --stats_history_buffer_size=1048576 --strict_bytes_per_sync=1 --subcompactions=2 --sync=0 --sync_fault_injection=1 --table_cache_numshardbits=6 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=3 --uncache_aggressiveness=8 --universal_max_read_amp=-1 --unpartitioned_pinning=2 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=0 --use_attribute_group=1 --use_delta_encoding=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=1 --use_multi_cf_iterator=1 --use_multi_get_entity=0 --use_multiget=0 --use_put_entity_one_in=1 --use_sqfc_for_range_queries=0 --use_timed_put_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_compression=1 --verify_db_one_in=100000 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=4194304 --write_dbid_to_manifest=0 --write_fault_one_in=50 --writepercent=35 --ops_per_thread=100000 --preserve_unverified_changes=1 ``` Reviewed By: hx235 Differential Revision: D62303631 Pulled By: cbi42 fbshipit-source-id: d9441188ee84d53e5e7916f3305e50843fe9fde2 |
|
Peter Dillinger | 4577b672d5 |
More valgrind fixes (#12990)
Summary: * https://github.com/facebook/rocksdb/issues/12936 was insufficient to fix the std::optional false positives. Making a fix validated in CI this time (see https://github.com/facebook/rocksdb/issues/12991) * valgrind grinds to a halt on startup on my dev machine apparently because it expects internet access. Disable its attempts to access the internet when git is using a proxy. * Move PORTABLE=1 from CI job to the Makefile. Without it, valgrind complains about illegal instructions (too new) Pull Request resolved: https://github.com/facebook/rocksdb/pull/12990 Test Plan: manual, watch nightly valgrind job Reviewed By: ltamasi Differential Revision: D62203242 Pulled By: pdillinger fbshipit-source-id: a611b08da7dbd173b0709ed7feb0578729553a17 |
|
Peter Dillinger | 0d3aaf7c0f |
Ensure SSTs compressed in tiered_secondary_cache_test (#12993)
Summary: It appears the arm testsuite is failing because it is building without snappy, which is causing the SST files not to be compressed, which somehow causes these tests to fail. Manually setting LZ4 which is already required. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12993 Test Plan: reproduced and verified fix on ARM laptop Reviewed By: anand1976 Differential Revision: D62216451 Pulled By: pdillinger fbshipit-source-id: 3f21fcd9be0edaa66c7eca0cb7d56b998171e263 |
|
Peter Dillinger | 4b1d595306 |
Fix and clarify ignore_unknown_options (#12989)
Summary: `ignore_unknown_options=true` had an undocumented behavior of having no effect (disallow unknown options) if reading from the same or older major.minor version. Presumably this was intended to catch unintentional addition of new options in a patch release, but there is no automated version compatibility testing between patch releases. So this was a bad choice without such testing support, because it just means users would hit the failure in case of adding features to a patch release. In this diff we respect ignore_unknown_options when reading a file from any newer version, even patch versions, and document this behavior in the API. I don't think it's practical or necessary to test among patch releases in check_format_compatible.sh. This seems like an exceptional case of applying a *different semantics* to patch version updates than to minor/major versions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12989 Test Plan: unit test updated (and refactored) Reviewed By: jaykorean Differential Revision: D62168738 Pulled By: pdillinger fbshipit-source-id: fb3c3ef30f0bbad0d5ffcc4570fb9ef963e7daac |
|
Jay Huh | 064c0ad53d |
Fix the check format compatible change (#12988)
Summary:
`check_format_compatible` script was broken due to extra comma added in
|
|
Symious | c989c51ed7 |
Fix non-ASCII character (#12972)
Summary: Met the following error while compiling the project. ``` build_tools/check-sources.sh utilities/fault_injection_fs.cc:509: // If there<E2><80><99>s no injected error, then cb will be called asynchronously when utilities/fault_injection_fs.cc:510: // target_ actually finishes the read. But if there<E2><80><99>s an injected error, it utilities/fault_injection_fs.cc:512: // isn<E2><80><99>t invoked at all. ^^^^ Use only ASCII characters in source files make[1]: *** [Makefile:1291: check-sources] Error 1 make[1]: Leaving directory '/home/janus/Github/symious/rocksdb' make: *** [Makefile:1084: check] Error 2 ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12972 Reviewed By: hx235 Differential Revision: D61923865 Pulled By: cbi42 fbshipit-source-id: 63af0a38fea15e09a860895bdd5ed0a57700e447 |
|
Changyu Bi | cd6f802ccb |
Add a new file ingestion option `link_files` (#12980)
Summary: Add option `IngestExternalFileOptions::link_files` that hard links input files and preserves original file links after ingestion, unlike `move_files` which will unlink input files after ingestion. This can be useful when being used together with `allow_db_generated_files` to ingest files from another DB. Also reverted the change to `move_files` in https://github.com/facebook/rocksdb/issues/12959 to simplify the contract so that it will always unlink input files without exception. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12980 Test Plan: updated unit test `ExternSSTFileLinkFailFallbackTest.LinkFailFallBackExternalSst` to test that input files will not be unlinked. Reviewed By: pdillinger Differential Revision: D61925111 Pulled By: cbi42 fbshipit-source-id: eadaca72e1ae5288bdd195d57158466e5656fa62 |
|
Changyu Bi | 92ad4a88f3 |
Small CPU optimization in InlineSkipList::Insert() (#12975)
Summary: reuse decode key in more places to avoid decoding length prefixed key x->Key(). Pull Request resolved: https://github.com/facebook/rocksdb/pull/12975 Test Plan: ran benchmarks simultaneously for "before" and "after" * fillseq: ``` (for I in $(seq 1 50); do ./db_bench --benchmarks=fillseq --disable_auto_compactions=1 --min_write_buffer_number_to_merge=100 --max_write_buffer_number=1000 --write_buffer_size=268435456 --num=5000000 --seed=1723056275 --disable_wal=1 2>&1 | grep "fillseq" done;) | awk '{ t += $5; c++; print } END { printf ("%9.3f\n", 1.0 * t / c) }'; before: 1483191 after: 1490555 (+0.5%) ``` * fillrandom: ``` (for I in $(seq 1 2); do ./db_bench_imain --benchmarks=fillrandom --disable_auto_compactions=1 --min_write_buffer_number_to_merge=100 --max_write_buffer_number=1000 --write_buffer_size=268435456 --num=2500000 --seed=1723056275 --disable_wal=1 2>&1 | grep "fillrandom" before: 255463 after: 256128 (+0.26%) ``` Reviewed By: anand1976 Differential Revision: D61835340 Pulled By: cbi42 fbshipit-source-id: 70345510720e348bacd51269acb5d2dd5a62bf0a |
|
anand76 | f31b4d80ff |
Retain previous trace file in db_stress for debugging purposes (#12978)
Summary: There are several crash test failures due to DB verification failure. Retain some trace history in the expected state directory to make debugging easier. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12978 Reviewed By: cbi42 Differential Revision: D61864921 Pulled By: anand1976 fbshipit-source-id: 9f3f37b7e1e958bc89a3cf0373182354c2c1aa3b |
|
Jay Huh | 0082907bf2 |
Scope down workflow permissions (#12973)
Summary: Followed instruction per https://docs.github.com/en/actions/writing-workflows/workflow-syntax-for-github-actions#defining-access-for-the-github_token-scopes It turns out that we did not need any of these except `Metadata: read`. Before ``` GITHUB_TOKEN Permissions Actions: write Attestations: write Checks: write Contents: write Deployments: write Discussions: write Issues: write Metadata: read Packages: write Pages: write PullRequests: write RepositoryProjects: write SecurityEvents: write Statuses: write ``` After ``` GITHUB_TOKEN Permissions Metadata: read ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12973 Test Plan: GitHub Actions triggered by this PR Reviewed By: cbi42 Differential Revision: D61812651 Pulled By: jaykorean fbshipit-source-id: 4413756c93f503e8b2fb77eb8b684ef9e6a6c13d |
|
Peter Dillinger | d96e67c2bf |
Fix flaky test DBTest2.VariousFileTemperatures (#12974)
Summary: ... apparently due to potentially not purging obsolete files after CompactRange Example: https://github.com/facebook/rocksdb/actions/runs/10564621261/job/29267393711?pr=12959 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12974 Test Plan: reproduced failure with USE_CLANG=1 COERCE_CONTEXT_SWITCH=1, now fixed Reviewed By: cbi42 Differential Revision: D61812600 Pulled By: pdillinger fbshipit-source-id: d4b23e1a179bb8ec39875ed7a8ce1649fa3344bd |
|
Changyu Bi | 4eb5878ab2 |
Support ingesting db generated files using hard link (#12959)
Summary: so `IngestExternalFileOptions::move_files` and `IngestExternalFileOptions::allow_db_generated_files` are now compatible. The original file links won't be removed if `allow_db_generated_files` is true. This is to prevent deleting files from another DB. There was a [comment](https://github.com/facebook/rocksdb/pull/12750#discussion_r1684509620) in https://github.com/facebook/rocksdb/issues/12750 about how exactly-once ingestion would work with `move_files`. I've discussed with customer and decided that it can be done by reading the target DB to see if it contains any ingested key. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12959 Test Plan: updated unit tests `IngestDBGeneratedFileTest*` to enable `move_files`. Reviewed By: jowlyzhang Differential Revision: D61703480 Pulled By: cbi42 fbshipit-source-id: 6b4294369767f989a2f36bbace4ca3c0257aeaf7 |
|
Peter Dillinger | 8ac14f525e |
Add Temperature to FileAttributes (#12965)
Summary: .. so that appropriate implementations can return temperature information from GetChildrenFileAttributes Pull Request resolved: https://github.com/facebook/rocksdb/pull/12965 Test Plan: just an API placeholder for now Reviewed By: anand1976 Differential Revision: D61748199 Pulled By: pdillinger fbshipit-source-id: b457e324cb451e836611a0bf630c3da0f30a8abf |
|
Peter Dillinger | 96340dbce2 |
Options for file temperature for more files (#12957)
Summary: We have a request to use the cold tier as primary source of truth for the DB, and to best support such use cases and to complement the existing options controlling SST file temperatures, we add two new DB options: * `metadata_write_temperature` for DB "small" files that don't contain much user data * `wal_write_temperature` for WALs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12957 Test Plan: Unit test included, though it's hard to be sure we've covered all the places Reviewed By: jowlyzhang Differential Revision: D61664815 Pulled By: pdillinger fbshipit-source-id: 8e19c9dd8fd2db059bb15f74938d6bc12002e82b |