rocksdb

Commit Graph

Author	SHA1	Message	Date
Yu Zhang	32dd657bad	Add some per key optimization for UDT in memtable only feature (#13031 ) Summary: This PR added some optimizations for the per key handling for SST file for the user-defined timestamps in Memtable only feature. CPU profiling shows this part is a big culprit for regression. This optimization saves some string construction/destruction/appending/copying. vector operations like reserve/emplace_back. When iterating keys in a block, we need to copy some shared bytes from previous key, put it together with the non shared bytes and find a right location to pad the min timestamp. Previously, we create a tmp local string buffer to first construct the key from its pieces, and then copying this local string's content into `IterKey`'s buffer. To avoid having this local string and to avoid this extra copy. Instead of piecing together the key in a local string first, we just track all the pieces that make this key in a reused Slice array. And then copy the pieces in order into `IterKey`'s buffer. Since the previous key should be kept intact while we are copying some shared bytes from it, we added a secondary buffer in `IterKey` and alternate between primary buffer and secondary buffer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13031 Test Plan: Existing tests. Reviewed By: ltamasi Differential Revision: D63416531 Pulled By: jowlyzhang fbshipit-source-id: 9819b0e02301a2dbc90621b2fe4f651bc912113c	2024-10-03 17:57:50 -07:00
Levi Tamasi	917e98ff9e	Templatize MultiCfIteratorImpl to avoid std::function's overhead (#13052 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/13052 Currently, `MultiCfIteratorImpl` uses `std::function`s for `reset_func_` and `populate_func_`, which uses type erasure and has a performance overhead. The patch turns `MultiCfIteratorImpl` into a template that takes the two function object types as template parameters, and changes `AttributeGroupIteratorImpl` and `CoalescingIterator` so they pass in function objects of named types (as opposed to lambdas). Reviewed By: jaykorean Differential Revision: D63802598 fbshipit-source-id: e202f6d80c9054335e5b2571051a67a9e012c2d0	2024-10-02 19:18:15 -07:00
Yu Zhang	9375c3b635	Fix `needs_flush` assertion in file ingestion (#13045 ) Summary: This PR makes file ingestion job's flush wait a bit further until the SuperVersion is also updated. This is necessary since follow up operations will use the current SuperVersion to do range overlapping check and level assignment. In debug mode, file ingestion job's second `NeedsFlush` call could have been invoked when the memtables are flushed but the SuperVersion hasn't been updated yet, triggering the assertion. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13045 Test Plan: Existing tests Manually stress tested Reviewed By: cbi42 Differential Revision: D63671151 Pulled By: jowlyzhang fbshipit-source-id: 95a169e58a7e59f6dd4125e7296e9060fe4c63a7	2024-10-02 17:19:18 -07:00
Peter Dillinger	dd23e84cad	Re-implement GetApproximateMemTableStats for skip lists (#13047 ) Summary: GetApproximateMemTableStats() could return some bad results with the standard skip list memtable. See this new db_bench test showing the dismal distribution of results when the actual number of entries in range is 1000: ``` $ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=1000 ... filluniquerandom : 1.391 micros/op 718915 ops/sec 1.391 seconds 1000000 operations; 11.7 MB/s approximatememtablestats : 3.711 micros/op 269492 ops/sec 3.711 seconds 1000000 operations; Reported entry count stats (expected 1000): Count: 1000000 Average: 2344.1611 StdDev: 26587.27 Min: 0 Median: 965.8555 Max: 835273 Percentiles: P50: 965.86 P75: 1610.77 P99: 12618.01 P99.9: 74991.58 P99.99: 830970.97 ------------------------------------------------------ [ 0, 1 ] 131344 13.134% 13.134% ### ( 1, 2 ] 115 0.011% 13.146% ( 2, 3 ] 106 0.011% 13.157% ( 3, 4 ] 190 0.019% 13.176% ( 4, 6 ] 214 0.021% 13.197% ( 6, 10 ] 522 0.052% 13.249% ( 10, 15 ] 748 0.075% 13.324% ( 15, 22 ] 1002 0.100% 13.424% ( 22, 34 ] 1948 0.195% 13.619% ( 34, 51 ] 3067 0.307% 13.926% ( 51, 76 ] 4213 0.421% 14.347% ( 76, 110 ] 5721 0.572% 14.919% ( 110, 170 ] 11375 1.137% 16.056% ( 170, 250 ] 17928 1.793% 17.849% ( 250, 380 ] 36597 3.660% 21.509% # ( 380, 580 ] 77882 7.788% 29.297% ## ( 580, 870 ] 160193 16.019% 45.317% ### ( 870, 1300 ] 210098 21.010% 66.326% #### ( 1300, 1900 ] 167461 16.746% 83.072% ### ( 1900, 2900 ] 78678 7.868% 90.940% ## ( 2900, 4400 ] 47743 4.774% 95.715% # ( 4400, 6600 ] 17650 1.765% 97.480% ( 6600, 9900 ] 11895 1.190% 98.669% ( 9900, 14000 ] 4993 0.499% 99.168% ( 14000, 22000 ] 2384 0.238% 99.407% ( 22000, 33000 ] 1966 0.197% 99.603% ( 50000, 75000 ] 2968 0.297% 99.900% ( 570000, 860000 ] 999 0.100% 100.000% readrandom : 1.967 micros/op 508487 ops/sec 1.967 seconds 1000000 operations; 8.2 MB/s (1000000 of 1000000 found) ``` Perhaps the only good thing to say about the old implementation was that it was fast, though apparently not that fast. I've implemented a much more robust and reasonably fast new version of the function. It's still logarithmic but with some larger constant factors. The standard deviation from true count is around 20% or less, and roughly the CPU cost of two memtable point look-ups. See code comments for detail. ``` $ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=1000 ... filluniquerandom : 1.478 micros/op 676434 ops/sec 1.478 seconds 1000000 operations; 11.0 MB/s approximatememtablestats : 2.694 micros/op 371157 ops/sec 2.694 seconds 1000000 operations; Reported entry count stats (expected 1000): Count: 1000000 Average: 1073.5158 StdDev: 197.80 Min: 608 Median: 1079.9506 Max: 2176 Percentiles: P50: 1079.95 P75: 1223.69 P99: 1852.36 P99.9: 1898.70 P99.99: 2176.00 ------------------------------------------------------ ( 580, 870 ] 134848 13.485% 13.485% ### ( 870, 1300 ] 747868 74.787% 88.272% ############### ( 1300, 1900 ] 116536 11.654% 99.925% ## ( 1900, 2900 ] 748 0.075% 100.000% readrandom : 1.997 micros/op 500654 ops/sec 1.997 seconds 1000000 operations; 8.1 MB/s (1000000 of 1000000 found) ``` We can already see that the distribution of results is dramatically better and wonderfully normal-looking, with relative standard deviation around 20%. The function is also FASTER, at least with these parameters. Let's look how this behavior generalizes, first much larger range: ``` $ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=30000 filluniquerandom : 1.390 micros/op 719654 ops/sec 1.376 seconds 990000 operations; 11.7 MB/s approximatememtablestats : 1.129 micros/op 885649 ops/sec 1.129 seconds 1000000 operations; Reported entry count stats (expected 30000): Count: 1000000 Average: 31098.8795 StdDev: 3601.47 Min: 21504 Median: 29333.9303 Max: 43008 Percentiles: P50: 29333.93 P75: 33018.00 P99: 43008.00 P99.9: 43008.00 P99.99: 43008.00 ------------------------------------------------------ ( 14000, 22000 ] 408 0.041% 0.041% ( 22000, 33000 ] 749327 74.933% 74.974% ############### ( 33000, 50000 ] 250265 25.027% 100.000% ##### readrandom : 1.894 micros/op 528083 ops/sec 1.894 seconds 1000000 operations; 8.5 MB/s (989989 of 1000000 found) ``` This is even faster and relatively more accurate, with relative standard deviation closer to 10%. Code comments explain why. Now let's look at smaller ranges. Implementation quirks or conveniences: * When actual number in range is >= 40, the minimum return value is 40. * When the actual is <= 10, it is guaranteed to return that actual number. ``` $ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=75 ... filluniquerandom : 1.417 micros/op 705668 ops/sec 1.417 seconds 999975 operations; 11.4 MB/s approximatememtablestats : 3.342 micros/op 299197 ops/sec 3.342 seconds 1000000 operations; Reported entry count stats (expected 75): Count: 1000000 Average: 75.1210 StdDev: 15.02 Min: 40 Median: 71.9395 Max: 256 Percentiles: P50: 71.94 P75: 89.69 P99: 119.12 P99.9: 166.68 P99.99: 229.78 ------------------------------------------------------ ( 34, 51 ] 38867 3.887% 3.887% # ( 51, 76 ] 550554 55.055% 58.942% ########### ( 76, 110 ] 398854 39.885% 98.828% ######## ( 110, 170 ] 11353 1.135% 99.963% ( 170, 250 ] 364 0.036% 99.999% ( 250, 380 ] 8 0.001% 100.000% readrandom : 1.861 micros/op 537224 ops/sec 1.861 seconds 1000000 operations; 8.7 MB/s (999974 of 1000000 found) $ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=25 ... filluniquerandom : 1.501 micros/op 666283 ops/sec 1.501 seconds 1000000 operations; 10.8 MB/s approximatememtablestats : 5.118 micros/op 195401 ops/sec 5.118 seconds 1000000 operations; Reported entry count stats (expected 25): Count: 1000000 Average: 26.2392 StdDev: 4.58 Min: 25 Median: 28.4590 Max: 72 Percentiles: P50: 28.46 P75: 31.69 P99: 49.27 P99.9: 67.95 P99.99: 72.00 ------------------------------------------------------ ( 22, 34 ] 928936 92.894% 92.894% ################### ( 34, 51 ] 67960 6.796% 99.690% # ( 51, 76 ] 3104 0.310% 100.000% readrandom : 1.892 micros/op 528595 ops/sec 1.892 seconds 1000000 operations; 8.6 MB/s (1000000 of 1000000 found) $ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=10 ... filluniquerandom : 1.642 micros/op 608916 ops/sec 1.642 seconds 1000000 operations; 9.9 MB/s approximatememtablestats : 3.042 micros/op 328721 ops/sec 3.042 seconds 1000000 operations; Reported entry count stats (expected 10): Count: 1000000 Average: 10.0000 StdDev: 0.00 Min: 10 Median: 10.0000 Max: 10 Percentiles: P50: 10.00 P75: 10.00 P99: 10.00 P99.9: 10.00 P99.99: 10.00 ------------------------------------------------------ ( 6, 10 ] 1000000 100.000% 100.000% #################### readrandom : 1.805 micros/op 554126 ops/sec 1.805 seconds 1000000 operations; 9.0 MB/s (1000000 of 1000000 found) ``` Remarkably consistent. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13047 Test Plan: new db_bench test for both performance and accuracy (see above); added to crash test; unit test updated. Reviewed By: cbi42 Differential Revision: D63722003 Pulled By: pdillinger fbshipit-source-id: cfc8613c085e87c17ecec22d82601aac2a5a1b26	2024-10-02 14:25:50 -07:00
Jay Huh	2a5ff78c12	More info in CompactionServiceJobInfo and CompactionJobStats (#13029 ) Summary: Add the following to the `CompactionServiceJobInfo` - compaction_reason - is_full_compaction - is_manual_compaction - bottommost_level Added `is_remote_compaction` to the `CompactionJobStats` and set initial values to avoid UB for uninitialized values. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13029 Test Plan: ``` ./compaction_service_test --gtest_filter="CompactionInfo" ``` Reviewed By: anand1976 Differential Revision: D63322878 Pulled By: jaykorean fbshipit-source-id: f02a66ca45e660b9d354a43837d8ec6beb7621fb	2024-09-25 10:26:15 -07:00
Peter Dillinger	a1a102ffce	Steps toward deprecating implicit prefix seek, related fixes (#13026 ) Summary: With some new use cases onboarding to prefix extractors/seek/filters, one of the risks is existing iterator code, e.g. for maintenance tasks, being unintentionally subject to prefix seek semantics. This is a longstanding known design flaw with prefix seek, and `prefix_same_as_start` and `auto_prefix_mode` were steps in the direction of making that obsolete. However, we can't just immediately set `total_order_seek` to true by default, because that would impact so much code instantly. Here we add a new DB option, `prefix_seek_opt_in_only` that basically allows users to transition to the future behavior when they are ready. When set to true, all iterators will be treated as if `total_order_seek=true` and then the only ways to get prefix seek semantics are with `prefix_same_as_start` or `auto_prefix_mode`. Related fixes / changes: * Make sure that `prefix_same_as_start` and `auto_prefix_mode` are compatible with (or override) `total_order_seek` (depending on your interpretation). * Fix a bug in which a new iterator after dynamically changing the prefix extractor might mix different prefix semantics between memtable and SSTs. Both should use the latest extractor semantics, which means iterators ignoring memtable prefix filters with an old extractor. And that means passing the latest prefix extractor to new memtable iterators that might use prefix seek. (Without the fix, the test added for this fails in many ways.) Suggested follow-up: * Investigate a FIXME where a MergeIteratorBuilder is created in db_impl.cc. No unit test detects a change in value that should impact correctness. * Make memtable prefix bloom compatible with `auto_prefix_mode`, which might require involving the memtablereps because we don't know at iterator creation time (only seek time) whether an auto_prefix_mode seek will be a prefix seek. * Add `prefix_same_as_start` testing to db_stress Pull Request resolved: https://github.com/facebook/rocksdb/pull/13026 Test Plan: tests updated, added. Add combination of `total_order_seek=true` and `auto_prefix_mode=true` to stress test. Ran `make blackbox_crash_test` for a long while. Manually ran tests with `prefix_seek_opt_in_only=true` as default, looking for unexpected issues. I inspected most of the results and migrated many tests to be ready for such a change (but not all). Reviewed By: ltamasi Differential Revision: D63147378 Pulled By: pdillinger fbshipit-source-id: 1f4477b730683d43b4be7e933338583702d3c25e	2024-09-20 15:54:19 -07:00
Jay Huh	5f4a8c3da4	Load latest options from OPTIONS file in Remote host (#13025 ) Summary: We've been serializing and deserializing DBOptions and CFOptions (and other CF into) as part of `CompactionServiceInput`. These are all readily available in the OPTIONS file and the remote worker can read the OPTIONS file to obtain the same information. This helps reducing the size of payload significantly. In a very rare scenario if the OPTIONS file is purged due to options change by primary host at the same time while the remote host is loading the latest options, it may fail. In this case, we just retry once. This also solves the problem where we had to open the default CF with the CFOption from another CF if the remote compaction is for a non-default column family. (TODO comment in /db_impl_secondary.cc) Pull Request resolved: https://github.com/facebook/rocksdb/pull/13025 Test Plan: Unit Tests ``` ./compaction_service_test ``` ``` ./compaction_job_test ``` Also tested with Meta's internal Offload Infra Reviewed By: anand1976, cbi42 Differential Revision: D63100109 Pulled By: jaykorean fbshipit-source-id: b7162695e31e2c5a920daa7f432842163a5b156d	2024-09-20 13:26:02 -07:00
Changyu Bi	71e38dbe25	Compact one file at a time for FIFO temperature change compactions (#13018 ) Summary: Per customer request, we should not merge multiple SST files together during temperature change compaction, since this can cause FIFO TTL compactions to be delayed. This PR changes the compaction picking logic to pick one file at a time. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13018 Test Plan: * updated some existing unit tests to test this new behavior. Reviewed By: jowlyzhang Differential Revision: D62883292 Pulled By: cbi42 fbshipit-source-id: 6a9fc8c296b5d9b17168ef6645f25153241c8b93	2024-09-19 15:50:41 -07:00
Levi Tamasi	54ace7f340	Change the semantics of blob_garbage_collection_force_threshold to provide better control over space amp (#13022 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/13022 Currently, `blob_garbage_collection_force_threshold` applies to the oldest batch of blob files, which is typically only a small subset of the blob files currently eligible for garbage collection. This can result in a form of head-of-line blocking: no GC-triggered compactions will be scheduled if the oldest batch does not currently exceed the threshold, even if a lot of higher-numbered blob files do. This can in turn lead to high space amplification that exceeds the soft bound implicit in the force threshold (e.g. 50% would suggest a space amp of <2 and 75% would imply a space amp of <4). The patch changes the semantics of this configuration threshold to apply to the entire set of blob files that are eligible for garbage collection based on `blob_garbage_collection_age_cutoff`. This provides more intuitive semantics for the option and can provide a better write amp/space amp trade-off. (Note that GC-triggered compactions still pick the same SST files as before, so triggered GC still targets the oldest the blob files.) Reviewed By: jowlyzhang Differential Revision: D62977860 fbshipit-source-id: a999f31fe9cdda313de513f0e7a6fc707424d4a3	2024-09-19 15:47:13 -07:00
Peter Dillinger	98c33cb8e3	Steps toward making IDENTITY file obsolete (#13019 ) Summary: * Set write_dbid_to_manifest=true by default * Add new option write_identity_file (default true) that allows us to opt-in to future behavior without identity file * Refactor related DB open code to minimize code duplication _Recommend hiding whitespace changes for review_ Intended follow-up: add support to ldb for reading and even replacing the DB identity in the manifest. Could be a variant of `update_manifest` command or based on it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13019 Test Plan: unit tests and stress test updated for new functionality Reviewed By: anand1976 Differential Revision: D62898229 Pulled By: pdillinger fbshipit-source-id: c08b25cf790610b034e51a9de0dc78b921abbcf0	2024-09-19 14:05:21 -07:00
Peter Dillinger	10984e8c26	Fix and generalize framework for filtering range queries, etc. (#13005 ) Summary: There was a subtle design/contract bug in the previous version of range filtering in experimental.h If someone implemented a key segments extractor with "all or nothing" fixed size segments, that could result in unsafe range filtering. For example, with two segments of width 3: ``` x = 0x\|12 34 56\|78 9A 00\| y = 0x\|12 34 56\|\|78 9B z = 0x\|12 34 56\|78 9C 00\| ``` Segment 1 of y (empty) is out of order with segment 1 of x and z. I have re-worked the contract to make it clear what does work, and implemented a standard extractor for fixed-size segments, CappedKeySegmentsExtractor. The safe approach for filtering is to consume as much as is available for a segment in the case of a short key. I have also added support for min-max filtering with reverse byte-wise comparator, which is probably the 2nd most common comparator for RocksDB users (because of MySQL). It might seem that a min-max filter doesn't care about forward or reverse ordering, but it does when trying to determine whether in input range from segment values v1 to v2, where it so happens that v2 is byte-wise less than v1, is an empty forward interval or a non-empty reverse interval. At least in the current setup, we don't have that context. A new unit test (with some refactoring) tests CappedKeySegmentsExtractor, reverse byte-wise comparator, and the corresponding min-max filter. I have also (contractually / mathematically) generalized the framework to comparators other than the byte-wise comparator, and made other generalizations to make the extractor limitations more explicitly connected to the particular filters and filtering used--at least in description. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13005 Test Plan: added unit tests as described Reviewed By: jowlyzhang Differential Revision: D62769784 Pulled By: pdillinger fbshipit-source-id: 0d41f0d0273586bdad55e4aa30381ebc861f7044	2024-09-18 15:26:37 -07:00
Nick Brekhus	0611eb5b9d	Fix orphaned files in SstFileManager (#13015 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/13015 `Close()`ing a database now releases tracked files in `SstFileManager`. Previously this space would be leaked until the database was later reopened. Reviewed By: jowlyzhang Differential Revision: D62590773 fbshipit-source-id: 5461bd253d974ac4967ad52fee92e2650f8a9a28	2024-09-18 13:27:44 -07:00
Changyu Bi	f97e33454f	Fix a bug with auto recovery on WAL write error (#12995 ) Summary: A recent crash test failure shows that auto recovery from WAL write failure can cause CFs to be inconsistent. A unit test repro in P1569398553. The following is an example sequence of events: ``` 0. manual_wal_flush is true. There are multiple CFs in a DB. 1. Submit a write batch with updates to multiple CF 2. A FlushWAL or a memtable swtich that will try to write the buffered WAL data. Fail this write so that buffered WAL data is dropped: `4b1d595306/file/writable_file_writer.cc (L624)` The error needs to be retryable to start background auto recovery. 3. One CF successfully flushes its memtable during auto recovery. 4. Crash the process. 5. Reopen the DB, one CF will have the update as a result of successful flush. Other CFs will miss all the updates in the write batch since WAL does not have them. ``` This can happen if a users configures manual_wal_flush, uses more than one CF, and can hit retryable error for WAL writes. This PR is a short-term fix that upgrades WAL related errors to fatal and not trigger auto recovery. A long-term fix may be not drop buffered WAL data by checking how much data is actually written, or require atomically flushing all column families during error recovery from this kind of errors. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12995 Test Plan: added unit test to check error severity and if recovery is triggered. A crash test repro command that fails in a few runs before this PR: ``` python3 ./tools/db_crashtest.py blackbox --interval=60 --metadata_write_fault_one_in=1000 --column_families=10 --exclude_wal_from_write_fault_injection=0 --manual_wal_flush_one_in=1000 --WAL_size_limit_MB=10240 --WAL_ttl_seconds=0 --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --adm_policy=1 --advise_random_on_open=1 --allow_data_in_errors=True --allow_fallocate=1 --async_io=0 --auto_readahead_size=0 --avoid_flush_during_recovery=1 --avoid_flush_during_shutdown=1 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=100 --block_align=1 --block_protection_bytes_per_key=0 --block_size=16384 --bloom_before_level=2147483647 --bottommost_compression_type=none --bottommost_file_compaction_delay=0 --bytes_per_sync=0 --cache_index_and_filter_blocks=1 --cache_index_and_filter_blocks_with_high_priority=1 --cache_size=33554432 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=0 --check_multiget_consistency=0 --check_multiget_entity_consistency=0 --checkpoint_one_in=0 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_readahead_size=1048576 --compaction_ttl=0 --compress_format_version=1 --compressed_secondary_cache_size=8388608 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=4 --compression_type=none --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc= --data_block_index_type=0 --db_write_buffer_size=0 --decouple_partitioned_filters=1 --default_temperature=kCold --default_write_temperature=kWarm --delete_obsolete_files_period_micros=30000000 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_file_deletions_one_in=1000000 --disable_manual_compaction_one_in=1000000 --disable_wal=0 --dump_malloc_stats=1 --enable_checksum_handoff=1 --enable_compaction_filter=0 --enable_custom_split_merge=0 --enable_do_not_compress_roles=0 --enable_index_compression=0 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=1 --enable_sst_partitioner_factory=0 --enable_thread_tracking=1 --enable_write_thread_adaptive_yield=1 --error_recovery_with_no_fault_injection=1 --fail_if_options_file_error=1 --fifo_allow_compaction=1 --file_checksum_impl=big --fill_cache=1 --flush_one_in=1000000 --format_version=6 --get_all_column_family_metadata_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=10000 --get_properties_of_all_tables_one_in=1000000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --index_block_restart_interval=4 --index_shortening=1 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100000 --last_level_temperature=kWarm --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=10000 --log_file_time_to_roll=0 --log_readahead_size=0 --long_running_snapshots=0 --lowest_used_cache_tier=2 --manifest_preallocation_size=5120 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=100000 --max_key_len=3 --max_log_file_size=0 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=16 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16777216 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=2097152 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.001 --memtable_protection_bytes_per_key=2 --memtable_whole_key_filtering=0 --memtablerep=skip_list --metadata_charge_policy=1 --metadata_read_fault_one_in=0 --min_write_buffer_number_to_merge=1 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=2 --open_files=100 --open_metadata_read_fault_one_in=0 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --optimize_filters_for_hits=0 --optimize_filters_for_memory=0 --optimize_multiget_for_io=0 --paranoid_file_checks=1 --paranoid_memory_checks=0 --partition_filters=0 --partition_pinning=2 --pause_background_one_in=10000 --periodic_compaction_seconds=0 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=0 --readahead_size=524288 --readpercent=45 --recycle_log_file_num=0 --reopen=0 --report_bg_io_stats=0 --reset_stats_one_in=10000 --sample_for_compression=5 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --set_options_one_in=10000 --skip_stats_update_on_db_open=1 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=1048576 --sqfc_name=bar --sqfc_version=1 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=600 --stats_history_buffer_size=1048576 --strict_bytes_per_sync=1 --subcompactions=2 --sync=0 --sync_fault_injection=1 --table_cache_numshardbits=6 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=3 --uncache_aggressiveness=8 --universal_max_read_amp=-1 --unpartitioned_pinning=2 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=0 --use_attribute_group=1 --use_delta_encoding=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=1 --use_multi_cf_iterator=1 --use_multi_get_entity=0 --use_multiget=0 --use_put_entity_one_in=1 --use_sqfc_for_range_queries=0 --use_timed_put_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_compression=1 --verify_db_one_in=100000 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=4194304 --write_dbid_to_manifest=0 --write_fault_one_in=50 --writepercent=35 --ops_per_thread=100000 --preserve_unverified_changes=1 ``` Reviewed By: hx235 Differential Revision: D62888510 Pulled By: cbi42 fbshipit-source-id: 308bdbbb8d897cc8eba950155cd0e37cf7eb76fe	2024-09-17 14:10:33 -07:00
Nicholas Ormrod	0e04ef1a96	Deshim coro in fbcode/internal_repo_rocksdb Summary: The following rules were deshimmed: ``` //folly/experimental/coro:accumulate -> //folly/coro:accumulate //folly/experimental/coro:async_generator -> //folly/coro:async_generator //folly/experimental/coro:async_pipe -> //folly/coro:async_pipe //folly/experimental/coro:async_scope -> //folly/coro:async_scope //folly/experimental/coro:async_stack -> //folly/coro:async_stack //folly/experimental/coro:baton -> //folly/coro:baton //folly/experimental/coro:blocking_wait -> //folly/coro:blocking_wait //folly/experimental/coro:collect -> //folly/coro:collect //folly/experimental/coro:concat -> //folly/coro:concat //folly/experimental/coro:coroutine -> //folly/coro:coroutine //folly/experimental/coro:current_executor -> //folly/coro:current_executor //folly/experimental/coro:detach_on_cancel -> //folly/coro:detach_on_cancel //folly/experimental/coro:detail_barrier -> //folly/coro:detail_barrier //folly/experimental/coro:detail_barrier_task -> //folly/coro:detail_barrier_task //folly/experimental/coro:detail_current_async_frame -> //folly/coro:detail_current_async_frame //folly/experimental/coro:detail_helpers -> //folly/coro:detail_helpers //folly/experimental/coro:detail_malloc -> //folly/coro:detail_malloc //folly/experimental/coro:detail_manual_lifetime -> //folly/coro:detail_manual_lifetime //folly/experimental/coro:detail_traits -> //folly/coro:detail_traits //folly/experimental/coro:filter -> //folly/coro:filter //folly/experimental/coro:future_util -> //folly/coro:future_util //folly/experimental/coro:generator -> //folly/coro:generator //folly/experimental/coro:gmock_helpers -> //folly/coro:gmock_helpers //folly/experimental/coro:gtest_helpers -> //folly/coro:gtest_helpers //folly/experimental/coro:inline_task -> //folly/coro:inline_task //folly/experimental/coro:invoke -> //folly/coro:invoke //folly/experimental/coro:merge -> //folly/coro:merge //folly/experimental/coro:mutex -> //folly/coro:mutex //folly/experimental/coro:promise -> //folly/coro:promise //folly/experimental/coro:result -> //folly/coro:result //folly/experimental/coro:retry -> //folly/coro:retry //folly/experimental/coro:rust_adaptors -> //folly/coro:rust_adaptors //folly/experimental/coro:scope_exit -> //folly/coro:scope_exit //folly/experimental/coro:shared_lock -> //folly/coro:shared_lock //folly/experimental/coro:shared_mutex -> //folly/coro:shared_mutex //folly/experimental/coro:sleep -> //folly/coro:sleep //folly/experimental/coro:small_unbounded_queue -> //folly/coro:small_unbounded_queue //folly/experimental/coro:task -> //folly/coro:task //folly/experimental/coro:timed_wait -> //folly/coro:timed_wait //folly/experimental/coro:timeout -> //folly/coro:timeout //folly/experimental/coro:traits -> //folly/coro:traits //folly/experimental/coro:transform -> //folly/coro:transform //folly/experimental/coro:unbounded_queue -> //folly/coro:unbounded_queue //folly/experimental/coro:via_if_async -> //folly/coro:via_if_async //folly/experimental/coro:with_async_stack -> //folly/coro:with_async_stack //folly/experimental/coro:with_cancellation -> //folly/coro:with_cancellation //folly/experimental/coro:bounded_queue -> //folly/coro:bounded_queue //folly/experimental/coro:shared_promise -> //folly/coro:shared_promise //folly/experimental/coro:cleanup -> //folly/coro:cleanup //folly/experimental/coro:auto_cleanup_fwd -> //folly/coro:auto_cleanup_fwd //folly/experimental/coro:auto_cleanup -> //folly/coro:auto_cleanup ``` The following headers were deshimmed: ``` folly/experimental/coro/Accumulate.h -> folly/coro/Accumulate.h folly/experimental/coro/Accumulate-inl.h -> folly/coro/Accumulate-inl.h folly/experimental/coro/AsyncGenerator.h -> folly/coro/AsyncGenerator.h folly/experimental/coro/AsyncPipe.h -> folly/coro/AsyncPipe.h folly/experimental/coro/AsyncScope.h -> folly/coro/AsyncScope.h folly/experimental/coro/AsyncStack.h -> folly/coro/AsyncStack.h folly/experimental/coro/Baton.h -> folly/coro/Baton.h folly/experimental/coro/BlockingWait.h -> folly/coro/BlockingWait.h folly/experimental/coro/Collect.h -> folly/coro/Collect.h folly/experimental/coro/Collect-inl.h -> folly/coro/Collect-inl.h folly/experimental/coro/Concat.h -> folly/coro/Concat.h folly/experimental/coro/Concat-inl.h -> folly/coro/Concat-inl.h folly/experimental/coro/Coroutine.h -> folly/coro/Coroutine.h folly/experimental/coro/CurrentExecutor.h -> folly/coro/CurrentExecutor.h folly/experimental/coro/DetachOnCancel.h -> folly/coro/DetachOnCancel.h folly/experimental/coro/detail/Barrier.h -> folly/coro/detail/Barrier.h folly/experimental/coro/detail/BarrierTask.h -> folly/coro/detail/BarrierTask.h folly/experimental/coro/detail/CurrentAsyncFrame.h -> folly/coro/detail/CurrentAsyncFrame.h folly/experimental/coro/detail/Helpers.h -> folly/coro/detail/Helpers.h folly/experimental/coro/detail/Malloc.h -> folly/coro/detail/Malloc.h folly/experimental/coro/detail/ManualLifetime.h -> folly/coro/detail/ManualLifetime.h folly/experimental/coro/detail/Traits.h -> folly/coro/detail/Traits.h folly/experimental/coro/Filter.h -> folly/coro/Filter.h folly/experimental/coro/Filter-inl.h -> folly/coro/Filter-inl.h folly/experimental/coro/FutureUtil.h -> folly/coro/FutureUtil.h folly/experimental/coro/Generator.h -> folly/coro/Generator.h folly/experimental/coro/GmockHelpers.h -> folly/coro/GmockHelpers.h folly/experimental/coro/GtestHelpers.h -> folly/coro/GtestHelpers.h folly/experimental/coro/detail/InlineTask.h -> folly/coro/detail/InlineTask.h folly/experimental/coro/Invoke.h -> folly/coro/Invoke.h folly/experimental/coro/Merge.h -> folly/coro/Merge.h folly/experimental/coro/Merge-inl.h -> folly/coro/Merge-inl.h folly/experimental/coro/Mutex.h -> folly/coro/Mutex.h folly/experimental/coro/Promise.h -> folly/coro/Promise.h folly/experimental/coro/Result.h -> folly/coro/Result.h folly/experimental/coro/Retry.h -> folly/coro/Retry.h folly/experimental/coro/RustAdaptors.h -> folly/coro/RustAdaptors.h folly/experimental/coro/ScopeExit.h -> folly/coro/ScopeExit.h folly/experimental/coro/SharedLock.h -> folly/coro/SharedLock.h folly/experimental/coro/SharedMutex.h -> folly/coro/SharedMutex.h folly/experimental/coro/Sleep.h -> folly/coro/Sleep.h folly/experimental/coro/Sleep-inl.h -> folly/coro/Sleep-inl.h folly/experimental/coro/SmallUnboundedQueue.h -> folly/coro/SmallUnboundedQueue.h folly/experimental/coro/Task.h -> folly/coro/Task.h folly/experimental/coro/TimedWait.h -> folly/coro/TimedWait.h folly/experimental/coro/Timeout.h -> folly/coro/Timeout.h folly/experimental/coro/Timeout-inl.h -> folly/coro/Timeout-inl.h folly/experimental/coro/Traits.h -> folly/coro/Traits.h folly/experimental/coro/Transform.h -> folly/coro/Transform.h folly/experimental/coro/Transform-inl.h -> folly/coro/Transform-inl.h folly/experimental/coro/UnboundedQueue.h -> folly/coro/UnboundedQueue.h folly/experimental/coro/ViaIfAsync.h -> folly/coro/ViaIfAsync.h folly/experimental/coro/WithAsyncStack.h -> folly/coro/WithAsyncStack.h folly/experimental/coro/WithCancellation.h -> folly/coro/WithCancellation.h folly/experimental/coro/BoundedQueue.h -> folly/coro/BoundedQueue.h folly/experimental/coro/SharedPromise.h -> folly/coro/SharedPromise.h folly/experimental/coro/Cleanup.h -> folly/coro/Cleanup.h folly/experimental/coro/AutoCleanup-fwd.h -> folly/coro/AutoCleanup-fwd.h folly/experimental/coro/AutoCleanup.h -> folly/coro/AutoCleanup.h ``` This is a codemod. It was automatically generated and will be landed once it is approved and tests are passing in sandcastle. You have been added as a reviewer by Sentinel or Butterfly. Autodiff project: dcoro Autodiff partition: fbcode.internal_repo_rocksdb Autodiff bookmark: ad.dcoro.fbcode.internal_repo_rocksdb Reviewed By: dtolnay Differential Revision: D62684411 fbshipit-source-id: 8dbd31ab64fcdd99435d322035b9668e3200e0a3	2024-09-14 09:48:21 -07:00
anand76	cabd2d8718	Fix a couple of missing cases of retry on corruption (#13007 ) Summary: For SST checksum mismatch corruptions in the read path, RocksDB retries the read if the underlying file system supports verification and reconstruction of data (`FSSupportedOps::kVerifyAndReconstructRead`). There were a couple of places where the retry was missing - reading the SST footer and the properties block. This PR fixes the retry in those cases. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13007 Test Plan: Add new unit tests Reviewed By: jaykorean Differential Revision: D62519186 Pulled By: anand1976 fbshipit-source-id: 50aa38f18f2a53531a9fc8d4ccdf34fbf034ed59	2024-09-13 13:56:49 -07:00
Changyu Bi	e490f2b051	Fix a bug in ReFitLevel() where `FileMetaData::being_compacted` is not cleared (#13009 ) Summary: in ReFitLevel(), we were not setting being_compacted to false after ReFitLevel() is done. This is not a issue if refit level is successful, since new FileMetaData is created for files at the target level. However, if there's an error during RefitLevel(), e.g., Manifest write failure, we should clear the being_compacted field for these files. Otherwise, these files will not be picked for compaction until db reopen. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13009 Test Plan: existing test. - stress test failure in T200339331 should not happen anymore. Reviewed By: hx235 Differential Revision: D62597169 Pulled By: cbi42 fbshipit-source-id: 0ba659806da6d6d4b42384fc95268b2d7bad720e	2024-09-12 15:19:14 -07:00
Yu Zhang	43bc71fef6	Add an internal API MemTableList::GetEditForDroppingCurrentVersion (#13001 ) Summary: Prepare this internal API to be used by atomic data replacement. The main purpose of this API is to get a `VersionEdit` to mark the entire current `MemTableListVersion` as dropped. Flush needs the similar functionality when installing results, so that logic is refactored into a util function `GetDBRecoveryEditForObsoletingMemTables` to be shared by flush and this internal API. To test this internal API, flush's result installation is redirected to use this API when it is flushing all the immutable MemTables in debug mode. It should achieve the exact same results, just with a duplicated `VersionEdit::log_number` field that doesn't upsets the recovery logic. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13001 Test Plan: Existing tests Reviewed By: pdillinger Differential Revision: D62309591 Pulled By: jowlyzhang fbshipit-source-id: e25914d9a2e281c25ab7ee31a66eaf6adfae4b88	2024-09-10 13:23:13 -07:00
Yu Zhang	0c6e9c036a	Make compaction always use the input version with extra ref protection (#12992 ) Summary: `Compaction` is already creating its own ref for the input Version: `4b1d595306/db/compaction/compaction.cc (L73)` And properly Unref it during destruction: `4b1d595306/db/compaction/compaction.cc (L450)` This PR redirects compaction's access of `cfd->current()` to this input `Version`, to prepare for when a column family's data can be replaced all together, and `cfd->current()` is not safe to access for a compaction job. Because a new `Version` with just some other external files could be installed as `cfd->current()`. The compaction job's expectation of the current `Version` and the corresponding storage info to always have its input files will no longer be guaranteed. My next follow up is to do a similar thing for flush, also to prepare it for when a column family's data can be replaced. I will make it create its own reference of the current `MemTableListVersion` and use it as input, all flush job's access of memtables will be wired to that input `MemTableListVersion`. Similarly this reference will be unreffed during a flush job's destruction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12992 Test Plan: Existing tests Reviewed By: pdillinger Differential Revision: D62212625 Pulled By: jowlyzhang fbshipit-source-id: 9a781213469cf366857a128d50a702af683a046a	2024-09-06 14:07:33 -07:00
Yu Zhang	a24574e80a	Add documentation for background job's state transition (#12994 ) Summary: The `SchedulePending` API is a bit confusing since it doesn't immediately schedule the work and can be confused with the actual scheduling. So I have changed these to be `EnqueuePending` and added some documentation for the corresponding state transitions of these background work. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12994 Test Plan: existing tests Reviewed By: cbi42 Differential Revision: D62252746 Pulled By: jowlyzhang fbshipit-source-id: ee68be6ed33070cad9a5004b7b3e16f5bcb041bf	2024-09-06 13:08:34 -07:00
Changyu Bi	cd6f802ccb	Add a new file ingestion option `link_files` (#12980 ) Summary: Add option `IngestExternalFileOptions::link_files` that hard links input files and preserves original file links after ingestion, unlike `move_files` which will unlink input files after ingestion. This can be useful when being used together with `allow_db_generated_files` to ingest files from another DB. Also reverted the change to `move_files` in https://github.com/facebook/rocksdb/issues/12959 to simplify the contract so that it will always unlink input files without exception. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12980 Test Plan: updated unit test `ExternSSTFileLinkFailFallbackTest.LinkFailFallBackExternalSst` to test that input files will not be unlinked. Reviewed By: pdillinger Differential Revision: D61925111 Pulled By: cbi42 fbshipit-source-id: eadaca72e1ae5288bdd195d57158466e5656fa62	2024-09-03 13:06:25 -07:00
Peter Dillinger	d96e67c2bf	Fix flaky test DBTest2.VariousFileTemperatures (#12974 ) Summary: ... apparently due to potentially not purging obsolete files after CompactRange Example: https://github.com/facebook/rocksdb/actions/runs/10564621261/job/29267393711?pr=12959 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12974 Test Plan: reproduced failure with USE_CLANG=1 COERCE_CONTEXT_SWITCH=1, now fixed Reviewed By: cbi42 Differential Revision: D61812600 Pulled By: pdillinger fbshipit-source-id: d4b23e1a179bb8ec39875ed7a8ce1649fa3344bd	2024-08-26 14:08:21 -07:00
Changyu Bi	4eb5878ab2	Support ingesting db generated files using hard link (#12959 ) Summary: so `IngestExternalFileOptions::move_files` and `IngestExternalFileOptions::allow_db_generated_files` are now compatible. The original file links won't be removed if `allow_db_generated_files` is true. This is to prevent deleting files from another DB. There was a [comment](https://github.com/facebook/rocksdb/pull/12750#discussion_r1684509620) in https://github.com/facebook/rocksdb/issues/12750 about how exactly-once ingestion would work with `move_files`. I've discussed with customer and decided that it can be done by reading the target DB to see if it contains any ingested key. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12959 Test Plan: updated unit tests `IngestDBGeneratedFileTest*` to enable `move_files`. Reviewed By: jowlyzhang Differential Revision: D61703480 Pulled By: cbi42 fbshipit-source-id: 6b4294369767f989a2f36bbace4ca3c0257aeaf7	2024-08-26 12:25:16 -07:00
Peter Dillinger	96340dbce2	Options for file temperature for more files (#12957 ) Summary: We have a request to use the cold tier as primary source of truth for the DB, and to best support such use cases and to complement the existing options controlling SST file temperatures, we add two new DB options: * `metadata_write_temperature` for DB "small" files that don't contain much user data * `wal_write_temperature` for WALs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12957 Test Plan: Unit test included, though it's hard to be sure we've covered all the places Reviewed By: jowlyzhang Differential Revision: D61664815 Pulled By: pdillinger fbshipit-source-id: 8e19c9dd8fd2db059bb15f74938d6bc12002e82b	2024-08-23 19:49:25 -07:00
eniac1024	f5c5f881d2	Fix MultiGet with timestamps (#12943 ) Summary: Issue: MultiGet(PinnableSlice) can't read out all timestamps. Fixed the impl, and added an UT as well. In the original impl, if MultiGet reads multiple column families, a later column family would clean up timestamps of previous column family. Fix: https://github.com/facebook/rocksdb/issues/12950#issue-2476996580 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12943 Reviewed By: anand1976 Differential Revision: D61729257 Pulled By: pdillinger fbshipit-source-id: 55267c26076c8a59acedd27e14714711729a40df	2024-08-23 13:58:04 -07:00
Changyu Bi	c62de54c7c	Record largest seqno in table properties and verify in file ingestion (#12951 ) Summary: this helps to avoid scanning input files when ingesting db generated files: `ecb844babd/db/external_sst_file_ingestion_job.cc (L917-L935)` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12951 Test Plan: * `IngestDBGeneratedFileTest.FailureCase` is updated to verify that this table property is verified during ingestion * existing unit tests for other ingestion use cases. Reviewed By: jowlyzhang Differential Revision: D61608285 Pulled By: cbi42 fbshipit-source-id: b5b7aae9741531349ab247be6ffaa3f3628b76ca	2024-08-21 16:24:18 -07:00
Yu Zhang	945f60b157	Add some documentation for version edit handlers (#12948 ) Summary: As titled. No functional change. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12948 Reviewed By: hx235 Differential Revision: D61551254 Pulled By: jowlyzhang fbshipit-source-id: ccf53d78bd2f18f174d7e61972e5de467c96ce76	2024-08-21 10:09:10 -07:00
Yu Zhang	81d52bdc1a	Fix UDT in memtable only assertions (#12946 ) Summary: Empty memtables can be legitimately created and flushed, for example by error recovery flush attempts: `273b3eadf0/db/db_impl/db_impl_compaction_flush.cc (L2309-L2312)` This check is updated to be considerate of this. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12946 Reviewed By: hx235 Differential Revision: D61492477 Pulled By: jowlyzhang fbshipit-source-id: 7d16fcaea457948546072f85b3650fd1cc24f9db	2024-08-20 09:19:52 -07:00
Changyu Bi	defd97bc9d	Add an option to verify memtable key order during reads (#12889 ) Summary: add a new CF option `paranoid_memory_checks` that allows additional data integrity validations during read/scan. Currently, skiplist-based memtable will validate the order of keys visited. Further data validation can be added in different layers. The option will be opt-in due to performance overhead. The motivation for this feature is for services where data correctness is critical and want to detect in-memory corruption earlier. For a corrupted memtable key, this feature can help to detect it during during reads instead of during flush with existing protections (OutputValidator that verifies key order or per kv checksum). See internally linked task for more context. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12889 Test Plan: * new unit test added for paranoid_memory_checks=true. * existing unit test for paranoid_memory_checks=false. * enable in stress test. Performance Benchmark: we check for performance regression in read path where data is in memtable only. For each benchmark, the script was run at the same time for main and this PR: * Memtable-only randomread ops/sec: ``` (for I in $(seq 1 50);do ./db_bench --benchmarks=fillseq,readrandom --write_buffer_size=268435456 --writes=250000 --num=250000 --reads=500000 --seed=1723056275 2>&1 \| grep "readrandom"; done;) \| awk '{ t += $5; c++; print } END { print 1.0 * t / c }'; Main: 608146 PR with paranoid_memory_checks=false: 607727 (- %0.07) PR with paranoid_memory_checks=true: 521889 (-%14.2) ``` * Memtable-only sequential scan ops/sec: ``` (for I in $(seq 1 50); do ./db_bench--benchmarks=fillseq,readseq[-X10] --write_buffer_size=268435456 --num=1000000 --seed=1723056275 2>1 \| grep "\[AVG 10 runs\]"; done;) \| awk '{ t += $6; c++; print; } END { printf "%.0f\n", 1.0 * t / c }'; Main: 9180077 PR with paranoid_memory_checks=false: 9536241 (+%3.8) PR with paranoid_memory_checks=true: 7653934 (-%16.6) ``` * Memtable-only reverse scan ops/sec: ``` (for I in $(seq 1 20); do ./db_bench --benchmarks=fillseq,readreverse[-X10] --write_buffer_size=268435456 --num=1000000 --seed=1723056275 2>1 \| grep "\[AVG 10 runs\]"; done;) \| awk '{ t += $6; c++; print; } END { printf "%.0f\n", 1.0 * t / c }'; Main: 1285719 PR with integrity_checks=false: 1431626 (+%11.3) PR with integrity_checks=true: 811031 (-%36.9) ``` The `readrandom` benchmark shows no regression. The scanning benchmarks show improvement that I can't explain. Reviewed By: pdillinger Differential Revision: D60414267 Pulled By: cbi42 fbshipit-source-id: a70b0cbeea131f1a249a5f78f9dc3a62dacfaa91	2024-08-19 13:53:25 -07:00
Jay Huh	273b3eadf0	Add Remote Compaction Installation Callback Function (#12940 ) Summary: Add an optional callback function upon remote compaction temp output installation. This will be internally used for setting the final status in the Offload Infra. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12940 Test Plan: Unit Test added ``` ./compaction_service_test ``` _Also internally tested by manually merging into internal code base_ Reviewed By: anand1976 Differential Revision: D61419157 Pulled By: jaykorean fbshipit-source-id: 66831685bc403949c26bfc65840dd1900d2a5a67	2024-08-19 11:22:43 -07:00
Yu Zhang	295326b6ee	Best efforts recovery recover seqno prefix (#12938 ) Summary: This PR make best efforts recovery more permissive by allowing it to recover incomplete Version that presents a valid point in time view from the user's perspective. Currently, a Version is only valid and saved if all files consisting that Version can be found. With this change, if only a suffix of L0 files (and their associated blob files) are missing, a valid Version is also available to be saved and recover to. Note that we don't do this if the column family was atomically flushed. Because atomic flush also need a consistent view across the column families, we cannot guarantee that if we are recovering to incomplete version. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12938 Test Plan: Existing tests and added unit tests. Reviewed By: anand1976 Differential Revision: D61414381 Pulled By: jowlyzhang fbshipit-source-id: f9b73deb34d35ad696ab42315928b656d586262a	2024-08-16 17:18:54 -07:00
Peter Dillinger	4d3518951a	Option to decouple index and filter partitions (#12939 ) Summary: Partitioned metadata blocks were introduced back in 2017 to deal more gracefully with large DBs where RAM is relatively scarce and some data might be much colder than other data. The feature allows metadata blocks to compete for memory in the block cache against data blocks while alleviating tail latencies and thrash conditions that can arise with large metadata blocks (sometimes megabytes each) that can arise with large SST files. In general, the cost to partitioned metadata is more CPU in accesses (especially for filters where more binary search is needed before hashing can be used) and a bit more memory fragmentation and related overheads. However the feature has always had a subtle limitation with a subtle effect on performance: index partitions and filter partitions must be cut at the same time, regardless of which wins the space race (hahaha) to metadata_block_size. Commonly filters will be a few times larger than indexes, so index partitions will be under-sized compared to filter (and data) blocks. While this does affect fragmentation and related overheads a bit, I suspect the bigger impact on performance is in the block cache. The coupling of the partition cuts would be defensible if the binary search done to find the filter block was used (on filter hit) to short-circuit binary search to an index partition, but that optimization has not been developed. Consider two metadata blocks, an under-sized one and a normal-sized one, covering proportional sections of the key space with the same density of read queries. The under-sized one will be more prone to eviction from block cache because it is used less often. This is unfair because of its despite its proportionally smaller cost of keeping in block cache, and most of the cost of a miss to re-load it (random IO) is not proportional to the size (similar latency etc. up to ~32KB). ## This change Adds a new table option decouple_partitioned_filters allows filter blocks and index blocks to be cut independently. To make this work, the partitioned filter block builder needs to know about the previous key, to generate an appropriate separator for the partition index. In most cases, BlockBasedTableBuilder already has easy access to the previous key to provide to the filter block builder. This change includes refactoring to pass that previous key to the filter builder when available, with the filter building caching the previous key itself when unavailable, such as during compression dictionary training and some unit tests. Access to the previous key eliminates the need to track the previous prefix, which results in a small SST construction CPU win in prefix filtering cases, regardless of coupling, and possibly a small regression for some non-prefix cases, regardless of coupling, but still overall improvement especially with https://github.com/facebook/rocksdb/issues/12931. Suggested follow-up: * Update confusing use of "last key" to refer to "previous key" * Expand unit test coverage with parallel compression and dictionary training * Consider an option or enhancement to alleviate under-sized metadata blocks "at the end" of an SST file due to no coordination or awareness of when files are cut. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12939 Test Plan: unit tests updated. Also did some unit test runs with "hard wired" usage of parallel compression and dictionary training code paths to ensure they were working. Also ran blackbox_crash_test for a while with the new feature. ## SST write performance (CPU) Using the same testing setup as in https://github.com/facebook/rocksdb/issues/12931 but with -decouple_partitioned_filters=1 in the "after" configuration, which benchmarking shows makes almost no difference in terms of SST write CPU. "After" vs. "before" this PR ``` -partition_index_and_filters=0 -prefix_size=0 -whole_key_filtering=1 923691 vs. 924851 (-0.13%) -partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=0 921398 vs. 922973 (-0.17%) -partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=1 902259 vs. 908756 (-0.71%) -partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=0 917932 vs. 916901 (+0.60%) -partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=0 912755 vs. 907298 (+0.60%) -partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=1 899754 vs. 892433 (+0.82%) ``` I think this is a pretty good trade, especially in attracting more movement toward partitioned configurations. ## Read performance Let's see how decoupling affects read performance across various degrees of memory constraint. To simplify LSM structure, we're using FIFO compaction. Since decoupling will overall increase metadata block size, we control for this somewhat with an extra "before" configuration with larger metadata block size setting (8k instead of 4k). Basic setup: ``` (for CS in 0300 1200; do TEST_TMPDIR=/dev/shm/rocksdb1 ./db_bench -benchmarks=fillrandom,flush,readrandom,block_cache_entry_stats -num=5000000 -duration=30 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=10 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters=1 -statistics=1 -cache_size=${CS}000000 -metadata_block_size=4096 -decouple_partitioned_filters=1 2>&1 \| tee results-$CS; done) ``` And read ops/s results: ```CSV Cache size MB,After/decoupled/4k,Before/4k,Before/8k 3,15593,15158,12826 6,16295,16693,14134 10,20427,20813,18459 20,27035,26836,27384 30,33250,31810,33846 60,35518,32585,35329 100,36612,31805,35292 300,35780,31492,35481 1000,34145,31551,35411 1100,35219,31380,34302 1200,35060,31037,34322 ``` If you graph this with log scale on the X axis (internal link: https://pxl.cl/5qKRc), you see that the decoupled/4k configuration is essentially the best of both the before/4k and before/8k configurations: handles really tight memory closer to the old 4k configuration and handles generous memory closer to the old 8k configuration. Reviewed By: jowlyzhang Differential Revision: D61376772 Pulled By: pdillinger fbshipit-source-id: fc2af2aee44290e2d9620f79651a30640799e01f	2024-08-16 15:34:31 -07:00
Peter Dillinger	f63428bcc7	Optimize, simplify filter block building (fix regression) (#12931 ) Summary: This is in part a refactoring / simplification to set up for "decoupled" partitioned filters and in part to fix an intentional regression for a correctness fix in https://github.com/facebook/rocksdb/issues/12872. Basically, we are taking out some complexity of the filter block builders, and pushing part of it (simultaneous de-duplication of prefixes and whole keys) into the filter bits builders, where it is more efficient by operating on hashes (rather than copied keys). Previously, the FullFilterBlockBuilder had a somewhat fragile and confusing set of conditions under which it would keep a copy of the most recent prefix and most recent whole key, along with some other state that is essentially redundant. Now we just track (always) the previous prefix in the PartitionedFilterBlockBuilder, to deal with the boundary prefix Seek filtering problem. (Btw, the next PR will optimize this away since BlockBasedTableReader already tracks the previous key.) And to deal with the problem of de-duplicating both whole keys and prefixes going into a single filter, we add a new function to FilterBitsBuilder that has that extra de-duplication capabilty, which is relatively efficient because we only have to cache an extra 64-bit hash, not a copied key or prefix. (The API of this new function is somewhat awkward to avoid a small CPU regression in some cases.) Also previously, there was awkward logic split between FullFilterBlockBuilder and PartitionedFilterBlockBuilder to deal with some things specific to partitioning. And confusing names like Add vs. AddKey. FullFilterBlockBuilder is much cleaner and simplified now. The splitting of PartitionedFilterBlockBuilder::MaybeCutAFilterBlock into DecideCutAFilterBlock and CutAFilterBlock is to address what would have been a slight performance regression in some cases. The split allows for more intruction-level parallelism by reducing unnecessary control dependencies. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12931 Test Plan: existing tests (with some minor updates) Also manually ported over the pre-broken regression test described in https://github.com/facebook/rocksdb/issues/12870 and ran it (passed). Performance: Here we validate that an entire series of recent related PRs are a net improvement in aggregate. "Before" is with these PRs reverted: https://github.com/facebook/rocksdb/issues/12872 #12911 https://github.com/facebook/rocksdb/issues/12874 #12867 https://github.com/facebook/rocksdb/issues/12903 #12904. "After" includes this PR (and all of those, with base revision `16c21af`). Simultaneous test script designed to maximally depend on SST construction efficiency: ``` for PF in 0 1; do for PS in 0 8; do for WK in 0 1; do [ "$PS" == "$WK" ] \|\| (for I in `seq 1 20`; do TEST_TMPDIR=/dev/shm/rocksdb2 ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -memtablerep=vector -allow_concurrent_memtable_write=0 -bloom_bits=10 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters=$PF -prefix_size=$PS -whole_key_filtering=$WK 2>&1 \| grep micros/op; done) \| awk '{ t += $5; c++; print } END { print 1.0 * t / c }'; echo "Was -partition_index_and_filters=$PF -prefix_size=$PS -whole_key_filtering=$WK"; done; done; done) \| tee results ``` Showing average ops/sec of "after" vs. "before" ``` -partition_index_and_filters=0 -prefix_size=0 -whole_key_filtering=1 935586 vs. 928176 (+0.79%) -partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=0 930171 vs. 926801 (+0.36%) -partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=1 910727 vs. 894397 (+1.8%) -partition_index_and_filters=1 -prefix_size=0 -whole_key_filtering=1 929795 vs. 922007 (+0.84%) -partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=0 921924 vs. 917285 (+0.51%) -partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=1 903393 vs. 887340 (+1.8%) ``` As one would predict, the most improvement is seen in cases where we have optimized away copying the whole key. Reviewed By: jowlyzhang Differential Revision: D61138271 Pulled By: pdillinger fbshipit-source-id: 427cef0b1465017b45d0a507bfa7720fa20af043	2024-08-14 15:13:16 -07:00
Yu Zhang	d458331ee9	Move file tracking in VersionEditHandlerPointInTime to VersionBuilder (#12928 ) Summary: `VersionEditHandlerPointInTime` is tracking found files, missing files, intermediate files in order to decide to build a `Version` on negative edge trigger (transition from valid to invalid) without applying the current `VersionEdit`. However, applying `VersionEdit` and check completeness of a `Version` are specialization of `VersionBuilder`. More importantly, when we augment best efforts recovery to recover not just complete point in time Version but also a prefix of seqno for a point in time Version, such checks need to be duplicated in `VersionEditHandlerPointInTime` and `VersionBuilder`. To avoid this, this refactor move all the file tracking functionality in `VersionEditHandlerPointInTime` into `VersionBuilder`. To continue to let `VersionEditHandlerPIT` do the edge trigger check and build a `Version` before applying the current `VersionEdit`, a suite of APIs to supporting creating a save point and its associated functions are added in `VersionBuilder` to achieve this. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12928 Test Plan: Existing tests Reviewed By: anand1976 Differential Revision: D61171320 Pulled By: jowlyzhang fbshipit-source-id: 604f66f8b1e3a3e13da59d8ba357c74e8a366dbc	2024-08-12 21:09:37 -07:00
anand76	c21fe1a47f	Add ticker stats for read corruption retries (#12923 ) Summary: Add a couple of ticker stats for corruption retry count and successful retries. This PR also eliminates an extra read attempt when there's a checksum mismatch in a block read from the prefetch buffer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12923 Test Plan: Update existing tests Reviewed By: jowlyzhang Differential Revision: D61024687 Pulled By: anand1976 fbshipit-source-id: 3a08403580ab244000e0d480b7ee0f5a03d76b06	2024-08-12 15:32:07 -07:00
SGZW	6727f0f58a	fix compaction_picker_test asan heap use after free (#12908 ) Summary: ![image](https://github.com/user-attachments/assets/3290fe18-aca2-4691-b072-fbbc96a15fb1) this testcase set syncpoint function which reference this test case heap variable "enable_per_key_placement_" and this sync point function will be triggered by another testcase, so asan will report asan heap use after free error Pull Request resolved: https://github.com/facebook/rocksdb/pull/12908 Reviewed By: hx235 Differential Revision: D60973363 Pulled By: cbi42 fbshipit-source-id: df4f488f51e7741784d5a92fc0a5fc538c5d5b1a	2024-08-09 15:06:37 -07:00
SGZW	5c456c4c08	fix compaction speedup for marked files ut (#12912 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12912 Reviewed By: hx235 Differential Revision: D60973460 Pulled By: cbi42 fbshipit-source-id: ebaa343757f09f7281884a512ebe3a7d6845c8b3	2024-08-09 15:05:02 -07:00
Changyu Bi	b32d899482	Fix MultiGet dropping memtable kv checksum corruption (#12842 ) Summary: Corruption status returned by `GetFromTable()` could be overwritten here: `b6c3495a71/db/version_set.cc (L2614)` This PR fixes this issue by setting `*(s->found_final_value) = true;` in SaveValue. Also makes the handling of the return value of `GetFromTable()` more robust and added asserts in a couple places. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12842 Test Plan: Updated an existing unit test to cover MultiGet. It fails the assertion here before this PR: `b6c3495a71/db/version_set.cc (L2601)` Reviewed By: anand1976 Differential Revision: D59498203 Pulled By: cbi42 fbshipit-source-id: 1f071c1b2c5b66fb71264b547a9e670d1cf592f0	2024-08-08 13:34:11 -07:00
Peter Dillinger	d33d25f903	Disable WAL recycling in crash test; reproducer for recovery data loss (#12918 ) Summary: I was investigating a crash test failure with "Corruption: SST file is ahead of WALs" which I haven't reproduced, but I did reproduce a data loss issue on recovery which I suspect could be the same root problem. The problem is already somewhat known (see https://github.com/facebook/rocksdb/issues/12403 and https://github.com/facebook/rocksdb/issues/12639) where it's only safe to recovery multiple recycled WAL files with trailing old data if the sequence numbers between them are adjacent (to ensure we didn't lose anything in the corrupt/obsolete WAL tail). However, aside from disableWAL=true, there are features like external file ingestion that can increment the sequence numbers without writing to the WAL. It is simply unsustainable to worry about this kind of feature interaction limiting where we can consume sequence numbers. It is very hard to test and audit as well. For reliable crash recovery of recycled WALs, we need a better way of detecting that we didn't drop data from one WAL to the next. Until then, let's disable WAL recycling in the crash test, to help stabilize it. Ideas for follow-up to fix the underlying problem: (a) With recycling, we could always sync the WAL before opening the next one. HOWEVER, this potentially very large sync could cause a big hiccup in writes (vs. O(1) sized manifest sync). (a1) The WAL sync could ensure it is truncated to size, or (a2) By requiring track_and_verify_wals_in_manifest, we could assume that the last synced size in the manifest is the final usable size of the WAL. (It might also be worth avoiding truncating recycled WALs.) (b) Add a new mechanism to record and verify the final size of a WAL without requiring a sync. (b1) By requiring track_and_verify_wals_in_manifest, this could be new WAL metadata recorded in the manifest (at the time of switching WALs). Note that new fields of WalMetadata are not forward-compatible, but a new kind of manifest record (next to WalAddition, WalDeletion; e.g. WalCompletion) is IIRC forward-compatible. (b2) A new kind of WAL header entry (not forward compatible, unfortunately) could record the final size of the previous WAL. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12918 Test Plan: Added disabled reproducer for non-linear data loss on recovery Reviewed By: hx235 Differential Revision: D60917527 Pulled By: pdillinger fbshipit-source-id: 3663d79aec81851f5cf41669f84a712bb4563fd7	2024-08-07 14:20:45 -07:00
Peter Dillinger	b15f8c7f0e	Refactor db_bloom_filter_test (#12911 ) Summary: Ahead of a "decoupled" variant of partitioned filters, refactoring this unit test file to make it easier to incorporate that new variant. * bool test param to new enum class FilterPartitioning * Some cases of iterating over that bool to new parameterized test * Combine some common functionality for configuring parameterized options Pull Request resolved: https://github.com/facebook/rocksdb/pull/12911 Test Plan: no production changes, and no intentional changes to scope or conditions of tests Differential Revision: D60701287 fbshipit-source-id: 3497e3230e29a4f62c934bcb75693965a2df41d8	2024-08-07 11:28:16 -07:00
Yu Zhang	719c96125c	Add a TransactionOptions to enable tracking timestamp size info inside WriteBatch (#12864 ) Summary: In normal use cases, meta info like column family's timestamp size is tracked at the transaction layer, so it's not necessary and even detrimental to track such info inside the internal WriteBatch because it may let anti-patterns like bypassing Transaction write APIs and directly write to its internal WriteBatch like this: `9d0a754dc9/storage/rocksdb/ha_rocksdb.cc (L4949-L4950)` Setting this option to true will keep aforementioned use case continue to work before it's refactored out. This option is only for this purpose and it will be gradually deprecated after aforementioned MyRocks use case are refactored. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12864 Test Plan: Added unit tests Reviewed By: cbi42 Differential Revision: D60194094 Pulled By: jowlyzhang fbshipit-source-id: 64a98822167e99aa7e4fa2a60085d44a5deaa45c	2024-08-05 13:06:45 -07:00
Yu Zhang	36b061a6c7	Fix test breakage (#12915 ) Summary: https://github.com/facebook/rocksdb/issues/12891 updated this deletion rate in the test to be much higher, which makes the test flaky. The rate is being intentionally set to very low to maximize the retention of a ".log.trash" file after DB closes. This PR just change it back. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12915 Reviewed By: ltamasi Differential Revision: D60776312 Pulled By: jowlyzhang fbshipit-source-id: d193557a042c65816fcc337cceb09905e042e9f6	2024-08-05 12:26:18 -07:00
Yu Zhang	d12aaf23ca	Fix file deletions in DestroyDB not rate limited (#12891 ) Summary: Make `DestroyDB` slowly delete files if it's configured and enabled via `SstFileManager`. It's currently not available mainly because of DeleteScheduler's logic related to tracked total_size_ and total_trash_size_. These accounting and logic should not be applied to `DestroyDB`. This PR adds a `DeleteUnaccountedDBFile` util for this purpose which deletes files without accounting it. This util also supports assigning a file to a specified trash bucket so that user can later wait for a specific trash bucket to be empty. For `DestroyDB`, files with more than 1 hard links will be deleted immediately. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12891 Test Plan: Added unit tests, existing tests. Reviewed By: anand1976 Differential Revision: D60300220 Pulled By: jowlyzhang fbshipit-source-id: 8b18109a177a3a9532f6dc2e40e08310c08ca3c7	2024-08-02 19:31:55 -07:00
Levi Tamasi	2e8a1a14ef	Fix a data race affecting the background error status (#12910 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12910 There is currently a call to `GetBGError()` in `DBImpl::WriteImplWALOnly()` where the DB mutex is (incorrectly) not held, leading to a data race. Technically, we could acquire the mutex here but instead, the patch removes the affected check altogether, since the same check is already performed (in a thread-safe manner) in the subsequent call to `PreprocessWrite()`. Reviewed By: cbi42 Differential Revision: D60682008 fbshipit-source-id: 54b67975dcf57d67c068cac71e8ada09a1793ec5	2024-08-02 14:11:08 -07:00
Changyu Bi	8be824e316	Use compensated file size for intra-L0 compaction (#12878 ) Summary: In leveled compaction, we pick intra-L0 compaction instead of L0->Lbase whenever L0 size is small. When L0 files contain many deletions, it makes more sense to compact then down instead of accumulating tombstones in L0. This PR uses compensated_file_size when computing L0 size for determining intra-L0 compaction. Also scale down the limit on total L0 size further to be more cautious about accumulating data in L0. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12878 Test Plan: updated unit test. Reviewed By: hx235 Differential Revision: D59932421 Pulled By: cbi42 fbshipit-source-id: 9de973ac51eb7df81b38b8c68110072b1aa06321	2024-08-01 17:49:34 -07:00
Yu Zhang	319374ae67	Add some checks at property block creation side (#12898 ) Summary: Crash test encountered this failure: ```file ingestion error: Corruption: properties unsorted under specified IngestExternalFileOptions: move_files: 0, verify_checksums_before_ingest: 1, verify_checksums_readahead_size: 1048576 (Empty string or missing field indicates default option or value is used``` Further inspection showed out of order table properties in an external file created by `SstFileWriter` for ingestion, and the file is likely created like this because it passed the initial checksum check. This change added some assertions to check invariant at the properties creation and collecting side. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12898 Test Plan: Existing tests Reviewed By: hx235 Differential Revision: D60459817 Pulled By: jowlyzhang fbshipit-source-id: 91474943d2f9d7795f00b6031c08a13ab91e2470	2024-07-31 13:28:17 -07:00
Peter Dillinger	2595476541	Fix rare WAL handling crash (#12899 ) Summary: A crash test failure in log sync in DBImpl::WriteToWAL is due to a missed case in https://github.com/facebook/rocksdb/issues/12734. Just need to apply similar logic from DBImpl::SyncWalImpl to check for an already closed WAL (nullptr writer). This is extremely rare because it only comes from failed Sync on a closed WAL. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12899 Test Plan: watch crash test Reviewed By: cbi42 Differential Revision: D60481652 Pulled By: pdillinger fbshipit-source-id: 4a176bb6a53dcf077f88344710a110c2f946c386	2024-07-30 17:38:30 -07:00
Peter Dillinger	9058fd037c	Small CPU optimization to experimental range filters (#12893 ) Summary: By reusing an object that owns a vector. The vector allocation/sizing was substantial in a CPU profile. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12893 Test Plan: existing tests Reviewed By: jowlyzhang Differential Revision: D60405139 Pulled By: pdillinger fbshipit-source-id: 8bfbc07cd9b4829f2ac9015e90f2b4eba61fd984	2024-07-29 14:23:35 -07:00
Hui Xiao	408e8d4c85	Handle injected write error after successful WAL write in crash test + misc (#12838 ) Summary: Context/Summary: We discovered the following false positive in our crash test lately: (1) PUT() writes k/v to WAL but fails in `ApplyWALToManifest()`. The k/v is in the WAL (2) Current stress test logic will rollback the expected state of such k/v since PUT() fails (3) If the DB crashes before recovery finishes and reopens, the WAL will be replayed and the k/v is in the DB while the expected state have been roll-backed. We decided to leave those expected state to be pending until the loop-write of the same key succeeds. Bonus: Now that I realized write to manifest can also fail the write which faces the similar problem as https://github.com/facebook/rocksdb/pull/12797, I decided to disable fault injection on user write per thread (instead of globally) when tracing is needed for prefix recovery; some refactory Pull Request resolved: https://github.com/facebook/rocksdb/pull/12838 Test Plan: Rehearsal CI Run below command (varies on sync_fault_injection=1,0 to verify ExpectedState behavior) for a while to ensure crash recovery validation works fine ``` python3 tools/db_crashtest.py --simple blackbox --interval=30 --WAL_size_limit_MB=0 --WAL_ttl_seconds=0 --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --adm_policy=1 --advise_random_on_open=0 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --allow_fallocate=0 --async_io=0 --auto_readahead_size=0 --avoid_flush_during_recovery=0 --avoid_flush_during_shutdown=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=1000000 --block_align=1 --block_protection_bytes_per_key=4 --block_size=16384 --bloom_before_level=4 --bloom_bits=56.810257702625165 --bottommost_compression_type=none --bottommost_file_compaction_delay=0 --bytes_per_sync=262144 --cache_index_and_filter_blocks=1 --cache_index_and_filter_blocks_with_high_priority=1 --cache_size=8388608 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=0 --check_multiget_consistency=0 --check_multiget_entity_consistency=1 --checkpoint_one_in=10000 --checksum_type=kxxHash --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000 --compact_range_one_in=1000 --compaction_pri=4 --compaction_readahead_size=1048576 --compaction_ttl=10 --compress_format_version=1 --compressed_secondary_cache_ratio=0.0 --compressed_secondary_cache_size=0 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=none --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc=04:00-08:00 --data_block_index_type=1 --db_write_buffer_size=0 --default_temperature=kWarm --default_write_temperature=kCold --delete_obsolete_files_period_micros=30000000 --delpercent=20 --delrangepercent=20 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_file_deletions_one_in=10000 --disable_manual_compaction_one_in=1000000 --disable_wal=0 --dump_malloc_stats=0 --enable_checksum_handoff=1 --enable_compaction_filter=0 --enable_custom_split_merge=0 --enable_do_not_compress_roles=0 --enable_index_compression=1 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=0 --enable_sst_partitioner_factory=0 --enable_thread_tracking=0 --enable_write_thread_adaptive_yield=0 --error_recovery_with_no_fault_injection=1 --exclude_wal_from_write_fault_injection=0 --fail_if_options_file_error=1 --fifo_allow_compaction=0 --file_checksum_impl=crc32c --fill_cache=1 --flush_one_in=1000000 --format_version=3 --get_all_column_family_metadata_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=1000000 --get_properties_of_all_tables_one_in=1000000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0.5 --index_block_restart_interval=4 --index_shortening=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100 --last_level_temperature=kWarm --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=10000 --log_file_time_to_roll=60 --log_readahead_size=16777216 --long_running_snapshots=1 --low_pri_pool_ratio=0 --lowest_used_cache_tier=0 --manifest_preallocation_size=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=100000 --max_key_len=3 --max_log_file_size=1048576 --max_manifest_file_size=32768 --max_sequential_skip_in_iterations=1 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=8388608 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --metadata_charge_policy=1 --metadata_read_fault_one_in=0 --metadata_write_fault_one_in=8 --min_write_buffer_number_to_merge=1 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=1 --open_files=-1 --open_metadata_read_fault_one_in=0 --open_metadata_write_fault_one_in=8 --open_read_fault_one_in=0 --open_write_fault_one_in=8 --ops_per_thread=100000000 --optimize_filters_for_hits=1 --optimize_filters_for_memory=1 --optimize_multiget_for_io=1 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=1000000 --periodic_compaction_seconds=2 --prefix_size=7 --prefixpercent=0 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=1000 --readahead_size=524288 --readpercent=10 --recycle_log_file_num=1 --reopen=0 --report_bg_io_stats=0 --reset_stats_one_in=1000000 --sample_for_compression=0 --secondary_cache_fault_one_in=0 --set_options_one_in=0 --skip_stats_update_on_db_open=1 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=68719476736 --sqfc_name=foo --sqfc_version=0 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=10 --stats_history_buffer_size=0 --strict_bytes_per_sync=1 --subcompactions=4 --sync=1 --sync_fault_injection=0 --table_cache_numshardbits=6 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=2 --uncache_aggressiveness=239 --universal_max_read_amp=-1 --unpartitioned_pinning=1 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=1 --use_attribute_group=0 --use_delta_encoding=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_cf_iterator=0 --use_multi_get_entity=0 --use_multiget=0 --use_put_entity_one_in=0 --use_sqfc_for_range_queries=1 --use_timed_put_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_compression=0 --verify_db_one_in=100000 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=0 --write_fault_one_in=8 --writepercent=40 ``` Reviewed By: cbi42 Differential Revision: D59377075 Pulled By: hx235 fbshipit-source-id: 91f602fd67e2d339d378cd28b982095fd073dcb6	2024-07-29 13:51:49 -07:00
Yu Zhang	9883b5f497	Fix manifest_number_ point to invalid file (#12882 ) Summary: This PR fix `VersionSet`'s `manifest_number_` could be pointing to an invalid number intermediately. This happens when a new manifest roll is attempted but fast failed after loading table handlers and before the new manifest file creation/writing is actually attempted. In theory, a later manifest roll effort will overthrow this intermediate invalid in memory state. There is on harm when the DB crashes in this invalid state either. But efforts that takes a file snapshot of the DB like backup will incorrectly try to copy a non existing manifest file. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12882 Reviewed By: cbi42 Differential Revision: D60204956 Pulled By: jowlyzhang fbshipit-source-id: effbdb124b582f879d114988af06ac63867fc549	2024-07-24 17:50:08 -07:00
Yu Zhang	05c9c9aeed	Fix race between test and recovery flush switch memtable (#12884 ) Summary: As titled, to fix this type of data race: https://github.com/facebook/rocksdb/actions/runs/10066814221/job/27829003372?pr=12882 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12884 Test Plan: COMPILE_WITH_TSAN=1 make -j10 db_wal_test ./db_wal_test --gtest_filter=DBWALTest.RecoveryFlushSwitchWALOnEmptyMemtable --gtest_repeat=100 Reviewed By: anand1976 Differential Revision: D60197834 Pulled By: jowlyzhang fbshipit-source-id: 89524cdb4d17a1b647295bcccf5eb2d7d425bc6a	2024-07-24 17:06:16 -07:00

1 2 3 4 5 ...

5734 Commits