rocksdb

mirror of https://github.com/facebook/rocksdb.git synced 2024-11-27 02:44:18 +00:00

Author	SHA1	Message	Date
Jay Huh	e9b15738d9	Honor ConfigOptions.ignore_unknown_options in ParseStruct() (#13152 ) Summary: `OptionTypeInfo::ParseStruct()` was not honoring `config_options.ignore_unknown_options` when unknown properties are found in the serialized string. This caused a compatibility issue in Remote Compaction. When the worker was updated with RocksDB 9.9, the remote worker started including a new table property, `newest_key_time` (added in PR https://github.com/facebook/rocksdb/issues/13083), in the compaction output files. However, parsing that table property in the serialized compaction result from the primary (running with `9.8`) was returning a non-ok status, even though `config_options.ignore_unknown_options` was `true`. In this fix, we will ignore unused properties if `config_options.ignore_unknown_options` is set to true. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13152 Test Plan: Unit Test Added Reviewed By: archang19 Differential Revision: D66374541 Pulled By: jaykorean fbshipit-source-id: 78fd8309909279390438c247c4d390bbee4fa914	2024-11-22 16:38:05 -08:00
Yu Zhang	bee8d5560e	Start version 9.10.0 (#13146 ) Summary: Pull in HISTORY for 9.9.0, update version.h for next version, update check_format_compatible.sh, update git hash for folly Pull Request resolved: https://github.com/facebook/rocksdb/pull/13146 Test Plan: CI Reviewed By: ltamasi Differential Revision: D66142259 Pulled By: jowlyzhang fbshipit-source-id: 90216b2d7cff2e0befb4f56567e3bd074f97c484	2024-11-18 21:49:47 -08:00
Changyu Bi	925435bbd9	Fix a bug that can retain old WAL longer than needed (#13127 ) Summary: The bug only happens for transaction db with 2pc. The main change is in `MemTableList::TryInstallMemtableFlushResults`. Before this fix, `memtables_to_flush` may not include all flushed memtables, and it causes the min_log_number for the flush to be incorrect. The code path for calculating min_log_number is `MemTableList::TryInstallMemtableFlushResults() -> GetDBRecoveryEditForObsoletingMemTables() -> PrecomputeMinLogNumberToKeep2PC() -> FindMinPrepLogReferencedByMemTable()`. Inside `FindMinPrepLogReferencedByMemTable()`, we need to exclude all memtables being flushed. The PR also includes some documentation changes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13127 Test Plan: added a new unit that fails before this change. Reviewed By: ltamasi Differential Revision: D65679270 Pulled By: cbi42 fbshipit-source-id: 611f34bd6ef4cba51f8b54cb1be416887b5a9c5e	2024-11-11 14:19:45 -08:00
anand76	ee258619be	Fix missing cases of corruption retries (#13122 ) Summary: This PR fixes a few cases where RocksDB was not retrying checksum failure/corruption of file reads with the `verify_and_reconstruct_read` IO option. After fixing these cases, we can almost always successfully open the DB and execute reads even if we see transient corruptions, provided the `FileSystem` supports the `verify_and_reconstruct_read` option. The specific cases fixed in this PR are - 1. CURRENT file 2. IDENTITY file 3. OPTIONS file 4. SST footer Pull Request resolved: https://github.com/facebook/rocksdb/pull/13122 Test Plan: Unit test in `db_io_failure_test.cc` that injects corruption at various stages of DB open and reads Reviewed By: jaykorean Differential Revision: D65617982 Pulled By: anand1976 fbshipit-source-id: 4324b88cc7eee5501ab5df20ef7a95bb12ed3ea7	2024-11-08 12:43:21 -08:00
Peter Dillinger	485ee4f45c	Fix and test for leaks of open SST files (#13117 ) Summary: Follow-up to https://github.com/facebook/rocksdb/issues/13106 which revealed that some SST file readers (in addition to blob files) were being essentially leaked in TableCache (until DB::Close() time). Patched sources of leaks: * Flush that is not committed (builder.cc) * Various obsolete SST files picked up by directory scan but not caught by SubcompactionState::Cleanup() cleaning up from some failed compactions. Dozens of unit tests fail without the "backstop" TableCache::Evict() call in PurgeObsoleteFiles(). We also needed to adjust the check for leaks as follows: * Ok if DB::Open never finished (see comment) * Ok if deletions are disabled (see comment) * Allow "quarantined" files to be in table_cache because (presumably) they might become live again. * Get live files from all live Versions. Suggested follow-up: * Potentially delete more obsolete files sooner with a FIXME in db_impl_files.cc. This could potentially be high value because it seems to gate deletion of any/all newer obsolete files on all older compactions finishing. * Try to catch obsolete files in more places using the VersionSet::obsolete_files_ pipeline rather than relying on them being picked up with directory scan, or deleting them outside of normal mechanisms. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13117 Test Plan: updated check used in most all unit tests in ASAN build Reviewed By: hx235 Differential Revision: D65502988 Pulled By: pdillinger fbshipit-source-id: aa0795a8a09d9ec578d25183fe43e2a35849209c	2024-11-08 10:54:43 -08:00
Yu Zhang	282f5a463b	Fix write committed transactions replay when UDT setting toggles (#13121 ) Summary: This PR adds some missing pieces in order to handle UDT setting toggles while replay WALs for WriteCommitted transactions DB. Specifically, all the transaction markers for no op, prepare, commit, rollback are currently not carried over from the original WriteBatch to the new WriteBatch when there is a timestamp setting difference detected. This PR fills that gap. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13121 Test Plan: Added unit tests Reviewed By: ltamasi Differential Revision: D65558801 Pulled By: jowlyzhang fbshipit-source-id: 8176882637b95f6dc0dad10d7fe21056fa5173d1	2024-11-06 17:32:03 -08:00
Peter Dillinger	a28cc4a38c	Fix a leak of open Blob files (#13106 ) Summary: An earlier change (`b34cef57b7`) removed apparently unused functionality where an obsolete blob file number is passed for removal from TableCache, which manages SST files. This was actually relying on broken/fragile abstractions wherein TableCache and BlobFileCache share the same Cache and using the TableCache interface to manipulate blob file caching. No unit test was actually checking for removal of obsolete blob files from the cache (which is somewhat tricky to check and a second order correctness requirement). Here we fix the leak and add a DEBUG+ASAN-only check in DB::Close() that no obsolete files are lingering in the table/blob file cache. Fixes https://github.com/facebook/rocksdb/issues/13066 Important follow-up (FIXME): The added check discovered some apparent cases of leaked (into table_cache) SST file readers that would stick around until DB::Close(). Need to enable that check, diagnose, and fix. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13106 Test Plan: added a check that is called during DB::Close in ASAN builds (to minimize paying the cost in all unit tests). Without the fix, the check failed in at least these tests: ``` db_blob_basic_test DBBlobBasicTest.DynamicallyWarmCacheDuringFlush db_blob_compaction_test DBBlobCompactionTest.CompactionReadaheadMerge db_blob_compaction_test DBBlobCompactionTest.MergeBlobWithBase db_blob_compaction_test DBBlobCompactionTest.CompactionDoNotFillCache db_blob_compaction_test DBBlobCompactionTest.SkipUntilFilter db_blob_compaction_test DBBlobCompactionTest.CompactionFilter db_blob_compaction_test DBBlobCompactionTest.CompactionReadaheadFilter db_blob_compaction_test DBBlobCompactionTest.CompactionReadaheadGarbageCollection ``` Reviewed By: ltamasi Differential Revision: D65296123 Pulled By: pdillinger fbshipit-source-id: 2276d76482beb2c75c9010bc1bec070bb23a24c0	2024-10-31 15:29:30 -07:00
Levi Tamasi	ef535039f3	Call PrepareValue on the base iterator in BaseDeltaIterator (#13105 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/13105 The `WriteBatchWithIndex::NewIteratorWithBase` interface enables creating a `BaseDeltaIterator` with an arbitrary base iterator passed in by the client, which has potentially been created with the `allow_unprepared_value` read option set. Because of this, `BaseDeltaIterator` has to call `PrepareValue` before using the `value()` or `columns()` from the base iterator. This includes both the case when `BaseDeltaIterator` exposes the `value()` and `columns()` of the base iterator as is and the case when the final `value()` / `columns()` is a result of merging key-values across the base and delta iterators. Note that `BaseDeltaIterator` itself does not support `allow_unprepared_value` yet; this will be implemented in an upcoming patch. Reviewed By: jowlyzhang Differential Revision: D65249643 fbshipit-source-id: b0a1ccc0dfd31105b2eef167b463ed15a8bb83b7	2024-10-31 14:20:33 -07:00
Peter Dillinger	9ad772e652	Start version 9.9.0 (#13093 ) Summary: Pull in HISTORY for 9.8.0, update version.h for next version, update check_format_compatible.sh Pull Request resolved: https://github.com/facebook/rocksdb/pull/13093 Test Plan: CI Reviewed By: jowlyzhang Differential Revision: D64987257 Pulled By: pdillinger fbshipit-source-id: a7cec329e3d245e63767760aa0298c08c3281695	2024-10-25 13:47:29 -07:00
Peter Dillinger	3fd1f11d35	Fix race to make BlockBasedTableOptions effectively mutable (#13082 ) Summary: Fix a longstanding race condition in SetOptions for `block_based_table_factory` options. The fix is mostly described in new, unified `TableFactoryParseFn()` in `cf_options.cc`. Also in this PR: * Adds a virtual `Clone()` function to TableFactory * To avoid behavioral hiccups with `SetOptions`, make the "hidden state" of `BlockBasedTableFactory` shared between an original and a clone. For example, `TailPrefetchStats` * `Configurable` was allowed to be copied but was not safe to do so, because the copy would have and use pointers into object it was copied from (!!!). This has been fixed using relative instead of absolute pointers, though it's still technically relying on undefined behavior (consistent object layout for non-standard-layout types). For future follow-up: * Deny SetOptions on block cache options (dubious and not yet made safe with proper shared_ptr handling) Fixes https://github.com/facebook/rocksdb/issues/10079 Pull Request resolved: https://github.com/facebook/rocksdb/pull/13082 Test Plan: added to unit tests and crash test Ran TSAN blackbox crashtest for hours with options to amplify potential race (see https://github.com/facebook/rocksdb/issues/10079) Reviewed By: cbi42 Differential Revision: D64947243 Pulled By: pdillinger fbshipit-source-id: 8390299149f50e2a2b39a5247680f2637edb23c8	2024-10-25 10:24:54 -07:00
Peter Dillinger	dd23e84cad	Re-implement GetApproximateMemTableStats for skip lists (#13047 ) Summary: GetApproximateMemTableStats() could return some bad results with the standard skip list memtable. See this new db_bench test showing the dismal distribution of results when the actual number of entries in range is 1000: ``` $ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=1000 ... filluniquerandom : 1.391 micros/op 718915 ops/sec 1.391 seconds 1000000 operations; 11.7 MB/s approximatememtablestats : 3.711 micros/op 269492 ops/sec 3.711 seconds 1000000 operations; Reported entry count stats (expected 1000): Count: 1000000 Average: 2344.1611 StdDev: 26587.27 Min: 0 Median: 965.8555 Max: 835273 Percentiles: P50: 965.86 P75: 1610.77 P99: 12618.01 P99.9: 74991.58 P99.99: 830970.97 ------------------------------------------------------ [ 0, 1 ] 131344 13.134% 13.134% ### ( 1, 2 ] 115 0.011% 13.146% ( 2, 3 ] 106 0.011% 13.157% ( 3, 4 ] 190 0.019% 13.176% ( 4, 6 ] 214 0.021% 13.197% ( 6, 10 ] 522 0.052% 13.249% ( 10, 15 ] 748 0.075% 13.324% ( 15, 22 ] 1002 0.100% 13.424% ( 22, 34 ] 1948 0.195% 13.619% ( 34, 51 ] 3067 0.307% 13.926% ( 51, 76 ] 4213 0.421% 14.347% ( 76, 110 ] 5721 0.572% 14.919% ( 110, 170 ] 11375 1.137% 16.056% ( 170, 250 ] 17928 1.793% 17.849% ( 250, 380 ] 36597 3.660% 21.509% # ( 380, 580 ] 77882 7.788% 29.297% ## ( 580, 870 ] 160193 16.019% 45.317% ### ( 870, 1300 ] 210098 21.010% 66.326% #### ( 1300, 1900 ] 167461 16.746% 83.072% ### ( 1900, 2900 ] 78678 7.868% 90.940% ## ( 2900, 4400 ] 47743 4.774% 95.715% # ( 4400, 6600 ] 17650 1.765% 97.480% ( 6600, 9900 ] 11895 1.190% 98.669% ( 9900, 14000 ] 4993 0.499% 99.168% ( 14000, 22000 ] 2384 0.238% 99.407% ( 22000, 33000 ] 1966 0.197% 99.603% ( 50000, 75000 ] 2968 0.297% 99.900% ( 570000, 860000 ] 999 0.100% 100.000% readrandom : 1.967 micros/op 508487 ops/sec 1.967 seconds 1000000 operations; 8.2 MB/s (1000000 of 1000000 found) ``` Perhaps the only good thing to say about the old implementation was that it was fast, though apparently not that fast. I've implemented a much more robust and reasonably fast new version of the function. It's still logarithmic but with some larger constant factors. The standard deviation from true count is around 20% or less, and roughly the CPU cost of two memtable point look-ups. See code comments for detail. ``` $ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=1000 ... filluniquerandom : 1.478 micros/op 676434 ops/sec 1.478 seconds 1000000 operations; 11.0 MB/s approximatememtablestats : 2.694 micros/op 371157 ops/sec 2.694 seconds 1000000 operations; Reported entry count stats (expected 1000): Count: 1000000 Average: 1073.5158 StdDev: 197.80 Min: 608 Median: 1079.9506 Max: 2176 Percentiles: P50: 1079.95 P75: 1223.69 P99: 1852.36 P99.9: 1898.70 P99.99: 2176.00 ------------------------------------------------------ ( 580, 870 ] 134848 13.485% 13.485% ### ( 870, 1300 ] 747868 74.787% 88.272% ############### ( 1300, 1900 ] 116536 11.654% 99.925% ## ( 1900, 2900 ] 748 0.075% 100.000% readrandom : 1.997 micros/op 500654 ops/sec 1.997 seconds 1000000 operations; 8.1 MB/s (1000000 of 1000000 found) ``` We can already see that the distribution of results is dramatically better and wonderfully normal-looking, with relative standard deviation around 20%. The function is also FASTER, at least with these parameters. Let's look how this behavior generalizes, first much larger range: ``` $ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=30000 filluniquerandom : 1.390 micros/op 719654 ops/sec 1.376 seconds 990000 operations; 11.7 MB/s approximatememtablestats : 1.129 micros/op 885649 ops/sec 1.129 seconds 1000000 operations; Reported entry count stats (expected 30000): Count: 1000000 Average: 31098.8795 StdDev: 3601.47 Min: 21504 Median: 29333.9303 Max: 43008 Percentiles: P50: 29333.93 P75: 33018.00 P99: 43008.00 P99.9: 43008.00 P99.99: 43008.00 ------------------------------------------------------ ( 14000, 22000 ] 408 0.041% 0.041% ( 22000, 33000 ] 749327 74.933% 74.974% ############### ( 33000, 50000 ] 250265 25.027% 100.000% ##### readrandom : 1.894 micros/op 528083 ops/sec 1.894 seconds 1000000 operations; 8.5 MB/s (989989 of 1000000 found) ``` This is even faster and relatively more accurate, with relative standard deviation closer to 10%. Code comments explain why. Now let's look at smaller ranges. Implementation quirks or conveniences: * When actual number in range is >= 40, the minimum return value is 40. * When the actual is <= 10, it is guaranteed to return that actual number. ``` $ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=75 ... filluniquerandom : 1.417 micros/op 705668 ops/sec 1.417 seconds 999975 operations; 11.4 MB/s approximatememtablestats : 3.342 micros/op 299197 ops/sec 3.342 seconds 1000000 operations; Reported entry count stats (expected 75): Count: 1000000 Average: 75.1210 StdDev: 15.02 Min: 40 Median: 71.9395 Max: 256 Percentiles: P50: 71.94 P75: 89.69 P99: 119.12 P99.9: 166.68 P99.99: 229.78 ------------------------------------------------------ ( 34, 51 ] 38867 3.887% 3.887% # ( 51, 76 ] 550554 55.055% 58.942% ########### ( 76, 110 ] 398854 39.885% 98.828% ######## ( 110, 170 ] 11353 1.135% 99.963% ( 170, 250 ] 364 0.036% 99.999% ( 250, 380 ] 8 0.001% 100.000% readrandom : 1.861 micros/op 537224 ops/sec 1.861 seconds 1000000 operations; 8.7 MB/s (999974 of 1000000 found) $ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=25 ... filluniquerandom : 1.501 micros/op 666283 ops/sec 1.501 seconds 1000000 operations; 10.8 MB/s approximatememtablestats : 5.118 micros/op 195401 ops/sec 5.118 seconds 1000000 operations; Reported entry count stats (expected 25): Count: 1000000 Average: 26.2392 StdDev: 4.58 Min: 25 Median: 28.4590 Max: 72 Percentiles: P50: 28.46 P75: 31.69 P99: 49.27 P99.9: 67.95 P99.99: 72.00 ------------------------------------------------------ ( 22, 34 ] 928936 92.894% 92.894% ################### ( 34, 51 ] 67960 6.796% 99.690% # ( 51, 76 ] 3104 0.310% 100.000% readrandom : 1.892 micros/op 528595 ops/sec 1.892 seconds 1000000 operations; 8.6 MB/s (1000000 of 1000000 found) $ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=10 ... filluniquerandom : 1.642 micros/op 608916 ops/sec 1.642 seconds 1000000 operations; 9.9 MB/s approximatememtablestats : 3.042 micros/op 328721 ops/sec 3.042 seconds 1000000 operations; Reported entry count stats (expected 10): Count: 1000000 Average: 10.0000 StdDev: 0.00 Min: 10 Median: 10.0000 Max: 10 Percentiles: P50: 10.00 P75: 10.00 P99: 10.00 P99.9: 10.00 P99.99: 10.00 ------------------------------------------------------ ( 6, 10 ] 1000000 100.000% 100.000% #################### readrandom : 1.805 micros/op 554126 ops/sec 1.805 seconds 1000000 operations; 9.0 MB/s (1000000 of 1000000 found) ``` Remarkably consistent. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13047 Test Plan: new db_bench test for both performance and accuracy (see above); added to crash test; unit test updated. Reviewed By: cbi42 Differential Revision: D63722003 Pulled By: pdillinger fbshipit-source-id: cfc8613c085e87c17ecec22d82601aac2a5a1b26	2024-10-02 14:25:50 -07:00
Peter Dillinger	79790cf2a8	Bug fix and test BuildDBOptions (#13038 ) Summary: The following DBOptions were not being propagated through BuildDBOptions, which could at least lead to settings being lost through `GetOptionsFromString()`, possibly elsewhere as well: * background_close_inactive_wals * write_dbid_to_manifest * write_identity_file * prefix_seek_opt_in_only Pull Request resolved: https://github.com/facebook/rocksdb/pull/13038 Test Plan: This problem was not being caught by OptionsSettableTest.DBOptionsAllFieldsSettable when the option was omitted from both options_helper.cc and options_settable_test.cc. I have added to the test to catch future instances (and the updated test was how I found three of the four missing options). The same kind of bug seems to be caught by ColumnFamilyOptionsAllFieldsSettable, and AFAIK analogous code does not exist for BlockBasedTableOptions. Reviewed By: ltamasi Differential Revision: D63483779 Pulled By: pdillinger fbshipit-source-id: a5d5f6e434174bacb8e5d251b767e81e62b7225a	2024-09-26 14:36:29 -07:00
anand76	22105366d9	More accurate accounting of compressed cache memory (#13032 ) Summary: When an item is inserted into the compressed secondary cache, this PR calculates the charge using the malloc_usable_size of the allocated memory, as well as the unique pointer allocation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13032 Test Plan: New unit test Reviewed By: pdillinger Differential Revision: D63418493 Pulled By: anand1976 fbshipit-source-id: 1db2835af6867442bb8cf6d9bf412e120ddd3824	2024-09-25 17:47:40 -07:00
anand76	d02f63cc54	Respect lowest_used_cache_tier option for compressed blocks (#13030 ) Summary: If the lowest_used_cache_tier DB option is set to kVolatileTier, skip insertion of compressed blocks into the secondary cache. Previously, these were always inserted into the secondary cache via the InsertSaved() method, leading to pollution of the secondary cache with blocks that would never be read. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13030 Test Plan: Add a new unit test Reviewed By: pdillinger Differential Revision: D63329841 Pulled By: anand1976 fbshipit-source-id: 14d2fce2ed309401d9ad4d2e7c356218b6673f7b	2024-09-25 11:45:51 -07:00
Levi Tamasi	fbbb08770f	Update HISTORY.md, version.h, and the format compatibility check script for the 9.7 release (#13027 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/13027 Reviewed By: jowlyzhang Differential Revision: D63158344 Pulled By: ltamasi fbshipit-source-id: e650a0024155d52c7aa2afd0f242b8363071279a	2024-09-20 19:19:06 -07:00
Peter Dillinger	a1a102ffce	Steps toward deprecating implicit prefix seek, related fixes (#13026 ) Summary: With some new use cases onboarding to prefix extractors/seek/filters, one of the risks is existing iterator code, e.g. for maintenance tasks, being unintentionally subject to prefix seek semantics. This is a longstanding known design flaw with prefix seek, and `prefix_same_as_start` and `auto_prefix_mode` were steps in the direction of making that obsolete. However, we can't just immediately set `total_order_seek` to true by default, because that would impact so much code instantly. Here we add a new DB option, `prefix_seek_opt_in_only` that basically allows users to transition to the future behavior when they are ready. When set to true, all iterators will be treated as if `total_order_seek=true` and then the only ways to get prefix seek semantics are with `prefix_same_as_start` or `auto_prefix_mode`. Related fixes / changes: * Make sure that `prefix_same_as_start` and `auto_prefix_mode` are compatible with (or override) `total_order_seek` (depending on your interpretation). * Fix a bug in which a new iterator after dynamically changing the prefix extractor might mix different prefix semantics between memtable and SSTs. Both should use the latest extractor semantics, which means iterators ignoring memtable prefix filters with an old extractor. And that means passing the latest prefix extractor to new memtable iterators that might use prefix seek. (Without the fix, the test added for this fails in many ways.) Suggested follow-up: * Investigate a FIXME where a MergeIteratorBuilder is created in db_impl.cc. No unit test detects a change in value that should impact correctness. * Make memtable prefix bloom compatible with `auto_prefix_mode`, which might require involving the memtablereps because we don't know at iterator creation time (only seek time) whether an auto_prefix_mode seek will be a prefix seek. * Add `prefix_same_as_start` testing to db_stress Pull Request resolved: https://github.com/facebook/rocksdb/pull/13026 Test Plan: tests updated, added. Add combination of `total_order_seek=true` and `auto_prefix_mode=true` to stress test. Ran `make blackbox_crash_test` for a long while. Manually ran tests with `prefix_seek_opt_in_only=true` as default, looking for unexpected issues. I inspected most of the results and migrated many tests to be ready for such a change (but not all). Reviewed By: ltamasi Differential Revision: D63147378 Pulled By: pdillinger fbshipit-source-id: 1f4477b730683d43b4be7e933338583702d3c25e	2024-09-20 15:54:19 -07:00
Changyu Bi	f97e33454f	Fix a bug with auto recovery on WAL write error (#12995 ) Summary: A recent crash test failure shows that auto recovery from WAL write failure can cause CFs to be inconsistent. A unit test repro in P1569398553. The following is an example sequence of events: ``` 0. manual_wal_flush is true. There are multiple CFs in a DB. 1. Submit a write batch with updates to multiple CF 2. A FlushWAL or a memtable swtich that will try to write the buffered WAL data. Fail this write so that buffered WAL data is dropped: `4b1d595306/file/writable_file_writer.cc (L624)` The error needs to be retryable to start background auto recovery. 3. One CF successfully flushes its memtable during auto recovery. 4. Crash the process. 5. Reopen the DB, one CF will have the update as a result of successful flush. Other CFs will miss all the updates in the write batch since WAL does not have them. ``` This can happen if a users configures manual_wal_flush, uses more than one CF, and can hit retryable error for WAL writes. This PR is a short-term fix that upgrades WAL related errors to fatal and not trigger auto recovery. A long-term fix may be not drop buffered WAL data by checking how much data is actually written, or require atomically flushing all column families during error recovery from this kind of errors. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12995 Test Plan: added unit test to check error severity and if recovery is triggered. A crash test repro command that fails in a few runs before this PR: ``` python3 ./tools/db_crashtest.py blackbox --interval=60 --metadata_write_fault_one_in=1000 --column_families=10 --exclude_wal_from_write_fault_injection=0 --manual_wal_flush_one_in=1000 --WAL_size_limit_MB=10240 --WAL_ttl_seconds=0 --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --adm_policy=1 --advise_random_on_open=1 --allow_data_in_errors=True --allow_fallocate=1 --async_io=0 --auto_readahead_size=0 --avoid_flush_during_recovery=1 --avoid_flush_during_shutdown=1 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=100 --block_align=1 --block_protection_bytes_per_key=0 --block_size=16384 --bloom_before_level=2147483647 --bottommost_compression_type=none --bottommost_file_compaction_delay=0 --bytes_per_sync=0 --cache_index_and_filter_blocks=1 --cache_index_and_filter_blocks_with_high_priority=1 --cache_size=33554432 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=0 --check_multiget_consistency=0 --check_multiget_entity_consistency=0 --checkpoint_one_in=0 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_readahead_size=1048576 --compaction_ttl=0 --compress_format_version=1 --compressed_secondary_cache_size=8388608 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=4 --compression_type=none --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc= --data_block_index_type=0 --db_write_buffer_size=0 --decouple_partitioned_filters=1 --default_temperature=kCold --default_write_temperature=kWarm --delete_obsolete_files_period_micros=30000000 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_file_deletions_one_in=1000000 --disable_manual_compaction_one_in=1000000 --disable_wal=0 --dump_malloc_stats=1 --enable_checksum_handoff=1 --enable_compaction_filter=0 --enable_custom_split_merge=0 --enable_do_not_compress_roles=0 --enable_index_compression=0 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=1 --enable_sst_partitioner_factory=0 --enable_thread_tracking=1 --enable_write_thread_adaptive_yield=1 --error_recovery_with_no_fault_injection=1 --fail_if_options_file_error=1 --fifo_allow_compaction=1 --file_checksum_impl=big --fill_cache=1 --flush_one_in=1000000 --format_version=6 --get_all_column_family_metadata_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=10000 --get_properties_of_all_tables_one_in=1000000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --index_block_restart_interval=4 --index_shortening=1 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100000 --last_level_temperature=kWarm --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=10000 --log_file_time_to_roll=0 --log_readahead_size=0 --long_running_snapshots=0 --lowest_used_cache_tier=2 --manifest_preallocation_size=5120 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=100000 --max_key_len=3 --max_log_file_size=0 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=16 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16777216 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=2097152 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.001 --memtable_protection_bytes_per_key=2 --memtable_whole_key_filtering=0 --memtablerep=skip_list --metadata_charge_policy=1 --metadata_read_fault_one_in=0 --min_write_buffer_number_to_merge=1 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=2 --open_files=100 --open_metadata_read_fault_one_in=0 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --optimize_filters_for_hits=0 --optimize_filters_for_memory=0 --optimize_multiget_for_io=0 --paranoid_file_checks=1 --paranoid_memory_checks=0 --partition_filters=0 --partition_pinning=2 --pause_background_one_in=10000 --periodic_compaction_seconds=0 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=0 --readahead_size=524288 --readpercent=45 --recycle_log_file_num=0 --reopen=0 --report_bg_io_stats=0 --reset_stats_one_in=10000 --sample_for_compression=5 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --set_options_one_in=10000 --skip_stats_update_on_db_open=1 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=1048576 --sqfc_name=bar --sqfc_version=1 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=600 --stats_history_buffer_size=1048576 --strict_bytes_per_sync=1 --subcompactions=2 --sync=0 --sync_fault_injection=1 --table_cache_numshardbits=6 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=3 --uncache_aggressiveness=8 --universal_max_read_amp=-1 --unpartitioned_pinning=2 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=0 --use_attribute_group=1 --use_delta_encoding=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=1 --use_multi_cf_iterator=1 --use_multi_get_entity=0 --use_multiget=0 --use_put_entity_one_in=1 --use_sqfc_for_range_queries=0 --use_timed_put_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_compression=1 --verify_db_one_in=100000 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=4194304 --write_dbid_to_manifest=0 --write_fault_one_in=50 --writepercent=35 --ops_per_thread=100000 --preserve_unverified_changes=1 ``` Reviewed By: hx235 Differential Revision: D62888510 Pulled By: cbi42 fbshipit-source-id: 308bdbbb8d897cc8eba950155cd0e37cf7eb76fe	2024-09-17 14:10:33 -07:00
Changyu Bi	e490f2b051	Fix a bug in ReFitLevel() where `FileMetaData::being_compacted` is not cleared (#13009 ) Summary: in ReFitLevel(), we were not setting being_compacted to false after ReFitLevel() is done. This is not a issue if refit level is successful, since new FileMetaData is created for files at the target level. However, if there's an error during RefitLevel(), e.g., Manifest write failure, we should clear the being_compacted field for these files. Otherwise, these files will not be picked for compaction until db reopen. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13009 Test Plan: existing test. - stress test failure in T200339331 should not happen anymore. Reviewed By: hx235 Differential Revision: D62597169 Pulled By: cbi42 fbshipit-source-id: 0ba659806da6d6d4b42384fc95268b2d7bad720e	2024-09-12 15:19:14 -07:00
Jay Huh	5b8f5cbcf4	Update main branch for 9.6 release (#12945 ) Summary: Main branch cut at `defd97bc9`. Updated HISTORY.md, version and format compatibility test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12945 Test Plan: CI Reviewed By: jowlyzhang Differential Revision: D61482149 Pulled By: jaykorean fbshipit-source-id: 4edf7c0a8c6e4df8fcc938bc778dfd02981d0c55	2024-08-19 17:36:23 -07:00
Jay Huh	b99baa5ae7	Do not add unprep_seqs when WriteImpl() fails in unprepared txn (#12927 ) Summary: With the following scenario, we see assertion failure in write_unprepared_txn 1. The write is the first one in the transaction (`log_number_` is not set yet - https://github.com/facebook/rocksdb/blob/main/utilities/transactions/write_unprepared_txn.cc#L376-L379) 2. `WriteToWAL()` fails inside `db_impl_->WriteImpl()` due to fault injection. `last_log_number_`is 0. https://github.com/facebook/rocksdb/blob/main/utilities/transactions/write_unprepared_txn.cc#L386 3. `prepare_batch_cnt_` is still added to `unprep_seqs_` https://github.com/facebook/rocksdb/blob/main/utilities/transactions/write_unprepared_txn.cc#L395-L398 4. When the transaction gets destructed after failed, it expects `log_number_ > 0` if `unprep_seqs_` is not empty - https://github.com/facebook/rocksdb/blob/main/utilities/transactions/write_unprepared_txn.cc#L54-L55. Step 3 is wrong. `unprep_seqs_` is for the ordered list of unprep sequence numbers that we have already written to the DB. If `db_impl_->WriteImpl()` failed, `unprep_seqs_` shouldn't be set. `prepare_seq` value would be wrong anyway. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12927 Test Plan: Following command steadily repros the issue. This no longer fails after the fix ``` ./db_stress \ --WAL_size_limit_MB=0 \ --WAL_ttl_seconds=60 \ --acquire_snapshot_one_in=10000 \ --adaptive_readahead=1 \ --adm_policy=2 \ --advise_random_on_open=0 \ --allow_data_in_errors=True \ --allow_fallocate=1 \ --async_io=1 \ --auto_readahead_size=0 \ --avoid_flush_during_recovery=0 \ --avoid_flush_during_shutdown=1 \ --avoid_unnecessary_blocking_io=0 \ --backup_max_size=104857600 \ --backup_one_in=100000 \ --batch_protection_bytes_per_key=8 \ --bgerror_resume_retry_interval=100 \ --block_align=1 \ --block_protection_bytes_per_key=8 \ --block_size=16384 \ --bloom_before_level=7 \ --bloom_bits=4.1280979878467345 \ --bottommost_compression_type=none \ --bottommost_file_compaction_delay=600 \ --bytes_per_sync=262144 \ --cache_index_and_filter_blocks=1 \ --cache_index_and_filter_blocks_with_high_priority=0 \ --cache_size=33554432 \ --cache_type=lru_cache \ --charge_compression_dictionary_building_buffer=1 \ --charge_file_metadata=1 \ --charge_filter_construction=1 \ --charge_table_reader=1 \ --check_multiget_consistency=0 \ --check_multiget_entity_consistency=1 \ --checkpoint_one_in=0 \ --checksum_type=kXXH3 \ --clear_column_family_one_in=0 \ --compact_files_one_in=1000000 \ --compact_range_one_in=1000000 \ --compaction_pri=4 \ --compaction_readahead_size=1048576 \ --compaction_ttl=0 \ --compress_format_version=1 \ --compressed_secondary_cache_ratio=0.0 \ --compressed_secondary_cache_size=0 \ --compression_checksum=1 \ --compression_max_dict_buffer_bytes=4294967295 \ --compression_max_dict_bytes=16384 \ --compression_parallel_threads=8 \ --compression_type=none \ --compression_use_zstd_dict_trainer=1 \ --compression_zstd_max_train_bytes=0 \ --continuous_verification_interval=0 \ --create_timestamped_snapshot_one_in=0 \ --daily_offpeak_time_utc=04:00-08:00 \ --data_block_index_type=0 \ --db=$db \ --db_write_buffer_size=8388608 \ --default_temperature=kHot \ --default_write_temperature=kUnknown \ --delete_obsolete_files_period_micros=21600000000 \ --delpercent=5 \ --delrangepercent=0 \ --destroy_db_initially=0 \ --detect_filter_construct_corruption=0 \ --disable_file_deletions_one_in=10000 \ --disable_manual_compaction_one_in=10000 \ --disable_wal=0 \ --dump_malloc_stats=1 \ --enable_checksum_handoff=1 \ --enable_compaction_filter=0 \ --enable_custom_split_merge=1 \ --enable_do_not_compress_roles=0 \ --enable_index_compression=1 \ --enable_memtable_insert_with_hint_prefix_extractor=0 \ --enable_pipelined_write=0 \ --enable_sst_partitioner_factory=0 \ --enable_thread_tracking=1 \ --enable_write_thread_adaptive_yield=1 \ --error_recovery_with_no_fault_injection=1 \ --exclude_wal_from_write_fault_injection=0 \ --expected_values_dir=$exp \ --fail_if_options_file_error=0 \ --fifo_allow_compaction=1 \ --file_checksum_impl=crc32c \ --fill_cache=1 \ --flush_one_in=1000 \ --format_version=2 \ --get_all_column_family_metadata_one_in=1000000 \ --get_current_wal_file_one_in=0 \ --get_live_files_apis_one_in=1000000 \ --get_properties_of_all_tables_one_in=100000 \ --get_property_one_in=1000000 \ --get_sorted_wal_files_one_in=0 \ --hard_pending_compaction_bytes_limit=274877906944 \ --high_pri_pool_ratio=0 \ --index_block_restart_interval=15 \ --index_shortening=0 \ --index_type=2 \ --ingest_external_file_one_in=0 \ --initial_auto_readahead_size=524288 \ --inplace_update_support=0 \ --iterpercent=10 \ --key_len_percent_dist=1,30,69 \ --key_may_exist_one_in=100000 \ --kill_random_test=888887 \ --last_level_temperature=kCold \ --level_compaction_dynamic_level_bytes=1 \ --lock_wal_one_in=1000000 \ --log2_keys_per_lock=10 \ --log_file_time_to_roll=0 \ --log_readahead_size=0 \ --long_running_snapshots=1 \ --low_pri_pool_ratio=0 \ --lowest_used_cache_tier=1 \ --manifest_preallocation_size=5120 \ --manual_wal_flush_one_in=0 \ --mark_for_compaction_one_file_in=10 \ --max_auto_readahead_size=0 \ --max_background_compactions=20 \ --max_bytes_for_level_base=10485760 \ --max_key=100000 \ --max_key_len=3 \ --max_log_file_size=1048576 \ --max_manifest_file_size=1073741824 \ --max_sequential_skip_in_iterations=2 \ --max_total_wal_size=0 \ --max_write_batch_group_size_bytes=16 \ --max_write_buffer_number=10 \ --max_write_buffer_size_to_maintain=0 \ --memtable_insert_hint_per_batch=0 \ --memtable_max_range_deletions=1000 \ --memtable_prefix_bloom_size_ratio=0.1 \ --memtable_protection_bytes_per_key=8 \ --memtable_whole_key_filtering=1 \ --memtablerep=skip_list \ --metadata_charge_policy=0 \ --metadata_read_fault_one_in=32 \ --metadata_write_fault_one_in=0 \ --min_write_buffer_number_to_merge=2 \ --mmap_read=1 \ --mock_direct_io=False \ --nooverwritepercent=1 \ --num_file_reads_for_auto_readahead=2 \ --open_files=-1 \ --open_metadata_read_fault_one_in=0 \ --open_metadata_write_fault_one_in=0 \ --open_read_fault_one_in=0 \ --open_write_fault_one_in=0 \ --ops_per_thread=20000000 \ --optimize_filters_for_hits=1 \ --optimize_filters_for_memory=0 \ --optimize_multiget_for_io=0 \ --paranoid_file_checks=1 \ --partition_filters=0 \ --partition_pinning=0 \ --pause_background_one_in=1000000 \ --periodic_compaction_seconds=1 \ --prefix_size=5 \ --prefixpercent=5 \ --prepopulate_block_cache=0 \ --preserve_internal_time_seconds=60 \ --progress_reports=0 \ --promote_l0_one_in=0 \ --read_amp_bytes_per_bit=32 \ --read_fault_one_in=1000 \ --readahead_size=0 \ --readpercent=45 \ --recycle_log_file_num=1 \ --reopen=20 \ --report_bg_io_stats=1 \ --reset_stats_one_in=10000 \ --sample_for_compression=0 \ --secondary_cache_fault_one_in=0 \ --sync=0 \ --sync_fault_injection=0 \ --table_cache_numshardbits=6 \ --target_file_size_base=524288 \ --target_file_size_multiplier=2 \ --test_batches_snapshots=0 \ --top_level_index_pinning=0 \ --txn_write_policy=2 \ --uncache_aggressiveness=126 \ --universal_max_read_amp=4 \ --unordered_write=0 \ --unpartitioned_pinning=1 \ --use_adaptive_mutex=0 \ --use_adaptive_mutex_lru=1 \ --use_attribute_group=1 \ --use_delta_encoding=1 \ --use_direct_io_for_flush_and_compaction=0 \ --use_direct_reads=0 \ --use_full_merge_v1=0 \ --use_get_entity=0 \ --use_merge=1 \ --use_multi_cf_iterator=0 \ --use_multi_get_entity=0 \ --use_multiget=1 \ --use_optimistic_txn=0 \ --use_put_entity_one_in=0 \ --use_timed_put_one_in=0 \ --use_txn=1 \ --use_write_buffer_manager=0 \ --user_timestamp_size=0 \ --value_size_mult=32 \ --verification_only=0 \ --verify_checksum=1 \ --verify_checksum_one_in=1000000 \ --verify_compression=0 \ --verify_db_one_in=10000 \ --verify_file_checksums_one_in=1000000 \ --verify_iterator_with_expected_state_one_in=5 \ --verify_sst_unique_id_in_manifest=1 \ --wal_bytes_per_sync=0 \ --wal_compression=none \ --write_buffer_size=4194304 \ --write_dbid_to_manifest=1 \ --write_fault_one_in=128 \ --writepercent=35 ``` Reviewed By: cbi42 Differential Revision: D61048774 Pulled By: jaykorean fbshipit-source-id: 22200d55fd0b22b68732b12516e681a6c6e2c601	2024-08-15 09:16:29 -07:00
Changyu Bi	b32d899482	Fix MultiGet dropping memtable kv checksum corruption (#12842 ) Summary: Corruption status returned by `GetFromTable()` could be overwritten here: `b6c3495a71/db/version_set.cc (L2614)` This PR fixes this issue by setting `*(s->found_final_value) = true;` in SaveValue. Also makes the handling of the return value of `GetFromTable()` more robust and added asserts in a couple places. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12842 Test Plan: Updated an existing unit test to cover MultiGet. It fails the assertion here before this PR: `b6c3495a71/db/version_set.cc (L2601)` Reviewed By: anand1976 Differential Revision: D59498203 Pulled By: cbi42 fbshipit-source-id: 1f071c1b2c5b66fb71264b547a9e670d1cf592f0	2024-08-08 13:34:11 -07:00
Yu Zhang	d12aaf23ca	Fix file deletions in DestroyDB not rate limited (#12891 ) Summary: Make `DestroyDB` slowly delete files if it's configured and enabled via `SstFileManager`. It's currently not available mainly because of DeleteScheduler's logic related to tracked total_size_ and total_trash_size_. These accounting and logic should not be applied to `DestroyDB`. This PR adds a `DeleteUnaccountedDBFile` util for this purpose which deletes files without accounting it. This util also supports assigning a file to a specified trash bucket so that user can later wait for a specific trash bucket to be empty. For `DestroyDB`, files with more than 1 hard links will be deleted immediately. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12891 Test Plan: Added unit tests, existing tests. Reviewed By: anand1976 Differential Revision: D60300220 Pulled By: jowlyzhang fbshipit-source-id: 8b18109a177a3a9532f6dc2e40e08310c08ca3c7	2024-08-02 19:31:55 -07:00
Levi Tamasi	2e8a1a14ef	Fix a data race affecting the background error status (#12910 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12910 There is currently a call to `GetBGError()` in `DBImpl::WriteImplWALOnly()` where the DB mutex is (incorrectly) not held, leading to a data race. Technically, we could acquire the mutex here but instead, the patch removes the affected check altogether, since the same check is already performed (in a thread-safe manner) in the subsequent call to `PreprocessWrite()`. Reviewed By: cbi42 Differential Revision: D60682008 fbshipit-source-id: 54b67975dcf57d67c068cac71e8ada09a1793ec5	2024-08-02 14:11:08 -07:00
anand76	55877d8893	Make transaction name conflict check more robust (#12895 ) Summary: The `PessimisticTransaction::SetName()` code checks for an existing txn of the given name before registering the new txn. However, this is not atomic, which could result in a race condition if two txns try to register with the same name. Both might succeed and lead to unpredictable behavior. This PR makes the test and set atomic. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12895 Reviewed By: pdillinger Differential Revision: D60460482 Pulled By: anand1976 fbshipit-source-id: e8afeb2356e1b8f4e8df785cb73532739f82579d	2024-07-30 12:31:02 -07:00
Yu Zhang	d94c2adc28	Add entry for bug fix in #12882 (#12892 ) Summary: As titled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12892 Reviewed By: hx235 Differential Revision: D60400651 Pulled By: jowlyzhang fbshipit-source-id: 2dd60c2287143f464ecab0de859715af6ab3a825	2024-07-29 12:20:50 -07:00
Hui Xiao	15d9988ab2	Update history and version for 9.5.fb release (#12880 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12880 Reviewed By: jaykorean, jowlyzhang Differential Revision: D60057955 Pulled By: hx235 fbshipit-source-id: 1c599a5334aff1f424bb473275efe4349b17d41d	2024-07-22 13:15:09 -07:00
Hui Xiao	349b1ec08f	Fix duplicate WAL entries caused by write after error recovery (#12873 ) Summary: Context/Summary: We recently discovered a case where write of the same key right after error recovery of a previous failed write of the same key finishes causes two same WAL entries, violating our assertion. This is because we don't advance seqno on failed write and reuse the same WAL containing the failed write for the new write if the memtable at the time is empty. This PR reuses the flush path for an empty memtable to switch WAL and update min WAL to keep in error recovery flush as well as updates the INFO log message for clarity. ``` 2024/07/17-15:01:32.271789 327757 (Original Log Time 2024/07/17-15:01:25.942234) [/flush_job.cc:1017] [default] [JOB 2] Level-0 flush table https://github.com/facebook/rocksdb/issues/9: 0 bytes OK It's an empty SST file from a successful flush so won't be kept in the DB 2024/07/17-15:01:32.271798 327757 (Original Log Time 2024/07/17-15:01:32.269954) [/memtable_list.cc:560] [default] Level-0 commit flush result of table https://github.com/facebook/rocksdb/issues/9 started 2024/07/17-15:01:32.271802 327757 (Original Log Time 2024/07/17-15:01:32.271217) [/memtable_list.cc:760] [default] Level-0 commit flush result of table https://github.com/facebook/rocksdb/issues/9: memtable https://github.com/facebook/rocksdb/issues/1 done ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12873 Test Plan: New UT that failed before this PR with following assertion failure (i.e, duplicate WAL entries) and passes after ``` db_wal_test: db/write_batch.cc:2254: rocksdb::Status rocksdb::{anonymous}::MemTableInserter::PutCFImpl(uint32_t, const rocksdb::Slice&, const rocksdb::Slice&, rocksdb::ValueType, RebuildTxnOp, const ProtectionInfoKVOS64) [with RebuildTxnOp = rocksdb::{anonymous}::MemTableInserter::PutCF(uint32_t, const rocksdb::Slice&, const rocksdb::Slice&)::<lambda(rocksdb::WriteBatch, uint32_t, const rocksdb::Slice&, const rocksdb::Slice&)>; uint32_t = unsigned int; rocksdb::ProtectionInfoKVOS64 = rocksdb::ProtectionInfoKVOS<long unsigned int>]: Assertion `seq_per_batch_' failed. ``` Reviewed By: anand1976 Differential Revision: D59884468 Pulled By: hx235 fbshipit-source-id: 5d854b719092552c69727a979f269fb7f6c39756	2024-07-22 12:40:25 -07:00
anand76	0fca5e31b4	Fix race between manifest error recovery and file ingestion (#12871 ) Summary: This PR fixes an assertion failure in `DBImpl::ResumeImpl` - `assert(!versions_->descriptor_log_)`. In `VersionSet`, `descriptor_log_` has a pointer to the current MANIFEST writer. When there's an error updating the manifest, `descriptor_log_` is reset, and the error recovery thread checks `io_status()` in `VersionSet` and attempts to write a new MANIFEST. If another DB manipulation happens at the same time (like external file ingestion, column family manipulation etc), it calls `LogAndApply`, which also attempts to write a new MANIFEST. The assertion in `ResumeImpl` might fail in this case since the other MANIFEST writer may have updated `descriptor_log_`. To prevent the assertion, this fix updates both `io_status_` and `descriptor_log_` while holding the DB mutex. The other option would have been to simply remove the assert. But I think its important to have it to ensure the invariant that `io_status_` is cleared if the MANIFEST is written successfully, and this fix makes things easier to reason about. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12871 Test Plan: Existing tests and crash test Reviewed By: hx235 Differential Revision: D59926947 Pulled By: anand1976 fbshipit-source-id: af9ad18da3e29fc62c7ec2e30e0738aa33d4e5f1	2024-07-19 10:37:51 -07:00
Peter Dillinger	93b163d1a2	Fix major bug with prefixes, SeekForPrev, and partitioned filters (#12872 ) Summary: Basically, the fix in https://github.com/facebook/rocksdb/issues/8137 was incomplete (and I missed it in the review), because if `whole_key_filtering` is false, then `last_prefix_str_` will never be set to non-empty and the fix doesn't work. Also related to https://github.com/facebook/rocksdb/issues/5835. This is intended as a safe, simple fix that will regress CPU efficiency slightly (for `whole_key_filtering=false` cases, because of extra prefix string copies during flush & compaction). An efficient fix is not possible without some substantial refactoring. Also in this PR: new test DBBloomFilterTest.FilterNumEntriesCoalesce tests an adjacent code path that was previously untested for its effect of ensuring the number of unique prefixes and keys is tracked properly when both prefixes and whole keys are going into a filter. (Test fails when either of the two code segments checking for duplicates is disabled.) In addition, the same test would fail before the main bug fix here because the code would inappropriately add the empty string to the filter (because of unmodified `last_prefix_str_`). Pull Request resolved: https://github.com/facebook/rocksdb/pull/12872 Test Plan: In addition to DBBloomFilterTest.FilterNumEntriesCoalesce, extended DBBloomFilterTest.SeekForPrevWithPartitionedFilters to cover the broken case. (Mostly whitespace change.) Reviewed By: jowlyzhang Differential Revision: D59873793 Pulled By: pdillinger fbshipit-source-id: 2a7b7f09ca73dc188fb4dab833826ad6da7ebb11	2024-07-17 14:08:35 -07:00
anand76	5aa675457e	Fix unhandled MANIFEST write errors (#12865 ) Summary: The failure of `WriteCurrentStateToManifest()` in `VersionSet::ProcessManifestWrites()` was not handled properly. If it failed, `manifest_io_status` was not updated, leading to `manifest_file_number_` being updated to the newly created manifest even though its bad. This would lead to the bad manifest immediately getting deleted, and also the good manifest (referenced by `CURRENT`) getting deleted by obsolete file deletion because of `manifest_file_number_` not referencing its number. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12865 Reviewed By: hx235 Differential Revision: D59782940 Pulled By: anand1976 fbshipit-source-id: f752fb9a1c23fd3d734616e273613cbac204301b	2024-07-15 19:13:29 -07:00
Hui Xiao	4ff35afb42	Fix a bug where `OnErrorRecoveryBegin()` is not called before auto-recovery (#12860 ) Summary: Context/Summary: `*auto_recovery` needs to be set true in order for `OnErrorRecoveryBegin()` to be called before auto-recovery `3db030d7ee/db/event_helpers.cc (L64-L66)` Currently it's set false for auto-recovery. This PR fixes it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12860 Test Plan: - Manual observation that it is called - Existing UT Reviewed By: jowlyzhang Differential Revision: D59693315 Pulled By: hx235 fbshipit-source-id: 3f428c5b1e9818bb7697fdcd7f245d11378eb14a	2024-07-15 17:00:14 -07:00
Yu Zhang	ff204d8ecd	Add entry for #12803 : fix race between event listener and error handler (#12809 ) Summary: As titled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12809 Reviewed By: hx235 Differential Revision: D58974154 Pulled By: jowlyzhang fbshipit-source-id: 7e44b54d9fa3bfbd58a4154a2c7e91aec905c34b	2024-06-24 17:01:41 -07:00
Changyu Bi	748f74aca3	Update main branch for 9.4 release (#12802 ) Summary: Main branch cut at `e90e9153d5`. Updated HISTORY.md, version and format compatibility test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12802 Reviewed By: ajkr Differential Revision: D58956464 Pulled By: cbi42 fbshipit-source-id: 50d786c145cebf93d1dd554b1b0e26baac3cc88c	2024-06-24 11:53:05 -07:00
Peter Dillinger	3abcba8470	Propagate more ReadOptions to ApproximateOffsetOf/Sizes (#12764 ) Summary: Unknown why these would ignore options like deadline and read_tier. Setting total_order_seek=true is unnecessary because of the disable_prefix_seek (= true) parameter to NewIndexIterator. This is only used by the hash index, which uses total order seek if either the ReadOption or the parameter is true. The parameter is arguably redundant with the total_order_seek option, meaning it could be eliminated, but I think this case is exceptional (compared to e.g. no_io): * Prefix seek is particular to user iterators, though might be usable, carefully, for other read operations. * The historical default of total_order_seek=false in a sense is "wrong result by default" so cannot be interpreted as an intent to force prefix seek in an operation for which it might be usual or give bad results. Also added a generic release note to cover this and related PRs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12764 Test Plan: existing tests Reviewed By: hx235 Differential Revision: D58474240 Pulled By: pdillinger fbshipit-source-id: 79014d9822ba8f09d57ce4524363aa0973017b68	2024-06-12 16:25:47 -07:00
Peter Dillinger	98393f0139	Fix Checkpoint hard link of inactive but unsynced WAL (#12731 ) Summary: Background: there is one active WAL file but there can be several more WAL files in various states. Those other WALs are always in a "flushed" state but could be on the `logs_` list not yet fully synced. We currently allow any WAL that is not the active WAL to be hard-linked when creating a Checkpoint, as although it might still be open for write, we are not appending any more data to it. The problem is that a created Checkpoint is supposed to be fully synced on return of that function, and a hard-linked WAL in the state described above might not be fully synced. (Through some prudence in https://github.com/facebook/rocksdb/issues/10083, it would synced if using track_and_verify_wals_in_manifest=true.) The fix is a step toward a long term goal of removing the need to query the filesystem to determine WAL files and their state. (I consider it dubious any time we independently read from or query metadata from a file we have open for writing, as this makes us more susceptible to FileSystem deficiencies or races.) More specifically: * Detect which WALs might not be fully synced, according to our DBImpl metadata, and prevent hard linking those (with `trim_to_size=true` from `GetLiveFilesStorageInfo()`. And while we're at it, use our known flushed sizes for those WALs. * To avoid a race between that and GetSortedWalFiles(), track a maximum needed WAL number for the Checkpoint/GetLiveFilesStorageInfo. * Because of the level of consistency provided by those two, we no longer need to consider syncing as part of the FlushWAL in GetLiveFilesStorageInfo. (We determine the max WAL number consistent with the manifest file size, while holding DB mutex. Should make track_and_verify_wals_in_manifest happy.) This makes the premise of test PutRaceWithCheckpointTrackedWalSync obsolete (sync point callback no longer hit) so the test is removed, with crash test as backstop for related issues. See https://github.com/facebook/rocksdb/issues/10185 Stacked on https://github.com/facebook/rocksdb/issues/12729 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12731 Test Plan: Expanded an existing test, which now fails before fix. Also long runs of blackbox_crash_test with amplified checkpoint frequency. Reviewed By: cbi42 Differential Revision: D58199629 Pulled By: pdillinger fbshipit-source-id: 376e55f4a2b082cd2adb6408a41209de14422382	2024-06-05 17:40:09 -07:00
Levi Tamasi	b9e82f5162	Mention two fixes (#12677 and #12681 ) in the changelog (#12730 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12730 Reviewed By: jowlyzhang Differential Revision: D58092079 fbshipit-source-id: 708437e705f9d9f770c9ba1a1a9b3c369b0a4b79	2024-06-03 10:50:41 -07:00
Andrew Kryczka	c3ae569792	Update the main branch for the 9.3 release (#12726 ) Summary: Cut the 9.3.fb branch as of 5/17 11:59pm. Also, cherry-picked all bug fixes that have happened since then. Removed their files from unreleased_history/ since those fixes will appear in 9.3.0, so there seems no use repeating them in any later release. Release branch: https://github.com/facebook/rocksdb/tree/9.3.fb Tests: https://github.com/facebook/rocksdb/actions/runs/9342097111 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12726 Reviewed By: ltamasi Differential Revision: D58069263 Pulled By: ajkr fbshipit-source-id: c4f557bc8dbc20ce53021ac7e97a24f930542bf9	2024-06-02 22:10:24 -07:00
anand76	0ae3d9f98d	Fix stale memory access with FSBuffer and tiered sec cache (#12712 ) Summary: A `BlockBasedTable` with `TieredSecondaryCache` containing a NVM cache inserts blocks into the compressed cache and the corresponding compressed block into the NVM cache. The `BlockFetcher` is used to get the uncompressed and compressed blocks by calling `ReadBlockContents()` and `GetUncompressedBlock()` respectively. If the file system supports FSBuffer (i.e returning a FS allocated buffer rather than caller provided), that buffer gets freed between the two calls. This PR fixes it by making the FSBuffer unique pointer a member rather than local variable. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12712 Test Plan: 1. Add a unit test 2. Release validation stress test Reviewed By: jaykorean Differential Revision: D57974026 Pulled By: anand1976 fbshipit-source-id: cfa895914e74b4f628413b40e6e39d8d8e5286bd	2024-05-30 12:33:58 -07:00
Andrew Kryczka	c72ee4531b	Fix recycled WAL detection when wal_compression is enabled (#12643 ) Summary: I think the point of the `if (end_of_buffer_offset_ - buffer_.size() == 0)` was to only set `recycled_` when the first record was read. However, the condition was false when reading the first record when the WAL began with a `kSetCompressionType` record because we had already dropped the `kSetCompressionType` record from `buffer_`. To fix this, I used `first_record_read_` instead. Also, it was pretty confusing to treat the WAL as non-recycled when a recyclable record first appeared in a non-first record. I changed it to return an error if that happens. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12643 Reviewed By: hx235 Differential Revision: D57238099 Pulled By: ajkr fbshipit-source-id: e20a2a0c9cf0c9510a7b6af463650a05d559239e	2024-05-22 15:34:37 -07:00
Peter Dillinger	d89ab23bec	Disallow memtable flush and sst ingest while WAL is locked (#12652 ) Summary: We recently noticed that some memtable flushed and file ingestions could proceed during LockWAL, in violation of its stated contract. (Note: we aren't 100% sure its actually needed by MySQL, but we want it to be in a clean state nonetheless.) Despite earlier skepticism that this could be done safely (https://github.com/facebook/rocksdb/issues/12666), I found a place to wait to wait for LockWAL to be cleared before allowing these operations to proceed: WaitForPendingWrites() Pull Request resolved: https://github.com/facebook/rocksdb/pull/12652 Test Plan: Added to unit tests. Extended how db_stress validates LockWAL and re-enabled combination of ingestion and LockWAL in crash test, in follow-up to https://github.com/facebook/rocksdb/issues/12642 Ran blackbox_crash_test for a long while with relevant features amplified. Suggested follow-up: fix FaultInjectionTestFS to report file sizes consistent with what the user has requested to be flushed. Reviewed By: jowlyzhang Differential Revision: D57622142 Pulled By: pdillinger fbshipit-source-id: aef265fce69465618974b4ec47f4636257c676ce	2024-05-21 10:17:34 -07:00
Hui Xiao	f910a0c025	Fix unreleased bug fix .md name (#12672 ) Summary: Context/Summary: as above Pull Request resolved: https://github.com/facebook/rocksdb/pull/12672 Test Plan: no code change Reviewed By: ajkr Differential Revision: D57505136 Pulled By: hx235 fbshipit-source-id: 0e216dc5974e9be10027b444eb6b4034f679dfd8	2024-05-20 09:41:11 -07:00
Hui Xiao	20213d01a3	Fix crash in CompactFiles() of conflict range under `preclude_last_level_data_seconds > 0` (#12628 ) Summary: Context/Summary: Previously `CompactFiles()` used `RangeOverlapWithCompaction()` to check for conflict when sanitizing input files while later used `FilesRangeOverlapWithCompaction()` to assert for no conflict. The latter function checks for more conflict scenarios than the former does, particularly the ones arising from `preclude_last_level_data_seconds > 0` (i.e, compaction can output to second-to-the-last level). So we ran into assertion violation in `CompactFiles()` like below ``` Assertion `output_level == 0 \|\| !FilesRangeOverlapWithCompaction( input_files, output_level, Compaction::EvaluatePenultimateLevel(vstorage, ioptions_, start_level, output_level))' failed. ``` This PR make `CompactFiles()` used `FilesRangeOverlapWithCompaction()` and return Aborted status upon range conflict instead of crashing (during debug build) or proceed incorrectly (during non-debug build). To do so cleanly, I included a refactoring to make `FilesRangeOverlapWithCompaction()` part of `SanitizeAndConvertCompactionInputFiles()`, replacing `RangeOverlapWithCompaction()`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12628 Test Plan: New UT crashed before the fix and return correct status after the fix. Reviewed By: cbi42 Differential Revision: D57123536 Pulled By: hx235 fbshipit-source-id: f963a2c9e7ba1a9927a67fcc87f0dce126d3a430	2024-05-13 13:12:06 -07:00
Andrew Kryczka	933ac0e05c	Fix locking for `ColumnFamilyOptions::inplace_update_support` (#12624 ) Summary: In `SaveValue()`, the read lock needs to be obtained before `VerifyEntryChecksum()` because the KV checksum verification reads the entire value metadata+data, which is all mutable when `ColumnFamilyOptions::inplace_update_support == true`. In `MemTable::Update()`, the write lock needs to be obtained before mutating the value metadata (changing the value size) because it can be read concurrently. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12624 Test Plan: ``` $ make COMPILE_WITH_TSAN=1 -j56 db_stress ... $ python3 tools/db_crashtest.py blackbox --simple --max_key=10 --inplace_update_support=1 --interval=10 --allow_concurrent_memtable_write=0 ``` Reviewed By: cbi42 Differential Revision: D57034571 Pulled By: ajkr fbshipit-source-id: 3dddf881ad87923143acdf6bfec12ce47bb13a48	2024-05-08 08:30:12 -07:00
Andrew Kryczka	0fef690bd5	Sync non-latest WALs during flush for 2PC, single-CF DBs (#12622 ) Summary: Previously we skipped syncing the non-latest WALs during memtable flush when the DB had only one column family. Normally that is fine because those non-latest WALs would not be read by recovery. However, in case of `DBOptions::allow_2pc == true`, there could be unmatched prepare records in those WALs making them needed by recovery. As a result, the missing sync could have resulted in the recovered WAL state falling behind the recovered SST state. When we detect that case, we return a `Status::Corruption` saying "SST file is ahead of WALs". This PR proposes syncing the WAL in case of `DBOptions::allow_2pc`. This introduces the sync in some scenarios where it isn't needed (e.g., non-recent WALs contain no prepares) but I suspect the simplicity is worth it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12622 Reviewed By: cbi42 Differential Revision: D56987303 Pulled By: ajkr fbshipit-source-id: 7fe9395458018a18d77e907a3b5429065c0e2e48	2024-05-06 11:56:16 -07:00
Changyu Bi	6fdc4c5282	Fix a corruption bug in `CreateColumnFamilyWithImport()` (#12602 ) Summary: when importing files from multiple CFs into a new CF, we were reusing the epoch numbers assigned by the original CFs. This means L0 files in the new CF can have the same epoch number (assigned originally by different CFs). While CreateColumnFamilyWithImport() requires each original CF to have disjoint key range, after an intra-l0 compaction, we still can end up with L0 files with the same epoch number but overlapping key range. This PR attempt to fix this by reassigning epoch numbers when importing multiple CFs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12602 Test Plan: a new repro unit test. Before this PR, it fails with ``` [ RUN ] ImportColumnFamilyTest.AssignEpochNumberToMultipleCF db/import_column_family_test.cc:1048: Failure db_->WaitForCompact(o) Corruption: force_consistency_checks(DEBUG): VersionBuilder: L0 files of same epoch number but overlapping range https://github.com/facebook/rocksdb/issues/44 , smallest key: '6B6579303030303030' seq:511, type:1 , largest key: '6B6579303031303239' seq:510, type:1 , epoch number: 3 vs. file https://github.com/facebook/rocksdb/issues/36 , smallest key: '6B6579303030313030' seq:401, type:1 , largest key: '6B6579303030313939' seq:500, type:1 , epoch number: 3 ``` Reviewed By: hx235 Differential Revision: D56851808 Pulled By: cbi42 fbshipit-source-id: 01b8c790c9f1f2a168047ead670e73633f705b84	2024-05-06 11:01:38 -07:00
anand76	6349da612b	Update HISTORY.md and version to 9.3.0 (#12601 ) Summary: Update HISTORY.md for 9.2 and version to 9.3. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12601 Reviewed By: jaykorean, jowlyzhang Differential Revision: D56845901 Pulled By: anand1976 fbshipit-source-id: 0d1137a6568e4712be2f8b705f4f7b438217dbed	2024-05-01 16:33:04 -07:00
Yu Zhang	241253053a	Fix delete obsolete files on recovery not rate limited (#12590 ) Summary: This PR fix the issue that deletion of obsolete files during DB::Open are not rate limited. The root cause is slow deletion is disabled if trash/db size ratio exceeds the configured `max_trash_db_ratio` `d610e14f93/include/rocksdb/sst_file_manager.h (L126)` however, the current handling in DB::Open starts with tracking nothing but the obsolete files. This will make the ratio always look like it's 1. In order for the deletion rate limiting logic to work properly, we should only start deleting files after `SstFileManager` has finished tracking the whole DB, so the main fix is to move these two places that attempts to delete file after the tracking are done: 1) the `DeleteScheduler::CleanupDirectory` call in `SanitizeOptions`, 2) the `DB::DeleteObsoleteFiles` call. There are some other aesthetic changes like refactoring collecting all the DB paths into a function, rename `DBImp::DeleteUnreferencedSstFiles` to `DBImpl:: MaybeUpdateNextFileNumber` as it doesn't actually delete the files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12590 Test Plan: Added unit test and verified with manual testing Reviewed By: anand1976 Differential Revision: D56830519 Pulled By: jowlyzhang fbshipit-source-id: 8a38a21b1ea11c5371924f2b88663648f7a17885	2024-05-01 12:26:54 -07:00
Hui Xiao	abd6751aba	Fix wrong padded bytes being used to generate file checksum (#12598 ) Summary: Context/Summary: https://github.com/facebook/rocksdb/pull/12542 introduced a bug where wrong padded bytes used to generate file checksum if flush happens during padding. This PR fixed it along with an existing same bug for `perform_data_verification_=true`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12598 Test Plan: - New UT that failed before this fix (`db->VerifyFileChecksums: ...Corruption: ...file checksum mismatch`) and passes after - Benchmark ``` TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=fillseq[-X300] --num=100000 --block_align=1 --compression_type=none ``` Pre-PR: fillseq [AVG 300 runs] : 421334 (± 4126) ops/sec; 46.6 (± 0.5) MB/sec Post-PR: (no regression observed but a slight improvement) fillseq [AVG 300 runs] : 425768 (± 4309) ops/sec; 47.1 (± 0.5) MB/sec Reviewed By: ajkr, anand1976 Differential Revision: D56725688 Pulled By: hx235 fbshipit-source-id: c1a700a95def8c65c0a21e44f8c1966164925ad5	2024-04-30 15:38:53 -07:00
Andrew Kryczka	b2931a5c53	Fixed `MultiGet()` error handling to not skip blob dereference (#12597 ) Summary: See comment at top of the test case and release note. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12597 Reviewed By: jaykorean Differential Revision: D56718786 Pulled By: ajkr fbshipit-source-id: 8dce185bb0d24a358372fc2b553d181793fc335f	2024-04-29 14:18:42 -07:00
anand76	e36b0a2da4	Fix corruption bug when recycle_log_file_num changed from 0 (#12591 ) Summary: When `recycle_log_file_num` is changed from 0 to non-zero and the DB is reopened, any log files from the previous session that are still alive get reused. However, the WAL records in those files are not in the recyclable format. If one of those files is reused and is empty, a subsequent re-open, in `RecoverLogFiles`, can replay those records and insert stale data into the memtable. Another manifestation of this is an assertion failure `first_seqno_ == 0 \|\| s >= first_seqno_` in `rocksdb::MemTable::Add`. We could fix this by either 1) Writing a special record when reusing a log file, or 2) Implement more rigorous checking in `RecoverLogFiles` to ensure we don't replay stale records, or 3) Not reuse files created by a previous DB session. We choose option 3 as its the simplest, and flipping `recycle_log_file_num` is expected to be a rare event. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12591 Test Plan: 1. Add a unit test to verify the bug and fix Reviewed By: jowlyzhang Differential Revision: D56655812 Pulled By: anand1976 fbshipit-source-id: aa3a26b4a5e892d39a54b5a0658233cbebebac87	2024-04-29 12:25:00 -07:00

1 2 3

117 commits