Commit graph

117 commits

Author SHA1 Message Date
Jay Huh e9b15738d9 Honor ConfigOptions.ignore_unknown_options in ParseStruct() (#13152)
Summary:
`OptionTypeInfo::ParseStruct()` was not honoring `config_options.ignore_unknown_options` when unknown properties are found in the serialized string. This caused a compatibility issue in Remote Compaction. When the worker was updated with RocksDB 9.9, the remote worker started including a new table property, `newest_key_time` (added in PR https://github.com/facebook/rocksdb/issues/13083), in the compaction output files. However, parsing that table property in the serialized compaction result from the primary (running with `9.8`) was returning a non-ok status, even though `config_options.ignore_unknown_options` was `true`.

In this fix, we will ignore unused properties if `config_options.ignore_unknown_options` is set to true.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13152

Test Plan: Unit Test Added

Reviewed By: archang19

Differential Revision: D66374541

Pulled By: jaykorean

fbshipit-source-id: 78fd8309909279390438c247c4d390bbee4fa914
2024-11-22 16:38:05 -08:00
Yu Zhang bee8d5560e Start version 9.10.0 (#13146)
Summary:
Pull in HISTORY for 9.9.0, update version.h for next version, update check_format_compatible.sh, update git hash for folly

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13146

Test Plan: CI

Reviewed By: ltamasi

Differential Revision: D66142259

Pulled By: jowlyzhang

fbshipit-source-id: 90216b2d7cff2e0befb4f56567e3bd074f97c484
2024-11-18 21:49:47 -08:00
Changyu Bi 925435bbd9 Fix a bug that can retain old WAL longer than needed (#13127)
Summary:
The bug only happens for transaction db with 2pc. The main change is in `MemTableList::TryInstallMemtableFlushResults`. Before this fix, `memtables_to_flush` may not include all flushed memtables, and it causes the min_log_number for the flush to be incorrect. The code path for calculating min_log_number is `MemTableList::TryInstallMemtableFlushResults() -> GetDBRecoveryEditForObsoletingMemTables() -> PrecomputeMinLogNumberToKeep2PC() -> FindMinPrepLogReferencedByMemTable()`. Inside `FindMinPrepLogReferencedByMemTable()`, we need to exclude all memtables being flushed.

The PR also includes some documentation changes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13127

Test Plan: added a new unit that fails before this change.

Reviewed By: ltamasi

Differential Revision: D65679270

Pulled By: cbi42

fbshipit-source-id: 611f34bd6ef4cba51f8b54cb1be416887b5a9c5e
2024-11-11 14:19:45 -08:00
anand76 ee258619be Fix missing cases of corruption retries (#13122)
Summary:
This PR fixes a few cases where RocksDB was not retrying checksum failure/corruption of file reads with the `verify_and_reconstruct_read` IO option. After fixing these cases, we can almost always successfully open the DB and execute reads even if we see transient corruptions, provided the `FileSystem` supports the `verify_and_reconstruct_read` option. The specific cases fixed in this PR are -
1. CURRENT file
2. IDENTITY file
3. OPTIONS file
4. SST footer

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13122

Test Plan: Unit test in `db_io_failure_test.cc` that injects corruption at various stages of DB open and reads

Reviewed By: jaykorean

Differential Revision: D65617982

Pulled By: anand1976

fbshipit-source-id: 4324b88cc7eee5501ab5df20ef7a95bb12ed3ea7
2024-11-08 12:43:21 -08:00
Peter Dillinger 485ee4f45c Fix and test for leaks of open SST files (#13117)
Summary:
Follow-up to https://github.com/facebook/rocksdb/issues/13106 which revealed that some SST file readers (in addition to blob files) were being essentially leaked in TableCache (until DB::Close() time). Patched sources of leaks:
* Flush that is not committed (builder.cc)
* Various obsolete SST files picked up by directory scan but not caught by SubcompactionState::Cleanup() cleaning up from some failed compactions. Dozens of unit tests fail without the "backstop" TableCache::Evict() call in PurgeObsoleteFiles().

We also needed to adjust the check for leaks as follows:
* Ok if DB::Open never finished (see comment)
* Ok if deletions are disabled (see comment)
* Allow "quarantined" files to be in table_cache because (presumably) they might become live again.
* Get live files from all live Versions.

Suggested follow-up:
* Potentially delete more obsolete files sooner with a FIXME in db_impl_files.cc. This could potentially be high value because it seems to gate deletion of any/all newer obsolete files on all older compactions finishing.
* Try to catch obsolete files in more places using the VersionSet::obsolete_files_ pipeline rather than relying on them being picked up with directory scan, or deleting them outside of normal mechanisms.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13117

Test Plan: updated check used in most all unit tests in ASAN build

Reviewed By: hx235

Differential Revision: D65502988

Pulled By: pdillinger

fbshipit-source-id: aa0795a8a09d9ec578d25183fe43e2a35849209c
2024-11-08 10:54:43 -08:00
Yu Zhang 282f5a463b Fix write committed transactions replay when UDT setting toggles (#13121)
Summary:
This PR adds some missing pieces in order to handle UDT setting toggles while replay WALs for WriteCommitted transactions DB. Specifically, all the transaction markers for no op, prepare, commit, rollback are currently not carried over from the original WriteBatch to the new WriteBatch when there is a timestamp setting difference detected. This PR fills that gap.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13121

Test Plan: Added unit tests

Reviewed By: ltamasi

Differential Revision: D65558801

Pulled By: jowlyzhang

fbshipit-source-id: 8176882637b95f6dc0dad10d7fe21056fa5173d1
2024-11-06 17:32:03 -08:00
Peter Dillinger a28cc4a38c Fix a leak of open Blob files (#13106)
Summary:
An earlier change (b34cef57b7) removed apparently unused functionality where an obsolete blob file number is passed for removal from TableCache, which manages SST files. This was actually relying on broken/fragile abstractions wherein TableCache and BlobFileCache share the same Cache and using the TableCache interface to manipulate blob file caching. No unit test was actually checking for removal of obsolete blob files from the cache (which is somewhat tricky to check and a second order correctness requirement).

Here we fix the leak and add a DEBUG+ASAN-only check in DB::Close() that no obsolete files are lingering in the table/blob file cache.

Fixes https://github.com/facebook/rocksdb/issues/13066

Important follow-up (FIXME): The added check discovered some apparent cases of leaked (into table_cache) SST file readers that would stick around until DB::Close(). Need to enable that check, diagnose, and fix.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13106

Test Plan:
added a check that is called during DB::Close in ASAN builds (to minimize paying the cost in all unit tests). Without the fix, the check failed in at least these tests:

```
db_blob_basic_test DBBlobBasicTest.DynamicallyWarmCacheDuringFlush
db_blob_compaction_test DBBlobCompactionTest.CompactionReadaheadMerge
db_blob_compaction_test DBBlobCompactionTest.MergeBlobWithBase
db_blob_compaction_test DBBlobCompactionTest.CompactionDoNotFillCache
db_blob_compaction_test DBBlobCompactionTest.SkipUntilFilter
db_blob_compaction_test DBBlobCompactionTest.CompactionFilter
db_blob_compaction_test DBBlobCompactionTest.CompactionReadaheadFilter
db_blob_compaction_test DBBlobCompactionTest.CompactionReadaheadGarbageCollection
```

Reviewed By: ltamasi

Differential Revision: D65296123

Pulled By: pdillinger

fbshipit-source-id: 2276d76482beb2c75c9010bc1bec070bb23a24c0
2024-10-31 15:29:30 -07:00
Levi Tamasi ef535039f3 Call PrepareValue on the base iterator in BaseDeltaIterator (#13105)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13105

The `WriteBatchWithIndex::NewIteratorWithBase` interface enables creating a `BaseDeltaIterator` with an arbitrary base iterator passed in by the client, which has potentially been created with the `allow_unprepared_value` read option set. Because of this, `BaseDeltaIterator` has to call `PrepareValue` before using the `value()` or `columns()` from the base iterator. This includes both the case when `BaseDeltaIterator` exposes the `value()` and `columns()` of the base iterator as is and the case when the final `value()` / `columns()` is a result of merging key-values across the base and delta iterators. Note that `BaseDeltaIterator` itself does not support `allow_unprepared_value` yet; this will be implemented in an upcoming patch.

Reviewed By: jowlyzhang

Differential Revision: D65249643

fbshipit-source-id: b0a1ccc0dfd31105b2eef167b463ed15a8bb83b7
2024-10-31 14:20:33 -07:00
Peter Dillinger 9ad772e652 Start version 9.9.0 (#13093)
Summary:
Pull in HISTORY for 9.8.0, update version.h for next version, update check_format_compatible.sh

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13093

Test Plan: CI

Reviewed By: jowlyzhang

Differential Revision: D64987257

Pulled By: pdillinger

fbshipit-source-id: a7cec329e3d245e63767760aa0298c08c3281695
2024-10-25 13:47:29 -07:00
Peter Dillinger 3fd1f11d35 Fix race to make BlockBasedTableOptions effectively mutable (#13082)
Summary:
Fix a longstanding race condition in SetOptions for `block_based_table_factory` options. The fix is mostly described in new, unified `TableFactoryParseFn()` in `cf_options.cc`. Also in this PR:
* Adds a virtual `Clone()` function to TableFactory
* To avoid behavioral hiccups with `SetOptions`, make the "hidden state" of `BlockBasedTableFactory` shared between an original and a clone. For example, `TailPrefetchStats`
* `Configurable` was allowed to be copied but was not safe to do so, because the copy would have and use pointers into object it was copied from (!!!). This has been fixed using relative instead of absolute pointers, though it's still technically relying on undefined behavior (consistent object layout for non-standard-layout types).

For future follow-up:
* Deny SetOptions on block cache options (dubious and not yet made safe with proper shared_ptr handling)

Fixes https://github.com/facebook/rocksdb/issues/10079

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13082

Test Plan:
added to unit tests and crash test

Ran TSAN blackbox crashtest for hours with options to amplify potential race (see https://github.com/facebook/rocksdb/issues/10079)

Reviewed By: cbi42

Differential Revision: D64947243

Pulled By: pdillinger

fbshipit-source-id: 8390299149f50e2a2b39a5247680f2637edb23c8
2024-10-25 10:24:54 -07:00
Peter Dillinger dd23e84cad Re-implement GetApproximateMemTableStats for skip lists (#13047)
Summary:
GetApproximateMemTableStats() could return some bad results with the standard skip list memtable. See this new db_bench test showing the dismal distribution of results when the actual number of entries in range is 1000:

```
$ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=1000
...
filluniquerandom :       1.391 micros/op 718915 ops/sec 1.391 seconds 1000000 operations;   11.7 MB/s
approximatememtablestats :       3.711 micros/op 269492 ops/sec 3.711 seconds 1000000 operations;
Reported entry count stats (expected 1000):
Count: 1000000 Average: 2344.1611  StdDev: 26587.27
Min: 0  Median: 965.8555  Max: 835273
Percentiles: P50: 965.86 P75: 1610.77 P99: 12618.01 P99.9: 74991.58 P99.99: 830970.97
------------------------------------------------------
[       0,       1 ]   131344  13.134%  13.134% ###
(       1,       2 ]      115   0.011%  13.146%
(       2,       3 ]      106   0.011%  13.157%
(       3,       4 ]      190   0.019%  13.176%
(       4,       6 ]      214   0.021%  13.197%
(       6,      10 ]      522   0.052%  13.249%
(      10,      15 ]      748   0.075%  13.324%
(      15,      22 ]     1002   0.100%  13.424%
(      22,      34 ]     1948   0.195%  13.619%
(      34,      51 ]     3067   0.307%  13.926%
(      51,      76 ]     4213   0.421%  14.347%
(      76,     110 ]     5721   0.572%  14.919%
(     110,     170 ]    11375   1.137%  16.056%
(     170,     250 ]    17928   1.793%  17.849%
(     250,     380 ]    36597   3.660%  21.509% #
(     380,     580 ]    77882   7.788%  29.297% ##
(     580,     870 ]   160193  16.019%  45.317% ###
(     870,    1300 ]   210098  21.010%  66.326% ####
(    1300,    1900 ]   167461  16.746%  83.072% ###
(    1900,    2900 ]    78678   7.868%  90.940% ##
(    2900,    4400 ]    47743   4.774%  95.715% #
(    4400,    6600 ]    17650   1.765%  97.480%
(    6600,    9900 ]    11895   1.190%  98.669%
(    9900,   14000 ]     4993   0.499%  99.168%
(   14000,   22000 ]     2384   0.238%  99.407%
(   22000,   33000 ]     1966   0.197%  99.603%
(   50000,   75000 ]     2968   0.297%  99.900%
(  570000,  860000 ]      999   0.100% 100.000%

readrandom   :       1.967 micros/op 508487 ops/sec 1.967 seconds 1000000 operations;    8.2 MB/s (1000000 of 1000000 found)
```

Perhaps the only good thing to say about the old implementation was that it was fast, though apparently not that fast.

I've implemented a much more robust and reasonably fast new version of the function. It's still logarithmic but with some larger constant factors. The standard deviation from true count is around 20% or less, and roughly the CPU cost of two memtable point look-ups. See code comments for detail.

```
$ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=1000
...
filluniquerandom :       1.478 micros/op 676434 ops/sec 1.478 seconds 1000000 operations;   11.0 MB/s
approximatememtablestats :       2.694 micros/op 371157 ops/sec 2.694 seconds 1000000 operations;
Reported entry count stats (expected 1000):
Count: 1000000 Average: 1073.5158  StdDev: 197.80
Min: 608  Median: 1079.9506  Max: 2176
Percentiles: P50: 1079.95 P75: 1223.69 P99: 1852.36 P99.9: 1898.70 P99.99: 2176.00
------------------------------------------------------
(     580,     870 ]   134848  13.485%  13.485% ###
(     870,    1300 ]   747868  74.787%  88.272% ###############
(    1300,    1900 ]   116536  11.654%  99.925% ##
(    1900,    2900 ]      748   0.075% 100.000%

readrandom   :       1.997 micros/op 500654 ops/sec 1.997 seconds 1000000 operations;    8.1 MB/s (1000000 of 1000000 found)
```

We can already see that the distribution of results is dramatically better and wonderfully normal-looking, with relative standard deviation around 20%. The function is also FASTER, at least with these parameters. Let's look how this behavior generalizes, first *much* larger range:

```
$ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=30000
filluniquerandom :       1.390 micros/op 719654 ops/sec 1.376 seconds 990000 operations;   11.7 MB/s
approximatememtablestats :       1.129 micros/op 885649 ops/sec 1.129 seconds 1000000 operations;
Reported entry count stats (expected 30000):
Count: 1000000 Average: 31098.8795  StdDev: 3601.47
Min: 21504  Median: 29333.9303  Max: 43008
Percentiles: P50: 29333.93 P75: 33018.00 P99: 43008.00 P99.9: 43008.00 P99.99: 43008.00
------------------------------------------------------
(   14000,   22000 ]      408   0.041%   0.041%
(   22000,   33000 ]   749327  74.933%  74.974% ###############
(   33000,   50000 ]   250265  25.027% 100.000% #####

readrandom   :       1.894 micros/op 528083 ops/sec 1.894 seconds 1000000 operations;    8.5 MB/s (989989 of 1000000 found)
```

This is *even faster* and relatively *more accurate*, with relative standard deviation closer to 10%. Code comments explain why. Now let's look at smaller ranges. Implementation quirks or conveniences:
* When actual number in range is >= 40, the minimum return value is 40.
* When the actual is <= 10, it is guaranteed to return that actual number.
```
$ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=75
...
filluniquerandom :       1.417 micros/op 705668 ops/sec 1.417 seconds 999975 operations;   11.4 MB/s
approximatememtablestats :       3.342 micros/op 299197 ops/sec 3.342 seconds 1000000 operations;
Reported entry count stats (expected 75):
Count: 1000000 Average: 75.1210  StdDev: 15.02
Min: 40  Median: 71.9395  Max: 256
Percentiles: P50: 71.94 P75: 89.69 P99: 119.12 P99.9: 166.68 P99.99: 229.78
------------------------------------------------------
(      34,      51 ]    38867   3.887%   3.887% #
(      51,      76 ]   550554  55.055%  58.942% ###########
(      76,     110 ]   398854  39.885%  98.828% ########
(     110,     170 ]    11353   1.135%  99.963%
(     170,     250 ]      364   0.036%  99.999%
(     250,     380 ]        8   0.001% 100.000%

readrandom   :       1.861 micros/op 537224 ops/sec 1.861 seconds 1000000 operations;    8.7 MB/s (999974 of 1000000 found)

$ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=25
...
filluniquerandom :       1.501 micros/op 666283 ops/sec 1.501 seconds 1000000 operations;   10.8 MB/s
approximatememtablestats :       5.118 micros/op 195401 ops/sec 5.118 seconds 1000000 operations;
Reported entry count stats (expected 25):
Count: 1000000 Average: 26.2392  StdDev: 4.58
Min: 25  Median: 28.4590  Max: 72
Percentiles: P50: 28.46 P75: 31.69 P99: 49.27 P99.9: 67.95 P99.99: 72.00
------------------------------------------------------
(      22,      34 ]   928936  92.894%  92.894% ###################
(      34,      51 ]    67960   6.796%  99.690% #
(      51,      76 ]     3104   0.310% 100.000%

readrandom   :       1.892 micros/op 528595 ops/sec 1.892 seconds 1000000 operations;    8.6 MB/s (1000000 of 1000000 found)

$ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=10
...
filluniquerandom :       1.642 micros/op 608916 ops/sec 1.642 seconds 1000000 operations;    9.9 MB/s
approximatememtablestats :       3.042 micros/op 328721 ops/sec 3.042 seconds 1000000 operations;
Reported entry count stats (expected 10):
Count: 1000000 Average: 10.0000  StdDev: 0.00
Min: 10  Median: 10.0000  Max: 10
Percentiles: P50: 10.00 P75: 10.00 P99: 10.00 P99.9: 10.00 P99.99: 10.00
------------------------------------------------------
(       6,      10 ]  1000000 100.000% 100.000% ####################

readrandom   :       1.805 micros/op 554126 ops/sec 1.805 seconds 1000000 operations;    9.0 MB/s (1000000 of 1000000 found)
```

Remarkably consistent.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13047

Test Plan: new db_bench test for both performance and accuracy (see above); added to crash test; unit test updated.

Reviewed By: cbi42

Differential Revision: D63722003

Pulled By: pdillinger

fbshipit-source-id: cfc8613c085e87c17ecec22d82601aac2a5a1b26
2024-10-02 14:25:50 -07:00
Peter Dillinger 79790cf2a8 Bug fix and test BuildDBOptions (#13038)
Summary:
The following DBOptions were not being propagated through BuildDBOptions, which could at least lead to settings being lost through `GetOptionsFromString()`, possibly elsewhere as well:
* background_close_inactive_wals
* write_dbid_to_manifest
* write_identity_file
* prefix_seek_opt_in_only

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13038

Test Plan:
This problem was not being caught by
OptionsSettableTest.DBOptionsAllFieldsSettable when the option was omitted from both options_helper.cc and options_settable_test.cc. I have added to the test to catch future instances (and the updated test was how I found three of the four missing options).

The same kind of bug seems to be caught by
ColumnFamilyOptionsAllFieldsSettable, and AFAIK analogous code does not exist for BlockBasedTableOptions.

Reviewed By: ltamasi

Differential Revision: D63483779

Pulled By: pdillinger

fbshipit-source-id: a5d5f6e434174bacb8e5d251b767e81e62b7225a
2024-09-26 14:36:29 -07:00
anand76 22105366d9 More accurate accounting of compressed cache memory (#13032)
Summary:
When an item is inserted into the compressed secondary cache, this PR calculates the charge using the malloc_usable_size of the allocated memory, as well as the unique pointer allocation.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13032

Test Plan: New unit test

Reviewed By: pdillinger

Differential Revision: D63418493

Pulled By: anand1976

fbshipit-source-id: 1db2835af6867442bb8cf6d9bf412e120ddd3824
2024-09-25 17:47:40 -07:00
anand76 d02f63cc54 Respect lowest_used_cache_tier option for compressed blocks (#13030)
Summary:
If the lowest_used_cache_tier DB option is set to kVolatileTier, skip insertion of compressed blocks into the secondary cache. Previously, these were always inserted into the secondary cache via the InsertSaved() method, leading to pollution of the secondary cache with blocks that would never be read.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13030

Test Plan: Add a new unit test

Reviewed By: pdillinger

Differential Revision: D63329841

Pulled By: anand1976

fbshipit-source-id: 14d2fce2ed309401d9ad4d2e7c356218b6673f7b
2024-09-25 11:45:51 -07:00
Levi Tamasi fbbb08770f Update HISTORY.md, version.h, and the format compatibility check script for the 9.7 release (#13027)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/13027

Reviewed By: jowlyzhang

Differential Revision: D63158344

Pulled By: ltamasi

fbshipit-source-id: e650a0024155d52c7aa2afd0f242b8363071279a
2024-09-20 19:19:06 -07:00
Peter Dillinger a1a102ffce Steps toward deprecating implicit prefix seek, related fixes (#13026)
Summary:
With some new use cases onboarding to prefix extractors/seek/filters, one of the risks is existing iterator code, e.g. for maintenance tasks, being unintentionally subject to prefix seek semantics. This is a longstanding known design flaw with prefix seek, and `prefix_same_as_start` and `auto_prefix_mode` were steps in the direction of making that obsolete. However, we can't just immediately set `total_order_seek` to true by default, because that would impact so much code instantly.

Here we add a new DB option, `prefix_seek_opt_in_only` that basically allows users to transition to the future behavior when they are ready. When set to true, all iterators will be treated as if `total_order_seek=true` and then the only ways to get prefix seek semantics are with `prefix_same_as_start` or `auto_prefix_mode`.

Related fixes / changes:
* Make sure that `prefix_same_as_start` and `auto_prefix_mode` are compatible with (or override) `total_order_seek` (depending on your interpretation).
* Fix a bug in which a new iterator after dynamically changing the prefix extractor might mix different prefix semantics between memtable and SSTs. Both should use the latest extractor semantics, which means iterators ignoring memtable prefix filters with an old extractor. And that means passing the latest prefix extractor to new memtable iterators that might use prefix seek. (Without the fix, the test added for this fails in many ways.)

Suggested follow-up:
* Investigate a FIXME where a MergeIteratorBuilder is created in db_impl.cc. No unit test detects a change in value that should impact correctness.
* Make memtable prefix bloom compatible with `auto_prefix_mode`, which might require involving the memtablereps because we don't know at iterator creation time (only seek time) whether an auto_prefix_mode seek will be a prefix seek.
* Add `prefix_same_as_start` testing to db_stress

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13026

Test Plan:
tests updated, added. Add combination of `total_order_seek=true` and `auto_prefix_mode=true` to stress test. Ran `make blackbox_crash_test` for a long while.

Manually ran tests with `prefix_seek_opt_in_only=true` as default, looking for unexpected issues. I inspected most of the results and migrated many tests to be ready for such a change (but not all).

Reviewed By: ltamasi

Differential Revision: D63147378

Pulled By: pdillinger

fbshipit-source-id: 1f4477b730683d43b4be7e933338583702d3c25e
2024-09-20 15:54:19 -07:00
Changyu Bi f97e33454f Fix a bug with auto recovery on WAL write error (#12995)
Summary:
A recent crash test failure shows that auto recovery from WAL write failure can cause CFs to be inconsistent. A unit test repro in P1569398553. The following is an example sequence of events:

```
0. manual_wal_flush is true. There are multiple CFs in a DB.
1. Submit a write batch with updates to multiple CF
2. A FlushWAL or a memtable swtich that will try to write the buffered WAL data. Fail this write so that buffered WAL data is dropped: 4b1d595306/file/writable_file_writer.cc (L624)
The error needs to be retryable to start background auto recovery.
3. One CF successfully flushes its memtable during auto recovery.
4. Crash the process.
5. Reopen the DB, one CF will have the update as a result of successful flush. Other CFs will miss all the updates in the write batch since WAL does not have them.
```

This can happen if a users configures manual_wal_flush, uses more than one CF, and can hit retryable error for WAL writes. This PR is a short-term fix that upgrades WAL related errors to fatal and not trigger auto recovery.

A long-term fix may be not drop buffered WAL data by checking how much data is actually written, or require atomically flushing all column families during error recovery from this kind of errors.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12995

Test Plan:
added unit test to check error severity and if recovery is triggered. A crash test repro command that fails in a few runs before this PR:
```
python3 ./tools/db_crashtest.py blackbox --interval=60 --metadata_write_fault_one_in=1000 --column_families=10 --exclude_wal_from_write_fault_injection=0 --manual_wal_flush_one_in=1000 --WAL_size_limit_MB=10240 --WAL_ttl_seconds=0 --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --adm_policy=1 --advise_random_on_open=1 --allow_data_in_errors=True --allow_fallocate=1 --async_io=0 --auto_readahead_size=0 --avoid_flush_during_recovery=1 --avoid_flush_during_shutdown=1 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=100 --block_align=1 --block_protection_bytes_per_key=0 --block_size=16384 --bloom_before_level=2147483647 --bottommost_compression_type=none --bottommost_file_compaction_delay=0 --bytes_per_sync=0 --cache_index_and_filter_blocks=1 --cache_index_and_filter_blocks_with_high_priority=1 --cache_size=33554432 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=0 --check_multiget_consistency=0 --check_multiget_entity_consistency=0 --checkpoint_one_in=0 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_readahead_size=1048576 --compaction_ttl=0 --compress_format_version=1 --compressed_secondary_cache_size=8388608 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=4 --compression_type=none --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc= --data_block_index_type=0  --db_write_buffer_size=0 --decouple_partitioned_filters=1 --default_temperature=kCold --default_write_temperature=kWarm --delete_obsolete_files_period_micros=30000000 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_file_deletions_one_in=1000000 --disable_manual_compaction_one_in=1000000 --disable_wal=0 --dump_malloc_stats=1 --enable_checksum_handoff=1 --enable_compaction_filter=0 --enable_custom_split_merge=0 --enable_do_not_compress_roles=0 --enable_index_compression=0 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=1 --enable_sst_partitioner_factory=0 --enable_thread_tracking=1 --enable_write_thread_adaptive_yield=1 --error_recovery_with_no_fault_injection=1 --fail_if_options_file_error=1 --fifo_allow_compaction=1 --file_checksum_impl=big --fill_cache=1 --flush_one_in=1000000 --format_version=6 --get_all_column_family_metadata_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=10000 --get_properties_of_all_tables_one_in=1000000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944  --index_block_restart_interval=4 --index_shortening=1 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100000 --last_level_temperature=kWarm --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=10000 --log_file_time_to_roll=0 --log_readahead_size=0 --long_running_snapshots=0 --lowest_used_cache_tier=2 --manifest_preallocation_size=5120 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=100000 --max_key_len=3 --max_log_file_size=0 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=16 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16777216 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=2097152 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.001 --memtable_protection_bytes_per_key=2 --memtable_whole_key_filtering=0 --memtablerep=skip_list --metadata_charge_policy=1 --metadata_read_fault_one_in=0 --min_write_buffer_number_to_merge=1 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=2 --open_files=100 --open_metadata_read_fault_one_in=0 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --optimize_filters_for_hits=0 --optimize_filters_for_memory=0 --optimize_multiget_for_io=0 --paranoid_file_checks=1 --paranoid_memory_checks=0 --partition_filters=0 --partition_pinning=2 --pause_background_one_in=10000 --periodic_compaction_seconds=0 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=0 --readahead_size=524288 --readpercent=45 --recycle_log_file_num=0 --reopen=0 --report_bg_io_stats=0 --reset_stats_one_in=10000 --sample_for_compression=5 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --set_options_one_in=10000 --skip_stats_update_on_db_open=1 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=1048576 --sqfc_name=bar --sqfc_version=1 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=600 --stats_history_buffer_size=1048576 --strict_bytes_per_sync=1 --subcompactions=2 --sync=0 --sync_fault_injection=1 --table_cache_numshardbits=6 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=3 --uncache_aggressiveness=8 --universal_max_read_amp=-1 --unpartitioned_pinning=2 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=0 --use_attribute_group=1 --use_delta_encoding=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=1 --use_multi_cf_iterator=1 --use_multi_get_entity=0 --use_multiget=0 --use_put_entity_one_in=1 --use_sqfc_for_range_queries=0 --use_timed_put_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_compression=1 --verify_db_one_in=100000 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=4194304 --write_dbid_to_manifest=0 --write_fault_one_in=50 --writepercent=35 --ops_per_thread=100000 --preserve_unverified_changes=1
```

Reviewed By: hx235

Differential Revision: D62888510

Pulled By: cbi42

fbshipit-source-id: 308bdbbb8d897cc8eba950155cd0e37cf7eb76fe
2024-09-17 14:10:33 -07:00
Changyu Bi e490f2b051 Fix a bug in ReFitLevel() where FileMetaData::being_compacted is not cleared (#13009)
Summary:
in ReFitLevel(), we were not setting being_compacted to false after ReFitLevel() is done. This is not a issue if refit level is successful, since new FileMetaData is created for files at the target level. However, if there's an error during RefitLevel(), e.g., Manifest write failure, we should clear the being_compacted field for these files. Otherwise, these files will not be picked for compaction until db reopen.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13009

Test Plan:
existing test.
- stress test failure in T200339331 should not happen anymore.

Reviewed By: hx235

Differential Revision: D62597169

Pulled By: cbi42

fbshipit-source-id: 0ba659806da6d6d4b42384fc95268b2d7bad720e
2024-09-12 15:19:14 -07:00
Jay Huh 5b8f5cbcf4 Update main branch for 9.6 release (#12945)
Summary:
Main branch cut at defd97bc9.
Updated HISTORY.md, version and format compatibility test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12945

Test Plan: CI

Reviewed By: jowlyzhang

Differential Revision: D61482149

Pulled By: jaykorean

fbshipit-source-id: 4edf7c0a8c6e4df8fcc938bc778dfd02981d0c55
2024-08-19 17:36:23 -07:00
Jay Huh b99baa5ae7 Do not add unprep_seqs when WriteImpl() fails in unprepared txn (#12927)
Summary:
With the following scenario, we see assertion failure in write_unprepared_txn

1. The write is the first one in the transaction (`log_number_` is not set yet - https://github.com/facebook/rocksdb/blob/main/utilities/transactions/write_unprepared_txn.cc#L376-L379)
2. `WriteToWAL()` fails inside `db_impl_->WriteImpl()` due to fault injection. `last_log_number_`is 0. https://github.com/facebook/rocksdb/blob/main/utilities/transactions/write_unprepared_txn.cc#L386
3. `prepare_batch_cnt_` is still added to `unprep_seqs_` https://github.com/facebook/rocksdb/blob/main/utilities/transactions/write_unprepared_txn.cc#L395-L398
4. When the transaction gets destructed after failed, it expects `log_number_ > 0` if `unprep_seqs_` is not empty - https://github.com/facebook/rocksdb/blob/main/utilities/transactions/write_unprepared_txn.cc#L54-L55.

Step 3 is wrong. `unprep_seqs_` is for the ordered list of unprep sequence numbers that we have already written to the DB. If `db_impl_->WriteImpl()` failed, `unprep_seqs_` shouldn't be set. `prepare_seq` value would be wrong anyway.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12927

Test Plan:
Following command steadily repros the issue. This no longer fails after the fix
```
./db_stress \
  --WAL_size_limit_MB=0 \
  --WAL_ttl_seconds=60 \
  --acquire_snapshot_one_in=10000 \
  --adaptive_readahead=1 \
  --adm_policy=2 \
  --advise_random_on_open=0 \
  --allow_data_in_errors=True \
  --allow_fallocate=1 \
  --async_io=1 \
  --auto_readahead_size=0 \
  --avoid_flush_during_recovery=0 \
  --avoid_flush_during_shutdown=1 \
  --avoid_unnecessary_blocking_io=0 \
  --backup_max_size=104857600 \
  --backup_one_in=100000 \
  --batch_protection_bytes_per_key=8 \
  --bgerror_resume_retry_interval=100 \
  --block_align=1 \
  --block_protection_bytes_per_key=8 \
  --block_size=16384 \
  --bloom_before_level=7 \
  --bloom_bits=4.1280979878467345 \
  --bottommost_compression_type=none \
  --bottommost_file_compaction_delay=600 \
  --bytes_per_sync=262144 \
  --cache_index_and_filter_blocks=1 \
  --cache_index_and_filter_blocks_with_high_priority=0 \
  --cache_size=33554432 \
  --cache_type=lru_cache \
  --charge_compression_dictionary_building_buffer=1 \
  --charge_file_metadata=1 \
  --charge_filter_construction=1 \
  --charge_table_reader=1 \
  --check_multiget_consistency=0 \
  --check_multiget_entity_consistency=1 \
  --checkpoint_one_in=0 \
  --checksum_type=kXXH3 \
  --clear_column_family_one_in=0 \
  --compact_files_one_in=1000000 \
  --compact_range_one_in=1000000 \
  --compaction_pri=4 \
  --compaction_readahead_size=1048576 \
  --compaction_ttl=0 \
  --compress_format_version=1 \
  --compressed_secondary_cache_ratio=0.0 \
  --compressed_secondary_cache_size=0 \
  --compression_checksum=1 \
  --compression_max_dict_buffer_bytes=4294967295 \
  --compression_max_dict_bytes=16384 \
  --compression_parallel_threads=8 \
  --compression_type=none \
  --compression_use_zstd_dict_trainer=1 \
  --compression_zstd_max_train_bytes=0 \
  --continuous_verification_interval=0 \
  --create_timestamped_snapshot_one_in=0 \
  --daily_offpeak_time_utc=04:00-08:00 \
  --data_block_index_type=0 \
  --db=$db \
  --db_write_buffer_size=8388608 \
  --default_temperature=kHot \
  --default_write_temperature=kUnknown \
  --delete_obsolete_files_period_micros=21600000000 \
  --delpercent=5 \
  --delrangepercent=0 \
  --destroy_db_initially=0 \
  --detect_filter_construct_corruption=0 \
  --disable_file_deletions_one_in=10000 \
  --disable_manual_compaction_one_in=10000 \
  --disable_wal=0 \
  --dump_malloc_stats=1 \
  --enable_checksum_handoff=1 \
  --enable_compaction_filter=0 \
  --enable_custom_split_merge=1 \
  --enable_do_not_compress_roles=0 \
  --enable_index_compression=1 \
  --enable_memtable_insert_with_hint_prefix_extractor=0 \
  --enable_pipelined_write=0 \
  --enable_sst_partitioner_factory=0 \
  --enable_thread_tracking=1 \
  --enable_write_thread_adaptive_yield=1 \
  --error_recovery_with_no_fault_injection=1 \
  --exclude_wal_from_write_fault_injection=0 \
  --expected_values_dir=$exp \
  --fail_if_options_file_error=0 \
  --fifo_allow_compaction=1 \
  --file_checksum_impl=crc32c \
  --fill_cache=1 \
  --flush_one_in=1000 \
  --format_version=2 \
  --get_all_column_family_metadata_one_in=1000000 \
  --get_current_wal_file_one_in=0 \
  --get_live_files_apis_one_in=1000000 \
  --get_properties_of_all_tables_one_in=100000 \
  --get_property_one_in=1000000 \
  --get_sorted_wal_files_one_in=0 \
  --hard_pending_compaction_bytes_limit=274877906944 \
  --high_pri_pool_ratio=0 \
  --index_block_restart_interval=15 \
  --index_shortening=0 \
  --index_type=2 \
  --ingest_external_file_one_in=0 \
  --initial_auto_readahead_size=524288 \
  --inplace_update_support=0 \
  --iterpercent=10 \
  --key_len_percent_dist=1,30,69 \
  --key_may_exist_one_in=100000 \
  --kill_random_test=888887 \
  --last_level_temperature=kCold \
  --level_compaction_dynamic_level_bytes=1 \
  --lock_wal_one_in=1000000 \
  --log2_keys_per_lock=10 \
  --log_file_time_to_roll=0 \
  --log_readahead_size=0 \
  --long_running_snapshots=1 \
  --low_pri_pool_ratio=0 \
  --lowest_used_cache_tier=1 \
  --manifest_preallocation_size=5120 \
  --manual_wal_flush_one_in=0 \
  --mark_for_compaction_one_file_in=10 \
  --max_auto_readahead_size=0 \
  --max_background_compactions=20 \
  --max_bytes_for_level_base=10485760 \
  --max_key=100000 \
  --max_key_len=3 \
  --max_log_file_size=1048576 \
  --max_manifest_file_size=1073741824 \
  --max_sequential_skip_in_iterations=2 \
  --max_total_wal_size=0 \
  --max_write_batch_group_size_bytes=16 \
  --max_write_buffer_number=10 \
  --max_write_buffer_size_to_maintain=0 \
  --memtable_insert_hint_per_batch=0 \
  --memtable_max_range_deletions=1000 \
  --memtable_prefix_bloom_size_ratio=0.1 \
  --memtable_protection_bytes_per_key=8 \
  --memtable_whole_key_filtering=1 \
  --memtablerep=skip_list \
  --metadata_charge_policy=0 \
  --metadata_read_fault_one_in=32 \
  --metadata_write_fault_one_in=0 \
  --min_write_buffer_number_to_merge=2 \
  --mmap_read=1 \
  --mock_direct_io=False \
  --nooverwritepercent=1 \
  --num_file_reads_for_auto_readahead=2 \
  --open_files=-1 \
  --open_metadata_read_fault_one_in=0 \
  --open_metadata_write_fault_one_in=0 \
  --open_read_fault_one_in=0 \
  --open_write_fault_one_in=0 \
  --ops_per_thread=20000000 \
  --optimize_filters_for_hits=1 \
  --optimize_filters_for_memory=0 \
  --optimize_multiget_for_io=0 \
  --paranoid_file_checks=1 \
  --partition_filters=0 \
  --partition_pinning=0 \
  --pause_background_one_in=1000000 \
  --periodic_compaction_seconds=1 \
  --prefix_size=5 \
  --prefixpercent=5 \
  --prepopulate_block_cache=0 \
  --preserve_internal_time_seconds=60 \
  --progress_reports=0 \
  --promote_l0_one_in=0 \
  --read_amp_bytes_per_bit=32 \
  --read_fault_one_in=1000 \
  --readahead_size=0 \
  --readpercent=45 \
  --recycle_log_file_num=1 \
  --reopen=20 \
  --report_bg_io_stats=1 \
  --reset_stats_one_in=10000 \
  --sample_for_compression=0 \
  --secondary_cache_fault_one_in=0 \
  --sync=0 \
  --sync_fault_injection=0 \
  --table_cache_numshardbits=6 \
  --target_file_size_base=524288 \
  --target_file_size_multiplier=2 \
  --test_batches_snapshots=0 \
  --top_level_index_pinning=0 \
  --txn_write_policy=2 \
  --uncache_aggressiveness=126 \
  --universal_max_read_amp=4 \
  --unordered_write=0 \
  --unpartitioned_pinning=1 \
  --use_adaptive_mutex=0 \
  --use_adaptive_mutex_lru=1 \
  --use_attribute_group=1 \
  --use_delta_encoding=1 \
  --use_direct_io_for_flush_and_compaction=0 \
  --use_direct_reads=0 \
  --use_full_merge_v1=0 \
  --use_get_entity=0 \
  --use_merge=1 \
  --use_multi_cf_iterator=0 \
  --use_multi_get_entity=0 \
  --use_multiget=1 \
  --use_optimistic_txn=0 \
  --use_put_entity_one_in=0 \
  --use_timed_put_one_in=0 \
  --use_txn=1 \
  --use_write_buffer_manager=0 \
  --user_timestamp_size=0 \
  --value_size_mult=32 \
  --verification_only=0 \
  --verify_checksum=1 \
  --verify_checksum_one_in=1000000 \
  --verify_compression=0 \
  --verify_db_one_in=10000 \
  --verify_file_checksums_one_in=1000000 \
  --verify_iterator_with_expected_state_one_in=5 \
  --verify_sst_unique_id_in_manifest=1 \
  --wal_bytes_per_sync=0 \
  --wal_compression=none \
  --write_buffer_size=4194304 \
  --write_dbid_to_manifest=1 \
  --write_fault_one_in=128 \
  --writepercent=35
```

Reviewed By: cbi42

Differential Revision: D61048774

Pulled By: jaykorean

fbshipit-source-id: 22200d55fd0b22b68732b12516e681a6c6e2c601
2024-08-15 09:16:29 -07:00
Changyu Bi b32d899482 Fix MultiGet dropping memtable kv checksum corruption (#12842)
Summary:
Corruption status returned by `GetFromTable()` could be overwritten here: b6c3495a71/db/version_set.cc (L2614)

This PR fixes this issue by setting `*(s->found_final_value) = true;` in SaveValue. Also makes the handling of the return value of `GetFromTable()` more robust and added asserts in a couple places.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12842

Test Plan: Updated an existing unit test to cover MultiGet. It fails the assertion here before this PR: b6c3495a71/db/version_set.cc (L2601)

Reviewed By: anand1976

Differential Revision: D59498203

Pulled By: cbi42

fbshipit-source-id: 1f071c1b2c5b66fb71264b547a9e670d1cf592f0
2024-08-08 13:34:11 -07:00
Yu Zhang d12aaf23ca Fix file deletions in DestroyDB not rate limited (#12891)
Summary:
Make `DestroyDB` slowly delete files if it's configured and enabled via `SstFileManager`.

It's currently not available mainly because of DeleteScheduler's logic related to tracked total_size_ and total_trash_size_. These accounting and logic should not be applied to `DestroyDB`. This PR adds a `DeleteUnaccountedDBFile` util for this purpose which deletes files without accounting it.  This util also supports assigning a file to a specified trash bucket so that user can later wait for a specific trash bucket to be empty. For `DestroyDB`, files with more than 1 hard links will be deleted immediately.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12891

Test Plan: Added unit tests, existing tests.

Reviewed By: anand1976

Differential Revision: D60300220

Pulled By: jowlyzhang

fbshipit-source-id: 8b18109a177a3a9532f6dc2e40e08310c08ca3c7
2024-08-02 19:31:55 -07:00
Levi Tamasi 2e8a1a14ef Fix a data race affecting the background error status (#12910)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12910

There is currently a call to `GetBGError()` in `DBImpl::WriteImplWALOnly()` where the DB mutex is (incorrectly) not held, leading to a data race. Technically, we could acquire the mutex here but instead, the patch removes the affected check altogether, since the same check is already performed (in a thread-safe manner) in the subsequent call to `PreprocessWrite()`.

Reviewed By: cbi42

Differential Revision: D60682008

fbshipit-source-id: 54b67975dcf57d67c068cac71e8ada09a1793ec5
2024-08-02 14:11:08 -07:00
anand76 55877d8893 Make transaction name conflict check more robust (#12895)
Summary:
The `PessimisticTransaction::SetName()` code checks for an existing txn of the given name before registering the new txn. However, this is not atomic, which could result in a race condition if two txns try to register with the same name. Both might succeed and lead to unpredictable behavior. This PR makes the test and set atomic.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12895

Reviewed By: pdillinger

Differential Revision: D60460482

Pulled By: anand1976

fbshipit-source-id: e8afeb2356e1b8f4e8df785cb73532739f82579d
2024-07-30 12:31:02 -07:00
Yu Zhang d94c2adc28 Add entry for bug fix in #12882 (#12892)
Summary:
As titled.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12892

Reviewed By: hx235

Differential Revision: D60400651

Pulled By: jowlyzhang

fbshipit-source-id: 2dd60c2287143f464ecab0de859715af6ab3a825
2024-07-29 12:20:50 -07:00
Hui Xiao 15d9988ab2 Update history and version for 9.5.fb release (#12880)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12880

Reviewed By: jaykorean, jowlyzhang

Differential Revision: D60057955

Pulled By: hx235

fbshipit-source-id: 1c599a5334aff1f424bb473275efe4349b17d41d
2024-07-22 13:15:09 -07:00
Hui Xiao 349b1ec08f Fix duplicate WAL entries caused by write after error recovery (#12873)
Summary:
**Context/Summary:**
We recently discovered a case where write of the same key right after error recovery of a previous failed write of the same key finishes causes two same WAL entries, violating our assertion. This is because we don't advance seqno on failed write and reuse the same WAL containing the failed write for the new write if the memtable at the time is empty.

This PR reuses the flush path for an empty memtable to switch WAL and update min WAL to keep in error recovery flush
 as well as updates the INFO log message for clarity.

```
2024/07/17-15:01:32.271789 327757 (Original Log Time 2024/07/17-15:01:25.942234) [/flush_job.cc:1017] [default] [JOB 2] Level-0 flush table https://github.com/facebook/rocksdb/issues/9: 0 bytes OK It's an empty SST file from a successful flush so won't be kept in the DB
2024/07/17-15:01:32.271798 327757 (Original Log Time 2024/07/17-15:01:32.269954) [/memtable_list.cc:560] [default] Level-0 commit flush result of table https://github.com/facebook/rocksdb/issues/9 started
2024/07/17-15:01:32.271802 327757 (Original Log Time 2024/07/17-15:01:32.271217) [/memtable_list.cc:760] [default] Level-0 commit flush result of table https://github.com/facebook/rocksdb/issues/9: memtable https://github.com/facebook/rocksdb/issues/1 done
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12873

Test Plan:
New UT that failed before this PR with following assertion failure (i.e, duplicate WAL entries) and passes after
```
db_wal_test: db/write_batch.cc:2254: rocksdb::Status rocksdb::{anonymous}::MemTableInserter::PutCFImpl(uint32_t, const rocksdb::Slice&, const rocksdb::Slice&, rocksdb::ValueType, RebuildTxnOp, const ProtectionInfoKVOS64*) [with RebuildTxnOp = rocksdb::{anonymous}::MemTableInserter::PutCF(uint32_t, const rocksdb::Slice&, const rocksdb::Slice&)::<lambda(rocksdb::WriteBatch*, uint32_t, const rocksdb::Slice&, const rocksdb::Slice&)>; uint32_t = unsigned int; rocksdb::ProtectionInfoKVOS64 = rocksdb::ProtectionInfoKVOS<long unsigned int>]: Assertion `seq_per_batch_' failed.
```

Reviewed By: anand1976

Differential Revision: D59884468

Pulled By: hx235

fbshipit-source-id: 5d854b719092552c69727a979f269fb7f6c39756
2024-07-22 12:40:25 -07:00
anand76 0fca5e31b4 Fix race between manifest error recovery and file ingestion (#12871)
Summary:
This PR fixes an assertion failure in `DBImpl::ResumeImpl` - `assert(!versions_->descriptor_log_)`. In `VersionSet`, `descriptor_log_` has a pointer to the current MANIFEST writer. When there's an error updating the manifest, `descriptor_log_` is reset, and the error recovery thread checks `io_status()` in `VersionSet` and attempts to write a new MANIFEST. If another DB manipulation happens at the same time (like external file ingestion, column family manipulation etc), it calls `LogAndApply`, which also attempts to write a new MANIFEST. The assertion in `ResumeImpl` might fail in this case since the other MANIFEST writer may have updated `descriptor_log_`. To prevent the assertion, this fix updates both `io_status_` and `descriptor_log_` while holding the DB mutex.

The other option would have been to simply remove the assert. But I think its important to have it to ensure the invariant that `io_status_` is cleared if the MANIFEST is written successfully, and this fix makes things easier to reason about.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12871

Test Plan: Existing tests and crash test

Reviewed By: hx235

Differential Revision: D59926947

Pulled By: anand1976

fbshipit-source-id: af9ad18da3e29fc62c7ec2e30e0738aa33d4e5f1
2024-07-19 10:37:51 -07:00
Peter Dillinger 93b163d1a2 Fix major bug with prefixes, SeekForPrev, and partitioned filters (#12872)
Summary:
Basically, the fix in https://github.com/facebook/rocksdb/issues/8137 was incomplete (and I missed it in the review), because if `whole_key_filtering` is false, then `last_prefix_str_` will never be set to non-empty and the fix doesn't work. Also related to https://github.com/facebook/rocksdb/issues/5835.

This is intended as a safe, simple fix that will regress CPU efficiency slightly (for `whole_key_filtering=false` cases, because of extra prefix string copies during flush & compaction). An efficient fix is not possible without some substantial refactoring.

Also in this PR: new test DBBloomFilterTest.FilterNumEntriesCoalesce tests an adjacent code path that was previously untested for its effect of ensuring the number of unique prefixes and keys is tracked properly when both prefixes and whole keys are going into a filter. (Test fails when either of the two code segments checking for duplicates is disabled.) In addition, the same test would fail before the main bug fix here because the code would inappropriately add the empty string to the filter (because of unmodified `last_prefix_str_`).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12872

Test Plan: In addition to DBBloomFilterTest.FilterNumEntriesCoalesce, extended DBBloomFilterTest.SeekForPrevWithPartitionedFilters to cover the broken case. (Mostly whitespace change.)

Reviewed By: jowlyzhang

Differential Revision: D59873793

Pulled By: pdillinger

fbshipit-source-id: 2a7b7f09ca73dc188fb4dab833826ad6da7ebb11
2024-07-17 14:08:35 -07:00
anand76 5aa675457e Fix unhandled MANIFEST write errors (#12865)
Summary:
The failure of `WriteCurrentStateToManifest()` in `VersionSet::ProcessManifestWrites()` was not handled properly. If it failed, `manifest_io_status` was not updated, leading to `manifest_file_number_` being updated to the newly created manifest even though its bad. This would lead to the bad manifest immediately getting deleted, and also the good manifest (referenced by `CURRENT`) getting deleted by obsolete file deletion because of `manifest_file_number_` not referencing its number.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12865

Reviewed By: hx235

Differential Revision: D59782940

Pulled By: anand1976

fbshipit-source-id: f752fb9a1c23fd3d734616e273613cbac204301b
2024-07-15 19:13:29 -07:00
Hui Xiao 4ff35afb42 Fix a bug where OnErrorRecoveryBegin() is not called before auto-recovery (#12860)
Summary:
**Context/Summary:**
`*auto_recovery` needs to be set true in order for `OnErrorRecoveryBegin()` to be called before auto-recovery
3db030d7ee/db/event_helpers.cc (L64-L66)
Currently it's set false for auto-recovery. This PR fixes it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12860

Test Plan:
- Manual observation that it is called
- Existing UT

Reviewed By: jowlyzhang

Differential Revision: D59693315

Pulled By: hx235

fbshipit-source-id: 3f428c5b1e9818bb7697fdcd7f245d11378eb14a
2024-07-15 17:00:14 -07:00
Yu Zhang ff204d8ecd Add entry for #12803: fix race between event listener and error handler (#12809)
Summary:
As titled.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12809

Reviewed By: hx235

Differential Revision: D58974154

Pulled By: jowlyzhang

fbshipit-source-id: 7e44b54d9fa3bfbd58a4154a2c7e91aec905c34b
2024-06-24 17:01:41 -07:00
Changyu Bi 748f74aca3 Update main branch for 9.4 release (#12802)
Summary:
Main branch cut at e90e9153d5.
Updated HISTORY.md, version and format compatibility test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12802

Reviewed By: ajkr

Differential Revision: D58956464

Pulled By: cbi42

fbshipit-source-id: 50d786c145cebf93d1dd554b1b0e26baac3cc88c
2024-06-24 11:53:05 -07:00
Peter Dillinger 3abcba8470 Propagate more ReadOptions to ApproximateOffsetOf/Sizes (#12764)
Summary:
Unknown why these would ignore options like deadline and read_tier. Setting total_order_seek=true is unnecessary because of the disable_prefix_seek (= true) parameter to NewIndexIterator. This is only used by the hash index, which uses total order seek if either the ReadOption or the parameter is true. The parameter is arguably redundant with the total_order_seek option, meaning it could be eliminated, but I think this case is exceptional (compared to e.g. no_io):
* Prefix seek is particular to user iterators, though might be usable, carefully, for other read operations.
* The historical default of total_order_seek=false in a sense is "wrong result by default" so cannot be interpreted as an intent to force prefix seek in an operation for which it might be usual or give bad results.

Also added a generic release note to cover this and related PRs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12764

Test Plan: existing tests

Reviewed By: hx235

Differential Revision: D58474240

Pulled By: pdillinger

fbshipit-source-id: 79014d9822ba8f09d57ce4524363aa0973017b68
2024-06-12 16:25:47 -07:00
Peter Dillinger 98393f0139 Fix Checkpoint hard link of inactive but unsynced WAL (#12731)
Summary:
Background: there is one active WAL file but there can be
several more WAL files in various states. Those other WALs are always
in a "flushed" state but could be on the `logs_` list not yet fully
synced. We currently allow any WAL that is not the active WAL to be
hard-linked when creating a Checkpoint, as although it might still be
open for write, we are not appending any more data to it.

The problem is that a created Checkpoint is supposed to be fully synced
on return of that function, and a hard-linked WAL in the state described
above might not be fully synced. (Through some prudence in https://github.com/facebook/rocksdb/issues/10083,
it would synced if using track_and_verify_wals_in_manifest=true.)

The fix is a step toward a long term goal of removing the need to query
the filesystem to determine WAL files and their state. (I consider it
dubious any time we independently read from or query metadata from a
file we have open for writing, as this makes us more susceptible to
FileSystem deficiencies or races.) More specifically:
* Detect which WALs might not be fully synced, according to our DBImpl
  metadata, and prevent hard linking those (with `trim_to_size=true`
  from `GetLiveFilesStorageInfo()`. And while we're at it, use our known
  flushed sizes for those WALs.
* To avoid a race between that and GetSortedWalFiles(), track a maximum
  needed WAL number for the Checkpoint/GetLiveFilesStorageInfo.
* Because of the level of consistency provided by those two, we no
  longer need to consider syncing as part of the FlushWAL in
  GetLiveFilesStorageInfo. (We determine the max WAL number consistent
  with the manifest file size, while holding DB mutex. Should make
  track_and_verify_wals_in_manifest happy.) This makes the premise of
  test PutRaceWithCheckpointTrackedWalSync obsolete (sync point callback
  no longer hit) so the test is removed, with crash test as backstop for
  related issues. See https://github.com/facebook/rocksdb/issues/10185

Stacked on https://github.com/facebook/rocksdb/issues/12729

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12731

Test Plan:
Expanded an existing test, which now fails before fix.
Also long runs of blackbox_crash_test with amplified checkpoint frequency.

Reviewed By: cbi42

Differential Revision: D58199629

Pulled By: pdillinger

fbshipit-source-id: 376e55f4a2b082cd2adb6408a41209de14422382
2024-06-05 17:40:09 -07:00
Levi Tamasi b9e82f5162 Mention two fixes (#12677 and #12681) in the changelog (#12730)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12730

Reviewed By: jowlyzhang

Differential Revision: D58092079

fbshipit-source-id: 708437e705f9d9f770c9ba1a1a9b3c369b0a4b79
2024-06-03 10:50:41 -07:00
Andrew Kryczka c3ae569792 Update the main branch for the 9.3 release (#12726)
Summary:
Cut the 9.3.fb branch as of 5/17 11:59pm. Also, cherry-picked all bug fixes that have happened since then. Removed their files from unreleased_history/ since those fixes will appear in 9.3.0, so there seems no use repeating them in any later release.

Release branch: https://github.com/facebook/rocksdb/tree/9.3.fb
Tests: https://github.com/facebook/rocksdb/actions/runs/9342097111

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12726

Reviewed By: ltamasi

Differential Revision: D58069263

Pulled By: ajkr

fbshipit-source-id: c4f557bc8dbc20ce53021ac7e97a24f930542bf9
2024-06-02 22:10:24 -07:00
anand76 0ae3d9f98d Fix stale memory access with FSBuffer and tiered sec cache (#12712)
Summary:
A `BlockBasedTable` with `TieredSecondaryCache` containing a NVM cache inserts blocks  into the compressed cache and the corresponding compressed block into the NVM cache.  The `BlockFetcher` is used to get the uncompressed and compressed blocks by calling `ReadBlockContents()` and `GetUncompressedBlock()` respectively. If the file system supports FSBuffer (i.e returning a FS allocated buffer rather than caller provided), that buffer gets freed between the two calls. This PR fixes it by making the FSBuffer unique pointer a member rather than local variable.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12712

Test Plan:
1. Add a unit test
2. Release validation stress test

Reviewed By: jaykorean

Differential Revision: D57974026

Pulled By: anand1976

fbshipit-source-id: cfa895914e74b4f628413b40e6e39d8d8e5286bd
2024-05-30 12:33:58 -07:00
Andrew Kryczka c72ee4531b Fix recycled WAL detection when wal_compression is enabled (#12643)
Summary:
I think the point of the `if (end_of_buffer_offset_ - buffer_.size() == 0)` was to only set `recycled_` when the first record was read. However, the condition was false when reading the first record when the WAL began with a  `kSetCompressionType` record because we had already dropped the `kSetCompressionType` record from `buffer_`. To fix this, I used `first_record_read_` instead.

Also, it was pretty confusing to treat the WAL as non-recycled when a recyclable record first appeared in a non-first record. I changed it to return an error if that happens.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12643

Reviewed By: hx235

Differential Revision: D57238099

Pulled By: ajkr

fbshipit-source-id: e20a2a0c9cf0c9510a7b6af463650a05d559239e
2024-05-22 15:34:37 -07:00
Peter Dillinger d89ab23bec Disallow memtable flush and sst ingest while WAL is locked (#12652)
Summary:
We recently noticed that some memtable flushed and file
ingestions could proceed during LockWAL, in violation of its stated
contract. (Note: we aren't 100% sure its actually needed by MySQL, but
we want it to be in a clean state nonetheless.)

Despite earlier skepticism that this could be done safely (https://github.com/facebook/rocksdb/issues/12666), I
found a place to wait to wait for LockWAL to be cleared before allowing
these operations to proceed: WaitForPendingWrites()

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12652

Test Plan:
Added to unit tests. Extended how db_stress validates LockWAL
and re-enabled combination of ingestion and LockWAL in crash test, in
follow-up to https://github.com/facebook/rocksdb/issues/12642

Ran blackbox_crash_test for a long while with relevant features
amplified.

Suggested follow-up: fix FaultInjectionTestFS to report file sizes
consistent with what the user has requested to be flushed.

Reviewed By: jowlyzhang

Differential Revision: D57622142

Pulled By: pdillinger

fbshipit-source-id: aef265fce69465618974b4ec47f4636257c676ce
2024-05-21 10:17:34 -07:00
Hui Xiao f910a0c025 Fix unreleased bug fix .md name (#12672)
Summary:
Context/Summary: as above

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12672

Test Plan: no code change

Reviewed By: ajkr

Differential Revision: D57505136

Pulled By: hx235

fbshipit-source-id: 0e216dc5974e9be10027b444eb6b4034f679dfd8
2024-05-20 09:41:11 -07:00
Hui Xiao 20213d01a3 Fix crash in CompactFiles() of conflict range under preclude_last_level_data_seconds > 0 (#12628)
Summary:
**Context/Summary:**

Previously `CompactFiles()` used `RangeOverlapWithCompaction()` to check for conflict when sanitizing input files while later used `FilesRangeOverlapWithCompaction()` to assert for no conflict. The latter function checks for more conflict scenarios than the former does, particularly the ones arising from `preclude_last_level_data_seconds > 0` (i.e, compaction can output to second-to-the-last level). So we ran into assertion violation in `CompactFiles()` like below
```
 Assertion `output_level == 0 || !FilesRangeOverlapWithCompaction( input_files, output_level, Compaction::EvaluatePenultimateLevel(vstorage, ioptions_, start_level, output_level))' failed.
```

This PR make `CompactFiles()` used `FilesRangeOverlapWithCompaction()` and return Aborted status upon range conflict instead of crashing (during debug build) or proceed incorrectly (during non-debug build). To do so cleanly, I included a refactoring to make `FilesRangeOverlapWithCompaction()` part of `SanitizeAndConvertCompactionInputFiles()`, replacing `RangeOverlapWithCompaction()`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12628

Test Plan: New UT crashed before the fix and return correct status after the fix.

Reviewed By: cbi42

Differential Revision: D57123536

Pulled By: hx235

fbshipit-source-id: f963a2c9e7ba1a9927a67fcc87f0dce126d3a430
2024-05-13 13:12:06 -07:00
Andrew Kryczka 933ac0e05c Fix locking for ColumnFamilyOptions::inplace_update_support (#12624)
Summary:
In `SaveValue()`, the read lock needs to be obtained before `VerifyEntryChecksum()` because the KV checksum verification reads the entire value metadata+data, which is all mutable when `ColumnFamilyOptions::inplace_update_support == true`.

In `MemTable::Update()`, the write lock needs to be obtained before mutating the value metadata (changing the value size) because it can be read concurrently.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12624

Test Plan:
```
$ make COMPILE_WITH_TSAN=1 -j56 db_stress
...
$ python3 tools/db_crashtest.py blackbox --simple --max_key=10 --inplace_update_support=1 --interval=10 --allow_concurrent_memtable_write=0
```

Reviewed By: cbi42

Differential Revision: D57034571

Pulled By: ajkr

fbshipit-source-id: 3dddf881ad87923143acdf6bfec12ce47bb13a48
2024-05-08 08:30:12 -07:00
Andrew Kryczka 0fef690bd5 Sync non-latest WALs during flush for 2PC, single-CF DBs (#12622)
Summary:
Previously we skipped syncing the non-latest WALs during memtable flush when the DB had only one column family. Normally that is fine because those non-latest WALs would not be read by recovery. However, in case of `DBOptions::allow_2pc == true`, there could be unmatched prepare records in those WALs making them needed by recovery. As a result, the missing sync could have resulted in the recovered WAL state falling behind the recovered SST state. When we detect that case, we return a `Status::Corruption` saying "SST file is ahead of WALs".

This PR proposes syncing the WAL in case of `DBOptions::allow_2pc`. This introduces the sync in some scenarios where it isn't needed (e.g., non-recent WALs contain no prepares) but I suspect the simplicity is worth it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12622

Reviewed By: cbi42

Differential Revision: D56987303

Pulled By: ajkr

fbshipit-source-id: 7fe9395458018a18d77e907a3b5429065c0e2e48
2024-05-06 11:56:16 -07:00
Changyu Bi 6fdc4c5282 Fix a corruption bug in CreateColumnFamilyWithImport() (#12602)
Summary:
when importing files from multiple CFs into a new CF, we were reusing the epoch numbers assigned by the original CFs. This means L0 files in the new CF can have the same epoch number (assigned originally by different CFs). While CreateColumnFamilyWithImport() requires each original CF to have disjoint key range, after an intra-l0 compaction, we still can end up with L0 files with the same epoch number but overlapping key range. This PR attempt to fix this by reassigning epoch numbers when importing multiple CFs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12602

Test Plan:
a new repro unit test. Before this PR, it fails with
```
[ RUN      ] ImportColumnFamilyTest.AssignEpochNumberToMultipleCF
db/import_column_family_test.cc:1048: Failure
db_->WaitForCompact(o)
Corruption: force_consistency_checks(DEBUG): VersionBuilder: L0 files of same epoch number but overlapping range https://github.com/facebook/rocksdb/issues/44 , smallest key: '6B6579303030303030' seq:511, type:1 , largest key: '6B6579303031303239' seq:510, type:1 , epoch number: 3 vs. file https://github.com/facebook/rocksdb/issues/36 , smallest key: '6B6579303030313030' seq:401, type:1 , largest key: '6B6579303030313939' seq:500, type:1 , epoch number: 3
```

Reviewed By: hx235

Differential Revision: D56851808

Pulled By: cbi42

fbshipit-source-id: 01b8c790c9f1f2a168047ead670e73633f705b84
2024-05-06 11:01:38 -07:00
anand76 6349da612b Update HISTORY.md and version to 9.3.0 (#12601)
Summary:
Update HISTORY.md for 9.2 and version to 9.3.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12601

Reviewed By: jaykorean, jowlyzhang

Differential Revision: D56845901

Pulled By: anand1976

fbshipit-source-id: 0d1137a6568e4712be2f8b705f4f7b438217dbed
2024-05-01 16:33:04 -07:00
Yu Zhang 241253053a Fix delete obsolete files on recovery not rate limited (#12590)
Summary:
This PR fix the issue that deletion of obsolete files during DB::Open are not rate limited.

The root cause is slow deletion is disabled if trash/db size ratio exceeds the configured `max_trash_db_ratio` d610e14f93/include/rocksdb/sst_file_manager.h (L126) however, the current handling in DB::Open starts with tracking nothing but the obsolete files. This will make the ratio always look like it's 1.

In order for the deletion rate limiting logic to work properly, we should only start deleting files after `SstFileManager` has finished tracking the whole DB, so the main fix is to move these two places that attempts to delete file after the tracking are done: 1) the `DeleteScheduler::CleanupDirectory` call in `SanitizeOptions`, 2) the `DB::DeleteObsoleteFiles` call.

There are some other aesthetic changes like refactoring collecting all the DB paths into a function, rename `DBImp::DeleteUnreferencedSstFiles` to `DBImpl:: MaybeUpdateNextFileNumber` as it doesn't actually delete the files.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12590

Test Plan: Added unit test and verified with manual testing

Reviewed By: anand1976

Differential Revision: D56830519

Pulled By: jowlyzhang

fbshipit-source-id: 8a38a21b1ea11c5371924f2b88663648f7a17885
2024-05-01 12:26:54 -07:00
Hui Xiao abd6751aba Fix wrong padded bytes being used to generate file checksum (#12598)
Summary:
**Context/Summary:**

https://github.com/facebook/rocksdb/pull/12542 introduced a bug where wrong padded bytes used to generate file checksum if flush happens during padding. This PR fixed it along with an existing same bug for `perform_data_verification_=true`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12598

Test Plan:
- New UT that failed before this fix (`db->VerifyFileChecksums: ...Corruption:  ...file checksum mismatch`) and passes after
- Benchmark
```
TEST_TMPDIR=/dev/shm  ./db_bench --benchmarks=fillseq[-X300] --num=100000 --block_align=1 --compression_type=none
```
Pre-PR:
fillseq [AVG    300 runs] : 421334 (± 4126) ops/sec;   46.6 (± 0.5) MB/sec
Post-PR: (no regression observed but a slight improvement)
fillseq [AVG    300 runs] : 425768 (± 4309) ops/sec;   47.1 (± 0.5) MB/sec

Reviewed By: ajkr, anand1976

Differential Revision: D56725688

Pulled By: hx235

fbshipit-source-id: c1a700a95def8c65c0a21e44f8c1966164925ad5
2024-04-30 15:38:53 -07:00
Andrew Kryczka b2931a5c53 Fixed MultiGet() error handling to not skip blob dereference (#12597)
Summary:
See comment at top of the test case and release note.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12597

Reviewed By: jaykorean

Differential Revision: D56718786

Pulled By: ajkr

fbshipit-source-id: 8dce185bb0d24a358372fc2b553d181793fc335f
2024-04-29 14:18:42 -07:00
anand76 e36b0a2da4 Fix corruption bug when recycle_log_file_num changed from 0 (#12591)
Summary:
When `recycle_log_file_num` is changed from 0 to non-zero and the DB is reopened, any log files from the previous session that are still alive get reused. However, the WAL records in those files are not in the recyclable format. If one of those files is reused and is empty, a subsequent re-open, in `RecoverLogFiles`, can replay those records and insert stale data into the memtable. Another manifestation of this is an assertion failure `first_seqno_ == 0 || s >= first_seqno_` in `rocksdb::MemTable::Add`.

We could fix this by either 1) Writing a special record when reusing a log file, or 2) Implement more rigorous checking in `RecoverLogFiles` to ensure we don't replay stale records, or 3) Not reuse files created by a previous DB session. We choose option 3 as its the simplest, and flipping `recycle_log_file_num` is expected to be a rare event.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12591

Test Plan: 1. Add a unit test to verify the bug and fix

Reviewed By: jowlyzhang

Differential Revision: D56655812

Pulled By: anand1976

fbshipit-source-id: aa3a26b4a5e892d39a54b5a0658233cbebebac87
2024-04-29 12:25:00 -07:00