rocksdb/db
Yanqin Jin e0c84aa0dc Fix a race condition in WAL tracking causing DB open failure (#9715)
Summary:
There is a race condition if WAL tracking in the MANIFEST is enabled in a database that disables 2PC.

The race condition is between two background flush threads trying to install flush results to the MANIFEST.

Consider an example database with two column families: "default" (cfd0) and "cf1" (cfd1). Initially,
both column families have one mutable (active) memtable whose data backed by 6.log.

1. Trigger a manual flush for "cf1", creating a 7.log
2. Insert another key to "default", and trigger flush for "default", creating 8.log
3. BgFlushThread1 finishes writing 9.sst
4. BgFlushThread2 finishes writing 10.sst

```
Time  BgFlushThread1                                    BgFlushThread2
 |    mutex_.Lock()
 |    precompute min_wal_to_keep as 6
 |    mutex_.Unlock()
 |                                                     mutex_.Lock()
 |                                                     precompute min_wal_to_keep as 6
 |                                                     join MANIFEST write queue and mutex_.Unlock()
 |    write to MANIFEST
 |    mutex_.Lock()
 |    cfd1->log_number = 7
 |    Signal bg_flush_2 and mutex_.Unlock()
 |                                                     wake up and mutex_.Lock()
 |                                                     cfd0->log_number = 8
 |                                                     FindObsoleteFiles() with job_context->log_number == 7
 |                                                     mutex_.Unlock()
 |                                                     PurgeObsoleteFiles() deletes 6.log
 V
```

As shown in the above, BgFlushThread2 thinks that the min wal to keep is 6.log because "cf1" has unflushed data in 6.log (cf1.log_number=6).
Similarly, BgThread1 thinks that min wal to keep is also 6.log because "default" has unflushed data (default.log_number=6).
No WAL deletion will be written to MANIFEST because 6 is equal to `versions_->wals_.min_wal_number_to_keep`,
due to https://github.com/facebook/rocksdb/blob/7.1.fb/db/memtable_list.cc#L513:L514.
The bg flush thread that finishes last will perform file purging. `job_context.log_number` will be evaluated as 7, i.e.
the min wal that contains unflushed data, causing 6.log to be deleted. However, MANIFEST thinks 6.log should still exist.
If you close the db at this point, you won't be able to re-open it if `track_and_verify_wal_in_manifest` is true.

We must handle the case of multiple bg flush threads, and it is difficult for one bg flush thread to know
the correct min wal number until the other bg flush threads have finished committing to the manifest and updated
the `cfd::log_number`.
To fix this issue, we rename an existing variable `min_log_number_to_keep_2pc` to `min_log_number_to_keep`,
and use it to track WAL file deletion in non-2pc mode as well.
This variable is updated only 1) during recovery with mutex held, or 2) in the MANIFEST write thread.
`min_log_number_to_keep` means RocksDB will delete WALs below it, although there may be WALs
above it which are also obsolete. Formally, we will have [min_wal_to_keep, max_obsolete_wal]. During recovery, we
make sure that only WALs above max_obsolete_wal are checked and added back to `alive_log_files_`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9715

Test Plan:
```
make check
```
Also ran stress test below (with asan) to make sure it completes successfully.
```
TEST_TMPDIR=/dev/shm/rocksdb OPT=-g ASAN_OPTIONS=disable_coredump=0 \
CRASH_TEST_EXT_ARGS=--compression_type=zstd SKIP_FORMAT_BUCK_CHECKS=1 \
make J=52 -j52 blackbox_asan_crash_test
```

Reviewed By: ltamasi

Differential Revision: D34984412

Pulled By: riversand963

fbshipit-source-id: c7b21a8d84751bb55ea79c9f387103d21b231005
2022-03-23 19:41:31 -07:00
..
blob Add rate limiter priority to ReadOptions (#9424) 2022-02-16 23:18:14 -08:00
compaction Fix a bug in PosixClock (#9695) 2022-03-21 16:11:02 -07:00
db_impl Fix a race condition in WAL tracking causing DB open failure (#9715) 2022-03-23 19:41:31 -07:00
arena_wrapped_db_iter.cc fix: Reusing-Iterator reads stale keys after DeleteRange() performed (#9258) 2022-03-15 09:50:21 -07:00
arena_wrapped_db_iter.h
builder.cc Expand auto recovery to background read errors (#9679) 2022-03-15 14:45:34 -07:00
builder.h
c.cc Fix a major performance bug in 7.0 re: filter compatibility (#9736) 2022-03-23 10:00:54 -07:00
c_test.c fix a bug, c api, if enable inplace_update_support, and use create sn… (#9471) 2022-03-21 12:04:33 -07:00
column_family.cc DisableManualCompaction may fail to cancel an unscheduled task (#9659) 2022-03-12 20:07:04 -08:00
column_family.h Add OpenAndTrimHistory API to support trimming data with specified timestamp (#9410) 2022-03-11 16:13:23 -08:00
column_family_test.cc More refactoring ahead of footer & meta changes (#9240) 2021-12-10 08:13:26 -08:00
compact_files_test.cc Fix test race conditions with OnFlushCompleted() (#9617) 2022-02-22 12:23:00 -08:00
comparator_db_test.cc More refactoring ahead of footer & meta changes (#9240) 2021-12-10 08:13:26 -08:00
convenience.cc Add rate limiter priority to ReadOptions (#9424) 2022-02-16 23:18:14 -08:00
corruption_test.cc Make the Env class Customizable (#9293) 2022-01-04 16:45:49 -08:00
cuckoo_table_db_test.cc
db_basic_test.cc Fix some MultiGet batching stats (#9583) 2022-02-17 16:31:41 -08:00
db_block_cache_test.cc Enhance new cache key testing & comments (#9329) 2022-02-04 14:15:58 -08:00
db_bloom_filter_test.cc Fix a major performance bug in 7.0 re: filter compatibility (#9736) 2022-03-23 10:00:54 -07:00
db_compaction_filter_test.cc
db_compaction_test.cc Fix a race condition when disable and enable manual compaction (#9694) 2022-03-15 12:31:14 -07:00
db_dynamic_level_test.cc Remove deprecated API AdvancedColumnFamilyOptions::soft_rate_limit/hard_rate_limit (#9452) 2022-01-27 13:01:09 -08:00
db_encryption_test.cc
db_filesnapshot.cc Use a sorted vector instead of a map to store blob file metadata (#9526) 2022-02-09 12:36:43 -08:00
db_flush_test.cc Fix mempurge crash reported in #8958 (#9671) 2022-03-10 15:16:55 -08:00
db_info_dumper.cc
db_info_dumper.h
db_inplace_update_test.cc fix a bug, c api, if enable inplace_update_support, and use create sn… (#9471) 2022-03-21 12:04:33 -07:00
db_io_failure_test.cc Enable a few unit tests to use custom Env objects (#9087) 2021-11-08 11:05:59 -08:00
db_iter.cc Remove iter_start_seqnum and preserve_deletes (#9430) 2022-01-28 13:28:38 -08:00
db_iter.h
db_iter_stress_test.cc
db_iter_test.cc Remove iter_start_seqnum and preserve_deletes (#9430) 2022-01-28 13:28:38 -08:00
db_iterator_test.cc
db_kv_checksum_test.cc Revise APIs related to user-defined timestamp (#8946) 2022-02-01 22:19:01 -08:00
db_log_iter_test.cc
db_logical_block_size_cache_test.cc Attempt to deflake DBLogicalBlockSizeCacheTest.CreateColumnFamilies (#9516) 2022-03-04 11:35:28 -08:00
db_memtable_test.cc
db_merge_operand_test.cc Fix PinSelf() read-after-free in DB::GetMergeOperands() (#9507) 2022-02-15 12:25:18 -08:00
db_merge_operator_test.cc
db_options_test.cc Remove deprecated option new_table_reader_for_compaction_inputs (#9443) 2022-02-08 19:31:28 -08:00
db_properties_test.cc compression_per_level should be used for flush and changeable (#9658) 2022-03-07 18:06:19 -08:00
db_range_del_test.cc fix: Reusing-Iterator reads stale keys after DeleteRange() performed (#9258) 2022-03-15 09:50:21 -07:00
db_rate_limiter_test.cc Rate-limit automatic WAL flush after each user write (#9607) 2022-03-08 13:19:39 -08:00
db_secondary_test.cc Make the Env class Customizable (#9293) 2022-01-04 16:45:49 -08:00
db_sst_test.cc Make the Env class Customizable (#9293) 2022-01-04 16:45:49 -08:00
db_statistics_test.cc
db_table_properties_test.cc Fix backward compatibility with 2.5 through 2.7 (#9189) 2021-11-19 17:31:01 -08:00
db_tailing_iter_test.cc
db_test.cc Fix spelling in public API (#9490) 2022-02-03 15:15:23 -08:00
db_test2.cc Add manifest fix-up utility for file temperatures (#9683) 2022-03-18 16:35:51 -07:00
db_test_util.cc Use a sorted vector instead of a map to store blob file metadata (#9526) 2022-02-09 12:36:43 -08:00
db_test_util.h Update Cache::Release param from force_erase to erase_if_last_ref (#9728) 2022-03-22 10:22:18 -07:00
db_universal_compaction_test.cc Adhere to per-DB concurrency limit when bottom-pri compactions exist (#9179) 2021-11-18 17:31:50 -08:00
db_wal_test.cc Fix a race condition in WAL tracking causing DB open failure (#9715) 2022-03-23 19:41:31 -07:00
db_with_timestamp_basic_test.cc Add OpenAndTrimHistory API to support trimming data with specified timestamp (#9410) 2022-03-11 16:13:23 -08:00
db_with_timestamp_compaction_test.cc Use the comparator from the sst file table properties in sst_dump_tool (#9491) 2022-02-08 12:15:35 -08:00
db_write_buffer_manager_test.cc Enable a few unit tests to use custom Env objects (#9087) 2021-11-08 11:05:59 -08:00
db_write_test.cc Enable a few unit tests to use custom Env objects (#9087) 2021-11-08 11:05:59 -08:00
dbformat.cc Track per-SST user-defined timestamp information in MANIFEST (#9092) 2021-11-10 10:49:04 -08:00
dbformat.h Add OpenAndTrimHistory API to support trimming data with specified timestamp (#9410) 2022-03-11 16:13:23 -08:00
dbformat_test.cc Enable a few unit tests to use custom Env objects (#9087) 2021-11-08 11:05:59 -08:00
deletefile_test.cc Enable a few unit tests to use custom Env objects (#9087) 2021-11-08 11:05:59 -08:00
error_handler.cc Expand auto recovery to background read errors (#9679) 2022-03-15 14:45:34 -07:00
error_handler.h Expand auto recovery to background read errors (#9679) 2022-03-15 14:45:34 -07:00
error_handler_fs_test.cc Expand auto recovery to background read errors (#9679) 2022-03-15 14:45:34 -07:00
event_helpers.cc Fix a race condition in WAL tracking causing DB open failure (#9715) 2022-03-23 19:41:31 -07:00
event_helpers.h Add a listener callback for end of auto error recovery (#9244) 2021-12-08 14:30:57 -08:00
experimental.cc Add manifest fix-up utility for file temperatures (#9683) 2022-03-18 16:35:51 -07:00
external_sst_file_basic_test.cc Enable a few unit tests to use custom Env objects (#9087) 2021-11-08 11:05:59 -08:00
external_sst_file_ingestion_job.cc Do not rely on ADL when invoking std::max_element (#9608) 2022-03-02 17:41:02 -08:00
external_sst_file_ingestion_job.h New stable, fixed-length cache keys (#9126) 2021-12-16 17:15:13 -08:00
external_sst_file_test.cc Make the Env class Customizable (#9293) 2022-01-04 16:45:49 -08:00
fault_injection_test.cc Fix a bug causing duplicate trailing entries in WritableFile (buffered IO) (#9236) 2021-12-13 09:00:36 -08:00
file_indexer.cc
file_indexer.h
file_indexer_test.cc
filename_test.cc
flush_job.cc Fix a bug in PosixClock (#9695) 2022-03-21 16:11:02 -07:00
flush_job.h Expand auto recovery to background read errors (#9679) 2022-03-15 14:45:34 -07:00
flush_job_test.cc Use the comparator from the sst file table properties in sst_dump_tool (#9491) 2022-02-08 12:15:35 -08:00
flush_scheduler.cc
flush_scheduler.h
forward_iterator.cc Fast path for detecting unchanged prefix_extractor (#9407) 2022-01-21 11:37:46 -08:00
forward_iterator.h Fast path for detecting unchanged prefix_extractor (#9407) 2022-01-21 11:37:46 -08:00
forward_iterator_bench.cc Remove using namespace (#9369) 2022-01-12 09:31:12 -08:00
history_trimming_iterator.h Add OpenAndTrimHistory API to support trimming data with specified timestamp (#9410) 2022-03-11 16:13:23 -08:00
import_column_family_job.cc Add Temperature info in NewSequentialFile() (#9499) 2022-02-18 18:23:07 -08:00
import_column_family_job.h New stable, fixed-length cache keys (#9126) 2021-12-16 17:15:13 -08:00
import_column_family_test.cc
internal_stats.cc Add manifest fix-up utility for file temperatures (#9683) 2022-03-18 16:35:51 -07:00
internal_stats.h
job_context.h Add manifest fix-up utility for file temperatures (#9683) 2022-03-18 16:35:51 -07:00
kv_checksum.h fix compile errors in db/kv_checksum.h (#9173) 2021-11-16 10:20:50 -08:00
listener_test.cc Fix test race conditions with OnFlushCompleted() (#9617) 2022-02-22 12:23:00 -08:00
log_format.h Add record to set WAL compression type if enabled (#9556) 2022-02-17 16:19:31 -08:00
log_reader.cc Integrate WAL compression into log reader/writer. (#9642) 2022-03-09 15:49:53 -08:00
log_reader.h Integrate WAL compression into log reader/writer. (#9642) 2022-03-09 15:49:53 -08:00
log_test.cc Integrate WAL compression into log reader/writer. (#9642) 2022-03-09 15:49:53 -08:00
log_writer.cc Integrate WAL compression into log reader/writer. (#9642) 2022-03-09 15:49:53 -08:00
log_writer.h Integrate WAL compression into log reader/writer. (#9642) 2022-03-09 15:49:53 -08:00
logs_with_prep_tracker.cc
logs_with_prep_tracker.h
lookup_key.h
malloc_stats.cc
malloc_stats.h
manual_compaction_test.cc Remove using namespace (#9369) 2022-01-12 09:31:12 -08:00
memtable.cc Fix major bug with MultiGet, DeleteRange, and memtable Bloom (#9453) 2022-01-27 14:55:04 -08:00
memtable.h Fix major bug with MultiGet, DeleteRange, and memtable Bloom (#9453) 2022-01-27 14:55:04 -08:00
memtable_list.cc Fix a race condition in WAL tracking causing DB open failure (#9715) 2022-03-23 19:41:31 -07:00
memtable_list.h Expand auto recovery to background read errors (#9679) 2022-03-15 14:45:34 -07:00
memtable_list_test.cc Expand auto recovery to background read errors (#9679) 2022-03-15 14:45:34 -07:00
merge_context.h
merge_helper.cc Support readahead during compaction for blob files (#9187) 2021-11-19 17:53:47 -08:00
merge_helper.h Support readahead during compaction for blob files (#9187) 2021-11-19 17:53:47 -08:00
merge_helper_test.cc Support readahead during compaction for blob files (#9187) 2021-11-19 17:53:47 -08:00
merge_operator.cc
merge_test.cc Make the Env class Customizable (#9293) 2022-01-04 16:45:49 -08:00
obsolete_files_test.cc Add commit marker with timestamp (#9266) 2021-12-10 11:05:35 -08:00
options_file_test.cc
output_validator.cc
output_validator.h
perf_context_test.cc
periodic_work_scheduler.cc Fix a timer crash caused by invalid memory management (#9656) 2022-03-12 11:45:56 -08:00
periodic_work_scheduler.h Fix a timer crash caused by invalid memory management (#9656) 2022-03-12 11:45:56 -08:00
periodic_work_scheduler_test.cc
pinned_iterators_manager.h
plain_table_db_test.cc Fast path for detecting unchanged prefix_extractor (#9407) 2022-01-21 11:37:46 -08:00
pre_release_callback.h
prefix_test.cc
range_del_aggregator.cc
range_del_aggregator.h
range_del_aggregator_bench.cc
range_del_aggregator_test.cc
range_tombstone_fragmenter.cc
range_tombstone_fragmenter.h
range_tombstone_fragmenter_test.cc
read_callback.h
repair.cc Fast path for detecting unchanged prefix_extractor (#9407) 2022-01-21 11:37:46 -08:00
repair_test.cc
snapshot_checker.h
snapshot_impl.cc
snapshot_impl.h
table_cache.cc fix a bug of the ticker NO_FILE_OPENS (#9677) 2022-03-15 09:55:49 -07:00
table_cache.h Fast path for detecting unchanged prefix_extractor (#9407) 2022-01-21 11:37:46 -08:00
table_properties_collector.cc
table_properties_collector.h Track each SST's timestamp information as user properties (#9093) 2021-11-19 11:37:06 -08:00
table_properties_collector_test.cc Improve / clean up meta block code & integrity (#9163) 2021-11-18 11:43:44 -08:00
transaction_log_impl.cc Add commit marker with timestamp (#9266) 2021-12-10 11:05:35 -08:00
transaction_log_impl.h
trim_history_scheduler.cc
trim_history_scheduler.h
version_builder.cc Use a sorted vector instead of a map to store blob file metadata (#9526) 2022-02-09 12:36:43 -08:00
version_builder.h Fast path for detecting unchanged prefix_extractor (#9407) 2022-01-21 11:37:46 -08:00
version_builder_test.cc Use a sorted vector instead of a map to store blob file metadata (#9526) 2022-02-09 12:36:43 -08:00
version_edit.cc Print file checksum in hex (#9196) 2021-11-22 09:30:47 -08:00
version_edit.h Clean up VersionStorageInfo a bit (#9494) 2022-02-04 08:19:20 -08:00
version_edit_handler.cc Fix a race condition in WAL tracking causing DB open failure (#9715) 2022-03-23 19:41:31 -07:00
version_edit_handler.h
version_edit_test.cc File temperature information should be preserved when restart the DB (#9242) 2021-12-03 14:43:14 -08:00
version_set.cc Fix a race condition in WAL tracking causing DB open failure (#9715) 2022-03-23 19:41:31 -07:00
version_set.h Fix a race condition in WAL tracking causing DB open failure (#9715) 2022-03-23 19:41:31 -07:00
version_set_test.cc Fix a race condition in WAL tracking causing DB open failure (#9715) 2022-03-23 19:41:31 -07:00
version_util.h Add manifest fix-up utility for file temperatures (#9683) 2022-03-18 16:35:51 -07:00
wal_edit.cc
wal_edit.h
wal_edit_test.cc
wal_manager.cc
wal_manager.h
wal_manager_test.cc
write_batch.cc Improve stress test for transactions (#9568) 2022-03-16 19:00:04 -07:00
write_batch_base.cc
write_batch_internal.h Support WBWI for keys having timestamps (#9603) 2022-02-22 14:23:01 -08:00
write_batch_test.cc Support WBWI for keys having timestamps (#9603) 2022-02-22 14:23:01 -08:00
write_callback.h
write_callback_test.cc
write_controller.cc
write_controller.h
write_controller_test.cc
write_thread.cc Rate-limit automatic WAL flush after each user write (#9607) 2022-03-08 13:19:39 -08:00
write_thread.h Rate-limit automatic WAL flush after each user write (#9607) 2022-03-08 13:19:39 -08:00