mirror of https://github.com/facebook/rocksdb.git
d33d25f903
Summary: I was investigating a crash test failure with "Corruption: SST file is ahead of WALs" which I haven't reproduced, but I did reproduce a data loss issue on recovery which I suspect could be the same root problem. The problem is already somewhat known (see https://github.com/facebook/rocksdb/issues/12403 and https://github.com/facebook/rocksdb/issues/12639) where it's only safe to recovery multiple recycled WAL files with trailing old data if the sequence numbers between them are adjacent (to ensure we didn't lose anything in the corrupt/obsolete WAL tail). However, aside from disableWAL=true, there are features like external file ingestion that can increment the sequence numbers without writing to the WAL. It is simply unsustainable to worry about this kind of feature interaction limiting where we can consume sequence numbers. It is very hard to test and audit as well. For reliable crash recovery of recycled WALs, we need a better way of detecting that we didn't drop data from one WAL to the next. Until then, let's disable WAL recycling in the crash test, to help stabilize it. Ideas for follow-up to fix the underlying problem: (a) With recycling, we could always sync the WAL before opening the next one. HOWEVER, this potentially very large sync could cause a big hiccup in writes (vs. O(1) sized manifest sync). (a1) The WAL sync could ensure it is truncated to size, or (a2) By requiring track_and_verify_wals_in_manifest, we could assume that the last synced size in the manifest is the final usable size of the WAL. (It might also be worth avoiding truncating recycled WALs.) (b) Add a new mechanism to record and verify the final size of a WAL without requiring a sync. (b1) By requiring track_and_verify_wals_in_manifest, this could be new WAL metadata recorded in the manifest (at the time of switching WALs). Note that new fields of WalMetadata are not forward-compatible, but a new kind of manifest record (next to WalAddition, WalDeletion; e.g. WalCompletion) is IIRC forward-compatible. (b2) A new kind of WAL header entry (not forward compatible, unfortunately) could record the final size of the previous WAL. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12918 Test Plan: Added disabled reproducer for non-linear data loss on recovery Reviewed By: hx235 Differential Revision: D60917527 Pulled By: pdillinger fbshipit-source-id: 3663d79aec81851f5cf41669f84a712bb4563fd7 |
||
---|---|---|
.. | ||
advisor | ||
block_cache_analyzer | ||
dump | ||
CMakeLists.txt | ||
Dockerfile | ||
analyze_txn_stress_test.sh | ||
auto_sanity_test.sh | ||
backup_db.sh | ||
benchmark.sh | ||
benchmark_ci.py | ||
benchmark_compare.sh | ||
benchmark_leveldb.sh | ||
blob_dump.cc | ||
check_all_python.py | ||
check_format_compatible.sh | ||
db_bench.cc | ||
db_bench_tool.cc | ||
db_bench_tool_test.cc | ||
db_crashtest.py | ||
db_repl_stress.cc | ||
db_sanity_test.cc | ||
dbench_monitor | ||
generate_random_db.sh | ||
ingest_external_sst.sh | ||
io_tracer_parser.cc | ||
io_tracer_parser_test.cc | ||
io_tracer_parser_tool.cc | ||
io_tracer_parser_tool.h | ||
ldb.cc | ||
ldb_cmd.cc | ||
ldb_cmd_impl.h | ||
ldb_cmd_test.cc | ||
ldb_test.py | ||
ldb_tool.cc | ||
pflag | ||
reduce_levels_test.cc | ||
regression_test.sh | ||
restore_db.sh | ||
rocksdb_dump_test.sh | ||
run_blob_bench.sh | ||
run_flash_bench.sh | ||
run_leveldb.sh | ||
sample-dump.dmp | ||
simulated_hybrid_file_system.cc | ||
simulated_hybrid_file_system.h | ||
sst_dump.cc | ||
sst_dump_test.cc | ||
sst_dump_tool.cc | ||
trace_analyzer.cc | ||
trace_analyzer_test.cc | ||
trace_analyzer_tool.cc | ||
trace_analyzer_tool.h | ||
verify_random_db.sh | ||
write_external_sst.sh | ||
write_stress.cc | ||
write_stress_runner.py |