rocksdb/unreleased_history/bug_fixes
Peter Dillinger 92dc5f3e67 AutoHCC: fix a bug with "blind" Insert (#12046)
Summary:
I have finally tracked down and fixed a bug affecting AutoHCC that was causing CI crash test assertion failures in AutoHCC when using secondary cache, but I was only able to reproduce locally a couple of times, after very long runs/repetitions.

It turns out that the essential feature used by secondary cache to trigger the bug is Insert without keeping a handle, which is otherwise rarely used in RocksDB and not incorporated into cache_bench (also used for targeted correctness stress testing) until this change (new option `-blind_insert_percent`).

The problem was in copying some logic from FixedHCC that makes the entry "sharable" but unreferenced once populated, if no reference is to be saved. The problem in AutoHCC is that we can only add the entry to a chain after it is in the sharable state, and must be removed from the chain while in the "under (de)construction" state and before it is back in the "empty" state. Also, it is possible for Lookup to find entries that are not connected to any chain, by design for efficiency, and for Release to erase_if_last_ref. Therefore, we could have
* Thread 1 starts to Insert a cache entry without keeping ref, and pauses before adding to the chain.
* Thread 2 finds it with Lookup optimizations, and then does Release with `erase_if_last_ref=true` causing it to trigger erasure on the entry. It successfully locks the home chain for the entry and purges any entries pending erasure. It is OK that this entry is not found on the chain, as another thread is allowed to remove it from the chain before we are able to (but after is it marked for (de)construction). And after the purge of the chain, the entry is marked empty.
* Thread 1 resumes in adding the slot (presumed entry) to the home chain for what was being inserted, but that now violates invariants and sets up a race or double-chain-reference as another thread could insert a new entry in the slot and try to insert into a different chain.

This is easily fixed by holding on to a reference until inserted onto the chain.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12046

Test Plan:
As I don't have a reliable local reproducer, I triggered 20 runs of internal CI on fbcode_blackbox_crash_test that were previously failing in AutoHCC with about 1/3 probability, and they all passed.

Also re-enabling AutoHCC in the crash test with this change. (Revert https://github.com/facebook/rocksdb/issues/12000)

Reviewed By: jowlyzhang

Differential Revision: D51016979

Pulled By: pdillinger

fbshipit-source-id: 3840fb829d65b97c779d8aed62a4a4a433aeff2b
2023-11-06 16:06:01 -08:00
..
.gitkeep Better management of unreleased HISTORY (#11481) 2023-05-30 16:42:49 -07:00
exp_autohcc_fix.md AutoHCC: fix a bug with "blind" Insert (#12046) 2023-11-06 16:06:01 -08:00
flush_recovery_db_destructor_race.md Fix race between flush error recovery and db destruction (#12002) 2023-10-25 11:59:09 -07:00