rocksdb/db/db_impl
Hui Xiao f79b3d19a7 Inject spurious wakeup and sleep before acquiring db mutex to expose race condition (#10291)
Summary:
**Context/Summary:**
Previous experience with bugs and flaky tests taught us there exist features in RocksDB vulnerable to race condition caused by acquiring db mutex at a particular timing. This PR aggressively exposes those vulnerable features by injecting spurious wakeup and sleep to cause acquiring db mutex at various timing in order to expose such race condition

**Testing:**
- `COERCE_CONTEXT_SWITCH=1 make -j56 check / make -j56 db_stress` should reveal
    - flaky tests caused by db mutex related race condition
       - Reverted https://github.com/facebook/rocksdb/pull/9528
       - A/B testing on `COMPILE_WITH_TSAN=1 make -j56 listener_test` w/ and w/o `COERCE_CONTEXT_SWITCH=1` followed by `./listener_test --gtest_filter=EventListenerTest.MultiCF --gtest_repeat=10`
       - `COERCE_CONTEXT_SWITCH=1` can cause expected test failure (i.e, expose target TSAN data race error) within 10 run while the other couldn't.
       - This proves our injection can expose flaky tests caused by db mutex related race condition faster.
    -  known or new race-condition-type of internal bug by continuously running this PR
- Performance
   - High ops-threads time: COERCE_CONTEXT_SWITCH=1 regressed by 4 times slower (2:01.16 vs 0:22.10 elapsed ). This PR will be run as a separate CI job so this regression won't affect any existing job.
```
TEST_TMPDIR=$db /usr/bin/time ./db_stress \
--ops_per_thread=100000 --expected_values_dir=$exp --clear_column_family_one_in=0 \
--write_buffer_size=524288 —target_file_size_base=524288 —ingest_external_file_one_in=100 —compact_files_one_in=1000 —compact_range_one_in=1000
```
  - Start-up time:  COERCE_CONTEXT_SWITCH=1 didn't regress by 25% (0:01.51 vs 0:01.29 elapsed)
```
TEST_TMPDIR=$db ./db_stress -ops_per_thread=100000000 -expected_values_dir=$exp --clear_column_family_one_in=0 & sleep 120; pkill -9 db_stress

TEST_TMPDIR=$db /usr/bin/time ./db_stress \
--ops_per_thread=1 -reopen=0 --expected_values_dir=$exp --clear_column_family_one_in=0 --destroy_db_initially=0
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10291

Reviewed By: ajkr

Differential Revision: D39231182

Pulled By: hx235

fbshipit-source-id: 7ab6695430460e0826727fd8c66679b32b3e44b6
2022-09-12 13:55:23 -07:00
..
compacted_db_impl.cc Fix periodic_task unable to re-register the same task type (#10379) 2022-08-25 18:52:37 -07:00
compacted_db_impl.h Don't wait for indirect flush in read-only DB (#10569) 2022-08-29 13:36:23 -07:00
db_impl.cc Inject spurious wakeup and sleep before acquiring db mutex to expose race condition (#10291) 2022-09-12 13:55:23 -07:00
db_impl.h Always verify SST unique IDs on SST file open (#10532) 2022-09-07 22:52:42 -07:00
db_impl_compaction_flush.cc Sync dir containing CURRENT after RenameFile on CURRENT as much as possible (#10573) 2022-08-29 17:35:21 -07:00
db_impl_debug.cc Fix periodic_task unable to re-register the same task type (#10379) 2022-08-25 18:52:37 -07:00
db_impl_experimental.cc Remove unused fields from FileMetaData (temporarily) (#10443) 2022-08-01 17:56:13 -07:00
db_impl_files.cc Do not hold mutex when write keys if not necessary (#7516) 2022-07-21 13:35:36 -07:00
db_impl_open.cc Always verify SST unique IDs on SST file open (#10532) 2022-09-07 22:52:42 -07:00
db_impl_readonly.cc Skip swaths of range tombstone covered keys in merging iterator (2022 edition) (#10449) 2022-09-02 09:51:19 -07:00
db_impl_readonly.h Don't wait for indirect flush in read-only DB (#10569) 2022-08-29 13:36:23 -07:00
db_impl_secondary.cc Skip swaths of range tombstone covered keys in merging iterator (2022 edition) (#10449) 2022-09-02 09:51:19 -07:00
db_impl_secondary.h Don't wait for indirect flush in read-only DB (#10569) 2022-08-29 13:36:23 -07:00
db_impl_write.cc Sync dir containing CURRENT after RenameFile on CURRENT as much as possible (#10573) 2022-08-29 17:35:21 -07:00