rocksdb/tools
Igor Canadi 4b66d95344 Write stress test
Summary:
The goal of this diff is to create a simple stress test with focus on catching:
* bugs in compaction/flush processes, especially the ones that cause assertion errors
* bugs in the code that deletes obsolete files

There are two parts of the test:
* write_stress, a binary that writes to the database
* write_stress_runner.py, a script that invokes and kills write_stress

Here are some interesting parts of write_stress:
* Runs with very high concurrency of compactions and flushes (32 threads total) and tries to create a huge amount of small files
* The keys written to the database are not uniformly distributed -- there is a 3-character prefix that mutates occasionally (in prefix mutator thread), in such a way that the first character mutates slower than second, which mutates slower than third character. That way, the compaction stress tests some interesting compaction features like trivial moves and bottommost level calculation
* There is a thread that creates an iterator, holds it for couple of seconds and then iterates over all keys. This is supposed to test RocksDB's abilities to keep the files alive when there are references to them.
* Some writes trigger WAL sync. This is stress testing our WAL sync code.
* At the end of the run, we make sure that we didn't leak any of the sst files

write_stress_runner.py changes the mode in which we run write_stress and also kills and restarts it. There are some interesting characteristics:
* At the beginning we divide the full test runtime into smaller parts -- shorter runtimes (couple of seconds) and longer runtimes (100, 1000) seconds
* The first time we run write_stress, we destroy the old DB. Every next time during the test, we use the same DB.
* We can run in kill mode or clean-restart mode. Kill mode kills the write_stress violently.
* We can run in mode where delete_obsolete_files_with_fullscan is true or false
* We can run with low_open_files mode turned on or off. When it's turned on, we configure table cache to only hold a couple of files -- that way we need to reopen files every time we access them.

Another goal was to create a stress test without a lot of parameters. So tools/write_stress_runner.py should only take one parameter -- runtime_sec and it should figure out everything else on its own.

In a separate diff, I'll add this new test to our nightly legocastle runs.

Test Plan:
The goal of this test was to retroactively catch the following bugs: D33045, D48201, D46899, D42399. I failed to reproduce D48201, but all others have been caught!

When i reverted https://reviews.facebook.net/D33045:

     ./write_stress --runtime_sec=200 --low_open_files_mode=true
     Iterator statuts not OK: IO error: /fast-rocksdb-tmp/rocksdb_test/write_stress/089166.sst: No such file or directory

When i reverted https://reviews.facebook.net/D42399:

    python tools/write_stress_runner.py --runtime_sec=5000
    Running write_stress, will kill after 5 seconds: ./write_stress --runtime_sec=-1
    Running write_stress, will kill after 2 seconds: ./write_stress --runtime_sec=-1 --destroy_db=false --delete_obsolete_files_with_fullscan=true
    Running write_stress, will kill after 7 seconds: ./write_stress --runtime_sec=-1 --destroy_db=false
    Running write_stress, will kill after 5 seconds: ./write_stress --runtime_sec=-1 --destroy_db=false
    Running write_stress, will kill after 8 seconds: ./write_stress --runtime_sec=-1 --destroy_db=false --low_open_files_mode=true
    Write to DB failed: IO error: /fast-rocksdb-tmp/rocksdb_test/write_stress/019250.sst: No such file or directory
    ERROR: write_stress died with exitcode=-6

When i reverted https://reviews.facebook.net/D46899:

    python tools/write_stress_runner.py --runtime_sec=1000
    runtime: 1000
    Going to execute write stress for [3, 3, 100, 3, 2, 100, 1, 788]
    Running write_stress for 3 seconds: ./write_stress --runtime_sec=3 --low_open_files_mode=true
    Running write_stress for 3 seconds: ./write_stress --runtime_sec=3 --destroy_db=false --delete_obsolete_files_with_fullscan=true
    Running write_stress, will kill after 100 seconds: ./write_stress --runtime_sec=-1 --destroy_db=false --delete_obsolete_files_with_fullscan=true
    write_stress: db/db_impl.cc:2070: void rocksdb::DBImpl::MarkLogsSynced(uint64_t, bool, const rocksdb::Status&): Assertion `log.getting_synced' failed.
    ERROR: write_stress died with exitcode=-6

Reviewers: IslamAbdelRahman, yhchiang, rven, kradhakrishnan, sdong, anthony

Reviewed By: anthony

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D49533
2015-10-28 16:15:07 -07:00
..
dump Update dump_tool and undump_tool to accept Options 2015-10-05 19:49:48 -07:00
rdb first rdb commit 2014-11-20 23:33:00 -05:00
auto_sanity_test.sh Make auto_sanity_test always use the db_sanity_test.cc of the newer commit. 2015-03-27 11:32:49 -07:00
benchmark.sh Fix benchmark report script 2015-08-22 12:18:00 -07:00
benchmark_leveldb.sh Add scripts to run leveldb benchmark 2015-04-27 19:32:56 -07:00
check_format_compatible.sh Script to check whether RocksDB can read DB generated by previous releases and vice versa 2015-04-08 16:04:59 -07:00
db_crashtest.py crash_test to trigger some less frequent crash point more frequently 2015-10-27 12:06:06 -07:00
db_repl_stress.cc "make format" against last 10 commits 2015-07-13 13:50:18 -07:00
db_sanity_test.cc Add ZSTD (not final format) compression type 2015-08-28 11:01:13 -07:00
db_stress.cc Allow users to disable some kill points in db_stress 2015-10-15 14:33:13 -07:00
dbench_monitor Added simple monitoring script to monitor overusage of memory in db_bench 2015-02-11 18:40:11 -08:00
Dockerfile adding docker build script and dockerfile 2015-05-22 16:03:39 -07:00
generate_random_db.sh Script to check whether RocksDB can read DB generated by previous releases and vice versa 2015-04-08 16:04:59 -07:00
ldb.cc Make db_stress built for ROCKSDB_LITE 2014-11-14 10:20:51 -08:00
ldb_cmd.cc log_reader: pass log_number and optional info_log to ctor 2015-10-18 21:24:32 -04:00
ldb_cmd.h Move ldb and sst_dump from utils to tools. 2015-10-14 17:08:28 -07:00
ldb_cmd_execute_result.h Move ldb and sst_dump from utils to tools. 2015-10-14 17:08:28 -07:00
ldb_cmd_test.cc Block tests under ROCKSDB_LITE 2015-10-15 10:51:00 -07:00
ldb_test.py Tests for ManifestDumpCommand and ListColumnFamiliesCommand 2015-09-08 14:23:42 -07:00
ldb_tool.cc Move ldb and sst_dump from utils to tools. 2015-10-14 17:08:28 -07:00
pflag Added simple monitoring script to monitor overusage of memory in db_bench 2015-02-11 18:40:11 -08:00
reduce_levels_test.cc Move ldb and sst_dump from utils to tools. 2015-10-14 17:08:28 -07:00
rocksdb_dump_test.sh Update dump_tool and undump_tool to accept Options 2015-10-05 19:49:48 -07:00
run_flash_bench.sh Improve defaults for benchmarks 2015-08-20 18:59:10 -07:00
run_leveldb.sh Add scripts to run leveldb benchmark 2015-04-27 19:32:56 -07:00
sample-dump.dmp First version of rocksdb_dump and rocksdb_undump. 2015-06-19 16:24:36 -07:00
sst_dump.cc Make db_stress built for ROCKSDB_LITE 2014-11-14 10:20:51 -08:00
sst_dump_test.cc Move ldb and sst_dump from utils to tools. 2015-10-14 17:08:28 -07:00
sst_dump_tool.cc Move ldb and sst_dump from utils to tools. 2015-10-14 17:08:28 -07:00
sst_dump_tool_imp.h Move ldb and sst_dump from utils to tools. 2015-10-14 17:08:28 -07:00
verify_random_db.sh Script to check whether RocksDB can read DB generated by previous releases and vice versa 2015-04-08 16:04:59 -07:00
write_stress.cc Write stress test 2015-10-28 16:15:07 -07:00
write_stress_runner.py Write stress test 2015-10-28 16:15:07 -07:00