rocksdb/db
Igor Canadi 42ea795209 Fix concurrency issue in CompactionPicker
Summary:
I am currently working on a project that uses RocksDB. While debugging some perf issues, I came up across interesting compaction concurrency issue. Namely, I had 15 idle threads and a good comapction to do, but CompactionPicker returned "Compaction nothing to do". Here's how Internal stats looked:

    2014/08/22-08:08:04.551982 7fc7fc3f5700 ------- DUMPING STATS -------
    2014/08/22-08:08:04.552000 7fc7fc3f5700
    ** Compaction Stats [default] **
    Level   Files   Size(MB) Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) RW-Amp W-Amp Rd(MB/s) Wr(MB/s)  Rn(cnt) Rnp1(cnt) Wnp1(cnt) Wnew(cnt)  Comp(sec) Comp(cnt) Avg(sec) Stall(sec) Stall(cnt) Avg(ms)
    ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      L0     7/5        353   1.0      0.0     0.0      0.0       2.3      2.3    0.0   0.0      0.0      9.4        0         0         0         0        247        46    5.359       8.53          1 8526.25
      L1     2/2         86   1.3      2.6     1.9      0.7       2.6      1.9    2.7   1.3     24.3     24.0       39        19        71        52        109        11    9.938       0.00          0    0.00
      L2    26/0        833   1.3      5.7     1.7      4.0       5.2      1.2    6.3   3.0     15.6     14.2       47       112       147        35        373        44    8.468       0.00          0    0.00
      L3    12/0        505   0.1      0.0     0.0      0.0       0.0      0.0    0.0   0.0      0.0      0.0        0         0         0         0          0         0    0.000       0.00          0    0.00
     Sum    47/7       1778   0.0      8.3     3.6      4.6      10.0      5.4    8.1   4.4     11.6     14.1       86       131       218        87        728       101    7.212       8.53          1 8526.25
     Int     0/0          0   0.0      2.4     0.8      1.6       2.7      1.2   11.5   6.1     12.0     13.6       20        43        63        20        203        23    8.845       0.00          0    0.00
    Flush(GB): accumulative 2.266, interval 0.444
    Stalls(secs): 0.000 level0_slowdown, 0.000 level0_numfiles, 8.526 memtable_compaction, 0.000 leveln_slowdown_soft, 0.000 leveln_slowdown_hard
    Stalls(count): 0 level0_slowdown, 0 level0_numfiles, 1 memtable_compaction, 0 leveln_slowdown_soft, 0 leveln_slowdown_hard

    ** DB Stats **
    Uptime(secs): 336.8 total, 60.4 interval
    Cumulative writes: 61584000 writes, 6480589 batches, 9.5 writes per batch, 1.39 GB user ingest
    Cumulative WAL: 0 writes, 0 syncs, 0.00 writes per sync, 0.00 GB written
    Interval writes: 11235257 writes, 1175050 batches, 9.6 writes per batch, 259.9 MB user ingest
    Interval WAL: 0 writes, 0 syncs, 0.00 writes per sync, 0.00 MB written

To see what happened, go here: 47b452cfcf/db/compaction_picker.cc (L430)
* The for loop started with level 1, because it has the worst score.
* PickCompactionBySize on L429 returned nullptr because all files were being compacted
* ExpandWhileOverlapping(c) returned true (because that's what it does when it gets nullptr!?)
* for loop break-ed, never trying compactions for level 2 :( :(

This bug was present at least since January. I have no idea how we didn't find this sooner.

Test Plan:
Unit testing compaction picker is hard. I tested this by running my service and observing L0->L1 and L2->L3 compactions in parallel. However, for long-term, I opened the task #4968469. @yhchiang is currently refactoring CompactionPicker, hopefully the new version will be unit-testable ;)

Here's how my compactions look like after the patch:

    2014/08/22-08:50:02.166699 7f3400ffb700 ------- DUMPING STATS -------
    2014/08/22-08:50:02.166722 7f3400ffb700
    ** Compaction Stats [default] **
    Level   Files   Size(MB) Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) RW-Amp W-Amp Rd(MB/s) Wr(MB/s)  Rn(cnt) Rnp1(cnt) Wnp1(cnt) Wnew(cnt)  Comp(sec) Comp(cnt) Avg(sec) Stall(sec) Stall(cnt) Avg(ms)
    ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      L0     8/5        404   1.5      0.0     0.0      0.0       4.3      4.3    0.0   0.0      0.0      9.6        0         0         0         0        463        88    5.260       0.00          0    0.00
      L1     2/2         60   0.9      4.8     3.9      0.8       4.7      3.9    2.4   1.2     23.9     23.6       80        23       131       108        204        19   10.747       0.00          0    0.00
      L2    23/3        697   1.0     11.6     3.5      8.1      10.9      2.8    6.4   3.1     17.7     16.6       95       242       317        75        669        92    7.268       0.00          0    0.00
      L3    58/14      2207   0.3      6.2     1.6      4.6       5.9      1.3    7.4   3.6     14.6     13.9       43       121       159        38        436        36   12.106       0.00          0    0.00
     Sum    91/24      3368   0.0     22.5     9.1     13.5      25.8     12.4   11.2   6.0     13.0     14.9      218       386       607       221       1772       235    7.538       0.00          0    0.00
     Int     0/0          0   0.0      3.2     0.9      2.3       3.6      1.3   15.3   8.0     12.4     13.7       24        66        89        23        266        27    9.838       0.00          0    0.00
    Flush(GB): accumulative 4.336, interval 0.444
    Stalls(secs): 0.000 level0_slowdown, 0.000 level0_numfiles, 0.000 memtable_compaction, 0.000 leveln_slowdown_soft, 0.000 leveln_slowdown_hard
    Stalls(count): 0 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 0 leveln_slowdown_soft, 0 leveln_slowdown_hard

    ** DB Stats **
    Uptime(secs): 577.7 total, 60.1 interval
    Cumulative writes: 116960736 writes, 11966220 batches, 9.8 writes per batch, 2.64 GB user ingest
    Cumulative WAL: 0 writes, 0 syncs, 0.00 writes per sync, 0.00 GB written
    Interval writes: 11643735 writes, 1206136 batches, 9.7 writes per batch, 269.2 MB user ingest
    Interval WAL: 0 writes, 0 syncs, 0.00 writes per sync, 0.00 MB written

Yay for concurrent L0->L1 and L2->L3 compactions!

Reviewers: sdong, yhchiang, ljin

Reviewed By: yhchiang

Subscribers: yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D22305
2014-08-22 11:32:40 -07:00
..
builder.cc Remove a check for merge operator in builder.cc 2014-07-31 14:22:21 -07:00
builder.h integrate rate limiter into rocksdb 2014-07-08 12:31:49 -07:00
c.cc Fix typo, add missing inclusion of state void* in invocation of 2014-08-06 18:42:15 -04:00
c_test.c Fix the error of c_test.c 2014-08-20 17:05:29 -07:00
column_family.cc WriteBatchWithIndex: a wrapper of WriteBatch, with a searchable index 2014-08-18 16:37:38 -07:00
column_family.h WriteBatchWithIndex: a wrapper of WriteBatch, with a searchable index 2014-08-18 16:37:38 -07:00
column_family_test.cc Flush only one column family 2014-08-11 22:10:32 -07:00
compaction.cc Changes to support unity build: 2014-08-11 13:22:47 -04:00
compaction.h Changes to support unity build: 2014-08-11 13:22:47 -04:00
compaction_picker.cc Fix concurrency issue in CompactionPicker 2014-08-22 11:32:40 -07:00
compaction_picker.h Changes to support unity build: 2014-08-11 13:22:47 -04:00
corruption_test.cc Fix corruption test 2014-04-24 14:56:41 -04:00
cuckoo_table_db_test.cc Fixed compile errors (signed / unsigned comparison) in cuckoo_table_db_test on Mac 2014-08-12 17:35:09 -07:00
db_bench.cc Adding Column Family support in db_bench. 2014-08-18 18:15:01 -07:00
db_filesnapshot.cc Support Multiple DB paths (without having an interface to expose to users) 2014-07-02 21:14:44 -07:00
db_impl.cc Improve Options sanitization and add MmapReadRequired() to TableFactory 2014-08-20 15:53:39 -07:00
db_impl.h log db path info before open 2014-08-13 13:45:13 -07:00
db_impl_debug.cc Allow user to specify DB path of output file of manual compaction 2014-07-21 19:06:00 -07:00
db_impl_readonly.cc NewIterators in read-only mode 2014-07-23 16:52:11 -04:00
db_impl_readonly.h Fix readonly db 2014-07-30 18:21:55 -07:00
db_iter.cc Add histogram for DB_SEEK 2014-08-13 15:56:37 -07:00
db_iter.h In DB::NewIterator(), try to allocate the whole iterator tree in an arena 2014-06-02 17:44:57 -07:00
db_iter_test.cc Fix clang compiler warnings 2014-07-20 22:57:20 +08:00
db_test.cc Improve Options sanitization and add MmapReadRequired() to TableFactory 2014-08-20 15:53:39 -07:00
dbformat.cc macros for perf_context 2014-04-08 10:58:07 -07:00
dbformat.h WriteBatchWithIndex: a wrapper of WriteBatch, with a searchable index 2014-08-18 16:37:38 -07:00
dbformat_test.cc Use IterKey instead of string in Block::Iter to reduce malloc 2014-07-23 12:31:11 -07:00
deletefile_test.cc Start DeleteFileTest with clean plate 2013-11-15 16:30:23 -08:00
file_indexer.cc Allow user to specify DB path of output file of manual compaction 2014-07-21 19:06:00 -07:00
file_indexer.h Allow user to specify DB path of output file of manual compaction 2014-07-21 19:06:00 -07:00
file_indexer_test.cc Allow user to specify DB path of output file of manual compaction 2014-07-21 19:06:00 -07:00
filename.cc Support purging logs from separate log directory 2014-08-14 13:22:50 -07:00
filename.h Support purging logs from separate log directory 2014-08-14 13:22:50 -07:00
filename_test.cc Support purging logs from separate log directory 2014-08-14 13:22:50 -07:00
forward_iterator.cc ForwardIterator seek bugfix 2014-07-10 16:46:13 -07:00
forward_iterator.h Fix compile errors on Mac 2014-06-03 12:28:58 -07:00
internal_stats.cc Minor: fix a format 2014-08-06 18:11:33 -07:00
internal_stats.h Add DB property "rocksdb.estimate-table-readers-mem" 2014-08-06 11:39:46 -07:00
log_and_apply_bench.cc Fix ldb dump_manifest 2014-07-30 10:17:48 -07:00
log_format.h Some minor refactoring on the code 2014-01-02 16:32:31 -08:00
log_reader.cc Make Log::Reader more robust 2014-02-28 13:19:47 -08:00
log_reader.h Fix UnmarkEOF for partial blocks 2014-01-27 14:49:10 -08:00
log_test.cc Make it compile on Debian/GCC 4.7 2014-03-14 22:44:35 +00:00
log_writer.cc Add appropriate LICENSE and Copyright message. 2013-10-16 17:48:41 -07:00
log_writer.h Add appropriate LICENSE and Copyright message. 2013-10-16 17:48:41 -07:00
memtable.cc Fixed a typo in the comment for merge operator. 2014-07-30 17:25:11 -07:00
memtable.h In DB::NewIterator(), try to allocate the whole iterator tree in an arena 2014-06-02 17:44:57 -07:00
memtable_list.cc Support Multiple DB paths (without having an interface to expose to users) 2014-07-02 21:14:44 -07:00
memtable_list.h Support Multiple DB paths (without having an interface to expose to users) 2014-07-02 21:14:44 -07:00
merge_context.h Enhance partial merge to support multiple arguments 2014-03-24 17:57:13 -07:00
merge_helper.cc Fixed the crash when merge_operator is not properly set after reopen. 2014-07-30 17:24:36 -07:00
merge_helper.h Fixed the crash when merge_operator is not properly set after reopen. 2014-07-30 17:24:36 -07:00
merge_operator.cc Some small cleaning up to make some compiling environment happy 2014-03-26 18:11:41 -07:00
merge_test.cc Temporary remove the last test in merge_test 2014-07-31 11:20:49 -07:00
perf_context_test.cc Missing includes 2014-03-14 13:02:20 -07:00
plain_table_db_test.cc Add DB property "rocksdb.estimate-table-readers-mem" 2014-08-06 11:39:46 -07:00
prefix_test.cc HashLinkList memtable switches a bucket to a skip list to reduce performance outliers 2014-07-01 17:14:15 -07:00
repair.cc Remove malloc from FormatFileNumber 2014-08-13 11:57:40 -07:00
simple_table_db_test.cc Add missing implementaiton of SanitizeDBOptions in simple_table_db_test.cc 2014-08-20 16:33:25 -07:00
skiplist.h Consolidate SliceTransform object ownership 2014-03-10 12:56:46 -07:00
skiplist_test.cc Clean up arena API 2014-01-30 22:10:10 -08:00
snapshot.h Add appropriate LICENSE and Copyright message. 2013-10-16 17:48:41 -07:00
table_cache.cc Add DB property "rocksdb.estimate-table-readers-mem" 2014-08-06 11:39:46 -07:00
table_cache.h Add DB property "rocksdb.estimate-table-readers-mem" 2014-08-06 11:39:46 -07:00
table_properties_collector.cc Extract metaindex block from block-based table 2013-12-05 16:34:16 -08:00
table_properties_collector.h TablePropertiesCollectorFactory 2014-05-13 12:30:55 -07:00
table_properties_collector_test.cc Add PlainTableOptions 2014-07-18 00:08:38 -07:00
transaction_log_impl.cc Fixed a file-not-found issue when a log file is moved to archive. 2014-05-12 17:50:21 -07:00
transaction_log_impl.h RocksDBLite 2014-04-15 13:39:26 -07:00
version_edit.cc Support Multiple DB paths (without having an interface to expose to users) 2014-07-02 21:14:44 -07:00
version_edit.h Avoid retrying to read property block from a table when it does not exist. 2014-08-15 12:17:44 -07:00
version_edit_test.cc Support Multiple DB paths (without having an interface to expose to users) 2014-07-02 21:14:44 -07:00
version_set.cc Avoid retrying to read property block from a table when it does not exist. 2014-08-15 12:17:44 -07:00
version_set.h Add DB property "rocksdb.estimate-table-readers-mem" 2014-08-06 11:39:46 -07:00
version_set_test.cc Fix clang compiler warnings 2014-07-20 22:57:20 +08:00
write_batch.cc WriteBatchWithIndex: a wrapper of WriteBatch, with a searchable index 2014-08-18 16:37:38 -07:00
write_batch_internal.h JSON (Document) API sketch 2014-07-10 09:31:42 -07:00
write_batch_test.cc WriteBatchWithIndex: a wrapper of WriteBatch, with a searchable index 2014-08-18 16:37:38 -07:00