Commit graph

358 commits

Author SHA1 Message Date
Mayank Agarwal bf66c10b13 Use KeyMayExist for WriteBatch-Deletes
Summary:
Introduced KeyMayExist checking during writebatch-delete and removed from Outer Delete API because it uses writebatch-delete.
Added code to skip getting Table from disk if not already present in table_cache.
Some renaming of variables.
Introduced KeyMayExistImpl which allows checking since specified sequence number in GetImpl useful to check partially written writebatch.
Changed KeyMayExist to not be pure virtual and provided a default implementation.
Expanded unit-tests in db_test to check appropriately.
Ran db_stress for 1 hour with ./db_stress --max_key=100000 --ops_per_thread=10000000 --delpercent=50 --filter_deletes=1 --statistics=1.

Test Plan: db_stress;make check

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb, xjin

Differential Revision: https://reviews.facebook.net/D11745
2013-07-23 13:36:50 -07:00
Haobo Xu d364eea1fc [RocksDB] Fix FindMinimumEmptyLevelFitting
Summary: as title

Test Plan: make check;

Reviewers: xjin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11751
2013-07-22 12:31:43 -07:00
Haobo Xu 9ee68871dc [RocksDB] Enable manual compaction to move files back to an appropriate level.
Summary: As title. This diff added an option reduce_level to CompactRange. When set to true, it will try to move the files back to the minimum level sufficient to hold the data set. Note that the default is set to true now, just to excerise it in all existing tests. Will set the default to false before check-in, for backward compatibility.

Test Plan: make check;

Reviewers: dhruba, emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11553
2013-07-19 16:20:36 -07:00
Dhruba Borthakur 4a745a5666 Merge branch 'master' into performance
Conflicts:
	db/version_set.cc
	include/leveldb/options.h
	util/options.cc
2013-07-17 15:05:57 -07:00
Dhruba Borthakur 76a4923307 Make file-sizes and grandparentoverlap to be unsigned to avoid bad comparisions.
Summary:
The maxGrandParentOverlapBytes_ was signed which was causing
an erroneous comparision between signed and unsigned longs.
This, in turn, was causing compaction-created-output-files
to be very small in size.

Test Plan: make check

Differential Revision: https://reviews.facebook.net/D11727
2013-07-17 14:26:06 -07:00
Mayank Agarwal e9b675bd94 Fix memory leak in KeyMayExist test part of db_test
Summary: NewBloomFilterPolicy call requires Delete to be called later on

Test Plan: make; valgrind ./db_test

Reviewers: haobo, dhruba, vamsi

Differential Revision: https://reviews.facebook.net/D11667
2013-07-12 16:58:57 -07:00
Mayank Agarwal 2a986919d6 Make rocksdb-deletes faster using bloom filter
Summary:
Wrote a new function in db_impl.c-CheckKeyMayExist that calls Get but with a new parameter turned on which makes Get return false only if bloom filters can guarantee that key is not in database. Delete calls this function and if the option- deletes_use_filter is turned on and CheckKeyMayExist returns false, the delete will be dropped saving:
1. Put of delete type
2. Space in the db,and
3. Compaction time

Test Plan:
make all check;
will run db_stress and db_bench and enhance unit-test once the basic design gets approved

Reviewers: dhruba, haobo, vamsi

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11607
2013-07-11 12:11:11 -07:00
Xing Jin 8a5341ec7d Newbie code question
Summary:
This diff is more about my question when reading compaction codes,
instead of a normal diff. I don't quite understand the logic here.

Test Plan: I didn't do any test. If this is a bug, I will continue doing some test.

Reviewers: haobo, dhruba, emayanke

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11661
2013-07-11 09:03:40 -07:00
Dhruba Borthakur 289efe9922 Update statistics only if needed.
Summary:
Update statistics only if needed.

Test Plan:

Reviewers:

CC:

Task ID: #

Blame Rev:
2013-07-09 16:17:00 -07:00
Dhruba Borthakur 7cb8d462d5 Rename PickCompactionHybrid to PickCompactionUniversal.
Summary:
Rename PickCompactionHybrid to PickCompactionUniversal.
Changed a few LOG message from "Hybrid:" to "Universal:".

Test Plan:

Reviewers:

CC:

Task ID: #

Blame Rev:
2013-07-09 16:08:54 -07:00
Haobo Xu a8d5f8dde2 [RocksDB] Remove old readahead options
Summary: As title.

Test Plan: make check; db_bench

Reviewers: dhruba, MarkCallaghan

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11643
2013-07-09 11:22:33 -07:00
Haobo Xu 9ba82786ce [RocksDB] Provide contiguous sequence number even in case of write failure
Summary: Replication logic would be simplifeid if we can guarantee that write sequence number is always contiguous, even if write failure occurs. Dhruba and I looked at the sequence number generation part of the code. It seems fixable. Note that if WAL was successful and insert into memtable was not, we would be in an unfortunate state. The approach in this diff is : IO error is expected and error status will be returned to client, sequence number will not be advanced; In-mem error is not expected and we panic.

Test Plan: make check; db_stress

Reviewers: dhruba, sheki

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11439
2013-07-08 15:31:09 -07:00
Dhruba Borthakur 116ec527f2 Renamed 'hybrid_compaction' tp be "Universal Compaction'.
Summary:
All the universal compaction parameters are encapsulated in
a new file universal_compaction.h

Test Plan:
make check
2013-07-03 15:47:53 -07:00
Dhruba Borthakur 47c4191fe8 Reduce write amplification by merging files in L0 back into L0
Summary:
There is a new option called hybrid_mode which, when switched on,
causes HBase style compactions.  Files from L0 are
compacted back into L0. This meat of this compaction algorithm
is in PickCompactionHybrid().

All files reside in L0. That means all files have overlapping
keys. Each file has a time-bound, i.e. each file contains a
range of keys that were inserted around the same time. The
start-seqno and the end-seqno refers to the timeframe when
these keys were inserted.  Files that have contiguous seqno
are compacted together into a larger file. All files are
ordered from most recent to the oldest.

The current compaction algorithm starts to look for
candidate files starting from the most recent file. It continues to
add more files to the same compaction run as long as the
sum of the files chosen till now is smaller than the next
candidate file size. This logic needs to be debated
and validated.

The above logic should reduce write amplification to a
large extent... will publish numbers shortly.

Test Plan: dbstress runs for 6 hours with no data corruption (tested so far).

Differential Revision: https://reviews.facebook.net/D11289
2013-06-30 20:07:04 -07:00
Haobo Xu 71e0f695c1 [RocksDB] Expose count for WriteBatch
Summary: As title. Exposed a Count function that returns the number of updates in a batch. Could be handy for replication sequence number check.

Test Plan: make check;

Reviewers: emayanke, sheki, dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11523
2013-06-26 15:13:21 -07:00
Abhishek Kona 5ef6bb8c37 [rocksdb][refactor] statistic printing code to one place
Summary: $title

Test Plan: db_bench --statistics=1

Reviewers: haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11373
2013-06-18 20:28:41 -07:00
Haobo Xu 3cc1af2062 [RocksDB] Option for incremental sync
Summary: This diff added an option to control the incremenal sync frequency. db_bench has a new flag bytes_per_sync for easy tuning exercise.

Test Plan: make check; db_bench

Reviewers: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11295
2013-06-18 15:00:32 -07:00
Abhishek Kona 79f4fd2b62 [Rocksdb] Simplify Printing code in db_bench
Summary:
simplify the printing code in db_bench
         use TickersMap and HistogramsNameMap introduced in previous diffs.

Test Plan: ./db_bench --statistics=1 and see if all the statistics are printed

Reviewers: haobo, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11355
2013-06-18 14:58:00 -07:00
Dhruba Borthakur 6acbe0fc45 Compact multiple memtables before flushing to storage.
Summary:
Merge multiple multiple memtables in memory before writing it
out to a file in L0.

There is a new config parameter min_write_buffer_number_to_merge
that specifies the number of write buffers that should be merged
together to a single file in storage. The system will not flush
wrte buffers to storage unless at least these many buffers have
accumulated in memory.
The default value of this new parameter is 1, which means that
a write buffer will be immediately flushed to disk as soon it is
ready.

Test Plan: make check

Differential Revision: https://reviews.facebook.net/D11241
2013-06-18 14:28:04 -07:00
Abhishek Kona 00124683de [rocksdb] do not trim range for level0 in manual compaction
Summary:
https://code.google.com/p/leveldb/issues/detail?can=1&q=178&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary&id=178

Ported the solution as is to RocksDB.

Test Plan: moved the unit test as manual_compaction_test

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11331
2013-06-17 13:58:17 -07:00
Abhishek Kona bff718d81c [Rocksdb] Implement filluniquerandom
Summary:
Use a bit set to keep track of which random number is generated.
        Currently only supports single-threaded. All our perf tests are run with threads=1
        Copied over bitset implementation from common/datastructures

Test Plan: printed the generated keys, and verified all keys were present.

Reviewers: MarkCallaghan, haobo, dhruba

Reviewed By: MarkCallaghan

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11247
2013-06-14 16:17:56 -07:00
Deon Nicholas 2a52e1dcb6 Fix db_bench for release build.
Test Plan: make release

Reviewers: haobo, dhruba, jpaton

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11307
2013-06-14 16:00:47 -07:00
Haobo Xu 1afdf28701 [RocksDB] Compaction Filter Cleanup
Summary: This hopefully gives the right semantics to compaction filter. Will write a small wiki to explain the ideas.

Test Plan: make check; db_stress

Reviewers: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11121
2013-06-14 14:23:08 -07:00
Abhishek Kona 7a5f71d19a [Rocksdb] measure table open io in a histogram
Summary: Table is setup for compaction using Table::SetupForCompaction. So read block calls can be differentiated b/w Gets/Compaction. Use this and measure times.

Test Plan: db_bench --statistics=1

Reviewers: dhruba, haobo

Reviewed By: haobo

CC: leveldb, MarkCallaghan

Differential Revision: https://reviews.facebook.net/D11217
2013-06-13 17:25:09 -07:00
Deon Nicholas 4985a9f73b [Rocksdb] [Multiget] Introduced multiget into db_bench
Summary:
Preliminary! Introduced the --use_multiget=1 and --keys_per_multiget=n
flags for db_bench. Also updated and tested the ReadRandom() method
to include an option to use multiget. By default,
keys_per_multiget=100.

Preliminary tests imply that multiget is at least 1.25x faster per
key than regular get.

Will continue adding Multiget for ReadMissing, ReadHot,
RandomWithVerify, ReadRandomWriteRandom; soon. Will also think
about ways to better verify benchmarks.

Test Plan:
1. make db_bench
2. ./db_bench --benchmarks=fillrandom
3. ./db_bench --benchmarks=readrandom --use_existing_db=1
	      --use_multiget=1 --threads=4 --keys_per_multiget=100
4. ./db_bench --benchmarks=readrandom --use_existing_db=1
	      --threads=4
5. Verify ops/sec (and 1000000 of 1000000 keys found)

Reviewers: haobo, MarkCallaghan, dhruba

Reviewed By: MarkCallaghan

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11127
2013-06-12 12:42:21 -07:00
Haobo Xu bdf1085944 [RocksDB] cleanup EnvOptions
Summary:
This diff simplifies EnvOptions by treating it as POD, similar to Options.
- virtual functions are removed and member fields are accessed directly.
- StorageOptions is removed.
- Options.allow_readahead and Options.allow_readahead_compactions are deprecated.
- Unused global variables are removed: useOsBuffer, useFsReadAhead, useMmapRead, useMmapWrite

Test Plan: make check; db_stress

Reviewers: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11175
2013-06-12 11:17:19 -07:00
Dhruba Borthakur e673d5d26d Do not submit multiple simultaneous seek-compaction requests.
Summary:
The code was such that if multi-threaded-compactions as well
as seek compaction are enabled then it submits multiple
compaction request for the same range of keys. This causes
extraneous sst-files to accumulate at various levels.

Test Plan:
I am not able to write a very good unit test for this one
but can easily reproduce this bug with 'dbstress' with the
following options.

batch=1;maxk=100000000;ops=100000000;ro=0;fm=2;bpl=10485760;of=500000; wbn=3; mbc=20; mb=2097152; wbs=4194304; dds=1; sync=0;  t=32; bs=16384; cs=1048576; of=500000; ./db_stress --disable_seek_compaction=0 --mmap_read=0 --threads=$t --block_size=$bs --cache_size=$cs --open_files=$of --verify_checksum=1 --db=/data/mysql/leveldb/dbstress.dir --sync=$sync --disable_wal=1 --disable_data_sync=$dds --write_buffer_size=$wbs --target_file_size_base=$mb --target_file_size_multiplier=$fm --max_write_buffer_number=$wbn --max_background_compactions=$mbc --max_bytes_for_level_base=$bpl --reopen=$ro --ops_per_thread=$ops --max_key=$maxk --test_batches_snapshots=$batch

Reviewers: leveldb, emayanke

Reviewed By: emayanke

Differential Revision: https://reviews.facebook.net/D11055
2013-06-10 15:49:19 -07:00
Dhruba Borthakur 1b69f1e584 Fix refering freed memory in earlier commit.
Summary: Fix refering freed memory in earlier commit by https://reviews.facebook.net/D11181

Test Plan: make check

Reviewers: haobo, sheki

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11193
2013-06-10 15:08:13 -07:00
Dhruba Borthakur c5de1b9391 Print name of user comparator in LOG.
Summary:
The current code prints the name of the InternalKeyComparator
in the log file. We would also like to print the name of the
user-specified comparator for easier debugging.

Test Plan: make check

Reviewers: sheki

Reviewed By: sheki

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11181
2013-06-10 12:11:55 -07:00
Mayank Agarwal 184343a061 Max_mem_compaction_level can have maximum value of num_levels-1
Summary:
Without this files could be written out to a level greater than the maximum level possible and is the source of the segfaults that wormhole awas getting. The sequence of steps that was followed:
1. WriteLevel0Table was called when memtable was to be flushed for a file.
2. PickLevelForMemTableOutput was called to determine the level to which this file should be pushed.
3. PickLevelForMemTableOutput returned a wrong result because max_mem_compaction_level was equal to 2 even when num_levels was equal to 0.
The fix to re-initialize max_mem_compaction_level based on num_levels passed seems correct.

Test Plan: make all check; Also made a dummy file to mimic the wormhole-file behaviour which was causing the segfaults and found that the same segfault occurs without this change and not with this.

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11157
2013-06-09 10:38:55 -07:00
Abhishek Kona e982b5a489 [Rocksdb] measure table open io in a histogram
Summary: as title

Test Plan: db_bench --statistics=1 check for statistic.

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11109
2013-06-07 10:02:28 -07:00
Deon Nicholas db1f0cddf3 Fixed valgrind errors 2013-06-05 17:25:16 -07:00
Deon Nicholas d8c7c45ea0 Very basic Multiget and simple test cases.
Summary:
Implemented the MultiGet operator which takes in a list of keys
and returns their associated values. Currently uses std::vector as its
container data structure. Otherwise, it works identically to "Get".

Test Plan:
 1. make db_test      ; compile it
 2. ./db_test         ; test it
 3. make all check    ; regress / run all tests
 4. make release      ; (optional) compile with release settings

Reviewers: haobo, MarkCallaghan, dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D10875
2013-06-05 11:22:38 -07:00
Abhishek Kona d91b42ee27 [Rocksdb] Measure all FSYNC/SYNC times
Summary: Add stop watches around all sync calls.

Test Plan: db_bench check if respective histograms are printed

Reviewers: haobo, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11073
2013-06-05 11:06:21 -07:00
Abhishek Kona ee522d0032 [Rocksdb] Log on disable/enable file deletions
Summary: as title

Test Plan: compile

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11085
2013-06-05 10:48:24 -07:00
Mark Callaghan d9f538e1a9 Improve output for GetProperty('leveldb.stats')
Summary:
Display separate values for read, write & total compaction IO.
Display compaction amplification and write amplification.
Add similar values for the period since the last call to GetProperty. Results since the server started
are reported as "cumulative" stats. Results since the last call to GetProperty are reported as
"interval" stats.

Level  Files Size(MB) Time(sec)  Read(MB) Write(MB)    Rn(MB)  Rnp1(MB)  Wnew(MB) Amplify Read(MB/s) Write(MB/s)      Rn     Rnp1     Wnp1     NewW    Count  Ln-stall
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
  0        7       13        21         0       211         0         0       211     0.0       0.0        10.1        0        0        0        0      113       0.0
  1       79      157        88       993       989       198       795       194     9.0      11.3        11.2      106      405      502       97       14       0.0
  2       19       36         5        63        63        37        27        36     2.4      12.3        12.2       19       14       32       18       12       0.0
>>>>>>>>>>>>>>>>>>>>>>>>> text below has been is new and/or reformatted
Uptime(secs): 122.2 total, 0.9 interval
Compaction IO cumulative (GB): 0.21 new, 1.03 read, 1.23 write, 2.26 read+write
Compaction IO cumulative (MB/sec): 1.7 new, 8.6 read, 10.3 write, 19.0 read+write
Amplification cumulative: 6.0 write, 11.0 compaction
Compaction IO interval (MB): 5.59 new, 0.00 read, 5.59 write, 5.59 read+write
Compaction IO interval (MB/sec): 6.5 new, 0.0 read, 6.5 write, 6.5 read+write
Amplification interval: 1.0 write, 1.0 compaction
>>>>>>>>>>>>>>>>>>>>>>>> text above is new and/or reformatted
Stalls(secs): 90.574 level0_slowdown, 0.000 level0_numfiles, 10.165 memtable_compaction, 0.000 leveln_slowdown

Task ID: #

Blame Rev:

Test Plan:
make check, run db_bench

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11049
2013-06-03 20:24:49 -07:00
Haobo Xu 2b1fb5b01d [RocksDB] Add score column to leveldb.stats
Summary: Added the 'score' column to the compaction stats output, which shows the level total size devided by level target size. Could be useful when monitoring compaction decisions...

Test Plan: make check; db_bench

Reviewers: dhruba

CC: leveldb, MarkCallaghan

Differential Revision: https://reviews.facebook.net/D11025
2013-06-03 17:38:27 -07:00
Haobo Xu d897d33bf1 [RocksDB] Introduce Fast Mutex option
Summary:
This diff adds an option to specify whether PTHREAD_MUTEX_ADAPTIVE_NP will be enabled for the rocksdb single big kernel lock. db_bench also have this option now.
Quickly tested 8 thread cpu bound 100 byte random read.
No fast mutex: ~750k/s ops
With fast mutex: ~880k/s ops

Test Plan: make check; db_bench; db_stress

Reviewers: dhruba

CC: MarkCallaghan, leveldb

Differential Revision: https://reviews.facebook.net/D11031
2013-06-01 23:11:34 -07:00
Haobo Xu ab8d2f6ab2 [RocksDB] [Performance] Allow different posix advice to be applied to the same table file
Summary:
Current posix advice implementation ties up the access pattern hint with the creation of a file.
It is not possible to apply different advice for different access (random get vs compaction read),
without keeping two open files for the same table. This patch extended the RandomeAccessFile interface
to accept new access hint at anytime. Particularly, we are able to set different access hint on the same
table file based on when/how the file is used.
Two options are added to set the access hint, after the file is first opened and after the file is being
compacted.

Test Plan: make check; db_stress; db_bench

Reviewers: dhruba

Reviewed By: dhruba

CC: MarkCallaghan, leveldb

Differential Revision: https://reviews.facebook.net/D10905
2013-05-30 19:08:44 -07:00
Haobo Xu 2df65c118c [RocksDB] Dump counters and histogram data periodically with compaction stats
Summary: As title

Test Plan: make check

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D10995
2013-05-29 12:00:18 -07:00
Dhruba Borthakur a8d807ee91 Record the number of open db iterators.
Summary: Enhance the statitics to report the number of open db iterators.

Test Plan: make check

Reviewers: haobo, emayanke

Reviewed By: emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D10983
2013-05-29 08:47:08 -07:00
Haobo Xu fb684da082 [RocksDB] Fix CorruptionTest
Summary: Overriding block_size_deviation to zero, so that CorruptionTest can pass.

Test Plan: make check

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D10977
2013-05-28 12:36:42 -07:00
heyongqiang 4b29651206 add block deviation option to terminate a block before it exceeds block_size
Summary: a new option block_size_deviation is added.

Test Plan: run db_test and db_bench

Reviewers: dhruba, haobo

Reviewed By: haobo

Differential Revision: https://reviews.facebook.net/D10821
2013-05-24 15:52:49 -07:00
Haobo Xu ef15b9d178 [RocksDB] Fix MaybeDumpStats
Summary: MaybeDumpStats was causing lock problem

Test Plan: make check; db_stress

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D10935
2013-05-24 15:43:16 -07:00
Haobo Xu 0e879c93de [RocksDB] dump leveldb.stats periodically in LOG file.
Summary:
Added an option stats_dump_period_sec to dump leveldb.stats to LOG periodically for diagnosis.
By defauly, it's set to a very big number 3600 (1 hour).

Test Plan: make check;

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb, zshao

Differential Revision: https://reviews.facebook.net/D10761
2013-05-23 16:56:59 -07:00
Dhruba Borthakur 26541860d3 The max size of the write buffer size can be 64 GB.
Summary: There was an artifical limit on the size of the write buffer size.

Test Plan: make check

Reviewers: haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D10911
2013-05-23 15:00:27 -07:00
Dhruba Borthakur 898f79372f Fix valgrind errors introduced by https://reviews.facebook.net/D10863
Summary:
The valgrind errors were in the unit tests where we change the
number of levels of a database using internal methods.

Test Plan:
valgrind ./reduce_levels_test
valgrind ./db_test

Reviewers: emayanke

Reviewed By: emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D10893
2013-05-23 12:14:41 -07:00
Haobo Xu c2e2460f8a [RocksDB] Expose DBStatistics
Summary: Make Statistics usable by client

Test Plan: make check; db_bench

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D10899
2013-05-23 11:49:38 -07:00
Haobo Xu 87d0af15d8 [RocksDB] Introduce an option to skip log error on recovery
Summary:
Currently, with paranoid_check on, DB::Open will fail on any log read error on recovery.
If client is ok with losing most recent updates, we could simply skip those errors.
However, it's important to introduce an additional flag, so that paranoid_check can
still guard against more serious problems.

Test Plan: make check; db_stress

Reviewers: dhruba, emayanke

Reviewed By: emayanke

CC: leveldb, emayanke

Differential Revision: https://reviews.facebook.net/D10869
2013-05-21 14:30:36 -07:00
Dhruba Borthakur d1aaaf718c Ability to set different size fanout multipliers for every level.
Summary:
There is an existing field Options.max_bytes_for_level_multiplier that
sets the multiplier for the size of each level in the database.

This patch introduces the ability to set different multipliers
for every level in the database. The size of a level is determined
by using both max_bytes_for_level_multiplier as well as the
per-level fanout.

size of level[i] = size of level[i-1] * max_bytes_for_level_multiplier
                   * fanout[i-1]

The default value of fanout is 1, so that it is backward compatible.

Test Plan: make check

Reviewers: haobo, emayanke

Reviewed By: emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D10863
2013-05-21 13:50:20 -07:00
Haobo Xu c3c13db346 [RocksDB] [Performance Bug] MemTable::Get Slow
Summary:
The merge operator diff introduced a performance problem in MemTable::Get.
An exit condition is missed when the current key does not match the user key.
This could lead to full memtable scan if the user key is not found.

Test Plan: make check; db_bench

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D10851
2013-05-21 13:40:38 -07:00
Abhishek Kona e1174306c5 [RocksDB] Simplify StopWatch implementation
Summary:
Make stop watch a simple implementation, instead of subclass of a virtual class
Allocate stop watches off the stack instead of heap.
Code is more terse now.

Test Plan: make all check, db_bench with --statistics=1

Reviewers: haobo, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D10809
2013-05-17 10:55:34 -07:00
Abhishek Kona 446151cd20 [Rocksdb] Remove unused double apis to record into histograms
Summary: Statistics.h and histogram.h had double based api's to record values. Remove them as they are not used anywhere

Test Plan: make all check

Reviewers: haobo, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D10815
2013-05-16 10:40:30 -07:00
Haobo Xu 4ca3c67bd3 [RocksDB] Cleanup compaction filter to use a class interface, instead of function pointer and additional context pointer.
Summary:
This diff replaces compaction_filter_args and CompactionFilter with a single compaction_filter parameter. It gives CompactionFilter better encapsulation and a similar look to Comparator and MergeOpertor, which improves consistency of the overall interface.
The change is not backward compatible. Nevertheless, the two references in fbcode are not in production yet.

Test Plan: make check

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb, zshao

Differential Revision: https://reviews.facebook.net/D10773
2013-05-13 14:06:10 -07:00
Haobo Xu 73c0a33346 [RocksDB] fix compaction filter trigger condition
Summary:
Currently, compaction filter is run on internal key older than the oldest snapshot, which is incorrect.
Compaction filter should really be run on the most recent internal key when there is no external snapshot.

Test Plan: make check; db_stress

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D10641
2013-05-13 12:33:02 -07:00
Abhishek Kona 8d58ecdc29 [RocksDB] Expose compaction stalls via db_statistics
Test Plan: make check

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D10575
2013-05-10 14:41:45 -07:00
Dhruba Borthakur a8d3aa2c26 Assertion failure for L0-L1 compactions.
Summary:
For level-0 compactions, we try to find if can include more L0 files
in the same compaction run. This causes the 'smallest' and 'largest'
key to get extended to a larger range. But the suceeding call to
ParentRangeInCompaction() was still using the earlier
values of 'smallest' and 'largest',

Because of this bug, a file in L1 can be part of two concurrent
compactions: one L0-L1 compaction and the other L1-L2 compaction.

This should not cause any data loss, but will cause an assertion
failure with debug builds.

Test Plan: make check

Differential Revision: https://reviews.facebook.net/D10677
2013-05-08 17:10:11 -07:00
Abhishek Kona 988c20b9f7 [RocksDB] Clear Archive WAL files
Summary:
WAL files are moved to archive directory and clear only at DB::Open.
Can lead to a lot of space consumption in a Database. Added logic to periodically clear Archive Directory too.

Test Plan: make all check + add unit test

Reviewers: dhruba, heyongqiang

Reviewed By: heyongqiang

CC: leveldb

Differential Revision: https://reviews.facebook.net/D10617
2013-05-06 11:41:01 -07:00
Haobo Xu 05e8854085 [Rocksdb] Support Merge operation in rocksdb
Summary:
This diff introduces a new Merge operation into rocksdb.
The purpose of this review is mostly getting feedback from the team (everyone please) on the design.

Please focus on the four files under include/leveldb/, as they spell the client visible interface change.
include/leveldb/db.h
include/leveldb/merge_operator.h
include/leveldb/options.h
include/leveldb/write_batch.h

Please go over local/my_test.cc carefully, as it is a concerete use case.

Please also review the impelmentation files to see if the straw man implementation makes sense.

Note that, the diff does pass all make check and truly supports forward iterator over db and a version
of Get that's based on iterator.

Future work:
- Integration with compaction
- A raw Get implementation

I am working on a wiki that explains the design and implementation choices, but coding comes
just naturally and I think it might be a good idea to share the code earlier. The code is
heavily commented.

Test Plan: run all local tests

Reviewers: dhruba, heyongqiang

Reviewed By: dhruba

CC: leveldb, zshao, sheki, emayanke, MarkCallaghan

Differential Revision: https://reviews.facebook.net/D9651
2013-05-03 16:59:02 -07:00
Abhishek Kona 41cb922b34 Allocate the LogReporter from heap. Summary:
Summary:
The current code has a bug that take address of stack allocated LogReporter.
It is causing SIGSEGV because the stack address is no longer valid when referenced.

Test Plan: Tested on prod.

Reviewers: haobo, dhruba, heyongqiang

Reviewed By: heyongqiang

Differential Revision: https://reviews.facebook.net/D10557
2013-04-29 13:19:24 -07:00
Abhishek Kona fb96ec1686 [RocksDB] Print all internally collected histograms in db_bench. Also print p95
Summary: $title

Test Plan: make db_bench . run db_bench and check for expected output

Reviewers: haobo, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D10521
2013-04-25 13:36:47 -07:00
Haobo Xu eb6d139666 [RocksDB] Move table.h to table/
Summary:
- don't see a point exposing table.h to the public.
- fixed make clean to remove also *.d files.

Test Plan: make check; db_stress

Reviewers: dhruba, heyongqiang

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D10479
2013-04-22 16:07:56 -07:00
Abhishek Kona 344e832f55 [RocksDB] Fix ReadMissing in db_bench
Summary: D8943 Broke read_missing. Fix it by adding a "." at the end of the generated key

Test Plan: generate, print and check the key has a "."

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D10455
2013-04-22 15:44:19 -07:00
Haobo Xu b4243e5a3d [RocksDB] CompactionFilter cleanup
Summary:
- removed the compaction_filter_value from the callback interface. Restrict compaction filter to purging values.
- modify some comments to reflect curent status.

Test Plan: make check

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D10335
2013-04-20 10:26:51 -07:00
Mark Callaghan b1ff9ac9c5 Add --writes_per_second rate limit, print p99.99 in histogram
Summary:
Adds the --writes_per_second rate limit for the readwhilewriting test.
The purpose is to optionally avoid saturating storage with writes & compaction
and test read response time when some writes are being done.

Changes the histogram code to also print the p99.99 value

Task ID: #

Blame Rev:

Test Plan:
make check, ran db_bench with it

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D10305
2013-04-20 10:26:51 -07:00
Haobo Xu 1255dcd446 [RocksDB] Add stacktrace signal handler
Summary:
This diff provides the ability to print out a stacktrace when the process receives certain signals.
Currently, we enable this for the following signals (program error related):
SIGILL SIGSEGV SIGBUS SIGABRT
Application simply #include "util/stack_trace.h" and call leveldb::InstallStackTraceHandler() during initialization, if signal handler is needed. It's not done automatically when openning db, because it's the application(process)'s responsibility to install signal handler and some applications might already have their own (like fbcode).

Sample output:
Received signal 11 (Segmentation fault)
#0  0x408ff0 ./signal_test() [0x408ff0] /home/haobo/rocksdb/util/signal_test.cc:4
#1  0x40827d ./signal_test() [0x40827d] /home/haobo/rocksdb/util/signal_test.cc:24
#2  0x7f8bb183172e /usr/local/fbcode/gcc-4.7.1-glibc-2.14.1/lib/libc.so.6(__libc_start_main+0x10e) [0x7f8bb183172e] ??:0
#3  0x408ebc ./signal_test() [0x408ebc] /home/engshare/third-party/src/glibc/glibc-2.14.1/glibc-2.14.1/csu/../sysdeps/x86_64/elf/start.S:113
Segmentation fault (core dumped)

For each frame, we print the raw pointer, the symbol provided by backtrace_symbols (still not good enough), and the source file/line. Note that address translation is done by directly shell out to addr2line. ??:0 means addr2line fails to do the translation. Hacky, but I think it's good for now.

Test Plan: signal_test.cc

Reviewers: dhruba, MarkCallaghan

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D10173
2013-04-20 10:26:50 -07:00
Abhishek Kona 7c6c3c0ff4 [Rockdsdb] Better Error messages. Closing db instead of deleting db
Summary: A better error message. A local change. Did not look at other places where this could be done.

Test Plan: compile

Reviewers: dhruba, MarkCallaghan

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D10251
2013-04-15 15:27:15 -07:00
Dhruba Borthakur 9b81d3c406 Simplified level_ptrs by using a std:vector
Summary: Simplified level_ptrs by using a std:vector

Test Plan: make check

Reviewers: sheki, emayanke

Reviewed By: emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D10245
2013-04-15 13:52:51 -07:00
Haobo Xu 013e9ebbf1 [RocksDB] [Performance] Speed up FindObsoleteFiles
Summary:
FindObsoleteFiles was slow, holding the single big lock, resulted in bad p99 behavior.
Didn't profile anything, but several things could be improved:
1. VersionSet::AddLiveFiles works with std::set, which is by itself slow (a tree).
   You also don't know how many dynamic allocations occur just for building up this tree.
   switched to std::vector, also added logic to pre-calculate total size and do just one allocation
2. Don't see why env_->GetChildren() needs to be mutex proteced, moved to PurgeObsoleteFiles where
   mutex could be unlocked.
3. switched std::set to std:unordered_set, the conversion from vector is also inside PurgeObsoleteFiles
I have a feeling this should pretty much fix it.

Test Plan: make check;  db_stress

Reviewers: dhruba, heyongqiang, MarkCallaghan

Reviewed By: dhruba

CC: leveldb, zshao

Differential Revision: https://reviews.facebook.net/D10197
2013-04-12 11:29:27 -07:00
Mayank Agarwal 94d86b25a9 Fix memory leak for probableWALfiles in db_impl.cc
Summary: using unique_ptr to have automatic delete for probableWALfiles in db_impl.cc

Test Plan: make

Reviewers: sheki, dhruba

Reviewed By: sheki

CC: leveldb

Differential Revision: https://reviews.facebook.net/D10083
2013-04-11 16:57:15 -07:00
Dhruba Borthakur 7730587120 Prevent segfault in OpenCompactionOutputFile
Summary:
The segfault was happening because the program was unable to open a new
sst file (as part of the compaction) because the process ran out of
file descriptors.

The fix is to check the return status of the file creation before taking
any other action.

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fabf03f9700 (LWP 29904)]
leveldb::DBImpl::OpenCompactionOutputFile (this=this@entry=0x7fabf9011400, compact=compact@entry=0x7fabf741a2b0) at db/db_impl.cc:1399
1399    db/db_impl.cc: No such file or directory.
(gdb) where

Test Plan: make check

Reviewers: MarkCallaghan, sheki

Reviewed By: MarkCallaghan

CC: leveldb

Differential Revision: https://reviews.facebook.net/D10101
2013-04-10 09:59:48 -07:00
Abhishek Kona 574b76f710 [RocksDB][Bug] Look at all the files, not just the first file in TransactionLogIter as BatchWrites can leave it in Limbo
Summary:
Transaction Log Iterator did not move to the next file in the series if there was a write batch at the end of the currentFile.
The solution is if the last seq no. of the current file is < RequestedSeqNo. Assume the first seqNo. of the next file has to satisfy the request.

Also major refactoring around the code. Moved opening the logreader to a seperate function, got rid of goto.

Test Plan: added a unit test for it.

Reviewers: dhruba, heyongqiang

Reviewed By: heyongqiang

CC: leveldb, emayanke

Differential Revision: https://reviews.facebook.net/D10029
2013-04-08 16:28:09 -07:00
Abhishek Kona 0e40185a7d [Rocksdb] Remove useless struct TableAndFile
Summary:
TableAndFile was a struct used earlier to delete the file as we did not have std::unique_ptr in the codebase.
With Chip introducing C++11 hotness like std::unique_ptr we can do away with the struct.

Test Plan: make all check

Reviewers: haobo, heyongqiang

Reviewed By: heyongqiang

CC: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D9975
2013-04-05 11:26:46 -07:00
Abhishek Kona ca789a10cc [Rocksdb] Recover last updated sequence number from manifest also.
Summary:
During recovery, last_updated_manifest number was not set if there were no records in the Write-ahead log.
Now check for the recovered manifest also and set last_updated_manifest file to the max value.

Test Plan: unit test

Reviewers: heyongqiang

Reviewed By: heyongqiang

CC: leveldb

Differential Revision: https://reviews.facebook.net/D9891
2013-04-02 17:18:27 -07:00
Haobo Xu 6763110867 [RocksDB] Replace iterator based loop with range based loop for stl containers
Summary:
As title.
Code is shorter and cleaner
See https://our.dev.facebook.com/intern/tasks/?t=2233981

Test Plan: make check

Reviewers: dhruba, heyongqiang

Reviewed By: dhruba

CC: leveldb, zshao

Differential Revision: https://reviews.facebook.net/D9789
2013-04-02 11:46:45 -07:00
Haobo Xu 645ff8f231 Let's get rid of delete as much as possible, here are some examples.
Summary:
If a class owns an object:
 - If the object can be null => use a unique_ptr. no delete
 - If the object can not be null => don't even need new, let alone delete
 - for runtime sized array => use vector, no delete.

Test Plan: make check

Reviewers: dhruba, heyongqiang

Reviewed By: heyongqiang

CC: leveldb, zshao, sheki, emayanke, MarkCallaghan

Differential Revision: https://reviews.facebook.net/D9783
2013-03-28 17:31:44 -07:00
Abhishek Kona 3b51605b8d [RocksDB] Fix binary search while finding probable wal files
Summary:
RocksDB does a binary search to look at the files which might contain the requested sequence number at the call GetUpdatesSince.
There was a bug in the binary search => when the file pointed by the middle index of bsearch was empty/corrupt it needst to resize the vector and update indexes.
This now fixes that.

Test Plan: existing unit tests pass.

Reviewers: heyongqiang, dhruba

Reviewed By: heyongqiang

CC: leveldb

Differential Revision: https://reviews.facebook.net/D9777
2013-03-28 13:37:15 -07:00
Abhishek Kona 8e9c781ae5 [Rocksdb] Fix Crash on finding a db with no log files. Error out instead
Summary:
If the vector returned by GetUpdatesSince is empty, it is still returned to the
user. This causes it throw an std::range error.
The probable file list is checked and it returns an IOError status instead of OK now.

Test Plan: added a unit test.

Reviewers: dhruba, heyongqiang

Reviewed By: heyongqiang

CC: leveldb

Differential Revision: https://reviews.facebook.net/D9771
2013-03-28 13:19:07 -07:00
Abhishek Kona 7fdd5f5b33 Use non-mmapd files for Write-Ahead Files
Summary:
Use non mmapd files for Write-Ahead log.
Earlier use of MMaped files. made the log iterator read ahead and miss records.
Now the reader and writer will point to the same physical location.

There is no perf regression :
./db_bench --benchmarks=fillseq --db=/dev/shm/mmap_test --num=$(million 20) --use_existing_db=0 --threads=2
with This diff :
fillseq      :      10.756 micros/op 185281 ops/sec;   20.5 MB/s
without this dif :
fillseq      :      11.085 micros/op 179676 ops/sec;   19.9 MB/s

Test Plan: unit test included

Reviewers: dhruba, heyongqiang

Reviewed By: heyongqiang

CC: leveldb

Differential Revision: https://reviews.facebook.net/D9741
2013-03-28 13:13:35 -07:00
Abhishek Kona 63f216ee0a memory manage statistics
Summary:
Earlier Statistics object was a raw pointer. This meant the user had to clear up
the Statistics object after creating the database. In most use cases the database is created in a function and the statistics pointer is out of scope. Hence the statistics object would never be deleted.
Now Using a shared_ptr to manage this.

Want this in before the next release.

Test Plan: make all check.

Reviewers: dhruba, emayanke

Reviewed By: emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D9735
2013-03-27 11:27:39 -07:00
Haobo Xu ecd8db0200 [RocksDB] Minimize Mutex protected code section in the critical path
Summary: rocksdb uses a single global lock to protect in memory metadata. We should minimize the mutex protected code section to increase the effective parallelism of the program. See https://our.intern.facebook.com/intern/tasks/?t=2218928

Test Plan:
make check
db_bench

Reviewers: dhruba, heyongqiang

CC: zshao, leveldb

Differential Revision: https://reviews.facebook.net/D9705
2013-03-26 22:42:26 -07:00
Abhishek Kona 9b70529c86 Disable Unit Test for TransactionLogIteratorStall
Summary:
The unit test fails as our solution does not work with MMap'd files.
Disable the failing unit test. Put it back with the next diff which should fix the problem.

Test Plan: db_test

Reviewers: heyongqiang

CC: dhruba

Differential Revision: https://reviews.facebook.net/D9645
2013-03-21 15:51:18 -07:00
Abhishek Kona 27c15fb67e TransactionLogIter should stall at the last record. Currently it errors out
Summary:
* Add a method to check if the log reader is at EOF.
* If we know a record has been flushed force the log_reader to believe it is not at EOF, using a new method UnMarkEof().

This does not work with MMpaed files.

Test Plan: added a unit test.

Reviewers: dhruba, heyongqiang

Reviewed By: heyongqiang

CC: leveldb

Differential Revision: https://reviews.facebook.net/D9567
2013-03-21 15:12:35 -07:00
Dhruba Borthakur d0798f67f4 Run compactions even if workload is readonly or read-mostly.
Summary:
The events that trigger compaction:
* opening the database
* Get -> only if seek compaction is not disabled and other checks are true
* MakeRoomForWrite -> when memtable is full
* BackgroundCall ->
  If the background thread is about to do a compaction run, it schedules
  a new background task to trigger a possible compaction. This will cause
  additional background threads to find and process other compactions that
  can run concurrently.

Test Plan: ran db_bench with overwrite and readonly alternatively.

Reviewers: sheki, MarkCallaghan

Reviewed By: sheki

CC: leveldb

Differential Revision: https://reviews.facebook.net/D9579
2013-03-20 23:43:29 -07:00
Dhruba Borthakur ad96563b79 Ability to configure bufferedio-reads, filesystem-readaheads and mmap-read-write per database.
Summary:
This patch allows an application to specify whether to use bufferedio,
reads-via-mmaps and writes-via-mmaps per database. Earlier, there
was a global static variable that was used to configure this functionality.

The default setting remains the same (and is backward compatible):
 1. use bufferedio
 2. do not use mmaps for reads
 3. use mmap for writes
 4. use readaheads for reads needed for compaction

I also added a parameter to db_bench to be able to explicitly specify
whether to do readaheads for compactions or not.

Test Plan: make check

Reviewers: sheki, heyongqiang, MarkCallaghan

Reviewed By: sheki

CC: leveldb

Differential Revision: https://reviews.facebook.net/D9429
2013-03-20 23:14:03 -07:00
Mayank Agarwal b1bea58457 Fix more signed-unsigned comparisons
Summary: Some comparisons left in log_test.cc and db_test.cc complained by make

Test Plan: make

Reviewers: dhruba, sheki

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D9537
2013-03-19 17:21:36 -07:00
Mayank Agarwal 487168cdcf Fixed sign-comparison in rocksdb code-base and fixed Makefile
Summary: Makefile had options to ignore sign-comparisons and unused-parameters, which should be there. Also fixed the specific errors in the code-base

Test Plan: make

Reviewers: chip, dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D9531
2013-03-19 14:35:23 -07:00
Mark Callaghan 72d14eafd3 add --benchmarks=levelstats option to db_bench, prevent "nan" in stats output
Summary:
Add --benchmarks=levelstats option to report per-level stats (#files, #bytes)
Change readwhilewriting test to report response time for writes but exclude
them from the stats merged by all threads.
Prevent "NaN" in stats output by preventing division by 0.
Remove "o" file I committed by mistake.

Task ID: #

Blame Rev:

Test Plan:
make check

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D9513
2013-03-19 13:14:44 -07:00
Abhishek Kona 02c459805b Ignore a zero-sized file while looking for a seq-no in GetUpdatesSince
Summary:
Rocksdb can create 0 sized log files when it is opened and closed without any operations.
The GetUpdatesSince fails currently if there is a log file of size zero.

This diff fixes this. If there is a log file is 0, it is removed form the probable_file_list

Test Plan: unit test

Reviewers: dhruba, heyongqiang

Reviewed By: heyongqiang

CC: leveldb

Differential Revision: https://reviews.facebook.net/D9507
2013-03-19 11:00:09 -07:00
Abhishek Kona 7b9db9c98e DO not report level size as zero when there are no files in L0
Summary:
Instead of checking for number of files in L0. Check for number of files in the requested level.

Bug introduced in D4929 (diff trying to do too many things).

Test Plan: db_test.

Reviewers: dhruba, MarkCallaghan

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D9483
2013-03-18 12:04:38 -07:00
Mark Callaghan 5a8c8845a9 Enhance db_bench
Summary:
Add --benchmarks=updaterandom for read-modify-write workloads. This is different
from --benchmarks=readrandomwriterandom in a few ways. First, an "operation" is the
combined time to do the read & write rather than treating them as two ops. Second,
the same key is used for the read & write.

Change RandomGenerator to support rows larger than 1M. That was using "assert"
to fail and assert is compiled-away when -DNDEBUG is used.

Add more options to db_bench
--duration - sets the number of seconds for tests to run. When not set the
operation count continues to be the limit. This is used by random operation
tests.

--use_snapshot - when set GetSnapshot() is called prior to each random read.
This is to measure the overhead from using snapshots.

--get_approx - when set GetApproximateSizes() is called prior to each random
read. This is to measure the overhead for a query optimizer.

Task ID: #

Blame Rev:

Test Plan:
run db_bench

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D9267
2013-03-14 16:00:23 -07:00
Mayank Agarwal 5b278b53ae Fix valgrind errors in rocksdb tests: auto_roll_logger_test, reduce_levels_test
Summary: Fix for memory leaks in rocksdb tests. Also modified the variable NUM_FAILED_TESTS to print the actual number of failed tests.

Test Plan: make <test>; valgrind --leak-check=full ./<test>

Reviewers: sheki, dhruba

Reviewed By: sheki

CC: leveldb

Differential Revision: https://reviews.facebook.net/D9333
2013-03-12 16:03:16 -07:00
Dhruba Borthakur ebf16f57c9 Prevent segfault because SizeUnderCompaction was called without any locks.
Summary:
SizeBeingCompacted was called without any lock protection. This causes
crashes, especially when running db_bench with value_size=128K.
The fix is to compute SizeUnderCompaction while holding the mutex and
passing in these values into the call to Finalize.

(gdb) where
#4  leveldb::VersionSet::SizeBeingCompacted (this=this@entry=0x7f0b490931c0, level=level@entry=4) at db/version_set.cc:1827
#5  0x000000000043a3c8 in leveldb::VersionSet::Finalize (this=this@entry=0x7f0b490931c0, v=v@entry=0x7f0b3b86b480) at db/version_set.cc:1420
#6  0x00000000004418d1 in leveldb::VersionSet::LogAndApply (this=0x7f0b490931c0, edit=0x7f0b3dc8c200, mu=0x7f0b490835b0, new_descriptor_log=<optimized out>) at db/version_set.cc:1016
#7  0x00000000004222b2 in leveldb::DBImpl::InstallCompactionResults (this=this@entry=0x7f0b49083400, compact=compact@entry=0x7f0b2b8330f0) at db/db_impl.cc:1473
#8  0x0000000000426027 in leveldb::DBImpl::DoCompactionWork (this=this@entry=0x7f0b49083400, compact=compact@entry=0x7f0b2b8330f0) at db/db_impl.cc:1757
#9  0x0000000000426690 in leveldb::DBImpl::BackgroundCompaction (this=this@entry=0x7f0b49083400, madeProgress=madeProgress@entry=0x7f0b41bf2d1e, deletion_state=...) at db/db_impl.cc:1268
#10 0x0000000000428f42 in leveldb::DBImpl::BackgroundCall (this=0x7f0b49083400) at db/db_impl.cc:1170
#11 0x000000000045348e in BGThread (this=0x7f0b49023100) at util/env_posix.cc:941
#12 leveldb::(anonymous namespace)::PosixEnv::BGThreadWrapper (arg=0x7f0b49023100) at util/env_posix.cc:874
#13 0x00007f0b4a7cf10d in start_thread (arg=0x7f0b41bf3700) at pthread_create.c:301
#14 0x00007f0b49b4b11d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Test Plan:
make check

I am running db_bench with a value size of 128K to see if the segfault is fixed.

Reviewers: MarkCallaghan, sheki, emayanke

Reviewed By: sheki

CC: leveldb

Differential Revision: https://reviews.facebook.net/D9279
2013-03-11 14:09:01 -07:00
Dhruba Borthakur 6d812b6afb A mechanism to detect manifest file write errors and put db in readonly mode.
Summary:
If there is an error while writing an edit to the manifest file, the manifest
file is closed and reopened to check if the edit made it in. However, if the
re-opening of the manifest is unsuccessful and options.paranoid_checks is set
t true, then the db refuses to accept new puts, effectively putting the db
in readonly mode.

In a future diff, I would like to make the default value of paranoid_check
to true.

Test Plan: make check

Reviewers: sheki

Reviewed By: sheki

CC: leveldb

Differential Revision: https://reviews.facebook.net/D9201
2013-03-07 09:45:49 -08:00
Abhishek Kona d68880a1b9 Do not allow Transaction Log Iterator to fall ahead when writer is writing the same file
Summary:
Store the last flushed, seq no. in db_impl. Check against it in
transaction Log iterator. Do not attempt to read ahead if we do not know
if the data is flushed completely.
Does not work if flush is disabled. Any ideas on fixing that?
* Minor change, iter->Next is called the first time automatically for
* the first time.

Test Plan:
existing test pass.
More ideas on testing this?
Planning to run some stress test.

Reviewers: dhruba, heyongqiang

CC: leveldb

Differential Revision: https://reviews.facebook.net/D9087
2013-03-06 14:05:53 -08:00
Dhruba Borthakur afed60938f Fox db_stress crash by copying keys before changing sequencenum to zero.
Summary:
The compaction process zeros out sequence numbers if the output is
part of the bottommost level.
The Slice is supposed to refer to an immutable data buffer. The
merger that implements the priority queue while reading kvs as
the input of a compaction run reies on this fact. The bug was that
were updating the sequence number of a record in-place and that was
causing suceeding invocations of the merger to return kvs in
arbitrary order of sequence numbers.
The fix is to copy the key to a local memory buffer before setting
its seqno to 0.

Test Plan:
Set Options.purge_redundant_kvs_while_flush = false and then run
db_stress --ops_per_thread=1000 --max_key=320

Reviewers: emayanke, sheki

Reviewed By: emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D9147
2013-03-06 10:52:08 -08:00
Dhruba Borthakur f5896681b4 Removed unnecesary file object in table_cache.
Summary:
TableCache->file is not used. remove it.
I kept the TableAndFile structure and will clean it up in a future patch.

Test Plan: make clean check

Reviewers: sheki, chip

Reviewed By: chip

CC: leveldb

Differential Revision: https://reviews.facebook.net/D9075
2013-03-04 13:56:23 -08:00
Mark Callaghan 993543d1be Add rate_delay_limit_milliseconds
Summary:
This adds the rate_delay_limit_milliseconds option to make the delay
configurable in MakeRoomForWrite when the max compaction score is too high.
This delay is called the Ln slowdown. This change also counts the Ln slowdown
per level to make it possible to see where the stalls occur.

From IO-bound performance testing, the Level N stalls occur:
* with compression -> at the largest uncompressed level. This makes sense
                      because compaction for compressed levels is much
                      slower. When Lx is uncompressed and Lx+1 is compressed
                      then files pile up at Lx because the (Lx,Lx+1)->Lx+1
                      compaction process is the first to be slowed by
                      compression.
* without compression -> at level 1

Task ID: #1832108

Blame Rev:

Test Plan:
run with real data, added test

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D9045
2013-03-04 07:41:15 -08:00
Dhruba Borthakur 806e264350 Ability for rocksdb to compact when flushing the in-memory memtable to a file in L0.
Summary:
Rocks accumulates recent writes and deletes in the in-memory memtable.
When the memtable is full, it writes the contents on the memtable to
a file in L0.

This patch removes redundant records at the time of the flush. If there
are multiple versions of the same key in the memtable, then only the
most recent one is dumped into the output file. The purging of
redundant records occur only if the most recent snapshot is earlier
than the earliest record in the memtable.

Should we switch on this feature by default or should we keep this feature
turned off in the default settings?

Test Plan: Added test case to db_test.cc

Reviewers: sheki, vamsi, emayanke, heyongqiang

Reviewed By: sheki

CC: leveldb

Differential Revision: https://reviews.facebook.net/D8991
2013-03-04 00:01:47 -08:00
bil 4992633751 enable the ability to set key size in db_bench in rocksdb
Summary:
1. the default value for key size is still 16
2. enable the ability to set the key size via command line --key_size=

Test Plan:
build & run db_banch and pass some value via command line.
verify it works correctly.

Reviewers: sheki

Reviewed By: sheki

CC: leveldb

Differential Revision: https://reviews.facebook.net/D8943
2013-03-01 14:10:09 -08:00