Commit Graph

4093 Commits

Author SHA1 Message Date
Igor Canadi 5f4166c90e ReadaheadRandomAccessFile -- userspace readahead
Summary:
ReadaheadRandomAccessFile acts as a transparent layer on top of RandomAccessFile. When a Read() request is issued, it issues a much bigger request to the OS and caches the result. When a new request comes in and we already have the data cached, it doesn't have to issue any requests to the OS.

We add ReadaheadRandomAccessFile layer only when file is read during compactions.

D45105 was incorrectly closed by Phabricator because I committed it to a separate branch (not master), so I'm resubmitting the diff.

Test Plan: make check

Reviewers: MarkCallaghan, sdong

Reviewed By: sdong

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D45123
2015-08-26 15:25:59 -07:00
Igor Canadi 16ebe3a2a9 Mmap reads should not return error if reading past file
Summary:
Currently, mmap returns IOError when user tries to read data past the end of the file. This diff changes the behavior. Now, we return just the bytes that we can, and report the size we returned via a Slice result. This is consistent with non-mmap behavior and also pread() system call.

This diff is taken out of D45123.

Test Plan: make check

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45645
2015-08-26 14:51:38 -07:00
sdong d286b5df90 DBIter to out extra keys with higher sequence numbers when changing direction from forward to backward
Summary:
When DBIter changes iterating direction from forward to backward, it might see some much larger keys with higher sequence ID. With this commit, these rows will be actively filtered out. It should fix existing disabled tests in db_iter_test.

This may not be a perfect fix, but it introduces least impact on existing codes, in order to be safe.

Test Plan:
Enable existing tests and make sure they pass. Add a new test DBIterWithMergeIterTest.InnerMergeIteratorDataRace8.
Also run all existing tests.

Reviewers: yhchiang, rven, anthony, IslamAbdelRahman, kradhakrishnan, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D45567
2015-08-26 13:01:39 -07:00
Andres Noetzli 3795449c9d Fix DBTest.GetProperty
Summary:
DBTest.GetProperty was failing occasionally (see task #8131266). The reason was
that the test closed the database before the compaction was done. When the test
reopened the database, RocksDB would schedule a compaction which in turn
created table readers and lead the test to fail the assertion that
rocksdb.estimate-table-readers-mem is 0. In most cases, GetIntProperty() of
rocksdb.estimate-table-readers-mem happened before the compaction created the
table readers, hiding the problem. This patch changes the
WaitForFlushMemTable() to WaitForCompact(). WaitForFlushMemTable() is not
necessary because it is already being called a couple of lines before without
any insertions in-between.

Test Plan:
Insert `usleep(10000);` just after `Reopen(options);` on line 2333 to make the issue more likely, then run:
make db_test && while ./db_test --gtest_filter=DBTest.GetProperty; do true; done

Reviewers: rven, yhchiang, anthony, igor, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45603
2015-08-26 10:10:26 -07:00
Igor Canadi a7834a1292 Merge pull request #698 from yuslepukhin/address_noexcept_windows
Address noexcept and const integer lambda capture on win
2015-08-25 17:15:23 -07:00
Yueh-Hsuan Chiang 9ccf1bd3e2 Correct the comment for GetProperty() API.
Summary:
"rocksdb.aggregated-table-properties" and "rocksdb.aggregated-table-properties-at-level<N>"
should belong to GetProperty() instead of GetIntProperty(), but the comment mistakenly
classifies them to GetIntProperty().

This patch fix this comment error.

Test Plan: no code change.

Reviewers: sdong, anthony, IslamAbdelRahman, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D45561
2015-08-25 16:45:23 -07:00
Dmitri Smirnov fbe2c05f59 s/NOEXCEPT/ROCKSDB_NOEXCEPT 2015-08-25 16:34:39 -07:00
Dmitri Smirnov 6924d7582b Address noexcept and const integer lambda capture
VS 2013 does not support noexcept.
   Complains about usage of ineteger constant within lambda requiring explicit capture.
2015-08-25 15:17:14 -07:00
Ari Ekmekji 2f8d71ec05 Moving sequence number compaction variables from SubCompactionState to CompactionJob
Summary:
It was pointed out to me that the members of SubCompactionState
'earliest_snapshot', 'latest_snapshot' and 'visible_at_tip' are never
modified by the subcompactions, so they can stay as global varaibles
instead to make things simpler.

Test Plan: make all && make check

Reviewers: sdong, igor, noetzli, anthony, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D45477
2015-08-25 14:03:10 -07:00
Venkatesh Radhakrishnan bab9934d9e Fix build failure caused by bad merge.
Summary: There was a bad merge during refresh.

Test Plan: make -j all; make check

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D45555
2015-08-25 14:02:03 -07:00
Venkatesh Radhakrishnan 4d28a7d8ab Add a whitebox test for deleted file iterators.
Summary:
We have earlier added a feature to delete file iterators when the
current key is over the iterate upper bound. We now add a whitebox test
to check if the file iterators were actually deleted.

Test Plan: Add check for a range which has deleted iterators.

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45321
2015-08-25 13:40:58 -07:00
Venkatesh Radhakrishnan 249fb4f881 Fix use of deleted file iterators with incomplete iterators
Summary:
After deleting file iterators which are over the iterate upper
bound, we also need to check for null pointers in
ResetIncompletIterators.

Test Plan: db_tailing_iter_test.TailingIteratorTrimSeekToNext

Reviewers: tnovak, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45525
2015-08-25 13:38:35 -07:00
Igor Canadi 53b88784df Add throttling to multi-threaded backups
Summary: See internal task t8056182

Test Plan: Added multi-threading in RateLimiter test

Reviewers: benj, AaronFeldman

Reviewed By: AaronFeldman

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45459
2015-08-25 13:32:46 -07:00
Andres Notzli 09d982f9e0 Fix compact_files_example
Summary:
See task #7983654. The example was triggering an assert in compaction job
because the compaction was not marked as manual. With this patch,
CompactionPicker::FormCompaction() marks compactions as manual. This patch
also fixes a couple of typos, adds optimistic_transaction_example to
.gitignore and librocksdb as a dependency for examples. Adding librocksdb as
a dependency makes sure that the examples are built with the latest changes
in librocksdb.

Test Plan: make clean && cd examples && make all && ./compact_files_example

Reviewers: rven, sdong, anthony, igor, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45117
2015-08-25 12:29:44 -07:00
Yueh-Hsuan Chiang 6996de87af Expose per-level aggregated table properties via GetProperty()
Summary:
This patch adds "rocksdb.aggregated-table-properties"
and "rocksdb.aggregated-table-properties-at-levelN", the former
returns the aggreated table properties of a column family,
while the later returns the aggregated table properties
of the specified level N.

Test Plan: Added tests in db_test

Reviewers: igor, sdong, IslamAbdelRahman, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45087
2015-08-25 12:03:54 -07:00
agiardullo 86d6c3cdea Fix Windows build
Summary: wrong filename

Test Plan: none

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45531
2015-08-25 11:02:31 -07:00
agiardullo 20d1e547d1 Common base class for transactions
Summary:
As I keep adding new features to transactions, I keep creating more duplicate code.  This diff cleans this up by creating a base implementation class for Transaction and OptimisticTransaction to inherit from.

The code in TransactionBase.h/.cc is all just copied from elsewhere.  The only entertaining part of this class worth looking at is the virtual TryLock method which allows OptimisticTransactions and Transactions to share the same common code for Put/Get/etc.

The rest of this diff is mostly red and easy on the eyes.

Test Plan: No functionality change.  existing tests pass.

Reviewers: sdong, jkedgar, rven, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45135
2015-08-24 19:09:43 -07:00
Andres Noetzli 2050832974 Fixing race condition in DBTest.DynamicMemtableOptions
Summary:
This patch fixes a race condition in DBTEst.DynamicMemtableOptions. In rare cases,
it was possible that the main thread would fill up both memtables before the flush
job acquired its work. Then, the flush job was flushing both memtables together,
producing only one L0 file while the test expected two. Now, the test waits for
flushes to finish earlier, to make sure that the memtables are flushed in separate
flush jobs.

Test Plan:
Insert "usleep(10000);" after "IOSTATS_SET_THREAD_POOL_ID(Env::Priority::HIGH);" in BGWorkFlush()
to make the issue more likely. Then test with:
make db_test && time while ./db_test --gtest_filter=*DynamicMemtableOptions; do true; done

Reviewers: rven, sdong, yhchiang, anthony, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45429
2015-08-24 17:04:18 -07:00
Igor Canadi e46bcc08b9 Remove an extra 's' from cur-size-all-mem-tabless
Summary: As title

Test Plan: make check

Reviewers: yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45447
2015-08-24 16:43:18 -07:00
Igor Canadi 4ab26c5ad1 Smarter purging during flush
Summary:
Currently, we only purge duplicate keys and deletions during flush if `earliest_seqno_in_memtable <= newest_snapshot`. This means that the newest snapshot happened before we first created the memtable. This is almost never true for MyRocks and MongoRocks.

This patch makes purging during flush able to understand snapshots. The main logic is copied from compaction_job.cc, although the logic over there is much more complicated and extensive. However, we should try to merge the common functionality at some point.

I need this patch to implement no_overwrite_i_promise functionality for flush. We'll also need this to support SingleDelete() during Flush(). @yoshinorim requested the feature.

Test Plan:
make check
I had to adjust some unit tests to understand this new behavior

Reviewers: yhchiang, yoshinorim, anthony, sdong, noetzli

Reviewed By: noetzli

Subscribers: yoshinorim, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42087
2015-08-24 11:11:12 -07:00
Mark Callaghan 4c81ac0c59 Fix benchmark report script
Summary:
db_bench output now displays Percentile many times with --statistics after
read IO latency histograms were added. So I only need the last one in the report output.

Task ID: #

Blame Rev:

Test Plan:
run run_flash_bench.sh

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D45093
2015-08-22 12:18:00 -07:00
Ari Ekmekji b6def58f73 Changed 'num_subcompactions' to the more accurate 'max_subcompactions'
Summary:
Up until this point we had DbOptions.num_subcompactions, but
it is semantically more correct to call this max_subcompactions since
we will schedule *up to* DbOptions.max_subcompactions smaller compactions
at a time during a compaction job.

I also added a --subcompactions option to db_bench

Test Plan: make all   make check

Reviewers: sdong, igor, anthony, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D45069
2015-08-21 14:25:34 -07:00
sdong c852968465 db_iter_test: add more test cases for the data race bug
Summary: Add more test cases of data race causing wrong iterating results. Tag tests not passing as DISABLED_

Test Plan: Run the tests

Reviewers: igor, rven, IslamAbdelRahman, anthony, kradhakrishnan, yhchiang

Reviewed By: yhchiang

Subscribers: tnovak, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D44907
2015-08-21 12:14:12 -07:00
sdong 9130873a13 Add options.new_table_reader_for_compaction_inputs
Summary: Currently compaction inputs share the same file descriptor and table reader as other foreground threads. It makes fadvise works less predictable. Add options.new_table_reader_for_compaction_inputs to enforce to create a new file descriptor and new table reader for it.

Test Plan: Add the option.

Reviewers: rven, anthony, kradhakrishnan, IslamAbdelRahman, igor, yhchiang

Reviewed By: igor

Subscribers: igor, MarkCallaghan, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D43311
2015-08-21 08:46:29 -07:00
sdong 07d2d34160 Add a counter about estimated pending compaction bytes
Summary:
Add a counter of estimated bytes the DB needs to compact for all the compactions to finish. Expose it as a DB Property.
In the future, we can use threshold of this counter to replace soft rate limit and hard rate limit. A single threshold of estimated compaction debt in bytes will be easier for users to reason about when should slow down and stopping than more abstract soft and hard rate limits.

Test Plan: Add unit tests

Reviewers: IslamAbdelRahman, yhchiang, rven, kradhakrishnan, anthony, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D44205
2015-08-20 22:17:10 -07:00
Mark Callaghan 41a0e2811d Improve defaults for benchmarks
Summary:
Changes include:
* don't sync-on-commit for single writer thread in readwhile... tests
* make default block size 8kb rather than 4kb to avoid too small blocks after compression
* use snappy instead of zlib to avoid stalls from compression latency
* disable statistics
* use bytes_per_sync=8M to reduce throughput loss on disk
* use open_files=-1 to reduce mutex contention

Task ID: #

Blame Rev:

Test Plan:
run benchmark

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D44961
2015-08-20 18:59:10 -07:00
Yueh-Hsuan Chiang a203b913c1 Fixed a rare deadlock in DBTest.ThreadStatusFlush
Summary:
Currently, ThreadStatusFlush uses two sync-points to ensure
there's a flush currently running when calling GetThreadList().
However, one of the sync-point is inside db-mutex, which could
cause deadlock in case there's a DB::Get() call.

This patch fix this issue by moving the sync-point to a better
place where the flush job does not hold the mutex.

Test Plan: db_test

Reviewers: igor, sdong, anthony, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45045
2015-08-20 17:18:47 -07:00
Siying Dong 962aa64292 Merge pull request #695 from yuslepukhin/address_windows_build
Address windows build issues caused by introducing Subcompaction
2015-08-20 17:04:48 -07:00
Dmitri Smirnov 5bf8907622 More indent adjustment. 2015-08-20 14:14:02 -07:00
Dmitri Smirnov e2a9f43d64 Adjust indent 2015-08-20 14:10:51 -07:00
Dmitri Smirnov 6e9a260b04 Merge branch 'address_windows_build' of https://github.com/yuslepukhin/rocksdb into address_windows_build 2015-08-20 14:09:11 -07:00
Dmitri Smirnov 1cac89c9b1 Address windows build issues
Intro SubCompactionState move functionality
 =delete copy functionality
 #ifdef SyncPoint in tests for Windows Release builds
2015-08-20 14:08:24 -07:00
Dmitri Smirnov f25f06ddd2 Address windows build issues
Intro SubCompactionState move functionality
  =delete copy functionality
  #ifdef SyncPoint in tests for Windows Release builds
2015-08-20 13:43:07 -07:00
Islam AbdelRahman 027ca5b2cd Total SST files size DB Property
Summary: Add a new DB property that calculate the total size of files used by all RocksDB Versions

Test Plan: Unittests for the new property

Reviewers: igor, yhchiang, anthony, rven, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D44799
2015-08-20 11:47:19 -07:00
Andres Noetzli b604d2562f Removing unused variables to fix build
Summary: Removing two unused variables that prevented compilation.

Test Plan: make all

Reviewers: rven, sdong, yhchiang, anthony, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D44991
2015-08-19 16:57:40 -07:00
Venkatesh Radhakrishnan 1b114eed4d Free file iterators for files which are above the iterate upper bound to Improve memory utilization
Summary:
This diff improves the memory utilization for tailing iterators RocksDB,
by freeing file iterators which are over the upper bound.
It is an updating on Siying's original diff for improving the memory usage for
tailing iterators. The changes for the seek and next path are now complete
and a test has been added to exercise these paths while deleting file iterators
which are above the upper bound.

Test Plan: db_tailing_iter_test.TailingIteratorTrimSeekToNext

Reviewers: march, tnovak, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D43833
2015-08-19 16:05:51 -07:00
Islam AbdelRahman 3fd70b05b8 Rate limit deletes issued by DestroyDB
Summary: Update DestroyDB so that all SST files in the first path id go through DeleteScheduler instead of being deleted immediately

Test Plan: added a unittest

Reviewers: igor, yhchiang, anthony, kradhakrishnan, rven, sdong

Reviewed By: sdong

Subscribers: jeanxu2012, dhruba

Differential Revision: https://reviews.facebook.net/D44955
2015-08-19 15:02:17 -07:00
Yueh-Hsuan Chiang df79eafcb3 Introduce GetIntProperty("rocksdb.size-all-mem-tables")
Summary:
Currently, GetIntProperty("rocksdb.cur-size-all-mem-tables") only returns
the memory usage by those memtables which have not yet been flushed.

This patch introduces GetIntProperty("rocksdb.size-all-mem-tables"),
which includes the memory usage by all the memtables, includes those
have been flushed but pinned by iterators.

Test Plan: Added a test in db_test

Reviewers: igor, anthony, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D44229
2015-08-19 13:32:09 -07:00
sdong 888fbdc889 Remove the contstaint that iterator upper bound needs to be within a prefix
Summary: There is a check to fail the iterator if prefix extractor is specified but upper bound is out of the prefix for the seek key. Relax this constraint to allow users to set upper bound to the next prefix of the current one.

Test Plan: make commit-prereq

Reviewers: igor, anthony, kradhakrishnan, yhchiang, rven

Reviewed By: rven

Subscribers: tnovak, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D44949
2015-08-19 11:03:51 -07:00
Ari Ekmekji 137c376675 Removing variables used only in assertions to prevent build error
Summary:
A couple variables were declared but only used in assertions
which causes issues when building in fbcode.

Test Plan: make dbg  and   make release

Reviewers: yhchiang, sdong, igor, anthony, MarkCallaghan

Reviewed By: MarkCallaghan

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D44937
2015-08-19 08:52:22 -07:00
Ari Ekmekji b47cc58516 Bounding Number of Subcompactions
Summary:
In D43239 (https://reviews.facebook.net/D43239) the number
of subcompactions is set based on the number of L1 files with
unique starting keys. In certain cases when this number is very large
this causes issues, particularly with the overlap between files since
very small output files can be generated. This diff bounds the number
of subcompactions to the user option DBOption.num_subcompactions.

Test Plan: ./db_test ./db_compaction_test

Reviewers: sdong, igor, anthony, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D44883
2015-08-18 14:56:31 -07:00
Venkatesh Radhakrishnan e58e1b18e7 Make tailing iterator show new entries in memtable.
Summary:
Reseek mutable_iter if it is invalid in Next and immutable_iter
is invalid.

Test Plan: DBTestTailingIterator.TailingIteratorSeekToNext

Reviewers: tnovak, march, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D44865
2015-08-18 14:40:06 -07:00
Yueh-Hsuan Chiang 9ec9571593 DBOptions serialization and deserialization
Summary:
This patch implements DBOptions deserialization and improve
the current implementation of DBOptions serialization by
using a static structure that stores the offset of each
DBOptions member variables to perform serialization and
deserialization instead of using tons of if-then-branch
to determine the mapping between string and variables.

Test Plan: Added test in options_test.cc

Reviewers: igor, anthony, sdong, IslamAbdelRahman

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D44097
2015-08-18 13:30:18 -07:00
Yueh-Hsuan Chiang b2df20a890 Make HashCuckooRep::ApproximateMemoryUsage() return reasonable estimation.
Summary:
HashCuckooRep::ApproximateMemoryUsage() previously return
std::numeric_limits<size_t>::max() when it cannot accept more
entries.  This patch makes it return a more reasonable estimation.

This change is necessary in order to make GetIntProperty("rocksdb.cur-size-all-mem-tables")
handles HashCuckooRep properly in diff https://reviews.facebook.net/D44229.

Test Plan: db_test

Reviewers: sdong, anthony, IslamAbdelRahman, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D44241
2015-08-18 13:19:55 -07:00
Ari Ekmekji 601b1aaca0 Fixing Failed Assertion in Subcompaction State Diff
Summary:
In D43239 (https://reviews.facebook.net/D43239) there is an
assertion to make sure a subcompaction's output is never empty at the
end of execution. This assertion however breaks the build because some
tests lead to exactly that scenario. So instead I have altered the logic
to handle this case instead of just failing the assertion.

The reason that it is possible for a subcompaction's output to be empty is
that during a sequential execution of subcompactions, if a user aborts the
compaction job then some of the later subcompactions to be executed may
have yet to process any keys and therefore have yet to generate output files.
This becomes very rare once the subcompactions are executed in parallel,
but for now they are still sequential so the case is possible when there is an
early termination, as in some of the tests.

Test Plan: ./db_test  ./db_compaction_test

Reviewers: sdong, igor, anthony, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D44877
2015-08-18 12:27:12 -07:00
Ari Ekmekji f0da6977a3 [Parallel L0-L1 Compaction Prep]: Giving Subcompactions Their Own State
Summary:
In prepration for running multiple threads at the same time during
a compaction job, this patch assigns each subcompaction its own state
(instead of sharing the one global CompactionState). Each subcompaction then
uses this state to update its statistics, keep track of its snapshots, etc.
during the course of execution. Then at the end of all the executions the
statistics are aggregated across the subcompactions so that the final result
is the same as if only one larger compaction had run.

Test Plan: ./db_test  ./db_compaction_test  ./compaction_job_test

Reviewers: sdong, anthony, igor, noetzli, yhchiang

Reviewed By: yhchiang

Subscribers: MarkCallaghan, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43239
2015-08-18 11:06:23 -07:00
Andres Notzli f32a572099 Simplify querying of merge results
Summary:
While working on supporting mixing merge operators with
single deletes ( https://reviews.facebook.net/D43179 ),
I realized that returning and dealing with merge results
can be made simpler. Submitting this as a separate diff
because it is not directly related to single deletes.

Before, callers of merge helper had to retrieve the merge
result in one of two ways depending on whether the merge
was successful or not (success = result of merge was single
kTypeValue). For successful merges, the caller could query
the resulting key/value pair and for unsuccessful merges,
the result could be retrieved in the form of two deques of
keys and values. However, with single deletes, a successful merge
does not return a single key/value pair (if merge
operands are merged with a single delete, we have to generate
a value and keep the original single delete around to make
sure that we are not accidentially producing a key overwrite).
In addition, the two existing call sites of the merge
helper were taking the same actions independently from whether
the merge was successful or not, so this patch simplifies that.

Test Plan: make clean all check

Reviewers: rven, sdong, yhchiang, anthony, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43353
2015-08-17 17:34:38 -07:00
sdong 72613657f0 Measure file read latency histogram per level
Summary: In internal stats, remember read latency histogram, if statistics is enabled. It can be retrieved from DB::GetProperty() with "rocksdb.dbstats" property, if it is enabled.

Test Plan: Manually run db_bench and prints out "rocksdb.dbstats" by hand and make sure it prints out as expected

Reviewers: igor, IslamAbdelRahman, rven, kradhakrishnan, anthony, yhchiang

Reviewed By: yhchiang

Subscribers: MarkCallaghan, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D44193
2015-08-14 17:32:42 -07:00
Nathan Bronson b7198c3afe reduce db mutex contention for write batch groups
Summary:
This diff allows a Writer to join the next write batch group
without acquiring any locks. Waiting is performed via a per-Writer mutex,
so all of the non-leader writers never need to acquire the db mutex.
It is now possible to join a write batch group after the leader has been
chosen but before the batch has been constructed. This diff doesn't
increase parallelism, but reduces synchronization overheads.

For some CPU-bound workloads (no WAL, RAM-sized working set) this can
substantially reduce contention on the db mutex in a multi-threaded
environment.  With T=8 N=500000 in a CPU-bound scenario (see the test
plan) this is good for a 33% perf win.  Not all scenarios see such a
win, but none show a loss.  This code is slightly faster even for the
single-threaded case (about 2% for the CPU-bound scenario below).

Test Plan:
1. unit tests
2. COMPILE_WITH_TSAN=1 make check
3. stress high-contention scenarios with db_bench -benchmarks=fillrandom -threads=$T -batch_size=1 -memtablerep=skip_list -value_size=0 --num=$N -level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999 -disable_auto_compactions --max_write_buffer_number=8 -max_background_flushes=8 --disable_wal --write_buffer_size=160000000

Reviewers: sdong, igor, rven, ljin, yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D43887
2015-08-14 10:55:43 -07:00
sdong 603b6da8b8 Add options.compaction_measure_io_stats to print write I/O stats in compactions
Summary:
Add options.compaction_measure_io_stats to print out / pass to listener accumulated time spent on write calls. Example outputs in info logs:

2015/08/12-16:27:59.463944 7fd428bff700 (Original Log Time 2015/08/12-16:27:59.463922) EVENT_LOG_v1 {"time_micros": 1439422079463897, "job": 6, "event": "compaction_finished", "output_level": 1, "num_output_files": 4, "total_output_size": 6900525, "num_input_records": 111483, "num_output_records": 106877, "file_write_nanos": 15663206, "file_range_sync_nanos": 649588, "file_fsync_nanos": 349614797, "file_prepare_write_nanos": 1505812, "lsm_state": [2, 4, 0, 0, 0, 0, 0]}

Add two more counters in iostats_context.

Also add a parameter of db_bench.

Test Plan: Add a unit test. Also manually verify LOG outputs in db_bench

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D44115
2015-08-13 16:52:26 -07:00