Commit Graph

270 Commits

Author SHA1 Message Date
Islam AbdelRahman d6c838f1e1 Add SstFileManager (component tracking all SST file in DBs and control the deletion rate)
Summary:
Add a new class SstFileTracker that will be notified whenever a DB add/delete/move and sst file, it will also replace DeleteScheduler
SstFileTracker can be used later to abort writes when we exceed a specific size

Test Plan: unit tests

Reviewers: rven, anthony, yhchiang, sdong

Reviewed By: sdong

Subscribers: igor, lovro, march, dhruba

Differential Revision: https://reviews.facebook.net/D50469
2016-01-28 18:35:01 -08:00
Andrew Kryczka 167bd8856d [directory includes cleanup] Finish removing util->db dependencies 2016-01-26 10:49:24 -08:00
Andrew Kryczka 46f9cd46af [directory includes cleanup] Move cross-function test points
Summary:
I split the db-specific test points out into a separate file under db/
directory. There were also a few bugs to fix in xfunc.{h,cc} that prevented it
from compiling previously; see https://reviews.facebook.net/D36825.

Test Plan:
compilation works now, below command works, will also run "make xfunc".

  $ make check ROCKSDB_XFUNC_TEST='managed_new' tests-regexp='DBTest' -j32

Reviewers: sdong, yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D53343
2016-01-26 10:49:05 -08:00
Nathan Bronson 7d87f02799 support for concurrent adds to memtable
Summary:
This diff adds support for concurrent adds to the skiplist memtable
implementations.  Memory allocation is made thread-safe by the addition of
a spinlock, with small per-core buffers to avoid contention.  Concurrent
memtable writes are made via an additional method and don't impose a
performance overhead on the non-concurrent case, so parallelism can be
selected on a per-batch basis.

Write thread synchronization is an increasing bottleneck for higher levels
of concurrency, so this diff adds --enable_write_thread_adaptive_yield
(default off).  This feature causes threads joining a write batch
group to spin for a short time (default 100 usec) using sched_yield,
rather than going to sleep on a mutex.  If the timing of the yield calls
indicates that another thread has actually run during the yield then
spinning is avoided.  This option improves performance for concurrent
situations even without parallel adds, although it has the potential to
increase CPU usage (and the heuristic adaptation is not yet mature).

Parallel writes are not currently compatible with
inplace updates, update callbacks, or delete filtering.
Enable it with --allow_concurrent_memtable_write (and
--enable_write_thread_adaptive_yield).  Parallel memtable writes
are performance neutral when there is no actual parallelism, and in
my experiments (SSD server-class Linux and varying contention and key
sizes for fillrandom) they are always a performance win when there is
more than one thread.

Statistics are updated earlier in the write path, dropping the number
of DB mutex acquisitions from 2 to 1 for almost all cases.

This diff was motivated and inspired by Yahoo's cLSM work.  It is more
conservative than cLSM: RocksDB's write batch group leader role is
preserved (along with all of the existing flush and write throttling
logic) and concurrent writers are blocked until all memtable insertions
have completed and the sequence number has been advanced, to preserve
linearizability.

My test config is "db_bench -benchmarks=fillrandom -threads=$T
-batch_size=1 -memtablerep=skip_list -value_size=100 --num=1000000/$T
-level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999
-disable_auto_compactions --max_write_buffer_number=8
-max_background_flushes=8 --disable_wal --write_buffer_size=160000000
--block_size=16384 --allow_concurrent_memtable_write" on a two-socket
Xeon E5-2660 @ 2.2Ghz with lots of memory and an SSD hard drive.  With 1
thread I get ~440Kops/sec.  Peak performance for 1 socket (numactl
-N1) is slightly more than 1Mops/sec, at 16 threads.  Peak performance
across both sockets happens at 30 threads, and is ~900Kops/sec, although
with fewer threads there is less performance loss when the system has
background work.

Test Plan:
1. concurrent stress tests for InlineSkipList and DynamicBloom
2. make clean; make check
3. make clean; DISABLE_JEMALLOC=1 make valgrind_check; valgrind db_bench
4. make clean; COMPILE_WITH_TSAN=1 make all check; db_bench
5. make clean; COMPILE_WITH_ASAN=1 make all check; db_bench
6. make clean; OPT=-DROCKSDB_LITE make check
7. verify no perf regressions when disabled

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: MarkCallaghan, IslamAbdelRahman, anthony, yhchiang, rven, sdong, guyg8, kradhakrishnan, dhruba

Differential Revision: https://reviews.facebook.net/D50589
2015-12-25 11:03:40 -08:00
Igor Canadi 64fa43843b Merge pull request #862 from ceph/wip-env
implement EnvMirror
2015-12-10 18:45:07 -08:00
Sage Weil 2074ddd625 env: add EnvMirror
This is an Env implementation that mirrors all storage-related methods on
two different backend Env's and verifies that they return the same
results (return status and read results).  This is useful for implementing
a new Env and verifying its correctness.

Signed-off-by: Sage Weil <sage@redhat.com>
2015-12-10 21:32:45 -05:00
Javier González b2863017b1 Move posix threads into a library
Summary: This patch moves all posix thread logic to a separate library.
The motivation is to allow another environments to easily reuse posix
threads. HDFS wraps already posix threads; this split would simplify
this code.

Test Plan: No new functionality is added to posix Env or the threading
library, thus the current tests should suffice.
2015-12-07 12:03:38 +01:00
Nathan Bronson 78812ec6bf InlineSkipList - part 1/3
Summary:
This diff is 1/3 in a sequence that introduces a skip list optimized for
a key that is a freshly-allocated const char*.  The diff is broken into
pieces to make it easier to review.  This piece only introduces the new
type by copying the existing SkipList, with mechanical naming changes
and reformatting.

Test Plan: new unit test

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D51279
2015-11-24 14:30:22 -08:00
sdong c4ebb66d61 Not to build forward_iterator_bench now
Summary: forward_iterator_bench is not stable enough for build. Remove it for now.

Test Plan: Build it with both of CLANG and non-CLANG and make sure it builds.

Reviewers: rven, kradhakrishnan, anthony, yhchiang, igor, IslamAbdelRahman

Reviewed By: igor, IslamAbdelRahman

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D50991
2015-11-18 15:42:06 -08:00
Venkatesh Radhakrishnan 7824444bfc Reuse file iterators in tailing iterator when memtable is flushed
Summary:
Under a tailing workload, there were increased block cache
misses when a memtable was flushed because we were rebuilding iterators
in that case since the version set changed. This was exacerbated in the
case of iterate_upper_bound, since file iterators which were over the
iterate_upper_bound would have been deleted and are now brought back as
part of the Rebuild, only to be deleted again. We now renew the iterators
and only build iterators for files which are added and delete file
iterators for files which are deleted.
Refer to https://reviews.facebook.net/D50463 for previous version

Test Plan: DBTestTailingIterator.TailingIteratorTrimSeekToNext

Reviewers: anthony, IslamAbdelRahman, igor, tnovak, yhchiang, sdong

Reviewed By: sdong

Subscribers: yhchiang, march, dhruba, leveldb, lovro

Differential Revision: https://reviews.facebook.net/D50679
2015-11-13 15:50:59 -08:00
Yueh-Hsuan Chiang e11f676e34 Add OptionsUtil::LoadOptionsFromFile() API
Summary:
This patch adds OptionsUtil::LoadOptionsFromFile() and
OptionsUtil::LoadLatestOptionsFromDB(), which allow developers
to construct DBOptions and ColumnFamilyOptions from a RocksDB
options file.  Note that most pointer-typed options such as
merge_operator will not be constructed.

With this API, developers no longer need to remember all the
options in order to reopen an existing rocksdb instance like
the following:

  DBOptions db_options;
  std::vector<std::string> cf_names;
  std::vector<ColumnFamilyOptions> cf_opts;

  // Load primitive-typed options from an existing DB
  OptionsUtil::LoadLatestOptionsFromDB(
      dbname, &db_options, &cf_names, &cf_opts);

  // Initialize necessary pointer-typed options
  cf_opts[0].merge_operator.reset(new MyMergeOperator());
  ...

  // Construct the vector of ColumnFamilyDescriptor
  std::vector<ColumnFamilyDescriptor> cf_descs;
  for (size_t i = 0; i < cf_opts.size(); ++i) {
    cf_descs.emplace_back(cf_names[i], cf_opts[i]);
  }

  // Open the DB
  DB* db = nullptr;
  std::vector<ColumnFamilyHandle*> cf_handles;
  auto s = DB::Open(db_options, dbname, cf_descs,
                    &handles, &db);

Test Plan:
Augment existing tests in column_family_test
options_test
db_test

Reviewers: igor, IslamAbdelRahman, sdong, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D49095
2015-11-12 06:52:43 -08:00
Yueh-Hsuan Chiang e114f0abb8 Enable RocksDB to persist Options file.
Summary:
This patch allows rocksdb to persist options into a file on
DB::Open, SetOptions, and Create / Drop ColumnFamily.
Options files are created under the same directory as the rocksdb
instance.

In addition, this patch also adds a fail_if_missing_options_file in DBOptions
that makes any function call return non-ok status when it is not able to
persist options properly.

  // If true, then DB::Open / CreateColumnFamily / DropColumnFamily
  // / SetOptions will fail if options file is not detected or properly
  // persisted.
  //
  // DEFAULT: false
  bool fail_if_missing_options_file;

Options file names are formatted as OPTIONS-<number>, and RocksDB
will always keep the latest two options files.

Test Plan:
Add options_file_test.

options_test
column_family_test

Reviewers: igor, IslamAbdelRahman, sdong, anthony

Reviewed By: anthony

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D48285
2015-11-10 22:58:01 -08:00
Nathan Bronson b81b430987 Switch to thread-local random for skiplist
Summary:
Using a TLS random instance for skiplist makes it smaller
(useful for hash_skiplist_rep) and prepares skiplist for concurrent
adds.  This diff also modifies the branching factor math to avoid an
unnecessary division.

This diff has the effect of changing the sequence of skip list node
height choices made by tests, so it has the potential to cause unit
test failures for tests that implicitly rely on the exact structure
of the skip list.  Tests that try to exactly trigger a compaction are
likely suspects for this problem (these tests have always been brittle to
changes in the skiplist details).  I've minimizes this risk by reseeding
the main thread's Random at the beginning of each test, increasing the
universal compaction size_ratio limit from 101% to 105% for some tests,
and verifying that the tests pass many times.

Test Plan: for i in `seq 0 9`; do make check; done

Reviewers: sdong, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D50439
2015-11-09 19:25:22 -08:00
Yueh-Hsuan Chiang 183cadfc87 Add OptionsSanityCheckLevel
Summary:
This patch introduces OptionsSanityCheckLevel internally to enable
sanity check rocksdb options.

Utilities API will be added in the follow-up diffs.

Test Plan: Added more tests in options_test

Reviewers: igor, IslamAbdelRahman, sdong, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D49515
2015-11-04 18:53:30 -08:00
Yueh-Hsuan Chiang 7d7ee2b654 Add Memory Insight support to utilities
Summary:
This patch introduces utilities/memory, which currently includes
GetApproximateMemoryUsageByType that reports different types of
rocksdb memory usage given a list of input DBs.

The API also take care of the case where Cache could be shared
across multiple column families / multiple db instances.

Currently, it reports memory usage of memtable, table-readers
and cache.

Test Plan: utilities/memory/memory_test.cc

Reviewers: igor, anthony, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D49257
2015-11-03 17:52:17 -08:00
Javier González 6e6dd5f6f9 Split posix storage backend into Env and library
Summary: This patch splits the posix storage backend into Env and
the actual *File implementations. The motivation is to allow other Envs
to use posix as a library. This enables a storage backend different from
posix to split its secondary storage between a normal file system
partition managed by posix, and it own media.

Test Plan: No new functionality is added to posix Env or the library,
thus the current tests should suffice.
2015-10-22 17:31:31 +02:00
Alexey Maykov e1a09a7703 Implementation for GetPropertiesOfTablesInRange
Summary: In MyRocks, it is sometimes important to get propeties only for the subset of the database. This diff implements the API in RocksDB.

Test Plan: ran the GetPropertiesOfTablesInRange

Reviewers: rven, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D48651
2015-10-17 13:34:43 -07:00
Venkatesh Radhakrishnan a98fbacfa0 Moving memtable related files from util to a new directory memtable
Summary:
We are cleaning up dependencies.
This diff takes a first step at moving memtable files to their own
directory called memtable. In future diffs, we will move other memtable
files from db to memtable.

Test Plan: make check

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D48915
2015-10-16 14:10:33 -07:00
Venkatesh Radhakrishnan 63e507c59c Move ldb and sst_dump from utils to tools.
Summary: As part of cleaning up dependencies for tech debt week, we are moving ldb and sst_dump tools from util to tools, since they are tools.

Test Plan: make check

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D48747
2015-10-14 17:08:28 -07:00
Venkatesh Radhakrishnan e587dbe03a Move manual_compaction_test.cc from util to db
Summary: manual_compaction_test.cc incorrectly in util. Moved to db.

Test Plan: make check

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D48687
2015-10-14 11:06:27 -07:00
Venkatesh Radhakrishnan f1fdf5205b Clean up dependency: Move db_test_util.* to db directory
Summary:
As part of tech debt week, we are cleaning up dependencies.
This diff moves db_test_util.[h,cc] from util to db directory.

Test Plan: make check

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D48543
2015-10-12 13:05:42 -07:00
Islam AbdelRahman 9babaeed16 Update dump_tool and undump_tool to accept Options
Summary:
Refactor dump_tool and undump_tool so that it's possible to use them with customized options
for example setting a specific comparator similar to what Dragon is doing with the LdbTool

https://phabricator.fb.com/diffusion/FBCODE/browse/master/dragon/tools/Ldb.cpp

Test Plan:
compiles
used it to dump / undump a dragon shard

Reviewers: sdong, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba, adsharma

Differential Revision: https://reviews.facebook.net/D47853
2015-10-05 19:49:48 -07:00
Evan Shaw 7a23e4d8ca New amalgamation target
This commit adds two new targets to the Makefile: rocksdb.cc and rocksdb.h

These files, when combined with the c.h header, are a self-contained RocksDB
source distribution called an amalgamation. (The name comes from SQLite's, which
is similar in concept.)

The main benefit of an amalgamation is that it's very easy to drop into a
new project. It also compiles faster compared to compiling individual source
files and potentially gives the compiler more opportunity to make optimizations
since it can see all functions at once.

rocksdb.cc and rocksdb.h are generated by a new script, amalgamate.py.
A detailed description of how amalgamate.py works is in a comment at the top of
the file.

There are also some small changes to existing files to enable the amalgamation:
* Use quotes for includes in unity build
* Fix an old header inclusion in util/xfunc.cc
* Move some includes outside ifdef in util/env_hdfs.cc
* Separate out tool sources in Makefile so they won't be included in unity.cc
* Unity build now produces a static library

Closes #733
2015-10-01 08:29:31 +13:00
Yueh-Hsuan Chiang 74b100ac17 RocksDB Options file format and its serialization / deserialization.
Summary:
This patch defines the format of RocksDB options file, which
follows the INI file format, and implements functions for its
serialization and deserialization.  An example RocksDB options
file can be found in examples/rocksdb_option_file_example.ini.

A typical RocksDB options file has three sections, which are
Version, DBOptions, and more than one CFOptions.  The RocksDB
options file in general follows the basic INI file format
with the following extensions / modifications:
 * Escaped characters
   We escaped the following characters:
    - \n -- line feed - new line
    - \r -- carriage return
    - \\ -- backslash \
    - \: -- colon symbol :
    - \# -- hash tag #
 * Comments
   We support # style comments.  Comments can appear at the ending
   part of a line.
 * Statements
   A statement is of the form option_name = value.
   Each statement contains a '=', where extra white-spaces
   are supported. However, we don't support multi-lined statement.
   Furthermore, each line can only contain at most one statement.
 * Section
   Sections are of the form [SecitonTitle "SectionArgument"],
   where section argument is optional.
 * List
   We use colon-separated string to represent a list.
   For instance, n1:n2:n3:n4 is a list containing four values.

Below is an example of a RocksDB options file:

[Version]
  rocksdb_version=4.0.0
  options_file_version=1.0
[DBOptions]
  max_open_files=12345
  max_background_flushes=301
[CFOptions "default"]
[CFOptions "the second column family"]
[CFOptions "the third column family"]

Test Plan: Added many tests in options_test.cc

Reviewers: igor, IslamAbdelRahman, sdong, anthony

Reviewed By: anthony

Subscribers: maykov, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D46059
2015-09-29 14:42:40 -07:00
Assaf Sela 4805fa0eae Remove ldb HexToString method's usage of sscanf
Summary:
Fix hex2String performance issues by removing sscanf dependency.
Also fixed some edge case handling (odd length, bad input).

Test Plan: Created a test file which called old and new implementation, and validated results are the same. I'll paste results in the phabricator diff.

Reviewers: igor, rven, anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, sdong

Reviewed By: sdong

Subscribers: thatsafunnyname, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D46785
2015-09-23 14:25:46 -07:00
Islam AbdelRahman f03b5c987b Add experimental DB::AddFile() to plug sst files into empty DB
Summary:
This is an initial version of bulk load feature

This diff allow us to create sst files, and then bulk load them later, right now the restrictions for loading an sst file are
(1) Memtables are empty
(2) Added sst files have sequence number = 0, and existing values in database have sequence number = 0
(3) Added sst files values are not overlapping

Test Plan: unit testing

Reviewers: igor, ott, sdong

Reviewed By: sdong

Subscribers: leveldb, ott, dhruba

Differential Revision: https://reviews.facebook.net/D39081
2015-09-23 12:42:43 -07:00
Andres Noetzli 8aa1f15197 Refactored common code of Builder/CompactionJob out into a CompactionIterator
Summary:
Builder and CompactionJob share a lot of fairly complex code. This patch
refactors this code into a separate class, the CompactionIterator. Because the
shared code is fairly complex, this patch hopefully improves maintainability.
While there are is a lot of potential for further improvements, the patch is
intentionally pretty close to the original structure because the change is
already complex enough.

Test Plan: make clean all check && ./db_stress

Reviewers: rven, anthony, yhchiang, sdong, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D46197
2015-09-10 14:35:25 -07:00
agiardullo 5e94f68f35 TransactionDB Custom Locking API
Summary:
Prototype of API to allow MyRocks to override default Mutex/CondVar used by transactions with their own implementations.  They would simply need to pass their own implementations of Mutex/CondVar to the templated TransactionDB::Open().

Default implementation of TransactionDBMutex/TransactionDBCondVar provided (but the code is not currently changed to use this).

Let me know if this API makes sense or if it should be changed

Test Plan: n/a

Reviewers: yhchiang, rven, igor, sdong, spetrunia

Reviewed By: spetrunia

Subscribers: maykov, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43761
2015-09-08 17:03:57 -07:00
agiardullo 77a28615ec Support static Status messages
Summary: Provide a way to specify a detailed static error message for a Status without incurring a memcpy.  Let me know what people think of this approach.

Test Plan: added simple test

Reviewers: igor, yhchiang, rven, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D44259
2015-08-31 16:13:29 -07:00
agiardullo 20d1e547d1 Common base class for transactions
Summary:
As I keep adding new features to transactions, I keep creating more duplicate code.  This diff cleans this up by creating a base implementation class for Transaction and OptimisticTransaction to inherit from.

The code in TransactionBase.h/.cc is all just copied from elsewhere.  The only entertaining part of this class worth looking at is the virtual TryLock method which allows OptimisticTransactions and Transactions to share the same common code for Put/Get/etc.

The rest of this diff is mostly red and easy on the eyes.

Test Plan: No functionality change.  existing tests pass.

Reviewers: sdong, jkedgar, rven, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45135
2015-08-24 19:09:43 -07:00
agiardullo c2f2cb0214 Pessimistic Transactions
Summary:
Initial implementation of Pessimistic Transactions.  This diff contains the api changes discussed in D38913.  This diff is pretty large, so let me know if people would prefer to meet up to discuss it.

MyRocks folks:  please take a look at the API in include/rocksdb/utilities/transaction[_db].h and let me know if you have any issues.

Also, you'll notice a couple of TODOs in the implementation of RollbackToSavePoint().  After chatting with Siying, I'm going to send out a separate diff for an alternate implementation of this feature that implements the rollback inside of WriteBatch/WriteBatchWithIndex.  We can then decide which route is preferable.

Next, I'm planning on doing some perf testing and then integrating this diff into MongoRocks for further testing.

Test Plan: Unit tests, db_bench parallel testing.

Reviewers: igor, rven, sdong, yhchiang, yoshinorim

Reviewed By: sdong

Subscribers: hermanlee4, maykov, spetrunia, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D40869
2015-08-11 17:52:23 -07:00
agiardullo 16ea1c7d1c simple ManagedSnapshot wrapper
Summary: Implemented this simple wrapper for something else I was working on.  Seemed like it makes sense to expose it instead of burying it in some random code.

Test Plan: added test

Reviewers: rven, kradhakrishnan, sdong, yhchiang

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43293
2015-08-06 17:59:05 -07:00
Poornima Chozhiyath Raman 960d936e83 Add function 'GetInfoLogList()'
Summary: The list of info log files of a db can be obtained using the new function.

Test Plan: New test in db_test.cc passed.

Reviewers: yhchiang, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: IslamAbdelRahman, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D41715
2015-08-05 16:16:46 -07:00
sdong 7ccd1c80a7 Add two unit tests for SyncWAL()
Summary:
Add two unit tests for SyncWAL(). One makes sure SyncWAL() doesn't block writes in the other thread. Another one makes sure SyncWAL() doesn't wait ongoing writes to finish before being executed.

Create a new test file db_wal_test and move two WAL related tests from db_test to here.

Test Plan: Run the new tests

Reviewers: IslamAbdelRahman, rven, kradhakrishnan, kolmike, tnovak, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D43605
2015-08-05 14:27:02 -07:00
Islam AbdelRahman c45a57b41e Support delete rate limiting
Summary:
Introduce DeleteScheduler that allow enforcing a rate limit on file deletion
Instead of deleting files immediately, files are moved to trash directory and deleted in a background thread that apply sleep penalty between deletes if needed.

I have updated PurgeObsoleteFiles and PurgeObsoleteWALFiles to use the delete_scheduler instead of env_->DeleteFile

Test Plan:
added delete_scheduler_test
existing unit tests

Reviewers: kradhakrishnan, anthony, rven, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D43221
2015-08-04 20:45:27 -07:00
Yueh-Hsuan Chiang f39cbcb0a5 Merge pull request #654 from adamretter/remove-emptyvalue-compactionfilter
RemoveEmptyValueCompactionFilter
2015-08-04 16:54:14 -07:00
Yueh-Hsuan Chiang ce21afd205 Expose the BackupEngine from the Java API
Summary:
Merge pull request #665 by adamretter

Exposes BackupEngine from C++ to the Java API. Previously only BackupableDB was available

Test Plan: BackupEngineTest.java

Reviewers: fyrz, igor, ankgup87, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D42873
2015-08-04 16:08:54 -07:00
Yueh-Hsuan Chiang 26894303c1 Add CompactOnDeletionCollector in utilities/table_properties_collectors.
Summary:
This diff adds CompactOnDeletionCollector in utilities/table_properties_collectors,
which applies a sliding window to a sst file and mark this file as need-compaction
when it observe enough deletion entries within the consecutive keys covered by
the sliding window.

Test Plan: compact_on_deletion_collector_test

Reviewers: igor, anthony, IslamAbdelRahman, kradhakrishnan, yoshinorim, sdong

Reviewed By: sdong

Subscribers: maykov, dhruba

Differential Revision: https://reviews.facebook.net/D41175
2015-08-03 20:42:55 -07:00
Yueh-Hsuan Chiang 7219088cda Move general compaction tests from db_test.cc to db_compaction_test.cc
Summary: Move general compaction tests from db_test.cc to db_compaction_test.cc

Test Plan:
db_test
db_compaction_test

Reviewers: igor, sdong, IslamAbdelRahman, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42651
2015-07-21 03:05:57 -07:00
Yueh-Hsuan Chiang 7462286d33 Move in-place-update related tests from db_test.cc to db_inplace_update_test.cc
Summary: Move in-place-update related tests from db_test.cc to db_inplace_update_test.cc

Test Plan:
db_test
db_inplace_update_test

Reviewers: igor, anthony, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D42657
2015-07-20 16:05:28 -07:00
sdong 6e9fbeb27c Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.

Test Plan: Run all existing unit tests.

Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D42321
2015-07-17 16:58:18 -07:00
Adam Retter 91bf1b80ef Java facility to use the RemoveEmptyValueCompactionFilter 2015-07-16 11:50:11 +01:00
Adam Retter 3d00271e40 The ability to specify a compaction filter via the Java API 2015-07-16 11:50:10 +01:00
Adam Retter 62dec0e2b7 RemoveEmptyValueCompactionFilter - A compaction filter which removes entries which have an empty value 2015-07-16 11:50:10 +01:00
agiardullo 81d072623c move convenience.h out of utilities
Summary: Moved convenience.h out of utilities to remove a dependency on utilities in db.

Test Plan: unit tests.  Also compiled a link to the old location to verify the _Pragma works.

Reviewers: sdong, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42201
2015-07-15 14:51:51 -07:00
Yueh-Hsuan Chiang 801df912af Move UniversalCompaction related db-tests to db_universal_compaction_test.cc
Summary: Move UniversalCompaction related db-tests to db_universal_compaction_test.cc

Test Plan:
db_test
db_universal_compaction_test

Reviewers: igor, sdong, IslamAbdelRahman, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42225
2015-07-14 18:24:45 -07:00
Yueh-Hsuan Chiang 3ca6b2541e Move TailingIterator tests from db_test.cc to db_test_tailing_iterator.cc
Summary: Move TailingIterator tests from db_test.cc to db_test_tailing_iterator.cc

Test Plan:
db_test
db_test_tailing_iterator

Reviewers: igor, anthony, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D42021
2015-07-14 16:41:08 -07:00
Yueh-Hsuan Chiang ce829c77e3 Make TransactionLogIterator related tests from db_test.cc to db_log_iter_test.cc
Summary: Make TransactionLogIterator related tests from db_test.cc to db_log_iter_test.cc

Test Plan:
db_test
db_log_iter_test

Reviewers: sdong, IslamAbdelRahman, igor, anthony

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42045
2015-07-14 16:08:21 -07:00
Yueh-Hsuan Chiang c3f98bb89b Move CompactionFilter tests in db_test.cc to db_compaction_filter_test.cc
Summary: Move CompactionFilter tests in db_test.cc to db_compaction_filter_test.cc

Test Plan:
db_test
db_compaction_filter_test

Reviewers: igor, sdong, IslamAbdelRahman, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42207
2015-07-14 16:03:47 -07:00
agiardullo 18d5e1bf88 Remove db_impl_readonly dependency on utilities
Summary: Seems like the cleanest way to resolve this is to move CompactedDBImpl into db/.  CompactedDBImpl should probably live in the same place as DBImplReadonly since Opening the latter could end up instantiating the former.  Both DBImplReadonly and CompactedDBImpl inherit from DBImpl access protected members of DBImpl( and the latter access friendly private methods).

Test Plan: unit tests

Reviewers: sdong, yhchiang, kradhakrishnan, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42027
2015-07-14 11:32:54 -07:00
Yueh-Hsuan Chiang b10cf4e2e0 Move DynamicLevel related db-tests to db_dynamic_level_test.cc
Summary: Move DynamicLevel related db-tests to db_dynamic_level_test.cc

Test Plan:
db_dynamic_level_test
db_test

Reviewers: igor, anthony, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42039
2015-07-13 19:00:30 -07:00
Yueh-Hsuan Chiang 625467a08a Move reusable part of db_test.cc to util/db_test_util.h
Summary:
Move reusable part of db_test.cc to util/db_test_util.h.
This makes it more possible to partition db_test.cc into
multiple smaller test files.

Also, fixed many old lint errors in db_test.

Test Plan: db_test

Reviewers: igor, anthony, IslamAbdelRahman, sdong, kradhakrishnan

Reviewed By: sdong, kradhakrishnan

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D41973
2015-07-13 16:53:38 -07:00
Venkatesh Radhakrishnan 04251e1e3a Add wal files to Checkpoint for multiple column families.
Summary:
When there are multiple column families, the flush in
GetLiveFiles is not atomic, so that there are entries in the wal files
which are needed to get a consisten RocksDB. We now add the log files to
the checkpoint.

Test Plan:
CheckpointCF - This test forces more data to be written to
the other column families after the flush of the first column family but
before the second.

Reviewers: igor, yhchiang, IslamAbdelRahman, anthony, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D40323
2015-06-19 16:08:31 -07:00
Yueh-Hsuan Chiang fe5c6321cb Allow EventListener::OnCompactionCompleted to return CompactionJobStats.
Summary:
Allow EventListener::OnCompactionCompleted to return CompactionJobStats,
which contains useful information about a compaction.

Example CompactionJobStats returned by OnCompactionCompleted():
    smallest_output_key_prefix 05000000
    largest_output_key_prefix 06990000
    elapsed_time 42419
    num_input_records 300
    num_input_files 3
    num_input_files_at_output_level 2
    num_output_records 200
    num_output_files 1
    actual_bytes_input 167200
    actual_bytes_output 110688
    total_input_raw_key_bytes 5400
    total_input_raw_value_bytes 300000
    num_records_replaced 100
    is_manual_compaction 1

Test Plan: Developed a mega test in db_test which covers 20 variables in CompactionJobStats.

Reviewers: rven, igor, anthony, sdong

Reviewed By: sdong

Subscribers: tnovak, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38463
2015-06-02 17:07:16 -07:00
Mike Kolupaev ec7a944360 more times in perf_context and iostats_context
Summary:
We occasionally get write stalls (>1s Write() calls) on HDD under read load. The following timers explain almost all of the stalls:
 - perf_context.db_mutex_lock_nanos
 - perf_context.db_condition_wait_nanos
 - iostats_context.open_time
 - iostats_context.allocate_time
 - iostats_context.write_time
 - iostats_context.range_sync_time
 - iostats_context.logger_time

In my experiments each of these occasionally takes >1s on write path under some workload. There are rare cases when Write() takes long but none of these takes long.

Test Plan: Added code to our application to write the listed timings to log for slow writes. They usually add up to almost exactly the time Write() call took.

Reviewers: rven, yhchiang, sdong

Reviewed By: sdong

Subscribers: march, dhruba, tnovak

Differential Revision: https://reviews.facebook.net/D39177
2015-06-02 02:07:58 -07:00
agiardullo dc9d70de65 Optimistic Transactions
Summary: Optimistic transactions supporting begin/commit/rollback semantics.  Currently relies on checking the memtable to determine if there are any collisions at commit time.  Not yet implemented would be a way of enuring the memtable has some minimum amount of history so that we won't fail to commit when the memtable is empty.  You should probably start with transaction.h to get an overview of what is currently supported.

Test Plan: Added a new test, but still need to look into stress testing.

Reviewers: yhchiang, igor, rven, sdong

Reviewed By: sdong

Subscribers: adamretter, MarkCallaghan, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D33435
2015-05-29 14:36:35 -07:00
Yueh-Hsuan Chiang ec4ff4e99c Rename EventLoggerHelpers EventHelpers
Summary:
Rename EventLoggerHelpers EventHelpers, as it's going to include
all event-related helper functions instead of EventLogger only stuffs.

Test Plan: make

Reviewers: sdong, rven, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39093
2015-05-28 13:37:47 -07:00
Igor Canadi dbd95b7532 Add more table properties to EventLogger
Summary:
Example output:

    {"time_micros": 1431463794310521, "job": 353, "event": "table_file_creation", "file_number": 387, "file_size": 86937, "table_info": {"data_size": "81801", "index_size": "9751", "filter_size": "0", "raw_key_size": "23448", "raw_average_key_size": "24.000000", "raw_value_size": "990571", "raw_average_value_size": "1013.890481", "num_data_blocks": "245", "num_entries": "977", "filter_policy_name": "", "kDeletedKeys": "0"}}

Also fixed a bug where BuildTable() in recovery was passing Env::IOHigh argument into paranoid_checks_file parameter.

Test Plan: make check + check out the output in the log

Reviewers: sdong, rven, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38343
2015-05-12 15:53:55 -07:00
agiardullo 711465ccec API to fetch from both a WriteBatchWithIndex and the db
Summary:
Added a couple functions to WriteBatchWithIndex to make it easier to query the value of a key including reading pending writes from a batch.  (This is needed for transactions).

I created write_batch_with_index_internal.h to use to store an internal-only helper function since there wasn't a good place in the existing class hierarchy to store this function (and it didn't seem right to stick this function inside WriteBatchInternal::Rep).

Since I needed to access the WriteBatchEntryComparator, I moved some helper classes from write_batch_with_index.cc into write_batch_with_index_internal.h/.cc.  WriteBatchIndexEntry, ReadableWriteBatch, and WriteBatchEntryComparator are all unchanged (just moved to a different file(s)).

Test Plan: Added new unit tests.

Reviewers: rven, yhchiang, sdong, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38037
2015-05-11 14:51:51 -07:00
Igor Canadi 6059bdf86a Add experimental API MarkForCompaction()
Summary:
Some Mongo+Rocks datasets in Parse's environment are not doing compactions very frequently. During the quiet period (with no IO), we'd like to schedule compactions so that our reads become faster. Also, aggressively compacting during quiet periods helps when write bursts happen. In addition, we also want to compact files that are containing deleted key ranges (like old oplog keys).

All of this is currently not possible with CompactRange() because it's single-threaded and blocks all other compactions from happening. Running CompactRange() risks an issue of blocking writes because we generate too much Level 0 files before the compaction is over. Stopping writes is very dangerous because they hold transaction locks. We tried running manual compaction once on Mongo+Rocks and everything fell apart.

MarkForCompaction() solves all of those problems. This is very light-weight manual compaction. It is lower priority than automatic compactions, which means it shouldn't interfere with background process keeping the LSM tree clean. However, if no automatic compactions need to be run (or we have extra background threads available), we will start compacting files that are marked for compaction.

Test Plan: added a new unit test

Reviewers: yhchiang, rven, MarkCallaghan, sdong

Reviewed By: sdong

Subscribers: yoshinorim, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D37083
2015-04-17 16:44:45 -07:00
Venkatesh Radhakrishnan 0a0501c8d5 Add Xfunc to makefile
Summary: Make target for running all xfunc tests

Test Plan: make xfunc

Reviewers: igor, sdong, meyering

Reviewed By: meyering

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D36873
2015-04-13 14:21:32 -07:00
Jim Meyering cba5920011 build: don't use a glob for java/rocksjni/*
Summary:
* src.mk (JNI_NATIVE_SOURCES): New variable, so we don't have to use
a glob in Makefile
* Makefile (JNI_NATIVE_SOURCES): Remove glob-using definition, now
that the explicit list of sources is in src.mk.

Test Plan:
  Run this:
    JAVA_HOME=/usr/local/jdk-7u67-64 PATH=$JAVA_HOME/bin:$PATH \
      make rocksdbjava

Reviewers: yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D36633
2015-04-07 15:19:25 -07:00
Igor Canadi 2511b7d947 Makefile minor cleanup
Summary:
Just couple of small changes:
1. removed signal_test, since it doesn't seem useful and we don't even run it as part of `make check`
2. moved perf_context_test to TESTS instead of PROGRAMS
3. `make release` probably shouldn't compile benchmarks. We currently rely on `make release` building db_bench (via Jenkins), so I left db_bench there.

This is just a minor cleanup. We need to rethink our targets since they are a bit messy right now. We can do this during our tech debt week.

Test Plan: make release

Reviewers: anthony, rven, yhchiang, sdong, meyering

Reviewed By: meyering

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D36171
2015-03-30 16:05:35 -04:00
Igor Canadi d61cb0b9de db_bench can now disable flashcache for background threads
Summary: Most of the approach is copied from WebSQL's MySQL branch. It's nice that we can do this without touching core RocksDB code.

Test Plan: Compiles and runs. Didn't test flashback code, as I don't have flashback device and most if it is c/p

Reviewers: MarkCallaghan, sdong

Reviewed By: sdong

Subscribers: rven, lgalanis, kradhakrishnan, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D35391
2015-03-30 09:51:11 -07:00
agiardullo 81345b90f9 Create an abstract interface for write batches
Summary: WriteBatch and WriteBatchWithIndex now both inherit from a common abstract base class.  This makes it easier to write code that is agnostic toward the implementation of the particular write batch.  In particular, I plan on utilizing this abstraction to allow transactions to support using either implementation of a write batch.

Test Plan: modified existing WriteBatchWithIndex tests to test new functions.  Running all tests.

Reviewers: igor, rven, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D34017
2015-03-17 19:23:08 -07:00
Igor Sugak a7aba2ef6b rocksdb: Add gtest
Summary:
Adds gtest fused source code into `third-party` directory. No manual changes.

gtest latest released 1.7 has clang dev compilation errors. Trunk version requires only one disabled warning (-Wno-missing-field-initializers)

Fused code is made as described here https://fburl.com/90806322
Details about why we need gtest source code instead of precompiled library https://fburl.com/90805763
Source used from http://googletest.googlecode.com/svn/trunk

Test Plan:
Build and notice no errors. Also check in logs that gtest-all.o being compiled gtest-all.o.
```lang=bash
% USE_CLANG=1 make all
```

Reviewers: lgalanis, yufei.zhu, rven, sdong, igor, meyering

Reviewed By: meyering

Subscribers: meyering, yhchiang, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D33345
2015-03-16 18:27:30 -07:00
Igor Canadi 52d8347a91 EventLogger
Summary:
Here's my proposal for making our LOGs easier to read by machines.

The idea is to dump all events as JSON objects. JSON is easy to read by humans, but more importantly, it's easy to read by machines. That way, we can parse this, load into SQLite/mongo and then query or visualize.

I started with table_create and table_delete events, but if everybody agrees, I'll continue by adding more events (flush/compaction/etc etc)

Test Plan:
Ran db_bench. Observed:
2015/01/15-14:13:25.788019 1105ef000 EVENT_LOG_v1 {"time_micros": 1421360005788015, "event": "table_file_creation", "file_number": 12, "file_size": 1909699}
2015/01/15-14:13:25.956500 110740000 EVENT_LOG_v1 {"time_micros": 1421360005956498, "event": "table_file_deletion", "file_number": 12}

Reviewers: yhchiang, rven, dhruba, MarkCallaghan, lgalanis, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D31647
2015-03-13 10:15:54 -07:00
Venkatesh Radhakrishnan 05d92efa75 Add convenience.cc to src.mk
Summary:
The build process now requires new source files to be added to
src.mk. Adding convenience.cc to src.mk

Test Plan: Build rocksdb

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D34815
2015-03-11 12:39:53 -07:00
Jim Meyering 34c75e984b fix-up patch: avoid new link error
Summary:
* src.mk (LIB_SOURCES): Add this file:
utilities/document/json_document_builder.cc
to avoid link errors.

Blame Rev: D33849

Test Plan: run "make"

Reviewers: ljin, igor.sugak, rven, sdong, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D34593
2015-03-06 11:15:18 -08:00
Jim Meyering ebc647de87 build: fix missing dependency problems
Summary:
Any time one would modify a dependent of any *test*.cc file,
"make" would fail to rebuild the affected test binaries,
e.g., db_test.  That was due to the fact that we deliberately
excluded those test-related files from the definition of SOURCES
and only $(SOURCES) was used to create the automatically-generated
.d dependency files.  The fix is to generate a .d file for every
source file.
* src.mk: New file.  Defines LIB_SOURCES, MOCK_SOURCES
and TEST_BENCH_SOURCES.
* Makefile: Include src.mk.
Reflect s/SOURCES/LIB_SOURCES/ renaming.
* build_tools/build_detect_platform: Remove the code
that was used to generate SOURCES= and MOCK_SOURCES=
definitions in make_config.mk. Those lists of files
are now hard-coded in src.mk. Hard-coding this list of
sources is desirable, because without that, one risks
including stray .cc files in a build.  Not reproducible.

Test Plan:
Touch a file used by db_test's dependent .o files and ensure that
they are all recompiled.  Before, none would be:

  $ touch db/db_impl.h && make db_test
    CC       db/db_test.o
    CC       db/column_family.o
    CC       db/db_filesnapshot.o
    CC       db/db_impl.o
    CC       db/db_impl_debug.o
    CC       db/db_impl_readonly.o
    CC       db/forward_iterator.o
    CC       db/internal_stats.o
    CC       db/managed_iterator.o
    CC       db/repair.o
    CC       db/write_batch.o
    CC       utilities/compacted_db/compacted_db_impl.o
    CC       utilities/ttl/db_ttl_impl.o
    CC       util/ldb_cmd.o
    CC       util/ldb_tool.o
    CC       util/sst_dump_tool.o
    CC       util/xfunc.o
    CCLD     db_test

Reviewers: ljin, igor.sugak, igor, rven, sdong

Reviewed By: sdong

Subscribers: yhchiang, adamretter, fyrz, dhruba

Differential Revision: https://reviews.facebook.net/D33849
2015-03-06 10:55:11 -08:00