Commit Graph

234 Commits

Author SHA1 Message Date
Yueh-Hsuan Chiang d781da8164 Add CheckOptionsCompatibility() API to options_util
Summary:
Add CheckOptionsCompatibility() API to options_util that returns
Status::OK if the input DBOptions and ColumnFamilyDescriptors
are compatible with the latest options stored in the specified DB path.

Test Plan: Added tests in options_util_test

Reviewers: igor, anthony, IslamAbdelRahman, rven, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D50649
2015-11-12 16:52:51 -08:00
Yueh-Hsuan Chiang e11f676e34 Add OptionsUtil::LoadOptionsFromFile() API
Summary:
This patch adds OptionsUtil::LoadOptionsFromFile() and
OptionsUtil::LoadLatestOptionsFromDB(), which allow developers
to construct DBOptions and ColumnFamilyOptions from a RocksDB
options file.  Note that most pointer-typed options such as
merge_operator will not be constructed.

With this API, developers no longer need to remember all the
options in order to reopen an existing rocksdb instance like
the following:

  DBOptions db_options;
  std::vector<std::string> cf_names;
  std::vector<ColumnFamilyOptions> cf_opts;

  // Load primitive-typed options from an existing DB
  OptionsUtil::LoadLatestOptionsFromDB(
      dbname, &db_options, &cf_names, &cf_opts);

  // Initialize necessary pointer-typed options
  cf_opts[0].merge_operator.reset(new MyMergeOperator());
  ...

  // Construct the vector of ColumnFamilyDescriptor
  std::vector<ColumnFamilyDescriptor> cf_descs;
  for (size_t i = 0; i < cf_opts.size(); ++i) {
    cf_descs.emplace_back(cf_names[i], cf_opts[i]);
  }

  // Open the DB
  DB* db = nullptr;
  std::vector<ColumnFamilyHandle*> cf_handles;
  auto s = DB::Open(db_options, dbname, cf_descs,
                    &handles, &db);

Test Plan:
Augment existing tests in column_family_test
options_test
db_test

Reviewers: igor, IslamAbdelRahman, sdong, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D49095
2015-11-12 06:52:43 -08:00
Yueh-Hsuan Chiang e114f0abb8 Enable RocksDB to persist Options file.
Summary:
This patch allows rocksdb to persist options into a file on
DB::Open, SetOptions, and Create / Drop ColumnFamily.
Options files are created under the same directory as the rocksdb
instance.

In addition, this patch also adds a fail_if_missing_options_file in DBOptions
that makes any function call return non-ok status when it is not able to
persist options properly.

  // If true, then DB::Open / CreateColumnFamily / DropColumnFamily
  // / SetOptions will fail if options file is not detected or properly
  // persisted.
  //
  // DEFAULT: false
  bool fail_if_missing_options_file;

Options file names are formatted as OPTIONS-<number>, and RocksDB
will always keep the latest two options files.

Test Plan:
Add options_file_test.

options_test
column_family_test

Reviewers: igor, IslamAbdelRahman, sdong, anthony

Reviewed By: anthony

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D48285
2015-11-10 22:58:01 -08:00
Yueh-Hsuan Chiang db3f5e4948 Update HISTORY.md
Summary:
Move some of the items in public API changes section to to new
features section.

Test Plan: no code change.

Reviewers: igor, sdong, anthony, IslamAbdelRahman

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D50229
2015-11-04 23:52:09 -08:00
Yueh-Hsuan Chiang 7d7ee2b654 Add Memory Insight support to utilities
Summary:
This patch introduces utilities/memory, which currently includes
GetApproximateMemoryUsageByType that reports different types of
rocksdb memory usage given a list of input DBs.

The API also take care of the case where Cache could be shared
across multiple column families / multiple db instances.

Currently, it reports memory usage of memtable, table-readers
and cache.

Test Plan: utilities/memory/memory_test.cc

Reviewers: igor, anthony, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D49257
2015-11-03 17:52:17 -08:00
Yueh-Hsuan Chiang 3ecbab0040 Add GetAggregatedIntProperty(): returns the aggregated value from all CFs
Summary:
This patch adds GetAggregatedIntProperty() that returns the aggregated
value from all CFs

Test Plan: Added a test in db_test

Reviewers: igor, sdong, anthony, IslamAbdelRahman, rven

Reviewed By: rven

Subscribers: rven, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D49497
2015-11-03 15:54:18 -08:00
Islam AbdelRahman 2872e0c8c2 Clean and expose CreateLoggerFromOptions
Summary:
CreateLoggerFromOptions have some parameters like  db_log_dir and env, these parameters are redundant since they already exist in DBOptions

this patch remove the redundant parameters and expose CreateLoggerFromOptions to users

Test Plan: make check

Reviewers: igor, anthony, yhchiang, rven, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: dhruba, hermanlee4

Differential Revision: https://reviews.facebook.net/D49713
2015-10-29 18:07:37 -07:00
sdong 666fb5df43 Remove DefaultCompactionFilterFactory.
Summary: DefaultCompactionFilterFactory is not used anymore after recent changes. Remove it.

Test Plan: Just run existing tests.

Reviewers: rven, igor, yhchiang, kradhakrishnan, anthony, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D48699
2015-10-14 11:02:30 -07:00
sdong 776bd8d5eb Pass column family ID to table property collector
Summary: Pass column family ID through TablePropertiesCollectorFactory::CreateTablePropertiesCollector() so that users can identify which column family this file is for and handle it differently.

Test Plan: Add unit test scenarios in tests related to table properties collectors to verify the information passed in is correct.

Reviewers: rven, yhchiang, anthony, kradhakrishnan, igor, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: yoshinorim, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D48411
2015-10-09 14:36:51 -07:00
sdong b77eb16aba New Manifest format to allow customized fields in NewFile.
Summary: With this commit, we add a new format in manifest when adding a new file. Now path ID and need-compaction hint are first two customized fields.

Test Plan: Add a test case in version_edit_test to verify the encoding and decoding logic. Add a unit test in db_test to verify need compaction is persistent after DB restarting.

Reviewers: kradhakrishnan, anthony, IslamAbdelRahman, yhchiang, rven, igor

Reviewed By: igor

Subscribers: javigon, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D48123
2015-10-08 15:51:45 -07:00
sdong 000836a880 CompactionFilter::Context to contain column family ID
Summary: Add the column family ID to compaction filter context, so it is easier for compaction filter to apply different logic for different column families.

Test Plan: Add a unit test to verify the column family ID passed is correct.

Reviewers: rven, yhchiang, igor

Reviewed By: igor

Subscribers: yoshinorim, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D48357
2015-10-08 11:27:38 -07:00
Yueh-Hsuan Chiang 3a0bf873b5 Change RocksDB version to 4.1
Summary: Change RocksDB version to 4.1

Test Plan: no code change.

Reviewers: sdong, anthony, IslamAbdelRahman, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D48387
2015-10-08 11:15:18 -07:00
Igor Canadi 9803e0d813 compaction_filter.h cleanup
Summary:
Two changes:
1. remove *V2 filter stuff. we deprecated that a while ago
2. clarify what happens when user sets max_subcompactions to bigger than 1

Test Plan: none

Reviewers: yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D47871
2015-10-08 09:32:50 -07:00
Igor Canadi d80ce7f99a Compaction filter on merge operands
Summary:
Since Andres' internship is over, I took over https://reviews.facebook.net/D42555 and rebased and simplified it a bit.

The behavior in this diff is a bit simpler than in D42555:
* only merge operators are passed through FilterMergeValue(). If fitler function returns true, the merge operator is ignored
* compaction filter is *not* called on: 1) results of merge operations and 2) base values that are getting merged with merge operands (the second case was also true in previous diff)

Do we also need a compaction filter to get called on merge results?

Test Plan: make && make check

Reviewers: lovro, tnovak, rven, yhchiang, sdong

Reviewed By: sdong

Subscribers: noetzli, kolmike, leveldb, dhruba, sdong

Differential Revision: https://reviews.facebook.net/D47847
2015-10-07 09:30:03 -07:00
dyniusz 0267502655 Support for LevelDB SST with .ldb suffix
Summary:
	Handle SST files with both ".sst" and ".ldb" suffix.
	This enables user to migrate from leveldb to rocksdb.

Test Plan:
        Added unit test with DB operating on SSTs with names schema.
        See db/dc_test.cc:SSTsWithLdbSuffixHandling for details

Reviewers: yhchiang, sdong, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D48003
2015-10-06 17:46:22 -07:00
Islam AbdelRahman f03b5c987b Add experimental DB::AddFile() to plug sst files into empty DB
Summary:
This is an initial version of bulk load feature

This diff allow us to create sst files, and then bulk load them later, right now the restrictions for loading an sst file are
(1) Memtables are empty
(2) Added sst files have sequence number = 0, and existing values in database have sequence number = 0
(3) Added sst files values are not overlapping

Test Plan: unit testing

Reviewers: igor, ott, sdong

Reviewed By: sdong

Subscribers: leveldb, ott, dhruba

Differential Revision: https://reviews.facebook.net/D39081
2015-09-23 12:42:43 -07:00
Andres Noetzli 014fd55adc Support for SingleDelete()
Summary:
This patch fixes #7460559. It introduces SingleDelete as a new database
operation. This operation can be used to delete keys that were never
overwritten (no put following another put of the same key). If an overwritten
key is single deleted the behavior is undefined. Single deletion of a
non-existent key has no effect but multiple consecutive single deletions are
not allowed (see limitations).

In contrast to the conventional Delete() operation, the deletion entry is
removed along with the value when the two are lined up in a compaction. Note:
The semantics are similar to @igor's prototype that allowed to have this
behavior on the granularity of a column family (
https://reviews.facebook.net/D42093 ). This new patch, however, is more
aggressive when it comes to removing tombstones: It removes the SingleDelete
together with the value whenever there is no snapshot between them while the
older patch only did this when the sequence number of the deletion was older
than the earliest snapshot.

Most of the complex additions are in the Compaction Iterator, all other changes
should be relatively straightforward. The patch also includes basic support for
single deletions in db_stress and db_bench.

Limitations:
- Not compatible with cuckoo hash tables
- Single deletions cannot be used in combination with merges and normal
  deletions on the same key (other keys are not affected by this)
- Consecutive single deletions are currently not allowed (and older version of
  this patch supported this so it could be resurrected if needed)

Test Plan: make all check

Reviewers: yhchiang, sdong, rven, anthony, yoshinorim, igor

Reviewed By: igor

Subscribers: maykov, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43179
2015-09-17 11:42:56 -07:00
Islam AbdelRahman f3f2032c46 Release RocksDB 4.0.0
Summary: Release RocksDB 4.0.0

Test Plan: no test

Reviewers: sdong, yhchiang, anthony, rven, kradhakrishnan, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D46515
2015-09-09 16:01:03 -07:00
agiardullo 25dbc579f5 Update HISTORY file for transactions
Summary: Move line about pessimistic transactions to the correct place in the history file and no longer refer to it as 'experimental'

Test Plan: n/a

Reviewers: igor, sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D46449
2015-09-09 13:27:19 -07:00
Andres Noetzli 6bdc484fd8 Added Equal method to Comparator interface
Summary:
In some cases, equality comparisons can be done more efficiently than three-way
comparisons. There are quite a few places in the code where we only care about
equality. This patch adds an Equal() method that defaults to using the
Compare() method.

Test Plan: make clean all check

Reviewers: rven, anthony, yhchiang, igor, sdong

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D46233
2015-09-08 15:30:49 -07:00
Igor Canadi b42cd6bed5 Remove the need for LATEST_BACKUP in BackupEngine
Summary:
In the first implementation of BackupEngine, LATEST_BACKUP was the commit point. The backup became committed after the write to LATEST_BACKUP completed.

However, we can avoid the need for LATEST_BACKUP. Instead of write to LATEST_BACKUP, the commit point can be the rename from `meta/<backup_id>.tmp` to `meta/<backup_id>`. Once we see that there exists a file `meta/<backup_id>` (without tmp), we can assume that backup is valid.

In this diff, we still write out the file LATEST_BACKUP. We need to do this so that we can maintain backward compatibility. However, the new version doesn't depend on this file anymore. We get the latest backup by `ls`-ing `meta` directory.

This diff depends on D41925

Test Plan: Adjusted backupable_db_test to this new behavior

Reviewers: benj, yhchiang, sdong, AaronFeldman

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42069
2015-09-02 11:49:49 -07:00
sdong 7a0dbdf3ac Add ZSTD (not final format) compression type
Summary: Add ZSTD compression type. The same way as adding LZ4.

Test Plan: run all tests. Generate files in db_bench. Make sure reads succeed. But the SST files cannot be opened in older versions. Also some other adhoc tests.

Reviewers: rven, anthony, IslamAbdelRahman, kradhakrishnan, igor

Reviewed By: igor

Subscribers: MarkCallaghan, maykov, yoshinorim, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D45747
2015-08-28 11:01:13 -07:00
Ari Ekmekji 2f8d71ec05 Moving sequence number compaction variables from SubCompactionState to CompactionJob
Summary:
It was pointed out to me that the members of SubCompactionState
'earliest_snapshot', 'latest_snapshot' and 'visible_at_tip' are never
modified by the subcompactions, so they can stay as global varaibles
instead to make things simpler.

Test Plan: make all && make check

Reviewers: sdong, igor, noetzli, anthony, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D45477
2015-08-25 14:03:10 -07:00
Yueh-Hsuan Chiang 6996de87af Expose per-level aggregated table properties via GetProperty()
Summary:
This patch adds "rocksdb.aggregated-table-properties"
and "rocksdb.aggregated-table-properties-at-levelN", the former
returns the aggreated table properties of a column family,
while the later returns the aggregated table properties
of the specified level N.

Test Plan: Added tests in db_test

Reviewers: igor, sdong, IslamAbdelRahman, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45087
2015-08-25 12:03:54 -07:00
Ari Ekmekji b6def58f73 Changed 'num_subcompactions' to the more accurate 'max_subcompactions'
Summary:
Up until this point we had DbOptions.num_subcompactions, but
it is semantically more correct to call this max_subcompactions since
we will schedule *up to* DbOptions.max_subcompactions smaller compactions
at a time during a compaction job.

I also added a --subcompactions option to db_bench

Test Plan: make all   make check

Reviewers: sdong, igor, anthony, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D45069
2015-08-21 14:25:34 -07:00
Islam AbdelRahman 3bd9db420e [Cleanup] Remove RandomRWFile
Summary: RandomRWFile is not used anywhere in out code base, this patch remove RandomRWFile

Test Plan:
make check -j64
USE_CLANG=1 make all -j64
OPT=-DROCKSDB_LITE make release -j64

Reviewers: sdong, yhchiang, anthony, kradhakrishnan, rven, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D44091
2015-08-12 10:18:59 -07:00
agiardullo c2f2cb0214 Pessimistic Transactions
Summary:
Initial implementation of Pessimistic Transactions.  This diff contains the api changes discussed in D38913.  This diff is pretty large, so let me know if people would prefer to meet up to discuss it.

MyRocks folks:  please take a look at the API in include/rocksdb/utilities/transaction[_db].h and let me know if you have any issues.

Also, you'll notice a couple of TODOs in the implementation of RollbackToSavePoint().  After chatting with Siying, I'm going to send out a separate diff for an alternate implementation of this feature that implements the rollback inside of WriteBatch/WriteBatchWithIndex.  We can then decide which route is preferable.

Next, I'm planning on doing some perf testing and then integrating this diff into MongoRocks for further testing.

Test Plan: Unit tests, db_bench parallel testing.

Reviewers: igor, rven, sdong, yhchiang, yoshinorim

Reviewed By: sdong

Subscribers: hermanlee4, maykov, spetrunia, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D40869
2015-08-11 17:52:23 -07:00
sdong 254c4fb88f In HISTORY.md Switch unreleased notes to 3.13
Summary: Prepare release notes for 3.13.

Test Plan: Text only

Reviewers: yhchiang, IslamAbdelRahman, rven, anthony

Reviewed By: anthony

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D43665
2015-08-06 14:02:44 -07:00
sdong c7742452eb Add Statistics.getHistogramString() to print more detailed outputs of a histogram
Summary:
Provide a way for users to know more detailed ditribution of a histogram metrics. Example outputs:

Manually add statement
  fprintf(stdout, "%s\n", dbstats->getHistogramString(SST_READ_MICROS).c_str());
Will print out something like:

Count: 989151  Average: 1.7659  StdDev: 1.52
Min: 0.0000  Median: 1.2071  Max: 860.0000
Percentiles: P50: 1.21 P75: 1.70 P99: 5.12 P99.9: 13.67 P99.99: 21.70
------------------------------------------------------
[       0,       1 )   390839  39.513%  39.513% ########
[       1,       2 )   500918  50.641%  90.154% ##########
[       2,       3 )    79358   8.023%  98.177% ##
[       3,       4 )     6297   0.637%  98.813%
[       4,       5 )     1712   0.173%  98.986%
[       5,       6 )     1134   0.115%  99.101%
[       6,       7 )     1222   0.124%  99.224%
[       7,       8 )     1529   0.155%  99.379%
[       8,       9 )     1264   0.128%  99.507%
[       9,      10 )      988   0.100%  99.607%
[      10,      12 )     1378   0.139%  99.746%
[      12,      14 )     1828   0.185%  99.931%
[      14,      16 )      410   0.041%  99.972%
[      16,      18 )       72   0.007%  99.980%
[      18,      20 )       67   0.007%  99.986%
[      20,      25 )      106   0.011%  99.997%
[      25,      30 )       24   0.002%  99.999%
[      30,      35 )        1   0.000% 100.000%
[     250,     300 )        2   0.000% 100.000%
[     300,     350 )        1   0.000% 100.000%
[     800,     900 )        1   0.000% 100.000%

Test Plan: Manually add a print in db_bench and make sure it prints out as expected. Will add some codes to cover the function

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D43611
2015-08-05 20:05:56 -07:00
Mike Kolupaev e06cf1a098 [wal changes 3/3] method in DB to sync WAL without blocking writers
Summary:
Subj. We really need this feature.

Previous diff D40899 has most of the changes to make this possible, this diff just adds the method.

Test Plan: `make check`, the new test fails without this diff; ran with ASAN, TSAN and valgrind.

Reviewers: igor, rven, IslamAbdelRahman, anthony, kradhakrishnan, tnovak, yhchiang, sdong

Reviewed By: sdong

Subscribers: MarkCallaghan, maykov, hermanlee4, yoshinorim, tnovak, dhruba

Differential Revision: https://reviews.facebook.net/D40905
2015-08-05 06:06:39 -07:00
Islam AbdelRahman c45a57b41e Support delete rate limiting
Summary:
Introduce DeleteScheduler that allow enforcing a rate limit on file deletion
Instead of deleting files immediately, files are moved to trash directory and deleted in a background thread that apply sleep penalty between deletes if needed.

I have updated PurgeObsoleteFiles and PurgeObsoleteWALFiles to use the delete_scheduler instead of env_->DeleteFile

Test Plan:
added delete_scheduler_test
existing unit tests

Reviewers: kradhakrishnan, anthony, rven, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D43221
2015-08-04 20:45:27 -07:00
Yueh-Hsuan Chiang 3424eeb1ed Polish HISTORY.md
Summary: Polish HISTORY.md

Test Plan: no code change.

Reviewers: igor, anthony, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D43527
2015-08-04 15:04:19 -07:00
Yueh-Hsuan Chiang 24daff6d7a Fix a typo and update HISTORY.md for NewCompactOnDeletionCollectorFactory().
Summary: Fix a typo and update HISTORY.md for NewCompactOnDeletionCollectorFactory().

Test Plan: no code change.

Reviewers: igor, anthony, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43521
2015-08-04 13:51:05 -07:00
Yueh-Hsuan Chiang 14d0bfa429 Add DBOptions::skip_sats_update_on_db_open
Summary:
UpdateAccumulatedStats() is used to optimize compaction decision
esp. when the number of deletion entries are high, but this function
can slowdown DBOpen esp. in disk environment.

This patch adds DBOptions::skip_sats_update_on_db_open, which skips
UpdateAccumulatedStats() in DB::Open() time when it's set to true.

Test Plan: Add DBCompactionTest.SkipStatsUpdateTest

Reviewers: igor, anthony, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: tnovak, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42843
2015-08-04 13:48:16 -07:00
agiardullo 8161bdb5a0 WriteBatch Save Points
Summary:
Support RollbackToSavePoint() in WriteBatch and WriteBatchWithIndex.  Support for partial transaction rollback is needed for MyRocks.

An alternate implementation of Transaction::RollbackToSavePoint() exists in D40869.  However, the other implementation is messier because it is implemented outside of WriteBatch.  This implementation is much cleaner and also exposes a potentially useful feature to WriteBatch.

Test Plan: Added unit tests

Reviewers: IslamAbdelRahman, kradhakrishnan, maykov, yoshinorim, hermanlee4, spetrunia, sdong, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42723
2015-07-29 16:54:23 -07:00
agiardullo 064294081b Improved FileExists API
Summary: Add new CheckFileExists method.  Considered changing the FileExists api but didn't want to break anyone's builds.

Test Plan: unit tests

Reviewers: yhchiang, igor, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42003
2015-07-20 17:20:40 -07:00
Igor Canadi a96fcd09b7 Deprecate CompactionFilterV2
Summary: It has been around for a while and it looks like it never found any uses in the wild. It's also complicating our compaction_job code quite a bit. We're deprecating it in 3.13, but will put it back in 3.14 if we actually find users that need this feature.

Test Plan: make check

Reviewers: noetzli, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42405
2015-07-17 18:59:11 +02:00
Igor Canadi 8a9fca2619 Better error handling in BackupEngine
Summary:
Couple of changes here:
* NewBackupEngine() and NewReadOnlyBackupEngine() are now removed. They were deprecated since RocksDB 3.8. Changing these to new functions should be pretty straight-forward. As a followup, I'll fix all fbcode callsights
* Instead of initializing backup engine in the constructor, we initialize it in a separate function now. That way, we can catch all errors and return appropriate status code.
* We catch all errors during initializations and return them to the client properly.
* Added new tests to backupable_db_test, to make sure that we can't open BackupEngine when there are Env errors.
* Transitioned backupable_db_test to use BackupEngine rather than BackupableDB. From the two available APIs, judging by the current use-cases, it looks like BackupEngine API won. It's much more flexible since it doesn't require StackableDB.

Test Plan: Added a new unit test to backupable_db_test

Reviewers: yhchiang, sdong, AaronFeldman

Reviewed By: AaronFeldman

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D41925
2015-07-14 20:51:36 +02:00
Igor Canadi a9c5109515 Deprecate purge_redundant_kvs_while_flush
Summary: This option is guarding the feature implemented 2 and a half years ago: D8991. The feature was enabled by default back then and has been running without issues. There is no reason why any client would turn this feature off. I found no reference in fbcode.

Test Plan: none

Reviewers: sdong, yhchiang, anthony, dhruba

Reviewed By: dhruba

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42063
2015-07-14 13:07:02 +02:00
Igor Canadi 5aea98ddd8 Deprecate WriteOptions::timeout_hint_us
Summary:
In one of our recent meetings, we discussed deprecating features that are not being actively used. One of those features, at least within Facebook, is timeout_hint. The feature is really nicely implemented, but if nobody needs it, we should remove it from our code-base (until we get a valid use-case). Some arguments:
* Less code == better icache hit rate, smaller builds, simpler code
* The motivation for adding timeout_hint_us was to work-around RocksDB's stall issue. However, we're currently addressing the stall issue itself (see @sdong's recent work on stall write_rate), so we should never see sharp lock-ups in the future.
* Nobody is using the feature within Facebook's code-base. Googling for `timeout_hint_us` also doesn't yield any users.

Test Plan: make check

Reviewers: anthony, kradhakrishnan, sdong, yhchiang

Reviewed By: yhchiang

Subscribers: sdong, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D41937
2015-07-14 09:35:48 +02:00
Poornima Chozhiyath Raman c0b23dd5b0 Enabling trivial move in universal compaction
Summary: This change enables trivial move if all the input files are non onverlapping while doing Universal Compaction.

Test Plan: ./compaction_picker_test and db_test ran successfully with the new testcases.

Reviewers: sdong

Reviewed By: sdong

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D40875
2015-07-07 14:18:55 -07:00
Yueh-Hsuan Chiang 59b50dcef9 Update HISTORY.md for Listener
Summary: Update HISTORY.md for Listener

Test Plan: no code change

Reviewers: igor, sdong, IslamAbdelRahman, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D41325
2015-07-07 12:39:36 -07:00
agiardullo 4159f5b87b Prepare 3.12
Summary: About to cut release

Test Plan: none

Reviewers: igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D41061
2015-07-02 12:20:36 -07:00
Aaron Feldman a69bc91e37 Multithreaded backup and restore in BackupEngineImpl
Summary:
Add a new field: BackupableDBOptions.max_background_copies.
CreateNewBackup() and RestoreDBFromBackup() will use this number of threads to perform copies.
If there is a backup rate limit, then max_background_copies must be 1.
Update backupable_db_test.cc to test multi-threaded backup and restore.
Update backupable_db_test.cc to test backups when the backup environment is not the same as the database environment.

Test Plan:
Run ./backupable_db_test
Run valgrind ./backupable_db_test
Run with TSAN and ASAN

Reviewers: yhchiang, rven, anthony, sdong, igor

Reviewed By: igor

Subscribers: yhchiang, anthony, sdong, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D40725
2015-07-02 11:35:51 -07:00
Igor Canadi 0a019d74a0 Use malloc_usable_size() for accounting block cache size
Summary:
Currently, when we insert something into block cache, we say that the block cache capacity decreased by the size of the block. However, size of the block might be less than the actual memory used by this object. For example, 4.5KB block will actually use 8KB of memory. So even if we configure block cache to 10GB, our actually memory usage of block cache will be 20GB!

This problem showed up a lot in testing and just recently also showed up in MongoRocks production where we were using 30GB more memory than expected.

This diff will fix the problem. Instead of counting the block size, we will count memory used by the block. That way, a block cache configured to be 10GB will actually use only 10GB of memory.

I'm using non-portable function and I couldn't find info on portability on Google. However, it seems to work on Linux, which will cover majority of our use-cases.

Test Plan:
1. fill up mongo instance with 80GB of data
2. restart mongo with block cache size configured to 10GB
3. do a table scan in mongo
4. memory usage before the diff: 12GB. memory usage after the diff: 10.5GB

Reviewers: sdong, MarkCallaghan, rven, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D40635
2015-06-26 11:48:09 -07:00
Islam AbdelRahman 674b1181cf Bottommost level compaction option
Summary: Replace force_bottommost_level_compaction in CompactRangeOption with an option that allow the user to (always skip, always compact, compact if compaction filter is present) the bottommost level for level based compaction.

Test Plan: make check

Reviewers: sdong, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D40527
2015-06-23 13:32:40 -07:00
Giuseppe Ottaviano 782a1590f9 Implement a table-level row cache
Summary:
Implementation of a table-level row cache.
It only caches point queries done through the `DB::Get` interface, queries done through the `Iterator` interface will completely skip the cache.

Supports snapshots and merge operations.

Test Plan: Ran `make valgrind_check commit-prereq`

Reviewers: igor, philipp, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D39849
2015-06-23 10:25:45 -07:00
Igor Canadi 760e9a94de Fail DB::Open() when the requested compression is not available
Summary:
Currently RocksDB silently ignores this issue and doesn't compress the data. Based on discussion, we agree that this is pretty bad because it can cause confusion for our users.

This patch fails DB::Open() if we don't support the compression that is specified in the options.

Test Plan: make check with LZ4 not present. If Snappy is not present all tests will just fail because Snappy is our default library. We should make Snappy the requirement, since without it our default DB::Open() fails.

Reviewers: sdong, MarkCallaghan, rven, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39687
2015-06-18 14:55:05 -07:00
Aaron Feldman 69bb210d58 Add Cache.GetPinnedUsageUsage()
Summary:
  Add the funcion Cache.GetPinnedUsage() to return the memory size of entries
  that are in use by the system (that is, all the entries not in the LRU list).

Test Plan:
  Run ./cache_test and examine PinnedUsageTest.

Reviewers: tnovak, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D40305
2015-06-18 13:56:31 -07:00
Islam AbdelRahman 4eabbdb7ec Skip bottommost level compaction if possible
Summary:
This is https://reviews.facebook.net/D39999 but after introducing an option to force compaction the bottom most level

Changes in this patch
- Introduce force_bottommost_level_compaction to CompactRangeOptions that force compacting bottommost level during compaction
- Skip bottommost level compaction if we dont have a compaction filter and force_bottommost_level_compaction options is not set

Although tests pass on my machine but I suspect that there maybe some tests that I am not aware of that  should use force_bottommost_level_compaction to pass in a deterministic way

Test Plan:
make check
adding new tests

Reviewers: igor, sdong, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D40059
2015-06-18 11:03:31 -07:00
Islam AbdelRahman 12e030a992 Use CompactRangeOptions for CompactRange
Summary:
This diff update DB::CompactRange to use RangeCompactionOptions instead of using multiple parameters
Old CompactRange is still available but deprecated

Test Plan:
make all check
make rocksdbjava
USE_CLANG=1 make all
OPT=-DROCKSDB_LITE make release

Reviewers: sdong, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D40209
2015-06-17 14:36:14 -07:00
sdong 40f562e747 Allow GetApproximateSize() to include mem table size if it is skip list memtable
Summary:
Add an option in GetApproximateSize() so that the result will include estimated sizes in mem tables.
To implement it, implement an estimated count from the beginning to a key in skip list. The approach is to count to find the entry, how many Next() is issued from each level, and sum them with a weight that is <branching factor> ^ <level>.

Test Plan: Add a test case

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D40119
2015-06-16 18:13:23 -07:00
Igor Canadi d59d90bb1f db_bench periodically writes QPS to CSV file
Summary:
This is part of an effort to better understand and optimize RocksDB stalls under high load. I added a feature to db_bench to periodically write QPS to CSV files. That way we can nicely see how our QPS changes in time (especially when DB is stalled) and can do a better job of evaluating our stall system (i.e. we want the QPS to be as constant as possible, as opposed to having bunch of stalls)

Cool part of CSV files is that we can easily graph them -- there are a bunch of tools available.

Test Plan:
Ran ./db_bench --report_interval_seconds=10 --benchmarks=fillrandom --num=10000000
and observed this in report.csv:

secs_elapsed,interval_qps
10,2725860
20,1980480
30,1863456
40,1454359
50,1460389

Reviewers: sdong, MarkCallaghan, rven, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D40047
2015-06-12 14:31:53 -07:00
sdong 7842920be5 Slow down writes by bytes written
Summary:
We slow down data into the database to the rate of options.delayed_write_rate (a new option) with this patch.

The thread synchronization approach I take is to still synchronize write controller by DB mutex and GetDelay() is inside DB mutex. Try to minimize the frequency of getting time in GetDelay(). I verified it through db_bench and it seems to work

hard_rate_limit is deprecated.

options.delayed_write_rate is still not dynamically changeable. Need to work on it as a follow-up.

Test Plan: Add new unit tests in db_test

Reviewers: yhchiang, rven, kradhakrishnan, anthony, MarkCallaghan, igor

Reviewed By: igor

Subscribers: ikabiljo, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D36351
2015-06-11 20:42:18 -07:00
Igor Canadi 821cff114e Re-generate WriteEntry on WBWIIterator::Entry()
Summary:
[This is the resubmit of D39813. Tests were failing, so I reverted the diff. I found the bug and I'm now resubmitting]

If we don't do this, any calls to Entry() after WBWI mutation will result in undefined behavior. We need to re-fetch the offset from the skip list and regenerate the new pointer (because string's base pointer can change while mutating).

Test Plan: COMPILE_WITH_ASAN=1 make write_batch_with_index_test && ./write_batch_with_index_test

Reviewers: sdong

Reviewed By: sdong

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D39897
2015-06-10 12:57:38 -07:00
Igor Canadi 75222d130e Revert "Fix compile"
This reverts commit 51440f83ec.

Revert "Re-generate WriteEntry on WBWIIterator::Entry()"

This reverts commit 4949ef08db.
2015-06-10 11:05:27 -07:00
Igor Canadi 4949ef08db Re-generate WriteEntry on WBWIIterator::Entry()
Summary: If we don't do this, any calls to Entry() after WBWI mutation will result in undefined behavior. We need to re-fetch the offset from the skip list and regenerate the new pointer (because string's base pointer can change while mutating).

Test Plan: COMPILE_WITH_ASAN=1 make write_batch_with_index_test && ./write_batch_with_index_test

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39813
2015-06-10 10:35:19 -07:00
Venkatesh Radhakrishnan 406a5682eb Fix hang when closing a DB after doing loads with WAL disabled.
Summary:
There is a hang during DB close in the following scenario:
a) a load with WAL disabled was done,
b) CancelAllBackgroundWork was called,
c) DB Close was called
This was because in that we will wait for a flush but we cannot do a
background flush because we have called CancelAllBackgroundWork which
marks the DB as shutting downn.

Test Plan: Added DBTest FlushOnDestroy

Reviewers: sdong

Reviewed By: sdong

Subscribers: yoshinorim, hermanlee4, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39747
2015-06-09 10:39:49 -07:00
Islam AbdelRahman 643bbbf081 Use nullptr for default compaction_filter_factory
Summary:
Replacing the default value for compaction_filter_factory and compaction_filter_factory_v2 to be nullptr instead of DefaultCompactionFilterFactory / DefaultCompactionFilterFactoryV2
The reason for this is to be able to determine easily if we have compaction filter factory or not without depending on RTTI

Test Plan: make check

Reviewers: yoshinorim, ott, igor, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D39693
2015-06-08 16:34:26 -07:00
Yueh-Hsuan Chiang 2e764f06ea [API Change] Improve EventListener::OnFlushCompleted interface
Summary:
EventListener::OnFlushCompleted() now passes a structure instead
of a list of parameters.  This minimizes the API change in the
future.

Test Plan:
listener_test
compact_files_test
example/compact_files_example

Reviewers: kradhakrishnan, sdong, IslamAbdelRahman, rven, igor

Reviewed By: rven, igor

Subscribers: IslamAbdelRahman, rven, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39543
2015-06-05 12:28:51 -07:00
sdong 4266d4fd90 Allow users to migrate to options.level_compaction_dynamic_level_bytes=true using CompactRange()
Summary: In DB::CompactRange(), change parameter "reduce_level" to "change_level". Users can compact all data to the last level if needed. By doing it, users can migrate the DB to options.level_compaction_dynamic_level_bytes=true.

Test Plan: Add a unit test for it.

Reviewers: yhchiang, anthony, kradhakrishnan, igor, rven

Reviewed By: rven

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D39099
2015-06-01 18:21:14 -07:00
agiardullo dc9d70de65 Optimistic Transactions
Summary: Optimistic transactions supporting begin/commit/rollback semantics.  Currently relies on checking the memtable to determine if there are any collisions at commit time.  Not yet implemented would be a way of enuring the memtable has some minimum amount of history so that we won't fail to commit when the memtable is empty.  You should probably start with transaction.h to get an overview of what is currently supported.

Test Plan: Added a new test, but still need to look into stress testing.

Reviewers: yhchiang, igor, rven, sdong

Reviewed By: sdong

Subscribers: adamretter, MarkCallaghan, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D33435
2015-05-29 14:36:35 -07:00
agiardullo c815351038 Support saving history in memtable_list
Summary:
For transactions, we are using the memtables to validate that there are no write conflicts.  But after flushing, we don't have any memtables, and transactions could fail to commit.  So we want to someone keep around some extra history to use for conflict checking.  In addition, we want to provide a way to increase the size of this history if too many transactions fail to commit.

After chatting with people, it seems like everyone prefers just using Memtables to store this history (instead of a separate history structure).  It seems like the best place for this is abstracted inside the memtable_list.  I decide to create a separate list in MemtableListVersion as using the same list complicated the flush/installalflushresults logic too much.

This diff adds a new parameter to control how much memtable history to keep around after flushing.  However, it sounds like people aren't too fond of adding new parameters.  So I am making the default size of flushed+not-flushed memtables be set to max_write_buffers.  This should not change the maximum amount of memory used, but make it more likely we're using closer the the limit.  (We are now postponing deleting flushed memtables until the max_write_buffer limit is reached).  So while we might use more memory on average, we are still obeying the limit set (and you could argue it's better to go ahead and use up memory now instead of waiting for a write stall to happen to test this limit).

However, if people are opposed to this default behavior, we can easily set it to 0 and require this parameter be set in order to use transactions.

Test Plan: Added a xfunc test to play around with setting different values of this parameter in all tests.  Added testing in memtablelist_test and planning on adding more testing here.

Reviewers: sdong, rven, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D37443
2015-05-28 16:34:24 -07:00
Yueh-Hsuan Chiang 672dda9b3b [API Change] Move listeners from ColumnFamilyOptions to DBOptions
Summary: Move listeners from ColumnFamilyOptions to DBOptions

Test Plan:
listener_test
compact_files_test

Reviewers: rven, anthony, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39087
2015-05-28 13:21:39 -07:00
Yueh-Hsuan Chiang e2c1d4b57f [Public API Change] Make DB::GetDbIdentity() be const function.
Summary: Make DB::GetDbIdentity() be const function.

Test Plan: make db_test

Reviewers: igor, rven, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38745
2015-05-21 11:01:48 -07:00
Karthikeyan Radhakrishnan eaf61ba9f3 Minor text correction
New features title was repeated twice. Fixed it.
2015-05-21 10:55:58 -07:00
Yueh-Hsuan Chiang b588505a7f Update HISTORY.md for GetThreadList() update.
Summary: Update HISTORY.md for GetThreadList() update.

Test Plan: no code change

Reviewers: sdong, rven, anthony, krishnanm86, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38685
2015-05-19 18:41:57 -07:00
Karthikeyan Radhakrishnan d5de04d20e Update history for 3.11
Flipped the unreleased section to 3.11
2015-05-19 14:19:11 -07:00
Igor Canadi 4a855c0799 Add an option wal_bytes_per_sync to control sync_file_range for WAL files
Summary:
sync_file_range is not always asyncronous and thus can block writes if we do this for WAL in the foreground thread. See more here: http://yoshinorimatsunobu.blogspot.com/2014/03/how-syncfilerange-really-works.html

Some users don't want us to call sync_file_range on WALs. Some other do.
Thus, I'm adding a separate option wal_bytes_per_sync to control calling
sync_file_range on WAL files. bytes_per_sync will apply only to table
files now.

Test Plan: no more sync_file_range for WAL as evidenced by strace

Reviewers: yhchiang, rven, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38253
2015-05-18 17:03:59 -07:00
Aashish Pant 794ccfde89 Task 6532943: Rocksdb - SetCapacity() can dynamically change cache capacity if feasible
Summary:
When new capacity is larger than existing capacity, simply update the capacity to the new valie
When new capacity is less than existing capacity, but more than the usage, simply update the capacity to new value
When new capacity is less than the existing capacity and existing usage both, try to purge entries in LRU if feasible to make usage < capacity

Test Plan: Created unit tests in cache_test.cc

Reviewers: sdong, rven, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D37527
2015-04-24 14:12:58 -07:00
sdong 953a885ebf A new call back to TablePropertiesCollector to allow users know the entry is add, delete or merge
Summary:
Currently users have no idea a key is add, delete or merge from TablePropertiesCollector call back. Add a new function to add it.

Also refactor the codes so that
(1) make table property collector and internal table property collector two separate data structures with the later one now exposed
(2) table builders only receive internal table properties

Test Plan: Add cases in table_properties_collector_test to cover both of old and new ways of using TablePropertiesCollector.

Reviewers: yhchiang, igor.sugak, rven, igor

Reviewed By: rven, igor

Subscribers: meyering, yoshinorim, maykov, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D35373
2015-04-06 10:27:21 -07:00
sdong b23bbaa82a Universal Compactions with Small Files
Summary:
With this change, we use L1 and up to store compaction outputs in universal compaction.
The compaction pick logic stays the same. Outputs are stored in the largest "level" as possible.

If options.num_levels=1, it behaves all the same as now.

Test Plan:
1) convert most of existing unit tests for universal comapaction to include the option of one level and multiple levels.
2) add a unit test to cover parallel compaction in universal compaction and run it in one level and multiple levels
3) add unit test to migrate from multiple level setting back to one level setting
4) add a unit test to insert keys to trigger multiple rounds of compactions and verify results.

Reviewers: rven, kradhakrishnan, yhchiang, igor

Reviewed By: igor

Subscribers: meyering, leveldb, MarkCallaghan, dhruba

Differential Revision: https://reviews.facebook.net/D34539
2015-03-30 15:12:02 -07:00
Igor Canadi d61cb0b9de db_bench can now disable flashcache for background threads
Summary: Most of the approach is copied from WebSQL's MySQL branch. It's nice that we can do this without touching core RocksDB code.

Test Plan: Compiles and runs. Didn't test flashback code, as I don't have flashback device and most if it is c/p

Reviewers: MarkCallaghan, sdong

Reviewed By: sdong

Subscribers: rven, lgalanis, kradhakrishnan, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D35391
2015-03-30 09:51:11 -07:00
Yueh-Hsuan Chiang 39d508e34c Add a missing section title in HISTORY.md
Summary:
Add a missing section title in HISTORY.md

Test Plan:
no code change
2015-03-25 14:14:26 -07:00
Yueh-Hsuan Chiang 2d417e52df Update HISTORY.md for 3.10.0
Summary: Update HISTORY.md for 3.10.0

Test Plan: no code chagne.

Reviewers: sdong, rven, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D35871
2015-03-24 16:39:39 -07:00
Igor Canadi b088c83e6e Don't delete files when column family is dropped
Summary:
To understand the bug read t5943287 and check out the new test in column_family_test (ReadDroppedColumnFamily), iter 0.

RocksDB contract allowes you to read a drop column family as long as there is a live reference. However, since our iteration ignores dropped column families, AddLiveFiles() didn't mark files of a dropped column families as live. So we deleted them.

In this patch I no longer ignore dropped column families in the iteration. I think this behavior was confusing and it also led to this bug. Now if an iterator client wants to ignore dropped column families, he needs to do it explicitly.

Test Plan: Added a new unit test that is failing on master. Unit test succeeds now.

Reviewers: sdong, rven, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D32535
2015-03-19 17:04:29 -07:00
agiardullo 81345b90f9 Create an abstract interface for write batches
Summary: WriteBatch and WriteBatchWithIndex now both inherit from a common abstract base class.  This makes it easier to write code that is agnostic toward the implementation of the particular write batch.  In particular, I plan on utilizing this abstraction to allow transactions to support using either implementation of a write batch.

Test Plan: modified existing WriteBatchWithIndex tests to test new functions.  Running all tests.

Reviewers: igor, rven, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D34017
2015-03-17 19:23:08 -07:00
Igor Canadi c88ff4ca76 Deprecate removeScanCountLimit in NewLRUCache
Summary: It is no longer used by the implementation, so we should also remove it from the public API.

Test Plan: make check

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D34971
2015-03-17 15:04:37 -07:00
Igor Canadi db03739340 options.level_compaction_dynamic_level_bytes to allow RocksDB to pick size bases of levels dynamically.
Summary:
When having fixed max_bytes_for_level_base, the ratio of size of largest level and the second one can range from 0 to the multiplier. This makes LSM tree frequently irregular and unpredictable. It can also cause poor space amplification in some cases.

In this improvement (proposed by Igor Kabiljo), we introduce a parameter option.level_compaction_use_dynamic_max_bytes. When turning it on, RocksDB is free to pick a level base in the range of (options.max_bytes_for_level_base/options.max_bytes_for_level_multiplier, options.max_bytes_for_level_base] so that real level ratios are close to options.max_bytes_for_level_multiplier.

Test Plan: New unit tests and pass tests suites including valgrind.

Reviewers: MarkCallaghan, rven, yhchiang, igor, ikabiljo

Reviewed By: ikabiljo

Subscribers: yoshinorim, ikabiljo, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D31437
2015-03-02 22:40:41 -08:00
Igor Canadi b9ff6b050d Fix a bug in ReadOnlyBackupEngine
Summary:
This diff fixes a bug introduced by D28521. Read-only backup engine can delete a backup that is later than the latest -- we never check the condition.

I also added a bunch of logging that will help with debugging cases like this in the future.

See more discussion at t6218248.

Test Plan: Added a unit test that was failing before the change. Also, see new LOG file contents: https://phabricator.fb.com/P19738984

Reviewers: benj, sanketh, sumeet, yhchiang, rven, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D33897
2015-02-27 14:03:56 -08:00
sdong 68af7811ea Remember whole key/prefix filtering on/off in SST file
Summary: Remember whole key or prefix filtering on/off in SST files. If user opens the DB with a different setting that cannot be satisfied while reading the SST file, ignore the bloom filter.

Test Plan: Add a unit test for it

Reviewers: yhchiang, igor, rven

Reviewed By: rven

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D32889
2015-02-11 11:20:04 -08:00
fyrz cfe8837e43 Switch logv with loglevel to virtual 2015-02-09 20:59:29 +01:00
sdong e63140d52b Get() to use prefix bloom filter when filter is not block based
Summary:
Get() now doesn't make use of bloom filter if it is prefix based. Add the check.
Didn't touch block based bloom filter. I can't fully reason whether it is correct to do that. But it's straight-forward to for full bloom filter.

Test Plan:
make all check
Add a test case in DBTest

Reviewers: rven, yhchiang, igor

Reviewed By: igor

Subscribers: MarkCallaghan, leveldb, dhruba, yoshinorim

Differential Revision: https://reviews.facebook.net/D31941
2015-02-04 15:15:41 -08:00
sdong 5917de0bae CappedFixTransform: return fixed length prefix, or full key if key is shorter than the fixed length
Summary: Add CappedFixTransform, which is the same as fixed length prefix extractor, except that when slice is shorter than the fixed length, it will use the full key.

Test Plan:
Add a test case for
db_test
options_test
and a new test

Reviewers: yhchiang, rven, igor

Reviewed By: igor

Subscribers: MarkCallaghan, leveldb, dhruba, yoshinorim

Differential Revision: https://reviews.facebook.net/D31887
2015-01-30 16:04:30 -08:00
Igor Canadi 2fd8f750ab Compile MemEnv with standard RocksDB library
Summary: This was a feature request by osquery. See task t5617758

Test Plan: compiles and memenv_test runs

Reviewers: yhchiang, rven, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D32115
2015-01-29 16:33:11 -08:00
Yueh-Hsuan Chiang cd4c071973 Update HISTORY.md for GetThreadStatus() support on compaction. 2015-01-22 15:46:56 -08:00
Igor Canadi 9ab5adfc59 New BlockBasedTable version -- better compressed block format
Summary:
This diff adds BlockBasedTable format_version = 2. New format version brings better compressed block format for these compressions:
1) Zlib -- encode decompressed size in compressed block header
2) BZip2 -- encode decompressed size in compressed block header
3) LZ4 and LZ4HC -- instead of doing memcpy of size_t encode size as varint32. memcpy is very bad because the DB is not portable accross big/little endian machines or even platforms where size_t might be 8 or 4 bytes.

It does not affect format for snappy.

If you write a new database with format_version = 2, it will not be readable by RocksDB versions before 3.10. DB::Open() will return corruption in that case.

Test Plan:
Added a new test in db_test.
I will also run db_bench and verify VSIZE when block_cache == 1GB

Reviewers: yhchiang, rven, MarkCallaghan, dhruba, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D31461
2015-01-14 16:24:24 -08:00
Igor Canadi 62ad0a9b19 Deprecating skip_log_error_on_recovery
Summary:
Since https://reviews.facebook.net/D16119, we ignore partial tailing writes. Because of that, we no longer need skip_log_error_on_recovery.

The documentation says "Skip log corruption error on recovery (If client is ok with losing most recent changes)", while the option actually ignores any corruption of the WAL (not only just the most recent changes). This is very dangerous and can lead to DB inconsistencies. This was originally set up to ignore partial tailing writes, which we now do automatically (after D16119). I have digged up old task t2416297 which confirms my findings.

Test Plan: There was actually no tests that verified correct behavior of skip_log_error_on_recovery.

Reviewers: yhchiang, rven, dhruba, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D30603
2015-01-05 13:35:56 -08:00
Lei Jin 5045c43944 add support for nested BlockBasedTableOptions in config string
Summary:
Add support to allow nested config for block-based table factory. The format looks like this:

"write_buffer_size=1024;block_based_table_factory={block_size=4k};max_write_buffer_num=2"

Test Plan: unit test

Reviewers: yhchiang, rven, igor, ljin, jonahcohen

Reviewed By: jonahcohen

Subscribers: jonahcohen, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D29223
2014-12-22 16:34:21 -08:00
Igor Canadi 0acc738810 Speed up FindObsoleteFiles()
Summary:
There are two versions of FindObsoleteFiles():
* full scan, which is executed every 6 hours (and it's terribly slow)
* no full scan, which is executed every time a background process finishes and iterator is deleted

This diff is optimizing the second case (no full scan). Here's what we do before the diff:
* Get the list of obsolete files (files with ref==0). Some files in obsolete_files set might actually be live.
* Get the list of live files to avoid deleting files that are live.
* Delete files that are in obsolete_files and not in live_files.

After this diff:
* The only files with ref==0 that are still live are files that have been part of move compaction. Don't include moved files in obsolete_files.
* Get the list of obsolete files (which exclude moved files).
* No need to get the list of live files, since all files in obsolete_files need to be deleted.

I'll post the benchmark results, but you can get the feel of it here: https://reviews.facebook.net/D30123

This depends on D30123.

P.S. We should do full scan only in failure scenarios, not every 6 hours. I'll do this in a follow-up diff.

Test Plan:
One new unit test. Made sure that unit test fails if we don't have a `if (!f->moved)` safeguard in ~Version.

make check

Big number of compactions and flushes:

  ./db_stress --threads=30 --ops_per_thread=20000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0  --reopen=15 --max_background_compactions=10 --max_background_flushes=10 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000

Reviewers: yhchiang, rven, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D30249
2014-12-22 12:04:45 +01:00
Venkatesh Radhakrishnan 7661e5a76e Move the file copy out of the mutex.
Summary:
We now release the mutex before copying the files in the case
of the trivial move. This path does not use the compaction job.

Test Plan: DBTest.LevelCompactionThirdPath

Reviewers: yhchiang, igor, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D30381
2014-12-16 16:57:22 -08:00
Venkatesh Radhakrishnan 153f4f0719 RocksDB: Allow Level-Style Compaction to Place Files in Different Paths
Summary:
Allow Level-style compaction to place files in different paths
This diff provides the code for task 4854591. We now support level-compaction
to place files in different paths by specifying  them in db_paths  along with
the minimum level for files to store in that path.

Test Plan: ManualLevelCompactionOutputPathId in db_test.cc

Reviewers: yhchiang, MarkCallaghan, dhruba, yoshinorim, sdong

Reviewed By: sdong

Subscribers: yoshinorim, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D29799
2014-12-15 21:48:16 -08:00
Igor Canadi 06eed650a0 Optimize default compile to compilation platform by default
Summary:
This diff changes compile to optimize for native platform by default. This will automatically turn on crc32 optimizations for modern processors, which greatly improves rocksdb's performance.

I also did some more changes to compilation documentation.

Test Plan:
compile with `make`, observe -march=native
compile with `PORTABLE=1 make`, observe no -march=native

Reviewers: sdong, rven, yhchiang, MarkCallaghan

Reviewed By: MarkCallaghan

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D30225
2014-12-15 11:29:41 +01:00
Alexey Maykov ee95cae9a4 Modifed the LRU cache eviction code so that it doesn't evict blocks which have exteranl references
Summary:
Currently, blocks which have more than one reference (ie referenced by something other than cache itself) are evicted from cache. This doesn't make much sense:
- blocks are still in RAM, so the RAM usage reported by the cache is incorrect
- if the same block is needed by another iterator, it will be loaded and decompressed again

This diff changes the reference counting scheme a bit. Previously, if the cache contained the block, this was accounted for in its refcount. After this change, the refcount is only used to track external references. There is a boolean flag which indicates whether or not the block is contained in the cache.
This diff also changes how LRU list is used. Previously, both hashtable and the LRU list contained all blocks. After this change, the LRU list contains blocks with the refcount==0, ie those which can be evicted from the cache.

Note that this change still allows for cache to grow beyond its capacity. This happens when all blocks are pinned (ie refcount>0). This is consistent with the current behavior. The cache's insert function never fails. I spent lots of time trying to make table_reader and other places work with the insert which might failed. It turned out to be pretty hard. It might really destabilize some customers, so finally, I decided against doing this.

table_cache_remove_scan_count_limit option will be unneeded after this change, but I will remove it in the following diff, if this one gets approved

Test Plan: Ran tests, made sure they pass

Reviewers: sdong, ljin

Differential Revision: https://reviews.facebook.net/D25503
2014-12-10 22:28:53 -08:00
Yueh-Hsuan Chiang 1b7fbb9e82 Update HISTORY.md for release 3.9 2014-12-08 15:19:48 -08:00
Jonah Cohen a14b7873ee Enforce write buffer memory limit across column families
Summary:
Introduces a new class for managing write buffer memory across column
families.  We supplement ColumnFamilyOptions::write_buffer_size with
ColumnFamilyOptions::write_buffer, a shared pointer to a WriteBuffer
instance that enforces memory limits before flushing out to disk.

Test Plan: Added SharedWriteBuffer unit test to db_test.cc

Reviewers: sdong, rven, ljin, igor

Reviewed By: igor

Subscribers: tnovak, yhchiang, dhruba, xjin, MarkCallaghan, yoshinorim

Differential Revision: https://reviews.facebook.net/D22581
2014-12-02 12:09:20 -08:00
Venkatesh Radhakrishnan 004f416b77 Moved checkpoint to utilities
Summary:
Moved checkpoint to utilities.
Addressed comments by Igor, Siying, Dhruba

Test Plan: db_test/SnapshotLink

Reviewers: dhruba, igor, sdong

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D29079
2014-11-20 15:54:47 -08:00
Yueh-Hsuan Chiang d0c5f28a5c Introduce GetThreadList API
Summary:
Add GetThreadList API, which allows developer to track the
status of each process.  Currently, calling GetThreadList will
only get the list of background threads in RocksDB with their
thread-id and thread-type (priority) set.  Will add more support
on this in the later diffs.

ThreadStatus currently has the following properties:

  // An unique ID for the thread.
  const uint64_t thread_id;

  // The type of the thread, it could be ROCKSDB_HIGH_PRIORITY,
  // ROCKSDB_LOW_PRIORITY, and USER_THREAD
  const ThreadType thread_type;

  // The name of the DB instance where the thread is currently
  // involved with.  It would be set to empty string if the thread
  // does not involve in any DB operation.
  const std::string db_name;

  // The name of the column family where the thread is currently
  // It would be set to empty string if the thread does not involve
  // in any column family.
  const std::string cf_name;

  // The event that the current thread is involved.
  // It would be set to empty string if the information about event
  // is not currently available.

Test Plan:
./thread_list_test
export ROCKSDB_TESTS=GetThreadList
./db_test

Reviewers: rven, igor, sdong, ljin

Reviewed By: ljin

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D25047
2014-11-20 10:49:32 -08:00
Yueh-Hsuan Chiang 7fe247080f Update HISTORY.md for RocksJava 2014-11-14 11:23:00 -08:00
Igor Canadi 94fa542f82 Update HISTROY.md for 3.8 release 2014-11-14 10:43:12 -08:00