Commit Graph

302 Commits

Author SHA1 Message Date
Vijay Nadimpalli 49c5a12dbe Organizing rocksdb/db directory
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5390

Differential Revision: D15579388

Pulled By: vjnadimpalli

fbshipit-source-id: 5bfc95e31554b8ff05b97b76d6534113f527f366
2019-05-31 11:57:01 -07:00
Zhongyi Xie ab8f6c01a6 move LevelCompactionPicker to a separate file (#5369)
Summary:
In order to improve code readability, this PR moves LevelCompactionBuilder and LevelCompactionPicker to compaction_picker_level.h and .cc
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5369

Differential Revision: D15540172

Pulled By: miasantreble

fbshipit-source-id: c1a578b93f127cd63661b53f32b356e6edd349af
2019-05-30 21:38:24 -07:00
Vijay Nadimpalli 50e470791d Organizing rocksdb/table directory by format
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5373

Differential Revision: D15559425

Pulled By: vjnadimpalli

fbshipit-source-id: 5d6d6d615582bedd96a4b879bb25d429a6de8b55
2019-05-30 14:51:11 -07:00
Siying Dong 545d206040 Move some file related files outside util/ (#5375)
Summary:
util/ means for lower level libraries, so it's a good idea to move the files which requires knowledge to DB out. Create a file/ and move some files there.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5375

Differential Revision: D15550935

Pulled By: siying

fbshipit-source-id: 61a9715dcde5386eebfb43e93f847bba1ae0d3f2
2019-05-29 20:47:06 -07:00
Mike Kolupaev cd77d3c558 Don't call FindObsoleteFiles() in ~ColumnFamilyHandleImpl() if CF is not dropped (#5238)
Summary:
We have a DB with ~4k column families and ~70k files. On shutdown, destroying the 4k ColumnFamilyHandle-s takes over 2 minutes. Most of this time is spent in VersionSet::AddLiveFiles() called from FindObsoleteFiles() from ~ColumnFamilyHandleImpl(). It's just iterating over the list of files in memory. This seems completely unnecessary as no obsolete files are actually found since the CFs are not even dropped. This PR fixes that.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5238

Differential Revision: D15056342

Pulled By: siying

fbshipit-source-id: 2aa342ef3770b4aa384ce81f8768e485480e4f08
2019-04-24 17:11:36 -07:00
Zhongyi Xie baa5302447 Avoid double-compacting data in bottom level in manual compactions (#5138)
Summary:
Depending on the config, manual compaction (leveled compaction style) does following compactions:
L0->L1
L1->L2
...
Ln-1 -> Ln
Ln -> Ln
The final Ln -> Ln compaction is partly unnecessary as it recompacts all the files that were just generated by the Ln-1 -> Ln. We should avoid recompacting such files. This rule should be applied to Lmax only.
Resolves issue https://github.com/facebook/rocksdb/issues/4995
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5138

Differential Revision: D14940106

Pulled By: miasantreble

fbshipit-source-id: 8d3cf5507a17e76f3333cfd4bac5256d005636e5
2019-04-16 23:32:20 -07:00
Mike Kolupaev 120bc4715b Add DBOptions. avoid_unnecessary_blocking_io to defer file deletions (#5043)
Summary:
Just like ReadOptions::background_purge_on_iterator_cleanup but for ColumnFamilyHandle instead of Iterator.

In our use case we sometimes call ColumnFamilyHandle's destructor from low-latency threads, and sometimes it blocks the thread for a few seconds deleting the files. To avoid that, we can either offload ColumnFamilyHandle's destruction to a background thread on our side, or add this option on rocksdb side. This PR does the latter, to be consistent with how we solve exactly the same problem for iterators using background_purge_on_iterator_cleanup option.

(EDIT: It's avoid_unnecessary_blocking_io now, and affects both CF drops and iterator destructors.)
I'm not quite comfortable with having two separate options (background_purge_on_iterator_cleanup and background_purge_on_cf_cleanup) for such a rarely used thing. Maybe we should merge them? Rename background_purge_on_cf_cleanup to something like delete_files_on_background_threads_only or avoid_blocking_io_in_unexpected_places, and make iterators use it instead of the one in ReadOptions? I can do that here if you guys think it's better.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5043

Differential Revision: D14339233

Pulled By: al13n321

fbshipit-source-id: ccf7efa11c85c9a5b91d969bb55627d0fb01e7b8
2019-04-01 17:10:40 -07:00
Andrew Kryczka 01013ae766 Digest ZSTD compression dictionary once when writing SST file (#4849)
Summary:
This is essentially a re-submission of #4251 with a few improvements:

- Split `CompressionDict` into two separate classes: `CompressionDict` and `UncompressionDict`
- Eliminated `Init` functions. Instead do all initialization work in constructors.
- Added test case for parallel DB open, which is the scenario where #4251 failed under TSAN.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4849

Differential Revision: D13606039

Pulled By: ajkr

fbshipit-source-id: 08c236059798c710db9cbf545fce0f371232d447
2019-01-18 19:12:57 -08:00
Abhishek Madan 81b6b09f6b Remove v1 RangeDelAggregator (#4778)
Summary:
Now that v2 is fully functional, the v1 aggregator is removed.
The v2 aggregator has been renamed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4778

Differential Revision: D13495930

Pulled By: abhimadan

fbshipit-source-id: 9d69500a60a283e79b6c4fa938fc68a8aa4d40d6
2018-12-17 17:33:46 -08:00
Abhishek Madan abf931afa6 Add compaction logic to RangeDelAggregatorV2 (#4758)
Summary:
RangeDelAggregatorV2 now supports ShouldDelete calls on
snapshot stripes and creation of range tombstone compaction iterators.
RangeDelAggregator is no longer used on any non-test code path, and will
be removed in a future commit.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4758

Differential Revision: D13439254

Pulled By: abhimadan

fbshipit-source-id: fe105bcf8e3d4a2df37a622d5510843cd71b0401
2018-12-17 13:20:51 -08:00
Sagar Vemuri 70645355ad Move FIFOCompactionPicker to a separate file (#4724)
Summary:
**Summary:**
Simplified the code layout by moving FIFOCompactionPicker to a separate file.
**Why?:**
While trying to add ttl functionality to universal compaction, I found that `FIFOCompactionPicker` class and its impl methods to be interspersed between `LevelCompactionPicker` methods which kind-of made the code a little hard to traverse. So I moved `FIFOCompactionPicker` to a separate compaction_picker_fifo.h/cc file, similar to `UniversalCompactionPicker`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4724

Differential Revision: D13227914

Pulled By: sagar0

fbshipit-source-id: 89471766ea67fa4d87664a41c057dd7df4b3d4e3
2018-11-29 16:04:52 -08:00
Abhishek Madan 8fe1e06ca0 Clean up FragmentedRangeTombstoneList (#4692)
Summary:
Removed `one_time_use` flag, which removed the need for some
tests, and changed all `NewRangeTombstoneIterator` methods to return
`FragmentedRangeTombstoneIterators`.

These changes also led to removing `RangeDelAggregatorV2::AddUnfragmentedTombstones`
and one of the `MemTableListVersion::AddRangeTombstoneIterators` methods.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4692

Differential Revision: D13106570

Pulled By: abhimadan

fbshipit-source-id: cbab5432d7fc2d9cdfd8d9d40361a1bffaa8f845
2018-11-28 15:29:02 -08:00
Abhishek Madan 457f77b9ff Introduce RangeDelAggregatorV2 (#4649)
Summary:
The old RangeDelAggregator did expensive pre-processing work
to create a collapsed, binary-searchable representation of range
tombstones. With FragmentedRangeTombstoneIterator, much of this work is
now unnecessary. RangeDelAggregatorV2 takes advantage of this by seeking
in each iterator to find a covering tombstone in ShouldDelete, while
doing minimal work in AddTombstones. The old RangeDelAggregator is still
used during flush/compaction for now, though RangeDelAggregatorV2 will
support those uses in a future PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4649

Differential Revision: D13146964

Pulled By: abhimadan

fbshipit-source-id: be29a4c020fc440500c137216fcc1cf529571eb3
2018-11-21 10:56:45 -08:00
Abhishek Madan eaaf1a6f05 Promote rocksdb.{deleted.keys,merge.operands} to main table properties (#4594)
Summary:
Since the number of range deletions are reported in
TableProperties, it is confusing to not report the number of merge
operands and point deletions as top-level properties; they are
accessible through the public API, but since they are not the "main"
properties, they do not appear in aggregated table properties, or the
string representation of table properties.

This change promotes those two property keys to
`rocksdb/table_properties.h`, adds corresponding uint64 members for
them, deprecates the old access methods `GetDeletedKeys()` and
`GetMergeOperands()` (though they are still usable for now), and removes
`InternalKeyPropertiesCollector`. The property key strings are the same
as before this change, so this should be able to read DBs written from older
versions (though I haven't tested this yet).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4594

Differential Revision: D12826893

Pulled By: abhimadan

fbshipit-source-id: 9e4e4fbdc5b0da161c89582566d184101ba8eb68
2018-10-30 15:34:27 -07:00
Siying Dong d59549298f Skip deleted WALs during recovery
Summary:
This patch record min log number to keep to the manifest while flushing SST files to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic.

Before the commit, for 2PC case, we determined which log number to keep in FindObsoleteFiles(). We looked at the earliest logs with outstanding prepare entries, or prepare entries whose respective commit or abort are in memtable. With the commit, the same calculation is done while we apply the SST flush. Just before installing the flush file, we precompute the earliest log file to keep after the flush finishes using the same logic (but skipping the memtables just flushed), record this information to the manifest entry for this new flushed SST file. This pre-computed value is also remembered in memory, and will later be used to determine whether a log file can be deleted. This value is unlikely to change until next flush because the commit entry will stay in memtable. (In WritePrepared, we could have removed the older log files as soon as all prepared entries are committed. It's not yet done anyway. Even if we do it, the only thing we loss with this new approach is earlier log deletion between two flushes, which does not guarantee to happen anyway because the obsolete file clean-up function is only executed after flush or compaction)

This min log number to keep is stored in the manifest using the safely-ignore customized field of AddFile entry, in order to guarantee that the DB generated using newer release can be opened by previous releases no older than 4.2.
Closes https://github.com/facebook/rocksdb/pull/3765

Differential Revision: D7747618

Pulled By: siying

fbshipit-source-id: d00c92105b4f83852e9754a1b70d6b64cb590729
2018-05-03 15:43:09 -07:00
Huachao Huang ed7a95b28c Add max_subcompactions as a compaction option
Summary:
Sometimes we want to compact files as fast as possible, but don't want to set a large `max_subcompactions` in the `DBOptions` by default.
I add a `max_subcompactions` options to `CompactionOptions` so that we can choose a proper concurrency dynamically.
Closes https://github.com/facebook/rocksdb/pull/3775

Differential Revision: D7792357

Pulled By: ajkr

fbshipit-source-id: 94f54c3784dce69e40a229721a79a97e80cd6a6c
2018-04-27 11:57:39 -07:00
Yanqin Jin 7dfbe33532 Rename pending_compaction_ to queued_for_compaction_.
Summary:
We use `queued_for_flush_` to indicate a column family has been added to the
flush queue. Similarly and to be consistent in our naming, we need to use `queued_for_compaction_` to indicate a column family has been added to the compaction queue. In the past we used
`pending_compaction_` which can also be ambiguous.
Closes https://github.com/facebook/rocksdb/pull/3781

Differential Revision: D7790063

Pulled By: riversand963

fbshipit-source-id: 6786b11a4fcaea36dc9b4672233dbe042f921804
2018-04-27 11:12:01 -07:00
Yanqin Jin 513b5ce618 Rename pending_flush_ to queued_for_flush_.
Summary:
With ColumnFamilyData::pending_flush_, we have the following code snippet in DBImpl::ScheedulePendingFlush

```
if (!cfd->pending_flush() && cfd->imm()->IsFlushPending()) {
...
}
```

`Pending` is ambiguous, and I feel `queued_for_flush` is a better name,
especially for the sake of readability.
Closes https://github.com/facebook/rocksdb/pull/3777

Differential Revision: D7783066

Pulled By: riversand963

fbshipit-source-id: f1bd8c8bfe5eafd2c94da0d8566c9b2b6bb57229
2018-04-26 21:12:51 -07:00
David Lai 3be9b36453 comment unused parameters to turn on -Wunused-parameter flag
Summary:
This PR comments out the rest of the unused arguments which allow us to turn on the -Wunused-parameter flag. This is the second part of a codemod relating to https://github.com/facebook/rocksdb/pull/3557.
Closes https://github.com/facebook/rocksdb/pull/3662

Differential Revision: D7426121

Pulled By: Dayvedde

fbshipit-source-id: 223994923b42bd4953eb016a0129e47560f7e352
2018-04-12 17:59:16 -07:00
Andrew Kryczka 019d7894eb fix calling SetOptions on deprecated options
Summary:
In `cf_options_type_info`, the deprecated options are all considered to have offset zero in the `MutableCFOptions` struct. Previously we weren't checking in `GetMutableOptionsFromStrings` whether the provided option was deprecated or not and simply writing the provided value to the offset specified by `cf_options_type_info`. That meant setting any deprecated option would overwrite the first element in the struct, which is `write_buffer_size`. `db_stress` hit this often since it calls `SetOptions` with `soft_rate_limit=0` and `hard_rate_limit=0`, which are both deprecated so cause `write_buffer_size` to be set to zero, which causes it to crash on the following assertion:

```
db_stress: db/memtable.cc:106: rocksdb::MemTable::MemTable(const rocksdb::InternalKeyComparator&, const rocksdb::ImmutableCFOptions&, const rocksdb::MutableCFOptions&, rocksdb::WriteBufferManager*, rocksdb::SequenceNumber, uint32_t): Assertion `!ShouldScheduleFlush()' failed.
```

We fix it by skipping deprecated options (and logging a warning) when users provide them to `SetOptions`. I didn't want to fail the call for compatibility reasons.
Closes https://github.com/facebook/rocksdb/pull/3700

Differential Revision: D7572596

Pulled By: ajkr

fbshipit-source-id: bd5d84e14c0c39f30c5d4c6df7c1503d2c28ecf1
2018-04-10 19:02:09 -07:00
Phani Shekhar Mantripragada 446b32cfc3 Support for Column family specific paths.
Summary:
In this change, an option to set different paths for different column families is added.
This option is set via cf_paths setting of ColumnFamilyOptions. This option will work in a similar fashion to db_paths setting. Cf_paths is a vector of Dbpath values which contains a pair of the absolute path and target size. Multiple levels in a Column family can go to different paths if cf_paths has more than one path.
To maintain backward compatibility, if cf_paths is not specified for a column family, db_paths setting will be used. Note that, if db_paths setting is also not specified, RocksDB already has code to use db_name as the only path.

Changes :
1) A new member "cf_paths" is added to ImmutableCfOptions. This is set, based on cf_paths setting of ColumnFamilyOptions and db_paths setting of ImmutableDbOptions.  This member is used to identify the path information whenever files are accessed.
2) Validation checks are added for cf_paths setting based on existing checks for db_paths setting.
3) DestroyDB, PurgeObsoleteFiles etc. are edited to support multiple cf_paths.
4) Unit tests are added appropriately.
Closes https://github.com/facebook/rocksdb/pull/3102

Differential Revision: D6951697

Pulled By: ajkr

fbshipit-source-id: 60d2262862b0a8fd6605b09ccb0da32bb331787d
2018-04-05 19:58:20 -07:00
Zhongyi Xie 1cbc96d236 FlushReason improvement
Summary:
Right now flush reason "SuperVersion Change" covers a few different scenarios which is a bit vague. For example, the following db_bench job should trigger "Write Buffer Full"

> $ TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304
$ grep 'flush_reason' /dev/shm/dbbench/LOG
...
2018/03/06-17:30:42.543638 7f2773b99700 EVENT_LOG_v1 {"time_micros": 1520386242543634, "job": 192, "event": "flush_started", "num_memtables": 1, "num_entries": 7006, "num_deletes": 0, "memory_usage": 1018024, "flush_reason": "SuperVersion Change"}
2018/03/06-17:30:42.569541 7f2773b99700 EVENT_LOG_v1 {"time_micros": 1520386242569536, "job": 193, "event": "flush_started", "num_memtables": 1, "num_entries": 7006, "num_deletes": 0, "memory_usage": 1018104, "flush_reason": "SuperVersion Change"}
2018/03/06-17:30:42.596396 7f2773b99700 EVENT_LOG_v1 {"time_micros": 1520386242596392, "job": 194, "event": "flush_started", "num_memtables": 1, "num_entries": 7008, "num_deletes": 0, "memory_usage": 1018048, "flush_reason": "SuperVersion Change"}
2018/03/06-17:30:42.622444 7f2773b99700 EVENT_LOG_v1 {"time_micros": 1520386242622440, "job": 195, "event": "flush_started", "num_memtables": 1, "num_entries": 7006, "num_deletes": 0, "memory_usage": 1018104, "flush_reason": "SuperVersion Change"}

With the fix:
> 2018/03/19-14:40:02.341451 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602341444, "job": 98, "event": "flush_started", "num_memtables": 1, "num_entries": 7009, "num_deletes": 0, "memory_usage": 1018008, "flush_reason": "Write Buffer Full"}
2018/03/19-14:40:02.379655 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602379642, "job": 100, "event": "flush_started", "num_memtables": 1, "num_entries": 7006, "num_deletes": 0, "memory_usage": 1018016, "flush_reason": "Write Buffer Full"}
2018/03/19-14:40:02.418479 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602418474, "job": 101, "event": "flush_started", "num_memtables": 1, "num_entries": 7009, "num_deletes": 0, "memory_usage": 1018104, "flush_reason": "Write Buffer Full"}
2018/03/19-14:40:02.455084 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602455079, "job": 102, "event": "flush_started", "num_memtables": 1, "num_entries": 7009, "num_deletes": 0, "memory_usage": 1018048, "flush_reason": "Write Buffer Full"}
2018/03/19-14:40:02.492293 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602492288, "job": 104, "event": "flush_started", "num_memtables": 1, "num_entries": 7007, "num_deletes": 0, "memory_usage": 1018056, "flush_reason": "Write Buffer Full"}
2018/03/19-14:40:02.528720 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602528715, "job": 105, "event": "flush_started", "num_memtables": 1, "num_entries": 7006, "num_deletes": 0, "memory_usage": 1018104, "flush_reason": "Write Buffer Full"}
2018/03/19-14:40:02.566255 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602566238, "job": 107, "event": "flush_started", "num_memtables": 1, "num_entries": 7009, "num_deletes": 0, "memory_usage": 1018112, "flush_reason": "Write Buffer Full"}
Closes https://github.com/facebook/rocksdb/pull/3627

Differential Revision: D7328772

Pulled By: miasantreble

fbshipit-source-id: 67c94065fbdd36930f09930aad0aaa6d2c152bb8
2018-03-22 18:42:18 -07:00
Siying Dong 93d52696bf Memory Problem Of Destorying ColumnFamilyHandle after deleting the CF
Summary:
When destorying column family handle after the column family has been deleted, the handle may hold share pointers of some objects in ColumnFamilyOptions, but in the destructor, the destructing order may cause some of the objects to be destoryed before being used by the following steps. Fix it by making a copy of the option object and destory it as the last step.
Closes https://github.com/facebook/rocksdb/pull/3610

Differential Revision: D7281025

Pulled By: siying

fbshipit-source-id: ac18f3b2841788cba4ccfa1abd8d59158c1113bc
2018-03-20 17:13:12 -07:00
Yi Wu bf937cf15b Add "rocksdb.live-sst-files-size" DB property
Summary:
Add "rocksdb.live-sst-files-size" DB property which only include files of latest version. Existing "rocksdb.total-sst-files-size" include files from all versions and thus include files that's obsolete but not yet deleted. I'm going to use this new property to cap blob db sst + blob files size.
Closes https://github.com/facebook/rocksdb/pull/3548

Differential Revision: D7116939

Pulled By: yiwu-arbug

fbshipit-source-id: c6a52e45ce0f24ef78708156e1a923c1dd6bc79a
2018-03-01 18:01:10 -08:00
Andrew Kryczka 3ae0047278 skip CompactRange flush based on memtable contents
Summary:
CompactRange has a call to Flush because we guarantee that, at the time it's called, all existing keys in the range will be pushed through the user's compaction filter. However, previously the flush was done blindly, so it'd happen even if the memtable does not contain keys in the range specified by the user. This caused unnecessarily many L0 files to be created, leading to write stalls in some cases. This PR checks the memtable's contents, and decides to flush only if it overlaps with `CompactRange`'s range.

- Move the memtable overlap check logic from `ExternalSstFileIngestionJob` to `ColumnFamilyData::RangesOverlapWithMemtables`
- Reuse the above logic in `CompactRange` and skip flushing if no overlap
Closes https://github.com/facebook/rocksdb/pull/3520

Differential Revision: D7018897

Pulled By: ajkr

fbshipit-source-id: a3c6b1cfae56687b49dd89ccac7c948e53545934
2018-02-27 17:12:44 -08:00
Mike Kolupaev 97307d888f Fix deadlock in ColumnFamilyData::InstallSuperVersion()
Summary:
Deadlock: a memtable flush holds DB::mutex_ and calls ThreadLocalPtr::Scrape(), which locks ThreadLocalPtr mutex; meanwhile, a thread exit handler locks ThreadLocalPtr mutex and calls SuperVersionUnrefHandle, which tries to lock DB::mutex_.

This deadlock is hit all the time on our workload. It blocks our release.

In general, the problem is that ThreadLocalPtr takes an arbitrary callback and calls it while holding a lock on a global mutex. The same global mutex is (at least in some cases) locked by almost all ThreadLocalPtr methods, on any instance of ThreadLocalPtr. So, there'll be a deadlock if the callback tries to do anything to any instance of ThreadLocalPtr, or waits for another thread to do so.

So, probably the only safe way to use ThreadLocalPtr callbacks is to do only do simple and lock-free things in them.

This PR fixes the deadlock by making sure that local_sv_ never holds the last reference to a SuperVersion, and therefore SuperVersionUnrefHandle never has to do any nontrivial cleanup.

I also searched for other uses of ThreadLocalPtr to see if they may have similar bugs. There's only one other use, in transaction_lock_mgr.cc, and it looks fine.
Closes https://github.com/facebook/rocksdb/pull/3510

Reviewed By: sagar0

Differential Revision: D7005346

Pulled By: al13n321

fbshipit-source-id: 37575591b84f07a891d6659e87e784660fde815f
2018-02-16 08:13:34 -08:00
jsteemann 4e7a182d09 Several small "fixes"
Summary:
- removed a few unneeded variables
- fused some variable declarations and their assignments
- fixed right-trimming code in string_util.cc to not underflow
- simplifed an assertion
- move non-nullptr check assertion before dereferencing of that pointer
- pass an std::string function parameter by const reference instead of by value (avoiding potential copy)
Closes https://github.com/facebook/rocksdb/pull/3507

Differential Revision: D7004679

Pulled By: sagar0

fbshipit-source-id: 52944952d9b56dfcac3bea3cd7878e315bb563c4
2018-02-15 16:57:37 -08:00
Andrew Kryczka ee1c802675 Add delay before flush in CompactRange to avoid write stalling
Summary:
- Refactored logic for checking write stall condition to a helper function: `GetWriteStallConditionAndCause`. Now it is decoupled from the logic for updating WriteController / stats in `RecalculateWriteStallConditions`, so we can reuse it for predicting whether write stall will occur.
- Updated `CompactRange` to first check whether the one additional immutable memtable / L0 file would cause stalling before it flushes. If so, it waits until that is no longer true.
- Updated `bg_cv_` to be signaled on `SetOptions` calls. The stall conditions `CompactRange` cares about can change when (1) flush finishes, (2) compaction finishes, or (3) options dynamically change. The cv was already signaled for (1) and (2) but not yet for (3).
Closes https://github.com/facebook/rocksdb/pull/3381

Differential Revision: D6754983

Pulled By: ajkr

fbshipit-source-id: 5613e03f1524df7192dc6ae885d40fd8f091d972
2018-02-12 15:42:47 -08:00
Zhongyi Xie 945f618ba5 log flush reason for better debugging experience
Summary:
It's always a mystery from the logs why flush was triggered -- user triggered it manually, WriteBufferManager triggered it,  logs were full, write buffer was full, etc.
This PR logs Flush reason whenever a flush is scheduled.
Closes https://github.com/facebook/rocksdb/pull/3401

Differential Revision: D6788142

Pulled By: miasantreble

fbshipit-source-id: a867e54d493c06adf5172bd36a180fb3faae3511
2018-02-09 12:12:43 -08:00
Tamir Duberstein cd5092e168 Suppress unused warnings
Summary:
- Use `__unused__` everywhere
- Suppress unused warnings in Release mode
    + This currently affects non-MSVC builds (e.g. mingw64).
Closes https://github.com/facebook/rocksdb/pull/3448

Differential Revision: D6885496

Pulled By: miasantreble

fbshipit-source-id: f2f6adacec940cc3851a9eee328fafbf61aad211
2018-02-02 12:27:07 -08:00
Fosco Marotto 5400800a56 Work around VS2017 warning for unused reference
Summary:
For #3407
Closes https://github.com/facebook/rocksdb/pull/3425

Differential Revision: D6836900

Pulled By: gfosco

fbshipit-source-id: 7bcaf7a1beeeeabb7c05584f2745e7b4a2473497
2018-01-31 11:58:10 -08:00
Yi Wu d46e832e94 Assert last reference before destroy ColumnFamilyData
Summary:
In ColumnFamilySet destructor, assert it hold the last reference to cfd before destroy them.

Closes #3112
Closes https://github.com/facebook/rocksdb/pull/3397

Differential Revision: D6777967

Pulled By: yiwu-arbug

fbshipit-source-id: 60b19070e0c194b3b6146699140c1d68777866cb
2018-01-23 15:12:28 -08:00
Yi Wu f1cb83fcf4 Fix Flush() keep waiting after flush finish
Summary:
Flush() call could be waiting indefinitely if min_write_buffer_number_to_merge is used. Consider the sequence:
1. User call Flush() with flush_options.wait = true
2. The manual flush started in the background
3. New memtable become immutable because of writes. The new memtable will not trigger flush if min_write_buffer_number_to_merge is not reached.
4. The manual flush finish.

Because of the new memtable created at step 3 not being flush, previous logic of WaitForFlushMemTable() keep waiting, despite the memtables it intent to flush has been flushed.

Here instead of checking if there are any more memtables to flush, WaitForFlushMemTable() also check the id of the earliest memtable. If the id is larger than that of latest memtable at the time flush was initiated, it means all the memtable at the time of flush start has all been flush.
Closes https://github.com/facebook/rocksdb/pull/3378

Differential Revision: D6746789

Pulled By: yiwu-arbug

fbshipit-source-id: 35e698f71c7f90b06337a93e6825f4ea3b619bfa
2018-01-18 17:45:16 -08:00
Shaohua Li eefd75a228 Stream
Summary:
Add a simple policy for NVMe write time life hint
Closes https://github.com/facebook/rocksdb/pull/3095

Differential Revision: D6298030

Pulled By: shligit

fbshipit-source-id: 9a72a42e32e92193af11599eb71f0cf77448e24d
2017-11-10 09:26:24 -08:00
Andrew Kryczka 24ad430600 pass key/value samples through zstd compression dictionary generator
Summary:
Instead of using samples directly, we now support passing the samples through zstd's dictionary generator when `CompressionOptions::zstd_max_train_bytes` is set to nonzero. If set to zero, we will use the samples directly as the dictionary -- same as before.

Note this is the first step of #2987, extracted into a separate PR per reviewer request.
Closes https://github.com/facebook/rocksdb/pull/3057

Differential Revision: D6116891

Pulled By: ajkr

fbshipit-source-id: 70ab13cc4c734fa02e554180eed0618b75255497
2017-11-02 22:56:36 -07:00
Andrew Kryczka c4c1f961e7 dynamically change current memtable size
Summary:
Previously setting `write_buffer_size` with `SetOptions` would only apply to new memtables. An internal user wanted it to take effect immediately, instead of at an arbitrary future point, to prevent OOM.

This PR makes the memtable's size mutable, and makes `SetOptions()` mutate it. There is one case when we preserve the old behavior, which is when memtable prefix bloom filter is enabled and the user is increasing the memtable's capacity. That's because the prefix bloom filter's size is fixed and wouldn't work as well on a larger memtable.
Closes https://github.com/facebook/rocksdb/pull/3119

Differential Revision: D6228304

Pulled By: ajkr

fbshipit-source-id: e44bd9d10a5f8c9d8c464bf7436070bb3eafdfc9
2017-11-02 22:28:10 -07:00
Dmitri Smirnov d2a65c59e1 Fix unused var warnings in Release mode
Summary:
MSVC does not support unused attribute at this time. A separate assignment line fixes the issue probably by being counted as usage for MSVC and it no longer complains about unused var.
Closes https://github.com/facebook/rocksdb/pull/3048

Differential Revision: D6126272

Pulled By: maysamyabandeh

fbshipit-source-id: 4907865db45fd75a39a15725c0695aaa17509c1f
2017-10-23 14:27:04 -07:00
Andrew Kryczka 8dd0a7e11a add comment in SuperVersion referencing logic
Summary:
The referencing logic is super confusing so added a comment at the part that took me longest to figure out.
Closes https://github.com/facebook/rocksdb/pull/2996

Differential Revision: D6034969

Pulled By: ajkr

fbshipit-source-id: 9cc2e744c1f79d6d57d378f86ed59238a5f583db
2017-10-11 15:12:31 -07:00
Adrien Schildknecht 01542400a8 Inform caller when rocksdb is stalling writes
Summary:
Add a new function in Listener to let the caller know when rocksdb
is stalling writes.
Closes https://github.com/facebook/rocksdb/pull/2897

Differential Revision: D5860124

Pulled By: schischi

fbshipit-source-id: ee791606169aa64f772c86f817cebf02624e05e1
2017-10-05 18:11:43 -07:00
Andrew Kryczka 3cd7ea2e8a rename stall-related internal stats
Summary:
Some of these names, like `MEMTABLE_COMPACTION`, did not mean anything. Tried to give them descriptive names.
Closes https://github.com/facebook/rocksdb/pull/2852

Differential Revision: D5782822

Pulled By: ajkr

fbshipit-source-id: f2695c4124af4073da4492d7135bae2411220f3a
2017-09-07 18:26:18 -07:00
Siying Dong 3c327ac2d0 Change RocksDB License
Summary: Closes https://github.com/facebook/rocksdb/pull/2589

Differential Revision: D5431502

Pulled By: siying

fbshipit-source-id: 8ebf8c87883daa9daa54b2303d11ce01ab1f6f75
2017-07-15 16:11:23 -07:00
奏之章 70440f7a63 Add virtual func IsDeleteRangeSupported
Summary:
this modify allows third-party tables able to support delete range
Closes https://github.com/facebook/rocksdb/pull/2035

Differential Revision: D5407973

Pulled By: ajkr

fbshipit-source-id: 82e364b7dd5a198660788d59543f15b8f95cc418
2017-07-12 16:58:45 -07:00
Siying Dong 6837a17621 Fix Data Race Between CreateColumnFamily() and GetAggregatedIntProperty()
Summary:
CreateColumnFamily() releases DB mutex after adding column family to the set and install super version (to write option file), so if users call GetAggregatedIntProperty() in the middle, then super version will be null and the process will crash. Fix it by skipping those column families without super version installed.

Maybe we should also fix the problem of releasing the lock when reading option file, but it is more risky. so I'm doing a quick and safer fix and we can investigate it later.
Closes https://github.com/facebook/rocksdb/pull/2475

Differential Revision: D5298053

Pulled By: siying

fbshipit-source-id: 4b3c8f91c60400b163fcc6cda8a0c77723be0ef6
2017-06-22 15:56:47 -07:00
Siying Dong 52a7f38b19 WriteOptions.low_pri which can throttle low pri writes if needed
Summary:
If ReadOptions.low_pri=true and compaction is behind, the write will either return immediate or be slowed down based on ReadOptions.no_slowdown.
Closes https://github.com/facebook/rocksdb/pull/2369

Differential Revision: D5127619

Pulled By: siying

fbshipit-source-id: d30e1cff515890af0eff32dfb869d2e4c9545eb0
2017-06-05 15:02:35 -07:00
hyunwoo c7662a44a4 fixed typo
Summary:
fixed typo
Closes https://github.com/facebook/rocksdb/pull/2376

Differential Revision: D5183630

Pulled By: ajkr

fbshipit-source-id: 133cfd0445959e70aa2cd1a12151bf3c0c5c3ac5
2017-06-05 11:27:34 -07:00
Andrew Kryczka a4d9c02511 Pass CF ID to MemTableRepFactory
Summary:
Some users want to monitor column family activity in their custom memtable implementations. Previously there was no way to figure out with which column family a memtable is associated. This diff:

- adds an overload to MemTableRepFactory::CreateMemTableRep() that provides the CF ID. For compatibility, its default implementation calls the old overload.
- updates MemTable to create MemTableRep's using the new overload.
Closes https://github.com/facebook/rocksdb/pull/2346

Differential Revision: D5108061

Pulled By: ajkr

fbshipit-source-id: 3a1921214a348dd8ea0f54e1cab3b71c3d46d616
2017-06-02 12:12:06 -07:00
Sagar Vemuri 7bb1f5d483 Increase of compaction threads should be logged at info level instead of a warning
Summary:
This log message shouldn't be a warning; some services are seeing high warning count due to this.

The count for the below line is a few hundreds of millions, as per Logview:
```
[rocksdb/src/db/column_family.cc:729] [checkpoints] Increasing compaction threads because we have 2 level-0 files
```
Closes https://github.com/facebook/rocksdb/pull/2364

Differential Revision: D5123565

Pulled By: sagar0

fbshipit-source-id: a07ce499a4f82f0ebde9cda9f4948fb9df6a734c
2017-05-26 09:56:13 -07:00
Mikhail Antonov ba685a472a Support ingest_behind for IngestExternalFile
Summary:
First cut for early review; there are few conceptual points to answer and some code structure issues.

For conceptual points -

 - restriction-wise, we're going to disallow ingest_behind if (use_seqno_zero_out=true || disable_auto_compaction=false), the user is responsible to properly open and close DB with required params
 - we wanted to ingest into reserved bottom most level. Should we fail fast if bottom level isn't empty, or should we attempt to ingest if file fits there key-ranges-wise?
 - Modifying AssignLevelForIngestedFile seems the place we we'd handle that.

On code structure - going to refactor GenerateAndAddExternalFile call in the test class to allow passing instance of IngestionOptions, that's just going to incur lots of changes at callsites.
Closes https://github.com/facebook/rocksdb/pull/2144

Differential Revision: D4873732

Pulled By: lightmark

fbshipit-source-id: 81cb698106b68ef8797f564453651d50900e153a
2017-05-17 11:42:42 -07:00
Siying Dong 264d3f540c Allow IntraL0 compaction in FIFO Compaction
Summary:
Allow an option for users to do some compaction in FIFO compaction, to pay some write amplification for fewer number of files.
Closes https://github.com/facebook/rocksdb/pull/2163

Differential Revision: D4895953

Pulled By: siying

fbshipit-source-id: a1ab608dd0627211f3e1f588a2e97159646e1231
2017-05-04 18:16:13 -07:00
Siying Dong aeaba07b2a Remove an assert that causes TSAN failure.
Summary:
ColumnFamilyData::ConstructNewMemtable is called out of DB mutex, and it asserts current_ is not empty, but current_ should only be accessed inside DB mutex. Remove this assert to make TSAN happy.
Closes https://github.com/facebook/rocksdb/pull/2235

Differential Revision: D4978531

Pulled By: siying

fbshipit-source-id: 423685a7dae88ed3faaa9e1b9ccb3427ac704a4b
2017-05-01 16:35:15 -07:00
Siying Dong d616ebea23 Add GPLv2 as an alternative license.
Summary: Closes https://github.com/facebook/rocksdb/pull/2226

Differential Revision: D4967547

Pulled By: siying

fbshipit-source-id: dd3b58ae1e7a106ab6bb6f37ab5c88575b125ab4
2017-04-27 18:06:12 -07:00
Siying Dong ff97287016 Refactor compaction picker code
Summary:
1. Move universal compaction picker to separate files compaction_picker_universal.cc and compaction_picker_universal.h.
2. Rename some functions to make the code easier to understand.
3. Move leveled compaction picking code to a dedicated class, so that we we don't need to pass some common variable around when calling functions. It also allowed us to break down LevelCompactionPicker::PickCompaction() to smaller functions.
Closes https://github.com/facebook/rocksdb/pull/2100

Differential Revision: D4845948

Pulled By: siying

fbshipit-source-id: efa0ab4
2017-04-06 20:09:34 -07:00
Siying Dong d2dce5611a Move some files under util/ to separate dirs
Summary:
Move some files under util/ to new directories env/, monitoring/ options/ and cache/
Closes https://github.com/facebook/rocksdb/pull/2090

Differential Revision: D4833681

Pulled By: siying

fbshipit-source-id: 2fd8bef
2017-04-05 19:09:16 -07:00
Islam AbdelRahman e19163688b Add macros to include file name and line number during Logging
Summary:
current logging
```
2017/03/14-14:20:30.393432 7fedde9f5700 (Original Log Time 2017/03/14-14:20:30.393414) [default] Level summary: base level 1 max bytes base 268435456 files[1 0 0 0 0 0 0] max score 0.25
2017/03/14-14:20:30.393438 7fedde9f5700 [JOB 2] Try to delete WAL files size 61417909, prev total WAL file size 73820858, number of live WAL files 2.
2017/03/14-14:20:30.393464 7fedde9f5700 [DEBUG] [JOB 2] Delete /dev/shm/old_logging//MANIFEST-000001 type=3 #1 -- OK
2017/03/14-14:20:30.393472 7fedde9f5700 [DEBUG] [JOB 2] Delete /dev/shm/old_logging//000003.log type=0 #3 -- OK
2017/03/14-14:20:31.427103 7fedd49f1700 [default] New memtable created with log file: #9. Immutable memtables: 0.
2017/03/14-14:20:31.427179 7fedde9f5700 [JOB 3] Syncing log #6
2017/03/14-14:20:31.427190 7fedde9f5700 (Original Log Time 2017/03/14-14:20:31.427170) Calling FlushMemTableToOutputFile with column family [default], flush slots available 1, compaction slots allowed 1, compaction slots scheduled 1
2017/03/14-14:20:31.
Closes https://github.com/facebook/rocksdb/pull/1990

Differential Revision: D4708695

Pulled By: IslamAbdelRahman

fbshipit-source-id: cb8968f
2017-03-15 19:39:12 -07:00
Siying Dong 1ba2804b7f Remove XFunc tests
Summary:
Xfunc is hardly used. Remove it to keep the code simple.
Closes https://github.com/facebook/rocksdb/pull/1905

Differential Revision: D4603220

Pulled By: siying

fbshipit-source-id: 731f96d
2017-02-23 12:09:11 -08:00
Reid Horuff 5cf176ca15 Fix for 2PC causing WAL to grow too large
Summary:
Consider the following single column family scenario:
prepare in log A
commit in log B
*WAL is too large, flush all CFs to releast log A*
*CFA is on log B so we do not see CFA is depending on log A so no flush is requested*

To fix this we must also consider the log containing the prepare section when determining what log a CF is dependent on.
Closes https://github.com/facebook/rocksdb/pull/1768

Differential Revision: D4403265

Pulled By: reidHoruff

fbshipit-source-id: ce800ff
2017-01-19 15:39:12 -08:00
Islam AbdelRahman 3c6b49ed66 Fix implicit conversion between int64_t to int
Summary:
Make conversion explicit, implicit conversion breaks the build
Closes https://github.com/facebook/rocksdb/pull/1589

Differential Revision: D4245158

Pulled By: IslamAbdelRahman

fbshipit-source-id: aaec00d
2016-11-29 10:54:15 -08:00
Islam AbdelRahman a2bf265a39 Avoid intentional overflow in GetL0ThresholdSpeedupCompaction
Summary:
99c052a34f fixes integer overflow in GetL0ThresholdSpeedupCompaction() by checking if int become -ve.
UBSAN will complain about that since this is still an overflow, we can fix the issue by simply using int64_t
Closes https://github.com/facebook/rocksdb/pull/1582

Differential Revision: D4241525

Pulled By: IslamAbdelRahman

fbshipit-source-id: b3ae21f
2016-11-28 18:39:13 -08:00
Siying Dong cd7c4143d7 Improve Write Stalling System
Summary:
Current write stalling system has the problem of lacking of positive feedback if the restricted rate is already too low. Users sometimes stack in very low slowdown value. With the diff, we add a positive feedback (increasing the slowdown value) if we recover from slowdown state back to normal. To avoid the positive feedback to keep the slowdown value to be to high, we add issue a negative feedback every time we are close to the stop condition. Experiments show it is easier to reach a relative balance than before.

Also increase level0_stop_writes_trigger default from 24 to 32. Since level0_slowdown_writes_trigger default is 20, stop trigger 24 only gives four files as the buffer time to slowdown writes. In order to avoid stop in four files while 20 files have been accumulated, the slowdown value must be very low, which is amost the same as stop. It also doesn't give enough time for the slowdown value to converge. Increase it to 32 will smooth out the system.
Closes https://github.com/facebook/rocksdb/pull/1562

Differential Revision: D4218519

Pulled By: siying

fbshipit-source-id: 95e4088
2016-11-23 09:24:15 -08:00
Yi Wu dfb6fe6755 Unified InlineSkipList::Insert algorithm with hinting
Summary:
This PR is based on nbronson's diff with small
modifications to wire it up with existing interface. Comparing to
previous version, this approach works better for inserting keys in
decreasing order or updating the same key, and impose less restriction
to the prefix extractor.

---- Summary from original diff ----

This diff introduces a single InlineSkipList::Insert that unifies
the existing sequential insert optimization (prev_), concurrent insertion,
and insertion using externally-managed insertion point hints.

There's a deep symmetry between insertion hints (cursors) and the
concurrent algorithm.  In both cases we have partial information from
the recent past that is likely but not certain to be accurate.  This diff
introduces the struct InlineSkipList::Splice, which encodes predecessor
and successor information in the same form that was previously only used
within a single call to InsertConcurrently.  Splice holds information
about an insertion point that can be used to levera
Closes https://github.com/facebook/rocksdb/pull/1561

Differential Revision: D4217283

Pulled By: yiwu-arbug

fbshipit-source-id: 33ee437
2016-11-22 14:09:13 -08:00
Andrew Kryczka 661e4c9267 DeleteRange unsupported in non-block-based tables
Summary:
Return an error from DeleteRange() (or Write() if the user is using the
low-level WriteBatch API) if an unsupported table type is configured.
Closes https://github.com/facebook/rocksdb/pull/1519

Differential Revision: D4185933

Pulled By: ajkr

fbshipit-source-id: abcdf84
2016-11-15 15:24:16 -08:00
Islam AbdelRahman eba99c28e4 Fix min_write_buffer_number_to_merge = 0 bug
Summary:
It's possible that we set min_write_buffer_number_to_merge to 0.
This should never happen
Closes https://github.com/facebook/rocksdb/pull/1515

Differential Revision: D4183356

Pulled By: yiwu-arbug

fbshipit-source-id: c9d39d7
2016-11-15 13:54:08 -08:00
Yi Wu 1ea79a78c9 Optimize sequential insert into memtable - Part 1: Interface
Summary:
Currently our skip-list have an optimization to speedup sequential
inserts from a single stream, by remembering the last insert position.
We extend the idea to support sequential inserts from multiple streams,
and even tolerate small reordering wihtin each stream.

This PR is the interface part adding the following:
- Add `memtable_insert_prefix_extractor` to allow specifying prefix for each key.
- Add `InsertWithHint()` interface to memtable, to allow underlying
  implementation to return a hint of insert position, which can be later
  pass back to optimize inserts.
- Memtable will maintain a map from prefix to hints and pass the hint
  via `InsertWithHint()` if `memtable_insert_prefix_extractor` is non-null.
Closes https://github.com/facebook/rocksdb/pull/1419

Differential Revision: D4079367

Pulled By: yiwu-arbug

fbshipit-source-id: 3555326
2016-11-13 19:09:18 -08:00
Lijun Tang adb665e0bf Allowed delayed_write_rate option to be dynamically set.
Summary: Closes https://github.com/facebook/rocksdb/pull/1488

Differential Revision: D4157784

Pulled By: siying

fbshipit-source-id: f150081
2016-11-12 15:54:11 -08:00
Edouard A 99c052a34f Fix integer overflow in GetL0ThresholdSpeedupCompaction (#1378) 2016-10-23 18:43:29 -07:00
Aaron Gao 59a7c0337b Change ioptions to store user_comparator, fix bug
Summary:
change ioptions.comparator to user_comparator instread of internal_comparator.
Also change Comparator* to InternalKeyComparator* to make its type explicitly.

Test Plan: make all check -j64

Reviewers: andrewkr, sdong, yiwu

Reviewed By: yiwu

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D65121
2016-10-21 11:31:42 -07:00
Islam AbdelRahman 5691a1d8a4 Fix compaction conflict with running compaction
Summary:
Issue scenario:
(1) We have 3 files in L1 and we issue a compaction that will compact them into 1 file in L2
(2) While compaction (1) is running, we flush a file into L0 and trigger another compaction that decide to move this file to L1 and then move it again to L2 (this file don't overlap with any other files)
(3) compaction (1) finishes and install the file it generated in L2, but this file overlap with the file we generated in (2) so we break the LSM consistency

Looks like this issue can be triggered by using non-exclusive manual compaction or AddFile()

Test Plan: unit tests

Reviewers: sdong

Reviewed By: sdong

Subscribers: hermanlee4, jkedgar, andrewkr, dhruba, yoshinorim

Differential Revision: https://reviews.facebook.net/D64947
2016-10-13 10:49:06 -07:00
Yi Wu 9ed928e7a9 Split DBOptions into ImmutableDBOptions and MutableDBOptions
Summary: Use ImmutableDBOptions/MutableDBOptions internally and DBOptions only for user-facing APIs. MutableDBOptions is barely a placeholder for now. I'll start to move options to MutableDBOptions in following diffs.

Test Plan:
  make all check

Reviewers: yhchiang, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D64065
2016-09-23 16:34:04 -07:00
Aaron Gao 0a1bd9c509 add cfh deletion started listener
Summary: add ColumnFamilyHandleDeletionStarted listener which can be called when user deletes handler.

Test Plan: ./listener_test

Reviewers: yiwu, IslamAbdelRahman, sdong, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D60717
2016-09-22 11:56:18 -07:00
Yi Wu 0a88f38b7e Remove ColumnFamilyData::options()
Summary: One more small refactor before I split DBOptions into mutable and immutable parts.

Test Plan: existing unit tests.

Reviewers: yhchiang, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D64047
2016-09-16 15:09:14 -07:00
Yi Wu 17f76fc564 DB::GetOptions() reflect dynamic changed options
Summary: DB::GetOptions() reflect dynamic changed options.

Test Plan: See the new unit test.

Reviewers: yhchiang, sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D63903
2016-09-14 22:10:28 -07:00
Yi Wu 81747f1be6 Refactor MutableCFOptions
Summary:
* Change constructor of MutableCFOptions to depends only on ColumnFamilyOptions.
* Move `max_subcompactions`, `compaction_options_fifo` and `compaction_pri` to ImmutableCFOptions to make it clear that they are immutable.

Test Plan: existing unit tests.

Reviewers: yhchiang, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D63945
2016-09-13 21:11:59 -07:00
sdong 32149059f9 Merge options source_compaction_factor, max_grandparent_overlap_bytes and expanded_compaction_factor into max_compaction_bytes
Summary: To reduce number of options, merge source_compaction_factor, max_grandparent_overlap_bytes and expanded_compaction_factor into max_compaction_bytes.

Test Plan: Add two new unit tests. Run all existing tests, including jtest.

Reviewers: yhchiang, igor, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59829
2016-09-01 14:33:24 -07:00
Yi Wu ee027fc19f Ignore write stall triggers when auto-compaction is disabled
Summary:
My understanding is that the purpose of write stall triggers are to wait for auto-compaction to catch up. Without auto-compaction, we don't need to stall writes.

Also with this diff, flush/compaction conditions are recalculated on dynamic option change. Previously the conditions are recalculate only when write stall options are changed.

Test Plan: See the new test. Removed two tests that are no longer valid.

Reviewers: IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D61437
2016-08-02 21:55:26 -07:00
Jay Edgar cdc4eb6892 Add a GetComparator() function to the ColumnFamilyHandle base class so that the user's comparator can be retrieved.
Summary: MyRocks is adding support for the user of the SstFileWriter which needs a comparator.  It would be more convenient to get the comparator from the column family (which already has to have it) than to have caller keep track of it.

Test Plan: Standard tests (adding one for the new method)

Reviewers: IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D61155
2016-08-02 14:34:57 -07:00
sdong 32df9733d1 Add options.write_buffer_manager: control total memtable size across DB instances
Summary: Add option write_buffer_manager to help users control total memory spent on memtables across multiple DB instances.

Test Plan: Add a new unit test.

Reviewers: yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: adela, benj, sumeet, muthu, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59925
2016-07-05 18:11:25 -07:00
sdong 7b79238b65 Deprectate filter_deletes
Summary: filter_deltes is not a frequently used feature. Remove it.

Test Plan: Run all test suites.

Reviewers: igor, yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59427
2016-06-17 10:30:47 -07:00
sdong 20699df843 memtable_prefix_bloom_bits -> memtable_prefix_bloom_bits_ratio and deprecate memtable_prefix_bloom_probes
Summary:
memtable_prefix_bloom_probes is not a critical option. Remove it to reduce number of options.
It's easier for users to make mistakes with memtable_prefix_bloom_bits, turn it to memtable_prefix_bloom_bits_ratio

Test Plan: Run all existing tests

Reviewers: yhchiang, igor, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: gunnarku, yoshinorim, MarkCallaghan, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59199
2016-06-10 12:12:10 -07:00
Warren Falk b8cf9130f8 Fix #1110, 32-bit build failure on Mac OSX (#1112)
Using explicit 64-bit type in conditional in platforms above 32-bits
This appears to be necessary on Mac OSX as std::conditional does not appear to short circuit and evaluates the third template arg
Making the third template arg be 64 bits explicitly works around this problem and will work on both 32 bit and 64+ bit platforms.
2016-05-02 10:04:37 -07:00
Dhruba Borthakur 1a2cc27e01 ColumnFamilyOptions SanitizeOptions is buggy on 32-bit platforms.
Summary:
The pre-existing code is trying to clamp between 65,536 and 0,
resulting in clamping to 65,536, resulting in very small buffers,
resulting in ShouldFlushNow() being true quite easily,
resulting in assertion failing and database performance
being "not what it should be".

https://github.com/facebook/rocksdb/issues/1018

Test Plan: make check

Reviewers: sdong, andrewkr, IslamAbdelRahman, yhchiang, igor

Reviewed By: igor

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D55455
2016-03-14 16:21:54 -07:00
Baraa Hamodi 21e95811d1 Updated all copyright headers to the new format. 2016-02-09 15:12:00 -08:00
sdong b1887c5dd9 Explictly fail when memtable doesn't support concurrent insert
Summary: If users turn on concurrent insert but the memtable doesn't support it, they might see unexcepted crash. Fix it by explicitly fail.

Test Plan:
Run different setting of stress_test and make sure it fails correctly.
Will add a unit test too.

Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, andrewkr, ngbronson

Reviewed By: ngbronson

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D53895
2016-02-05 14:15:50 -08:00
Venkatesh Radhakrishnan 3b2a1ddd2e Add options.base_background_compactions as a number of compaction threads for low compaction debt
Summary:
If options.base_background_compactions is given, we try to schedule number of compactions not existing this number, only when L0 files increase to certain number, or pending compaction bytes more than certain threshold, we schedule compactions based on options.max_background_compactions.

The watermarks are calculated based on slowdown thresholds.

Test Plan:
Add new test cases in column_family_test.
Adding more unit tests.

Reviewers: IslamAbdelRahman, yhchiang, kradhakrishnan, rven, anthony

Reviewed By: anthony

Subscribers: leveldb, dhruba, yoshinorim

Differential Revision: https://reviews.facebook.net/D53409
2016-01-29 16:15:53 -08:00
sdong 6ee38bb15c Slowdown of writing to the last memtable should not override stopping
Summary: Now slowing down for the last mem table takes priority against some stopping conditions. This is logically confusing. Fix it.

Test Plan: Run all existing tests.

Reviewers: yhchiang, IslamAbdelRahman, kradhakrishnan, andrewkr, anthony

Reviewed By: anthony

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D53529
2016-01-28 21:52:29 -08:00
Yueh-Hsuan Chiang 6935eb24e0 Add ColumnFamilyHandle::GetDescriptor()
Summary:
This patch addes ColumnFamilyHandle::GetDescriptor(), which allows
developers to obtain the CF options and names of the associated column
family given its handle.

  // Returns the up-to-date descriptor used by the current handle.  Since it
  // returns the up-to-date information, this call might internally locks
  // and releases DB mutex to access the up-to-date CF options.
  virtual ColumnFamilyDescriptor GetDescriptor() = 0;

Test Plan: augment column_family_test

Reviewers: sdong, yoshinorim, IslamAbdelRahman, rven, kradhakrishnan, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D51543
2016-01-06 18:14:01 -08:00
Siying Dong 22c0ed8a5f Disable Visual Studio Warning C4351
Currently Windows build is broken because of Warning C4351. Disable the warning before figuring out the right way to fix it.
2015-12-28 15:06:34 -08:00
Nathan Bronson 7d87f02799 support for concurrent adds to memtable
Summary:
This diff adds support for concurrent adds to the skiplist memtable
implementations.  Memory allocation is made thread-safe by the addition of
a spinlock, with small per-core buffers to avoid contention.  Concurrent
memtable writes are made via an additional method and don't impose a
performance overhead on the non-concurrent case, so parallelism can be
selected on a per-batch basis.

Write thread synchronization is an increasing bottleneck for higher levels
of concurrency, so this diff adds --enable_write_thread_adaptive_yield
(default off).  This feature causes threads joining a write batch
group to spin for a short time (default 100 usec) using sched_yield,
rather than going to sleep on a mutex.  If the timing of the yield calls
indicates that another thread has actually run during the yield then
spinning is avoided.  This option improves performance for concurrent
situations even without parallel adds, although it has the potential to
increase CPU usage (and the heuristic adaptation is not yet mature).

Parallel writes are not currently compatible with
inplace updates, update callbacks, or delete filtering.
Enable it with --allow_concurrent_memtable_write (and
--enable_write_thread_adaptive_yield).  Parallel memtable writes
are performance neutral when there is no actual parallelism, and in
my experiments (SSD server-class Linux and varying contention and key
sizes for fillrandom) they are always a performance win when there is
more than one thread.

Statistics are updated earlier in the write path, dropping the number
of DB mutex acquisitions from 2 to 1 for almost all cases.

This diff was motivated and inspired by Yahoo's cLSM work.  It is more
conservative than cLSM: RocksDB's write batch group leader role is
preserved (along with all of the existing flush and write throttling
logic) and concurrent writers are blocked until all memtable insertions
have completed and the sequence number has been advanced, to preserve
linearizability.

My test config is "db_bench -benchmarks=fillrandom -threads=$T
-batch_size=1 -memtablerep=skip_list -value_size=100 --num=1000000/$T
-level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999
-disable_auto_compactions --max_write_buffer_number=8
-max_background_flushes=8 --disable_wal --write_buffer_size=160000000
--block_size=16384 --allow_concurrent_memtable_write" on a two-socket
Xeon E5-2660 @ 2.2Ghz with lots of memory and an SSD hard drive.  With 1
thread I get ~440Kops/sec.  Peak performance for 1 socket (numactl
-N1) is slightly more than 1Mops/sec, at 16 threads.  Peak performance
across both sockets happens at 30 threads, and is ~900Kops/sec, although
with fewer threads there is less performance loss when the system has
background work.

Test Plan:
1. concurrent stress tests for InlineSkipList and DynamicBloom
2. make clean; make check
3. make clean; DISABLE_JEMALLOC=1 make valgrind_check; valgrind db_bench
4. make clean; COMPILE_WITH_TSAN=1 make all check; db_bench
5. make clean; COMPILE_WITH_ASAN=1 make all check; db_bench
6. make clean; OPT=-DROCKSDB_LITE make check
7. verify no perf regressions when disabled

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: MarkCallaghan, IslamAbdelRahman, anthony, yhchiang, rven, sdong, guyg8, kradhakrishnan, dhruba

Differential Revision: https://reviews.facebook.net/D50589
2015-12-25 11:03:40 -08:00
sdong b9f77ba12b When slowdown is triggered, reduce the write rate
Summary: It's usually hard for users to set a value of options.delayed_write_rate. With this diff, after slowdown condition triggers, we greedily reduce write rate if estimated pending compaction bytes increase. If estimated compaction pending bytes drop, we increase the write rate.

Test Plan:
Add a unit test
Test with db_bench setting:
TEST_TMPDIR=/dev/shm/ ./db_bench --benchmarks=fillrandom -num=10000000 --soft_pending_compaction_bytes_limit=1000000000 --hard_pending_compaction_bytes_limit=3000000000 --delayed_write_rate=100000000

and make sure without the commit, write stop will happen, but with the commit, it will not happen.

Reviewers: igor, anthony, rven, yhchiang, kradhakrishnan, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52131
2015-12-23 11:33:15 -08:00
sdong d72b31774e Slowdown when writing to the last write buffer
Summary: Now if inserting to mem table is much faster than writing to files, there is no mechanism users can rely on to avoid stopping for reaching options.max_write_buffer_number. With the commit, if there are more than four maximum write buffers configured, we slow down to the rate of options.delayed_write_rate while we reach the last one.

Test Plan:
1. Add a new unit test.
2. Run db_bench with

./db_bench --benchmarks=fillrandom --num=10000000 --max_background_flushes=6 --batch_size=32 -max_write_buffer_number=4 --delayed_write_rate=500000 --statistics

based on hard drive and see stopping is avoided with the commit.

Reviewers: yhchiang, IslamAbdelRahman, anthony, rven, kradhakrishnan, igor

Reviewed By: igor

Subscribers: MarkCallaghan, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52047
2015-12-17 10:49:08 -08:00
Venkatesh Radhakrishnan 030215bf01 Running manual compactions in parallel with other automatic or manual compactions in restricted cases
Summary:
This diff provides a framework for doing manual
compactions in parallel with other compactions. We now have a deque of manual compactions. We also pass manual compactions as an argument from RunManualCompactions down to
BackgroundCompactions, so that RunManualCompactions can be reentrant.
Parallelism is controlled by the two routines
ConflictingManualCompaction to allow/disallow new parallel/manual
compactions based on already existing ManualCompactions. In this diff, by default manual compactions still have to run exclusive of other compactions. However, by setting the compaction option, exclusive_manual_compaction to false, it is possible to run other compactions in parallel with a manual compaction. However, we are still restricted to one manual compaction per column family at a time. All of these restrictions will be relaxed in future diffs.
I will be adding more tests later.

Test Plan: Rocksdb regression + new tests + valgrind

Reviewers: igor, anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, sdong

Reviewed By: sdong

Subscribers: yoshinorim, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D47973
2015-12-14 11:20:34 -08:00
sdong 56e77f0967 Deprecate options.soft_rate_limit and add options.soft_pending_compaction_bytes_limit
Summary: Deprecate options.soft_rate_limit, which is hard to tune, with options.soft_pending_compaction_bytes_limit, which would trigger the slowdown if estimated pending compaction bytes exceeds the threshold. The hope is to make it more striaght-forward to tune.

Test Plan: Modify DBTest.SoftLimit to cover options.soft_pending_compaction_bytes_limit instead; run all unit tests.

Reviewers: IslamAbdelRahman, yhchiang, rven, kradhakrishnan, igor, anthony

Reviewed By: anthony

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51117
2015-12-09 18:22:45 -08:00
sdong dac5b248b1 UniversalCompactionPicker::PickCompaction(): avoid to form compactions if there is no file
Summary:
Currently RocksDB may break in lines like this:

for (size_t i = sorted_runs.size() - 1; i >= first_index_after; i--) {

if options.level0_file_num_compaction_trigger=0.

Fix it by not executing the logic of picking compactions if there is no file (sorted_runs.size() = 0). Also internally set options.level0_file_num_compaction_trigger=1 if users give a 0. 0 is a value makes no sense in RocksDB.

Test Plan: Run all tests. Will add a unit test too.

Reviewers: yhchiang, IslamAbdelRahman, anthony, kradhakrishnan, rven

Reviewed By: rven

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D50727
2015-11-16 10:32:45 -08:00
Venkatesh Radhakrishnan a98fbacfa0 Moving memtable related files from util to a new directory memtable
Summary:
We are cleaning up dependencies.
This diff takes a first step at moving memtable files to their own
directory called memtable. In future diffs, we will move other memtable
files from db to memtable.

Test Plan: make check

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D48915
2015-10-16 14:10:33 -07:00
sdong 5de807ac16 Add options.hard_pending_compaction_bytes_limit to stop writes if compaction lagging behind
Summary: Add an option to stop writes if compaction lefts behind. If estimated pending compaction bytes is more than threshold specified by options.hard_pending_compaction_bytes_liimt, writes will stop until compactions are cleared to under the threshold.

Test Plan: Add unit test DBTest.HardLimit

Reviewers: rven, kradhakrishnan, anthony, IslamAbdelRahman, yhchiang, igor

Reviewed By: igor

Subscribers: MarkCallaghan, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D45999
2015-09-14 12:51:16 -07:00
Ari Ekmekji 03ddce9a01 Add counters for L0 stall while L0-L1 compaction is taking place
Summary:
Although there are currently counters to keep track of the
stall caused by having too many L0 files, there is no distinction as
to whether when that stall occurs either (A) L0-L1 compaction is taking
place to try and mitigate it, or (B) no L0-L1 compaction has been scheduled
at the moment. This diff adds a counter for (A) so that the nature of L0
stalls can be better understood.

Test Plan: make all && make check

Reviewers: sdong, igor, anthony, noetzli, yhchiang

Reviewed By: yhchiang

Subscribers: MarkCallaghan, dhruba

Differential Revision: https://reviews.facebook.net/D46749
2015-09-14 11:03:37 -07:00
agiardullo b5b2b75e52 better tuning of arena block size
Summary: Currently, if users didn't set options.arena_block_size, we set "result.arena_block_size = result.write_buffer_size / 10". It makes result.arena_block_size not a multiplier of 4KB, even if options.write_buffer_size is a multiplier of MBs. When calling malloc to arena_block_size, we may waste a small amount of memory for it. We now make the default to be /8 or /16 and align it to 4KB.

Test Plan: unit tests

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D46467
2015-09-08 20:53:32 -07:00
Islam AbdelRahman 027ca5b2cd Total SST files size DB Property
Summary: Add a new DB property that calculate the total size of files used by all RocksDB Versions

Test Plan: Unittests for the new property

Reviewers: igor, yhchiang, anthony, rven, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D44799
2015-08-20 11:47:19 -07:00
Yueh-Hsuan Chiang df79eafcb3 Introduce GetIntProperty("rocksdb.size-all-mem-tables")
Summary:
Currently, GetIntProperty("rocksdb.cur-size-all-mem-tables") only returns
the memory usage by those memtables which have not yet been flushed.

This patch introduces GetIntProperty("rocksdb.size-all-mem-tables"),
which includes the memory usage by all the memtables, includes those
have been flushed but pinned by iterators.

Test Plan: Added a test in db_test

Reviewers: igor, anthony, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D44229
2015-08-19 13:32:09 -07:00
Igor Canadi 35ca59364c Don't let flushes preempt compactions
Summary:
When we first started, max_background_flushes was 0 by default and compaction thread was executing flushes (since there was no flush thread). Then, we switched the default max_background_flushes to 1. However, we still support the case where there is no flush thread and flushes are done in compaction. This is making our code a bit more complicated. By not supporting this use-case we can make our code simpler.

We have a special case that when you set max_background_flushes to 0, we
schedule the flush to execute on the compaction thread.

Test Plan: make check (there might be some unit tests that depend on this behavior)

Reviewers: IslamAbdelRahman, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D41931
2015-07-17 12:02:52 -07:00
Yueh-Hsuan Chiang 4ce5be4255 fixed leaking log::Writers
Summary: Fixes valgrind errors in column_family_test.

Test Plan: `make check`, `make valgrind_check`

Reviewers: igor, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D41181
2015-07-07 12:10:10 -07:00
Igor Canadi 760e9a94de Fail DB::Open() when the requested compression is not available
Summary:
Currently RocksDB silently ignores this issue and doesn't compress the data. Based on discussion, we agree that this is pretty bad because it can cause confusion for our users.

This patch fails DB::Open() if we don't support the compression that is specified in the options.

Test Plan: make check with LZ4 not present. If Snappy is not present all tests will just fail because Snappy is our default library. We should make Snappy the requirement, since without it our default DB::Open() fails.

Reviewers: sdong, MarkCallaghan, rven, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39687
2015-06-18 14:55:05 -07:00
Igor Canadi 4b8bb62f0a Don't dump DBOptions for each column family
Summary: Currently we dump DBOptions for each column family options we dump. This leads to duplicate lines in our LOG file. This diff fixes that.

Test Plan: Check out the LOG

Reviewers: sdong, rven, yhchiang

Reviewed By: yhchiang

Subscribers: IslamAbdelRahman, yoshinorim, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39729
2015-06-18 10:15:54 -07:00
sdong 7842920be5 Slow down writes by bytes written
Summary:
We slow down data into the database to the rate of options.delayed_write_rate (a new option) with this patch.

The thread synchronization approach I take is to still synchronize write controller by DB mutex and GetDelay() is inside DB mutex. Try to minimize the frequency of getting time in GetDelay(). I verified it through db_bench and it seems to work

hard_rate_limit is deprecated.

options.delayed_write_rate is still not dynamically changeable. Need to work on it as a follow-up.

Test Plan: Add new unit tests in db_test

Reviewers: yhchiang, rven, kradhakrishnan, anthony, MarkCallaghan, igor

Reviewed By: igor

Subscribers: ikabiljo, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D36351
2015-06-11 20:42:18 -07:00
Yueh-Hsuan Chiang fe5c6321cb Allow EventListener::OnCompactionCompleted to return CompactionJobStats.
Summary:
Allow EventListener::OnCompactionCompleted to return CompactionJobStats,
which contains useful information about a compaction.

Example CompactionJobStats returned by OnCompactionCompleted():
    smallest_output_key_prefix 05000000
    largest_output_key_prefix 06990000
    elapsed_time 42419
    num_input_records 300
    num_input_files 3
    num_input_files_at_output_level 2
    num_output_records 200
    num_output_files 1
    actual_bytes_input 167200
    actual_bytes_output 110688
    total_input_raw_key_bytes 5400
    total_input_raw_value_bytes 300000
    num_records_replaced 100
    is_manual_compaction 1

Test Plan: Developed a mega test in db_test which covers 20 variables in CompactionJobStats.

Reviewers: rven, igor, anthony, sdong

Reviewed By: sdong

Subscribers: tnovak, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38463
2015-06-02 17:07:16 -07:00
agiardullo dc9d70de65 Optimistic Transactions
Summary: Optimistic transactions supporting begin/commit/rollback semantics.  Currently relies on checking the memtable to determine if there are any collisions at commit time.  Not yet implemented would be a way of enuring the memtable has some minimum amount of history so that we won't fail to commit when the memtable is empty.  You should probably start with transaction.h to get an overview of what is currently supported.

Test Plan: Added a new test, but still need to look into stress testing.

Reviewers: yhchiang, igor, rven, sdong

Reviewed By: sdong

Subscribers: adamretter, MarkCallaghan, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D33435
2015-05-29 14:36:35 -07:00
agiardullo c815351038 Support saving history in memtable_list
Summary:
For transactions, we are using the memtables to validate that there are no write conflicts.  But after flushing, we don't have any memtables, and transactions could fail to commit.  So we want to someone keep around some extra history to use for conflict checking.  In addition, we want to provide a way to increase the size of this history if too many transactions fail to commit.

After chatting with people, it seems like everyone prefers just using Memtables to store this history (instead of a separate history structure).  It seems like the best place for this is abstracted inside the memtable_list.  I decide to create a separate list in MemtableListVersion as using the same list complicated the flush/installalflushresults logic too much.

This diff adds a new parameter to control how much memtable history to keep around after flushing.  However, it sounds like people aren't too fond of adding new parameters.  So I am making the default size of flushed+not-flushed memtables be set to max_write_buffers.  This should not change the maximum amount of memory used, but make it more likely we're using closer the the limit.  (We are now postponing deleting flushed memtables until the max_write_buffer limit is reached).  So while we might use more memory on average, we are still obeying the limit set (and you could argue it's better to go ahead and use up memory now instead of waiting for a write stall to happen to test this limit).

However, if people are opposed to this default behavior, we can easily set it to 0 and require this parameter be set in order to use transactions.

Test Plan: Added a xfunc test to play around with setting different values of this parameter in all tests.  Added testing in memtablelist_test and planning on adding more testing here.

Reviewers: sdong, rven, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D37443
2015-05-28 16:34:24 -07:00
Yueh-Hsuan Chiang 672dda9b3b [API Change] Move listeners from ColumnFamilyOptions to DBOptions
Summary: Move listeners from ColumnFamilyOptions to DBOptions

Test Plan:
listener_test
compact_files_test

Reviewers: rven, anthony, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39087
2015-05-28 13:21:39 -07:00
Yueh-Hsuan Chiang 687214f878 Ensure ColumnFamilyOptions.num_levels >= 2 when level compaction is used.
Summary: Ensure ColumnFamilyOptions.num_levels >= 2 when level compaction is used.

Test Plan: Extend SanitizeOptions test in column_family_test

Reviewers: sdong, rven, anthony, krishnanm86, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38829
2015-05-22 11:35:40 -07:00
sdong 6fa7085121 CompactRange skips levels 1 to base_level -1 for dynamic level base size
Summary: CompactRange() now is much more expensive for dynamic level base size as it goes through all the levels. Skip those not used levels between level 0 an base level.

Test Plan: Run all unit tests

Reviewers: yhchiang, rven, anthony, kradhakrishnan, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D37125
2015-05-18 10:54:11 -07:00
Yueh-Hsuan Chiang df1f87a882 Fixed compile error in db/column_family.cc
Summary:
Fixed the following compile error in db/column_family.cc
    db/column_family.cc:633:33: error: ‘ASSERT_GT’ was not declared in this scope
    16:14:45    ASSERT_GT(listeners.size(), 0U);

Test Plan: make db_test

Reviewers: igor, sdong, rven

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38367
2015-05-12 16:20:03 -07:00
Yueh-Hsuan Chiang 14431e971d Fixed a bug in EventListener::OnCompactionCompleted().
Summary:
Fixed a bug in EventListener::OnCompactionCompleted() that returns
incorrect list of input / output file names.

Test Plan: Extend existing test in listener_test.cc

Reviewers: sdong, rven, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38349
2015-05-12 16:10:23 -07:00
clark.kang 6ede020dc4 fix typos 2015-04-25 18:14:27 +09:00
sdong d01bbb53ae Fix CompactRange for universal compaction with num_levels > 1
Summary:
CompactRange for universal compaction with num_levels > 1 seems to have a bug. The unit test also has a bug so it doesn't capture the problem.
Fix it. Revert the compact range to the logic equivalent to num_levels=1. Always compact all files together.

It should also fix DBTest.IncreaseUniversalCompactionNumLevels. The issue was that options.write_buffer_size = 100 << 10 and options.write_buffer_size = 100 << 10 are not used in later test scenarios. So write_buffer_size of 4MB was used. The compaction trigger condition is not anymore obvious as expected.

Test Plan: Run the new test and all test suites

Reviewers: yhchiang, rven, kradhakrishnan, anthony, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D37551
2015-04-23 19:12:31 -07:00
Igor Canadi 5e067a7b19 Clean up compression logging
Summary: Now we add warnings when user configures compression and the compression is not supported.

Test Plan:
Configured compression to non-supported values. Observed messages in my log:

    2015/03/26-12:17:57.586341 7ffb8a496840 [WARN] Compression type chosen for level 2 is not supported: LZ4. RocksDB will not compress data on level 2.

    2015/03/26-12:19:10.768045 7f36f15c5840 [WARN] Compression type chosen is not supported: LZ4. RocksDB will not compress data.

Reviewers: rven, sdong, yhchiang

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D35979
2015-04-06 12:50:44 -07:00
sdong 953a885ebf A new call back to TablePropertiesCollector to allow users know the entry is add, delete or merge
Summary:
Currently users have no idea a key is add, delete or merge from TablePropertiesCollector call back. Add a new function to add it.

Also refactor the codes so that
(1) make table property collector and internal table property collector two separate data structures with the later one now exposed
(2) table builders only receive internal table properties

Test Plan: Add cases in table_properties_collector_test to cover both of old and new ways of using TablePropertiesCollector.

Reviewers: yhchiang, igor.sugak, rven, igor

Reviewed By: rven, igor

Subscribers: meyering, yoshinorim, maykov, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D35373
2015-04-06 10:27:21 -07:00
sdong b23bbaa82a Universal Compactions with Small Files
Summary:
With this change, we use L1 and up to store compaction outputs in universal compaction.
The compaction pick logic stays the same. Outputs are stored in the largest "level" as possible.

If options.num_levels=1, it behaves all the same as now.

Test Plan:
1) convert most of existing unit tests for universal comapaction to include the option of one level and multiple levels.
2) add a unit test to cover parallel compaction in universal compaction and run it in one level and multiple levels
3) add unit test to migrate from multiple level setting back to one level setting
4) add a unit test to insert keys to trigger multiple rounds of compactions and verify results.

Reviewers: rven, kradhakrishnan, yhchiang, igor

Reviewed By: igor

Subscribers: meyering, leveldb, MarkCallaghan, dhruba

Differential Revision: https://reviews.facebook.net/D34539
2015-03-30 15:12:02 -07:00
Mark Callaghan c8da670325 Stop printing per-level stall times.
Summary:
Per-level stall times are the suggested stall time, not the actual stall time so this change stops printing them
both in the per-level output lines and in the summary. Also changed output for total stall time to include units
in all cases. The new output looks like:
Level   Files   Size(MB) Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) Stall(cnt)    RecordIn   RecordDrop
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0     4/1          7   0.8      0.0     0.0      0.0       0.6      0.6       0.0   0.0      0.0     12.9        50       352    0.141        882            0            0
  L1     5/0          9   0.9      0.0     0.0      0.0       0.0      0.0       0.6   0.0      0.0      0.0         0         0    0.000          0            0            0
  L2    54/0         99   1.0      0.0     0.0      0.0       0.0      0.0       0.6   0.0      0.0      0.0         0         0    0.000          0            0            0
  L3   289/0        527   0.5      0.0     0.0      0.0       0.0      0.0       0.5   0.0      0.0      0.0         0         0    0.000          0            0            0
 Sum   352/1        642   0.0      0.0     0.0      0.0       0.6      0.6       1.7   1.0      0.0     12.9        50       352    0.141        882            0            0
 Int     0/0          0   0.0      0.0     0.0      0.0       0.0      0.0       0.0   1.0      0.0     15.5         0         3    0.118          7            0            0
Flush(GB): accumulative 0.627, interval 0.005
Stalls(count): 0 level0_slowdown, 0 level0_numfiles, 882 memtable_compaction, 0 leveln_slowdown_soft, 0 leveln_slowdown_hard

Task ID: #6493861

Blame Rev:

Test Plan:
run db_bench, look at output

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D35085
2015-03-14 15:01:43 -07:00
Igor Canadi db03739340 options.level_compaction_dynamic_level_bytes to allow RocksDB to pick size bases of levels dynamically.
Summary:
When having fixed max_bytes_for_level_base, the ratio of size of largest level and the second one can range from 0 to the multiplier. This makes LSM tree frequently irregular and unpredictable. It can also cause poor space amplification in some cases.

In this improvement (proposed by Igor Kabiljo), we introduce a parameter option.level_compaction_use_dynamic_max_bytes. When turning it on, RocksDB is free to pick a level base in the range of (options.max_bytes_for_level_base/options.max_bytes_for_level_multiplier, options.max_bytes_for_level_base] so that real level ratios are close to options.max_bytes_for_level_multiplier.

Test Plan: New unit tests and pass tests suites including valgrind.

Reviewers: MarkCallaghan, rven, yhchiang, igor, ikabiljo

Reviewed By: ikabiljo

Subscribers: yoshinorim, ikabiljo, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D31437
2015-03-02 22:40:41 -08:00
Jinfu Leng 96d989f70d catch config errors with L0 file count triggers
Test Plan: Run "make clean && make all check"

Reviewers: rven, igor, yhchiang, kradhakrishnan, MarkCallaghan, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D33627
2015-02-23 16:08:27 -08:00
sdong d45a6a4002 Add rocksdb.num-live-versions: number of live versions
Summary: Add a DB property about live versions. It can be helpful to figure out whether there are files not live but not yet deleted, in some use cases.

Test Plan: make all check

Reviewers: rven, yhchiang, igor

Reviewed By: igor

Subscribers: yoshinorim, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D33327
2015-02-19 13:10:37 -08:00
Igor Canadi e7ea51a8e7 Introduce job_id for flush and compaction
Summary:
It would be good to assing background job their IDs. Two benefits:
1) makes LOGs more readable
2) I might use it in my EventLogger, which will try to make our LOG easier to read/query/visualize

Test Plan: ran rocksdb, read the LOG

Reviewers: sdong, rven, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D31617
2015-02-12 09:54:48 -08:00
Yueh-Hsuan Chiang 181191a1e4 Add a counter for collecting the wait time on db mutex.
Summary:
Add a counter for collecting the wait time on db mutex.
Also add MutexWrapper and CondVarWrapper for measuring wait time.

Test Plan:
./db_test
export ROCKSDB_TESTS=MutexWaitStats
./db_test

verify stats output using db_bench
make clean
make release
./db_bench --statistics=1 --benchmarks=fillseq,readwhilewriting --num=10000 --threads=10

Sample output:
    rocksdb.db.mutex.wait.micros COUNT : 7546866

Reviewers: MarkCallaghan, rven, sdong, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D32787
2015-02-04 21:39:45 -08:00
Ori Bernstein f9758e0129 Add compaction listener.
Summary: This adds a listener for compactions, and gives some useful statistics on each compaction pass.

Test Plan: Unit tests.

Reviewers: sdong, igor, rven, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D31641
2015-01-27 14:44:02 -08:00
sdong e919ecedfc SuperVersion::Unref() to use sequential consistency to decrease ref counting
Summary: I'm not sure the expected results of std::atomic::fetch_sub() when using memory_order_relaxed, and I suspect TSAN complains.

Test Plan: make all check

Reviewers: rven, yhchiang, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D32259
2015-01-27 14:08:08 -08:00
Igor Canadi f1c8862479 Fix data race #1
Summary:
This is first in a series of diffs that fixes data races detected by thread sanitizer.

Here the problem is that we call Ref() on a column family during a single-threaded write, without holding a mutex.

Test Plan: TSAN is no longer complaining about LevelLimitReopen.

Reviewers: yhchiang, rven, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D32121
2015-01-26 11:48:07 -08:00
Igor Canadi 7731d51c82 Simplify column family concurrency
Summary:
This patch changes concurrency guarantees around ColumnFamilySet::column_families_ and ColumnFamilySet::column_families_data_.

Before:
* When mutating: lock DB mutex and spin lock
* When reading: lock DB mutex OR spin lock

After:
* When mutating: lock DB mutex and be in write thread
* When reading: lock DB mutex or be in write thread

That way, we eliminate the spin lock that protects these hash maps and  simplify concurrency. That means we don't need to lock the spin lock during writing, since writing is mutually exclusive with column family create/drop (the only operations that mutate those hash maps).

With these new restrictions, I also needed to move column family create to the write thread (column family drop was already in the write thread).

Even though we don't need to lock the spin lock during write, impact on performance should be minimal -- the spin lock is almost never busy, so locking it is almost free.

This addresses task t5116919.

Test Plan:
make check

Stress test with lots and lots of column family drop and create:

   time ./db_stress --threads=30 --ops_per_thread=5000000 --max_key=5000 --column_families=200 --clear_column_family_one_in=100000 --verify_before_write=0  --reopen=15 --max_background_compactions=10 --max_background_flushes=10 --db=/fast-rocksdb-tmp/db_stress/

Reviewers: yhchiang, rven, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D30651
2015-01-06 12:44:21 -08:00
Igor Canadi fdb6be4e24 Rewritten system for scheduling background work
Summary:
When scaling to higher number of column families, the worst bottleneck was MaybeScheduleFlushOrCompaction(), which did a for loop over all column families while holding a mutex. This patch addresses the issue.

The approach is similar to our earlier efforts: instead of a pull-model, where we do something for every column family, we can do a push-based model -- when we detect that column family is ready to be flushed/compacted, we add it to the flush_queue_/compaction_queue_. That way we don't need to loop over every column family in MaybeScheduleFlushOrCompaction.

Here are the performance results:

Command:

    ./db_bench --write_buffer_size=268435456 --db_write_buffer_size=268435456 --db=/fast-rocksdb-tmp/rocks_lots_of_cf --use_existing_db=0 --open_files=55000 --statistics=1 --histogram=1 --disable_data_sync=1 --max_write_buffer_number=2 --sync=0 --benchmarks=fillrandom --threads=16 --num_column_families=5000  --disable_wal=1 --max_background_flushes=16 --max_background_compactions=16 --level0_file_num_compaction_trigger=2 --level0_slowdown_writes_trigger=2 --level0_stop_writes_trigger=3 --hard_rate_limit=1 --num=33333333 --writes=33333333

Before the patch:

     fillrandom   :      26.950 micros/op 37105 ops/sec;    4.1 MB/s

After the patch:

      fillrandom   :      17.404 micros/op 57456 ops/sec;    6.4 MB/s

Next bottleneck is VersionSet::AddLiveFiles, which is painfully slow when we have a lot of files. This is coming in the next patch, but when I removed that code, here's what I got:

      fillrandom   :       7.590 micros/op 131758 ops/sec;   14.6 MB/s

Test Plan:
make check

two stress tests:

Big number of compactions and flushes:

    ./db_stress --threads=30 --ops_per_thread=20000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0  --reopen=15 --max_background_compactions=10 --max_background_flushes=10 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000

max_background_flushes=0, to verify that this case also works correctly

    ./db_stress --threads=30 --ops_per_thread=2000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0  --reopen=3 --max_background_compactions=3 --max_background_flushes=0 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000

Reviewers: ljin, rven, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D30123
2014-12-19 20:38:12 +01:00
Jonah Cohen a14b7873ee Enforce write buffer memory limit across column families
Summary:
Introduces a new class for managing write buffer memory across column
families.  We supplement ColumnFamilyOptions::write_buffer_size with
ColumnFamilyOptions::write_buffer, a shared pointer to a WriteBuffer
instance that enforces memory limits before flushing out to disk.

Test Plan: Added SharedWriteBuffer unit test to db_test.cc

Reviewers: sdong, rven, ljin, igor

Reviewed By: igor

Subscribers: tnovak, yhchiang, dhruba, xjin, MarkCallaghan, yoshinorim

Differential Revision: https://reviews.facebook.net/D22581
2014-12-02 12:09:20 -08:00
Igor Canadi 37d73d597e Fix linters
Summary:
Two fixes:
1. if cpplint is not present on the system, don't return a confusing error in the linter
2. Add include_alpha, which means our includes should be sorted lexicographically

Test Plan: Tried unsorting our includes, lint complained

Reviewers: rven, ljin, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D28845
2014-12-02 13:53:39 -05:00
Yueh-Hsuan Chiang bcf9086899 Block Universal and FIFO compactions in ROCKSDB_LITE
Summary: Block Universal and FIFO compactions in ROCKSDB_LITE

Test Plan:
make shared_lib -j32
make OPT=-DROCKSDB_LITE shared_lib

Reviewers: ljin, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D29589
2014-11-26 15:45:11 -08:00
Lei Jin 8d3f8f9696 remove all remaining references to cfd->options()
Summary:
The very last reference happens in DBImpl::GetOptions()
I built with both DBImpl::GetOptions() and ColumnFamilyData::options() commented out

Test Plan: make all check

Reviewers: sdong, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D29073
2014-11-18 10:20:10 -08:00
Igor Canadi 772bc97f13 No CompactFiles in ROCKSDB_LITE
Summary: It adds lots of code.

Test Plan: compile for iOS, compile for mac. works.

Reviewers: rven, sdong, ljin, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D28857
2014-11-13 16:45:33 -05:00
Igor Canadi 25f273027b Fix iOS compile with -Wshorten-64-to-32
Summary: So iOS size_t is 32-bit, so we need to static_cast<size_t> any uint64_t :(

Test Plan: TARGET_OS=IOS make static_lib

Reviewers: dhruba, ljin, yhchiang, rven, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D28743
2014-11-13 14:39:30 -05:00
Yueh-Hsuan Chiang 28c82ff1b3 CompactFiles, EventListener and GetDatabaseMetaData
Summary:
This diff adds three sets of APIs to RocksDB.

= GetColumnFamilyMetaData =
* This APIs allow users to obtain the current state of a RocksDB instance on one column family.
* See GetColumnFamilyMetaData in include/rocksdb/db.h

= EventListener =
* A virtual class that allows users to implement a set of
  call-back functions which will be called when specific
  events of a RocksDB instance happens.
* To register EventListener, simply insert an EventListener to ColumnFamilyOptions::listeners

= CompactFiles =
* CompactFiles API inputs a set of file numbers and an output level, and RocksDB
  will try to compact those files into the specified level.

= Example =
* Example code can be found in example/compact_files_example.cc, which implements
  a simple external compactor using EventListener, GetColumnFamilyMetaData, and
  CompactFiles API.

Test Plan:
listener_test
compactor_test
example/compact_files_example
export ROCKSDB_TESTS=CompactFiles
db_test
export ROCKSDB_TESTS=MetaData
db_test

Reviewers: ljin, igor, rven, sdong

Reviewed By: sdong

Subscribers: MarkCallaghan, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D24705
2014-11-07 14:45:18 -08:00
Igor Canadi 9f20395cd6 Turn -Wshadow back on
Summary: It turns out that -Wshadow has different rules for gcc than clang. Previous commit fixed clang. This commits fixes the rest of the warnings for gcc.

Test Plan: compiles

Reviewers: ljin, yhchiang, rven, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D28131
2014-11-06 11:14:28 -08:00
Lei Jin fd24ae9d05 SetOptions() to return status and also add it to StackableDB
Summary: as title

Test Plan: ./db_test

Reviewers: sdong, yhchiang, rven, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D28269
2014-11-04 16:23:05 -08:00
sdong ac6afaf9ef Enforce naming convention of getters in version_set.h
Summary: Enforce the accessier naming convention in functions in version_set.h

Test Plan: make all check

Reviewers: ljin, yhchiang, rven, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D28143
2014-11-04 09:59:05 -08:00
Igor Canadi 9f7fc3ac45 Turn on -Wshadow
Summary:
...and fix all the errors :)

Jim suggested turning on -Wshadow because it helped him fix number of critical bugs in fbcode. I think it's a good idea to be -Wshadow clean.

Test Plan: compiles

Reviewers: yhchiang, rven, sdong, ljin

Reviewed By: ljin

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D27711
2014-10-31 11:59:54 -07:00
sdong 4d2ba38b65 Make VersionBuilder unit testable
Summary:
Rename Version::Builder to VersionBuilder and expose its definition to a header.
Make VerisonBuilder not reference Version or ColumnFamilyData, only working with VersionStorageInfo.
Add version_builder_test which has a simple test.

Test Plan: make all check

Reviewers: rven, yhchiang, igor, ljin

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D27969
2014-10-31 10:44:06 -07:00
sdong 76d1c28e82 Make CompactionPicker more easily tested
Summary:
Make compaction picker easier to test.
The basic idea is to separate a minimum subcomponent of Version to VersionStorageInfo, which just responsible to LSM tree. A stub VersionStorageInfo can then be easily created and passed into compaction picker so that we can check the outputs.

It now passes most tests. Still two things need to be done:
(1) deal with the FIFO compaction's file size.
(2) write an example test to make sure the interface can do the job.

Add a compaction_picker_test to make sure compaction picker codes can be easily unit tested.

Test Plan:
Pass all unit tests and compaction_picker_test

Reviewers: yhchiang, rven, igor, ljin

Reviewed By: ljin

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D27639
2014-10-29 15:16:53 -07:00
Yueh-Hsuan Chiang 34d436b7db Apply InfoLogLevel to the logs in db/column_family.cc
Summary: Apply InfoLogLevel to the logs in db/column_family.cc

Test Plan: make

Reviewers: ljin, sdong, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D27843
2014-10-29 15:11:32 -07:00
Igor Canadi a39e931e50 FlushProcess
Summary:
Abstract out FlushProcess and take it out of DBImpl.
This also includes taking DeletionState outside of DBImpl.

Currently this diff is only doing the refactoring. Future work includes:
1. Decoupling flush_process.cc, make it depend on less state
2. Write flush_process_test, which will mock out everything that FlushProcess depends on and test it in isolation

Test Plan: make check

Reviewers: rven, yhchiang, sdong, ljin

Reviewed By: ljin

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D27561
2014-10-28 11:54:33 -07:00
Lei Jin f981e08139 unfriend ColumnFamilyData from VersionSet
Summary: as title

Test Plan:
make release
will run full test on all stacked diffs before committing

Reviewers: sdong, yhchiang, rven, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D27591
2014-10-28 10:04:38 -07:00
Lei Jin f1841985e4 dynamic inplace_update options
Summary:
Make inplace_update_support and inplace_update_num_locks dynamic.
inplace_callback becomes immutable
We are almost free of references to cfd->options() in db_impl

Test Plan: unit test

Reviewers: igor, yhchiang, rven, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D25293
2014-10-27 12:10:13 -07:00
Lei Jin 4d5708aa56 dynamic soft_rate_limit and hard_rate_limit
Summary: as title

Test Plan:
unit test
I am only able to build the test case for hard_rate_limit.
soft_rate_limit is essentially the same thing as hard_rate_limit

Reviewers: igor, sdong, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D24759
2014-10-16 17:21:31 -07:00
Lei Jin dc50a1a593 make max_write_buffer_number dynamic
Summary: as title

Test Plan: unit test

Reviewers: sdong, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D24729
2014-10-16 16:57:59 -07:00
Lei Jin 5ec53f3edf make compaction related options changeable
Summary:
make compaction related options changeable. Most of changes are tedious,
following the same convention: grabs MutableCFOptions at the beginning
of compaction under mutex, then pass it throughout the job and register
it in SuperVersion at the end.

Test Plan: make all check

Reviewers: igor, yhchiang, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D23349
2014-10-01 16:19:16 -07:00
Igor Canadi 21ddcf6e4f Remove allow_thread_local
Summary: See https://reviews.facebook.net/D19365

Test Plan: compiles

Reviewers: sdong, yhchiang, ljin

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D23907
2014-09-24 13:12:16 -07:00
sdong d0de413f4d WriteBatchWithIndex to allow different Comparators for different column families
Summary:
Previously, one single column family is given to WriteBatchWithIndex to index keys for all column families. An extra map from column family ID to comparator is maintained which can override the default comparator given in the constructor. A WriteBatchWithIndex::SetComparatorForCF() is added for user to add comparators per column family.

Also move more codes into anonymous namespace.

Test Plan: Add a unit test

Reviewers: ljin, igor

Reviewed By: igor

Subscribers: dhruba, leveldb, yhchiang

Differential Revision: https://reviews.facebook.net/D23355
2014-09-22 13:47:39 -07:00
Lei Jin a062e1f2c4 SetOptions() for memtable related options
Summary: as title

Test Plan:
make all check
I will think a way to set up stress test for this

Reviewers: sdong, yhchiang, igor

Reviewed By: igor

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D23055
2014-09-17 12:49:13 -07:00