Commit graph

1188 commits

Author SHA1 Message Date
Andrew Kryczka 7ffb10fc1a DeleteRange compaction statistics
Summary:
- "rocksdb.compaction.key.drop.range_del" - number of keys dropped during compaction due to a range tombstone covering them
- "rocksdb.compaction.range_del.drop.obsolete" - number of range tombstones dropped due to compaction to bottom level and no snapshot saving them
- s/CompactionIteratorStats/CompactionIterationStats/g since this class is no longer specific to CompactionIterator -- it's also updated for range tombstone iteration during compaction
- Move the above class into a separate .h file to avoid circular dependency.
Closes https://github.com/facebook/rocksdb/pull/1520

Differential Revision: D4187179

Pulled By: ajkr

fbshipit-source-id: 10c2103
2016-11-28 11:54:12 -08:00
Mike Kolupaev 236d4c67e9 Less linear search in DBIter::Seek() when keys are overwritten a lot
Summary:
In one deployment we saw high latencies (presumably from slow iterator operations) and a lot of CPU time reported by perf with this stack:

```
  rocksdb::MergingIterator::Next
  rocksdb::DBIter::FindNextUserEntryInternal
  rocksdb::DBIter::Seek
```

I think what's happening is:
1. we create a snapshot iterator,
2. we do lots of Put()s for the same key x; this creates lots of entries in memtable,
3. we seek the iterator to a key slightly smaller than x,
4. the seek walks over lots of entries in memtable for key x, skipping them because of high sequence numbers.

CC IslamAbdelRahman
Closes https://github.com/facebook/rocksdb/pull/1413

Differential Revision: D4083879

Pulled By: IslamAbdelRahman

fbshipit-source-id: a83ddae
2016-11-28 10:24:11 -08:00
Yi Wu dfb6fe6755 Unified InlineSkipList::Insert algorithm with hinting
Summary:
This PR is based on nbronson's diff with small
modifications to wire it up with existing interface. Comparing to
previous version, this approach works better for inserting keys in
decreasing order or updating the same key, and impose less restriction
to the prefix extractor.

---- Summary from original diff ----

This diff introduces a single InlineSkipList::Insert that unifies
the existing sequential insert optimization (prev_), concurrent insertion,
and insertion using externally-managed insertion point hints.

There's a deep symmetry between insertion hints (cursors) and the
concurrent algorithm.  In both cases we have partial information from
the recent past that is likely but not certain to be accurate.  This diff
introduces the struct InlineSkipList::Splice, which encodes predecessor
and successor information in the same form that was previously only used
within a single call to InsertConcurrently.  Splice holds information
about an insertion point that can be used to levera
Closes https://github.com/facebook/rocksdb/pull/1561

Differential Revision: D4217283

Pulled By: yiwu-arbug

fbshipit-source-id: 33ee437
2016-11-22 14:09:13 -08:00
Maysam Yabandeh 182b940e70 Add WriteOptions.no_slowdown
Summary:
If the WriteOptions.no_slowdown flag is set AND we need to wait or sleep for
the write request, then fail immediately with Status::Incomplete().
Closes https://github.com/facebook/rocksdb/pull/1527

Differential Revision: D4191405

Pulled By: maysamyabandeh

fbshipit-source-id: 7f3ce3f
2016-11-21 18:09:13 -08:00
Karthikeyan Radhakrishnan 4118e13330 Persistent Cache: Expose stats to user via public API
Summary:
Exposing persistent cache stats (counters) to the user via public API.
Closes https://github.com/facebook/rocksdb/pull/1485

Differential Revision: D4155274

Pulled By: siying

fbshipit-source-id: 30a9f50
2016-11-21 17:39:13 -08:00
Maysam Yabandeh 9d60151b04 Implement PositionedAppend for PosixWritableFile
Summary:
This patch clarifies the contract of PositionedAppend with some unit
tests and also implements it for PosixWritableFile. (Tasks: 14524071)
Closes https://github.com/facebook/rocksdb/pull/1514

Differential Revision: D4204907

Pulled By: maysamyabandeh

fbshipit-source-id: 06eabd2
2016-11-18 17:24:13 -08:00
Islam AbdelRahman c1038d2837 Release RocksDB 5.0
Summary:
Update HISTORY.md and version.h
Closes https://github.com/facebook/rocksdb/pull/1536

Differential Revision: D4202987

Pulled By: IslamAbdelRahman

fbshipit-source-id: 94985e3
2016-11-17 18:39:15 -08:00
Islam AbdelRahman f39452e81f Fix heap use after free ASAN/Valgrind
Summary:
Dont use c_str() of temp std::string in RocksLuaCompactionFilter::Name()
Closes https://github.com/facebook/rocksdb/pull/1535

Differential Revision: D4199094

Pulled By: IslamAbdelRahman

fbshipit-source-id: e56ce62
2016-11-17 12:24:12 -08:00
Yi Wu 36e4762ce0 Remove Ticker::SEQUENCE_NUMBER
Summary:
Remove the ticker count because:
* Having to reset the ticker count in WriteImpl is ineffiecent;
* It doesn't make sense to have it as a ticker count if multiple db
  instance share a statistics object.
Closes https://github.com/facebook/rocksdb/pull/1531

Differential Revision: D4194442

Pulled By: yiwu-arbug

fbshipit-source-id: e2110a9
2016-11-16 22:39:09 -08:00
Yueh-Hsuan Chiang 647eafdc21 Introduce Lua Extension: RocksLuaCompactionFilter
Summary:
This diff includes an implementation of CompactionFilter that allows
users to write CompactionFilter in Lua.  With this ability, users can
dynamically change compaction filter logic without requiring building
the rocksdb binary and restarting the database.

To compile, WITH_LUA_PATH must be specified to the base directory
of lua.
Closes https://github.com/facebook/rocksdb/pull/1478

Differential Revision: D4150138

Pulled By: yhchiang

fbshipit-source-id: ed84222
2016-11-16 15:39:12 -08:00
Siying Dong 972e3ff295 Enable allow_concurrent_memtable_write and enable_write_thread_adaptive_yield by default
Summary: Closes https://github.com/facebook/rocksdb/pull/1496

Differential Revision: D4168080

Pulled By: siying

fbshipit-source-id: 056ae62
2016-11-16 09:39:09 -08:00
Andrew Kryczka 489d142808 DeleteRange interface
Summary:
Expose DeleteRange() interface since we think the implementation is functionally correct now.
Closes https://github.com/facebook/rocksdb/pull/1503

Differential Revision: D4171921

Pulled By: ajkr

fbshipit-source-id: 5e21c98
2016-11-15 15:24:16 -08:00
Yi Wu 1ea79a78c9 Optimize sequential insert into memtable - Part 1: Interface
Summary:
Currently our skip-list have an optimization to speedup sequential
inserts from a single stream, by remembering the last insert position.
We extend the idea to support sequential inserts from multiple streams,
and even tolerate small reordering wihtin each stream.

This PR is the interface part adding the following:
- Add `memtable_insert_prefix_extractor` to allow specifying prefix for each key.
- Add `InsertWithHint()` interface to memtable, to allow underlying
  implementation to return a hint of insert position, which can be later
  pass back to optimize inserts.
- Memtable will maintain a map from prefix to hints and pass the hint
  via `InsertWithHint()` if `memtable_insert_prefix_extractor` is non-null.
Closes https://github.com/facebook/rocksdb/pull/1419

Differential Revision: D4079367

Pulled By: yiwu-arbug

fbshipit-source-id: 3555326
2016-11-13 19:09:18 -08:00
Maysam Yabandeh 361010d447 Exporting compaction stats in the form of a map
Summary:
Currently the compaction stats are printed to stdout. We want to export the compaction stats in a map format so that the upper layer apps (e.g., MySQL) could present
the stats in any format required by the them.
Closes https://github.com/facebook/rocksdb/pull/1477

Differential Revision: D4149836

Pulled By: maysamyabandeh

fbshipit-source-id: b3df19f
2016-11-11 20:54:14 -08:00
Jay Lee a7875272d7 c: support seek_for_prev
Summary:
support seek_for_prev in c abi.
Closes https://github.com/facebook/rocksdb/pull/1457

Differential Revision: D4135360

Pulled By: lightmark

fbshipit-source-id: 61256b0
2016-11-08 12:54:13 -08:00
Andrew Kryczka f998c9790f DeleteRange Get support
Summary:
During Get()/MultiGet(), build up a RangeDelAggregator with range
tombstones as we search through live memtable, immutable memtables, and
SST files. This aggregator is then used by memtable.cc's SaveValue() and
GetContext::SaveValue() to check whether keys are covered.

added tests for Get on memtables/files; end-to-end tests mainly in https://reviews.facebook.net/D64761
Closes https://github.com/facebook/rocksdb/pull/1456

Differential Revision: D4111271

Pulled By: ajkr

fbshipit-source-id: 6e388d4
2016-11-03 18:54:20 -07:00
zhangjinpeng1987 879f366366 Add C api for RateLimiter
Summary:
Add C api for RateLimiter.
Closes https://github.com/facebook/rocksdb/pull/1455

Differential Revision: D4116362

Pulled By: yiwu-arbug

fbshipit-source-id: cb05a8d
2016-11-03 11:09:17 -07:00
Yi Wu 437942e481 Add avoid_flush_during_shutdown DB option
Summary:
Add avoid_flush_during_shutdown DB option.
Closes https://github.com/facebook/rocksdb/pull/1451

Differential Revision: D4108643

Pulled By: yiwu-arbug

fbshipit-source-id: abdaf4d
2016-11-02 15:39:18 -07:00
Benoit Girard 2b16d664cb Change max_bytes_for_level_multiplier to double
Summary: Closes https://github.com/facebook/rocksdb/pull/1427

Differential Revision: D4094732

Pulled By: yiwu-arbug

fbshipit-source-id: b9b79e9
2016-11-01 21:09:23 -07:00
Jay Lee 16fb04434f expose IngestExternalFile to c abi
Summary:
IngestExternalFile is very useful when doing bulk load. This pr expose this API to c so many bindings can benefit from it too.
Closes https://github.com/facebook/rocksdb/pull/1454

Differential Revision: D4113420

Pulled By: yiwu-arbug

fbshipit-source-id: 307c6ae
2016-11-01 17:09:39 -07:00
Kien-hung Li eeb27e1bbd Add handy option to turn on direct I/O in db_bench (#1424) 2016-10-28 10:36:05 -07:00
Jan Doms c6168d13ab removed some declarations from c.h which resulted in undefined symbols (#1407) 2016-10-28 10:33:49 -07:00
Reid Horuff 4dfaa6610a Make IsDeadlockDetect() virtual member of Transaction
Summary: Make `IsDeadlockDetect()` virtual member of base class `Transaction` for ease of use in MyRocks

Test Plan: compiles. compiles into MyRocks call-site.

Reviewers: mung

Reviewed By: mung

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D65385
2016-10-21 14:47:59 -07:00
Islam AbdelRahman 869ae5d786 Support IngestExternalFile (remove AddFile restrictions)
Summary:
Changes in the diff

API changes:
- Introduce IngestExternalFile to replace AddFile (I think this make the API more clear)
- Introduce IngestExternalFileOptions (This struct will encapsulate the options for ingesting the external file)
- Deprecate AddFile() API

Logic changes:
- If our file overlap with the memtable we will flush the memtable
- We will find the first level in the LSM tree that our file key range overlap with the keys in it
- We will find the lowest level in the LSM tree above the the level we found in step 2 that our file can fit in and ingest our file in it
- We will assign a global sequence number to our new file
- Remove AddFile restrictions by using global sequence numbers

Other changes:
- Refactor all AddFile logic to be encapsulated in ExternalSstFileIngestionJob

Test Plan:
unit tests (still need to add more)
addfile_stress (https://reviews.facebook.net/D65037)

Reviewers: yiwu, andrewkr, lightmark, yhchiang, sdong

Reviewed By: sdong

Subscribers: jkedgar, hcz, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D65061
2016-10-20 17:05:32 -07:00
Manuel Ung 4edd39fda2 Implement deadlock detection
Summary: Implement deadlock detection. This is done by maintaining a TxnID -> TxnID map which represents the edges in the wait for graph (this is named `wait_txn_map_`).

Test Plan: transaction_test

Reviewers: IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D64491
2016-10-19 19:45:57 -07:00
Islam AbdelRahman b88f8e87c5 Support SST files with Global sequence numbers [reland]
Summary:
reland https://reviews.facebook.net/D62523

- Update SstFileWriter to include a property for a global sequence number in the SST file `rocksdb.external_sst_file.global_seqno`
- Update TableProperties to be aware of the offset of each property in the file
- Update BlockBasedTableReader and Block to be able to honor the sequence number in `rocksdb.external_sst_file.global_seqno` property and use it to overwrite all sequence number in the file

Something worth mentioning is that we don't update the seqno in the index block since and when doing a binary search, the reason for that is that it's guaranteed that SST files with global seqno will have only one user_key and each key will have seqno=0 encoded in it, This mean that this key is greater than any other key with seqno> 0. That mean that we can actually keep the current logic for these blocks

Test Plan: unit tests

Reviewers: sdong, yhchiang

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D65211
2016-10-18 16:59:37 -07:00
Yueh-Hsuan Chiang ab53998372 Bump RocksDB version to 4.13 (#1405)
Summary:
Bump RocksDB version to 4.13

Test Plan:
unit tests

Reviewers: sdong, IslamAbdelRahman, andrewkr, lightmark

Subscribers: leveldb
2016-10-18 15:39:10 -07:00
Aaron Gao d88dff4ef2 add seeforprev in history
Summary: update new feature in history and avoid breaking mongorocks

Test Plan: make check

Reviewers: sdong, yiwu, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D64611
2016-10-17 15:34:13 -07:00
Yi Wu e29d3b67c2 Make max_background_compactions and base_background_compactions dynamic changeable
Summary:
Add DB::SetDBOptions to dynamic change max_background_compactions and base_background_compactions.
I'll add more dynamic changeable options soon.

Test Plan: unit test.

Reviewers: yhchiang, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D64749
2016-10-14 12:25:39 -07:00
Dmitri Smirnov b9311aa65c Implement WinRandomRW file and improve code reuse (#1388) 2016-10-13 16:36:34 -07:00
Reid Horuff 02b3e3985c Make txn->GetState() const
Summary: makes Transaction::GetState() a const function.

Test Plan: compiles.

Reviewers: mung

Reviewed By: mung

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D64929
2016-10-11 15:48:50 -07:00
Yi Wu 991b585ee0 More block cache tickers
Summary: Adding several missing block cache tickers.

Test Plan:
  make all check

Reviewers: IslamAbdelRahman, yhchiang, lightmark

Reviewed By: lightmark

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D64881
2016-10-11 11:59:05 -07:00
Yi Wu d6ae6dec69 Add Statistics::getAndResetTickerCount().
Summary: A convience method to atomically get and reset ticker count. I'm wanting to use it to have a thin wrapper to the statistics object to export ticker counts to ODS for LogDevice (since they don't even use fb303).

Test Plan:
test in LogDevice shadow cluster.
https://fburl.com/461868822

Reviewers: andrewkr, yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D64869
2016-10-11 10:54:11 -07:00
Islam AbdelRahman 2ad68b971a Support running consistency checks in release mode
Summary:
We always run consistency checks when compiling in debug mode
allow users to set Options::force_consistency_checks to true to be able to run such checks even when compiling in release mode

Test Plan:
make check -j64
make release

Reviewers: lightmark, sdong, yiwu

Reviewed By: yiwu

Subscribers: hermanlee4, andrewkr, yoshinorim, jkedgar, dhruba

Differential Revision: https://reviews.facebook.net/D64701
2016-10-07 17:21:45 -07:00
Islam AbdelRahman d062328977 Revert "Support SST files with Global sequence numbers"
This reverts commit ab01da5437.
2016-10-07 14:05:12 -07:00
Reid Horuff 37737c3a6b Expose Transaction State Publicly
Summary:
This exposes a transactions state through a public api rather than through a public member variable. I also do some name refactoring.
ExecutionStatus => TransactionState
exec_status_ => trx_state_

Test Plan: It compiles and transaction_test passes.

Reviewers: IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, mung, dhruba, sdong

Differential Revision: https://reviews.facebook.net/D64689
2016-10-07 11:58:53 -07:00
Reid Horuff 2c1f95291d Add facility to write only a portion of WriteBatch to WAL
Summary:
When constructing a write batch a client may now call MarkWalTerminationPoint() on that batch. No batch operations after this call will be added written to the WAL but will still be inserted into the Memtable. This facility is used to remove one of the three WriteImpl calls in 2PC transactions. This produces a ~1% perf improvement.

```
RocksDB - unoptimized 2pc, sync_binlog=1, disable_2pc=off
INFO 2016-08-31 14:30:38,814 [main]: REQUEST PHASE COMPLETED. 75000000 requests done in 2619 seconds. Requests/second = 28628

RocksDB - optimized 2pc , sync_binlog=1, disable_2pc=off
INFO 2016-08-31 16:26:59,442 [main]: REQUEST PHASE COMPLETED. 75000000 requests done in 2581 seconds. Requests/second = 29054
```

Test Plan: Two unit tests added.

Reviewers: sdong, yiwu, IslamAbdelRahman

Reviewed By: yiwu

Subscribers: hermanlee4, dhruba, andrewkr

Differential Revision: https://reviews.facebook.net/D64599
2016-10-07 11:32:10 -07:00
Sage Weil 4985f60fc8 env_mirror: fix a few leaks (#1363)
* env_mirror: fix leak from LockFile

Signed-off-by: Sage Weil <sage@redhat.com>

* env_mirror: instruct EnvMirror whether mirrored Envs should be destroyed

The lifecycle rules for Env are frustrating and undocumented.  Notably,
Env::Default() should *not* be freed, but any Env instances we created
should be.

Explicitly instruct EnvMirror whether to clean up child Env instances.
Default to false so that we do not affect existing callers.

Signed-off-by: Sage Weil <sage@redhat.com>
2016-10-06 10:43:05 -07:00
Igor Mihalik 5aded67ddb update of c.h (#1371)
Added rocksdb_options_set_memtable_prefix_bloom_size_ratio function implemented in c.cc but not exported via c.h
2016-10-06 10:37:19 -07:00
Islam AbdelRahman ab01da5437 Support SST files with Global sequence numbers
Summary:
- Update SstFileWriter to include a property for a global sequence number in the SST file `rocksdb.external_sst_file.global_seqno`
- Update TableProperties to be aware of the offset of each property in the file
- Update BlockBasedTableReader and Block to be able to honor the sequence number in `rocksdb.external_sst_file.global_seqno` property and use it to overwrite all sequence number in the file

Something worth mentioning is that we don't update the seqno in the index block since and when doing a binary search, the reason for that is that it's guaranteed that SST files with global seqno will have only one user_key and each key will have seqno=0 encoded in it, This mean that this key is greater than any other key with seqno> 0. That mean that we can actually keep the current logic for these blocks

Test Plan: unit tests

Reviewers: andrewkr, yhchiang, yiwu, sdong

Reviewed By: sdong

Subscribers: hcz, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D62523
2016-10-03 16:12:39 -07:00
krad e91b4d0cf6 Add factory method for creating persistent cache that is accessible from public
Summary:
Currently there is no mechanism to create persistent cache from
headers. Adding a simple factory method to create a simple persistent cache with
default or NVM optimized settings.

note: Any idea to test this factory is appreciated.

Test Plan: None

Reviewers: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D64527
2016-10-03 10:55:46 -07:00
Manuel Ung be1f1092c9 Expose transaction id, lock state information and transaction wait information
Summary:
This diff does 3 things:

Expose TransactionID so that we can identify transactions when we retrieve locking and lock wait information. This is exposed as `Transaction::GetID`.

Expose lock state information by locking all stripes in all column families and copying their contents to a data structure. This is exposed as `TransactionDB::GetLockStatusData`.

Adds support for tracking the transaction and the key being waited on, and exposes this as `Transaction::GetWaitingTxn`.

Test Plan: unit tests

Reviewers: horuff, sdong

Reviewed By: sdong

Subscribers: vasilep, hermanlee4, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D64413
2016-09-30 11:41:21 -07:00
Aaron Gao f517d9dd09 Add SeekForPrev() to Iterator
Summary:
Add new Iterator API, `SeekForPrev`: find the last key that <= target key
support prefix_extractor
support prefix_same_as_start
support upper_bound
not supported in iterators without Prev()

Also add tests in db_iter_test and db_iterator_test

Pass all tests
Cheers!

Test Plan: make all check -j64

Reviewers: andrewkr, yiwu, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D64149
2016-09-27 18:20:57 -07:00
Yi Wu 9ed928e7a9 Split DBOptions into ImmutableDBOptions and MutableDBOptions
Summary: Use ImmutableDBOptions/MutableDBOptions internally and DBOptions only for user-facing APIs. MutableDBOptions is barely a placeholder for now. I'll start to move options to MutableDBOptions in following diffs.

Test Plan:
  make all check

Reviewers: yhchiang, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D64065
2016-09-23 16:34:04 -07:00
Aaron Gao 0a1bd9c509 add cfh deletion started listener
Summary: add ColumnFamilyHandleDeletionStarted listener which can be called when user deletes handler.

Test Plan: ./listener_test

Reviewers: yiwu, IslamAbdelRahman, sdong, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D60717
2016-09-22 11:56:18 -07:00
Giuseppe Ottaviano d45eb6c6d2 Fix typo (#1349) 2016-09-21 20:06:56 -07:00
Yi Wu 17f76fc564 DB::GetOptions() reflect dynamic changed options
Summary: DB::GetOptions() reflect dynamic changed options.

Test Plan: See the new unit test.

Reviewers: yhchiang, sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D63903
2016-09-14 22:10:28 -07:00
Islam AbdelRahman ba65c816bb Support POSIX RandomRWFile
Summary:
Add Env::RandomRWFile in env.h and implement it for POSIX
RandomRWFile is a file that allow us to read from / write to random offsets in the file

I will implement it for other Envs later after finishing the whole task for AddFile()

Test Plan: unit tests

Reviewers: andrewkr, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D62433
2016-09-13 12:08:22 -07:00
zhangjinpeng1987 b06b191362 add C api for set wal_recovery_mode (#1327)
* add C api for set wal recovery mode

* add test
2016-09-09 10:11:30 -07:00
ammongit 0fcb6dbed7 Remove extraneous function prototypes from c.h (#1326)
* Fix function prototypes from upstream commit 32149059.

* Fix removed function.

* Readd removed function.
2016-09-08 11:31:06 -07:00