Commit Graph

1160 Commits

Author SHA1 Message Date
Islam AbdelRahman fce7a6e196 Fail IngestExternalFile when bg_error_ exists
Summary:
Fail IngestExternalFile() when bg_error_ exists
Closes https://github.com/facebook/rocksdb/pull/1881

Differential Revision: D4580621

Pulled By: IslamAbdelRahman

fbshipit-source-id: 1194913
2017-02-17 13:39:17 -08:00
Yi Wu c2247dc1c7 Make DBImpl::has_unpersisted_data_ atomic
Summary:
Seems to me `has_unpersisted_data_` is read from read thread and write
from write thread concurrently without synchronization. Making it an
atomic.

I update the logic not because seeing any problem with it, but it just
feel confusing.
Closes https://github.com/facebook/rocksdb/pull/1869

Differential Revision: D4555837

Pulled By: yiwu-arbug

fbshipit-source-id: eff2ab8
2017-02-13 18:54:13 -08:00
Sagar Vemuri eb912a927e Remove disableDataSync option
Summary:
Remove disableDataSync, and another similarly named disable_data_sync options.
This is being done to simplify options, and also because the performance gains of this feature can be achieved by other methods.
Closes https://github.com/facebook/rocksdb/pull/1859

Differential Revision: D4541292

Pulled By: sagar0

fbshipit-source-id: 5b3a6ca
2017-02-13 11:09:13 -08:00
Vitaliy Liptchinsky 1aaa898cf1 Adding GetApproximateMemTableStats method
Summary:
Added method that returns approx num of entries as well as size for memtables.
Closes https://github.com/facebook/rocksdb/pull/1841

Differential Revision: D4511990

Pulled By: VitaliyLi

fbshipit-source-id: 9a4576e
2017-02-06 14:54:16 -08:00
Siying Dong 036d668b19 Fix wrong result in data race case related to Get()
Summary:
In theory, Get() can get a wrong result, if it races in a special with with flush. The bug can be reproduced in DBTest2.GetRaceFlush. Fix this bug by getting snapshot after referencing the super version.
Closes https://github.com/facebook/rocksdb/pull/1816

Differential Revision: D4475958

Pulled By: siying

fbshipit-source-id: bd9e67a
2017-02-03 11:39:15 -08:00
Islam AbdelRahman 574b543f80 Rename merger.h -> merging_iterator.h
Summary:
merger.h was always a confusing name for me, simply give the file a better name
Closes https://github.com/facebook/rocksdb/pull/1836

Differential Revision: D4505357

Pulled By: IslamAbdelRahman

fbshipit-source-id: 07b28d8
2017-02-02 16:54:19 -08:00
Siying Dong f25f1ec60b Add test DBTest2.GetRaceFlush which can expose a data race bug
Summary:
A current data race issue in Get() and Flush() can cause a Get() to return wrong results when a flush happened in the middle. Disable the test for now.
Closes https://github.com/facebook/rocksdb/pull/1813

Differential Revision: D4472310

Pulled By: siying

fbshipit-source-id: 5755ebd
2017-01-26 16:39:14 -08:00
sdong 5dad9d6d28 Avoid logs_ operation out of DB mutex
Summary:
logs_.back() is called out of DB mutex, which can cause data race. We move the access into the DB mutex protection area.
Closes https://github.com/facebook/rocksdb/pull/1774

Reviewed By: AsyncDBConnMarkedDownDBException

Differential Revision: D4417472

Pulled By: AsyncDBConnMarkedDownDBException

fbshipit-source-id: 2da1f1e
2017-01-25 15:54:13 -08:00
Islam AbdelRahman a7b13919bf Fix CompactFiles() bug when used with CompactionFilter using SuperVersion
Summary:
GetAndRefSuperVersion() should not be called again in the same thread before ReturnAndCleanupSuperVersion() is called.

If we have a compaction filter that is using DB::Get, This will happen
```
CompactFiles() {
  GetAndRefSuperVersion() // -- first call
    ..
    CompactionFilter() {
      GetAndRefSuperVersion() // -- second call
      ReturnAndCleanupSuperVersion()
    }
    ..
  ReturnAndCleanupSuperVersion()
}
```

We solve this issue in the same way Iterator is solving it, but using GetReferencedSuperVersion()

This was discovered in https://github.com/facebook/mysql-5.6/issues/427 by alxyang
Closes https://github.com/facebook/rocksdb/pull/1803

Differential Revision: D4460155

Pulled By: IslamAbdelRahman

fbshipit-source-id: 5e54322
2017-01-25 14:09:13 -08:00
Islam AbdelRahman 03ca2ac8a9 Remove function from DBImpl that are not used anywhere
Summary:
GetAndRefSuperVersionUnlocked
ReturnAndCleanupSuperVersionUnlocked
GetColumnFamilyHandleUnlocked

Are dead code that are not used any where
Closes https://github.com/facebook/rocksdb/pull/1802

Differential Revision: D4459948

Pulled By: IslamAbdelRahman

fbshipit-source-id: 30fa89d
2017-01-24 19:24:13 -08:00
Vitaliy Liptchinsky 753ff84a3d Fix get approx size
Summary:
Fixing GetApproximateSize bug for the case of computing stats for mem tables only.
Closes https://github.com/facebook/rocksdb/pull/1795

Differential Revision: D4445507

Pulled By: IslamAbdelRahman

fbshipit-source-id: 3905846
2017-01-20 15:54:12 -08:00
Changli Gao 5ac97314e7 Fix std::out_of_range when DBOptions::keep_log_file_num is zero
Summary:
We should validate this option, otherwise we may see
std::out_of_range thrown at: db/db_impl.cc:1124

1123     for (unsigned int i = 0; i <= end; i++) {
1124       std::string& to_delete = old_info_log_files.at(i);
1125       std::string full_path_to_delete =
1126           (immutable_db_options_.db_log_dir.empty()
Closes https://github.com/facebook/rocksdb/pull/1722

Differential Revision: D4379495

Pulled By: yiwu-arbug

fbshipit-source-id: e136552
2017-01-20 13:24:12 -08:00
Vitaliy Liptchinsky e840213d6e Change DB::GetApproximateSizes for more flexibility needed for MyRocks
Summary:
Added an option to GetApproximateSizes to exclude file stats, as MyRocks has those counted exactly and we need only stats from memtables.
Closes https://github.com/facebook/rocksdb/pull/1787

Differential Revision: D4441111

Pulled By: IslamAbdelRahman

fbshipit-source-id: c11f4c3
2017-01-20 09:39:11 -08:00
Yi Wu 9239103cd4 Flush job should release reference current version if sync log failed
Summary:
Fix the bug when sync log fail, FlushJob::Run() will not be execute and
reference to cfd->current() will not be release.
Closes https://github.com/facebook/rocksdb/pull/1792

Differential Revision: D4441316

Pulled By: yiwu-arbug

fbshipit-source-id: 5523e28
2017-01-19 23:09:15 -08:00
Reid Horuff 5cf176ca15 Fix for 2PC causing WAL to grow too large
Summary:
Consider the following single column family scenario:
prepare in log A
commit in log B
*WAL is too large, flush all CFs to releast log A*
*CFA is on log B so we do not see CFA is depending on log A so no flush is requested*

To fix this we must also consider the log containing the prepare section when determining what log a CF is dependent on.
Closes https://github.com/facebook/rocksdb/pull/1768

Differential Revision: D4403265

Pulled By: reidHoruff

fbshipit-source-id: ce800ff
2017-01-19 15:39:12 -08:00
Mike Kolupaev d18dd2c41f Abort compactions more reliably when closing DB
Summary:
DB shutdown aborts running compactions by setting an atomic shutting_down=true that CompactionJob periodically checks. Without this PR it checks it before processing every _output_ value. If compaction filter filters everything out, the compaction is uninterruptible. This PR adds checks for shutting_down on every _input_ value (in CompactionIterator and MergeHelper).

There's also some minor code cleanup along the way.
Closes https://github.com/facebook/rocksdb/pull/1639

Differential Revision: D4306571

Pulled By: yiwu-arbug

fbshipit-source-id: f050890
2017-01-11 15:09:21 -08:00
Maysam Yabandeh d0ba8ec8f9 Revert "PinnableSlice"
Summary:
This reverts commit 54d94e9c2c.

The pull request was landed by mistake.
Closes https://github.com/facebook/rocksdb/pull/1755

Differential Revision: D4391678

Pulled By: maysamyabandeh

fbshipit-source-id: 36d5149
2017-01-08 14:24:12 -08:00
Maysam Yabandeh 54d94e9c2c PinnableSlice
Summary:
Currently the point lookup values are copied to a string provided by the user.
This incures an extra memcpy cost. This patch allows doing point lookup
via a PinnableSlice which pins the source memory location (instead of
copying their content) and releases them after the content is consumed
by the user. The old API of Get(string) is translated to the new API
underneath.

 Here is the summary for improvements:
 1. value 100 byte: 1.8%  regular, 1.2% merge values
 2. value 1k   byte: 11.5% regular, 7.5% merge values
 3. value 10k byte: 26% regular,    29.9% merge values

 The improvement for merge could be more if we extend this approach to
 pin the merge output and delay the full merge operation until the user
 actually needs it. We have put that for future work.

PS:
Sometimes we observe a small decrease in performance when switching from
t5452014 to this patch but with the old Get(string) API. The difference
is a little and could be noise. More importantly it is safely
cancelled
Closes https://github.com/facebook/rocksdb/pull/1732

Differential Revision: D4374613

Pulled By: maysamyabandeh

fbshipit-source-id: a077f1a
2017-01-08 13:54:13 -08:00
Siying Dong 438f22bc56 Fix bug of Checkpoint loses recent transactions with 2PC
Summary:
If 2PC is enabled, checkpoint may not copy previous log files that contain uncommitted prepare records. In this diff we keep those files.
Closes https://github.com/facebook/rocksdb/pull/1724

Differential Revision: D4368319

Pulled By: siying

fbshipit-source-id: cc2c746
2016-12-28 12:24:16 -08:00
Aaron Gao 972f96b3fb direct io write support
Summary:
rocksdb direct io support

```
[gzh@dev11575.prn2 ~/rocksdb] ./db_bench -benchmarks=fillseq --num=1000000
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
RocksDB:    version 5.0
Date:       Wed Nov 23 13:17:43 2016
CPU:        40 * Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
CPUCache:   25600 KB
Keys:       16 bytes each
Values:     100 bytes each (50 bytes after compression)
Entries:    1000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    110.6 MB (estimated)
FileSize:   62.9 MB (estimated)
Write rate: 0 bytes/second
Compression: Snappy
Memtablerep: skip_list
Perf Level: 1
WARNING: Assertions are enabled; benchmarks unnecessarily slow
------------------------------------------------
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
DB path: [/tmp/rocksdbtest-112628/dbbench]
fillseq      :       4.393 micros/op 227639 ops/sec;   25.2 MB/s

[gzh@dev11575.prn2 ~/roc
Closes https://github.com/facebook/rocksdb/pull/1564

Differential Revision: D4241093

Pulled By: lightmark

fbshipit-source-id: 98c29e3
2016-12-22 13:09:19 -08:00
Islam AbdelRahman 989e644ed8 Remove sst_file_manager option from LITE
Summary:
Remove sst_file_manager option from LITE
Closes https://github.com/facebook/rocksdb/pull/1690

Differential Revision: D4341331

Pulled By: IslamAbdelRahman

fbshipit-source-id: 9f9328d
2016-12-21 17:54:21 -08:00
Daniel Black bfbcec2339 Gcc 7 error expansion to defined
Summary:
sorry if these gcc-7/clang-4 cleanups are getting tedious.
Closes https://github.com/facebook/rocksdb/pull/1658

Differential Revision: D4318792

Pulled By: yiwu-arbug

fbshipit-source-id: 8e85891
2016-12-13 18:39:14 -08:00
Islam AbdelRahman 1a146f89c7 break Flush wait for dropped CF
Summary:
In FlushJob we dont do the Flush if the CF is dropped
https://github.com/facebook/rocksdb/blob/master/db/flush_job.cc#L184-L188

but inside WaitForFlushMemTable we keep waiting forever even if the CF is dropped.
Closes https://github.com/facebook/rocksdb/pull/1664

Differential Revision: D4321032

Pulled By: IslamAbdelRahman

fbshipit-source-id: 6e2b25d
2016-12-13 14:09:12 -08:00
Islam AbdelRahman 2ba59b5a1e Disallow ingesting files into dropped CFs
Summary:
This PR update IngestExternalFile to return an error if we try to ingest a file into a dropped CF.

Right now if IngestExternalFile want to flush a memtable, and it's ingesting a file into a dropped CF, it will wait forever since flushing is not possible for the dropped CF
Closes https://github.com/facebook/rocksdb/pull/1657

Differential Revision: D4318657

Pulled By: IslamAbdelRahman

fbshipit-source-id: ed6ea2b
2016-12-13 00:54:14 -08:00
Islam AbdelRahman ed8fbdb560 Add EventListener::OnExternalFileIngested() event
Summary:
Add EventListener::OnExternalFileIngested() to allow user to subscribe to external file ingestion events
Closes https://github.com/facebook/rocksdb/pull/1623

Differential Revision: D4285844

Pulled By: IslamAbdelRahman

fbshipit-source-id: 0b95a88
2016-12-06 14:09:17 -08:00
Anton Safonov 9053fe2a5c Made delete_obsolete_files_period_micros option dynamic
Summary:
Made delete_obsolete_files_period_micros option dynamic. It can be updating using DB::SetDBOptions().
Closes https://github.com/facebook/rocksdb/pull/1595

Differential Revision: D4246569

Pulled By: tonek

fbshipit-source-id: d23f560
2016-12-05 14:24:16 -08:00
Maysam Yabandeh 182b940e70 Add WriteOptions.no_slowdown
Summary:
If the WriteOptions.no_slowdown flag is set AND we need to wait or sleep for
the write request, then fail immediately with Status::Incomplete().
Closes https://github.com/facebook/rocksdb/pull/1527

Differential Revision: D4191405

Pulled By: maysamyabandeh

fbshipit-source-id: 7f3ce3f
2016-11-21 18:09:13 -08:00
Andrew Kryczka fe349db57b Remove Arena in RangeDelAggregator
Summary:
The Arena construction/destruction introduced significant overhead to read-heavy workload just by creating empty vectors for its blocks, so avoid it in RangeDelAggregator.
Closes https://github.com/facebook/rocksdb/pull/1547

Differential Revision: D4207781

Pulled By: ajkr

fbshipit-source-id: 9d1c130
2016-11-19 14:24:12 -08:00
Andrew Kryczka 3f62215210 Lazily initialize RangeDelAggregator's map and pinning manager
Summary:
Since a RangeDelAggregator is created for each read request, these heap-allocating member variables were consuming significant CPU (~3% total) which slowed down request throughput. The map and pinning manager are only necessary when range deletions exist, so we can defer their initialization until the first range deletion is encountered. Currently lazy initialization is done for reads only since reads pass us a single snapshot, which is easier to store on the stack for later insertion into the map than the vector passed to us by flush or compaction.

Note the Arena member variable is still expensive, I will figure out what to do with it in a subsequent diff. It cannot be lazily initialized because we currently use this arena even to allocate empty iterators, which is necessary even when no range deletions exist.
Closes https://github.com/facebook/rocksdb/pull/1539

Differential Revision: D4203488

Pulled By: ajkr

fbshipit-source-id: 3b36279
2016-11-18 17:09:11 -08:00
Yi Wu 36e4762ce0 Remove Ticker::SEQUENCE_NUMBER
Summary:
Remove the ticker count because:
* Having to reset the ticker count in WriteImpl is ineffiecent;
* It doesn't make sense to have it as a ticker count if multiple db
  instance share a statistics object.
Closes https://github.com/facebook/rocksdb/pull/1531

Differential Revision: D4194442

Pulled By: yiwu-arbug

fbshipit-source-id: e2110a9
2016-11-16 22:39:09 -08:00
Andrew Kryczka 489d142808 DeleteRange interface
Summary:
Expose DeleteRange() interface since we think the implementation is functionally correct now.
Closes https://github.com/facebook/rocksdb/pull/1503

Differential Revision: D4171921

Pulled By: ajkr

fbshipit-source-id: 5e21c98
2016-11-15 15:24:16 -08:00
Artemiy Kolesnikov 91300d01f6 Dynamic max_total_wal_size option
Summary: Closes https://github.com/facebook/rocksdb/pull/1509

Differential Revision: D4176426

Pulled By: yiwu-arbug

fbshipit-source-id: b57689d
2016-11-14 22:54:17 -08:00
Lijun Tang adb665e0bf Allowed delayed_write_rate option to be dynamically set.
Summary: Closes https://github.com/facebook/rocksdb/pull/1488

Differential Revision: D4157784

Pulled By: siying

fbshipit-source-id: f150081
2016-11-12 15:54:11 -08:00
Maysam Yabandeh 361010d447 Exporting compaction stats in the form of a map
Summary:
Currently the compaction stats are printed to stdout. We want to export the compaction stats in a map format so that the upper layer apps (e.g., MySQL) could present
the stats in any format required by the them.
Closes https://github.com/facebook/rocksdb/pull/1477

Differential Revision: D4149836

Pulled By: maysamyabandeh

fbshipit-source-id: b3df19f
2016-11-11 20:54:14 -08:00
Reid Horuff 1ca5f6d132 Fix 2PC Recovery SeqId Miscount
Summary:
Originally sequence ids were calculated, in recovery, based off of the first seqid found if the first log recovered. The working seqid was then incremented from that value based on every insertion that took place. This was faulty because of the potential for missing log files or inserts that skipped the WAL. The current recovery scheme grabs sequence from current recovering batch and increments using memtableinserter to track how many actual inserts take place. This works for 2PC batches as well scenarios where some logs are missing or inserts that skip the WAL.
Closes https://github.com/facebook/rocksdb/pull/1486

Differential Revision: D4156064

Pulled By: reidHoruff

fbshipit-source-id: a6da8d9
2016-11-10 11:09:22 -08:00
Andrew Kryczka c90fef88b1 fix open failure with empty wal
Summary: Closes https://github.com/facebook/rocksdb/pull/1490

Differential Revision: D4158821

Pulled By: IslamAbdelRahman

fbshipit-source-id: 59b73f4
2016-11-09 22:24:26 -08:00
Reid Horuff d133b08f68 Use correct sequence number when creating memtable
Summary:
copied from: 5ebfd2623a

Opening existing RocksDB attempts recovery from log files, which uses
wrong sequence number to create the memtable. This is a regression
introduced in change a400336.

This change includes a test demonstrating the problem, without the fix
the test fails with "Operation failed. Try again.: Transaction could not
check for conflicts for operation at SequenceNumber 1 as the MemTable
only contains changes newer than SequenceNumber 2.  Increasing the value
of the max_write_buffer_number_to_maintain option could reduce the
frequency of this error"

This change is a joint effort by Peter 'Stig' Edwards thatsafunnyname
and me.
Closes https://github.com/facebook/rocksdb/pull/1458

Differential Revision: D4143791

Pulled By: reidHoruff

fbshipit-source-id: 5a25033
2016-11-09 12:24:17 -08:00
Islam AbdelRahman 9bd191d2f4 Fix deadlock between (WriterThread/Compaction/IngestExternalFile)
Summary:
A deadlock is possible if this happen

(1) Writer thread is stopped because it's waiting for compaction to finish
(2) Compaction is waiting for current IngestExternalFile() calls to finish
(3) IngestExternalFile() is waiting to be able to acquire the writer thread
(4) WriterThread is held by stopped writes that are waiting for compactions to finish

This patch fix the issue by not incrementing num_running_ingest_file_ except when we acquire the writer thread.

This patch include a unittest to reproduce the described scenario
Closes https://github.com/facebook/rocksdb/pull/1480

Differential Revision: D4151646

Pulled By: IslamAbdelRahman

fbshipit-source-id: 09b39db
2016-11-09 10:54:10 -08:00
Andrew Kryczka 9e7cf3469b DeleteRange user iterator support
Summary:
Note: reviewed in  https://reviews.facebook.net/D65115

- DBIter maintains a range tombstone accumulator. We don't cleanup obsolete tombstones yet, so if the user seeks back and forth, the same tombstones would be added to the accumulator multiple times.
- DBImpl::NewInternalIterator() (used to make DBIter's underlying iterator) adds memtable/L0 range tombstones, L1+ range tombstones are added on-demand during NewSecondaryIterator() (see D62205)
- DBIter uses ShouldDelete() when advancing to check whether keys are covered by range tombstones
Closes https://github.com/facebook/rocksdb/pull/1464

Differential Revision: D4131753

Pulled By: ajkr

fbshipit-source-id: be86559
2016-11-04 12:09:22 -07:00
Andrew Kryczka f998c9790f DeleteRange Get support
Summary:
During Get()/MultiGet(), build up a RangeDelAggregator with range
tombstones as we search through live memtable, immutable memtables, and
SST files. This aggregator is then used by memtable.cc's SaveValue() and
GetContext::SaveValue() to check whether keys are covered.

added tests for Get on memtables/files; end-to-end tests mainly in https://reviews.facebook.net/D64761
Closes https://github.com/facebook/rocksdb/pull/1456

Differential Revision: D4111271

Pulled By: ajkr

fbshipit-source-id: 6e388d4
2016-11-03 18:54:20 -07:00
Yi Wu 437942e481 Add avoid_flush_during_shutdown DB option
Summary:
Add avoid_flush_during_shutdown DB option.
Closes https://github.com/facebook/rocksdb/pull/1451

Differential Revision: D4108643

Pulled By: yiwu-arbug

fbshipit-source-id: abdaf4d
2016-11-02 15:39:18 -07:00
Andrew Kryczka 40a2e406f8 DeleteRange flush support
Summary:
Changed BuildTable() (used for flush) to (1) add range
tombstones to the aggregator, which is used by CompactionIterator to
determine which keys can be removed; and (2) add aggregator's range
tombstones to the table that is output for the flush.
Closes https://github.com/facebook/rocksdb/pull/1438

Differential Revision: D4100025

Pulled By: ajkr

fbshipit-source-id: cb01a70
2016-10-31 20:54:18 -07:00
Siying Dong da61f348d3 Print compression and Fast CRC support info as Header level
Summary:
Currently the compression suppport and fast CRC support information is printed as info level. They should be in the same level as options, which is header level.

Also add ZSTD to this printing.
Closes https://github.com/facebook/rocksdb/pull/1448

Differential Revision: D4106608

Pulled By: yiwu-arbug

fbshipit-source-id: cb9a076
2016-10-31 16:09:13 -07:00
Siying Dong c90c48d3c8 Show More DB Stats in info logs
Summary:
DB Stats now are truncated if there are too many CFs. Extend the buffer size to allow more to be printed out. Also, separate out malloc to another log line.
Closes https://github.com/facebook/rocksdb/pull/1439

Differential Revision: D4100943

Pulled By: yiwu-arbug

fbshipit-source-id: 79f7218
2016-10-29 16:09:18 -07:00
Siying Dong 04751d5345 L0 compression should follow options.compression_per_level if not empty
Summary:
Currently, we don't use options.compression_per_level[0] as the compression style for L0 compression type, unless it is None. This behavior
 doesn't look like on purpose. This diff will make sure L0 compress using the style of options.compression_per_level[0].

Reviewed and accepted in: https://reviews.facebook.net/D65607
Closes https://github.com/facebook/rocksdb/pull/1435

Differential Revision: D4099368

Pulled By: siying

fbshipit-source-id: cfbbdcd
2016-10-28 17:39:20 -07:00
Aaron Gao 9de2f75216 revert Prev() in MergingIterator to use previous code in non-prefix-seek mode
Summary: Siying suggested to keep old code for normal mode prev() for safety

Test Plan: make check -j64

Reviewers: yiwu, andrewkr, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D65439
2016-10-24 13:13:01 -07:00
Aaron Gao 59a7c0337b Change ioptions to store user_comparator, fix bug
Summary:
change ioptions.comparator to user_comparator instread of internal_comparator.
Also change Comparator* to InternalKeyComparator* to make its type explicitly.

Test Plan: make all check -j64

Reviewers: andrewkr, sdong, yiwu

Reviewed By: yiwu

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D65121
2016-10-21 11:31:42 -07:00
Islam AbdelRahman 869ae5d786 Support IngestExternalFile (remove AddFile restrictions)
Summary:
Changes in the diff

API changes:
- Introduce IngestExternalFile to replace AddFile (I think this make the API more clear)
- Introduce IngestExternalFileOptions (This struct will encapsulate the options for ingesting the external file)
- Deprecate AddFile() API

Logic changes:
- If our file overlap with the memtable we will flush the memtable
- We will find the first level in the LSM tree that our file key range overlap with the keys in it
- We will find the lowest level in the LSM tree above the the level we found in step 2 that our file can fit in and ingest our file in it
- We will assign a global sequence number to our new file
- Remove AddFile restrictions by using global sequence numbers

Other changes:
- Refactor all AddFile logic to be encapsulated in ExternalSstFileIngestionJob

Test Plan:
unit tests (still need to add more)
addfile_stress (https://reviews.facebook.net/D65037)

Reviewers: yiwu, andrewkr, lightmark, yhchiang, sdong

Reviewed By: sdong

Subscribers: jkedgar, hcz, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D65061
2016-10-20 17:05:32 -07:00
Andrew Kryczka f47054015d Handle WAL deletion when using avoid_flush_during_recovery
Summary:
Previously the WAL files that were avoided during recovery would never
be considered for deletion. That was because alive_log_files_ was only
populated when log files are created. This diff further populates
alive_log_files_ with existing log files that aren't flushed during recovery,
such that FindObsoleteFiles() can find them later.

Depends on D64053.

Test Plan: new unit test, verifies it fails before this change and passes after

Reviewers: sdong, IslamAbdelRahman, yiwu

Reviewed By: yiwu

Subscribers: leveldb, dhruba, andrewkr

Differential Revision: https://reviews.facebook.net/D64059
2016-10-14 12:59:51 -07:00
Yi Wu e29d3b67c2 Make max_background_compactions and base_background_compactions dynamic changeable
Summary:
Add DB::SetDBOptions to dynamic change max_background_compactions and base_background_compactions.
I'll add more dynamic changeable options soon.

Test Plan: unit test.

Reviewers: yhchiang, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D64749
2016-10-14 12:25:39 -07:00
Islam AbdelRahman 5691a1d8a4 Fix compaction conflict with running compaction
Summary:
Issue scenario:
(1) We have 3 files in L1 and we issue a compaction that will compact them into 1 file in L2
(2) While compaction (1) is running, we flush a file into L0 and trigger another compaction that decide to move this file to L1 and then move it again to L2 (this file don't overlap with any other files)
(3) compaction (1) finishes and install the file it generated in L2, but this file overlap with the file we generated in (2) so we break the LSM consistency

Looks like this issue can be triggered by using non-exclusive manual compaction or AddFile()

Test Plan: unit tests

Reviewers: sdong

Reviewed By: sdong

Subscribers: hermanlee4, jkedgar, andrewkr, dhruba, yoshinorim

Differential Revision: https://reviews.facebook.net/D64947
2016-10-13 10:49:06 -07:00
Andrew Kryczka 017de666c7 fixup commit
Summary: I accidentally left out these changes from my commit of D64053 due to
messing up the merge conflict resolution.

Test Plan: ./db_wal_test

Reviewers:

Subscribers:

Tasks:

Blame Revision: D64053
2016-10-13 08:48:40 -07:00
Andrew Kryczka 1b7af5fb1a Redo handling of recycled logs in full purge
Summary:
This reverts commit 9e4aa798c3,
which doesn't handle all cases (see inline comment).

I reimplemented the logic as suggested in the initial PR: https://github.com/facebook/rocksdb/pull/1313.

This approach has two benefits:

- All the parsing/filtering of full_scan_candidate_files is kept together in PurgeObsoleteFiles.
- We only need to check whether log file is recycled in one place where we've already determined it's a log file

Test Plan:
new unit test, verified fails before the original fix, still passes
now.

Reviewers: IslamAbdelRahman, yiwu, sdong

Reviewed By: yiwu, sdong

Subscribers: leveldb, dhruba, andrewkr

Differential Revision: https://reviews.facebook.net/D64053
2016-10-12 23:13:09 -07:00
Aaron Gao 447f17127c new Prev() prefix support using SeekForPrev()
Summary:
1) The previous solution for Prev() prefix support is not clean.
Since I add api SeekForPrev(), now the Prev() can be symmetric to Next().
and we do not need SeekToLast() to be called in Prev() any more.

Also, Next() will Seek(prefix_seek_key_) to solve the problem of possible inconsistency between db_iter and merge_iter when
there is merge_operator. And prefix_seek_key is only refreshed when change direction to forward.

2) This diff also solves the bug of Iterator::SeekToLast() with iterate_upper_bound_ with prefix extractor.

add test cases for the above two cases.

There are some tests for the SeekToLast() in Prev(), I will clean them later.

Test Plan: make all check

Reviewers: IslamAbdelRahman, andrewkr, yiwu, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D63933
2016-10-11 13:54:26 -07:00
Reid Horuff 2c1f95291d Add facility to write only a portion of WriteBatch to WAL
Summary:
When constructing a write batch a client may now call MarkWalTerminationPoint() on that batch. No batch operations after this call will be added written to the WAL but will still be inserted into the Memtable. This facility is used to remove one of the three WriteImpl calls in 2PC transactions. This produces a ~1% perf improvement.

```
RocksDB - unoptimized 2pc, sync_binlog=1, disable_2pc=off
INFO 2016-08-31 14:30:38,814 [main]: REQUEST PHASE COMPLETED. 75000000 requests done in 2619 seconds. Requests/second = 28628

RocksDB - optimized 2pc , sync_binlog=1, disable_2pc=off
INFO 2016-08-31 16:26:59,442 [main]: REQUEST PHASE COMPLETED. 75000000 requests done in 2581 seconds. Requests/second = 29054
```

Test Plan: Two unit tests added.

Reviewers: sdong, yiwu, IslamAbdelRahman

Reviewed By: yiwu

Subscribers: hermanlee4, dhruba, andrewkr

Differential Revision: https://reviews.facebook.net/D64599
2016-10-07 11:32:10 -07:00
Islam AbdelRahman 87dfc1d23e Fix conflict between AddFile() and CompactRange()
Summary:
Fix the conflict bug between AddFile() and CompactRange() by
- Make sure that no AddFile calls are running when asking CompactionPicker to pick compaction for manual compaction
- If AddFile() run after we pick the compaction for the manual compaction it will be aware of it since we will add the manual compaction to running_compactions_ after picking it

This will solve these 2 scenarios
- If AddFile() is running, we will wait for it to finish before we pick a compaction for the manual compaction
- If we already picked a manual compaction and then AddFile() started ... we ensure that it never ingest a file in a level that will overlap with the manual compaction

Test Plan: unit tests

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, yoshinorim, jkedgar, dhruba

Differential Revision: https://reviews.facebook.net/D64449
2016-09-28 15:42:06 -07:00
yiwu-arbug eb3894cf42 Recompute compaction score on SetOptions (#1346)
Summary: We didn't recompute compaction score on SetOptions, and end up not having compaction if no flush happens afterward. The PR fixing it.

Test Plan: See unit test.

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D64167
2016-09-27 11:17:15 -07:00
Islam AbdelRahman 5c64fb67d2 Fix AddFile() conflict with compaction output [WaitForAddFile()]
Summary:
Since AddFile unlock/lock the mutex inside LogAndApply() we need to ensure that during this period other compactions cannot run since such compactions are not aware of the file we are ingesting and could create a compaction that overlap wit this file

this diff add
- WaitForAddFile() call that will ensure that no AddFile() calls are being processed right now
- Call `WaitForAddFile()` in 3 locations
-- When doing manual Compaction
-- When starting automatic Compaction
-- When  doing CompactFiles()

Test Plan: unit test

Reviewers: lightmark, yiwu, andrewkr, sdong

Reviewed By: sdong

Subscribers: andrewkr, yoshinorim, jkedgar, dhruba

Differential Revision: https://reviews.facebook.net/D64383
2016-09-27 00:14:55 -07:00
Yi Wu 9ed928e7a9 Split DBOptions into ImmutableDBOptions and MutableDBOptions
Summary: Use ImmutableDBOptions/MutableDBOptions internally and DBOptions only for user-facing APIs. MutableDBOptions is barely a placeholder for now. I'll start to move options to MutableDBOptions in following diffs.

Test Plan:
  make all check

Reviewers: yhchiang, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D64065
2016-09-23 16:34:04 -07:00
yiwu-arbug 4bc8c88e6b Recover same sequence id from WAL (#1350)
Summary:
Revert the behavior where we don't read sequence id from WAL, but increase it as we replay the log. We still keep the behave for 2PC for now but will fix later.

This change fixes github issue 1339, where some writes come with WAL disabled and we may recover records with wrong sequence id.

Test Plan: Added unit test.

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D64275
2016-09-23 16:15:14 -07:00
Aaron Gao 715256338a forbid merge during recovery
Summary:
Mitigate regression bug of options.max_successive_merges hit during DB Recovery
For https://reviews.facebook.net/D62625

Test Plan: make all check

Reviewers: horuff, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D62655
2016-09-21 11:05:07 -07:00
Yi Wu e4d3f5d9b8 Fix DBImpl::GetWalPreallocateBlockSize Mac build error
Summary: Specify type param with std::min to resolve compile error on Mac.

Test Plan: https://travis-ci.org/facebook/rocksdb/builds/161223845

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D64143
2016-09-20 10:17:28 -07:00
sdong d78a4401b5 DBImpl::GetWalPreallocateBlockSize() should return size_t
Summary: WritableFile::SetPreallocationBlockSize() requires parameter as size_t, and options used in DBImpl::GetWalPreallocateBlockSize() are all size_t. WritableFile::SetPreallocationBlockSize() should return size_t to avoid build break if size_t is not uint64_t.

Test Plan: Run existing tests.

Reviewers: andrewkr, IslamAbdelRahman, yiwu

Reviewed By: yiwu

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D64137
2016-09-19 16:51:38 -07:00
sdong b666f85445 Consider more factors when determining preallocation size of WAL files
Summary: Currently the WAL file preallocation size is 1.1 * write_buffer_size. This, however, will be over-estimated if options.db_write_buffer_size or options.max_total_wal_size is set and is much smaller.

Test Plan: Add a unit test.

Reviewers: andrewkr, yiwu

Reviewed By: yiwu

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D63957
2016-09-19 12:04:35 -07:00
Yi Wu 0a88f38b7e Remove ColumnFamilyData::options()
Summary: One more small refactor before I split DBOptions into mutable and immutable parts.

Test Plan: existing unit tests.

Reviewers: yhchiang, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D64047
2016-09-16 15:09:14 -07:00
Andrew Kryczka 06b4785fec Fix recovery for WALs without data for all CFs
Summary:
if one or more CFs had no data in the WAL, the log number that's used
by FindObsoleteFiles() wasn't updated. We need to treat this case the same as
if the data for that WAL had been flushed.

Test Plan: new unit test

Reviewers: IslamAbdelRahman, yiwu, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D63963
2016-09-15 11:40:48 -07:00
Yi Wu 17f76fc564 DB::GetOptions() reflect dynamic changed options
Summary: DB::GetOptions() reflect dynamic changed options.

Test Plan: See the new unit test.

Reviewers: yhchiang, sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D63903
2016-09-14 22:10:28 -07:00
Yi Wu 81747f1be6 Refactor MutableCFOptions
Summary:
* Change constructor of MutableCFOptions to depends only on ColumnFamilyOptions.
* Move `max_subcompactions`, `compaction_options_fifo` and `compaction_pri` to ImmutableCFOptions to make it clear that they are immutable.

Test Plan: existing unit tests.

Reviewers: yhchiang, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D63945
2016-09-13 21:11:59 -07:00
somnathr 9e4aa798c3 Summary: (#1313)
If log recycling is enabled with the rocksdb (recycle_log_file_num=16)
 db->Writebatch is erroring out with keynotfound after ~5-6 hours of run
 (1M seq but can happen to any workload I guess).See my detailed bug
 report here (https://github.com/facebook/rocksdb/issues/1303).
 This commit is the fix for this, a check is been added not to delete
 the log file if it is already there in the recycle list.

Test Plan:
 Unit tested it and ran the similar profile. Not reproducing anymore.
2016-09-12 16:53:42 -07:00
Islam AbdelRahman 1cca091298 Temporarily revert Prev() prefix support
Summary:
Temporarily revert commits for supporting prefix Prev() to unblock MyRocks and RocksDB release

These are the commits reverted

  - 6a14d55bd9
  - b18f9c9eac
  - db74b1a219
  - 2482d5fb45

Test Plan: make check -j64

Reviewers: sdong, lightmark

Reviewed By: lightmark

Subscribers: andrewkr, dhruba, yoshinorim

Differential Revision: https://reviews.facebook.net/D63789
2016-09-08 14:45:32 -07:00
Injun Song ce1be2ce37 Fix build error on Windows (AppVeyor) (#1315)
Add 'cf_options' to source list and db_imple.cc

fix casting
2016-09-06 08:41:43 -07:00
Aaron Gao 2482d5fb45 support Prev() in prefix seek mode
Summary: As title, make sure Prev() works as expected with Next() when the current iter->key() in the range of the same prefix in prefix seek mode

Test Plan: make all check -j64 (add prefix_test with PrefixSeekModePrev test case)

Reviewers: andrewkr, sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: yoshinorim, andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D61419
2016-08-29 20:55:39 -07:00
sdong dade61ac26 Mitigate regression bug of options.max_successive_merges hit during DB Recovery
Summary:
After 1b8a2e8fdd, DB Pointer is passed to WriteBatchInternal::InsertInto() while DB recovery. This can cause deadlock if options.max_successive_merges hits. In that case DB::Get() will be called. Get() will try to acquire the DB mutex, which is already held by the DB::Open(), causing a deadlock condition.

This commit mitigates the problem by not passing the DB pointer unless 2PC is allowed.

Test Plan: Add a new test and run it.

Reviewers: IslamAbdelRahman, andrewkr, kradhakrishnan, horuff

Reviewed By: kradhakrishnan

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D62625
2016-08-25 17:30:34 -07:00
Justin Gibbs b2ce59537c Persist data during user initiated shutdown
Summary:
Move the manual memtable flush for databases containing data that has
bypassed the WAL from DBImpl's destructor to CancleAllBackgroundWork().

CancelAllBackgroundWork() is a publicly exposed API which allows
async operations performed by background threads to be disabled on a
database. In effect, this places the database into a "shutdown" state
in advance of calling the database object's destructor. No compactions
or flushing of SST files can occur once a call to this API completes.

When writes are issued to a database with WriteOptions::disableWAL
set to true, DBImpl::has_unpersisted_data_ is set so that
memtables can be flushed when the database object is destroyed. If
CancelAllBackgroundWork() has been called prior to DBImpl's destructor,
this flush operation is not possible and is skipped, causing unnecessary
loss of data.

Since CancelAllBackgroundWork() is already invoked by DBImpl's destructor
in order to perform the thread join portion of its cleanup processing,
moving the manual memtable flush to CancelAllBackgroundWork() ensures
data is persisted regardless of client behavior.

Test Plan:
Write an amount of data that will not cause a memtable flush to a rocksdb
database with all writes marked with WriteOptions::disableWAL. Properly
"close" the database. Reopen database and verify that the data was
persisted.

Reviewers: IslamAbdelRahman, yiwu, yoshinorim, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D62277
2016-08-25 12:24:22 -07:00
sdong 56dd034115 read_options.background_purge_on_iterator_cleanup to cover forward iterator and log file closing too.
Summary: With read_options.background_purge_on_iterator_cleanup=true, File deletion and closing can still happen in forward iterator, or WAL file closing. Cover those cases too.

Test Plan: I am adding unit tests.

Reviewers: andrewkr, IslamAbdelRahman, yiwu

Reviewed By: yiwu

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D61503
2016-08-10 13:16:41 -07:00
Yi Wu ee027fc19f Ignore write stall triggers when auto-compaction is disabled
Summary:
My understanding is that the purpose of write stall triggers are to wait for auto-compaction to catch up. Without auto-compaction, we don't need to stall writes.

Also with this diff, flush/compaction conditions are recalculated on dynamic option change. Previously the conditions are recalculate only when write stall options are changed.

Test Plan: See the new test. Removed two tests that are no longer valid.

Reviewers: IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D61437
2016-08-02 21:55:26 -07:00
sdong 2a6d0cde72 Ignore stale logs while restarting DBs
Summary:
Stale log files can be deleted out of order. This can happen for various reasons. One of the reason is that no data is ever inserted to a column family and we have an optimization to update its log number, but not all the old log files are cleaned up (the case shown in the unit tests added). It can also happen when we simply delete multiple log files out of order.

This causes data corruption because we simply increase seqID after processing the next row and we may end up with writing data with smaller seqID than what is already flushed to memtables.

In DB recovery, for the oldest files we are replaying, if there it contains no data for any column family, we ignore the sequence IDs in the file.

Test Plan: Add two unit tests that fail without the fix.

Reviewers: IslamAbdelRahman, igor, yiwu

Reviewed By: yiwu

Subscribers: hermanlee4, yoshinorim, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D60891
2016-07-25 11:47:31 -07:00
sdong d5a51d4de3 Need to make sure log file synced before flushing memtable of one column family
Summary: Multiput atomiciy is broken across multiple column families if we don't sync WAL before flushing one column family. The WAL file may contain a write batch containing writes to a key to the CF to be flushed and a key to other CF. If we don't sync WAL before flushing, if machine crashes after flushing, the write batch will only be partial recovered. Data to other CFs are lost.

Test Plan: Add a new unit test which will fail without the diff.

Reviewers: yhchiang, IslamAbdelRahman, igor, yiwu

Reviewed By: yiwu

Subscribers: yiwu, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D60915
2016-07-21 16:29:06 -07:00
Aaron Gao dda6c72ac8 Add DestroyColumnFamilyHandle(ColumnFamilyHandle**) to db.h
Summary:
add DestroyColumnFamilyHandle(ColumnFamilyHandle**) to close column family instead of deleting cfh*
User should call this to close a cf and then we can detect the deletion in this function.

Test Plan: make all check -j64

Reviewers: andrewkr, yiwu, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D60765
2016-07-13 17:59:25 -07:00
Yi Wu 6ea41f8527 Fix deadlock when trying update options when write stalls
Summary:
When write stalls because of auto compaction is disabled, or stop write trigger is reached,
user may change these two options to unblock writes. Unfortunately we had issue where the write
thread will block the attempt to persist the options, thus creating a deadlock. This diff
fix the issue and add two test cases to detect such deadlock.

Test Plan:
Run unit tests.

Also, revert db_impl.cc to master (but don't revert `DBImpl::BackgroundCompaction:Finish` sync point) and run db_options_test. Both tests should hit deadlock.

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D60627
2016-07-12 15:30:38 -07:00
Aaron Gao 8e6b38d895 update DB::AddFile to ingest list of sst files
Summary:
DB::AddFile(std::string file_path) API that allow them to ingest an SST file created using SstFileWriter
We want to update this interface to be able to accept a list of files that will be ingested, DB::AddFile(std::vector<std::string> file_path_list).

Test Plan:
Add test case `AddExternalSstFileList` in `DBSSTTest`. To make sure:
1. files key ranges are not overlapping with each other
2. each file key range dont overlap with the DB key range
3. make sure no snapshots are held

Reviewers: andrewkr, sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D58587
2016-07-11 10:43:12 -07:00
sdong a00bf1b3cf Add More Logging to track total_log_size
Summary: We saw instances where total_log_size is off the real value, but I'm not able to reproduce it. Add more logging to help debugging when it happens again.

Test Plan: Run the unit test and see the logging.

Reviewers: andrewkr, yhchiang, igor, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D60081
2016-07-06 14:29:18 -07:00
sdong 32df9733d1 Add options.write_buffer_manager: control total memtable size across DB instances
Summary: Add option write_buffer_manager to help users control total memory spent on memtables across multiple DB instances.

Test Plan: Add a new unit test.

Reviewers: yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: adela, benj, sumeet, muthu, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59925
2016-07-05 18:11:25 -07:00
omegaga a45ee83181 Fix a bug that accesses invalid address in iterator cleanup function
Summary: Reported in T11889874. When registering the cleanup function we should copy the option so that we can still access it if ReadOptions is deleted.

Test Plan: Add a unit test to reproduce this bug.

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D60087
2016-07-05 11:57:14 -07:00
omegaga c4e19b77e8 Add a read option to enable background purge when cleaning up iterators
Summary:
Add a read option `background_purge_on_iterator_cleanup` to avoid deleting files in foreground when destroying iterators.
Instead, a job is scheduled in high priority queue and would be executed in a separate background thread.

Test Plan: Add a variant of PurgeObsoleteFileTest. Turn on background purge option in the new test, and use sleeping task to ensure files are deleted in background.

Reviewers: IslamAbdelRahman, sdong

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59499
2016-06-21 18:41:23 -07:00
Islam AbdelRahman fa813f7478 Update DB::AddFile() to ingest the file to the lowest possible level
Summary:
DB::AddFile() right now always add the ingested file to L0
update the logic to add the file to the lowest possible level

Test Plan: unit tests

Reviewers: jkedgar, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, yoshinorim

Differential Revision: https://reviews.facebook.net/D59637
2016-06-21 17:57:59 -07:00
sdong 7b79238b65 Deprectate filter_deletes
Summary: filter_deltes is not a frequently used feature. Remove it.

Test Plan: Run all test suites.

Reviewers: igor, yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59427
2016-06-17 10:30:47 -07:00
Islam AbdelRahman 30a24f2d3d Add InternalStats and logging for AddFile()
Summary:
We dont report the bytes that we ingested from AddFile which make the write amplification numbers incorrect
Update InternalStats and add logging for AddFile()

Test Plan: Make sure the code compile and existing tests pass

Reviewers: lightmark, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59763
2016-06-16 16:21:41 -07:00
Yi Wu bc8af90e8c add option to not flush memtable on open()
Summary:
Add option to not flush memtable on open()
In case the option is enabled, don't delete existing log files by not updating log numbers to MANIFEST.
Will still flush if we need to (e.g. memtable full in the middle). In that case we also flush final memtable.
If wal_recovery_mode = kPointInTimeRecovery, do not halt immediately after encounter corruption. Instead, check if seq id of next log file is last_log_sequence + 1. In that case we continue recovery.

Test Plan: See unit test.

Reviewers: dhruba, horuff, sdong

Reviewed By: sdong

Subscribers: benj, yhchiang, andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57813
2016-06-13 11:34:16 -07:00
Wanning Jiang 56887f6cb8 Backup Options
Summary: Backup options file to private directory

Test Plan:
backupable_db_test.cc, BackupOptions
	   Modify DB options by calling OpenDB for 3 times. Check the latest options file is in the right place. Also check no redundent files are backuped.

Reviewers: andrewkr

Reviewed By: andrewkr

Subscribers: leveldb, dhruba, andrewkr

Differential Revision: https://reviews.facebook.net/D59373
2016-06-09 19:03:10 -07:00
Andrew Kryczka 842958651f Fix race condition in SwitchMemtable
Summary:
MemTableList::current_ could be written by background flush thread and
simultaneously read in the user thread (NumNotFlushed() is used in
SwitchMemtable()). Use the lock to prevent this case. Found the error from tsan.

Related: D58833

Test Plan:
  $ OPT=-g COMPILE_WITH_TSAN=1 make -j64 db_test
  $ TEST_TMPDIR=/dev/shm/rocksdb ./db_test --gtest_filter=DBTest.RepeatedWritesToSameKey

Reviewers: lightmark, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D59139
2016-06-02 17:11:45 -07:00
PraveenSinghRao 3a276b0cbe Add a callback for when memtable is moved to immutable (#1137)
* Create a callback for memtable becoming immutable

Create a callback for memtable becoming immutable

Create a callback for memtable becoming immutable

moved notification outside the lock

Move sealed notification to unlocked portion of SwitchMemtable

* fix lite build
2016-06-02 11:57:31 -07:00
Mike Kolupaev 936973d145 Small tweaks to logging to track the number of immutable memtables
Summary:
We see some write stalls because of number of unflushed memtables. With existing logging I couldn't figure out what's happening exactly. See internal task t11446054 for details if interested. This diff adds:
- logging of memtable creation at info level; I wanted it on multiple occasions for different reasons; also include number of immutable memtables,
- logging of number of remaining immutable memtables after a flush.

Test Plan: ran tests

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D58833
2016-06-01 11:11:33 -07:00
Reid Horuff 5d85fdb2c5 add missing lock 2016-05-31 12:26:48 -07:00
Ashish Shenoy 99765ed855 Clean up the ComputeCompactionScore() API
Summary: Make CompactionOptionsFIFO a part of mutable_cf_options

Test Plan: UT

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, lgalanis, dhruba

Differential Revision: https://reviews.facebook.net/D58653
2016-05-23 15:55:29 -07:00
Sage Weil 11f329bd40 db/db_impl: restrict WALRecoveryMode when using recycled log files
kPointInTimeRecovery is indistinguishable from
kTolerateCorruptedTailRecords in recycle mode since we define
the "end" of the log as the first corrupt record we encounter.

kAbsoluteConsistency doesn't make sense because even a clean
shutdown leaves old junk at the end of the log file.

Signed-off-by: Sage Weil <sage@redhat.com>
2016-05-22 22:00:15 -07:00
Aaron Orenstein 2073cf3775 Eliminate use of 'using namespace std'. Also remove a number of ADL references to std functions.
Summary: Reduce use of argument-dependent name lookup in RocksDB.

Test Plan: 'make check' passed.

Reviewers: andrewkr

Reviewed By: andrewkr

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D58203
2016-05-20 07:42:18 -07:00
Islam AbdelRahman c70a9335de Fix mutex unlock issue between scheduled compaction and ReleaseCompactionFiles()
Summary:
NotifyOnCompactionCompleted can unlock the mutex.
That mean that we can schedule a background compaction that will start before we ReleaseCompactionFiles().

Test Plan:
added unittest
existing unittest

Reviewers: yhchiang, sdong

Reviewed By: sdong

Subscribers: yoshinorim, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D58065
2016-05-18 14:56:30 -07:00
Aaron Gao 43afd72bee [rocksdb] make more options dynamic
Summary:
make more ColumnFamilyOptions dynamic:
- compression
- soft_pending_compaction_bytes_limit
- hard_pending_compaction_bytes_limit
- min_partial_merge_operands
- report_bg_io_stats
- paranoid_file_checks

Test Plan:
Add sanity check in `db_test.cc` for all above options except for soft_pending_compaction_bytes_limit and hard_pending_compaction_bytes_limit.
All passed.

Reviewers: andrewkr, sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57519
2016-05-17 13:11:56 -07:00
Islam AbdelRahman f6aedb62c0 Fix Transaction memory leak
Summary:
- Make sure we clean up recovered_transactions_ on DBImpl destructor
- delete leaked txns and env in TransactionTest

Test Plan: Run transaction_test under valgrind

Reviewers: sdong, andrewkr, yhchiang, horuff

Reviewed By: horuff

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D58263
2016-05-16 16:32:55 -07:00
Reid Horuff a400336398 TransactionLogIterator sequence gap fix
Summary: DBTestXactLogIterator.TransactionLogIterator was failing due the sequence gaps. This was caused by an off-by-one error when calculating the new sequence number after recovering from logs.

Test Plan: db_log_iter_test

Reviewers: andrewkr

Subscribers: andrewkr, hermanlee4, dhruba, IslamAbdelRahman

Differential Revision: https://reviews.facebook.net/D58053
2016-05-12 13:54:08 -07:00
Reid Horuff c27061dae7 [rocksdb] 2PC double recovery bug fix
Summary:
1. prepare()
2. crash
3. recover
4. commit()
5. crash
6. data is lost

This is due to the transaction data still only residing in the WAL but because the logs were flushed on the first recovery the data is ignored on the second recovery. We must scan all logs found on recovery and only ignore redundant data at the time of replay. It is not possible to know which logs still contain relevant data at time of recovery. We cannot simply ignore a log because all of the non-2pc data it contains has already been written to L0.

The changes made to MemTableInserter are to ensure that prepared sections are still recovered even if all of the non-2pc data in that log has already been flushed to L0.

Test Plan: Provided test.

Reviewers: sdong

Subscribers: andrewkr, hermanlee4, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57729
2016-05-10 14:06:07 -07:00
Reid Horuff a657ee9a9c [rocksdb] Recovery path sequence miscount fix
Summary:
Consider the following WAL with 4 batch entries prefixed with their sequence at time of memtable insert.
[1: BEGIN_PREPARE, PUT, PUT, PUT, PUT, END_PREPARE(a)]
[1: BEGIN_PREPARE, PUT, PUT, PUT, PUT, END_PREPARE(b)]
[4: COMMIT(a)]
[7: COMMIT(b)]

The first two batches do not consume any sequence numbers so are both prefixed with seq=1.
For 2pc commit, memtable insertion takes place before COMMIT batch is written to WAL.
We can see that sequence number consumption takes place between WAL entries giving us the seemingly sparse sequence prefix for WAL entries.
This is a valid WAL.

Because with 2PC markers one WriteBatch points to another batch containing its inserts a writebatch can consume more or less sequence numbers than the number of sequence consuming entries that it contains.

We can see that, given the entries in the WAL, 6 sequence ids were consumed. Yet on recovery the maximum sequence consumed would be 7 + 3 (the number of sequence numbers consumed by COMMIT(b))

So, now upon recovery we must track the actual consumption of sequence numbers.
In the provided scenario there will be no sequence gaps, but it is possible to produce a sequence gap. This should not be a problem though. correct?

Test Plan: provided test.

Reviewers: sdong

Subscribers: andrewkr, leveldb, dhruba, hermanlee4

Differential Revision: https://reviews.facebook.net/D57645
2016-05-10 14:06:07 -07:00
Reid Horuff 1b8a2e8fdd [rocksdb] Memtable Log Referencing and Prepared Batch Recovery
Summary:
This diff is built on top of WriteBatch modification: https://reviews.facebook.net/D54093 and adds the required functionality to rocksdb core necessary for rocksdb to support 2PC.

modfication of DBImpl::WriteImpl()
- added two arguments *uint64_t log_used = nullptr, uint64_t log_ref = 0;
- *log_used is an output argument which will return the log number which the incoming batch was inserted into, 0 if no WAL insert took place.
-  log_ref is a supplied log_number which all memtables inserted into will reference after the batch insert takes place. This number will reside in 'FindMinPrepLogReferencedByMemTable()' until all Memtables insertinto have flushed.

- Recovery/writepath is now aware of prepared batches and commit and rollback markers.

Test Plan: There is currently no test on this diff. All testing of this functionality takes place in the Transaction layer/diff but I will add some testing.

Reviewers: IslamAbdelRahman, sdong

Subscribers: leveldb, santoshb, andrewkr, vasilep, dhruba, hermanlee4

Differential Revision: https://reviews.facebook.net/D56919
2016-05-10 14:06:07 -07:00
Islam AbdelRahman 4b31723433 Add bottommost_compression option
Summary:
Add a new option that can be used to set a specific compression algorithm for bottommost level.
This option will only affect levels larger than base level.

I have also updated CompactionJobInfo to include the compression algorithm used in compaction

Test Plan:
added new unittest
existing unittests

Reviewers: andrewkr, yhchiang, sdong

Reviewed By: sdong

Subscribers: lightmark, andrewkr, dhruba, yoshinorim

Differential Revision: https://reviews.facebook.net/D57669
2016-05-09 15:57:19 -07:00
Yi Wu a92049e3e7 Added EventListener::OnTableFileCreationStarted() callback
Summary: Added EventListener::OnTableFileCreationStarted. EventListener::OnTableFileCreated will be called on failure case. User can check creation status via TableFileCreationInfo::status.

Test Plan: unit test.

Reviewers: dhruba, yhchiang, ott, sdong

Reviewed By: sdong

Subscribers: sdong, kradhakrishnan, IslamAbdelRahman, andrewkr, yhchiang, leveldb, ott, dhruba

Differential Revision: https://reviews.facebook.net/D56337
2016-04-29 11:35:00 -07:00
sdong 992a8f83b7 Not enable jemalloc status printing if USE_CLANG=1
Summary: Warning is printed out with USE_CLANG=1 when including jemalloc.h. Disable it in that case.

Test Plan: Run db_bench with USE_CLANG=1 and not. Make sure they can all build and jemalloc status is printed out in the case where USE_CLANG is not set.

Reviewers: andrewkr, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D57399
2016-04-28 16:16:14 -07:00
Islam AbdelRahman 0850bc5147 Fix build on machines without jemalloc
Summary: It looks like we mistakenly enable JEMALLOC even if it's not available on the machine, that's why travis is failing

Test Plan:
check on my devserver
check on my mac

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D57345
2016-04-27 18:25:19 -07:00
Sergey Makarenko 1c80dfab24 Print memory allocation counters
Summary:
Introduced option to dump malloc statistics using new option flag.
    Added new command line option to db_bench tool to enable this
    funtionality.
    Also extended build to support environments with/without jemalloc.

Test Plan:
1) Build rocksdb using `make` command. Launch the following command
    `./db_bench --benchmarks=fillrandom --dump_malloc_stats=true
    --num=10000000` end verified that jemalloc dump is present in LOG file.
    2) Build rocksdb using `DISABLE_JEMALLOC=1  make db_bench -j32` and ran
    the same db_bench tool and found the following message in LOG file:
    "Please compile with jemalloc to enable malloc dump".
    3) Also built rocksdb using `make` command on MacOS to verify behavior
    in non-FB environment.
    Also to debug build configuration change temporary changed
    AM_DEFAULT_VERBOSITY = 1 in Makefile to see compiler and build
    tools output. For case 1) -DROCKSDB_JEMALLOC was present in compiler
    command line. For both 2) and 3) this flag was not present.

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D57321
2016-04-27 16:23:33 -07:00
sdong ac0e54b4c6 CompactedDB should not be used if there is outstanding WAL files
Summary: CompactedDB skips memtable. So we shouldn't use compacted DB if there is outstanding WAL files.

Test Plan: Change to options.max_open_files = -1 perf context test to create a compacted DB, which we shouldn't do.

Reviewers: yhchiang, kradhakrishnan, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D57057
2016-04-26 14:22:07 -07:00
Yueh-Hsuan Chiang 24110ce90c Correct Statistics FLUSH_WRITE_BYTES
Summary:
In https://reviews.facebook.net/D56271, we fixed an issue where
we consider flush as compaction.  However, that makes us mistakenly
count FLUSH_WRITE_BYTES twice (one in flush_job and one in db_impl.)

This patch removes the one incremented in db_impl.

Test Plan: db_test

Reviewers: yiwu, andrewkr, IslamAbdelRahman, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57111
2016-04-25 12:01:01 -07:00
sdong 4b6833aec1 Rename options.compaction_measure_io_stats to options.report_bg_io_stats and include flush too.
Summary: It is useful to print out IO stats in flush jobs too. Extend options.compaction_measure_io_stats to flush jobs and raname it.

Test Plan: Try db_bench and see the stats are printed out.

Reviewers: yhchiang

Reviewed By: yhchiang

Subscribers: kradhakrishnan, yiwu, IslamAbdelRahman, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56769
2016-04-15 10:22:18 -07:00
Jay Edgar 2448f80375 Make sure that if use_mmap_reads is on use_os_buffer is also on
Summary: The code assumes that if use_mmap_reads is on then use_os_buffer is also on.  This make sense as by using memory mapped files for reading you are expecting the OS to cache what it needs.  Add code to make sure the user does not turn off use_os_buffer when they turn on use_mmap_reads

Test Plan: New test: DBTest.MMapAndBufferOptions

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56397
2016-04-08 14:30:15 -07:00
Andrew Kryczka 2391ef7214 Embed column family name in SST file
Summary:
Added the column family name to the properties block. This property
is omitted only if the property is unavailable, such as when RepairDB()
writes SST files.

In a next diff, I will change RepairDB to use this new property for
deciding to which column family an existing SST file belongs. If this
property is missing, it will add it to the "unknown" column family (same
as its existing behavior).

Test Plan:
New unit test:

  $ ./db_table_properties_test --gtest_filter=DBTablePropertiesTest.GetColumnFamilyNameProperty

Reviewers: IslamAbdelRahman, yhchiang, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D55605
2016-04-06 23:10:32 -07:00
Marton Trencseni 9b51987521 Adding pin_l0_filter_and_index_blocks_in_cache feature and related fixes.
Summary:
When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache.
What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention.

Test Plan:
'export TEST_TMPDIR=/dev/shm/ && DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32' is OK.
I didn't run the Java tests, I don't have Java set up on my devserver.

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56133
2016-04-01 10:42:39 -07:00
zensan 78711524b7 In all the places where log records are read, there was a check that
record.size() should not be less than 12.

This "magic number" seems to be the WriteBatch header (8 byte sequence
and 4 byte count).   Replaced all the places where "12" was used
by WriteBatchInternal::kHeader.
2016-03-30 23:05:22 +05:30
Praveen Rao 583157f710 Avoid overloaded virtual function 2016-03-22 17:10:31 -07:00
Praveen Rao 4f1c74a46e merge from master 2016-03-18 12:48:01 -07:00
Praveen Rao f8c2189307 Publish log numbers for column family to wal_filter, and provide log number in the record callback 2016-03-18 12:32:15 -07:00
Baris Yazici e8e6cf0173 fix: handle_fatal_signal (sig=6) in std::vector<std::string, std::allocator<std::string> >::_M_range_check | c++/4.8.2/bits/stl_vector.h:794 #174
Summary:
Fix for https://github.com/facebook/mysql-5.6/issues/174

When there is no old files to purge, vector.at(i) function was crashing

if (old_info_log_file_count != 0 &&
      old_info_log_file_count >= db_options_.keep_log_file_num) {
    std::sort(old_info_log_files.begin(), old_info_log_files.end());
    size_t end = old_info_log_file_count - db_options_.keep_log_file_num;
    for (unsigned int i = 0; i <= end; i++) {
      std::string& to_delete = old_info_log_files.at(i);

Added check to old_info_log_file_count be non zero.

Test Plan: run existing tests

Reviewers: gunnarku, vasilep, sdong, yhchiang

Reviewed By: yhchiang

Subscribers: andrewkr, webscalesql-eng, dhruba

Differential Revision: https://reviews.facebook.net/D55245
2016-03-11 11:11:45 -08:00
Andrew Kryczka d9620239d2 Cleanup stale manifests outside of full purge
Summary:
- Keep track of obsolete manifests in VersionSet
- Updated FindObsoleteFiles() to put obsolete manifests in the JobContext for later use by PurgeObsoleteFiles()
- Added test case that verifies a stale manifest is deleted by a non-full purge

Test Plan:
  $ ./backupable_db_test --gtest_filter=BackupableDBTest.ChangeManifestDuringBackupCreation

Reviewers: IslamAbdelRahman, yoshinorim, sdong

Reviewed By: sdong

Subscribers: andrewkr, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D55269
2016-03-10 18:16:21 -08:00
Yueh-Hsuan Chiang 765597fa78 Update compaction score right after CompactFiles forms a compaction
Summary:
This is a follow-up patch of https://reviews.facebook.net/D54891.
As the information about files being compacted will also be used
when making compaction decision, it is necessary to update the compaction
score when a compaction plan has been made but not yet execute.

This patch adds a missing call to update the compaction score in
CompactFiles().

Test Plan: compact_files_test

Reviewers: sdong, IslamAbdelRahman, kradhakrishnan, yiwu, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D55227
2016-03-10 14:34:28 -08:00
Yueh-Hsuan Chiang a7d4eb2f34 Fix a bug where flush does not happen when a manual compaction is running
Summary:
Currently, when rocksdb tries to run manual compaction to refit data into a level,
there's a ReFitLevel() process that requires no bg work is currently running.
When RocksDB plans to ReFitLevel(), it will do the following:

 1. pause scheduling new bg work.
 2. wait until all bg work finished
 3. do the ReFitLevel()
 4. unpause scheduling new bg work.

However, as it pause scheduling new bg work at step one and waiting for all bg work
finished in step 2, RocksDB will stop flushing until all bg work is done (which
could take a long time.)

This patch fix this issue by changing the way ReFitLevel() pause the background work:

1. pause scheduling compaction.
2. wait until all bg work finished.
3. pause scheduling flush
4. do ReFitLevel()
5. unpause both flush and compaction.

The major difference is that.  We only pause scheduling compaction in step 1 and wait
for all bg work finished in step 2.  This prevent flush being blocked for a long time.
Although there's a very rare case that ReFitLevel() might be in starvation in step 2,
but it's less likely the case as flush typically finish very fast.

Test Plan: existing test.

Reviewers: anthony, IslamAbdelRahman, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D55029
2016-03-04 14:24:52 -08:00
Islam AbdelRahman dfe96c72c3 Fix WriteLevel0TableForRecovery file delete protection
Summary:
The call to

```
CaptureCurrentFileNumberInPendingOutputs()
```

should be before

```
versions_->NewFileNumber()
```
Right now we are not actually protecting the file from being deleted

Test Plan: make check

Reviewers: sdong, anthony, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D54645
2016-03-03 18:25:07 -08:00
sdong e79ad9e184 Add Iterator Property rocksdb.iterator.version_number
Summary: We want to provide a way to detect whether an iterator is stale and needs to be recreated. Add a iterator property to return version number.

Test Plan: Add two unit tests for it.

Reviewers: IslamAbdelRahman, yhchiang, anthony, kradhakrishnan, andrewkr

Reviewed By: andrewkr

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D54921
2016-03-02 16:23:59 -08:00
Islam AbdelRahman 6743135ea1 Fix DB::AddFile() issue when PurgeObsoleteFiles() is called
Summary:
In some situations the DB will scan all existing files in the DB path and delete the ones that are Obsolete.
If this happen during adding an external sst file. this could cause the file to be deleted while we are adding it.
This diff fix this issue

Test Plan:
unit test to reproduce the bug
existing unit tests

Reviewers: sdong, yhchiang, andrewkr

Reviewed By: andrewkr

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D54627
2016-03-01 12:05:29 -08:00
sdong 38201b3599 Fix assert failure when DBImpl::SyncWAL() conflicts with log rolling
Summary: DBImpl::SyncWAL() releases db mutex before calling DBImpl::MarkLogsSynced(), while inside DBImpl::MarkLogsSynced() we assert there is none or one outstanding log file. However, a memtable switch can happen in between and causing two or outstanding logs there, failing the assert. The diff adds a unit test that repros the issue and fix the assert so that the unit test passes.

Test Plan: Run the new tests.

Reviewers: anthony, kolmike, yhchiang, IslamAbdelRahman, kradhakrishnan, andrewkr

Reviewed By: andrewkr

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D54621
2016-02-23 11:42:15 -08:00
Mike Kolupaev eef63ef807 Fixed CompactFiles() spuriously failing or corrupting DB
Summary:
We started getting two kinds of crashes since we started using `DB::CompactFiles()`:
(1) `CompactFiles()` fails saying something like "/data/logdevice/4440/shard12/012302.sst: No such file or directory", and presumably makes DB read-only,
(2) DB fails to open saying "Corruption: Can't access /267000.sst: IO error: /data/logdevice/4440/shard1/267000.sst: No such file or directory".

AFAICT, both can be explained by background thread deleting compaction output as "obsolete" while it's being written, before it's committed to manifest. If it ends up committed to the manifest, we get (2); if compaction notices the disappearance and fails, we get (1). The internal tasks t10068021 and t10134177 have some details about the investigation that led to this.

Test Plan: `make -j check`; the new test fails to reopen the DB without the fix

Reviewers: yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, sdong

Differential Revision: https://reviews.facebook.net/D54561
2016-02-22 13:54:58 -08:00
Islam AbdelRahman df9ba6df62 Introduce SstFileManager::SetMaxAllowedSpaceUsage() to cap disk space usage
Summary:
Introude SstFileManager::SetMaxAllowedSpaceUsage() that can be used to limit the maximum space usage allowed for RocksDB.
When this limit is exceeded WriteImpl() will fail and return Status::Aborted()

Test Plan: unit testing

Reviewers: yhchiang, anthony, andrewkr, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D53763
2016-02-17 15:20:23 -08:00
reid horuff 5bcf952a87 Fix WriteImpl empty batch hanging issue
Summary: There is an issue in DBImpl::WriteImpl where if an empty writebatch comes in and sync=true then the logs will be marked as being synced yet the sync never actually happens because there is no data in the writebatch. This causes the next incoming batch to hang while waiting for the logs to complete syncing. This fix syncs logs even if the writebatch is empty.

Test Plan: DoubleEmptyBatch unit test in transaction_test.

Reviewers: yoshinorim, hermanlee4, sdong, ngbronson, anthony

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D54057
2016-02-16 12:21:33 -08:00
Mike Kolupaev 44371501f0 Fixed a segfault when compaction fails
Summary: We've hit it today.

Test Plan: `make -j check`; didn't reproduce the issue

Reviewers: yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D54219
2016-02-16 11:11:16 -08:00
Baraa Hamodi 21e95811d1 Updated all copyright headers to the new format. 2016-02-09 15:12:00 -08:00
Yueh-Hsuan Chiang 4a8cbf4e31 Allows Get and MultiGet to read directly from SST files.
Summary:
Add kSstFileTier to ReadTier, which allows Get and MultiGet to
read only directly from SST files and skip mem-tables.

    kSstFileTier = 0x2      // data in SST files.
                          // Note that this ReadTier currently only supports
                          // Get and MultiGet and does not support iterators.

Test Plan: add new test in db_test.

Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: igor, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D53511
2016-02-09 11:20:22 -08:00
reid horuff 6f71d3b68b Improve perf of Pessimistic Transaction expirations (and optimistic transactions)
Summary:
copy from task 8196669:

1) Optimistic transactions do not support batching writes from different threads.
2) Pessimistic transactions do not support batching writes if an expiration time is set.

In these 2 cases, we currently do not do any write batching in DBImpl::WriteImpl() because there is a WriteCallback that could decide at the last minute to abort the write.  But we could support batching write operations with callbacks if we make sure to process the callbacks correctly.

To do this, we would first need to modify write_thread.cc to stop preventing writes with callbacks from being batched together.  Then we would need to change DBImpl::WriteImpl() to call all WriteCallback's in a batch, only write the batches that succeed, and correctly set the state of each batch's WriteThread::Writer.

Test Plan: Added test WriteWithCallbackTest to write_callback_test.cc which creates multiple client threads and verifies that writes are batched and executed properly.

Reviewers: hermanlee4, anthony, ngbronson

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52863
2016-02-05 10:44:13 -08:00
Andrew Kryczka 284aa613a7 Eliminate duplicated property constants
Summary:
Before this diff, there were duplicated constants to refer to properties (user-
facing API had strings and InternalStats had an enum). I noticed these were
inconsistent in terms of which constants are provided, names of constants, and
documentation of constants. Overall it seemed annoying/error-prone to maintain
these duplicated constants.

So, this diff gets rid of InternalStats's constants and replaces them with a map
keyed on the user-facing constant. The value in that map contains a function
pointer to get the property value, so we don't need to do string matching while
holding db->mutex_. This approach has a side benefit of making many small
handler functions rather than a giant switch-statement.

Test Plan: db_properties_test passes, running "make commit-prereq -j32"

Reviewers: sdong, yhchiang, kradhakrishnan, IslamAbdelRahman, rven, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D53253
2016-02-02 19:14:56 -08:00
Nathan Bronson 9c2cf9479b Fix for --allow_concurrent_memtable_write with batching
Summary:
Concurrent memtable adds were incorrectly computing
the last sequence number for a write batch group when the
write batches were not solitary.  This is the cause of
https://github.com/facebook/mysql-5.6/issues/155

Test Plan:
1. unit tests
2. new unit test
3. parallel db_bench stress tests with batch size of 10 and asserts enabled

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: IslamAbdelRahman, MarkCallaghan, dhruba

Differential Revision: https://reviews.facebook.net/D53595
2016-02-01 20:41:57 -08:00
SherlockNoMad 37159a6448 Add histogram for value size per operation 2016-01-31 18:09:24 -08:00
Venkatesh Radhakrishnan 3b2a1ddd2e Add options.base_background_compactions as a number of compaction threads for low compaction debt
Summary:
If options.base_background_compactions is given, we try to schedule number of compactions not existing this number, only when L0 files increase to certain number, or pending compaction bytes more than certain threshold, we schedule compactions based on options.max_background_compactions.

The watermarks are calculated based on slowdown thresholds.

Test Plan:
Add new test cases in column_family_test.
Adding more unit tests.

Reviewers: IslamAbdelRahman, yhchiang, kradhakrishnan, rven, anthony

Reviewed By: anthony

Subscribers: leveldb, dhruba, yoshinorim

Differential Revision: https://reviews.facebook.net/D53409
2016-01-29 16:15:53 -08:00
Islam AbdelRahman d6c838f1e1 Add SstFileManager (component tracking all SST file in DBs and control the deletion rate)
Summary:
Add a new class SstFileTracker that will be notified whenever a DB add/delete/move and sst file, it will also replace DeleteScheduler
SstFileTracker can be used later to abort writes when we exceed a specific size

Test Plan: unit tests

Reviewers: rven, anthony, yhchiang, sdong

Reviewed By: sdong

Subscribers: igor, lovro, march, dhruba

Differential Revision: https://reviews.facebook.net/D50469
2016-01-28 18:35:01 -08:00
Andrew Kryczka 167bd8856d [directory includes cleanup] Finish removing util->db dependencies 2016-01-26 10:49:24 -08:00
Andrew Kryczka acd7d58695 [directory includes cleanup] Remove util->db dependency for ThreadStatusUtil
Summary:
We can avoid the dependency by forward-declaring ColumnFamilyData and
then treating it as a black box. That means callers of ThreadStatusUtil need to
explicitly provide more options, even if they can be derived from the
ColumnFamilyData, since ThreadStatusUtil doesn't include the definition.

This is part of a series of diffs to eliminate circular dependencies between
directories (e.g., db/* files depending on util/* files and vice-versa).

Test Plan:
  $ ./db_test --gtest_filter=DBTest.GetThreadStatus
  $ make -j32 commit-prereq

Reviewers: sdong, yhchiang, IslamAbdelRahman

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D53361
2016-01-26 10:49:16 -08:00
Andrew Kryczka 46f9cd46af [directory includes cleanup] Move cross-function test points
Summary:
I split the db-specific test points out into a separate file under db/
directory. There were also a few bugs to fix in xfunc.{h,cc} that prevented it
from compiling previously; see https://reviews.facebook.net/D36825.

Test Plan:
compilation works now, below command works, will also run "make xfunc".

  $ make check ROCKSDB_XFUNC_TEST='managed_new' tests-regexp='DBTest' -j32

Reviewers: sdong, yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D53343
2016-01-26 10:49:05 -08:00
Islam AbdelRahman 2fbc59a348 Disallow SstFileWriter from creating empty sst files
Summary:
SstFileWriter may create an sst file with no entries
Right now this will fail when being ingested using DB::AddFile() saying that the keys are corrupted

Test Plan: make check

Reviewers: yhchiang, rven, anthony, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D52815
2016-01-25 13:47:07 -08:00
David Bernard 3f12e16f27 Make alloca.h optional 2016-01-19 06:17:31 +00:00
David Bernard d78c6b28c4 Changes for build on solaris
Makefile adjust paths for solaris build
Makefile enable _GLIBCXX_USE_C99 so that std::to_string is available
db_compaction_test.cc Initialise a variable to avoid a compilation error
db_impl.cc Include <alloca.h>
db_test.cc Include <alloca.h>
Environment.java recognise solaris envrionment
options_bulder.cc Make log unambiguous
geodb_impl.cc Make log and floor unambiguous
2016-01-19 04:45:21 +00:00
Venkatesh Radhakrishnan 7ece10ecb6 DeleteFilesInRange: Mark files to be deleted as being compacted before applying change
Summary:
While running the myrocks regression suite, I found that while
dropping a table soon after inserting rows into it resulted in an
assertion failure in CheckConsistencyForDeletes for not finding
a file which was recently added or moved. Marking the files to be
deleted as being compacted before calling LogAndApplyChange
fixed the assertion failures.

Test Plan: DBCompactionTest.DeleteFileRange

Reviewers: IslamAbdelRahman, anthony, yhchiang, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: dhruba, yoshinorim, leveldb

Differential Revision: https://reviews.facebook.net/D52599
2016-01-07 14:48:45 -08:00
Reid Horuff da032495d3 Optimize GetLatestSequenceForKey
Summary: DBImpl::GetLatestSequenceForKey() can do memcpy's to load a value that will never be used.  This can be optimized by changing all the Get() functions called to optionally not fetch the value (and only fetch the sequencenumber).

Test Plan: optimistic_transaction_test and transaction_test

Reviewers: anthony

Reviewed By: anthony

Subscribers: leveldb, dhruba, hermanlee4

Differential Revision: https://reviews.facebook.net/D52227
2016-01-06 13:43:22 -08:00
Venkatesh Radhakrishnan d74c9f0a57 DeleteFilesInRange: Clean job context if no files deleted
Summary:
We need to clean the job context if we end up not deleting any
files because no files are in the range specified.

Test Plan: DBCompactionTest.DeleteFileRange

Reviewers: sdong, anthony, yhchiang, kradhakrishnan, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D52467
2016-01-04 10:55:31 -08:00
Nathan Bronson ac16663bd6 use -Werror=missing-field-initializers, to closer match MyRocks build
Summary:
myrocks seems to build rocksdb using
-Wmissing-field-initializers (and treats warnings as errors).  This diff
adds that flag to the rocksdb build, and fixes the compilation failures
that result.  I have not checked for any other differences in the build
flags for rocksdb build as part of myrocks.

Test Plan: make check

Reviewers: sdong, rven

Reviewed By: rven

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D52443
2015-12-30 14:56:18 -08:00
Venkatesh Radhakrishnan 63ddb783db Delete files in given key range
Summary:
This is an initial diff for providing the ability to delete
files which are completely within a given range of keys.

Test Plan: DBCompactionTest.DeleteRange

Reviewers: IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: yoshinorim, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D52293
2015-12-29 13:22:13 -08:00
Siying Dong 22c0ed8a5f Disable Visual Studio Warning C4351
Currently Windows build is broken because of Warning C4351. Disable the warning before figuring out the right way to fix it.
2015-12-28 15:06:34 -08:00
sdong 11672df19a Fix CLANG errors introduced by 7d87f02799
Summary: Fix some CLANG errors introduced in 7d87f02799

Test Plan: Build with both of CLANG and gcc

Reviewers: rven, yhchiang, kradhakrishnan, anthony, IslamAbdelRahman, ngbronson

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52329
2015-12-28 10:00:58 -08:00
Nathan Bronson 7d87f02799 support for concurrent adds to memtable
Summary:
This diff adds support for concurrent adds to the skiplist memtable
implementations.  Memory allocation is made thread-safe by the addition of
a spinlock, with small per-core buffers to avoid contention.  Concurrent
memtable writes are made via an additional method and don't impose a
performance overhead on the non-concurrent case, so parallelism can be
selected on a per-batch basis.

Write thread synchronization is an increasing bottleneck for higher levels
of concurrency, so this diff adds --enable_write_thread_adaptive_yield
(default off).  This feature causes threads joining a write batch
group to spin for a short time (default 100 usec) using sched_yield,
rather than going to sleep on a mutex.  If the timing of the yield calls
indicates that another thread has actually run during the yield then
spinning is avoided.  This option improves performance for concurrent
situations even without parallel adds, although it has the potential to
increase CPU usage (and the heuristic adaptation is not yet mature).

Parallel writes are not currently compatible with
inplace updates, update callbacks, or delete filtering.
Enable it with --allow_concurrent_memtable_write (and
--enable_write_thread_adaptive_yield).  Parallel memtable writes
are performance neutral when there is no actual parallelism, and in
my experiments (SSD server-class Linux and varying contention and key
sizes for fillrandom) they are always a performance win when there is
more than one thread.

Statistics are updated earlier in the write path, dropping the number
of DB mutex acquisitions from 2 to 1 for almost all cases.

This diff was motivated and inspired by Yahoo's cLSM work.  It is more
conservative than cLSM: RocksDB's write batch group leader role is
preserved (along with all of the existing flush and write throttling
logic) and concurrent writers are blocked until all memtable insertions
have completed and the sequence number has been advanced, to preserve
linearizability.

My test config is "db_bench -benchmarks=fillrandom -threads=$T
-batch_size=1 -memtablerep=skip_list -value_size=100 --num=1000000/$T
-level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999
-disable_auto_compactions --max_write_buffer_number=8
-max_background_flushes=8 --disable_wal --write_buffer_size=160000000
--block_size=16384 --allow_concurrent_memtable_write" on a two-socket
Xeon E5-2660 @ 2.2Ghz with lots of memory and an SSD hard drive.  With 1
thread I get ~440Kops/sec.  Peak performance for 1 socket (numactl
-N1) is slightly more than 1Mops/sec, at 16 threads.  Peak performance
across both sockets happens at 30 threads, and is ~900Kops/sec, although
with fewer threads there is less performance loss when the system has
background work.

Test Plan:
1. concurrent stress tests for InlineSkipList and DynamicBloom
2. make clean; make check
3. make clean; DISABLE_JEMALLOC=1 make valgrind_check; valgrind db_bench
4. make clean; COMPILE_WITH_TSAN=1 make all check; db_bench
5. make clean; COMPILE_WITH_ASAN=1 make all check; db_bench
6. make clean; OPT=-DROCKSDB_LITE make check
7. verify no perf regressions when disabled

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: MarkCallaghan, IslamAbdelRahman, anthony, yhchiang, rven, sdong, guyg8, kradhakrishnan, dhruba

Differential Revision: https://reviews.facebook.net/D50589
2015-12-25 11:03:40 -08:00
Siying Dong 298ba27ae2 Merge pull request #846 from yuslepukhin/enble_c4244_lossofdata
Enable MS compiler warning c4244.
2015-12-23 22:59:42 -08:00
Islam AbdelRahman d005c66faf Report compaction reason in CompactionListener
Summary:
Add CompactionReason to CompactionJobInfo
This will allow users to understand why compaction started which will help options tuning

Test Plan:
added new tests
make check -j64

Reviewers: yhchiang, anthony, kradhakrishnan, sdong, rven

Reviewed By: rven

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D51975
2015-12-22 11:37:19 -08:00
Alex Yang 33e09c0e19 add call to install superversion and schedule work in enableautocompactions
Summary:
This patch fixes https://github.com/facebook/mysql-5.6/issues/121

There is a recent change in rocksdb to disable auto compactions on startup: https://reviews.facebook.net/D51147. However, there is a small timing window where a column family needs to be compacted and schedules a compaction, but the scheduled compaction fails when it checks the disable_auto_compactions setting. The expectation is once the application is ready, it will call EnableAutoCompactions() to allow new compactions to go through. However, if the Column family is stalled because L0 is full, and no writes can go through, it is possible the column family may never have a new compaction request get scheduled. EnableAutoCompaction() should probably schedule an new flush and compaction event when it resets disable_auto_compaction.

Using InstallSuperVersionAndScheduleWork, we call SchedulePendingFlush,
SchedulePendingCompaction, as well as MaybeScheduleFlushOrcompaction on all the
column families to avoid the situation above.

This is still a first pass for feedback.
Could also just call SchedePendingFlush and SchedulePendingCompaction directly.

Test Plan:
Run on Asan build
cd _build-5.6-ASan/ && ./mysql-test/mtr --mem --big --testcase-timeout=36000 --suite-timeout=12000 --parallel=16 --suite=rocksdb,rocksdb_rpl,rocksdb_sys_vars --mysqld=--default-storage-engine=rocksdb --mysqld=--skip-innodb --mysqld=--default-tmp-storage-engine=MyISAM --mysqld=--rocksdb rocksdb_rpl.rpl_rocksdb_stress_crash --repeat=1000

Ensure that it no longer hangs during the test.

Reviewers: hermanlee4, yhchiang, anthony

Reviewed By: anthony

Subscribers: leveldb, yhchiang, dhruba

Differential Revision: https://reviews.facebook.net/D51747
2015-12-21 10:06:49 -08:00
Venkatesh Radhakrishnan 7b12ae97d4 Add signalall after removing item from manual_compaction deque
Summary:
When there are waiting manual compactions, we need to signal
them after removing the current manual compaction from the deque.

Test Plan: ColumnFamilytTest.SameCFManualManualCommaction

Reviewers: anthony, IslamAbdelRahman, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: dhruba, yoshinorim

Differential Revision: https://reviews.facebook.net/D52119
2015-12-17 16:59:00 -08:00
Islam AbdelRahman aececc209e Introduce ReadOptions::pin_data (support zero copy for keys)
Summary:
This patch update the Iterator API to introduce new functions that allow users to keep the Slices returned by key() valid as long as the Iterator is not deleted

ReadOptions::pin_data : If true keep loaded blocks in memory as long as the iterator is not deleted
Iterator::IsKeyPinned() : If true, this mean that the Slice returned by key() is valid as long as the iterator is not deleted

Also add a new option BlockBasedTableOptions::use_delta_encoding to allow users to disable delta_encoding if needed.

Benchmark results (using https://phabricator.fb.com/P20083553)

```
// $ du -h /home/tec/local/normal.4K.Snappy/db10077
// 6.1G    /home/tec/local/normal.4K.Snappy/db10077

// $ du -h /home/tec/local/zero.8K.LZ4/db10077
// 6.4G    /home/tec/local/zero.8K.LZ4/db10077

// Benchmarks for shard db10077
// _build/opt/rocks/benchmark/rocks_copy_benchmark \
//      --normal_db_path="/home/tec/local/normal.4K.Snappy/db10077" \
//      --zero_db_path="/home/tec/local/zero.8K.LZ4/db10077"

// First run
// ============================================================================
// rocks/benchmark/RocksCopyBenchmark.cpp          relative  time/iter  iters/s
// ============================================================================
// BM_StringCopy                                                 1.73s  576.97m
// BM_StringPiece                                   103.74%      1.67s  598.55m
// ============================================================================
// Match rate : 1000000 / 1000000

// Second run
// ============================================================================
// rocks/benchmark/RocksCopyBenchmark.cpp          relative  time/iter  iters/s
// ============================================================================
// BM_StringCopy                                              611.99ms     1.63
// BM_StringPiece                                   203.76%   300.35ms     3.33
// ============================================================================
// Match rate : 1000000 / 1000000
```

Test Plan: Unit tests

Reviewers: sdong, igor, anthony, yhchiang, rven

Reviewed By: rven

Subscribers: dhruba, lovro, adsharma

Differential Revision: https://reviews.facebook.net/D48999
2015-12-16 12:08:30 -08:00
Gunnar Kudrjavets 97265f5f14 Fix minor bugs in delete operator, snprintf, and size_t usage
Summary:
List of changes:

1) Fix the snprintf() usage in cases where wrong variable was used to determine the output buffer size.

2) Remove unnecessary checks before calling delete operator.

3) Increase code correctness by using size_t type when getting vector's size.

4) Unify the coding style by removing namespace::std usage at the top of the file to confirm to the majority usage.

5) Fix various lint errors pointed out by 'arc lint'.

Test Plan:
Code review and build:

git diff
make clean
make -j 32 commit-prereq
arc lint

Reviewers: kradhakrishnan, sdong, rven, anthony, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D51849
2015-12-15 15:26:20 -08:00
Venkatesh Radhakrishnan 030215bf01 Running manual compactions in parallel with other automatic or manual compactions in restricted cases
Summary:
This diff provides a framework for doing manual
compactions in parallel with other compactions. We now have a deque of manual compactions. We also pass manual compactions as an argument from RunManualCompactions down to
BackgroundCompactions, so that RunManualCompactions can be reentrant.
Parallelism is controlled by the two routines
ConflictingManualCompaction to allow/disallow new parallel/manual
compactions based on already existing ManualCompactions. In this diff, by default manual compactions still have to run exclusive of other compactions. However, by setting the compaction option, exclusive_manual_compaction to false, it is possible to run other compactions in parallel with a manual compaction. However, we are still restricted to one manual compaction per column family at a time. All of these restrictions will be relaxed in future diffs.
I will be adding more tests later.

Test Plan: Rocksdb regression + new tests + valgrind

Reviewers: igor, anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, sdong

Reviewed By: sdong

Subscribers: yoshinorim, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D47973
2015-12-14 11:20:34 -08:00
Dmitri Smirnov 236fe21c92 Enable MS compiler warning c4244.
Mostly due to the fact that there are differences in sizes of int,long
  on 64 bit systems vs GNU.
2015-12-11 16:47:34 -08:00
agiardullo 3bfd3d39a3 Use SST files for Transaction conflict detection
Summary:
Currently, transactions can fail even if there is no actual write conflict.  This is due to relying on only the memtables to check for write-conflicts.  Users have to tune memtable settings to try to avoid this, but it's hard to figure out exactly how to tune these settings.

With this diff, TransactionDB will use both memtables and SST files to determine if there are any write conflicts.  This relies on the fact that BlockBasedTable stores sequence numbers for all writes that happen after any open snapshot.  Also, D50295 is needed to prevent SingleDelete from disappearing writes (the TODOs in this test code will be fixed once the other diff is approved and merged).

Note that Optimistic transactions will still rely on tuning memtable settings as we do not want to read from SST while on the write thread.  Also, memtable settings can still be used to reduce how often TransactionDB needs to read SST files.

Test Plan: unit tests, db bench

Reviewers: rven, yhchiang, kradhakrishnan, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb, yoshinorim

Differential Revision: https://reviews.facebook.net/D50475
2015-12-11 12:34:11 -08:00
agiardullo 9e44629061 Change SingleDelete to support conflict checking
Summary: For Transactions, we want to start using the SST files to do write conflict checking.  To do this, we need to make sure that compaction never removes all writes if an earlier snapshot exists.  So I had to change the way we process SingleDeletes to sometimes leave a SingleDelete behind when we encounter a Put followed by a SingleDelete.  See the comments in this diff for a more detailed explanation.

Test Plan: added more unit tests

Reviewers: rven, igor, kradhakrishnan, IslamAbdelRahman, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D50295
2015-12-10 11:35:38 -08:00
Yueh-Hsuan Chiang 774b80e99e Resubmit the fix for a race condition in persisting options
Summary:
This patch fix a race condition in persisting options which will cause a crash when:

* Thread A obtain cf options and start to persist options based on that cf options.
* Thread B kicks in and finish DropColumnFamily and delete cf_handle.
* Thread A wakes up and tries to finish the persisting options and crashes.

Test Plan: Add a test in column_family_test that can reproduce the crash

Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51717
2015-12-08 17:01:02 -08:00
agiardullo e5c5f23814 Support marking snapshots for write-conflict checking - Take 2
Summary:
D51183 was reverted due to breaking the LITE build.

This diff is the same as D51183 but with a fix for the LITE BUILD(D51693)

Test Plan: run all unit tests

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D51711
2015-12-08 16:47:31 -08:00
sdong 1d63c3d610 Revert "Support marking snapshots for write-conflict checking"
This reverts commit ec704aafdc for it broke RocksDB LITE build.
2015-12-08 09:27:17 -08:00
agiardullo ec704aafdc Support marking snapshots for write-conflict checking
Summary:
D50475 enables using SST files for transaction write-conflict checking.  In order for this to work, we need to make sure not to compact out SingleDeletes when there is an earlier transaction snapshot(D50295).  If there is a long-held snapshot, this could reduce the benefit of the SingleDelete optimization.

This diff allows Transactions to mark snapshots as being used for write-conflict checking.  Then, during compaction, we will be able to optimize SingleDeletes better in the future.

This diff adds a flag to SnapshotImpl which is used by Transactions.  This diff also passes the earliest write-conflict snapshot's sequence number to CompactionIterator.  This diff does not actually change Compaction (after this diff is pushed, D50295 will be able to use this information).

Test Plan: no behavior change, ran existing tests

Reviewers: rven, kradhakrishnan, yhchiang, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D51183
2015-12-07 19:40:51 -08:00
sdong f307036bde Revert "Fix a race condition in persisting options"
This reverts commit 2fa3ed5180. It breaks RocksDB lite build
2015-12-07 17:09:12 -08:00
Yueh-Hsuan Chiang 2fa3ed5180 Fix a race condition in persisting options
Summary:
This patch fix a race condition in persisting options which will cause a crash when:

* Thread A obtain cf options and start to persist options based on that cf options.
* Thread B kicks in and finish DropColumnFamily and delete cf_handle.
* Thread A wakes up and tries to finish the persisting options and crashes.

Test Plan: Add a test in column_family_test that can reproduce the crash

Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D51609
2015-12-07 15:25:12 -08:00
Alex Yang e8180f9901 added public api to schedule flush/compaction, code to prevent race with db::open
Summary:
Fixes T8781168.

Added a new function EnableAutoCompactions in db.h to be publicly
avialable.  This allows compaction to be re-enabled after disabling it via
SetOptions

Refactored code to set the dbptr earlier on in TransactionDB::Open and DB::Open
Temporarily disable auto_compaction in TransactionDB::Open until dbptr is set to
prevent race condition.

Test Plan:
Ran make all check

verified fix on myrocks side:
was able to reproduce the seg fault with
../tools/mysqltest.sh --mem --force rocksdb.drop_table

method was to manually sleep the thread after DB::Open but before TransactionDB ptr was
assigned in transaction_db_impl.cc:
  DB::Open(db_options, dbname, column_families_copy, handles, &db);
  clock_t goal = (60000 * 10) + clock();
  while (goal > clock());
  ...dbptr(aka rdb) gets assigned below

verified my changes fixed the issue.

Also added unit test 'ToggleAutoCompaction' in transaction_test.cc

Reviewers: hermanlee4, anthony

Reviewed By: anthony

Subscribers: alex, dhruba

Differential Revision: https://reviews.facebook.net/D51147
2015-12-03 22:59:44 -08:00
sdong db320b1b82 DB to only flush the column family with the largest memtable while option.db_write_buffer_size is hit
Summary: When option.db_write_buffer_size is hit, we currently flush all column families. Move to flush the column family with the largest active memt table instead. In this way, we can avoid too many small files in some cases.

Test Plan: Modify test DBTest.SharedWriteBuffer to work with the updated behavior

Reviewers: kradhakrishnan, yhchiang, rven, anthony, IslamAbdelRahman, igor

Reviewed By: igor

Subscribers: march, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51291
2015-11-30 13:36:57 -08:00
Reid Horuff 3381e2c3e7 Handle multiple calls to DBImpl::PauseBackgroundWork() and DBImpl::ContinueBackgroundWork()
Summary: Handle multiple calls to DBImpl::PauseBackgroundWork() and DBImpl::ContinueBackgroundWork()

Test Plan: rocksdb.information_schema handles this case.

Reviewers: igor

Reviewed By: igor

Subscribers: hermanlee4, jkedgar, dhruba

Differential Revision: https://reviews.facebook.net/D50781
2015-11-16 14:20:18 -08:00
Venkatesh Radhakrishnan 2ae4d7d708 Make sure that CompactFiles does not run two parallel Level 0 compactions
Summary:
Since level 0 files can overlap, two level 0 compactions cannot
run in parallel. Compact files needs to check this before running a
compaction.

Test Plan: CompactFilesTest.L0ConflictsFiles

Reviewers: igor, IslamAbdelRahman, anthony, sdong, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D50079
2015-11-13 12:01:00 -08:00
Nathan Bronson 6ce42dd075 Don't merge WriteBatch-es if WAL is disabled
Summary:
There's no need for WriteImpl to flatten the write batch group
into a single WriteBatch if the WAL is disabled.  This diff moves the
flattening into the WAL step, and skips flattening entirely if it isn't
needed.  It's good for about 5% speedup on a multi-threaded workload
with no WAL.

This diff also adds clarifying comments about the chance for partial
failure of WriteBatchInternal::InsertInto, and always sets bg_error_ if
the memtable state diverges from the logged state or if a WriteBatch
succeeds only partially.

Benchmark for speedup:
  db_bench -benchmarks=fillrandom -threads=16 -batch_size=1 -memtablerep=skip_list -value_size=0 --num=200000 -level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999 -disable_auto_compactions --max_write_buffer_number=8 -max_background_flushes=8 --disable_wal --write_buffer_size=160000000

Test Plan: asserts + make check

Reviewers: sdong, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D50583
2015-11-12 10:50:38 -08:00
Yueh-Hsuan Chiang e114f0abb8 Enable RocksDB to persist Options file.
Summary:
This patch allows rocksdb to persist options into a file on
DB::Open, SetOptions, and Create / Drop ColumnFamily.
Options files are created under the same directory as the rocksdb
instance.

In addition, this patch also adds a fail_if_missing_options_file in DBOptions
that makes any function call return non-ok status when it is not able to
persist options properly.

  // If true, then DB::Open / CreateColumnFamily / DropColumnFamily
  // / SetOptions will fail if options file is not detected or properly
  // persisted.
  //
  // DEFAULT: false
  bool fail_if_missing_options_file;

Options file names are formatted as OPTIONS-<number>, and RocksDB
will always keep the latest two options files.

Test Plan:
Add options_file_test.

options_test
column_family_test

Reviewers: igor, IslamAbdelRahman, sdong, anthony

Reviewed By: anthony

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D48285
2015-11-10 22:58:01 -08:00
Venkatesh Radhakrishnan 9d50afc3b9 Prefix-based iterating only shows keys in prefix
Summary:
MyRocks testing found an issue that while iterating over keys
that are outside the prefix, sometimes wrong results were seen for keys
outside the prefix. We now tighten the range of keys seen with a new
read option called prefix_seen_at_start. This remembers the starting
prefix and then compares it on a Next for equality of prefix. If they
are from a different prefix, it sets valid to false.

Test Plan: PrefixTest.PrefixValid

Reviewers: IslamAbdelRahman, sdong, yhchiang, anthony

Reviewed By: anthony

Subscribers: spetrunia, hermanlee4, yoshinorim, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D50211
2015-11-05 13:24:05 -08:00
Yueh-Hsuan Chiang 3ecbab0040 Add GetAggregatedIntProperty(): returns the aggregated value from all CFs
Summary:
This patch adds GetAggregatedIntProperty() that returns the aggregated
value from all CFs

Test Plan: Added a test in db_test

Reviewers: igor, sdong, anthony, IslamAbdelRahman, rven

Reviewed By: rven

Subscribers: rven, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D49497
2015-11-03 15:54:18 -08:00
Islam AbdelRahman ff4499e297 Update DB::AddFile() to have less restrictions
Summary:
Update DB::AddFile() restrictions to be
  - Key range in loaded table file don't overlap with existing keys or tombstones in DB.
  - No other writes happen during AddFile call.

The updated AddFile() will verify that the file key range don't overlap with any keys or tombstones in the DB, and then add the file to L0

Test Plan: unit tests

Reviewers: igor, rven, anthony, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: adsharma, ameyag, dhruba

Differential Revision: https://reviews.facebook.net/D49233
2015-10-30 16:38:10 -07:00
Islam AbdelRahman 2872e0c8c2 Clean and expose CreateLoggerFromOptions
Summary:
CreateLoggerFromOptions have some parameters like  db_log_dir and env, these parameters are redundant since they already exist in DBOptions

this patch remove the redundant parameters and expose CreateLoggerFromOptions to users

Test Plan: make check

Reviewers: igor, anthony, yhchiang, rven, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: dhruba, hermanlee4

Differential Revision: https://reviews.facebook.net/D49713
2015-10-29 18:07:37 -07:00
sdong 296c3a1f94 "make format" in some recent commits
Summary: Run "make format" for some recent commits.

Test Plan: Build and run tests

Reviewers: IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D49707
2015-10-29 17:11:14 -07:00
Herman Lee 0d720dfc17 Use the correct variable when fetching table properties.
Summary:
An uninitialized parameter was being passed into the call to fetch the table
properties during the compaction notification callbacks.

Test Plan:
Build it with myrocks and verify unit test passed.
Run unit tests.

Reviewers: rven, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D49635
2015-10-28 16:28:11 -07:00
Praveen Rao 4ce117c4d5 Merge branch 'master' into wal_filter 2015-10-26 19:03:34 -07:00
Praveen Rao 32cdec634e Fail recovery if filter provides more records than original and corresponding unit-test, fix naming conventions 2015-10-26 18:11:18 -07:00
Siying Dong 138876a62c Merge pull request #746 from ceph/wip-recycle
Add Options.recycle_log_file_num for Recycling WAL Files
2015-10-26 15:01:28 -07:00
Praveen Rao 2938c5c137 merge upstream changes 2015-10-19 15:21:33 -07:00
Praveen Rao 0c59691dde Handle multiple batches in single log record - allow app to return a new batch + allow app to return corrupted record status 2015-10-19 13:27:40 -07:00
Alexey Maykov f18acd8875 Fixed the clang compilation failure
Summary: As above.

Test Plan: USE_CLANG=1 make check -j

Reviewers: igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D48981
2015-10-19 10:38:50 -07:00
Sage Weil 9c33f64d19 log_reader: pass in WALRecoveryMode instead of bool report_eof_inconsistency
Soon our behavior will depend on more than just whther we are in
kAbsoluteConsistency or not.

Signed-off-by: Sage Weil <sage@redhat.com>
2015-10-18 21:24:32 -04:00
Sage Weil 3ac13c99d1 log_reader: pass log_number and optional info_log to ctor
We will need the log number to validate the recycle-style CRCs.  The log
is helpful for debugging, but optional, as not all callers have it.

Signed-off-by: Sage Weil <sage@redhat.com>
2015-10-18 21:24:32 -04:00
Sage Weil 5830c699f2 log_writer: pass log number and whether recycling is enabled to ctor
When we recycle log files, we need to mix the log number into the CRC
for each record.  Note that for logs that don't get recycled (like the
manifest), we always pass a log_number of 0 and false.

Signed-off-by: Sage Weil <sage@redhat.com>
2015-10-18 21:24:32 -04:00
Sage Weil 666376150c db_impl: recycle log files
If log recycling is enabled, put old WAL files on a recycle queue instead of
deleting them.  When we need a new log file, take a recycled file off the
list if one is available.

Signed-off-by: Sage Weil <sage@redhat.com>
2015-10-18 21:24:32 -04:00
Sage Weil d666225a0a db_impl: disable recycle_log_files if WAL archive is enabled
We can't recycle the files if they are being archived.

Signed-off-by: Sage Weil <sage@redhat.com>
2015-10-18 21:21:24 -04:00
Alexey Maykov e1a09a7703 Implementation for GetPropertiesOfTablesInRange
Summary: In MyRocks, it is sometimes important to get propeties only for the subset of the database. This diff implements the API in RocksDB.

Test Plan: ran the GetPropertiesOfTablesInRange

Reviewers: rven, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D48651
2015-10-17 13:34:43 -07:00
Yueh-Hsuan Chiang ad471453e8 Allow GetProperty to report the number of currently running flushes / compactions.
Summary:
Add rocksdb.num-running-compactions and rocksdb.num-running-flushes
to GetIntProperty() that reports the number of currently running
compactions / flushes.

Test Plan: augmented existing tests in db_test

Reviewers: igor, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D48693
2015-10-17 00:16:36 -07:00
sdong 277dea78f0 Add more kill points
Summary:
Add kill points in:
1. after creating a file
2. before writing a manifest record
3. before syncing manifest
4. before creating a new current file
5. after creating a new current file

Test Plan: Run all current tests.

Reviewers: yhchiang, igor, anthony, IslamAbdelRahman, rven, kradhakrishnan

Reviewed By: kradhakrishnan

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D48855
2015-10-16 14:35:12 -07:00
Venkatesh Radhakrishnan a98fbacfa0 Moving memtable related files from util to a new directory memtable
Summary:
We are cleaning up dependencies.
This diff takes a first step at moving memtable files to their own
directory called memtable. In future diffs, we will move other memtable
files from db to memtable.

Test Plan: make check

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D48915
2015-10-16 14:10:33 -07:00
Islam AbdelRahman f55d3009c0 Make db_test_util compile under ROCKSDB_LITE
Summary: db_test_util is used in multiple test files but it dont compile under ROCKSDB_LITE

Test Plan:
make check
make static_lib
OPT=-DROCKSDB_LITE make db_wal_test

Reviewers: igor, yhchiang, rven, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D48579
2015-10-13 17:33:23 -07:00
sdong 35ad531be3 Seperate InternalIterator from Iterator
Summary:
Separate a new class InternalIterator from class Iterator, when the look-up is done internally, which also means they operate on key with sequence ID and type.

This change will enable potential future optimizations but for now InternalIterator's functions are still the same as Iterator's.
At the same time, separate the cleanup function to a separate class and let both of InternalIterator and Iterator inherit from it.

Test Plan: Run all existing tests.

Reviewers: igor, yhchiang, anthony, kradhakrishnan, IslamAbdelRahman, rven

Reviewed By: rven

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D48549
2015-10-13 15:32:13 -07:00
Praveen Rao cc4d13e0a8 Put wal_filter under #ifndef ROCKSDB_LITE 2015-10-13 11:10:14 -07:00
Praveen Rao eb24178553 merge from master 2015-10-12 17:24:21 -07:00