Commit Graph

482 Commits

Author SHA1 Message Date
Yanqin Jin fdf2a0d7eb Fix a compaction bug for write-prepared txn (#9061)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9061

In write-prepared txn, checking a sequence's visibility in a released (old)
snapshot may return "Snapshot released". Suppose we have two snapshots:

```
earliest_snap < earliest_write_conflict_snap
```

If we release `earliest_write_conflict_snap` but keep `earliest_snap` during
bottommost level compaction, then it is possible that certain sequence of
events can lead to a PUT being seq-zeroed followed by a SingleDelete of the
same key. This violates the ascending order of keys, and will cause data
inconsistency.

Reviewed By: ltamasi

Differential Revision: D31813017

fbshipit-source-id: dc68ba2541d1228489b93cf3edda5f37ed06f285
2021-10-29 15:23:17 -07:00
Jonathan Albrecht e970248602 Add support for building on s390x platform (#8962)
Summary:
This PR adds support for building on s390x including updating travis CI. It uses the previous work in https://github.com/facebook/rocksdb/pull/6168 and adds some more changes to get all current tests (make check and jni tests) to pass. The tests were run with snappy, lz4, bzip2 and zstd all compiled in.

There are a few pieces still needed to get the travis build working that I don't think I can do. adamretter is this something you could help with?

1. A prebuilt https://rocksdb-deps.s3-us-west-2.amazonaws.com/cmake/cmake-3.14.5-Linux-s390x.deb package
2. A https://hub.docker.com/r/evolvedbinary/rocksjava s390x image

Not sure if there is more required for travis. Happy to help in any way I can.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8962

Reviewed By: mrambacher

Differential Revision: D31802198

Pulled By: pdillinger

fbshipit-source-id: 683511466fa6b505f85ba5a9964a268c6151f0c2
2021-10-22 10:13:15 -07:00
Yanqin Jin e1139167ae Inline an empty destructor (#9004)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9004

Inline an empty destructor

Reviewed By: ltamasi

Differential Revision: D31525561

fbshipit-source-id: 3b9e37f06b0c70529a5d2d660de21ea335c73611
2021-10-11 18:14:10 -07:00
Yanqin Jin 1a79839c59 Some code cleanup (#9003)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9003

cleanup some code before real work.

Reviewed By: ltamasi

Differential Revision: D31525563

fbshipit-source-id: 44558b3594f2200adc7d8621b08b06c77e358a27
2021-10-11 18:14:10 -07:00
mrambacher 13ae16c315 Cleanup includes in dbformat.h (#8930)
Summary:
This header file was including everything and the kitchen sink when it did not need to.  This resulted in many places including this header when they needed other pieces instead.

Cleaned up this header to only include what was needed and fixed up the remaining code to include what was now missing.

Hopefully, this sort of code hygiene cleanup will speed up the builds...

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8930

Reviewed By: pdillinger

Differential Revision: D31142788

Pulled By: mrambacher

fbshipit-source-id: 6b45de3f300750c79f751f6227dece9cfd44085d
2021-09-29 04:04:40 -07:00
Levi Tamasi be206db351 Deflake MySQLStyleTransactionTest.TransactionStressTest in "status checked" mode (#8947)
Summary:
There is a corner case when using WriteUnprepared transactions when
`WriteUnpreparedTxn::Get` returns `Status::TryAgain` instead of
propagating the result of `GetFromBatchAndDB`. The patch adds
`PermitUncheckedError` to make the `ASSERT_STATUS_CHECKED` build pass in
this case as well.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8947

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D31125422

Pulled By: ltamasi

fbshipit-source-id: 42de51dcfa9384e032244c2b4d3f40e9a4111194
2021-09-22 16:40:25 -07:00
Peter Dillinger 4750421ece Replace most typedef with using= (#8751)
Summary:
Old typedef syntax is confusing

Most but not all changes with

    perl -pi -e 's/typedef (.*) ([a-zA-Z0-9_]+);/using $2 = $1;/g' list_of_files
    make format

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8751

Test Plan: existing

Reviewed By: zhichao-cao

Differential Revision: D30745277

Pulled By: pdillinger

fbshipit-source-id: 6f65f0631c3563382d43347896020413cc2366d9
2021-09-07 11:31:59 -07:00
Yanqin Jin 2b367fa8cc Fix bug caused by releasing snapshot(s) during compaction (#8608)
Summary:
In debug mode, we are seeing assertion failure as follows

```
db/compaction/compaction_iterator.cc:980: void rocksdb::CompactionIterator::PrepareOutput(): \
Assertion `ikey_.type != kTypeDeletion && ikey_.type != kTypeSingleDeletion' failed.
```

It is caused by releasing earliest snapshot during compaction between the execution of
`NextFromInput()` and `PrepareOutput()`.

In one case, as demonstrated in unit test `WritePreparedTransaction.ReleaseEarliestSnapshotDuringCompaction_WithSD2`,
incorrect result may be returned by a following range scan if we disable assertion, as in opt compilation
level: the SingleDelete marker's sequence number is zeroed out, but the preceding PUT is also
outputted to the SST file after compaction. Due to the logic of DBIter, the PUT will not be
skipped and will be returned by iterator in range scan. https://github.com/facebook/rocksdb/issues/8661 illustrates what happened.

Fix by taking a more conservative approach: make compaction zero out sequence number only
if key is in the earliest snapshot when the compaction starts.

Another assertion failure is
```
Assertion `current_user_key_snapshot_ == last_snapshot' failed.
```

It's caused by releasing the snapshot between the PUT and SingleDelete during compaction.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8608

Test Plan: make check

Reviewed By: jay-zhuang

Differential Revision: D30145645

Pulled By: riversand963

fbshipit-source-id: 699f58e66faf70732ad53810ccef43935d3bbe81
2021-08-17 22:14:20 -07:00
Yanqin Jin 066b51126d Several simple local code clean-ups (#8565)
Summary:
This PR tries to remove some unnecessary checks as well as unreachable code blocks to
improve readability. An obvious non-public API method naming typo is also corrected.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8565

Test Plan: make check

Reviewed By: lth

Differential Revision: D29963984

Pulled By: riversand963

fbshipit-source-id: cc96e8f09890e5cfe9b20eadb63bdca5484c150a
2021-07-30 12:07:49 -07:00
mrambacher 3aee4fbd41 Make EventListener into a Customizable Class (#8473)
Summary:
- Added Type/CreateFromString
- Added ability to load EventListeners to DBOptions
- Since EventListeners did not previously have a Name(), defaulted to "".  If there is no name, the listener cannot be loaded from the ObjectRegistry.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8473

Reviewed By: zhichao-cao

Differential Revision: D29901488

Pulled By: mrambacher

fbshipit-source-id: 2d3a4aa6db1562ac03e7ad41b360e3521d486254
2021-07-27 07:47:02 -07:00
longlijian 4e4ec16957 Replace the namespace "rocksdb" to "ROCKSDB_NAMESPACE" (#8531)
Summary:
For more detail can reference the https://github.com/facebook/rocksdb/issues/6433
(https://github.com/facebook/rocksdb/pull/6433)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8531

Reviewed By: siying

Differential Revision: D29717057

Pulled By: ajkr

fbshipit-source-id: 3ccad9501e5612590e54a7cf8c447118f323c7f4
2021-07-15 17:23:39 -07:00
Andrew Kryczka ed8eb436db Move slow valgrind tests behind -DROCKSDB_FULL_VALGRIND_RUN (#8475)
Summary:
Various tests had disabled valgrind due to it slowing down and timing
out (as is the case right now) the CI runs. Where a test was disabled with no comment,
I assumed slowness was the cause. For these tests that were slow under
valgrind, as well as the ones identified in https://github.com/facebook/rocksdb/issues/8352, this PR moves them
behind the compiler flag `-DROCKSDB_FULL_VALGRIND_RUN`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8475

Test Plan: running `make full_valgrind_test`, `make valgrind_test`, `make check`; will verify they appear working correctly

Reviewed By: jay-zhuang

Differential Revision: D29504843

Pulled By: ajkr

fbshipit-source-id: 2aac90749cfbd30d5ce11cb29a07a1b9314eeea7
2021-07-07 11:14:05 -07:00
Jay Zhuang f89423a57a Revert "Revert "Snapshot release triggered compaction without multiple tombstones (#8357)" (#8410)" (#8438)
Summary:
This reverts commit 25be1ed66a.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8438

Test Plan: Run the impacted mysql test 40 times

Reviewed By: ajkr

Differential Revision: D29286247

Pulled By: jay-zhuang

fbshipit-source-id: d3bd056971a19a8b012d5d0295fa045c012b3c04
2021-06-22 11:10:03 -07:00
Andrew Kryczka 25be1ed66a Revert "Snapshot release triggered compaction without multiple tombstones (#8357)" (#8410)
Summary:
This reverts commit 9167ece586.

It was found to reliably trip a compaction picking conflict assertion in a MyRocks unit test. We don't understand why yet so reverting in the meantime.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8410

Test Plan: `make check -j48`

Reviewed By: jay-zhuang

Differential Revision: D29150300

Pulled By: ajkr

fbshipit-source-id: 2de8664f355d6da015e84e5fec2e3f90f49741c8
2021-06-15 18:15:15 -07:00
Andrew Kryczka 9167ece586 Snapshot release triggered compaction without multiple tombstones (#8357)
Summary:
This is a duplicate of https://github.com/facebook/rocksdb/issues/4948 by mzhaom to fix tests after rebase.

This change is a follow-up to https://github.com/facebook/rocksdb/issues/4927, which made this possible by allowing tombstone dropping/seqnum zeroing optimizations on the last key in the compaction. Now the `largest_seqno != 0` condition suffices to prevent snapshot release triggered compaction from entering an infinite loop.

The issues caused by the extraneous condition `level_and_file.second->num_deletions > 1` are:

- files could have `largest_seqno > 0` forever making it impossible to tell they cannot contain any covering keys
- it doesn't trigger compaction when there are many overwritten keys. Some MyRocks use case actually doesn't use Delete but instead calls Put with empty value to "delete" keys, so we'd like to be able to trigger compaction in this case too.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8357

Test Plan: - make check

Reviewed By: jay-zhuang

Differential Revision: D28855340

Pulled By: ajkr

fbshipit-source-id: a261b51eecafec492499e6d01e8e43112f801798
2021-06-04 00:21:40 -07:00
Andrew Kryczka d76c46e6a0 Deflake TransactionStressTest.ExpiredTransactionDataRace1 (#8258)
Summary:
We saw the `Commit()` fail with "Operation expired" so apparently the
expiration time is too short. Increased the magnitude of the times in
this test to make flakiness less likely.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8258

Reviewed By: jay-zhuang

Differential Revision: D28177033

Pulled By: ajkr

fbshipit-source-id: 0357acee6cc14c104b6ccd39231a683a606ab130
2021-05-12 15:49:05 -07:00
mrambacher ff463742b5 Add Merge Operator support to WriteBatchWithIndex (#8135)
Summary:
The WBWI has two differing modes of operation dependent on the value
of the constructor parameter `overwrite_key`.
Currently, regardless of the parameter, neither mode performs as
expected when using Merge. This PR remedies this by correctly invoking
the appropriate Merge Operator before returning results from the WBWI.

Examples of issues that exist which are solved by this PR:

## Example 1 with `overwrite_key=false`
Currently, from an empty database, the following sequence:
```
Put('k1', 'v1')
Merge('k1', 'v2')
Get('k1')
```
Incorrectly yields `v2`, that is to say that the Merge behaves like a Put.

## Example 2 with o`verwrite_key=true`
Currently, from an empty database, the following sequence:
```
Put('k1', 'v1')
Merge('k1', 'v2')
Get('k1')
```
Incorrectly yields `ERROR: kMergeInProgress`.

## Example 3 with `overwrite_key=false`
Currently, with a database containing `('k1' -> 'v1')`, the following sequence:
```
Merge('k1', 'v2')
GetFromBatchAndDB('k1')
```
Incorrectly yields `v1,v2`

## Example 4 with `overwrite_key=true`
Currently, with a database containing `('k1' -> 'v1')`, the following sequence:
```
Merge('k1', 'v1')
GetFromBatchAndDB('k1')
```
Incorrectly yields `ERROR: kMergeInProgress`.

## Example 5 with `overwrite_key=false`
Currently, from an empty database, the following sequence:
```
Put('k1', 'v1')
Merge('k1', 'v2')
GetFromBatchAndDB('k1')
```
Incorrectly yields `v1,v2`

## Example 6 with `overwrite_key=true`
Currently, from an empty database, `('k1' -> 'v1')`, the following sequence:
```
Put('k1', 'v1')
Merge('k1', 'v2')
GetFromBatchAndDB('k1')
```
Incorrectly yields `ERROR: kMergeInProgress`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8135

Reviewed By: pdillinger

Differential Revision: D27657938

Pulled By: mrambacher

fbshipit-source-id: 0fbda6bbc66bedeba96a84786d90141d776297df
2021-05-10 12:50:25 -07:00
Yanqin Jin d6052d381e Remove duplicate code (#8079)
Summary:
The implementation of TransactionDB::WrapDB() and
TransactionDB::WrapStackableDB() are almost identical, except for the
type of the first argument `db`. This PR adds a new template function in
anonymous namespace, and calls it in the above two functions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8079

Test Plan: make check

Reviewed By: lth

Differential Revision: D27184575

Pulled By: riversand963

fbshipit-source-id: f2855a6db3a7e897d0d611f7050ca4b696c56a7a
2021-03-22 12:29:21 -07:00
mrambacher 3dff28cf9b Use SystemClock* instead of std::shared_ptr<SystemClock> in lower level routines (#8033)
Summary:
For performance purposes, the lower level routines were changed to use a SystemClock* instead of a std::shared_ptr<SystemClock>.  The shared ptr has some performance degradation on certain hardware classes.

For most of the system, there is no risk of the pointer being deleted/invalid because the shared_ptr will be stored elsewhere.  For example, the ImmutableDBOptions stores the Env which has a std::shared_ptr<SystemClock> in it.  The SystemClock* within the ImmutableDBOptions is essentially a "short cut" to gain access to this constant resource.

There were a few classes (PeriodicWorkScheduler?) where the "short cut" property did not hold.  In those cases, the shared pointer was preserved.

Using db_bench readrandom perf_level=3 on my EC2 box, this change performed as well or better than 6.17:

6.17: readrandom   :      28.046 micros/op 854902 ops/sec;   61.3 MB/s (355999 of 355999 found)
6.18: readrandom   :      32.615 micros/op 735306 ops/sec;   52.7 MB/s (290999 of 290999 found)
PR: readrandom   :      27.500 micros/op 871909 ops/sec;   62.5 MB/s (367999 of 367999 found)

(Note that the times for 6.18 are prior to revert of the SystemClock).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8033

Reviewed By: pdillinger

Differential Revision: D27014563

Pulled By: mrambacher

fbshipit-source-id: ad0459eba03182e454391b5926bf5cdd45657b67
2021-03-15 04:34:11 -07:00
Sergei Petrunia c9878baa87 Fix an assertion failure in range locking, locktree code. (#7938)
Summary:
Fix this scenario:
trx1> acquire shared lock on $key
trx2> acquire shared lock on the same $key
trx1> attempt to acquire a unique lock on $key.

Lock acquisition will fail, and deadlock detection will start.
It will call iterate_and_get_overlapping_row_locks() which will
produce a list with two locks (shared locks by trx1 and trx2).

However the code in lock_request::build_wait_graph() was not prepared
to find the lock by the same transaction in the list of conflicting
locks. Fix it to ignore it.

(One may suggest to fix iterate_and_get_overlapping_row_locks() to not
include locks by trx1. This is not a good idea, because that function
is also used to report all locks currently held)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7938

Reviewed By: zhichao-cao

Differential Revision: D26529374

Pulled By: ajkr

fbshipit-source-id: d89cbed008db1a97a8f2351b9bfb75310750d16a
2021-02-18 18:15:19 -08:00
Jay Zhuang 59ba104e4a Fix txn `MultiGet()` return un-committed data with snapshot (#7963)
Summary:
TransactionDB uses read callback to filter out un-committed data before
a snapshot. But `MultiGet()` API doesn't use that at all, which causes
returning unwanted data.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7963

Test Plan: Added unittest to reproduce

Reviewed By: anand1976

Differential Revision: D26455851

Pulled By: jay-zhuang

fbshipit-source-id: 265276698cf9d8c4cd79e3250ef10d14375bac55
2021-02-18 08:49:00 -08:00
Levi Tamasi 974458891c Revert "Turn on memtable bloom filter by default. (#6584)" (#7939)
Summary:
This reverts commit ee79a28963.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7939

Reviewed By: siying

Differential Revision: D26298564

Pulled By: ltamasi

fbshipit-source-id: 6d663516e82e6de436f8d5317932ca9a98e152bd
2021-02-06 22:34:30 -08:00
Andrew Kryczka 8d2bbdd04f Allow range deletions in `*TransactionDB` only when safe (#7929)
Summary:
Explicitly reject all range deletions on `TransactionDB` or `OptimisticTransactionDB`, except when the user provides sufficient promises that allow us to proceed safely. The necessary promises are described in the API doc for `TransactionDB::DeleteRange()`. There is currently no way to provide enough promises to make it safe in `OptimisticTransactionDB`.

Fixes https://github.com/facebook/rocksdb/issues/7913.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7929

Test Plan: unit tests covering the cases it's permitted/rejected

Reviewed By: ltamasi

Differential Revision: D26240254

Pulled By: ajkr

fbshipit-source-id: 2834a0ce64cc3e4c3799e35b885a5e79c2f4f6d9
2021-02-05 15:57:26 -08:00
sdong ee79a28963 Turn on memtable bloom filter by default. (#6584)
Summary:
Memtable bloom filter is useful in many use cases. A default value on with conservative 1.5% memory can benefit more use cases than use cases impacted.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/6584

Test Plan: Run all existing tests.

Reviewed By: pdillinger

Differential Revision: D20626739

fbshipit-source-id: 1dd45532b932139552519b8c2682bd954550c2f9
2021-02-05 12:59:46 -08:00
Adam Retter d5f5d6579a Fix compilation against musl lib C (#7875)
Summary:
See https://github.com/percona/PerconaFT/pull/450

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7875

Reviewed By: ajkr

Differential Revision: D25938020

Pulled By: jay-zhuang

fbshipit-source-id: 9014dbc7b23bf92c5e63bfbdda4565bb0d2f2b58
2021-01-21 08:39:42 -08:00
cheng-chang 736c6dc59f Disable BasicLockEscalation if cannot determine whether TSAN is enabled (#7814)
Summary:
BasicLockEscalation will cause false-positive warnings under TSAN (this is a known issue in TSAN, see details in https://gist.github.com/spetrunia/77274cf2d5848e0a7e090d622695ed4e), skip this test if TSAN is enabled, or if we are not sure whether TSAN is enabled.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7814

Test Plan: watch the tsan contrun test to pass.

Reviewed By: zhichao-cao

Differential Revision: D25708094

Pulled By: cheng-chang

fbshipit-source-id: 4fc813ff373301d033d086154cc7bb60a5e95889
2020-12-27 16:18:00 -08:00
Sergei Petrunia daab7603f6 Range Locking: Implementation of range locking (#7506)
Summary:
Range Locking - an implementation based on the locktree library

- Add a RangeTreeLockManager and RangeTreeLockTracker which implement
  range locking using the locktree library.
- Point locks are handled as locks on single-point ranges.
- Add a unit test: range_locking_test

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7506

Reviewed By: akankshamahajan15

Differential Revision: D25320703

Pulled By: cheng-chang

fbshipit-source-id: f86347384b42ba2b0257d67eca0f45f806b69da7
2020-12-22 19:12:36 -08:00
Sergei Petrunia 1022090981 Apply the changes from: PS-5501 : Re-license PerconaFT 'locktree' to Apache V2 (#7801)
Summary:
commit d5178f513c0b4144a5ac9358ec0f6a3b54a28e76
Author: George O. Lorch III <george.lorch@percona.com>
Date:   Tue Mar 19 12:18:40 2019 -0700

    PS-5501 : Re-license PerconaFT 'locktree' to Apache V2

    - Fixed some incomplete relicensed files from previous round.

    - Added missing license text to some.

    - Relicensed more files to Apache V2 that locktree depends on.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7801

Reviewed By: jay-zhuang

Differential Revision: D25682430

Pulled By: cheng-chang

fbshipit-source-id: deb8a0de3e76f3638672997bfbd300e2fffbe5f5
2020-12-22 14:47:41 -08:00
Peter Dillinger 4d897e51df Migrate away from Travis+Linux+amd64 (#7791)
Summary:
This disables Linux/amd64 builds in Travis for PRs, and adds a
gcc-10+c++20 build in CircleCI, which should fill out sufficient coverage
vs. what we had in Travis

Fixed a use of std::is_pod, which is deprecated in c++20

Fixed ++ on a volatile in db_repl_stress.cc, with bigger refactoring.
Although ++ on this volatile was probably ok with one thread writer and
one thread reader, the code was still overly complex. There was a
deadcode check for error
`if (replThread.no_read < dataPump.no_records)` which can be proven
never to happen based on the structure of the code. It infinite loops
instead for the case intended to be checked. I just simplified the code
for what should be the same checking power.

Also most configurations seem to be using make parallelism = 2 * vcores,
so fixing / using that.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7791

Test Plan:
CI
and `while ./db_repl_stress; do echo again; done` for a while

Reviewed By: siying

Differential Revision: D25669834

Pulled By: pdillinger

fbshipit-source-id: b2c688053d0b1d52c989903449d3cd27a04130d6
2020-12-22 00:20:57 -08:00
Akanksha Mahajan 99f5a800c3 Fix clang_analyze error (#7777)
Summary:
Fix clang_analyze error

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7777

Test Plan:
USE_CLANG=1 TEST_TMPDIR=/dev/shm/rocksdb OPT=-g make -j64
analyze

Reviewed By: jay-zhuang

Differential Revision: D25601675

Pulled By: akankshamahajan15

fbshipit-source-id: 30f58cf4d575a2d546c455fb43e856455eb72a07
2020-12-16 21:34:41 -08:00
Adam Retter 8ff6557e7f Add further tests to ASSERT_STATUS_CHECKED (2) (#7698)
Summary:
Second batch of adding more tests to ASSERT_STATUS_CHECKED.

* external_sst_file_basic_test
* checkpoint_test
* db_wal_test
* db_block_cache_test
* db_logical_block_size_cache_test
* db_blob_index_test
* optimistic_transaction_test
* transaction_test
* point_lock_manager_test
* write_prepared_transaction_test
* write_unprepared_transaction_test

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7698

Reviewed By: cheng-chang

Differential Revision: D25441664

Pulled By: pdillinger

fbshipit-source-id: 9e78867f32321db5d4833e95eb96c5734526ef00
2020-12-09 21:21:16 -08:00
Manuel Ung 71239908cf Invalidate iterator on transaction clear (#7733)
Summary:
Some clients do not close their iterators until after the transaction finishes. To handle this case, we will invalidate any iterators on transaction clear.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7733

Reviewed By: cheng-chang

Differential Revision: D25261158

Pulled By: lth

fbshipit-source-id: b91320f00c54cbe0e6882b794b34f3bb5640dbc0
2020-12-09 19:13:22 -08:00
Sergei Petrunia 98236fb10e LockTree library, originally from PerconaFT (#7753)
Summary:
To be used for implementing Range Locking.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7753

Reviewed By: zhichao-cao

Differential Revision: D25378980

Pulled By: cheng-chang

fbshipit-source-id: 801a9c5cd92a84654ca2586b73e8f69001e89320
2020-12-09 12:10:57 -08:00
Sergei Petrunia d8bd9fc7b3 Range Locking: Allow different LockManagers, add Range Lock definitions (#7443)
Summary:
This PR has two commits:
1.  Modify the code to allow different Lock Managers (of any kind) to be used.  It is implied that a LockManager uses its own custom LockTracker.
2.  Add definitions for Range Locking (class Endpoint and GetRangeLock() function.

cheng-chang, is this what you've had in mind (should the PR have both item 1 and item 2?)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7443

Reviewed By: zhichao-cao

Differential Revision: D24123172

Pulled By: cheng-chang

fbshipit-source-id: c6548ad6d4cc3c25f68d13b29147bc6fdf357185
2020-12-07 20:18:07 -08:00
Levi Tamasi 61932cdf1d Add blob support to DBIter (#7731)
Summary:
The patch adds iterator support to the integrated BlobDB implementation.
Whenever a blob reference is encountered during iteration, the corresponding
blob is retrieved by calling `Version::GetBlob`, assuming the `expose_blob_index`
(formerly `allow_blob`) flag is *not* set. (Note: the flag is set by the old stacked
BlobDB implementation, which has its own blob file handling/blob retrieval logic.)

In addition, `DBIter` now uniformly returns `Status::NotSupported` with the error
message `"BlobDB does not support merge operator."` when encountering a
blob reference while performing a merge (instead of potentially returning a
message that implies the database should be opened using the stacked BlobDB's
`Open`.)

TODO: We can implement support for lazily retrieving the blob value (or in other
words, bypassing the retrieval of blob values based on key) by extending the `Iterator`
API with a new `PrepareValue` method (similarly to `InternalIterator`, which already
supports lazy values).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7731

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D25256293

Pulled By: ltamasi

fbshipit-source-id: c39cd782011495a526cdff99c16f5fca400c4811
2020-12-04 21:29:38 -08:00
Andrew Kryczka 1c5f13f2a5 Fail early when `merge_operator` not configured (#7667)
Summary:
An application may accidentally write merge operands without properly configuring `merge_operator`. We should alert them as early as possible that there's an API misuse. Previously RocksDB only notified them when a query or background operation needed to merge but couldn't. With this PR, RocksDB notifies them of the problem before applying the merge operand to the memtable (although it may already be in WAL, which seems it'd cause a crash loop until they enable `merge_operator`).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7667

Reviewed By: riversand963

Differential Revision: D24933360

Pulled By: ajkr

fbshipit-source-id: 3a4a2ceb0b7aed184113dd03b8efd735a8332f7f
2020-11-16 20:39:01 -08:00
Cheng Chang da42eceabc Skip fsync in txn tests (#7641)
Summary:
The tests often times out in internal infra, skipping fsync should reduce test time.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7641

Test Plan: watch existing tests to pass

Reviewed By: anand1976

Differential Revision: D24765098

Pulled By: cheng-chang

fbshipit-source-id: c62bf8110361aee901918d632cf4772435d05e8d
2020-11-06 14:25:14 -08:00
Yanqin Jin 394210f280 Remove unused includes (#7604)
Summary:
This is a PR generated **semi-automatically** by an internal tool to remove unused includes and `using` statements.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7604

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D24579392

Pulled By: riversand963

fbshipit-source-id: c4bfa6c6b08da1de186690d37eb73d8fff45aecd
2020-10-28 23:22:27 -07:00
Cheng Chang 5227b315ec Fix unchecked statuses for transaction_test (#7572)
Summary:
When `ASSERT_STATUS_CHECKED` is enabled, `transaction_test` does not pass without this PR.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7572

Test Plan: `ASSERT_STATUS_CHECKED=1 make   -j32 transaction_test && ./transaction_test`

Reviewed By: zhichao-cao

Differential Revision: D24404319

Pulled By: cheng-chang

fbshipit-source-id: 13689035995366ab06d8eada3ea404e45fef8bc5
2020-10-21 14:03:59 -07:00
Cheng Chang 0ea7db768e Abstract out LockManager interface (#7532)
Summary:
In order to be able to introduce more locking protocols, we need to abstract out the locking subsystem in TransactionDB into a set of interfaces.

PR https://github.com/facebook/rocksdb/pull/7013 introduces interface `LockTracker`. This PR is a follow up to take the first step to abstract out a `LockManager` interface.

Further modifications to the interface may be needed when introducing the first implementation of range lock. But the idea here is to put the range lock implementation based on range tree under the `utilities/transactions/lock/range/range_tree`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7532

Test Plan: point_lock_manager_test

Reviewed By: ajkr

Differential Revision: D24238731

Pulled By: cheng-chang

fbshipit-source-id: 2a9458cd8b3fb008d9529dbc4d3b28c24631f463
2020-10-19 10:14:42 -07:00
Peter Dillinger 08552b19d3 Genericize and clean up FastRange (#7436)
Summary:
A generic algorithm in progress depends on a templatized
version of fastrange, so this change generalizes it and renames
it to fit our style guidelines, FastRange32, FastRange64, and now
FastRangeGeneric.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7436

Test Plan: added a few more test cases

Reviewed By: jay-zhuang

Differential Revision: D23958153

Pulled By: pdillinger

fbshipit-source-id: 8c3b76101653417804997e5f076623a25586f3e8
2020-09-28 11:35:00 -07:00
sdong 5aacef9712 Disable fsync in SeqAdvanceConcurrentTest (#7302)
Summary:
SeqAdvanceConcurrentTest sometimes runs too long on some platforms. Disable fsync to speed it up.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7302

Test Plan: Run the tests and watch CI.

Reviewed By: ajkr

Differential Revision: D23298192

fbshipit-source-id: 2185eed4e0958c3de5e8a3f94ceed5be5945ed37
2020-08-24 13:22:06 -07:00
Cheng Chang 71c7e4935e Replace tracked_keys with a new LockTracker interface in TransactionDB (#7013)
Summary:
We're going to support more locking protocols such as range lock in transaction.

However, in current design, `TransactionBase` has a member `tracked_keys` which assumes that point lock (lock a single key) is used, and is used in snapshot checking (isolation protocol). When using range lock, we may use read committed instead of snapshot checking as the isolation protocol.

The most significant usage scenarios of `tracked_keys` are:
1. pessimistic transaction uses it to track the locked keys, and unlock these keys when commit or rollback.
2. optimistic transaction does not lock keys upfront, it only tracks the lock intentions in tracked_keys, and do write conflict checking when commit.
3. each `SavePoint` tracks the keys that are locked since the `SavePoint`, `RollbackToSavePoint` or `PopSavePoint` relies on both the tracked keys in `SavePoint`s and `tracked_keys`.

Based on these scenarios, if we can abstract out a `LockTracker` interface to hold a set of tracked locks (can be keys or key ranges), and have methods that can be composed together to implement the scenarios, then `tracked_keys` can be an internal data structure of one implementation of `LockTracker`. See `utilities/transactions/lock/lock_tracker.h` for the detailed interface design, and `utilities/transactions/lock/point_lock_tracker.cc` for the implementation.

In the future, a `RangeLockTracker` can be implemented to track range locks without affecting other components.

After this PR, a clean interface for lock manager should be possible, and then ideally, we can have pluggable locking protocols.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7013

Test Plan: Run `transaction_test` and `optimistic_transaction_test`.

Reviewed By: ajkr

Differential Revision: D22163706

Pulled By: cheng-chang

fbshipit-source-id: f2860577b5334e31dd2994f5bc6d7c40d502b1b4
2020-08-06 12:38:00 -07:00
sdong 43cc622d09 Add CLANG analyze to CircleCI (#7114)
Summary:
CLANG analyze is useful before pull request. Add it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7114

Test Plan: Watch the CI results to succeed.

Reviewed By: riversand963

Differential Revision: D22491942

fbshipit-source-id: 9ccad91c6142fedc3d3dd491cf55054827908f36
2020-07-13 12:33:16 -07:00
mrambacher c7c7b07f06 More Makefile Cleanup (#7097)
Summary:
Cleans up some of the dependencies on test code in the Makefile while building tools:
- Moves the test::RandomString, DBBaseTest::RandomString into Random
- Moves the test::RandomHumanReadableString into Random
- Moves the DestroyDir method into file_utils
- Moves the SetupSyncPointsToMockDirectIO into sync_point.
- Moves the FaultInjection Env and FS classes under env

These changes allow all of the tools to build without dependencies on test_util, thereby simplifying the build dependencies.  By moving the FaultInjection code, the dependency in db_stress on different libraries for debug vs release was eliminated.

Tested both release and debug builds via Make and CMake for both static and shared libraries.

More work remains to clean up how the tools are built and remove some unnecessary dependencies.  There is also more work that should be done to get the Makefile and CMake to align in their builds -- what is in the libraries and the sizes of the executables are different.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7097

Reviewed By: riversand963

Differential Revision: D22463160

Pulled By: pdillinger

fbshipit-source-id: e19462b53324ab3f0b7c72459dbc73165cc382b2
2020-07-09 14:35:17 -07:00
Jay Zhuang 00de699096 Replace reinterpret_cast with static_cast_with_check (#7067)
Summary:
Replace `reinterpret_cast` with `static_cast_with_check` for `DBImpl` and `ColumnFamilyHandleImpl`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7067

Reviewed By: siying

Differential Revision: D22361587

Pulled By: jay-zhuang

fbshipit-source-id: dfe9e8f3af39c3d27cc372c55ab9ad905eb0a5a1
2020-07-02 19:25:41 -07:00
Peter Dillinger aaece2a98d Fix some defects reported by Coverity Scan (#6933)
Summary:
Confusing checks for null that are never null
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6933

Test Plan: make check

Reviewed By: cheng-chang

Differential Revision: D21885466

Pulled By: pdillinger

fbshipit-source-id: 4b48e03c2a33727f2702b0d12292f9fda5a3c475
2020-06-04 15:46:27 -07:00
sdong afa3518839 Revert "Update googletest from 1.8.1 to 1.10.0 (#6808)" (#6923)
Summary:
This reverts commit 8d87e9cea1.

Based on offline discussions, it's too early to upgrade to gtest 1.10, as it prevents some developers from using an older version of gtest to integrate to some other systems. Revert it for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6923

Reviewed By: pdillinger

Differential Revision: D21864799

fbshipit-source-id: d0726b1ff649fc911b9378f1763316200bd363fc
2020-06-03 15:55:03 -07:00
Zhichao Cao 2adb7e3768 Fix potential overflow of unsigned type in for loop (#6902)
Summary:
x.size() -1 or y - 1 can overflow to an extremely large value when x.size() pr y is 0 when they are unsigned type. The end condition of i in the for loop will be extremely large, potentially causes segment fault. Fix them.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6902

Test Plan: pass make asan_check

Reviewed By: ajkr

Differential Revision: D21843767

Pulled By: zhichao-cao

fbshipit-source-id: 5b8b88155ac5a93d86246d832e89905a783bb5a1
2020-06-02 15:05:07 -07:00
Adam Retter 8d87e9cea1 Update googletest from 1.8.1 to 1.10.0 (#6808)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6808

Reviewed By: anand1976

Differential Revision: D21483984

Pulled By: pdillinger

fbshipit-source-id: 70c5eff2bd54ddba469761d95e4cd4611fb8e598
2020-06-01 20:33:42 -07:00
Cheng Chang 211088df6e Remove redundant update of txn_state_ in transaction Prepare (#6778)
Summary:
When  expiration is set in a pessimistic transaction, `txn_state_` is already updated to `AWAITING_PREPARE` in the `if (expiration_time_ > 0)` block, there is  no need to update the state in `if (can_prepare)` block again.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6778

Test Plan: make check

Reviewed By: lth

Differential Revision: D21335319

Pulled By: cheng-chang

fbshipit-source-id: 251d634cc7d1a0e86e673a59f0bda8584da5a35f
2020-05-01 17:37:33 -07:00
Cheng Chang ef0c3eda27 Make users explicitly be aware of prepare before commit (#6775)
Summary:
In current commit protocol of pessimistic transaction, if the transaction is not prepared before commit, the commit protocol implicitly assumes that the user wants to commit without prepare.

This PR adds TransactionOptions::skip_prepare, the default value is `true` because if set to `false`, all existing users who commit without prepare need to update their code to set skip_prepare to true. Although this does not force the user to explicitly express their intention of skip_prepare, it at least lets the user be aware of the assumption of being able to commit without prepare.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6775

Test Plan: added a new unit test TransactionTest::CommitWithoutPrepare

Reviewed By: lth

Differential Revision: D21313270

Pulled By: cheng-chang

fbshipit-source-id: 3d95b7c9b2d6cdddc09bdd66c561bc4fae8c3251
2020-04-30 16:24:20 -07:00
Derrick Pallas 5272305437 Fix FilterBench when RTTI=0 (#6732)
Summary:
The dynamic_cast in the filter benchmark causes release mode to fail due to
no-rtti.  Replace with static_cast_with_check.

Signed-off-by: Derrick Pallas <derrick@pallas.us>

Addition by peterd: Remove unnecessary 2nd template arg on all static_cast_with_check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6732

Reviewed By: ltamasi

Differential Revision: D21304260

Pulled By: pdillinger

fbshipit-source-id: 6e8eb437c4ca5a16dbbfa4053d67c4ad55f1608c
2020-04-29 13:09:23 -07:00
Peter Dillinger 31da5e34c1 C++20 compatibility (#6697)
Summary:
Based on https://github.com/facebook/rocksdb/issues/6648 (CLA Signed), but heavily modified / extended:

* Implicit capture of this via [=] deprecated in C++20, and [=,this] not standard before C++20 -> now using explicit capture lists
* Implicit copy operator deprecated in gcc 9 -> add explicit '= default' definition
* std::random_shuffle deprecated in C++17 and removed in C++20 -> migrated to a replacement in RocksDB random.h API
* Add the ability to build with different std version though -DCMAKE_CXX_STANDARD=11/14/17/20 on the cmake command line
* Minimal rebuild flag of MSVC is deprecated and is forbidden with /std:c++latest (C++20)
* Added MSVC 2019 C++11 & MSVC 2019 C++20 in AppVeyor
* Added GCC 9 C++11 & GCC9 C++20 in Travis
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6697

Test Plan: make check and CI

Reviewed By: cheng-chang

Differential Revision: D21020318

Pulled By: pdillinger

fbshipit-source-id: 12311be5dbd8675a0e2c817f7ec50fa11c18ab91
2020-04-20 13:24:25 -07:00
sdong 1be3be5522 Auto-Format two recent diffs and add HISTORY.md (#6685)
Summary:
Two recent diffs can be autoformatted.
Also add HISTORY.md entry for https://github.com/facebook/rocksdb/pull/6214
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6685

Test Plan: Run all existing tests

Reviewed By: cheng-chang

Differential Revision: D20965780

fbshipit-source-id: 195b08d7849513d42fe14073112cd19fdda6af95
2020-04-10 11:32:44 -07:00
Cheng Chang d648a0e17f Add unit test for TransactionLockMgr (#6599)
Summary:
Although there are tests related to locking in transaction_test, this new test directly tests against TransactionLockMgr.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6599

Test Plan: make transaction_lock_mgr_test && ./transaction_lock_mgr_test

Reviewed By: lth

Differential Revision: D20673749

Pulled By: cheng-chang

fbshipit-source-id: 1fa4a13218e68d785f5a99924556751a8c5c0f31
2020-04-08 13:51:51 -07:00
Cheng Chang 3881a678d5 Refactor IsLockExpired (#6586)
Summary:
1. If expiration_time is non-positive, no need to call NowMicros, save a syscall.
2. expire_time should only be set when expired is false.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6586

Test Plan: make check

Reviewed By: lth

Differential Revision: D20673730

Pulled By: cheng-chang

fbshipit-source-id: a69e8d7b16dc6d0d00487bb1c19f0710d79482e2
2020-03-27 16:14:22 -07:00
Cheng Chang 2e276973e4 Compute cv_end_time with simpler logic (#6585)
Summary:
The refactored logic is easier to read.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6585

Test Plan: make check

Reviewed By: lth

Differential Revision: D20663225

Pulled By: cheng-chang

fbshipit-source-id: cfd28955cd03b0a71d9087085170875f6dd0be9e
2020-03-27 16:01:23 -07:00
Burton Li 8abd41a544 Fix write_unprepared_transaction_test crash on debug version. (#6574)
Summary:
The last key may hit index of out bound exception when id = 9.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6574

Reviewed By: riversand963

Differential Revision: D20699791

Pulled By: cheng-chang

fbshipit-source-id: 8e2c5be5ff0e53e9857cfd59cea97cff21446819
2020-03-27 11:12:23 -07:00
Otto Kekäläinen f6c2777d95 Fix spelling: commited -> committed (#6481)
Summary:
In most places in the code the variable names are spelled correctly as
COMMITTED but in a couple places not. This fixes them and ensures the
variable is always called COMMITTED everywhere.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6481

Differential Revision: D20306776

Pulled By: pdillinger

fbshipit-source-id: b6c1bfe41db559b4bc6955c530934460c07f7022
2020-03-06 12:45:20 -08:00
Michael R. Crusoe 051696bf98 fix some spelling typos (#6464)
Summary:
Found from Debian's "Lintian" program
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6464

Differential Revision: D20162862

Pulled By: zhichao-cao

fbshipit-source-id: 06941ee2437b038b2b8045becbe9d2c6fbff3e12
2020-02-28 14:14:03 -08:00
Manuel Ung 41535d0218 WriteUnPrepared: Pass in correct subbatch count during rollback (#6463)
Summary:
Today `WriteUnpreparedTxn::RollbackInternal` will write the rollback batch assuming that there is only a single subbatch. However, because untracked_keys_ are currently not deduplicated, it's possible for duplicate keys to exist, and thus split the batch. Also, tracked_keys_ also does not support compators outside of the bytewise comparators, so it's possible for duplicates to occur there as well.

To solve this, just pass in the correct subbatch count.

Also, removed `WriteUnpreparedRollbackPreReleaseCallback` to unify the Commit/Rollback codepaths some more.

Also, fixed a bug in `CommitInternal` where if 1. two_write_queue is true and 2. include_data is true, then `WriteUnpreparedCommitEntryPreReleaseCallback` ends up calling `AddCommitted` on the commit time write batch a second time on the second write. To fix, `WriteUnpreparedCommitEntryPreReleaseCallback` is re-initialized.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6463

Differential Revision: D20150153

Pulled By: lth

fbshipit-source-id: df0b42d39406c75af73df995aa1138f0db539cd1
2020-02-28 11:19:32 -08:00
sdong fdf882ded2 Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433)
Summary:
When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433

Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.

Differential Revision: D19977691

fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
2020-02-20 12:09:57 -08:00
Manuel Ung dc23c125c3 WriteUnPrepared: Untracked keys (#6404)
Summary:
For write unprepared, some applications may bypass the transaction api, and write keys directly into the write batch. However, since they are not tracked, rollbacks (both for savepoint and transaction) are not aware that these keys have to be rolled back.

The fix is to track them in `WriteUnpreparedTxn::untracked_keys_`. This is populated whenever we flush unprepared batches into the DB.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6404

Differential Revision: D19842023

Pulled By: lth

fbshipit-source-id: a9edfc643d5c905fc89da9a9a9094d30c9b70108
2020-02-14 11:31:39 -08:00
wolfkdy 29e24434fe refine code (#6420)
Summary:
I create a new branch from the branch new upsteram/master and "git merge --squash".
Maybe it will fix everything.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6420

Differential Revision: D19897152

Pulled By: zhichao-cao

fbshipit-source-id: 6575d9e3b23e360f42ee1480b43028b5fcc20136
2020-02-13 18:55:02 -08:00
Manuel Ung fb571509a7 WriteUnPrepared: Enable WAL during crash recovery (#6418)
Summary:
Unfortunately, it seems like mysqld reuses xids across machine restarts. When that happens, we could have something like the following happening:

```
BEGIN_PREPARE(unprepared) Put(a) END_PREPARE(xid = 1)
-- crash and recover with Put(a) rolled back as it was not prepared
BEGIN_PREPARE(prepared) Put(b) END_PREPARE(xid = 1)
COMMIT(xid = 1)
-- crash and recover with both a, b
```

To solve this, we will have to log the rollback batch into the WAL during recovery.

WritePrepared already logs the rollback batch into the WAL, if a rollback happens after prepare, so there is no problem there.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6418

Differential Revision: D19896151

Pulled By: lth

fbshipit-source-id: 2ff65ddc5fe75efd57736fed4b7cd7a109d26609
2020-02-13 18:44:39 -08:00
sdong ac8e89a443 Should flush and sync WAL when writing it in DB::Open() (#6417)
Summary:
A recent fix related to 2pc https://github.com/facebook/rocksdb/pull/6313/ writes something to WAL, but does not flush or sync. This causes assertion failure "impl->TEST_WALBufferIsEmpty()" if manual_wal_flush = true. We should fsync the entry to make sure a second power reset can recover.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6417

Test Plan: Add manual_wal_flush=true case in TransactionTest.DoubleCrashInRecovery and fix a bug in the test so that the bug can be reproduced. It passes with the fix.

Differential Revision: D19894537

fbshipit-source-id: f1e84e49e2269f583c6019743118292cd8b6598e
2020-02-13 18:41:04 -08:00
Yanqin Jin f2fbc5d668 Shorten certain test names to avoid infra failure (#6352)
Summary:
Unit test names, together with other components,  are used to create log files
during some internal testing. Overly long names cause infra failure due to file
names being too long.

Look for internal tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6352

Differential Revision: D19649307

Pulled By: riversand963

fbshipit-source-id: 6f29de096e33c0eaa87d9c8702f810eda50059e7
2020-01-30 23:10:24 -08:00
Maysam Yabandeh 2f973ca96e Double Crash in kPointInTimeRecovery with TransactionDB (#6313)
Summary:
In WritePrepared there could be gap in sequence numbers. This breaks the trick we use in kPointInTimeRecovery which assume the first seq in the log right after the corrupted log is one larger than the last seq we read from the logs. To let this trick keep working, we add a dummy entry with the expected sequence to the first log right after recovery.
Also in WriteCommitted, if the log right after the corrupted log is empty, since it has no sequence number to let the sequential trick work, it is assumed as unexpected behavior. This is however expected to happen if we close the db after recovering from a corruption and before writing anything new to it. To remedy that, we apply the same technique by writing a dummy entry to the log that is created after the corrupted log.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6313

Differential Revision: D19458291

Pulled By: maysamyabandeh

fbshipit-source-id: 09bc49e574690085df45b034ca863ff315937e2d
2020-01-29 11:40:55 -08:00
Maysam Yabandeh eff5e076f5 unordered_write incompatible with max_successive_merges (#6284)
Summary:
unordered_write is incompatible with non-zero max_successive_merges. Although we check this at runtime, we currently don't prevent the user from setting this combination in options. This has led to stress tests to fail with this combination is tried in ::SetOptions.
The patch fixes that and also reverts the changes performed by https://github.com/facebook/rocksdb/pull/6254, in which max_successive_merges was mistakenly declared incompatible with unordered_write.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6284

Differential Revision: D19356115

Pulled By: maysamyabandeh

fbshipit-source-id: f06dadec777622bd75f267361c022735cf8cecb6
2020-01-10 16:53:19 -08:00
Maysam Yabandeh 5709e97a74 Skip CancelAllBackgroundWork if DBImpl is already closed (#6268)
Summary:
WritePreparedTxnDB calls CancelAllBackgroundWork in its destructor to avoid dangling references to it from background job's SnapshotChecker callback. However, if the DBImpl is already closed, the info log might be closed with it, which causes memory leak when CancelAllBackgroundWork tries to print to the info log. The patch fixes that by calling CancelAllBackgroundWork only if the db is not closed already.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6268

Differential Revision: D19303439

Pulled By: maysamyabandeh

fbshipit-source-id: 4228a6be7e78d43c90630347baa89b008200bd15
2020-01-07 15:34:27 -08:00
wolfkdy 1ab1231acf parallel occ (#6240)
Summary:
This is a continuation of https://github.com/facebook/rocksdb/pull/5320/files
I open a new mr for these purposes, half a year has past since the old mr is posted so it's almost impossible to fulfill some points below on the old mr, especially 5)
1) add validation modes for optimistic txns
2) modify unittests to test both modes
3) make format
4) refine hash functor
5) push to master
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6240

Differential Revision: D19301296

fbshipit-source-id: 5b5b3cbd39558f43947f7d2dec6cd31a06386edb
2020-01-07 14:20:38 -08:00
Maysam Yabandeh 48a678b7c9 Prevent an incompatible combination of options (#6254)
Summary:
allow_concurrent_memtable_write is incompatible with non-zero max_successive_merges. Although we check this at runtime, we currently don't prevent the user from setting this combination in options. This has led to stress tests to fail with this combination is tried in ::SetOptions. The patch fixes that.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6254

Differential Revision: D19265819

Pulled By: maysamyabandeh

fbshipit-source-id: 47f2e2dc26fe0972c7152f4da15dadb9703f1179
2020-01-02 16:15:06 -08:00
anand1976 1be48cb895 Fix crash in Transaction::MultiGet() when num_keys > 32
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6192

Test Plan:
Add a unit test that fails without the fix and passes now
make check

Differential Revision: D19124781

Pulled By: anand1976

fbshipit-source-id: 8c8cb6fa16c3fc23ec011e168561a13f76bbd783
2019-12-16 20:39:35 -08:00
Adam Retter 6d58ea901d Fix compilation under MSVC VS2015 (#6081)
Summary:
**NOTE**: this also needs to be back-ported to 6.4.6 and possibly older branches if further releases from them is envisaged.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6081

Differential Revision: D18710107

Pulled By: zhichao-cao

fbshipit-source-id: 03260f9316566e2bfc12c7d702d6338bb7941e01
2019-11-26 18:24:09 -08:00
Maysam Yabandeh 0058daef7b Disable SmallestUnCommittedSeq in Valgrind run (#6035)
Summary:
SmallestUnCommittedSeq sometimes takes too long when run under Valgrind. The patch disables it when the tests are run under Valgrind.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6035

Differential Revision: D18509198

Pulled By: maysamyabandeh

fbshipit-source-id: 1191443b9fedb6b9c50d6b76f5c92371f5030230
2019-11-14 14:41:52 -08:00
Sergei Petrunia 230bcae7b6 Add a limited support for iteration bounds into BaseDeltaIterator (#5403)
Summary:
For MDEV-19670: MyRocks: key lookups into deleted data are very slow

BaseDeltaIterator remembers iterate_upper_bound and will not let delta_iterator_
walk above the iterate_upper_bound if base_iterator_ is not valid
anymore.

== Rationale ==
The most straightforward way would be to make the delta_iterator
(which is a rocksdb::WBWIIterator) to support iterator bounds. But
checking for bounds has an extra CPU overhead.

So we put the check into BaseDeltaIterator, and only make it when
base_iterator_ is not valid.

(note: We could take it even further, and move the check a few lines
down, and only check iterator bounds ourselves if base_iterator_ is
not valid AND delta_iterator_ hit a tombstone).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5403

Differential Revision: D15863092

Pulled By: maysamyabandeh

fbshipit-source-id: 8da458e7b9af95ff49356666f69664b4a6ccf49b
2019-11-05 11:39:36 -08:00
Maysam Yabandeh 52733b4498 WritePrepared: Fix flaky test MaxCatchupWithNewSnapshot (#5850)
Summary:
MaxCatchupWithNewSnapshot tests that the snapshot sequence number will be larger than the max sequence number when the snapshot was taken. However since the test does not have access to the max sequence number when the snapshot was taken, it uses max sequence number after that, which could have advanced the snapshot by then, thus making the test flaky.
The fix is to compare with max sequence number before the snapshot was taken, which is a lower bound for the value when the snapshot was taken.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5850

Test Plan: ~/gtest-parallel/gtest-parallel --repeat=12800 ./write_prepared_transaction_test --gtest_filter="*MaxCatchupWithNewSnapshot*"

Differential Revision: D17608926

Pulled By: maysamyabandeh

fbshipit-source-id: b122ae5a27f982b290bd60da852e28d3c5eb0136
2019-11-04 16:23:57 -08:00
Peter Dillinger ca7ccbe2ea Misc hashing updates / upgrades (#5909)
Summary:
- Updated our included xxhash implementation to version 0.7.2 (== the latest dev version as of 2019-10-09).
- Using XXH_NAMESPACE (like other fb projects) to avoid potential name collisions.
- Added fastrange64, and unit tests for it and fastrange32. These are faster alternatives to hash % range.
- Use preview version of XXH3 instead of MurmurHash64A for NPHash64
-- Had to update cache_test to increase probability of passing for any given hash function.
- Use fastrange64 instead of % with uses of NPHash64
-- Had to fix WritePreparedTransactionTest.CommitOfDelayedPrepared to avoid deadlock apparently caused by new hash collision.
- Set default seed for NPHash64 because specifying a seed rarely makes sense for it.
- Removed unnecessary include xxhash.h in a popular .h file
- Rename preview version of XXH3 to XXH3p for clarity and to ease backward compatibility in case final version of XXH3 is integrated.

Relying on existing unit tests for NPHash64-related changes. Each new implementation of fastrange64 passed unit tests when manipulating my local build to select it. I haven't done any integration performance tests, but I consider the improved performance of the pieces being swapped in to be well established.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5909

Differential Revision: D18125196

Pulled By: pdillinger

fbshipit-source-id: f6bf83d49d20cbb2549926adf454fd035f0ecc0d
2019-10-24 17:16:46 -07:00
jsteemann da3b2840cb save a few redundant container lookups (#5875)
Summary:
This PR eliminates repeated lookups in associative or ordered containers when a single lookup suffices.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5875

Differential Revision: D17753172

Pulled By: anand1976

fbshipit-source-id: 796b02b760082521d8c42a1cb65a76bf0e6c1b8e
2019-10-07 12:28:09 -07:00
sdong e8263dbdaa Apply formatter to recent 200+ commits. (#5830)
Summary:
Further apply formatter to more recent commits.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5830

Test Plan: Run all existing tests.

Differential Revision: D17488031

fbshipit-source-id: 137458fd94d56dd271b8b40c522b03036943a2ab
2019-09-20 12:04:26 -07:00
sdong c06b54d0c6 Apply formatter on recent 45 commits. (#5827)
Summary:
Some recent commits might not have passed through the formatter. I formatted recent 45 commits. The script hangs for more commits so I stopped there.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5827

Test Plan: Run all existing tests.

Differential Revision: D17483727

fbshipit-source-id: af23113ee63015d8a43d89a3bc2c1056189afe8f
2019-09-19 12:34:17 -07:00
anand76 83a6a614e9 Refactor ArenaWrappedDBIter into separate files (#5801)
Summary:
Move definition and implementation for ArenaWrappedDBIter into its own .h/.cc files. Also, change inlining of functions to better comply with the Google C++ style guide.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5801

Test Plan: make check

Differential Revision: D17371012

Pulled By: anand1976

fbshipit-source-id: c1361abc2851575111e357a63d88be3b3d6cb341
2019-09-13 13:50:43 -07:00
Shylock Hg 9eb3e1f77d Use delete to disable automatic generated methods. (#5009)
Summary:
Use delete to disable automatic generated methods instead of private, and put the constructor together for more clear.This modification cause the unused field warning, so add unused attribute to disable this warning.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5009

Differential Revision: D17288733

fbshipit-source-id: 8a767ce096f185f1db01bd28fc88fef1cdd921f3
2019-09-11 18:09:00 -07:00
Wilfried Goesgens fbab9913e2 upgrade gtest 1.7.0 => 1.8.1 for json result writing
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5332

Differential Revision: D17242232

fbshipit-source-id: c0d4646556a1335e51ac7382b986ca7f6ced7b64
2019-09-09 11:24:11 -07:00
Maysam Yabandeh 78b8cfc7ec WriteUnPrepared: Split ReadYourOwnWriteStress to three (#5776)
Summary:
ReadYourOwnWriteStress occasionally times out on some platforms. The patch splits it to three.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5776

Differential Revision: D17231743

Pulled By: maysamyabandeh

fbshipit-source-id: d42eeaf22f61a48d50f9c404d98b1081ae8dac94
2019-09-06 15:25:26 -07:00
Manuel Ung 2208cc0196 Fix build break in TransactionBaseImpl::TrackKey (#5771)
Summary:
Fix build broken in https://github.com/facebook/rocksdb/pull/5696.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5771

Differential Revision: D17217665

Pulled By: lth

fbshipit-source-id: 7aa84a2a9b4feb7a3ab1cab174e09276430fe042
2019-09-06 10:18:04 -07:00
jsteemann 19e8c9b64f use c++17's try_emplace if available (#5696)
Summary:
This avoids rehashing the key in TrackKey() in case the key is not already
in the map of tracked keys, which will happen at least once per key used in a
transaction.

Additionally fix two typos.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5696

Differential Revision: D17210178

Pulled By: lth

fbshipit-source-id: 7e2c28e9e505c1d1c1535d435250cf2b191a6fdf
2019-09-05 13:59:40 -07:00
Maysam Yabandeh f9fb9f1421 Add a unit test to detect infinite loops with reseek optimizations (#5727)
Summary:
Iterators reseek to the target key after iterating over max_sequential_skip_in_iterations invalid values. The logic is susceptible to an infinite loop bug, which has been present with WritePrepared Transactions up until 6.2 release. Although the bug is not present on master, the patch adds a unit test to prevent it from resurfacing again.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5727

Differential Revision: D16952759

Pulled By: maysamyabandeh

fbshipit-source-id: d0d973dddc8dfabd5a794931232aa4c862c74f51
2019-09-04 14:31:10 -07:00
Pratik Dhandharia 1b4c104a67 replace some reinterpret_cast with static_cast_with_check (#5740)
Summary:
This PR focuses on replacing some of the reinterpret_cast<DBImpl*> to static_cast_with_check<DBImpl, DB>.

Files impacted:

./db/db_impl/db_impl_compaction_flush.cc
./db/write_batch.cc
./utilities/blob_db/blob_db_impl.cc
./utilities/transactions/pessimistic_transaction_db.cc
./utilities/transactions/transaction_base.cc
./utilities/transactions/write_prepared_txn_db.cc
./utilities/transactions/write_unprepared_txn_db.cc
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5740

Differential Revision: D17055691

Pulled By: pdhandharia

fbshipit-source-id: 0f8034d1b32eade56e37d59c04b7bf236a81d8e8
2019-08-27 10:59:11 -07:00
Zhongyi Xie 2f41ecfe75 Refactor trimming logic for immutable memtables (#5022)
Summary:
MyRocks currently sets `max_write_buffer_number_to_maintain` in order to maintain enough history for transaction conflict checking. The effectiveness of this approach depends on the size of memtables. When memtables are small, it may not keep enough history; when memtables are large, this may consume too much memory.
We are proposing a new way to configure memtable list history: by limiting the memory usage of immutable memtables. The new option is `max_write_buffer_size_to_maintain` and it will take precedence over the old `max_write_buffer_number_to_maintain` if they are both set to non-zero values. The new option accounts for the total memory usage of flushed immutable memtables and mutable memtable. When the total usage exceeds the limit, RocksDB may start dropping immutable memtables (which is also called trimming history), starting from the oldest one.
The semantics of the old option actually works both as an upper bound and lower bound. History trimming will start if number of immutable memtables exceeds the limit, but it will never go below (limit-1) due to history trimming.
In order the mimic the behavior with the new option, history trimming will stop if dropping the next immutable memtable causes the total memory usage go below the size limit. For example, assuming the size limit is set to 64MB, and there are 3 immutable memtables with sizes of 20, 30, 30. Although the total memory usage is 80MB > 64MB, dropping the oldest memtable will reduce the memory usage to 60MB < 64MB, so in this case no memtable will be dropped.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5022

Differential Revision: D14394062

Pulled By: miasantreble

fbshipit-source-id: 60457a509c6af89d0993f988c9b5c2aa9e45f5c5
2019-08-23 13:55:34 -07:00
sdong e0515607bc Blacklist TransactionTest.GetWithoutSnapshot from valgrind_test (#5715)
Summary:
In valgrind_test, TransactionTest.GetWithoutSnapshot ran 2 hours and still didn't finish. Black list from valgrind_test to prevent timeout.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5715

Test Plan: run "make valgrind_test" and see whether the test is still generated.

Differential Revision: D16866009

fbshipit-source-id: 92c78049b0bc1c2b9a0dfc1b7c8a9206b36f02f0
2019-08-16 15:36:49 -07:00
Manuel Ung 7785f61132 WriteUnPrepared: Fix bug in savepoints (#5703)
Summary:
Fix a bug in write unprepared savepoints. When flushing the write batch according to savepoint boundaries, we were forgetting to flush the last write batch after the last savepoint, meaning that some data was not written to DB.

Also, add a small optimization where we avoid flushing empty batches.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5703

Differential Revision: D16811996

Pulled By: lth

fbshipit-source-id: 600c7e0e520ad7a8fad32d77e11d932453e68e3f
2019-08-14 16:15:46 -07:00
Manuel Ung 4c70cb7306 WriteUnPrepared: support iterating while writing to transaction (#5699)
Summary:
In MyRocks, there are cases where we write while iterating through keys. This currently breaks WBWIIterator, because if a write batch flushes during iteration, the delta iterator would point to invalid memory.

For now, fix by disallowing flush if there are active iterators. In the future, we will loop through all the iterators on a transaction, and refresh the iterators when a write batch is flushed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5699

Differential Revision: D16794157

Pulled By: lth

fbshipit-source-id: 5d5bf70688bd68fe58e8a766475ae88fd1be3190
2019-08-14 14:28:53 -07:00
Zhongyi Xie 90cd6c2bb1 Fix double deletion in transaction_test (#5700)
Summary:
Fix the following clang analyze failures:
```
In file included from utilities/transactions/transaction_test.cc:8:
./utilities/transactions/transaction_test.h:174:14: warning: Attempt to delete released memory
      delete root_db;
             ^
```
The destructor of StackableDB already deletes the root db and there is no need to delete the db separately.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5700

Test Plan: USE_CLANG=1 TEST_TMPDIR=/dev/shm/rocksdb OPT=-g make -j24 analyze

Differential Revision: D16800579

Pulled By: maysamyabandeh

fbshipit-source-id: 64c2d70f23e07e6a15242add97c744902ea33be5
2019-08-13 21:54:55 -07:00
Manuel Ung 8a678a50ba WriteUnPrepared: Relax restriction on iterators and writes with no snapshot (#5697)
Summary:
Currently, if a write is done without a snapshot, then `largest_validated_seq_` is set to `kMaxSequenceNumber`. This is too aggressive, because an iterator with a snapshot created after this write should be valid.

Set `largest_validated_seq_` to `GetLastPublishedSequence` instead. The variable means that no keys in the current tracked key set has changed by other transactions since `largest_validated_seq_`.

Also, do some extra cleanup in Clear() for safety.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5697

Differential Revision: D16788613

Pulled By: lth

fbshipit-source-id: f2aa40b8b12e0c0cf9e38c940fecc8f1cc0d2385
2019-08-13 13:11:51 -07:00
Maysam Yabandeh 64855979ae WriteUnPrepared: Pass snap_released to the callback (#5691)
Summary:
With changes made in https://github.com/facebook/rocksdb/pull/5664 we meant to pass snap_released parameter of ::IsInSnapshot from the read callbacks. Although the variable was defined, passing it to the callback in WritePreparedTxnReadCallback was missing, which is fixed in this PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5691

Differential Revision: D16767310

Pulled By: maysamyabandeh

fbshipit-source-id: 3bf53f5964a2756a66ceef7c8f6b3ac75f102f48
2019-08-12 12:20:46 -07:00
Manuel Ung 6f0f82de87 WriteUnPrepared: increase test coverage in transaction_test (#5658)
Summary:
The changes transaction_test to set `txn_db_options.default_write_batch_flush_threshold = 1` in order to give better test coverage for WriteUnprepared.

As part of the change, some tests had to be updated.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5658

Differential Revision: D16740468

Pulled By: lth

fbshipit-source-id: 3821eec20baf13917c8c1fab444332f75a509de9
2019-08-12 12:16:04 -07:00
Maysam Yabandeh 12eaacb71d WritePrepared: Fix SmallestUnCommittedSeq bug (#5683)
Summary:
SmallestUnCommittedSeq reads two data structures, prepared_txns_ and delayed_prepared_. These two are updated in CheckPreparedAgainstMax when max_evicted_seq_ advances some prepared entires. To avoid the cost of acquiring a mutex, the read from them in SmallestUnCommittedSeq is not atomic. This creates a potential race condition.
The fix is to read the two data structures in the reverse order of their update. CheckPreparedAgainstMax copies the prepared entry to delayed_prepared_ before removing it from prepared_txns_ and SmallestUnCommittedSeq looks into prepared_txns_ before reading delayed_prepared_.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5683

Differential Revision: D16744699

Pulled By: maysamyabandeh

fbshipit-source-id: b1bdb134018beb0b9de58827f512662bea35cad0
2019-08-09 16:40:00 -07:00
Vijay Nadimpalli d150e01474 New API to get all merge operands for a Key (#5604)
Summary:
This is a new API added to db.h to allow for fetching all merge operands associated with a Key. The main motivation for this API is to support use cases where doing a full online merge is not necessary as it is performance sensitive. Example use-cases:
1. Update subset of columns and read subset of columns -
Imagine a SQL Table, a row is encoded as a K/V pair (as it is done in MyRocks). If there are many columns and users only updated one of them, we can use merge operator to reduce write amplification. While users only read one or two columns in the read query, this feature can avoid a full merging of the whole row, and save some CPU.
2. Updating very few attributes in a value which is a JSON-like document -
Updating one attribute can be done efficiently using merge operator, while reading back one attribute can be done more efficiently if we don't need to do a full merge.
----------------------------------------------------------------------------------------------------
API :
Status GetMergeOperands(
      const ReadOptions& options, ColumnFamilyHandle* column_family,
      const Slice& key, PinnableSlice* merge_operands,
      GetMergeOperandsOptions* get_merge_operands_options,
      int* number_of_operands)

Example usage :
int size = 100;
int number_of_operands = 0;
std::vector<PinnableSlice> values(size);
GetMergeOperandsOptions merge_operands_info;
db_->GetMergeOperands(ReadOptions(), db_->DefaultColumnFamily(), "k1", values.data(), merge_operands_info, &number_of_operands);

Description :
Returns all the merge operands corresponding to the key. If the number of merge operands in DB is greater than merge_operands_options.expected_max_number_of_operands no merge operands are returned and status is Incomplete. Merge operands returned are in the order of insertion.
merge_operands-> Points to an array of at-least merge_operands_options.expected_max_number_of_operands and the caller is responsible for allocating it. If the status returned is Incomplete then number_of_operands will contain the total number of merge operands found in DB for key.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5604

Test Plan:
Added unit test and perf test in db_bench that can be run using the command:
./db_bench -benchmarks=getmergeoperands --merge_operator=sortlist

Differential Revision: D16657366

Pulled By: vjnadimpalli

fbshipit-source-id: 0faadd752351745224ee12d4ae9ef3cb529951bf
2019-08-06 14:26:44 -07:00
Maysam Yabandeh 208556ee13 WritePrepared: fix Get without snapshot (#5664)
Summary:
if read_options.snapshot is not set, ::Get will take the last sequence number after taking a super-version and uses that as the sequence number. Theoretically max_eviceted_seq_ could advance this sequence number. This could lead ::IsInSnapshot that will be invoked by the ReadCallback to notice the absence of the snapshot. In this case, the ReadCallback should have passed a non-value to snap_released so that it could be set by the ::IsInSnapshot. The patch does that, and adds a unit test to verify it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5664

Differential Revision: D16614033

Pulled By: maysamyabandeh

fbshipit-source-id: 06fb3fd4aacd75806ed1a1acec7961f5d02486f2
2019-08-05 13:41:21 -07:00
Maysam Yabandeh e579e32eaa Disable ReadYourOwnWriteStress when run under Valgrind (#5671)
Summary:
It sometimes times out when run under valgrind taking around 20m. The patch skips the test under Valgrind.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5671

Differential Revision: D16652382

Pulled By: maysamyabandeh

fbshipit-source-id: 0f6f4f76d37337d56226b689e01b14523dd07aae
2019-08-05 13:35:39 -07:00
Manuel Ung f622ca2c7c WriteUnPrepared: savepoint support (#5627)
Summary:
Add savepoint support when the current transaction has flushed unprepared batches.

Rolling back to savepoint is similar to rolling back a transaction. It requires the set of keys that have changed since the savepoint, re-reading the keys at the snapshot at that savepoint, and the restoring the old keys by writing out another unprepared batch.

For this strategy to work though, we must be capable of reading keys at a savepoint. This does not work if keys were written out using the same sequence number before and after a savepoint. Therefore, when we flush out unprepared batches, we must split the batch by savepoint if any savepoints exist.

eg. If we have the following:
```
Put(A)
Put(B)
Put(C)
SetSavePoint()
Put(D)
Put(E)
SetSavePoint()
Put(F)
```

Then we will write out 3 separate unprepared batches:
```
Put(A) 1
Put(B) 1
Put(C) 1
Put(D) 2
Put(E) 2
Put(F) 3
```

This is so that when we rollback to eg. the first savepoint, we can just read keys at snapshot_seq = 1.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5627

Differential Revision: D16584130

Pulled By: lth

fbshipit-source-id: 6d100dd548fb20c4b76661bd0f8a2647e64477fa
2019-07-31 13:39:39 -07:00
Manuel Ung d599135a03 WriteUnPrepared: use WriteUnpreparedTxnReadCallback for ValidateSnapshot (#5657)
Summary:
In DeferSnapshotSavePointTest, writes were failing with snapshot validation error because the key with the latest sequence number was an unprepared key from the current transaction.

Fix this by passing down the correct read callback.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5657

Differential Revision: D16582466

Pulled By: lth

fbshipit-source-id: 11645dac0e7c1374d917ef5fdf757d13c1d1108d
2019-07-31 10:44:56 -07:00
Manuel Ung 399f477818 WriteUnPrepared: Use WriteUnpreparedTxnReadCallback for MultiGet (#5634)
Summary:
The `TransactionTest.MultiGetBatchedTest` were failing with unprepared batches because we were not using the correct callbacks. Override MultiGet to pass down the correct ReadCallback. A similar problem is also fixed in WritePrepared.

This PR also fixes an issue similar to (https://github.com/facebook/rocksdb/pull/5147), but for MultiGet instead of Get.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5634

Differential Revision: D16552674

Pulled By: lth

fbshipit-source-id: 736eaf8e919c6b13d5f5655b1c0d36b57ad04804
2019-07-29 17:56:13 -07:00
Manuel Ung 80d7067cb2 Use int64_t instead of ssize_t (#5638)
Summary:
The ssize_t type was introduced in https://github.com/facebook/rocksdb/pull/5633, but it seems like it's a POSIX specific type.

I just need a signed type to represent number of bytes, so use int64_t instead. It seems like we have a typedef from SSIZE_T for Windows, but it doesn't seem like we ever include "port/port.h" in our public header files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5638

Differential Revision: D16526269

Pulled By: lth

fbshipit-source-id: 8d3a5c41003951b74b29bc5f1d949b2b22da0cee
2019-07-26 16:36:49 -07:00
Manuel Ung 41df734830 WriteUnPrepared: Add new variable write_batch_flush_threshold (#5633)
Summary:
Instead of reusing `TransactionOptions::max_write_batch_size` for determining when to flush a write batch for write unprepared, add a new variable called `write_batch_flush_threshold` for this use case instead.

Also add `TransactionDBOptions::default_write_batch_flush_threshold` which sets the default value if `TransactionOptions::write_batch_flush_threshold` is unspecified.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5633

Differential Revision: D16520364

Pulled By: lth

fbshipit-source-id: d75ae5a2141ce7708982d5069dc3f0b58d250e8c
2019-07-26 12:56:26 -07:00
Manuel Ung 230b909da8 Fix PopSavePoint to merge info into the previous savepoint (#5628)
Summary:
Transaction::RollbackToSavePoint undos the modification made since the SavePoint beginning, and also unlocks the corresponding keys, which are tracked in the last SavePoint. Currently ::PopSavePoint simply discard these tracked keys, leaving them locked in the lock manager. This breaks a subsequent ::RollbackToSavePoint behavior as it loses track of such keys, and thus cannot unlock them. The patch fixes ::PopSavePoint by passing on the track key information to the previous SavePoint.
Fixes https://github.com/facebook/rocksdb/issues/5618
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5628

Differential Revision: D16505325

Pulled By: lth

fbshipit-source-id: 2bc3b30963ab4d36d996d1f66543c93abf358980
2019-07-26 11:39:30 -07:00
Manuel Ung 66b524a911 Simplify WriteUnpreparedTxnReadCallback and fix some comments (#5621)
Summary:
Simplify WriteUnpreparedTxnReadCallback so we just have one function `CalcMaxVisibleSeq`. Also, there's no need for the read callback to hold onto the transaction any more, so just hold the set of unprep_seqs, reducing about of indirection in `IsVisibleFullCheck`.

Also, some comments about using transaction snapshot were out of date, so remove them.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5621

Differential Revision: D16459883

Pulled By: lth

fbshipit-source-id: cd581323fd18982e817d99af57b6eaba59e599bb
2019-07-24 10:25:26 -07:00
Manuel Ung eae832740b WriteUnPrepared: improve read your own write functionality (#5573)
Summary:
There are a number of fixes in this PR (with most bugs found via the added stress tests):
1. Re-enable reseek optimization. This was initially disabled to avoid infinite loops in https://github.com/facebook/rocksdb/pull/3955 but this can be resolved by remembering not to reseek after a reseek has already been done. This problem only affects forward iteration in `DBIter::FindNextUserEntryInternal`, as we already disable reseeking in `DBIter::FindValueForCurrentKeyUsingSeek`.
2. Verify that ReadOption.snapshot can be safely used for iterator creation. Some snapshots would not give correct results because snaphsot validation would not be enforced, breaking some assumptions in Prev() iteration.
3. In the non-snapshot Get() case, reads done at `LastPublishedSequence` may not be enough, because unprepared sequence numbers are not published. Use `std::max(published_seq, max_visible_seq)` to do lookups instead.
4. Add stress test to test reading own writes.
5. Minor bug in the allow_concurrent_memtable_write case where we forgot to pass in batch_per_txn_.
6. Minor performance optimization in `CalcMaxUnpreparedSequenceNumber` by assigning by reference instead of value.
7. Add some more comments everywhere.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5573

Differential Revision: D16276089

Pulled By: lth

fbshipit-source-id: 18029c944eb427a90a87dee76ac1b23f37ec1ccb
2019-07-23 08:08:19 -07:00
Manuel Ung 0acaa1a846 WriteUnPrepared: use tracked_keys_ to track keys needed for rollback (#5562)
Summary:
Currently, we are tracking keys we need to rollback via a separate structure specific to WriteUnprepared in write_set_keys_.

We already have a data structure called tracked_keys_ used to track which keys to unlock on transaction termination. This is exactly what we want, since we should only rollback keys that we have locked anyway.

Save some memory by reusing that data structure instead of making our own.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5562

Differential Revision: D16206484

Pulled By: lth

fbshipit-source-id: 5894d2b824a4b19062d84adbd6e6e86f00047488
2019-07-16 15:24:56 -07:00
Maysam Yabandeh 60f3ec2ca5 Fix appveyor compliant about passing const to thread (#5447)
Summary:
CLANG would complain if we pass const to lambda function and appveyor complains if we don't (https://github.com/facebook/rocksdb/pull/5443). The patch fixes that by using the default capture mode.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5447

Differential Revision: D15788722

Pulled By: maysamyabandeh

fbshipit-source-id: 47e7f49264afe31fdafe42cb8bf93da126abfca9
2019-06-12 15:06:22 -07:00
Maysam Yabandeh 4a285d0dd3 Remove passing const variable to thread (#5443)
Summary:
CLANG complains that passing const to thread is not necessary. The patch removes it form PreparedHeap::Concurrent test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5443

Differential Revision: D15781598

Pulled By: maysamyabandeh

fbshipit-source-id: 3aceb05d96182fa4726d6d37eed45fd3aac4c016
2019-06-12 09:45:57 -07:00
Maysam Yabandeh 773f914a40 WritePrepared: switch PreparedHeap from priority_queue to deque (#5436)
Summary:
Internally PreparedHeap is currently using a priority_queue. The rationale was the in the initial design PreparedHeap::AddPrepared could be called in arbitrary order. With the recent optimizations, we call ::AddPrepared only from the main write queue, which results into in-order insertion into PreparedHeap. The patch thus replaces the underlying priority_queue with a more efficient deque implementation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5436

Differential Revision: D15752147

Pulled By: maysamyabandeh

fbshipit-source-id: e6960f2b2097e13137dded1ceeff3b10b03b0aeb
2019-06-11 19:55:14 -07:00
Manuel Ung ca1aee2a19 WriteUnprepared: commit only from the 2nd queue (#5439)
Summary:
This is a port of this PR into WriteUnprepared:
https://github.com/facebook/rocksdb/pull/5014

This also reverts this test change to restore some flaky write unprepared
tests: https://github.com/facebook/rocksdb/pull/5315

Tested with:
$ gtest-parallel ./transaction_test --gtest_filter=MySQLStyleTransactionTest/MySQLStyleTransactionTest.TransactionStressTest/9 --repeat=128
[128/128] MySQLStyleTransactionTest/MySQLStyleTransactionTest.TransactionStressTest/9 (18250 ms)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5439

Differential Revision: D15761405

Pulled By: lth

fbshipit-source-id: ae2581fd942d8a5b3f9278fd6bc3c1ac0b2c964c
2019-06-11 18:01:39 -07:00
sdong 58c4aee42e TransactionUtil::CheckKey() to skip unnecessary history (#4941)
Summary:
If a memtable definitely covers a key, there isn't a need to check older memtables.
We can skip them by checking the earliest sequence number.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4941

Differential Revision: D13932666

fbshipit-source-id: b9d52f234b8ad9dd3bf6547645cd457175a3ca9b
2019-06-11 11:46:42 -07:00
Maysam Yabandeh c292dc8540 WritePrepared: reduce prepared_mutex_ overhead (#5420)
Summary:
The patch reduces the contention over prepared_mutex_ using these techniques:
1) Move ::RemovePrepared() to be called from the commit callback when we have two write queues.
2) Use two separate mutex for PreparedHeap, one prepared_mutex_ needed for ::RemovePrepared, and one ::push_pop_mutex() needed for ::AddPrepared(). Given that we call ::AddPrepared only from the first write queue and ::RemovePrepared mostly from the 2nd, this will result into each the two write queues not competing with each other over a single mutex. ::RemovePrepared might occasionally need to acquire ::push_pop_mutex() if ::erase() ends up with calling ::pop()
3) Acquire ::push_pop_mutex() on the first callback of the write queue and release it on the last.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5420

Differential Revision: D15741985

Pulled By: maysamyabandeh

fbshipit-source-id: 84ce8016007e88bb6e10da5760ba1f0d26347735
2019-06-10 11:53:31 -07:00
Zhongyi Xie d68f9f4580 simplify include directive involving inttypes (#5402)
Summary:
When using `PRIu64` type of printf specifier, current code base does the following:
```
#ifndef __STDC_FORMAT_MACROS
#define __STDC_FORMAT_MACROS
#endif
#include <inttypes.h>
```
However, this can be simplified to
```
#include <cinttypes>
```
as long as flag `-std=c++11` is used.
This should solve issues like https://github.com/facebook/rocksdb/issues/5159
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5402

Differential Revision: D15701195

Pulled By: miasantreble

fbshipit-source-id: 6dac0a05f52aadb55e9728038599d3d2e4b59d03
2019-06-06 13:56:07 -07:00
Maysam Yabandeh ae05a83e19 Call ValidateOptions from SetOptions (#5368)
Summary:
Currently we validate options in DB::Open. However the validation step is missing when options are dynamically updated in ::SetOptions. The patch fixes that.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5368

Differential Revision: D15540101

Pulled By: maysamyabandeh

fbshipit-source-id: d27bbffd8f0252d1b50bcf59e0a70a278ed937f4
2019-06-03 19:49:57 -07:00
Siying Dong 000b9ec217 Move some logging related files to logging/ (#5387)
Summary:
Many logging related source files are under util/. It will be more structured if they are together.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5387

Differential Revision: D15579036

Pulled By: siying

fbshipit-source-id: 3850134ed50b8c0bb40a0c8ae1f184fa4081303f
2019-05-31 17:23:59 -07:00
Vijay Nadimpalli cae22c53fb Make format
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5395

Differential Revision: D15581698

Pulled By: vjnadimpalli

fbshipit-source-id: f415972f16e784b1361714c202b97defcab46767
2019-05-31 15:24:43 -07:00
Vijay Nadimpalli 49c5a12dbe Organizing rocksdb/db directory
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5390

Differential Revision: D15579388

Pulled By: vjnadimpalli

fbshipit-source-id: 5bfc95e31554b8ff05b97b76d6534113f527f366
2019-05-31 11:57:01 -07:00
Siying Dong 8843129ece Move some memory related files from util/ to memory/ (#5382)
Summary:
Move arena, allocator, and memory tools under util to a separate memory/ directory.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5382

Differential Revision: D15564655

Pulled By: siying

fbshipit-source-id: 9cd6b5d0d3d52b39606e19221fa154596e5852a5
2019-05-30 17:44:09 -07:00
Siying Dong e9e0101ca4 Move test related files under util/ to test_util/ (#5377)
Summary:
There are too many types of files under util/. Some test related files don't belong to there or just are just loosely related. Mo
ve them to a new directory test_util/, so that util/ is cleaner.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5377

Differential Revision: D15551366

Pulled By: siying

fbshipit-source-id: 0f5c8653832354ef8caa31749c0143815d719e2c
2019-05-30 11:25:51 -07:00
Maysam Yabandeh eab4f49a2c WritePrepared: skip_concurrency_control option (#5330)
Summary:
This enables the user to set TransactionDBOptions::skip_concurrency_control so the standard `DB::Write(const WriteOptions& opts, WriteBatch* updates)` would skip the concurrency control. This would give higher throughput to the users who know their use case doesn't need concurrency control.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5330

Differential Revision: D15525932

Pulled By: maysamyabandeh

fbshipit-source-id: 68421ac1ba34f549a4a8de9ce4c2dccf6fb4b06b
2019-05-28 16:29:45 -07:00
Maysam Yabandeh f5576c3317 WritePrepared: disableWAL in commit without prepare (#5327)
Summary:
When committing a transaction without prepare, WritePrepared simply writes the batch to db and add the commit entry to CommitCache. When two_write_queues=true, following the rule of committing only from 2nd write queue, the first write, writes the batch and the only thing the 2nd write does is to write the commit entry to CommitCache. Currently the write batch in 2nd write is set to an empty LogData entry, while the write to the WAL could simply be entirely disabled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5327

Differential Revision: D15424546

Pulled By: maysamyabandeh

fbshipit-source-id: 3d9ea3922d5196984c584d62a3ed57e1f7ca7b9f
2019-05-28 14:21:52 -07:00
Maysam Yabandeh 5c0e304170 WritePrepared: Clarify the need for two_write_queues in unordered_write (#5313)
Summary:
WritePrepared transactions when configured with two_write_queues=true offers higher throughput with unordered_write feature without however compromising the rocksdb guarantees. This is because it performs ordering among writes in a 2nd step that is not tied to memtable write speed. The 2nd step is naturally provided by 2PC when the commit phase does the ordering as well. Without 2PC, the 2nd step would only be provided when we use two_write_queues=true, where WritePrepared after performing the writes, in a 2nd step uses the 2nd queue to assign order to the writes.
The patch clarifies the need for two_write_queues=true in the HISTORY and inline comments of unordered_writes. Moreover it extends the stress tests of WritePrepared to unordred_write.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5313

Differential Revision: D15379977

Pulled By: maysamyabandeh

fbshipit-source-id: 5b6f05b9b59285dcbf3b0532215ba9fe7d926e00
2019-05-20 07:49:20 -07:00
Maysam Yabandeh c71f5bb9aa Disable WriteUnPrepared stress tests (#5315)
Summary:
They are kind of flaky at the moment. Will re-enable it when flakiness is fixed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5315

Differential Revision: D15382744

Pulled By: maysamyabandeh

fbshipit-source-id: 8b2f9d81a4bb34bfd51481727a682d5cd063c5e3
2019-05-16 15:39:33 -07:00
Maysam Yabandeh f0e8216197 WritePrepared: Fix deadlock in WriteRecoverableState (#5306)
Summary:
The recent improvement in https://github.com/facebook/rocksdb/pull/3661 could cause a deadlock: When writing recoverable state, we also commit its sequence number to commit table, which could result into evicting existing commit entry, which could result into advancing max_evicted_seq_, which would need to get snapshots from database, which requires obtaining db mutex. The patch releases db_mutex before calling the callback in WriteRecoverableState to avoid the potential deadlock. It also improves the stress tests to let the issue be manifested in the tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5306

Differential Revision: D15341458

Pulled By: maysamyabandeh

fbshipit-source-id: 05dcbed7e21b789fd1e5fd5ee8eea08077162323
2019-05-15 13:53:54 -07:00
Thomas Fersch a42757607d Use pre-increment instead of post-increment for iterators (#5296)
Summary:
Google C++ style guide indicates pre-increment should be used for iterators: https://google.github.io/styleguide/cppguide.html#Preincrement_and_Predecrement. Replaced all instances of ' it++' by ' ++it' (where type is iterator). So this covers the cases where iterators are named 'it'.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5296

Differential Revision: D15301256

Pulled By: tfersch

fbshipit-source-id: 2803483c1392504ad3b281d21db615429c71114b
2019-05-15 13:19:15 -07:00
Maysam Yabandeh f383641a1d Unordered Writes (#5218)
Summary:
Performing unordered writes in rocksdb when unordered_write option is set to true. When enabled the writes to memtable are done without joining any write thread. This offers much higher write throughput since the upcoming writes would not have to wait for the slowest memtable write to finish. The tradeoff is that the writes visible to a snapshot might change over time. If the application cannot tolerate that, it should implement its own mechanisms to work around that. Using TransactionDB with WRITE_PREPARED write policy is one way to achieve that. Doing so increases the max throughput by 2.2x without however compromising the snapshot guarantees.
The patch is prepared based on an original by siying
Existing unit tests are extended to include unordered_write option.

Benchmark Results:
```
TEST_TMPDIR=/dev/shm/ ./db_bench_unordered --benchmarks=fillrandom --threads=32 --num=10000000 -max_write_buffer_number=16 --max_background_jobs=64 --batch_size=8 --writes=3000000 -level0_file_num_compaction_trigger=99999 --level0_slowdown_writes_trigger=99999 --level0_stop_writes_trigger=99999 -enable_pipelined_write=false -disable_auto_compactions  --unordered_write=1
```
With WAL
- Vanilla RocksDB: 78.6 MB/s
- WRITER_PREPARED with unordered_write: 177.8 MB/s (2.2x)
- unordered_write: 368.9 MB/s (4.7x with relaxed snapshot guarantees)

Without WAL
- Vanilla RocksDB: 111.3 MB/s
- WRITER_PREPARED with unordered_write: 259.3 MB/s MB/s (2.3x)
- unordered_write: 645.6 MB/s (5.8x with relaxed snapshot guarantees)

- WRITER_PREPARED with unordered_write disable concurrency control: 185.3 MB/s MB/s (2.35x)

Limitations:
- The feature is not yet extended to `max_successive_merges` > 0. The feature is also incompatible with `enable_pipelined_write` = true as well as with `allow_concurrent_memtable_write` = false.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5218

Differential Revision: D15219029

Pulled By: maysamyabandeh

fbshipit-source-id: 38f2abc4af8780148c6128acdba2b3227bc81759
2019-05-13 17:47:21 -07:00
anand76 1c8cbf315f Extend MultiGet batching to Transactions (#5210)
Summary:
MultiGet batching was implemented in #5011 in order to reduce CPU utilization when looking up multiple keys at once. This PR implements corresponding ```MultiGet``` and ```MultiGetSingleCFForUpdate``` in ```rocksdb::Transaction``` that call the underlying batching implementation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5210

Differential Revision: D15048164

Pulled By: anand1976

fbshipit-source-id: c52f6043102ab0cbc723f4cba2a7b7d1767f6f52
2019-04-23 14:11:26 -07:00
jsteemann de76909464 refactor SavePoints (#5192)
Summary:
Savepoints are assumed to be used in a stack-wise fashion (only
the top element should be used), so they were stored by `WriteBatch`
in a member variable `save_points` using an std::stack.

Conceptually this is fine, but the implementation had a few issues:
- the `save_points_` instance variable was a plain pointer to a heap-
  allocated `SavePoints` struct. The destructor of `WriteBatch` simply
  deletes this pointer. However, the copy constructor of WriteBatch
  just copied that pointer, meaning that copying a WriteBatch with
  active savepoints will very likely have crashed before. Now a proper
  copy of the savepoints is made in the copy constructor, and not just
  a copy of the pointer
- `save_points_` was an std::stack, which defaults to `std::deque` for
  the underlying container. A deque is a bit over the top here, as we
  only need access to the most recent savepoint (i.e. stack.top()) but
  never any elements at the front. std::deque is rather expensive to
  initialize in common environments. For example, the STL implementation
  shipped with GNU g++ will perform a heap allocation of more than 500
  bytes to create an empty deque object. Although the `save_points_`
  container is created lazily by RocksDB, moving from a deque to a plain
  `std::vector` is much more memory-efficient. So `save_points_` is now
  a vector.
- `save_points_` was changed from a plain pointer to an `std::unique_ptr`,
  making ownership more explicit.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5192

Differential Revision: D15024074

Pulled By: maysamyabandeh

fbshipit-source-id: 5b128786d3789cde94e46465c9e91badd07a25d7
2019-04-19 20:33:04 -07:00
jsteemann 8295d364e2 Improve transaction lock details (#5193)
Summary:
This branch contains two small improvements:
* Create `LockMap` entries using `std::make_shared`. This saves one heap allocation per LockMap entry but also locates the control block and the LockMap object closely together in memory, which can help with caching
* Reorder the members of `TrackedTrxInfo`, so that the resulting struct uses less memory (at least on 64bit systems)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5193

Differential Revision: D14934536

Pulled By: maysamyabandeh

fbshipit-source-id: f7b49812bb4b6029eef9d131e7cd56260df5b28e
2019-04-15 10:44:03 -07:00
Manuel Ung d655a3aab7 Remove extraneous call to TrackKey (#5173)
Summary:
In `PessimisticTransaction::TryLock`, we were calling `TrackKey` even when assume_tracked=true, which defeats the purpose of assume_tracked. Remove this.

For keys that are already tracked, TrackKey will actually bump some counters (num_reads/num_writes) which are consumed in `TransactionBaseImpl::GetTrackedKeysSinceSavePoint`, and this is used to determine which keys were tracked since the last savepoint. I believe this functionality should still work, since I think the user should not call GetForUpdate/Put(assume_tracked=true) across savepoints, and if they do, they should not expect the Put(assume_tracked=true) to show up as a tracked key in the second savepoint.

This is another 2-3% cpu improvement.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5173

Differential Revision: D14883809

Pulled By: lth

fbshipit-source-id: 7d09f0772da422384af0519773e310c22b0cbca3
2019-04-12 16:37:12 -07:00
Maysam Yabandeh fe642cbee6 WritePrepared: fix race condition in reading batch with duplicate keys (#5147)
Summary:
When ReadOption doesn't specify a snapshot, WritePrepared::Get used kMaxSequenceNumber to avoid the cost of creating a new snapshot object (that requires sync over db_mutex). This creates a race condition if it is reading from the writes of a transaction that had duplicate keys: each instance of duplicate key is inserted with a different sequence number and depending on the ordering the ::Get might skip the newer one and read the older one that is obsolete.
The patch fixes that by using last published seq as the snapshot sequence number. It also adds a check after the read is done to ensure that the max_evicted_seq has not advanced the aforementioned seq, which is a very unlikely event. If it did, then the read is not valid since the seq is not backed by an actually snapshot to let IsInSnapshot handle that properly when an overlapping commit is evicted from commit cache.
A unit  test is added to reproduce the race condition with duplicate keys.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5147

Differential Revision: D14758815

Pulled By: maysamyabandeh

fbshipit-source-id: a56915657132cf6ba5e3f5ea1b5d78c803407719
2019-04-12 14:40:41 -07:00
Manuel Ung ef0fc1b461 Reduce copies of LockInfo (#5172)
Summary:
The LockInfo struct is not easy to copy because it contains std::vector. Reduce copies by using move constructor and `unordered_map::emplace`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5172

Differential Revision: D14882053

Pulled By: lth

fbshipit-source-id: 93999ec6ab1a5841fb5115abb764b6c1831a6de1
2019-04-10 15:58:58 -07:00
Siying Dong 0bb555630f Consolidate hash function used for non-persistent data in a new function (#5155)
Summary:
Create new function NPHash64() and GetSliceNPHash64(), which are currently
implemented using murmurhash.
Replace the current direct call of murmurhash() to use the new functions
if the hash results are not used in on-disk format.
This will make it easier to try out or switch to alternative functions
in the uses where data format compatibility doesn't need to be considered.
This part shouldn't have any performance impact.

Also, the sharded cache hash function is changed to the new format, because
it falls into this categoery. It doesn't show visible performance impact
in db_bench results. CPU showed by perf is increased from about 0.2% to 0.4%
in an extreme benchmark setting (4KB blocks, no-compression, everything
cached in block cache). We've known that the current hash function used,
our own Hash() has serious hash quality problem. It can generate a lots of
conflicts with similar input. In this use case, it means extra lock contention
for reads from the same file. This slight CPU regression is worthy to me
to counter the potential bad performance with hot keys. And hopefully this
will get further improved in the future with a better hash function.

cache_test's condition is relaxed a little bit to. The new hash is slightly
more skewed in this use case, but I manually checked the data and see
the hash results are still in a reasonable range.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5155

Differential Revision: D14834821

Pulled By: siying

fbshipit-source-id: ec9a2c0a2f8ae4b54d08b13a5c2e9cc97aa80cb5
2019-04-08 13:32:06 -07:00
Maysam Yabandeh 7441a0ecba WriteUnPrepared: fix ubsan complaint (#5148)
Summary:
Ubsna complains that in initialization of WriteUnpreparedTxnReadCallback the method of the child class is used before the parent class is constructed. The patch fixes that by making the aforementioned method static.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5148

Differential Revision: D14760098

Pulled By: maysamyabandeh

fbshipit-source-id: cf19b7c1fdb5de0a54e62c1deebe09a0fa048ded
2019-04-03 15:51:30 -07:00
Maysam Yabandeh 5234fc1b70 Mark logs with prepare in PreReleaseCallback (#5121)
Summary:
In prepare phase of 2PC, the db promises to remember the prepared data, for possible future commits. To fulfill the promise the prepared data must be persisted in the WAL so that they could be recovered after a crash. The log that contains a prepare batch that is not committed yet, is marked so that it is not garbage collected before the transaction commits/rollbacks. The bug was that the write to the log file and the mark of the file was not atomic, and WAL gc could have happened before the WAL log is actually marked. This patch moves the marking logic to PreReleaseCallback so that the WAL gc logic that joins both write threads would see the WAL write and WAL mark atomically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5121

Differential Revision: D14665210

Pulled By: maysamyabandeh

fbshipit-source-id: 1d66aeb1c66a296cb4899a5a20c4d40c59e4b534
2019-04-02 15:17:47 -07:00
Maysam Yabandeh 14b3f683a1 WriteUnPrepared: less virtual in iterator callback (#5049)
Summary:
WriteUnPrepared adds a virtual function, MaxUnpreparedSequenceNumber, to ReadCallback, which returns 0 unless WriteUnPrepared is enabled and the transaction has uncommitted data written to the DB. Together with snapshot sequence number, this determines the last sequence that is visible to reads.
The patch clarifies the guarantees of the GetIterator API in WriteUnPrepared transactions and make use of that to statically initialize the read callback and thus avoid the virtual call.
Furthermore it increases the minimum value for min_uncommitted from 0 to 1 as seq 0 is used only for last level keys that are committed in all snapshots.

The following benchmark shows +0.26% higher throughput in seekrandom benchmark.

Benchmark:
./db_bench --benchmarks=fillrandom --use_existing_db=0 --num=1000000 --db=/dev/shm/dbbench

./db_bench --benchmarks=seekrandom[X10] --use_existing_db=1 --db=/dev/shm/dbbench --num=1000000 --duration=60 --seek_nexts=100
seekrandom [AVG    10 runs] : 20355 ops/sec;  225.2 MB/sec
seekrandom [MEDIAN 10 runs] : 20425 ops/sec;  225.9 MB/sec

./db_bench_lessvirtual3 --benchmarks=seekrandom[X10] --use_existing_db=1 --db=/dev/shm/dbbench --num=1000000 --duration=60 --seek_nexts=100
seekrandom [AVG    10 runs] : 20409 ops/sec;  225.8 MB/sec
seekrandom [MEDIAN 10 runs] : 20487 ops/sec;  226.6 MB/sec
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5049

Differential Revision: D14366459

Pulled By: maysamyabandeh

fbshipit-source-id: ebaff8908332a5ae9af7defeadabcb624be660ef
2019-04-02 14:47:16 -07:00
Maysam Yabandeh a703f16da9 WriteUnPrepared: Enable auto-compaction after max_evicted_seq_ init (#5128)
Summary:
Compaction would depend on max_evicted_seq_ value. The ::Initialize method should do that after max_evicted_seq_ is properly initialized. The patch also back ports #4853 from WritePrepared txn to WriteUnPrepared.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5128

Differential Revision: D14686562

Pulled By: maysamyabandeh

fbshipit-source-id: b2355025712a72676ac3b20a95258adcf4774490
2019-03-29 13:18:57 -07:00
Maysam Yabandeh 04d3ac4e63 Fix tsan compliant on AddPreparedBeforeMax (#5052)
Summary:
Add a mutex to the test to synchronize before accessing the shared txn object.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5052

Differential Revision: D14386861

Pulled By: maysamyabandeh

fbshipit-source-id: 5b32e209840b210c35af53848dc77f489a76c95a
2019-03-08 09:39:00 -08:00
Maysam Yabandeh 04a2631dbe WritePrepared: handle adding prepare before max_evicted_seq_ (#5025)
Summary:
The patch fixes an improbable race condition between AddPrepared from one write queue and AdvanceMaxEvictedSeq from another queue. In this scenario AddPrepared finds prepare_seq lower than max and adding to PrepareHeap as usual while AdvanceMaxEvictedSeq has finished checking PrepareHeap against the future max. Thus when AdvanceMaxEvictedSeq finishes off by updating the max_evicted_seq_, PrepareHeap ends up with a prepared_seq lower than it which breaks the PrepareHeap contract. The fix is that in AddPrepared we check against the future_max_evicted_seq_ instead, which is update before AdvanceMaxEvictedSeq acquire prepare_mutex_ and looks into PrepareHeap.
A unit test added to test for the failure scenario. The code is also refactored a bit to remove the duplicate code between AdvanceMaxEvictedSeq and AddPrepared.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5025

Differential Revision: D14249028

Pulled By: maysamyabandeh

fbshipit-source-id: 072ea56663f40359662c05fafa6ac524417b0622
2019-03-07 07:41:15 -08:00
Maysam Yabandeh 703f1375c2 WritePrepared: Add rollback batch to PreparedHeap (#5026)
Summary:
The patch adds the sequence number of the rollback patch to the PrepareHeap when two_write_queues is enabled. Although the current behavior is still correct, the change simplifies reasoning about the code, by having all uncommitted batches registered with the PreparedHeap.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5026

Differential Revision: D14249401

Pulled By: maysamyabandeh

fbshipit-source-id: 1e3424edee5cd14e56ee35931ad3c93ed997cd5a
2019-03-07 07:33:31 -08:00
Maysam Yabandeh 68a2f94d5d WritePrepared: commit only from the 2nd queue (#5014)
Summary:
When two_write_queues is enabled we call ::AddPrepared only from the main queue, which writes to both WAL and memtable, and call ::AddCommitted from the 2nd queue, which writes only to WAL. This simplifies the logic by avoiding concurrency between AddPrepared and also between AddCommitted. The patch fixes one case that did not conform with the rule above. This would allow future refactoring. For example AdvaneMaxEvictedSeq, which is invoked by AddCommitted, can be simplified by assuming lack of concurrent calls to it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5014

Differential Revision: D14210493

Pulled By: maysamyabandeh

fbshipit-source-id: 6db5ba372a294a568a14caa010576460917a4eab
2019-02-28 15:23:34 -08:00
Maysam Yabandeh a661c0d208 WritePrepared: optimize read path by avoiding virtual (#5018)
Summary:
The read path includes a callback function, ReadCallback, which would eventually calls IsInSnapshot to figure if a particular seq is in the reading snapshot or not. This callback is virtual, which adds the cost of multiple virtual function call to each read. The first few checks in IsInSnapshot, however, are quite trivial and take care of majority of the cases. The patch moves those to a non-virtual function in the the parent class, ReadCallback, to lower the virtual callback cost.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5018

Differential Revision: D14226562

Pulled By: maysamyabandeh

fbshipit-source-id: 6feed5b34f3b082e52092c5ef143e29b49c46b44
2019-02-26 16:56:19 -08:00
Maysam Yabandeh cf98df34c1 Change random seed for txn stress tests on each run (#5004)
Summary:
Currently the transaction stress tests use thread id as the seed. Since the thread ids are likely to be the same across multiple runs, the seed is thus going to be the same. The patch includes time in calculating the seed to help covering a very different part of state space in each run of the stress tests. To be able to reproduce the bug in case the stress tests failed, it also prints out the time that was used to calculate the seed value.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5004

Differential Revision: D14144356

Pulled By: maysamyabandeh

fbshipit-source-id: 728ed522f550fc8b4f5f9f373259c05fe9a54556
2019-02-19 19:58:55 -08:00
Maysam Yabandeh 0f4244fe00 WritePrepared: Improve stress tests with slow threads (#4974)
Summary:
The transaction stress tests, stress a high concurrency scenario. In WritePrepared/WriteUnPrepared we need to also stress the scenarios where an inserting/reading transaction is very slow. This would stress the corner cases that the caching is not sufficient and other slower data structures are engaged. To emulate such cases we make use of slow inserter/verifier threads and also reduce the size of cache data structures.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4974

Differential Revision: D14143070

Pulled By: maysamyabandeh

fbshipit-source-id: 81eb674678faf9fae0f654cd60ebcc74e26aeee7
2019-02-19 16:56:49 -08:00
Maysam Yabandeh bcdc8c8b19 WritePrepared: max_evicted_seq_ update during commit cache lookup (#4955)
Summary:
max_evicted_seq_ could be updated in the middle of the read in ::IsInSnapshot. The code to be correct in presence of this update would be complicated. The patch simplifies it by checking the value of max_evicted_seq_ before and after looking into commit_cache_ and retries in the unlucky case that it was changed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4955

Differential Revision: D13999556

Pulled By: maysamyabandeh

fbshipit-source-id: 7a1bdfa95ea8b5d8d73ddff3263ed31d7297b39c
2019-02-19 16:14:08 -08:00