Commit Graph

80 Commits

Author SHA1 Message Date
Andrew Kryczka bf98dcf9a8 Fix kBlockCacheTier read when merge-chain base value is in a blob file (#12462)
Summary:
The original goal is to propagate failures from `GetContext::SaveValue()` -> `GetContext::GetBlobValue()` -> `BlobFetcher::FetchBlob()` up to the user. This call sequence happens when a merge chain ends with a base value in a blob file.

There's also fixes for bugs encountered along the way where non-ok statuses were ignored/overwritten, and a bit of plumbing work for functions that had no capability to return a status.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12462

Test Plan:
A repro command

```
db=/dev/shm/dbstress_db ; exp=/dev/shm/dbstress_exp ; rm -rf $db $exp ; mkdir -p $db $exp
./db_stress \
        --clear_column_family_one_in=0 \
        --test_batches_snapshots=0 \
        --write_fault_one_in=0 \
        --use_put_entity_one_in=0 \
        --prefixpercent=0 \
        --read_fault_one_in=0 \
        --readpercent=0 \
        --reopen=0 \
        --set_options_one_in=10000 \
        --delpercent=0 \
        --delrangepercent=0 \
        --open_metadata_write_fault_one_in=0 \
        --open_read_fault_one_in=0 \
        --open_write_fault_one_in=0 \
        --destroy_db_initially=0 \
        --ingest_external_file_one_in=0 \
        --iterpercent=0 \
        --nooverwritepercent=0 \
        --db=$db \
        --enable_blob_files=1 \
        --expected_values_dir=$exp \
        --max_background_compactions=20 \
        --max_bytes_for_level_base=2097152 \
        --max_key=100000 \
        --min_blob_size=0 \
        --open_files=-1 \
        --ops_per_thread=100000000 \
        --prefix_size=-1 \
        --target_file_size_base=524288 \
        --use_merge=1 \
        --value_size_mult=32 \
        --write_buffer_size=524288 \
        --writepercent=100
```

It used to fail like:

```
...
frame https://github.com/facebook/rocksdb/issues/9: 0x00007fc63903bc93 libc.so.6`__GI___assert_fail(assertion="HasDefaultColumn(columns)", file="fbcode/internal_repo_rocksdb/repo/db/wide/wide_columns_helper.h", line=33, function="static const rocksdb::Slice &rocksdb::WideColumnsHelper::GetDefaultColumn(const rocksdb::WideColumns &)") at assert.c:101:3
frame https://github.com/facebook/rocksdb/issues/10: 0x00000000006f7e92 db_stress`rocksdb::Version::Get(rocksdb::ReadOptions const&, rocksdb::LookupKey const&, rocksdb::PinnableSlice*, rocksdb::PinnableWideColumns*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>*, rocksdb::Status*, rocksdb::MergeContext*, unsigned long*, rocksdb::PinnedIteratorsManager*, bool*, bool*, unsigned long*, rocksdb::ReadCallback*, bool*, bool) [inlined] rocksdb::WideColumnsHelper::GetDefaultColumn(columns=size=0) at wide_columns_helper.h:33
frame https://github.com/facebook/rocksdb/issues/11: 0x00000000006f7e76 db_stress`rocksdb::Version::Get(this=0x00007fc5ec763000, read_options=<unavailable>, k=<unavailable>, value=0x0000000000000000, columns=0x00007fc6035fd1d8, timestamp=<unavailable>, status=0x00007fc6035fd250, merge_context=0x00007fc6035fce40, max_covering_tombstone_seq=0x00007fc6035fce90, pinned_iters_mgr=0x00007fc6035fcdf0, value_found=0x0000000000000000, key_exists=0x0000000000000000, seq=0x0000000000000000, callback=0x0000000000000000, is_blob=0x0000000000000000, do_merge=<unavailable>) at version_set.cc:2492
frame https://github.com/facebook/rocksdb/issues/12: 0x000000000051e245 db_stress`rocksdb::DBImpl::GetImpl(this=0x00007fc637a86000, read_options=0x00007fc6035fcf60, key=<unavailable>, get_impl_options=0x00007fc6035fd000) at db_impl.cc:2408
frame https://github.com/facebook/rocksdb/issues/13: 0x000000000050cec2 db_stress`rocksdb::DBImpl::GetEntity(this=0x00007fc637a86000, _read_options=<unavailable>, column_family=<unavailable>, key=0x00007fc6035fd3c8, columns=0x00007fc6035fd1d8) at db_impl.cc:2109
frame https://github.com/facebook/rocksdb/issues/14: 0x000000000074f688 db_stress`rocksdb::(anonymous namespace)::MemTableInserter::MergeCF(this=0x00007fc6035fd450, column_family_id=2, key=0x00007fc6035fd3c8, value=0x00007fc6035fd3a0) at write_batch.cc:2656
frame https://github.com/facebook/rocksdb/issues/15: 0x00000000007476fc db_stress`rocksdb::WriteBatchInternal::Iterate(wb=0x00007fc6035fe698, handler=0x00007fc6035fd450, begin=12, end=<unavailable>) at write_batch.cc:607
frame https://github.com/facebook/rocksdb/issues/16: 0x000000000074d7dd db_stress`rocksdb::WriteBatchInternal::InsertInto(rocksdb::WriteThread::WriteGroup&, unsigned long, rocksdb::ColumnFamilyMemTables*, rocksdb::FlushScheduler*, rocksdb::TrimHistoryScheduler*, bool, unsigned long, rocksdb::DB*, bool, bool, bool) [inlined] rocksdb::WriteBatch::Iterate(this=<unavailable>, handler=0x00007fc6035fd450) const at write_batch.cc:505
frame https://github.com/facebook/rocksdb/issues/17: 0x000000000074d77b db_stress`rocksdb::WriteBatchInternal::InsertInto(write_group=<unavailable>, sequence=<unavailable>, memtables=<unavailable>, flush_scheduler=<unavailable>, trim_history_scheduler=<unavailable>, ignore_missing_column_families=<unavailable>, recovery_log_number=0, db=0x00007fc637a86000, concurrent_memtable_writes=<unavailable>, seq_per_batch=false, batch_per_txn=<unavailable>) at write_batch.cc:3084
frame https://github.com/facebook/rocksdb/issues/18: 0x0000000000631d77 db_stress`rocksdb::DBImpl::PipelinedWriteImpl(this=0x00007fc637a86000, write_options=<unavailable>, my_batch=0x00007fc6035fe698, callback=0x0000000000000000, log_used=<unavailable>, log_ref=0, disable_memtable=<unavailable>, seq_used=0x0000000000000000) at db_impl_write.cc:807
frame https://github.com/facebook/rocksdb/issues/19: 0x000000000062ceeb db_stress`rocksdb::DBImpl::WriteImpl(this=<unavailable>, write_options=<unavailable>, my_batch=0x00007fc6035fe698, callback=0x0000000000000000, log_used=<unavailable>, log_ref=0, disable_memtable=<unavailable>, seq_used=0x0000000000000000, batch_cnt=0, pre_release_callback=0x0000000000000000, post_memtable_callback=0x0000000000000000) at db_impl_write.cc:312
frame https://github.com/facebook/rocksdb/issues/20: 0x000000000062c8ec db_stress`rocksdb::DBImpl::Write(this=0x00007fc637a86000, write_options=0x00007fc6035feca8, my_batch=0x00007fc6035fe698) at db_impl_write.cc:157
frame https://github.com/facebook/rocksdb/issues/21: 0x000000000062b847 db_stress`rocksdb::DB::Merge(this=0x00007fc637a86000, opt=0x00007fc6035feca8, column_family=0x00007fc6370bf140, key=0x00007fc6035fe8d8, value=0x00007fc6035fe830) at db_impl_write.cc:2544
frame https://github.com/facebook/rocksdb/issues/22: 0x000000000062b6ef db_stress`rocksdb::DBImpl::Merge(this=0x00007fc637a86000, o=<unavailable>, column_family=0x00007fc6370bf140, key=0x00007fc6035fe8d8, val=0x00007fc6035fe830) at db_impl_write.cc:72
frame https://github.com/facebook/rocksdb/issues/23: 0x00000000004d6397 db_stress`rocksdb::NonBatchedOpsStressTest::TestPut(this=0x00007fc637041000, thread=0x00007fc6370dbc00, write_opts=0x00007fc6035feca8, read_opts=0x00007fc6035fe9c8, rand_column_families=<unavailable>, rand_keys=size=1, value={P\xe9_\x03\xc6\x7f\0\0}) at no_batched_ops_stress.cc:1317
frame https://github.com/facebook/rocksdb/issues/24: 0x000000000049361d db_stress`rocksdb::StressTest::OperateDb(this=0x00007fc637041000, thread=0x00007fc6370dbc00) at db_stress_test_base.cc:1148
...
```

Reviewed By: ltamasi

Differential Revision: D55157795

Pulled By: ajkr

fbshipit-source-id: 5f7c1380ead5794c29d41680028e34b839744764
2024-03-21 12:38:53 -07:00
Andrew Kryczka 3f5bd46a07 Add `ContinueCallback` to `GetMergeOperands()` (#12438)
Summary:
The use case is similar to `MergeOperator::ShouldMerge()` for `Get()`: preventing reads into LSM components for merge operands that are of no interest to the user. `MergeOperator::ShouldMerge()` cannot be reused here because:

- Its name does not make sense in the context of `GetMergeOperands()` since `GetMergeOperands()` never invokes merge
- The callback is part of the `MergeOperator`, but an option specific to the read operation makes more sense to me

If there are any ideas for an API design that covers both `MergeOperator::ShouldMerge()`'s use cases and `GetMergeOperandsOptions::continue_cb`'s use cases, that would be ideal, but for now this is what I came up with.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12438

Reviewed By: hx235

Differential Revision: D54914669

Pulled By: ajkr

fbshipit-source-id: 5f3ff78d3890adc0b1b74bedf3921221930ce63a
2024-03-15 12:25:49 -07:00
Yu Zhang 1104eaa35e Add initial support for TimedPut API (#12419)
Summary:
This PR adds support for `TimedPut` API. We introduced a new type `kTypeValuePreferredSeqno` for entries added to the DB via the `TimedPut` API.

The life cycle of such an entry on the write/flush/compaction paths are:

1) It is initially added to memtable as:
`<user_key, seq, kTypeValuePreferredSeqno>: {value, write_unix_time}`

2) When it's flushed to L0 sst files, it's converted to:
`<user_key, seq, kTypeValuePreferredSeqno>: {value, preferred_seqno}`
 when we have easy access to the seqno to time mapping.

3) During compaction, if certain conditions are met, we swap in the `preferred_seqno` and the entry will become:
`<user_key, preferred_seqno, kTypeValue>: value`. This step helps fast track these entries to the cold tier if they are eligible after the sequence number swap.

On the read path:
A `kTypeValuePreferredSeqno` entry acts the same as a `kTypeValue` entry, the unix_write_time/preferred seqno part packed in value is completely ignored.

Needed follow ups:
1) The seqno to time mapping accessible in flush needs to be extended to cover the `write_unix_time` for possible `kTypeValuePreferredSeqno` entries. This also means we need to track these `write_unix_time` in memtable.

2) Compaction filter support for the new `kTypeValuePreferredSeqno` type for feature parity with other `kTypeValue` and equivalent types.

3) Stress test coverage for the feature

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12419

Test Plan: Added unit tests

Reviewed By: pdillinger

Differential Revision: D54920296

Pulled By: jowlyzhang

fbshipit-source-id: c8b43f7a7c465e569141770e93c748371ff1da9e
2024-03-14 15:44:55 -07:00
Levi Tamasi 0ebe1614cb Eliminate some code duplication in MergeHelper (#12121)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12121

The patch eliminates some code duplication by unifying the two sets of `MergeHelper::TimedFullMerge` overloads using variadic templates. It also brings the order of parameters into sync when it comes to the various `TimedFullMerge*` methods.

Reviewed By: jaykorean

Differential Revision: D51862483

fbshipit-source-id: e3f832a6ff89ba34591451655cf11025d0a0d018
2023-12-05 14:07:42 -08:00
cz2h 324453e579 Fix rowcache get returning incorrect timestamp (#11952)
Summary:
Fixes https://github.com/facebook/rocksdb/issues/7930.

When there is a timestamp associated with stored records, get from row cache will return the timestamp provided in query instead of the timestamp associated with the stored record.

## Cause of error:
Currently a row_handle is fetched using row_cache_key(contains a timestamp provided by user query) and the row_handle itself does not persist timestamp associated with the object. Hence the [GetContext::SaveValue()
](6e3429b8a6/table/get_context.cc (L257)) function will fetch the timestamp in row_cache_key and may return the incorrect timestamp value.

## Proposed Solution
If current cf enables ts, append a timestamp associated with stored records after the value in replay_log (equivalently the value of row cache entry).

When read, `replayGetContextLog()` will update parsed_key with the correct timestamp.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11952

Reviewed By: ajkr

Differential Revision: D51501176

Pulled By: jowlyzhang

fbshipit-source-id: 808fc943a8ae95de56ae0e82ec59a2573a031f28
2023-11-21 20:39:33 -08:00
chuhao zeng 8acf17002a Fix row cache falsely return kNotFound when timestamp enabled (#11816)
Summary:
**Summary:**
When row cache hits and a timestamp is being set in read_options, even though ROW_CACHE entry is hit, the return status is kNotFound.

**Cause of error:**
If timestamp is provided in readoptions, a callback for sequence number checking is registered [here](8fc78a3a9e/db/db_impl/db_impl.cc (L2112)).

Hence the default value set at this [line](694e49cbb1/table/get_context.cc (L611)) prevents get_context from saving value found in cache. Causing the final status to be kNotFound even though the entry exist in both cache and SST file.

**Proposed Solution**
Row cache key contains a sequence number in it. If the key for row cache lookup matches the key in cache, this cache entry should be good to be exposed to user and hence we reuse the sequence number in cache key rather than passing kMaxSequenceNumber.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11816

Reviewed By: ajkr

Differential Revision: D49419029

Pulled By: jowlyzhang

fbshipit-source-id: 6c77e9e751628d7d8e6c389f299e29a11ea824c6
2023-09-20 11:34:38 -07:00
Levi Tamasi f42e70bf56 Integrate FullMergeV3 into the query and compaction paths (#11858)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11858

The patch builds on https://github.com/facebook/rocksdb/pull/11807 and integrates the `FullMergeV3` API into the read and compaction code paths by updating and extending the logic in `MergeHelper`.

In particular, when it comes to merge inputs, the existing `TimedFullMergeWithEntity` is folded into `TimedFullMerge`, since wide-column base values are now handled the same way as plain base values (or no base values for that matter), e.g. they are passed directly to the `MergeOperator`. On the other hand, there is some new differentiation on the output side. Namely, there are now two sets of `TimedFullMerge` variants: one set for contexts where the complete merge result and its value type are needed (used by iterators and compactions), and another set where the merge result is needed in a form determined by the client (used by the point lookup APIs, where e.g. for `Get` we have to extract the value of the default column of any wide-column results).

Implementation-wise, the two sets of overloads use different visitors to process the `std::variant` produced by `FullMergeV3`. This has the benefit of eliminating some repeated code e.g. in the point lookup paths, since `TimedFullMerge` now populates the application's result object (`PinnableSlice`/`string` or `PinnableWideColumns`) directly. Moreover, within each set of variants, there is a separate overload for the no base value/plain base value/wide-column base value cases, which eliminates some repeated branching w/r/t to the type of the base value if any.

Reviewed By: jaykorean

Differential Revision: D49352562

fbshipit-source-id: c2fb9853dba3fbbc6918665bde4195c4ea150a0c
2023-09-19 17:27:04 -07:00
Peter Dillinger 206fdea3d9 Change internal headers with duplicate names (#11408)
Summary:
In IDE navigation I find it annoying that there are two statistics.h files (etc.) and often land on the wrong one. Here I migrate several headers to use the blah.h <- blah_impl.h <- blah.cc idiom. Although clang-format wants "blah.h" to be the top include for "blah.cc", I think overall this is an improvement.

No public API changes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11408

Test Plan: existing tests

Reviewed By: ltamasi

Differential Revision: D45456696

Pulled By: pdillinger

fbshipit-source-id: 809d931253f3272c908cf5facf7e1d32fc507373
2023-05-17 11:27:09 -07:00
Levi Tamasi 49881921cd Rename a recently added PerfContext counter (#11294)
Summary:
The patch renames the counter added in https://github.com/facebook/rocksdb/issues/11284 for better consistency with the existing naming scheme.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11294

Test Plan: `make check`

Reviewed By: jowlyzhang

Differential Revision: D44035964

Pulled By: ltamasi

fbshipit-source-id: 8b1a2a03ee728148365367e0ecc1fcf462f62191
2023-03-13 18:43:27 -07:00
Levi Tamasi 1d52438504 Add a PerfContext counter for merge operands applied in point lookups (#11284)
Summary:
The existing PerfContext counter `internal_merge_count` only tracks the
Merge operands applied during range scans. The patch adds a new counter
called `internal_merge_count_point_lookups` to track the same metric
for point lookups (`Get` / `MultiGet` / `GetEntity` / `MultiGetEntity`), and
also fixes a couple of cases in the iterator where the existing counter wasn't
updated.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11284

Test Plan: `make check`

Reviewed By: jowlyzhang

Differential Revision: D43926082

Pulled By: ltamasi

fbshipit-source-id: 321566d8b4cf0a3b6c9b73b7a5c984fb9bb492e9
2023-03-08 18:22:11 -08:00
Levi Tamasi 3c9eed688e Enable moving a string or PinnableSlice into PinnableWideColumns (#11248)
Summary:
This makes it possible to eliminate some copies in `GetEntity` / `MultiGetEntity`,
in particular when `Merge`s or blobs are involved.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11248

Test Plan: `make check`

Reviewed By: akankshamahajan15

Differential Revision: D43544215

Pulled By: ltamasi

fbshipit-source-id: bc4c8955a24bbd8bc4ab098e72133ead757f9707
2023-02-24 10:33:00 -08:00
Andrew Kryczka 25e1365227 Merge operator failed subcode (#11231)
Summary:
From HISTORY.md: Added a subcode of `Status::Corruption`, `Status::SubCode::kMergeOperatorFailed`, for users to identify corruption failures originating in the merge operator, as opposed to RocksDB's internally identified data corruptions.

This is a followup to https://github.com/facebook/rocksdb/issues/11092, where we gave users the ability to keep running a DB despite merge operator failing. Now that the DB keeps running despite such failures, they want to be able to distinguish such failures from real corruptions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11231

Test Plan: updated unit test

Reviewed By: akankshamahajan15

Differential Revision: D43396607

Pulled By: ajkr

fbshipit-source-id: 17fbcc779ad724dafada8abd73efd38e1c5208b9
2023-02-17 10:58:46 -08:00
Yu Zhang 24ac53d81a Use user key on sst file for blob verification for Get and MultiGet (#11105)
Summary:
Use the user key on sst file for blob verification for `Get` and `MultiGet` instead of the user key passed from caller.

Add tests for `Get` and `MultiGet` operations when user defined timestamp feature is enabled in a BlobDB.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11105

Test Plan:
make V=1 db_blob_basic_test
./db_blob_basic_test --gtest_filter="DBBlobTestWithTimestamp.*"

Reviewed By: ltamasi

Differential Revision: D42716487

Pulled By: jowlyzhang

fbshipit-source-id: 5987ecbb7e56ddf46d2467a3649369390789506a
2023-01-30 10:21:21 -08:00
sdong 4720ba4391 Remove RocksDB LITE (#11147)
Summary:
We haven't been actively mantaining RocksDB LITE recently and the size must have been gone up significantly. We are removing the support.

Most of changes were done through following comments:

unifdef -m -UROCKSDB_LITE `git grep -l ROCKSDB_LITE | egrep '[.](cc|h)'`

by Peter Dillinger. Others changes were manually applied to build scripts, CircleCI manifests, ROCKSDB_LITE is used in an expression and file db_stress_test_base.cc.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11147

Test Plan: See CI

Reviewed By: pdillinger

Differential Revision: D42796341

fbshipit-source-id: 4920e15fc2060c2cd2221330a6d0e5e65d4b7fe2
2023-01-27 13:14:19 -08:00
Andrew Kryczka b7fbcefda8 Add API to limit blast radius of merge operator failure (#11092)
Summary:
Prior to this PR, `FullMergeV2()` can only return `false` to indicate failure, which causes any operation invoking it to fail. During a compaction, such a failure causes the compaction to fail and causes the DB to irreversibly enter read-only mode. Some users asked for a way to allow the merge operator to fail without such widespread damage.

To limit the blast radius of merge operator failures, this PR introduces the `MergeOperationOutput::op_failure_scope` API. When unpopulated (`kDefault`) or set to `kTryMerge`, the merge operator failure handling is the same as before. When set to `kMustMerge`, merge operator failure still causes failure to operations that must merge (`Get()`, iterator, `MultiGet()`, etc.). However, under `kMustMerge`, flushes/compactions can survive merge operator failures by outputting the unmerged input operands.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11092

Reviewed By: siying

Differential Revision: D42525673

Pulled By: ajkr

fbshipit-source-id: 951dc3bf190f86347dccf3381be967565cda52ee
2023-01-20 14:40:30 -08:00
Levi Tamasi 2ea109521f Revisit the interface of MergeHelper::TimedFullMerge(WithEntity) (#10932)
Summary:
The patch refines/reworks `MergeHelper::TimedFullMerge(WithEntity)`
a bit in two ways. First, it eliminates the recently introduced `TimedFullMerge`
overload, which makes the responsibilities clearer by making sure the query
result (`value` for `Get`, `columns` for `GetEntity`) is set uniformly in
`SaveValue` and `GetContext`. Second, it changes the interface of
`TimedFullMergeWithEntity` so it exposes its result in a serialized form; this
is a more decoupled design which will come in handy when adding support
for `Merge` with wide-column entities to `DBIter`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10932

Test Plan: `make check`

Reviewed By: akankshamahajan15

Differential Revision: D41129399

Pulled By: ltamasi

fbshipit-source-id: 69d8da358c77d4fc7e8c40f4dafc2c129a710677
2022-11-09 12:54:05 -08:00
Levi Tamasi fbd9077d66 Fix a bug where GetContext does not update READ_NUM_MERGE_OPERANDS (#10925)
Summary:
The patch fixes a bug where `GetContext::Merge` (and `MergeEntity`) does not update the ticker `READ_NUM_MERGE_OPERANDS` because it implicitly uses the default parameter value of `update_num_ops_stats=false` when calling `MergeHelper::TimedFullMerge`. Also, to prevent such issues going forward, the PR removes the default parameter values from the `TimedFullMerge` methods. In addition, it removes an unused/unnecessary parameter from `TimedFullMergeWithEntity`, and does some cleanup at the call sites of these methods.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10925

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D41096453

Pulled By: ltamasi

fbshipit-source-id: fc60646d32b4d516b8fe81e265c3f020a32fd7f8
2022-11-07 15:42:10 -08:00
Levi Tamasi 941d834739 Support Merge for wide-column entities during point lookups (#10916)
Summary:
The patch adds `Merge` support for wide-column entities to the point lookup
APIs, i.e. `Get`, `MultiGet`, `GetEntity`, and `GetMergeOperands`. (I plan to
update the iterator and compaction logic in separate PRs.) In terms of semantics,
the `Merge` operation is applied to the default (anonymous) column; any other
columns in the entity are unaffected.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10916

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D40962311

Pulled By: ltamasi

fbshipit-source-id: 244bc9d172be1af2f204796b2f89104e4d2fa373
2022-11-03 08:35:42 -07:00
Yanqin Jin 7d26e4c5a3 Basic Support for Merge with user-defined timestamp (#10819)
Summary:
This PR implements the originally disabled `Merge()` APIs when user-defined timestamp is enabled.

Simplest usage:
```cpp
// assume string append merge op is used with '.' as delimiter.
// ts1 < ts2
db->Put(WriteOptions(), "key", ts1, "v0");
db->Merge(WriteOptions(), "key", ts2, "1");
ReadOptions ro;
ro.timestamp = &ts2;
db->Get(ro, "key", &value);
ASSERT_EQ("v0.1", value);
```

Some code comments are added for clarity.

Note: support for timestamp in `DB::GetMergeOperands()` will be done in a follow-up PR.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10819

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D40603195

Pulled By: riversand963

fbshipit-source-id: f96d6f183258f3392d80377025529f7660503013
2022-10-31 22:28:58 -07:00
Levi Tamasi 7867a1112b Handle Merges correctly in GetEntity (#10894)
Summary:
The PR fixes the handling of `Merge`s in `GetEntity`. Note that `Merge` is not yet
supported for wide-column entities written using `PutEntity`; this change is
about returning correct (i.e. consistent with `Get`) results in cases like when the
base value is a plain old key-value written using `Put` or when there is no real base
value because we hit either a tombstone or the beginning of history.

Implementation-wise, the patch introduces a new wrapper around the existing
`MergeHelper::TimedFullMerge` that can store the merge result in either a string
(for the purposes of `Get`) or a `PinnableWideColumns` instance (for `GetEntity`).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10894

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D40782708

Pulled By: ltamasi

fbshipit-source-id: 3d700d56b2ef81f02ba1e2d93f6481bf13abcc90
2022-10-28 10:48:51 -07:00
Changyu Bi 9f2363f4c4 User-defined timestamp support for `DeleteRange()` (#10661)
Summary:
Add user-defined timestamp support for range deletion. The new API is `DeleteRange(opt, cf, begin_key, end_key, ts)`. Most of the change is to update the comparator to compare without timestamp. Other than that, major changes are
- internal range tombstone data structures (`FragmentedRangeTombstoneList`, `RangeTombstone`, etc.) to store timestamps.
- Garbage collection of range tombstones and range tombstone covered keys during compaction.
- Get()/MultiGet() to return the timestamp of a range tombstone when needed.
- Get/Iterator with range tombstones bounded by readoptions.timestamp.
- timestamp crash test now issues DeleteRange by default.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10661

Test Plan:
- Added unit test: `make check`
- Stress test: `python3 tools/db_crashtest.py --enable_ts whitebox --readpercent=57 --prefixpercent=4 --writepercent=25 -delpercent=5 --iterpercent=5 --delrangepercent=4`
- Ran `db_bench` to measure regression when timestamp is not enabled. The tests are for write (with some range deletion) and iterate with DB fitting in memory: `./db_bench--benchmarks=fillrandom,seekrandom --writes_per_range_tombstone=200 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=500000 --reads=500000 --seek_nexts=10 --disable_auto_compactions -disable_wal=true --max_num_range_tombstones=1000`.  Did not see consistent regression in no timestamp case.

| micros/op | fillrandom | seekrandom |
| --- | --- | --- |
|main| 2.58 |10.96|
|PR 10661| 2.68 |10.63|

Reviewed By: riversand963

Differential Revision: D39441192

Pulled By: cbi42

fbshipit-source-id: f05aca3c41605caf110daf0ff405919f300ddec2
2022-09-30 16:13:03 -07:00
Yanqin Jin 07249fea8f Fix DBImpl::GetLatestSequenceForKey() for Merge (#10724)
Summary:
Currently, without this fix, DBImpl::GetLatestSequenceForKey() may not return the latest sequence number for merge operands of the key. This can cause conflict checking during optimistic transaction commit phase to fail. Fix it by always returning the latest sequence number of the key, also considering range tombstones.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10724

Test Plan: make check

Reviewed By: cbi42

Differential Revision: D39756847

Pulled By: riversand963

fbshipit-source-id: 0764c3dd4cb24960b37e18adccc6e7feed0e6876
2022-09-23 17:29:05 -07:00
Levi Tamasi 81388b36e0 Add support for wide-column point lookups (#10540)
Summary:
The patch adds a new API `GetEntity` that can be used to perform
wide-column point lookups. It also extends the `Get` code path and
the `MemTable` / `MemTableList` and `Version` / `GetContext` logic
accordingly so that wide-column entities can be served from both
memtables and SSTs. If the result of a lookup is a wide-column entity
(`kTypeWideColumnEntity`), it is passed to the application in deserialized
form; if it is a plain old key-value (`kTypeValue`), it is presented as a
wide-column entity with a single default (anonymous) column.
(In contrast, regular `Get` returns plain old key-values as-is, and
returns the value of the default column for wide-column entities, see
https://github.com/facebook/rocksdb/issues/10483 .)

The result of `GetEntity` is a self-contained `PinnableWideColumns` object.
`PinnableWideColumns` contains a `PinnableSlice`, which either stores the
underlying data in its own buffer or holds on to a cache handle. It also contains
a `WideColumns` instance, which indexes the contents of the `PinnableSlice`,
so applications can access the values of columns efficiently.

There are several pieces of functionality which are currently not supported
for wide-column entities: there is currently no `MultiGetEntity` or wide-column
iterator; also, `Merge` and `GetMergeOperands` are not supported, and there
is no `GetEntity` implementation for read-only and secondary instances.
We plan to implement these in future PRs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10540

Test Plan: `make check`

Reviewed By: akankshamahajan15

Differential Revision: D38847474

Pulled By: ltamasi

fbshipit-source-id: 42311a34ccdfe88b3775e847a5e2a5296e002b5b
2022-08-19 11:51:12 -07:00
Levi Tamasi 24bcab7d5d Make queries return the value of the default column for wide-column entities (#10483)
Summary:
The patch adds support for wide-column entities to the existing query
APIs (`Get`, `MultiGet`, and iterator). Namely, when during a query a
wide-column entity is encountered, we will return the value of the default
(anonymous) column as the result. Later, we plan to add wide-column
specific query APIs which will enable retrieving entire wide-column entities
or a subset of their columns.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10483

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D38441881

Pulled By: ltamasi

fbshipit-source-id: 6444e79a31aff2470e866698e3a97985bc2b3543
2022-08-08 16:10:08 -07:00
Peter Dillinger e6c5e0ab9a Have Cache use Status::MemoryLimit (#10262)
Summary:
I noticed it would clean up some things to have Cache::Insert()
return our MemoryLimit Status instead of Incomplete for the case in
which the capacity limit is reached. I suspect this fixes some existing but
unknown bugs where this Incomplete could be confused with other uses
of Incomplete, especially no_io cases. This is the most suspicious case I
noticed, but was not able to reproduce a bug, in part because the existing
code is not covered by unit tests (FIXME added): 57adbf0e91/table/get_context.cc (L397)

I audited all the existing uses of IsIncomplete and updated those that
seemed relevant.

HISTORY updated with a clear warning to users of strict_capacity_limit=true
to update uses of `IsIncomplete()`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10262

Test Plan: updated unit tests

Reviewed By: hx235

Differential Revision: D37473155

Pulled By: pdillinger

fbshipit-source-id: 4bd9d9353ccddfe286b03ebd0652df8ce20f99cb
2022-07-06 14:41:46 -07:00
Levi Tamasi c73d2a9d18 Add API for writing wide-column entities (#10242)
Summary:
The patch builds on https://github.com/facebook/rocksdb/pull/9915 and adds
a new API called `PutEntity` that can be used to write a wide-column entity
to the database. The new API is added to both `DB` and `WriteBatch`. Note
that currently there is no way to retrieve these entities; more precisely, all
read APIs (`Get`, `MultiGet`, and iterator) return `NotSupported` when they
encounter a wide-column entity that is required to answer a query. Read-side
support (as well as other missing functionality like `Merge`, compaction filter,
and timestamp support) will be added in later PRs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10242

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D37369748

Pulled By: ltamasi

fbshipit-source-id: 7f5e412359ed7a400fd80b897dae5599dbcd685d
2022-06-25 15:30:47 -07:00
Yanqin Jin 3e02c6e05a Point-lookup returns timestamps of Delete and SingleDelete (#10056)
Summary:
If caller specifies a non-null `timestamp` argument in `DB::Get()` or a non-null `timestamps` in `DB::MultiGet()`,
RocksDB will return the timestamps of the point tombstones.

Note: DeleteRange is still unsupported.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10056

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D36677956

Pulled By: riversand963

fbshipit-source-id: 2d7af02cc7237b1829cd269086ea895a49d501ae
2022-06-03 20:00:42 -07:00
Levi Tamasi dc5de45af8 Support readahead during compaction for blob files (#9187)
Summary:
The patch adds a new BlobDB configuration option `blob_compaction_readahead_size`
that can be used to enable prefetching data from blob files during compaction.
This is important when using storage with higher latencies like HDDs or remote filesystems.
If enabled, prefetching is used for all cases when blobs are read during compaction,
namely garbage collection, compaction filters (when the existing value has to be read from
a blob file), and `Merge` (when the value of the base `Put` is stored in a blob file).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9187

Test Plan: Ran `make check` and the stress/crash test.

Reviewed By: riversand963

Differential Revision: D32565512

Pulled By: ltamasi

fbshipit-source-id: 87be9cebc3aa01cc227bec6b5f64d827b8164f5d
2021-11-19 17:53:47 -08:00
mrambacher 13ae16c315 Cleanup includes in dbformat.h (#8930)
Summary:
This header file was including everything and the kitchen sink when it did not need to.  This resulted in many places including this header when they needed other pieces instead.

Cleaned up this header to only include what was needed and fixed up the remaining code to include what was now missing.

Hopefully, this sort of code hygiene cleanup will speed up the builds...

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8930

Reviewed By: pdillinger

Differential Revision: D31142788

Pulled By: mrambacher

fbshipit-source-id: 6b45de3f300750c79f751f6227dece9cfd44085d
2021-09-29 04:04:40 -07:00
Akanksha Mahajan 3897ce3125 Support for Merge in Integrated BlobDB with base values (#8292)
Summary:
This PR add support for Merge operation in Integrated BlobDB with base values(i.e DB::Put). Merged values can be retrieved through  DB::Get, DB::MultiGet, DB::GetMergeOperands and Iterator operation.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8292

Test Plan: Add new unit tests

Reviewed By: ltamasi

Differential Revision: D28415896

Pulled By: akankshamahajan15

fbshipit-source-id: e9b3478bef51d2f214fb88c31ed3c8d2f4a531ff
2021-06-10 12:58:37 -07:00
mrambacher 3dff28cf9b Use SystemClock* instead of std::shared_ptr<SystemClock> in lower level routines (#8033)
Summary:
For performance purposes, the lower level routines were changed to use a SystemClock* instead of a std::shared_ptr<SystemClock>.  The shared ptr has some performance degradation on certain hardware classes.

For most of the system, there is no risk of the pointer being deleted/invalid because the shared_ptr will be stored elsewhere.  For example, the ImmutableDBOptions stores the Env which has a std::shared_ptr<SystemClock> in it.  The SystemClock* within the ImmutableDBOptions is essentially a "short cut" to gain access to this constant resource.

There were a few classes (PeriodicWorkScheduler?) where the "short cut" property did not hold.  In those cases, the shared pointer was preserved.

Using db_bench readrandom perf_level=3 on my EC2 box, this change performed as well or better than 6.17:

6.17: readrandom   :      28.046 micros/op 854902 ops/sec;   61.3 MB/s (355999 of 355999 found)
6.18: readrandom   :      32.615 micros/op 735306 ops/sec;   52.7 MB/s (290999 of 290999 found)
PR: readrandom   :      27.500 micros/op 871909 ops/sec;   62.5 MB/s (367999 of 367999 found)

(Note that the times for 6.18 are prior to revert of the SystemClock).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8033

Reviewed By: pdillinger

Differential Revision: D27014563

Pulled By: mrambacher

fbshipit-source-id: ad0459eba03182e454391b5926bf5cdd45657b67
2021-03-15 04:34:11 -07:00
Yanqin Jin 82b3888433 Enable backward iterator for keys with user-defined timestamp (#8035)
Summary:
This PR does the following:

- Enable backward iteration for keys with user-defined timestamp. Note that merge, single delete, range delete are not supported yet.
- Introduces a new helper API `Comparator::EqualWithoutTimestamp()`.
- Fix a typo in `SetTimestamp()`.
- Add/update unit tests

Run db_bench (built with DEBUG_LEVEL=0) to demonstrate that no overhead is introduced for CPU-intensive workloads with a lot of `Prev()`. Also provided results of iterating keys with timestamps.

1. Disable timestamp, run:
```
./db_bench -db=/dev/shm/rocksdb -disable_wal=1 -benchmarks=fillseq,seekrandom[-W1-X6] -reverse_iterator=1 -seek_nexts=5
```
Results:
> Baseline
> - seekrandom [AVG    6 runs] : 96115 ops/sec;   53.2 MB/sec
> - seekrandom [MEDIAN 6 runs] : 98075 ops/sec;   54.2 MB/sec
>
> This PR
> - seekrandom [AVG    6 runs] : 95521 ops/sec;   52.8 MB/sec
> - seekrandom [MEDIAN 6 runs] : 96338 ops/sec;   53.3 MB/sec

2. Enable timestamp, run:
```
./db_bench -user_timestamp_size=8  -db=/dev/shm/rocksdb -disable_wal=1 -benchmarks=fillseq,seekrandom[-W1-X6] -reverse_iterator=1 -seek_nexts=5
```
Result:
> Baseline: not supported
>
> This PR
> - seekrandom [AVG    6 runs] : 90514 ops/sec;   50.1 MB/sec
> - seekrandom [MEDIAN 6 runs] : 90834 ops/sec;   50.2 MB/sec

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8035

Reviewed By: ltamasi

Differential Revision: D26926668

Pulled By: riversand963

fbshipit-source-id: 95330cc2242397c03e09d29e5417dfb0adc98ef5
2021-03-10 11:15:46 -08:00
mrambacher 12f1137355 Add a SystemClock class to capture the time functions of an Env (#7858)
Summary:
Introduces and uses a SystemClock class to RocksDB.  This class contains the time-related functions of an Env and these functions can be redirected from the Env to the SystemClock.

Many of the places that used an Env (Timer, PerfStepTimer, RepeatableThread, RateLimiter, WriteController) for time-related functions have been changed to use SystemClock instead.  There are likely more places that can be changed, but this is a start to show what can/should be done.  Over time it would be nice to migrate most (if not all) of the uses of the time functions from the Env to the SystemClock.

There are several Env classes that implement these functions.  Most of these have not been converted yet to SystemClock implementations; that will come in a subsequent PR.  It would be good to unify many of the Mock Timer implementations, so that they behave similarly and be tested similarly (some override Sleep, some use a MockSleep, etc).

Additionally, this change will allow new methods to be introduced to the SystemClock (like https://github.com/facebook/rocksdb/issues/7101 WaitFor) in a consistent manner across a smaller number of classes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7858

Reviewed By: pdillinger

Differential Revision: D26006406

Pulled By: mrambacher

fbshipit-source-id: ed10a8abbdab7ff2e23d69d85bd25b3e7e899e90
2021-01-25 22:09:11 -08:00
Levi Tamasi e8cb32ed67 Introduce BlobFileCache and add support for blob files to Get() (#7540)
Summary:
The patch adds blob file support to the `Get` API by extending `Version` so that
whenever a blob reference is read from a file, the blob is retrieved from the corresponding
blob file and passed back to the caller. (This is assuming the blob reference is valid
and the blob file is actually part of the given `Version`.) It also introduces a cache
of `BlobFileReader`s called `BlobFileCache` that enables sharing `BlobFileReader`s
between callers. `BlobFileCache` uses the same backing cache as `TableCache`, so
`max_open_files` (if specified) limits the total number of open (table + blob) files.

TODO: proactively open/cache blob files and pin the cache handles of the readers in the
metadata objects similarly to what `VersionBuilder::LoadTableHandlers` does for
table files.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7540

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D24260219

Pulled By: ltamasi

fbshipit-source-id: a8a2a4f11d3d04d6082201b52184bc4d7b0857ba
2020-10-15 13:04:47 -07:00
Yanqin Jin 961c7590d6 Add timestamp to delete (#6253)
Summary:
Preliminary user-timestamp support for delete.

If ["a", ts=100] exists, you can delete it by calling `DB::Delete(write_options, key)` in which `write_options.timestamp` points to a `ts` higher than 100.

Implementation
A new ValueType, i.e. `kTypeDeletionWithTimestamp` is added for deletion marker with timestamp.
The reason for a separate `kTypeDeletionWithTimestamp`: RocksDB may drop tombstones (keys with kTypeDeletion) when compacting them to the bottom level. This is OK and useful if timestamp is disabled. When timestamp is enabled, should we still reuse `kTypeDeletion`, we may drop the tombstone with a more recent timestamp, causing deleted keys to re-appear.

Test plan (dev server)
```
make check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6253

Reviewed By: ltamasi

Differential Revision: D20995328

Pulled By: riversand963

fbshipit-source-id: a9e5c22968ad76f98e3dc6ee0151265a3f0df619
2020-05-28 10:40:03 -07:00
Peter Dillinger 249eff0f30 Stats for redundant insertions into block cache (#6681)
Summary:
Since read threads do not coordinate on loading data into block
cache, two threads between Lookup and Insert can end up loading and
inserting the same data. This is particularly concerning with
cache_index_and_filter_blocks since those are hot and more likely to
be race targets if ejected from (or not pre-populated in) the cache.

Particularly with moves toward disaggregated / network storage, the cost
of redundant retrieval might be high, and we should at least have some
hard statistics from which we can estimate impact.

Example with full filter thrashing "cliff":

    $ ./db_bench --benchmarks=fillrandom --num=15000000 --cache_index_and_filter_blocks -bloom_bits=10
    ...
    $ ./db_bench --db=/tmp/rocksdbtest-172704/dbbench --use_existing_db --benchmarks=readrandom,stats --num=200000 --cache_index_and_filter_blocks --cache_size=$((130 * 1024 * 1024)) --bloom_bits=10 --threads=16 -statistics 2>&1 | egrep '^rocksdb.block.cache.(.*add|.*redundant)' | grep -v compress | sort
    rocksdb.block.cache.add COUNT : 14181
    rocksdb.block.cache.add.failures COUNT : 0
    rocksdb.block.cache.add.redundant COUNT : 476
    rocksdb.block.cache.data.add COUNT : 12749
    rocksdb.block.cache.data.add.redundant COUNT : 18
    rocksdb.block.cache.filter.add COUNT : 1003
    rocksdb.block.cache.filter.add.redundant COUNT : 217
    rocksdb.block.cache.index.add COUNT : 429
    rocksdb.block.cache.index.add.redundant COUNT : 241
    $ ./db_bench --db=/tmp/rocksdbtest-172704/dbbench --use_existing_db --benchmarks=readrandom,stats --num=200000 --cache_index_and_filter_blocks --cache_size=$((120 * 1024 * 1024)) --bloom_bits=10 --threads=16 -statistics 2>&1 | egrep '^rocksdb.block.cache.(.*add|.*redundant)' | grep -v compress | sort
    rocksdb.block.cache.add COUNT : 1182223
    rocksdb.block.cache.add.failures COUNT : 0
    rocksdb.block.cache.add.redundant COUNT : 302728
    rocksdb.block.cache.data.add COUNT : 31425
    rocksdb.block.cache.data.add.redundant COUNT : 12
    rocksdb.block.cache.filter.add COUNT : 795455
    rocksdb.block.cache.filter.add.redundant COUNT : 130238
    rocksdb.block.cache.index.add COUNT : 355343
    rocksdb.block.cache.index.add.redundant COUNT : 172478
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6681

Test Plan: Some manual testing (above) and unit test covering key metrics is included

Reviewed By: ltamasi

Differential Revision: D21134113

Pulled By: pdillinger

fbshipit-source-id: c11497b5f00f4ffdfe919823904e52d0a1a91d87
2020-04-27 13:20:27 -07:00
Huisheng Liu 904a60ff63 return timestamp from get (#6409)
Summary:
Added new Get() methods that return timestamp. Dummy implementation is given so that classes derived from DB don't need to be touched to provide their implementation. MultiGet is not included.

ReadRandom perf test (10 minutes) on the same development machine ram drive with the same DB data shows no regression (within marge of error). The test is adapted from https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks.
    base line (commit 72ee067b9):
        101.712 micros/op 314602 ops/sec;   36.0 MB/s (5658999 of 5658999 found)
    This PR:
        100.288 micros/op 319071 ops/sec;   36.5 MB/s (5674999 of 5674999 found)

./db_bench --db=r:\rocksdb.github --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --cache_size=2147483648 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=134217728 --max_bytes_for_level_base=1073741824 --disable_wal=0 --wal_dir=r:\rocksdb.github\WAL_LOG --sync=0 --verify_checksum=1 --delete_obsolete_files_period_micros=314572800 --max_background_compactions=4 --max_background_flushes=0 --level0_slowdown_writes_trigger=16 --level0_stop_writes_trigger=24 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --mmap_read=1 --mmap_write=0 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --duration=600 --benchmarks=readrandom --use_existing_db=1 --num=25000000 --threads=32
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6409

Differential Revision: D20200086

Pulled By: riversand963

fbshipit-source-id: 490edd74d924f62bd8ae9c29c2a6bbbb8410ca50
2020-03-02 16:01:00 -08:00
sdong fdf882ded2 Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433)
Summary:
When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433

Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.

Differential Revision: D19977691

fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
2020-02-20 12:09:57 -08:00
Vijay Nadimpalli d150e01474 New API to get all merge operands for a Key (#5604)
Summary:
This is a new API added to db.h to allow for fetching all merge operands associated with a Key. The main motivation for this API is to support use cases where doing a full online merge is not necessary as it is performance sensitive. Example use-cases:
1. Update subset of columns and read subset of columns -
Imagine a SQL Table, a row is encoded as a K/V pair (as it is done in MyRocks). If there are many columns and users only updated one of them, we can use merge operator to reduce write amplification. While users only read one or two columns in the read query, this feature can avoid a full merging of the whole row, and save some CPU.
2. Updating very few attributes in a value which is a JSON-like document -
Updating one attribute can be done efficiently using merge operator, while reading back one attribute can be done more efficiently if we don't need to do a full merge.
----------------------------------------------------------------------------------------------------
API :
Status GetMergeOperands(
      const ReadOptions& options, ColumnFamilyHandle* column_family,
      const Slice& key, PinnableSlice* merge_operands,
      GetMergeOperandsOptions* get_merge_operands_options,
      int* number_of_operands)

Example usage :
int size = 100;
int number_of_operands = 0;
std::vector<PinnableSlice> values(size);
GetMergeOperandsOptions merge_operands_info;
db_->GetMergeOperands(ReadOptions(), db_->DefaultColumnFamily(), "k1", values.data(), merge_operands_info, &number_of_operands);

Description :
Returns all the merge operands corresponding to the key. If the number of merge operands in DB is greater than merge_operands_options.expected_max_number_of_operands no merge operands are returned and status is Incomplete. Merge operands returned are in the order of insertion.
merge_operands-> Points to an array of at-least merge_operands_options.expected_max_number_of_operands and the caller is responsible for allocating it. If the status returned is Incomplete then number_of_operands will contain the total number of merge operands found in DB for key.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5604

Test Plan:
Added unit test and perf test in db_bench that can be run using the command:
./db_bench -benchmarks=getmergeoperands --merge_operator=sortlist

Differential Revision: D16657366

Pulled By: vjnadimpalli

fbshipit-source-id: 0faadd752351745224ee12d4ae9ef3cb529951bf
2019-08-06 14:26:44 -07:00
haoyuhuang 6edc5d0719 Block cache tracing: Associate a unique id with Get and MultiGet (#5514)
Summary:
This PR associates a unique id with Get and MultiGet. This enables us to track how many blocks a Get/MultiGet request accesses. We can also measure the impact of row cache vs block cache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5514

Test Plan: make clean && COMPILE_WITH_ASAN=1 make check -j32

Differential Revision: D16032681

Pulled By: HaoyuHuang

fbshipit-source-id: 775b05f4440badd58de6667e3ec9f4fc87a0af4c
2019-07-03 19:35:41 -07:00
Yanqin Jin 340ed4fac7 Add support for timestamp in Get/Put (#5079)
Summary:
It's useful to be able to (optionally) associate key-value pairs with user-provided timestamps. This PR is an early effort towards this goal and continues the work of facebook#4942. A suite of new unit tests exist in DBBasicTestWithTimestampWithParam. Support for timestamp requires the user to provide timestamp as a slice in `ReadOptions` and `WriteOptions`. All timestamps of the same database must share the same length, format, etc. The format of the timestamp is the same throughout the same database, and the user is responsible for providing a comparator function (Comparator) to order the <key, timestamp> tuples. Once created, the format and length of the timestamp cannot change (at least for now).

Test plan (on devserver):
```
$COMPILE_WITH_ASAN=1 make -j32 all
$./db_basic_test --gtest_filter=Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/*
$make check
```
All tests must pass.

We also run the following db_bench tests to verify whether there is regression on Get/Put while timestamp is not enabled.
```
$TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillseq,readrandom -num=1000000
$TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=1000000
```
Repeat for 6 times for both versions.

Results are as follows:
```
|        | readrandom | fillrandom |
| master | 16.77 MB/s | 47.05 MB/s |
| PR5079 | 16.44 MB/s | 47.03 MB/s |
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5079

Differential Revision: D15132946

Pulled By: riversand963

fbshipit-source-id: 833a0d657eac21182f0f206c910a6438154c742c
2019-06-05 23:10:47 -07:00
Siying Dong f184bee77b PlainTable should avoid copying Get() results from immortal source. (#4924)
Summary:
https://github.com/facebook/rocksdb/pull/4053 avoids memcopy for Get() results if files are immortable
(read-only DB, max_open_files=-1) and the file is ammaped. The same optimization is being applied to PlainTable
here.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4924

Differential Revision: D13827749

Pulled By: siying

fbshipit-source-id: 1f2cbfc530b40ce08ccd53f95f6e78de4d1c2f96
2019-01-25 17:12:19 -08:00
Andrew Kryczka 8ec3e72551 Cache dictionary used for decompressing data blocks (#4881)
Summary:
- If block cache disabled or not used for meta-blocks, `BlockBasedTableReader::Rep::uncompression_dict` owns the `UncompressionDict`. It is preloaded during `PrefetchIndexAndFilterBlocks`.
- If block cache is enabled and used for meta-blocks, block cache owns the `UncompressionDict`, which holds dictionary and digested dictionary when needed. It is never prefetched though there is a TODO for this in the code. The cache key is simply the compression dictionary block handle.
- New stats for compression dictionary accesses in block cache: "BLOCK_CACHE_COMPRESSION_DICT_*" and "compression_dict_block_read_count"
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4881

Differential Revision: D13663801

Pulled By: ajkr

fbshipit-source-id: bdcc54044e180855cdcc57639b493b0e016c9a3f
2019-01-23 18:15:47 -08:00
Abhishek Madan 8c78348c77 Use only "local" range tombstones during Get (#4449)
Summary:
Previously, range tombstones were accumulated from every level, which
was necessary if a range tombstone in a higher level covered a key in a lower
level. However, RangeDelAggregator::AddTombstones's complexity is based on
the number of tombstones that are currently stored in it, which is wasteful in
the Get case, where we only need to know the highest sequence number of range
tombstones that cover the key from higher levels, and compute the highest covering
sequence number at the current level. This change introduces this optimization, and
removes the use of RangeDelAggregator from the Get path.

In the benchmark results, the following command was used to initialize the database:
```
./db_bench -db=/dev/shm/5k-rts -use_existing_db=false -benchmarks=filluniquerandom -write_buffer_size=1048576 -compression_type=lz4 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -value_size=112 -key_size=16 -block_size=4096 -level_compaction_dynamic_level_bytes=true -num=5000000 -max_background_jobs=12 -benchmark_write_rate_limit=20971520 -range_tombstone_width=100 -writes_per_range_tombstone=100 -max_num_range_tombstones=50000 -bloom_bits=8
```

...and the following command was used to measure read throughput:
```
./db_bench -db=/dev/shm/5k-rts/ -use_existing_db=true -benchmarks=readrandom -disable_auto_compactions=true -num=5000000 -reads=100000 -threads=32
```

The filluniquerandom command was only run once, and the resulting database was used
to measure read performance before and after the PR. Both binaries were compiled with
`DEBUG_LEVEL=0`.

Readrandom results before PR:
```
readrandom   :       4.544 micros/op 220090 ops/sec;   16.9 MB/s (63103 of 100000 found)
```

Readrandom results after PR:
```
readrandom   :      11.147 micros/op 89707 ops/sec;    6.9 MB/s (63103 of 100000 found)
```

So it's actually slower right now, but this PR paves the way for future optimizations (see #4493).

----
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4449

Differential Revision: D10370575

Pulled By: abhimadan

fbshipit-source-id: 9a2e152be1ef36969055c0e9eb4beb0d96c11f4d
2018-10-24 12:31:12 -07:00
Andrey Zagrebin fa4de6e30f #3865 followup for fix performance degression introduced by switching order of operands (#4284)
Summary:
Followup for #4266. There is one more place in **get_context.cc** where **MergeOperator::ShouldMerge** should be called with reversed list of operands.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4284

Differential Revision: D9380008

Pulled By: sagar0

fbshipit-source-id: 70ec26e607e5b88465e1acbdcd6c6171bd76b9f2
2018-08-17 10:57:25 -07:00
Zhongyi Xie f95a5b2464 Avoid unnecessary big for-loop when reporting ticker stats stored in GetContext (#3490)
Summary:
Currently in `Version::Get` when reporting ticker stats stored in `GetContext`, there is a big for-loop through all `Ticker` which adds unnecessary cost to overall CPU usage. We can optimize by storing only ticker values that are used in `Get()` calls in a new struct `GetContextStats` since only a small fraction of all tickers are used in `Get()` calls. For comparison, with the new approach we only need to visit 17 values while old approach will require visiting 100+ `Ticker`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3490

Differential Revision: D6969154

Pulled By: miasantreble

fbshipit-source-id: fc27072965a3a94125a3e6883d20dafcf5b84029
2018-07-20 16:58:13 -07:00
David Lai 3be9b36453 comment unused parameters to turn on -Wunused-parameter flag
Summary:
This PR comments out the rest of the unused arguments which allow us to turn on the -Wunused-parameter flag. This is the second part of a codemod relating to https://github.com/facebook/rocksdb/pull/3557.
Closes https://github.com/facebook/rocksdb/pull/3662

Differential Revision: D7426121

Pulled By: Dayvedde

fbshipit-source-id: 223994923b42bd4953eb016a0129e47560f7e352
2018-04-12 17:59:16 -07:00
Maysam Yabandeh 67182678a5 Stats for false positive rate of full filtesr
Summary:
Adds two stats to allow us measuring the false positive rate of full filters:
- The total count of positives: rocksdb.bloom.filter.full.positive
- The total count of true positives: rocksdb.bloom.filter.full.true.positive
Not the term "full" in the stat name to indicate that they are meaningful in full filters. block-based filters are to be deprecated soon and supporting it is not worth the the additional cost of if-then-else branches.

Closes #3680

Tested by:
$ ./db_bench -benchmarks=fillrandom  -db /dev/shm/rocksdb-tmpdb --num=1000000 -bloom_bits=10
$ ./db_bench -benchmarks="readwhilewriting"  -db /dev/shm/rocksdb-tmpdb --statistics -bloom_bits=10 --duration=60 --num=2000000 --use_existing_db 2>&1 > /tmp/full.log
$ grep filter.full /tmp/full.log
rocksdb.bloom.filter.full.positive COUNT : 3628593
rocksdb.bloom.filter.full.true.positive COUNT : 3536026
which gives the false positive rate of 2.5%
Closes https://github.com/facebook/rocksdb/pull/3681

Differential Revision: D7517570

Pulled By: maysamyabandeh

fbshipit-source-id: 630ab1a473afdce404916d297035b6318de4c052
2018-04-05 15:58:48 -07:00
Andrew Kryczka 5d68243e61 Comment out unused variables
Summary:
Submitting on behalf of another employee.
Closes https://github.com/facebook/rocksdb/pull/3557

Differential Revision: D7146025

Pulled By: ajkr

fbshipit-source-id: 495ca5db5beec3789e671e26f78170957704e77e
2018-03-05 13:13:41 -08:00
Igor Sugak aba3409740 Back out "[codemod] - comment out unused parameters"
Reviewed By: igorsugak

fbshipit-source-id: 4a93675cc1931089ddd574cacdb15d228b1e5f37
2018-02-22 12:43:17 -08:00