Commit graph

11927 commits

Author SHA1 Message Date
Andrew Kryczka 097f9f4425 Fix CompactionIterator flag for penultimate level output (#10967)
Summary:
We were not resetting it in non-debug mode so it could be true once and then stay true for future keys where it should be false. This PR adds the reset logic.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10967

Test Plan:
- built `db_bench` with DEBUG_LEVEL=0
- ran benchmark: `TEST_TMPDIR=/dev/shm/prefix ./db_bench -benchmarks=fillrandom -compaction_style=1 -preserve_internal_time_seconds=100 -preclude_last_level_data_seconds=10 -write_buffer_size=1048576 -target_file_size_base=1048576 -subcompactions=8 -duration=120`
- compared "output_to_penultimate_level: X bytes + last: Y bytes" lines in LOG output
  - Before this fix, Y was always zero
  - After this fix, Y gradually increased throughout the benchmark

Reviewed By: riversand963

Differential Revision: D41417726

Pulled By: ajkr

fbshipit-source-id: ace1e9a289e751a5b0c2fbaa8addd4eda5525329
2022-11-21 16:14:03 -08:00
Peter Dillinger 3182beeffc Observe and warn about misconfigured HyperClockCache (#10965)
Summary:
Background. One of the core risks of chosing HyperClockCache is ending up with degraded performance if estimated_entry_charge is very significantly wrong. Too low leads to under-utilized hash table, which wastes a bit of (tracked) memory and likely increases access times due to larger working set size (more TLB misses). Too high leads to fully populated hash table (at some limit with reasonable lookup performance) and not being able to cache as many objects as the memory limit would allow. In either case, performance degradation is graceful/continuous but can be quite significant. For example, cutting block size in half without updating estimated_entry_charge could lead to a large portion of configured block cache memory (up to roughly 1/3) going unused.

Fix. This change adds a mechanism through which the DB periodically probes the block cache(s) for "problems" to report, and adds diagnostics to the HyperClockCache for bad estimated_entry_charge. The periodic probing is currently done with DumpStats / stats_dump_period_sec, and diagnostics reported to info_log (normally LOG file).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10965

Test Plan:
unit test included. Doesn't cover all the implemented subtleties of reporting, but ensures basics of when to report or not.

Also manual testing with db_bench. Create db with
```
./db_bench --benchmarks=fillrandom,flush --num=3000000 --disable_wal=1
```
Use and check LOG file for HyperClockCache for various block sizes (used as estimated_entry_charge)
```
./db_bench --use_existing_db --benchmarks=readrandom --num=3000000 --duration=20 --stats_dump_period_sec=8 --cache_type=hyper_clock_cache -block_size=XXXX
```
Seeing warnings / errors or not as expected.

Reviewed By: anand1976

Differential Revision: D41406932

Pulled By: pdillinger

fbshipit-source-id: 4ca56162b73017e4b9cec2cad74466f49c27a0a7
2022-11-21 12:08:21 -08:00
Yanqin Jin a8a4ed52a4 Test Merge with timestamps in stress test (#10948)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10948

Test Plan: make crash_test_with_ts

Reviewed By: ltamasi

Differential Revision: D41390854

Pulled By: riversand963

fbshipit-source-id: 599e114da8e2b2bbff5628fb8c67fa0393a31c05
2022-11-17 20:43:50 -08:00
Peter Dillinger 8c0f5b1fcf Mark HyperClockCache as production-ready (#10963)
Summary:
After a couple minor bug fixes and successful productions roll-outs in a few places, I think we can mark this as production-ready. It has a clear value proposition for many workloads, even if we don't have clear advice for every workload yet.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10963

Test Plan: existing tests, comment changes only

Reviewed By: siying

Differential Revision: D41384083

Pulled By: pdillinger

fbshipit-source-id: 56359f01a57bb28de8697666b342382fac72ce6d
2022-11-17 14:44:59 -08:00
Levi Tamasi 8fa8780932 Mention wide-column support in HISTORY.md (#10959)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10959

Reviewed By: akankshamahajan15

Differential Revision: D41348198

Pulled By: ltamasi

fbshipit-source-id: 51e89d03c1fe87f576a766f609a7f233a519c83d
2022-11-16 12:22:35 -08:00
Peter Dillinger 32520df1d9 Remove prototype FastLRUCache (#10954)
Summary:
This was just a stepping stone to what eventually became HyperClockCache, and is now just more code to maintain.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10954

Test Plan: tests updated

Reviewed By: akankshamahajan15

Differential Revision: D41310123

Pulled By: pdillinger

fbshipit-source-id: 618ee148a1a0a29ee756ba8fe28359617b7cd67c
2022-11-16 10:15:55 -08:00
Peter Dillinger b55e70357c Re-arrange cache.h to prepare for refactoring (#10942)
Summary:
No material changes to code or comments, just re-arranging things to prepare for a big refactoring, making it easier to what changed. Some specifics:
* This groups things together in Cache in anticipation of secondary cache features being marked production-ready (vs. experimental).
* CacheEntryRole will be needed in definition of class Cache, so that has been moved above it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10942

Test Plan: existing tests

Reviewed By: anand1976

Differential Revision: D41205509

Pulled By: pdillinger

fbshipit-source-id: 3f2559ab1651c758918dc97056951fa2b5eb0348
2022-11-15 10:47:15 -08:00
Levi Tamasi b644baa1eb Support using GetMergeOperands for verification with wide columns (#10952)
Summary:
With the recent changes, `GetMergeOperands` is now supported for wide-column entities as well, so we can use it for verification purposes in the non-batched stress tests.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10952

Test Plan: Ran a simple non-batched ops blackbox crash test.

Reviewed By: riversand963

Differential Revision: D41292114

Pulled By: ltamasi

fbshipit-source-id: 70b4c756a4a1fecb445c16c7096aad805a51203c
2022-11-15 08:06:41 -08:00
Akanksha Mahajan 1562524e63 Fix db_stress failure in async_io in FilePrefetchBuffer (#10949)
Summary:
Fix db_stress failure in async_io in FilePrefetchBuffer.

From the logs, assertion was caused when
- prev_offset_ = offset but somehow prev_len != 0 and explicit_prefetch_submitted_ = true. That scenario is when we send async request to prefetch buffer during seek but in second seek that data is found in cache. prev_offset_ and prev_len_ get updated but we were not setting explicit_prefetch_submitted_ = false because of which buffers were getting out of sync.
It's possible a read by another thread might have loaded the block into the cache in the meantime.

Particular assertion example:
```
prev_offset: 0, prev_len_: 8097 , offset: 0, length: 8097, actual_length: 8097 , actual_offset: 0 ,
curr_: 0, bufs_[curr_].offset_: 4096 ,bufs_[curr_].CurrentSize(): 48541 , async_len_to_read: 278528, bufs_[curr_].async_in_progress_: false
second: 1, bufs_[second].offset_: 282624 ,bufs_[second].CurrentSize(): 0, async_len_to_read: 262144 ,bufs_[second].async_in_progress_: true ,
explicit_prefetch_submitted_: true , copy_to_third_buffer: false
```
As we can see curr_ was expected to read 278528 but it read 48541. Also buffers are out of sync.
Also `explicit_prefetch_submitted_` is set true but prev_len not 0.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10949

Test Plan:
- Ran db_bench for regression to make sure there is no regression;
- Ran db_stress failing without this fix,
- Ran build-linux-mini-crashtest 7- 8 times locally + CircleCI

Reviewed By: anand1976

Differential Revision: D41257786

Pulled By: akankshamahajan15

fbshipit-source-id: 1d100f94f8c06bbbe4cc76ca27f1bbc820c2494f
2022-11-14 16:14:41 -08:00
xiaochenfan 0993c9225f Fix broken dependency: update zlib from 1.2.12 to 1.2.13 (#10833)
Summary:
zlib(https://zlib.net/) has released v1.2.13.

1.2.12 is no longer available for downloading and Makefile for rocksdb will be broken due to can't find the source .tar.gz.

https://nvd.nist.gov/vuln/detail/CVE-2022-37434

This pr update the version number and the shasum of new .tar.gz file. (1.2.13)

Fixes https://github.com/facebook/rocksdb/issues/10876

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10833

Reviewed By: hx235

Differential Revision: D40575954

Pulled By: ajkr

fbshipit-source-id: 3e560e453ddf58d045214fc4e64f83bef91f22e5
2022-11-14 11:49:06 -08:00
akankshamahajan 8515437594 Update unit test to avoid timeout (#10950)
Summary:
Update unit test to avoid timeout

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10950

Reviewed By: hx235

Differential Revision: D41258892

Pulled By: akankshamahajan15

fbshipit-source-id: cbfe94da63e9e54544a307845deb79ba42458301
2022-11-14 11:39:22 -08:00
anand76 ecba6a320e Add some async read stats (#10947)
Summary:
Add stats for time spent in the ReadAsync call, and async read errors.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10947

Test Plan: Run db_bench and look at stats

Reviewed By: akankshamahajan15

Differential Revision: D41236637

Pulled By: anand1976

fbshipit-source-id: 70539b69a28491d57acead449436a761f7108acf
2022-11-13 21:38:35 -08:00
Peter Dillinger f321e8fc98 Don't attempt to use SecondaryCache on block_cache_compressed (#10944)
Summary:
Compressed block cache depends on reading the block compression marker beyond the payload block size. Only the payload bytes were being saved and loaded from SecondaryCache -> boom!

This removes some unnecessary code attempting to combine these two competing features. Note that BlockContents was previously used for block-based filter in block cache, but that support has been removed.

Also marking block_cache_compressed as deprecated in this commit as we expect it to be replaced with SecondaryCache.

This problem was discovered during refactoring but didn't want to combine bug fix with that refactoring.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10944

Test Plan: test added that fails on base revision (at least with ASAN)

Reviewed By: akankshamahajan15

Differential Revision: D41205578

Pulled By: pdillinger

fbshipit-source-id: 1b29d36c7a6552355ac6511fcdc67038ef4af29f
2022-11-11 17:35:53 -08:00
Levi Tamasi 5e8947057b Support Merge for wide-column entities in the compaction logic (#10946)
Summary:
The patch extends the compaction logic to handle `Merge`s in conjunction with wide-column entities. As usual, the merge operation is applied to the anonymous default column, and any other columns are unaffected.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10946

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D41233722

Pulled By: ltamasi

fbshipit-source-id: dfd9b1362222f01bafcecb139eb48480eb279fed
2022-11-11 16:32:32 -08:00
akankshamahajan d1aca4a5ae Fix async_io regression in scans (#10939)
Summary:
Fix async_io regression in scans due to incorrect check which was causing the valid data in buffer to be cleared during seek.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10939

Test Plan:
- stress tests  export CRASH_TEST_EXT_ARGS="--async_io=1"
    make crash_test -j32
- Ran db_bench command which was caught the regression:
./db_bench --db=/rocksdb_async_io_testing/prefix_scan --disable_wal=1 --use_existing_db=true --benchmarks="seekrandom" -key_size=32 -value_size=512 -num=50000000 -use_direct_reads=false -seek_nexts=963 -duration=30 -ops_between_duration_checks=1 --async_io=true --compaction_readahead_size=4194304 --log_readahead_size=0 --blob_compaction_readahead_size=0 --initial_auto_readahead_size=65536 --num_file_reads_for_auto_readahead=0 --max_auto_readahead_size=524288

seekrandom   :    3777.415 micros/op 264 ops/sec 30.000 seconds 7942 operations;  132.3 MB/s (7942 of 7942 found)

Reviewed By: anand1976

Differential Revision: D41173899

Pulled By: akankshamahajan15

fbshipit-source-id: 2d75b06457d65b1851c92382565d9c3fac329dfe
2022-11-11 13:34:49 -08:00
Levi Tamasi dbc4101b89 Support Merge with wide-column entities in iterator (#10941)
Summary:
The patch adds `Merge` support for wide-column entities in `DBIter`. As before, the `Merge` operation is applied to the default column of the entity; any other columns are unchanged. As a small cleanup, the PR also changes the signature of `DBIter::Merge` to simply return a boolean instead of the `Merge` operation's `Status` since the actual `Status` is already stored in a member variable.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10941

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D41195471

Pulled By: ltamasi

fbshipit-source-id: 362cf555897296e252c3de5ddfbd569ef34f85ef
2022-11-10 18:00:08 -08:00
Levi Tamasi 9460d4b77e Refactor MergeHelper::MergeUntil a bit (#10943)
Summary:
The patch untangles some nested ifs in `MergeHelper::MergeUntil`. This will come in handy when extending the compaction logic to support `Merge` for wide-column entities, and also enables us to eliminate some repeated branching on value type and to decrease the scope of some variables.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10943

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D41201946

Pulled By: ltamasi

fbshipit-source-id: 890bd3d4e31cdccadca614489a94686d76485ba9
2022-11-10 17:29:57 -08:00
Levi Tamasi 2ea109521f Revisit the interface of MergeHelper::TimedFullMerge(WithEntity) (#10932)
Summary:
The patch refines/reworks `MergeHelper::TimedFullMerge(WithEntity)`
a bit in two ways. First, it eliminates the recently introduced `TimedFullMerge`
overload, which makes the responsibilities clearer by making sure the query
result (`value` for `Get`, `columns` for `GetEntity`) is set uniformly in
`SaveValue` and `GetContext`. Second, it changes the interface of
`TimedFullMergeWithEntity` so it exposes its result in a serialized form; this
is a more decoupled design which will come in handy when adding support
for `Merge` with wide-column entities to `DBIter`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10932

Test Plan: `make check`

Reviewed By: akankshamahajan15

Differential Revision: D41129399

Pulled By: ltamasi

fbshipit-source-id: 69d8da358c77d4fc7e8c40f4dafc2c129a710677
2022-11-09 12:54:05 -08:00
Levi Tamasi c62f322169 Clear saved value in DBIter::{Next, Prev} (#10934)
Summary:
`DBIter::saved_value_` stores the result of any `Merge` that was performed to compute the iterator's current value. This value can be ditched whenever the iterator's position is changed, and is already cleared in `Seek`, `SeekForPrev`, `SeekToFirst`, and `SeekToLast`. With the patch, it is also cleared in `Next` and `Prev`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10934

Test Plan: `make check`

Reviewed By: akankshamahajan15

Differential Revision: D41133473

Pulled By: ltamasi

fbshipit-source-id: cf9e936f48151e64e455cc1664d6e9f4a03aa308
2022-11-08 14:49:16 -08:00
Daniel Engel 55d58d91e7 Fix use of crc32c 3way on portable builds using MSVC (#10667)
Summary:
Hello,
As discussed previously in this [discussion](https://github.com/facebook/rocksdb/pull/9680#discussion_r853105163), the mentioned PR introduced a regression in portable versions that compile with MSVC - crc_3way optimization won't be used even in cases where it is supported.

This PR aims to fix just that.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10667

Reviewed By: akankshamahajan15

Differential Revision: D40644592

Pulled By: ajkr

fbshipit-source-id: dadbeb10d57c19800e74288258ec3b96095557dd
2022-11-08 11:56:55 -08:00
Jay Zhuang b8de2291ad Blog post for Aligning Compaction Output File Boundaries (#10917)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10917

Reviewed By: ajkr

Differential Revision: D41070371

Pulled By: jay-zhuang

fbshipit-source-id: f211aa4f9931d06a38b32042f73a5e207d996caa
2022-11-07 19:28:05 -08:00
Levi Tamasi fbd9077d66 Fix a bug where GetContext does not update READ_NUM_MERGE_OPERANDS (#10925)
Summary:
The patch fixes a bug where `GetContext::Merge` (and `MergeEntity`) does not update the ticker `READ_NUM_MERGE_OPERANDS` because it implicitly uses the default parameter value of `update_num_ops_stats=false` when calling `MergeHelper::TimedFullMerge`. Also, to prevent such issues going forward, the PR removes the default parameter values from the `TimedFullMerge` methods. In addition, it removes an unused/unnecessary parameter from `TimedFullMergeWithEntity`, and does some cleanup at the call sites of these methods.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10925

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D41096453

Pulled By: ltamasi

fbshipit-source-id: fc60646d32b4d516b8fe81e265c3f020a32fd7f8
2022-11-07 15:42:10 -08:00
Yanqin Jin 75aca74017 Replace member variable lambda with methods (#10924)
Summary:
In transaction unit tests, replace a few member variable lambdas with
non-static methods. It's easier for gdb to work with variables in methods than in lambdas.
(Seen similar things to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86675).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10924

Test Plan: make check

Reviewed By: jay-zhuang

Differential Revision: D41072241

Pulled By: riversand963

fbshipit-source-id: e4fa491de573c4656225a86a75af926c1df827f6
2022-11-07 12:31:48 -08:00
Andrew Kryczka aa0a11e1b9 Fix flush picking non-consecutive memtables (#10921)
Summary:
Prevents `MemTableList::PickMemtablesToFlush()` from picking non-consecutive memtables. It leads to wrong ordering in L0 if the files are committed, or an error like below if force_consistency_checks=true catches it:

```
Corruption: force_consistency_checks: VersionBuilder: L0 file https://github.com/facebook/rocksdb/issues/25 with seqno 320416 368066 vs. file https://github.com/facebook/rocksdb/issues/24 with seqno 336037 352068
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10921

Test Plan: fix the expectation in the existing test of this behavior

Reviewed By: riversand963

Differential Revision: D41046935

Pulled By: ajkr

fbshipit-source-id: 783696bff56115063d5dc5856dfaed6a9881d1ab
2022-11-04 15:55:54 -07:00
anand76 aafe7bd376 Add multireadwhilewriting benchmark to db_bench (#10919)
Summary:
Add the new benchmark

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10919

Reviewed By: akankshamahajan15

Differential Revision: D41017025

Pulled By: anand1976

fbshipit-source-id: 5220815d66de1f689b7f09d9c5266cebf4e345d1
2022-11-04 11:01:33 -07:00
Yanqin Jin 18cb731f27 Fix a bug in range scan with merge and deletion with timestamp (#10915)
Summary:
When performing Merge during range scan, iterator should understand value types of kDeletionWithTimestamp.

Also add an additional check in debug mode to MergeHelper, and account for the presence of compaction filter.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10915

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D40960039

Pulled By: riversand963

fbshipit-source-id: dd79d86d7c79d05755bb939a3d94e0c53ddd7f59
2022-11-03 13:02:06 -07:00
Levi Tamasi 941d834739 Support Merge for wide-column entities during point lookups (#10916)
Summary:
The patch adds `Merge` support for wide-column entities to the point lookup
APIs, i.e. `Get`, `MultiGet`, `GetEntity`, and `GetMergeOperands`. (I plan to
update the iterator and compaction logic in separate PRs.) In terms of semantics,
the `Merge` operation is applied to the default (anonymous) column; any other
columns in the entity are unaffected.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10916

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D40962311

Pulled By: ltamasi

fbshipit-source-id: 244bc9d172be1af2f204796b2f89104e4d2fa373
2022-11-03 08:35:42 -07:00
Peter Dillinger cc8c8f6958 Refactor (Hyper)ClockCache code (#10887)
Summary:
For clean-up and in preparation for some other anticipated changes, including
* A new dynamically-scaling variant of HyperClockCache
* SecondaryCache support for HyperClockCache

This change does some refactoring for current and future code sharing and reusability. (Including follow-up on https://github.com/facebook/rocksdb/issues/10843)

## clock_cache.h
* TBD whether new variant will be a HyperClockCache or use some other name, so namespace is just clock_cache for the family of structures.
* A number of helper functions introduced and used.
* Pre-emptively split ClockHandle (shared among lock-free clock cache variants) and HandleImpl (specific to a kind of Table), and introduce template to plug new Table implementation into ClockCacheShard.

## clock_cache.cc
* Mostly using helper functions. Some things like `Rollback()` and `FreeDataMarkEmpty()` were not combined because `Rollback()` is Table-specific while `FreeDataMarkEmpty()` can be used with different table implementations.
* Performance testing indicated that despite more opportunities for parallelism, making a local copy of handle data for processing after marking an entry empty was slower than doing that processing before marking the entry empty (but after marking it "under construction"), thus avoiding a few words of copying data. At least for now, this answers the "TODO? Delay freeing?" questions (no).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10887

Test Plan:
fixed a unit testing gap; other minor test updates for refactoring

No functionality change

## Performance
Same setup as https://github.com/facebook/rocksdb/issues/10801:

Before: `readrandom [AVG 81 runs] : 627992 (± 5124) ops/sec`
After: `readrandom [AVG 81 runs] : 637512 (± 4866) ops/sec`

I've been getting some inconsistent results on restarts like the system is not being fair to the two processes, so I'm not sure there's such a real difference.

Reviewed By: anand1976

Differential Revision: D40959240

Pulled By: pdillinger

fbshipit-source-id: 0a8f3646b3bdb5bc7aaad60b26790b0779189949
2022-11-02 22:41:39 -07:00
Tal Zussman 0d5dc5fdb9 Add rocksdb_backup_restore_example to examples/.gitignore (#10825)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10825

Reviewed By: akankshamahajan15

Differential Revision: D40419234

Pulled By: ajkr

fbshipit-source-id: 2d700154eb5b2943d10a0f944f2b414ece353e4a
2022-11-02 15:02:09 -07:00
Yanqin Jin 0547cecb81 Reduce access to atomic variables in a test (#10909)
Summary:
With TSAN build on CircleCI (see mini-tsan in .circleci/config).
Sometimes `SeqAdvanceConcurrentTest.SeqAdvanceConcurrent` will get stuck when an experimental feature called
"unordered write" is enabled. Stack trace will be the following
```
Thread 7 (Thread 0x7f2284a1c700 (LWP 481523) "write_prepared_"):
#0  0x00000000004fa3f5 in __tsan_atomic64_load () at ./db/merge_context.h:15
https://github.com/facebook/rocksdb/issues/1  0x00000000005e5942 in std::__atomic_base<unsigned long>::load (this=0x7b74000012f8, __m=std::memory_order_seq_cst) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/atomic_base.h:481
https://github.com/facebook/rocksdb/issues/2  std::__atomic_base<unsigned long>::operator unsigned long (this=0x7b74000012f8) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/atomic_base.h:341
https://github.com/facebook/rocksdb/issues/3  0x00000000005bf001 in rocksdb::SeqAdvanceConcurrentTest_SeqAdvanceConcurrent_Test::TestBody()::$_9::operator()(void*) const (this=0x7b14000085e8) at utilities/transactions/write_prepared_transaction_test.cc:1702

Thread 6 (Thread 0x7f228421b700 (LWP 481521) "write_prepared_"):
#0  0x000000000052178c in __tsan::MetaMap::GetAndLock(__tsan::ThreadState*, unsigned long, unsigned long, bool, bool) () at ./db/merge_context.h:15
https://github.com/facebook/rocksdb/issues/1  0x00000000004fa48e in __tsan_atomic64_load () at ./db/merge_context.h:15
https://github.com/facebook/rocksdb/issues/2  0x00000000005e5942 in std::__atomic_base<unsigned long>::load (this=0x7b74000012f8, __m=std::memory_order_seq_cst) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/atomic_base.h:481
https://github.com/facebook/rocksdb/issues/3  std::__atomic_base<unsigned long>::operator unsigned long (this=0x7b74000012f8) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/atomic_base.h:341
https://github.com/facebook/rocksdb/issues/4  0x00000000005bf001 in rocksdb::SeqAdvanceConcurrentTest_SeqAdvanceConcurrent_Test::TestBody()::$_9::operator()(void*) const (this=0x7b14000085e8) at utilities/transactions/write_prepared_transaction_test.cc:1702
```

This is problematic and suspicious. Two threads will get stuck in the same place trying to load from an atomic variable.
https://github.com/facebook/rocksdb/blob/7.8.fb/utilities/transactions/write_prepared_transaction_test.cc#L1694:L1707. Not sure why two threads can reach the same point.

The stack trace shows that there may be a deadlock, since the two threads are on the same write thread (one is doing Prepare, while the other is trying to commit).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10909

Test Plan:
On CircleCI mini-tsan, apply a patch first so that we have a higher chance of hitting the same problematic situation,
```
 diff --git a/utilities/transactions/write_prepared_transaction_test.cc b/utilities/transactions/write_prepared_transaction_test.cc
index 4bc1f3744..bd5dc4924 100644
 --- a/utilities/transactions/write_prepared_transaction_test.cc
+++ b/utilities/transactions/write_prepared_transaction_test.cc
@@ -1714,13 +1714,13 @@ TEST_P(SeqAdvanceConcurrentTest, SeqAdvanceConcurrent) {
       size_t d = (n % base[bi + 1]) / base[bi];
       switch (d) {
         case 0:
-          threads.emplace_back(txn_t0, bi);
+          threads.emplace_back(txn_t3, bi);
           break;
         case 1:
-          threads.emplace_back(txn_t1, bi);
+          threads.emplace_back(txn_t3, bi);
           break;
         case 2:
-          threads.emplace_back(txn_t2, bi);
+          threads.emplace_back(txn_t3, bi);
           break;
         case 3:
           threads.emplace_back(txn_t3, bi);
```
then build and run tests
```
COMPILE_WITH_TSAN=1 CC=clang-13 CXX=clang++-13 ROCKSDB_DISABLE_ALIGNED_NEW=1 USE_CLANG=1 make V=1 -j32 check
gtest-parallel -r 100 ./write_prepared_transaction_test --gtest_filter=TwoWriteQueues/SeqAdvanceConcurrentTest.SeqAdvanceConcurrent/19
```
In the above, `SeqAdvanceConcurrent/19`. The tests 10 to 19 correspond to unordered write in which Prepare() and Commit() can both enter the same write thread.
Before this PR, there is a high chance of hitting the deadlock. With this PR, no deadlock has been encountered so far.

Reviewed By: ltamasi

Differential Revision: D40869387

Pulled By: riversand963

fbshipit-source-id: 81e82a70c263e4f3417597a201b081ee54f1deab
2022-11-02 14:54:58 -07:00
Brord van Wierst d80baa1396 Added placeholders for MADV defines (#10881)
Summary:
Cross compiling rocksdb with rust bindings to android leads to an error since 7.4.0 (Incusion of madvise)
This is due to missing placeholders for non-linux platforms.

This PR adds the missing placeholders.

See https://github.com/rust-rocksdb/rust-rocksdb/issues/697 for the specific error thrown.

I have just completed the CLA :)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10881

Reviewed By: akankshamahajan15

Differential Revision: D40726103

Pulled By: ajkr

fbshipit-source-id: 6b391636a74ef7e20d0daf47d332ddf0c14d5c34
2022-11-02 14:42:42 -07:00
Adam Retter 781a387488 Improve musl libc detection and provide an option for the user to override (#10889)
Summary:
The user may override the detection of whether to use GNU libc (the default) or musl libc by setting the environment variable: `ROCKSDB_MUSL_LIBC=true`.

Builds upon and supersedes: https://github.com/facebook/rocksdb/pull/9977

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10889

Reviewed By: akankshamahajan15

Differential Revision: D40788431

Pulled By: ajkr

fbshipit-source-id: ef594d973fc14cbadf28bfb38434231a18a2107c
2022-11-02 14:42:23 -07:00
Brad Smith 4a6906e28c Add OpenBSD/arm64 support for detection of CRC32 and PMULL (#10902)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10902

Reviewed By: akankshamahajan15

Differential Revision: D40839659

Pulled By: ajkr

fbshipit-source-id: 06be5919622f8cce1fce1097c5e654900bf7f8fb
2022-11-02 14:35:27 -07:00
Andrew Kryczka 5cf6ab6f31 Ran clang-format on db/ directory (#10910)
Summary:
Ran `find ./db/ -type f | xargs clang-format -i`. Excluded minor changes it tried to make on db/db_impl/. Everything else it changed was directly under db/ directory. Included minor manual touchups mentioned in PR commit history.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10910

Reviewed By: riversand963

Differential Revision: D40880683

Pulled By: ajkr

fbshipit-source-id: cfe26cda05b3fb9a72e3cb82c286e21d8c5c4174
2022-11-02 14:34:24 -07:00
akankshamahajan ff9ad2c39b Fix async_io failures in case there is error in reading data (#10890)
Summary:
Fix memory corruption error in scans if async_io is enabled. Memory corruption happened if data is overlapping between two buffers. If there is IOError while reading the data, it leads to empty buffer and other buffer already in progress of async read goes again for reading causing the error.
Fix: Added check to abort IO in second buffer if curr_ got empty.

This PR also fixes db_stress failures which happened when buffers are not aligned.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10890

Test Plan:
- Ran make crash_test -j32 with async_io enabled.
-  Ran benchmarks to make sure there is no regression.

Reviewed By: anand1976

Differential Revision: D40881731

Pulled By: akankshamahajan15

fbshipit-source-id: 39fcf2134c7b1bbb08415ede3e1ef261ac2dbc58
2022-11-01 16:06:51 -07:00
Yanqin Jin 7d26e4c5a3 Basic Support for Merge with user-defined timestamp (#10819)
Summary:
This PR implements the originally disabled `Merge()` APIs when user-defined timestamp is enabled.

Simplest usage:
```cpp
// assume string append merge op is used with '.' as delimiter.
// ts1 < ts2
db->Put(WriteOptions(), "key", ts1, "v0");
db->Merge(WriteOptions(), "key", ts2, "1");
ReadOptions ro;
ro.timestamp = &ts2;
db->Get(ro, "key", &value);
ASSERT_EQ("v0.1", value);
```

Some code comments are added for clarity.

Note: support for timestamp in `DB::GetMergeOperands()` will be done in a follow-up PR.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10819

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D40603195

Pulled By: riversand963

fbshipit-source-id: f96d6f183258f3392d80377025529f7660503013
2022-10-31 22:28:58 -07:00
Denis Hananein 9f3475eccf Fix compilation errors, clang++-15 (#10907)
Summary:
I've tried to compile the main branch, but there are two minor things which are make CE.
I'm not sure about the second one (`num_empty_non_l0_level`), probably there is should be additional assert.

```
-c ../cache/clock_cache.cc
[build] ../cache/clock_cache.cc:855:15: error: variable 'i' set but not used [-Werror,-Wunused-but-set-variable]
[build]   for (size_t i = 0; &array_[current] != h; i++) {
[build]               ^
```

```
[build] ../db/version_set.cc:3665:7: error: variable 'num_empty_non_l0_level' set but not used [-Werror,-Wunused-but-set-variable]
[build]   int num_empty_non_l0_level = 0;
[build]       ^
[build] 1 error generated.
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10907

Reviewed By: jay-zhuang

Differential Revision: D40866667

Pulled By: ajkr

fbshipit-source-id: 963b7bd56859d0b3b2779cd36fad229425cb7b17
2022-10-31 18:24:44 -07:00
Hui Xiao 7f5e438aee Move move wrong history entry out of 7.8 release (#10898)
Summary:
**Context/Summary:**

https://github.com/facebook/rocksdb/pull/10777 mistakenly added a history entry under 7.8 release but the PR is not included in 7.8. This mistake was due to rebase and merge didn't realize it was a conflict when "## Unreleased" was changed to "## 7.8.0 (10/22/2022)".

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10898

Test Plan: Make check

Reviewed By: akankshamahajan15

Differential Revision: D40861001

Pulled By: hx235

fbshipit-source-id: b2310c95490f6ebb90834a210c965a74c9560b51
2022-10-31 15:02:29 -07:00
Levi Tamasi ea1982d010 Add missing copyright headers to a couple of Java test files (#10900)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10900

Reviewed By: akankshamahajan15

Differential Revision: D40825886

Pulled By: ltamasi

fbshipit-source-id: e60f74aa8a622c3c71e1fee420fd586728fb2b7b
2022-10-31 10:05:03 -07:00
sdong d989300ad1 Avoid repeat periodic stats printing when there is no change (#10891)
Summary:
When there is a column family that doesn't get any traffic, its stats are still dumped when options.options.stats_dump_period_sec triggers. This sometimes spam the information logs. With this change, we skip the printing if there is not change, until 8 periods.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10891

Test Plan: Manually test the behavior with hacked db_bench setups.

Reviewed By: jay-zhuang

Differential Revision: D40777183

fbshipit-source-id: ef0b9a793e4f6282df099b464f01d1fb4c5a2cab
2022-10-31 09:51:38 -07:00
Yanqin Jin 9079895aae Fix deletion counting in memtable stats (#10886)
Summary:
Currently, a memtable's stats `num_deletes_` is incremented only if the entry is a regular delete (kTypeDeletion). We need to fix it by accounting for kTypeSingleDeletion and kTypeDeletionWithTimestamp.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10886

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D40740754

Pulled By: riversand963

fbshipit-source-id: 7bde62cd6df136585bc5bfb1c426c7a8276c08e1
2022-10-28 17:03:44 -07:00
Jay Zhuang 36f5e19e33 Fix a Windows build error (#10897)
Summary:
The for loop is marked as unreachable code because it will never call the increment. Switch it to `if`.

```
\table\merging_iterator.cc(823): error C2220: the following warning is treated as an error
\table\merging_iterator.cc(823): warning C4702: unreachable code
\table\merging_iterator.cc(1030): error C2220: the following warning is treated as an error
\table\merging_iterator.cc(1030): warning C4702: unreachable code
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10897

Reviewed By: cbi42

Differential Revision: D40811790

Pulled By: jay-zhuang

fbshipit-source-id: fe8fd3e7cf3d6f710360c402b79763854d5120df
2022-10-28 14:24:48 -07:00
Yanqin Jin 900f79126d Pass const LockInfo& to AcquireLocked() and AcquireWithTimeout (#10874)
Summary:
The motivation and benefit of current behavior of passing `LockInfo&&` as argument to AcquireLocked() and AcquireWithTimeout() is not clear to me. Furthermore, in AcquireWithTimeout(), we access members of `LockInfo&&` after it is passed to AcquireLocked() as rvalue ref. In addition, we may call `AcquireLocked()` with `std::move(lock_info)` multiple times.

This leads to linter warning of use-after-move. If future implementation of AcquireLocked() does something like moving-construct a new `LockedInfo` using the passed-in `LockInfo&&`, then the caller cannot use it because `LockInfo` has a member of type `autovector`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10874

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D40704210

Pulled By: riversand963

fbshipit-source-id: 20091df65b4fc63b072bcec9809efc49955d6d35
2022-10-28 14:05:12 -07:00
Hui Xiao 08a63ad10b Run clang format against files under example/, memory/ and memtable/ folders (#10893)
Summary:
**Context/Summary:**
Run the following to format
```
find ./examples -iname *.h -o -iname *.cc | xargs clang-format -i
find ./memory -iname *.h -o -iname *.cc | xargs clang-format -i
find ./memtable -iname *.h -o -iname *.cc | xargs clang-format -i
```

**Test**
- Manual inspection to ensure changes are cosmetic only
- CI

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10893

Reviewed By: jay-zhuang

Differential Revision: D40779187

Pulled By: hx235

fbshipit-source-id: 529cbb0f0fbd698d95817e8c42fe3ce32254d9b0
2022-10-28 13:16:50 -07:00
Levi Tamasi 7867a1112b Handle Merges correctly in GetEntity (#10894)
Summary:
The PR fixes the handling of `Merge`s in `GetEntity`. Note that `Merge` is not yet
supported for wide-column entities written using `PutEntity`; this change is
about returning correct (i.e. consistent with `Get`) results in cases like when the
base value is a plain old key-value written using `Put` or when there is no real base
value because we hit either a tombstone or the beginning of history.

Implementation-wise, the patch introduces a new wrapper around the existing
`MergeHelper::TimedFullMerge` that can store the merge result in either a string
(for the purposes of `Get`) or a `PinnableWideColumns` instance (for `GetEntity`).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10894

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D40782708

Pulled By: ltamasi

fbshipit-source-id: 3d700d56b2ef81f02ba1e2d93f6481bf13abcc90
2022-10-28 10:48:51 -07:00
Jay Zhuang 1e6f1ef894 Upgrade CircleCI Windows Build (#10090)
Summary:
* Upgrade CircleCI orb from 2.4 to 5.0
* Setup vs2022 build
* Use image build-in vs2019 and vs2022
* Remove vs2017
* Remove CMAKE_CXX_STANDARD=20

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10090

Reviewed By: ajkr

Differential Revision: D40787942

Pulled By: jay-zhuang

fbshipit-source-id: cc74c02a9f28dd784a0ba5502c4bfc9ff1a26d3e
2022-10-28 09:14:47 -07:00
anand76 bf497e91ad Allow a custom DB cleanup command to be passed to db_crashtest.py (#10883)
Summary:
This option allows a custom cleanup command line for a non-Posix file system to be used by db_crashtest.py to cleanup between runs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10883

Test Plan: Run the whitebox crash test

Reviewed By: pdillinger

Differential Revision: D40726424

Pulled By: anand1976

fbshipit-source-id: b827f6b583ff78f9ca75ced2d96f7e58f5200432
2022-10-27 19:47:01 -07:00
Levi Tamasi 22ff8c5af7 Use malloc/free for LRUHandle instead of new[]/delete[] (#10884)
Summary:
It's unsafe to call `malloc_usable_size` with an address not returned by a function from the `malloc` family (see https://github.com/facebook/rocksdb/issues/10798). The patch switches from using `new[]` / `delete[]` for `LRUHandle` to `malloc` / `free`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10884

Test Plan: `make check`

Reviewed By: pdillinger

Differential Revision: D40738089

Pulled By: ltamasi

fbshipit-source-id: ac5583f88125fee49c314639be6b6df85937fbee
2022-10-27 15:39:29 -07:00
Changyu Bi 56715350d9 Reduce heap operations for range tombstone keys in iterator (#10877)
Summary:
Right now in MergingIterator, for each range tombstone start and end key, we pop one end from heap and push the other end into the heap. This involves extra downheap and upheap cost. In the likely cases when a range tombstone iterator emits relatively adjacent keys, these keys should have similar order within all keys in the heap. This can happen when there is a burst of consecutive range tombstones, and most of the keys covered by them are dropped already. This PR uses `replace_top()` when inserting new range tombstone keys, which is more efficient in these common cases.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10877

Test Plan:
- existing UT
- ran all flavors of stress test through sandcastle
- benchmark:
```
# Set up: --writes_per_range_tombstone=1 means one point write and one delete range

TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone ./db_bench --benchmarks=fillseq,levelstats --writes_per_range_tombstone=1 --max_num_range_tombstones=1000000 --range_tombstone_width=2 --num=100000000 --writes=800000 --max_bytes_for_level_base=4194304 --disable_auto_compactions --write_buffer_size=33554432 --key_size=64

Level Files Size(MB)
--------------------
  0        8      152
  1        0        0
  2        0        0
  3        0        0
  4        0        0
  5        0        0
  6        0        0

# Benchmark
TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone/ ./db_bench --benchmarks=readseq[-W1][-X5],levelstats --use_existing_db=true --cache_size=3221225472 --num=100000000 --reads=1000000 --disable_auto_compactions=true --avoid_flush_during_recovery=true

# Pre PR
readseq [AVG    5 runs] : 1432116 (± 59664) ops/sec;  224.0 (± 9.3) MB/sec
readseq [MEDIAN 5 runs] : 1454886 ops/sec;  227.5 MB/sec

# Post PR
readseq [AVG    5 runs] : 1944425 (± 29521) ops/sec;  304.1 (± 4.6) MB/sec
readseq [MEDIAN 5 runs] : 1959430 ops/sec;  306.5 MB/sec
```

Reviewed By: ajkr

Differential Revision: D40710936

Pulled By: cbi42

fbshipit-source-id: cb782fb9cdcd26c0c3eb9443215a4ef4d2f79022
2022-10-27 14:28:50 -07:00
sdong 3e686c7cbe sst_dump --command=raw to add index offset information (#10873)
Summary:
Add some extra information in outputs of "sst_dump --command=raw" to help debug some issues. Right now, encoded block handle is printed out. It is more useful to directly print out offset and size.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10873

Test Plan: Manually run it against a file and check the output.

Reviewed By: anand1976

Differential Revision: D40742289

fbshipit-source-id: 04d7de26e7f27e1595a7cc3ac1c1082e4e835b93
2022-10-27 11:56:09 -07:00