Commit Graph

11839 Commits

Author SHA1 Message Date
Hui Xiao d8c043f7ad Trigger FIFO file deletion in non L0 only if exceeding max_table_files_size (#10955)
Summary:
**Context**

https://github.com/facebook/rocksdb/pull/10348 allows multi-level FIFO but accidentally made change to the logic of deleting files in `FIFOCompactionPicker::PickSizeCompaction`. With [this](https://github.com/facebook/rocksdb/pull/10348/files#diff-d8fb3d50749aa69b378de447e3d9cf2f48abe0281437f010b5d61365a7b813fdR156) and [this](https://github.com/facebook/rocksdb/pull/10348/files#diff-d8fb3d50749aa69b378de447e3d9cf2f48abe0281437f010b5d61365a7b813fdR235) together, it deletes one file in non-L0 even when `total_size <= mutable_cf_options.compaction_options_fifo.max_table_files_size`, which is incorrect.

As a consequence, FIFO exercises more file deletion in our crash testing, which is not able to verify correctly on deleted keys in the file deleted by compaction. This results in errors  `error : inconsistent values for key 000000000000239F000000000000012B000000000000028B: expected state has the key, Get() returns NotFound.
Verification failed :(` or `Expected state has key 00000000000023A90000000000000003787878, iterator is at key 00000000000023A9000000000000004178
Column family: default, op_logs: S 00000000000023A90000000000000003787878`

**Summary**:
- Delete file for non-L0 only if `total_size <= mutable_cf_options.compaction_options_fifo.max_table_files_size`
- Add some helpful log to LOG file

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10955

Test Plan:
- Errors repro-ed by
```
./db_stress --preserve_unverified_changes=1 --acquire_snapshot_one_in=10000 --adaptive_readahead=0 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --async_io=0 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=10 --bottommost_compression_type=none --bytes_per_sync=0 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=1 --checkpoint_one_in=1000000 --checksum_type=kxxHash --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=3 --compaction_style=2 --compaction_ttl=0 --compression_max_dict_buffer_bytes=8589934591 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=xpress --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --db_write_buffer_size=1048576 --delpercent=0 --delrangepercent=0 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=1 --fifo_allow_compaction=1 --file_checksum_impl=xxh64 --flush_one_in=1000000 --format_version=4 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=10 --index_type=2 --ingest_external_file_one_in=1000000 --initial_auto_readahead_size=16384 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=False --log2_keys_per_lock=10 --long_running_snapshots=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=524288 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=25000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=True --nooverwritepercent=0 --num_file_reads_for_auto_readahead=2 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=40000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=7 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=1000 --readahead_size=0 --readpercent=65 --recycle_log_file_num=1 --reopen=0 --ribbon_starting_level=999 --secondary_cache_fault_one_in=0 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=1 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=1 --use_direct_reads=1 --use_full_merge_v1=1 --use_merge=0 --use_multiget=0 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_iterator_with_expected_state_one_in=0 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=1 --writepercent=20
```
is gone after this fix
- CI

Reviewed By: ajkr

Differential Revision: D41319441

Pulled By: hx235

fbshipit-source-id: 6939753767007f7449ea7055b1420aabd03d7709
2022-11-28 15:45:03 -08:00
relife22 ed23fd7591 Add Apache Spark as a user (#10993)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10993

Reviewed By: ajkr

Differential Revision: D41543962

Pulled By: cbi42

fbshipit-source-id: a895d7863543bd64734c5c9faa7b55b0732b3d60
2022-11-28 09:42:42 -08:00
Changyu Bi 534fb06dd3 Prevent iterating over range tombstones beyond `iterate_upper_bound` (#10966)
Summary:
Currently, `iterate_upper_bound` is not checked for range tombstone keys in MergingIterator. This may impact performance when there is a large number of range tombstones right after `iterate_upper_bound`. This PR fixes this issue by checking `iterate_upper_bound` in MergingIterator for range tombstone keys.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10966

Test Plan:
- added unit test
- stress test: `python3 tools/db_crashtest.py whitebox --simple --verify_iterator_with_expected_state_one_in=5 --delrangepercent=5 --prefixpercent=18 --writepercent=48 --readpercen=15 --duration=36000 --range_deletion_width=100`
- ran different stress tests over sandcastle
- Falcon team ran some test traffic and saw reduced CPU usage on processing range tombstones.

Reviewed By: ajkr

Differential Revision: D41414172

Pulled By: cbi42

fbshipit-source-id: 9b2c29eb3abb99327c6a649bdc412e70d863f981
2022-11-23 14:27:14 -08:00
Andrew Kryczka 54c2542df2 Support tiering when file endpoints overlap (#10961)
Summary:
Enabled output to penultimate level when file endpoints overlap. This is probably only possible when range tombstones span files. Otherwise the overlapping files would all be included in the penultimate level inputs thanks to our atomic compaction unit logic.

Also, corrected `penultimate_output_range_type_`, which is a minor fix as it appears only used for logging.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10961

Test Plan: updated unit test

Reviewed By: cbi42

Differential Revision: D41370615

Pulled By: ajkr

fbshipit-source-id: 7e75ec369a3b41b8382b336446c81825a4c4f572
2022-11-23 09:20:58 -08:00
Yanqin Jin 3d0d6b8140 Make best-efforts recovery verify SST unique ID before Version construction (#10962)
Summary:
The check for SST unique IDs added to best-efforts recovery (`Options::best_efforts_recovery` is true).

With best_efforts_recovery being true, RocksDB will recover to the latest point in
MANIFEST such that all valid SST files included up to this point pass unique ID checks as well.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10962

Test Plan: make check

Reviewed By: pdillinger

Differential Revision: D41378241

Pulled By: riversand963

fbshipit-source-id: a036064e2c17dec13d080a24ef2a9f85d607b16c
2022-11-22 22:53:31 -08:00
jsteemann d8e792e4cf fix compile warnings (#10976)
Summary:
Fixes lots of compile warnings related to missing override specifiers, e.g.
```
./3rdParty/rocksdb/trace_replay/block_cache_tracer.h:130:10: warning: ‘virtual rocksdb::Status rocksdb::BlockCacheTraceWriterImpl::WriteBlockAccess(const rocksdb::BlockCacheTraceRecord&, const rocksdb::Slice&, const rocksdb::Slice&, const rocksdb::Slice&)’ can be marked override [-Wsuggest-override]
  130 |   Status WriteBlockAccess(const BlockCacheTraceRecord& record,
      |          ^~~~~~~~~~~~~~~~
./3rdParty/rocksdb/trace_replay/block_cache_tracer.h:136:10: warning: ‘virtual rocksdb::Status rocksdb::BlockCacheTraceWriterImpl::WriteHeader()’ can be marked override [-Wsuggest-override]
  136 |   Status WriteHeader();
      |          ^~~~~~~~~~~
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10976

Reviewed By: riversand963

Differential Revision: D41478588

Pulled By: ajkr

fbshipit-source-id: d30b0457241999e38b16aacf6dabe3e691f7c46f
2022-11-22 15:51:01 -08:00
Alan Paxton ae115eff8f improve copying of Env in Options (#10666)
Summary:
Closes https://github.com/facebook/rocksdb/issues/9909

- Constructing an Options from a DBOptions should use the Env from the DBOptions
- DBOptions should be constructed with the default Env as the env_, rather than null. Why ever not ?

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10666

Reviewed By: riversand963

Differential Revision: D40515418

Pulled By: ajkr

fbshipit-source-id: 4122ba3f537660720262694c21ab4bfb13b6f8de
2022-11-22 15:48:59 -08:00
Andrew Kryczka db9cbddc6f Deflake DBTest2.TraceAndReplay by relaxing latency checks (#10979)
Summary:
Since the latency measurement uses real time it is possible for the operation to complete in zero microseconds and then fail these checks. We saw this with the operation that invokes Get() on an invalid CF. This PR relaxes the assertions to allow for operations completing in zero microseconds.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10979

Reviewed By: riversand963

Differential Revision: D41478300

Pulled By: ajkr

fbshipit-source-id: 50ef096bd8f0162b31adb46f54ae6ddc337d0a5e
2022-11-22 13:07:17 -08:00
anand76 f4cfcfe824 Post 7.9.0 release branch cut updates (#10974)
Summary:
Update HISTORY.md, version.h, and check_format_compatible.sh

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10974

Reviewed By: akankshamahajan15

Differential Revision: D41455289

Pulled By: anand1976

fbshipit-source-id: 99888ebcb9109e5ced80584a66b20123f8783c0b
2022-11-21 19:24:42 -08:00
Changyu Bi 6c5ec92070 Set correct temperature for range tombstone only file in penultimate level (#10972)
Summary:
before this PR, if there is a range tombstone-only file generated in penultimate level, it is marked the `last_level_temperature`. This PR fixes this issue.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10972

Test Plan: added unit test for this scenario.

Reviewed By: ajkr

Differential Revision: D41449215

Pulled By: cbi42

fbshipit-source-id: 1e06b5ae3bc0183db2991a45965a9807a7e8be0c
2022-11-21 17:08:50 -08:00
anand76 3ff6da6bd5 Update HISTORY.md for 7.9.0 (#10973)
Summary:
Update HISTORY.md for 7.9.0 release.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10973

Reviewed By: pdillinger

Differential Revision: D41453720

Pulled By: anand1976

fbshipit-source-id: 47a23d4b6539ec6a9a09c9e69c026f7c8b10afa7
2022-11-21 17:00:01 -08:00
Peter Dillinger e079d562af Add a SecondaryCache::InsertSaved() API, use in CacheDumper impl (#10945)
Summary:
Can simplify some ugly code in cache_dump_load_impl.cc by having an API in SecondaryCache that can directly consume persisted data.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10945

Test Plan: existing tests for CacheDumper, added basic unit test

Reviewed By: anand1976

Differential Revision: D41231497

Pulled By: pdillinger

fbshipit-source-id: b8ec993ef7d3e7efd68aae8602fd3f858da58068
2022-11-21 16:17:36 -08:00
Andrew Kryczka 097f9f4425 Fix CompactionIterator flag for penultimate level output (#10967)
Summary:
We were not resetting it in non-debug mode so it could be true once and then stay true for future keys where it should be false. This PR adds the reset logic.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10967

Test Plan:
- built `db_bench` with DEBUG_LEVEL=0
- ran benchmark: `TEST_TMPDIR=/dev/shm/prefix ./db_bench -benchmarks=fillrandom -compaction_style=1 -preserve_internal_time_seconds=100 -preclude_last_level_data_seconds=10 -write_buffer_size=1048576 -target_file_size_base=1048576 -subcompactions=8 -duration=120`
- compared "output_to_penultimate_level: X bytes + last: Y bytes" lines in LOG output
  - Before this fix, Y was always zero
  - After this fix, Y gradually increased throughout the benchmark

Reviewed By: riversand963

Differential Revision: D41417726

Pulled By: ajkr

fbshipit-source-id: ace1e9a289e751a5b0c2fbaa8addd4eda5525329
2022-11-21 16:14:03 -08:00
Peter Dillinger 3182beeffc Observe and warn about misconfigured HyperClockCache (#10965)
Summary:
Background. One of the core risks of chosing HyperClockCache is ending up with degraded performance if estimated_entry_charge is very significantly wrong. Too low leads to under-utilized hash table, which wastes a bit of (tracked) memory and likely increases access times due to larger working set size (more TLB misses). Too high leads to fully populated hash table (at some limit with reasonable lookup performance) and not being able to cache as many objects as the memory limit would allow. In either case, performance degradation is graceful/continuous but can be quite significant. For example, cutting block size in half without updating estimated_entry_charge could lead to a large portion of configured block cache memory (up to roughly 1/3) going unused.

Fix. This change adds a mechanism through which the DB periodically probes the block cache(s) for "problems" to report, and adds diagnostics to the HyperClockCache for bad estimated_entry_charge. The periodic probing is currently done with DumpStats / stats_dump_period_sec, and diagnostics reported to info_log (normally LOG file).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10965

Test Plan:
unit test included. Doesn't cover all the implemented subtleties of reporting, but ensures basics of when to report or not.

Also manual testing with db_bench. Create db with
```
./db_bench --benchmarks=fillrandom,flush --num=3000000 --disable_wal=1
```
Use and check LOG file for HyperClockCache for various block sizes (used as estimated_entry_charge)
```
./db_bench --use_existing_db --benchmarks=readrandom --num=3000000 --duration=20 --stats_dump_period_sec=8 --cache_type=hyper_clock_cache -block_size=XXXX
```
Seeing warnings / errors or not as expected.

Reviewed By: anand1976

Differential Revision: D41406932

Pulled By: pdillinger

fbshipit-source-id: 4ca56162b73017e4b9cec2cad74466f49c27a0a7
2022-11-21 12:08:21 -08:00
Yanqin Jin a8a4ed52a4 Test Merge with timestamps in stress test (#10948)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10948

Test Plan: make crash_test_with_ts

Reviewed By: ltamasi

Differential Revision: D41390854

Pulled By: riversand963

fbshipit-source-id: 599e114da8e2b2bbff5628fb8c67fa0393a31c05
2022-11-17 20:43:50 -08:00
Peter Dillinger 8c0f5b1fcf Mark HyperClockCache as production-ready (#10963)
Summary:
After a couple minor bug fixes and successful productions roll-outs in a few places, I think we can mark this as production-ready. It has a clear value proposition for many workloads, even if we don't have clear advice for every workload yet.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10963

Test Plan: existing tests, comment changes only

Reviewed By: siying

Differential Revision: D41384083

Pulled By: pdillinger

fbshipit-source-id: 56359f01a57bb28de8697666b342382fac72ce6d
2022-11-17 14:44:59 -08:00
Levi Tamasi 8fa8780932 Mention wide-column support in HISTORY.md (#10959)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10959

Reviewed By: akankshamahajan15

Differential Revision: D41348198

Pulled By: ltamasi

fbshipit-source-id: 51e89d03c1fe87f576a766f609a7f233a519c83d
2022-11-16 12:22:35 -08:00
Peter Dillinger 32520df1d9 Remove prototype FastLRUCache (#10954)
Summary:
This was just a stepping stone to what eventually became HyperClockCache, and is now just more code to maintain.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10954

Test Plan: tests updated

Reviewed By: akankshamahajan15

Differential Revision: D41310123

Pulled By: pdillinger

fbshipit-source-id: 618ee148a1a0a29ee756ba8fe28359617b7cd67c
2022-11-16 10:15:55 -08:00
Peter Dillinger b55e70357c Re-arrange cache.h to prepare for refactoring (#10942)
Summary:
No material changes to code or comments, just re-arranging things to prepare for a big refactoring, making it easier to what changed. Some specifics:
* This groups things together in Cache in anticipation of secondary cache features being marked production-ready (vs. experimental).
* CacheEntryRole will be needed in definition of class Cache, so that has been moved above it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10942

Test Plan: existing tests

Reviewed By: anand1976

Differential Revision: D41205509

Pulled By: pdillinger

fbshipit-source-id: 3f2559ab1651c758918dc97056951fa2b5eb0348
2022-11-15 10:47:15 -08:00
Levi Tamasi b644baa1eb Support using GetMergeOperands for verification with wide columns (#10952)
Summary:
With the recent changes, `GetMergeOperands` is now supported for wide-column entities as well, so we can use it for verification purposes in the non-batched stress tests.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10952

Test Plan: Ran a simple non-batched ops blackbox crash test.

Reviewed By: riversand963

Differential Revision: D41292114

Pulled By: ltamasi

fbshipit-source-id: 70b4c756a4a1fecb445c16c7096aad805a51203c
2022-11-15 08:06:41 -08:00
Akanksha Mahajan 1562524e63 Fix db_stress failure in async_io in FilePrefetchBuffer (#10949)
Summary:
Fix db_stress failure in async_io in FilePrefetchBuffer.

From the logs, assertion was caused when
- prev_offset_ = offset but somehow prev_len != 0 and explicit_prefetch_submitted_ = true. That scenario is when we send async request to prefetch buffer during seek but in second seek that data is found in cache. prev_offset_ and prev_len_ get updated but we were not setting explicit_prefetch_submitted_ = false because of which buffers were getting out of sync.
It's possible a read by another thread might have loaded the block into the cache in the meantime.

Particular assertion example:
```
prev_offset: 0, prev_len_: 8097 , offset: 0, length: 8097, actual_length: 8097 , actual_offset: 0 ,
curr_: 0, bufs_[curr_].offset_: 4096 ,bufs_[curr_].CurrentSize(): 48541 , async_len_to_read: 278528, bufs_[curr_].async_in_progress_: false
second: 1, bufs_[second].offset_: 282624 ,bufs_[second].CurrentSize(): 0, async_len_to_read: 262144 ,bufs_[second].async_in_progress_: true ,
explicit_prefetch_submitted_: true , copy_to_third_buffer: false
```
As we can see curr_ was expected to read 278528 but it read 48541. Also buffers are out of sync.
Also `explicit_prefetch_submitted_` is set true but prev_len not 0.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10949

Test Plan:
- Ran db_bench for regression to make sure there is no regression;
- Ran db_stress failing without this fix,
- Ran build-linux-mini-crashtest 7- 8 times locally + CircleCI

Reviewed By: anand1976

Differential Revision: D41257786

Pulled By: akankshamahajan15

fbshipit-source-id: 1d100f94f8c06bbbe4cc76ca27f1bbc820c2494f
2022-11-14 16:14:41 -08:00
xiaochenfan 0993c9225f Fix broken dependency: update zlib from 1.2.12 to 1.2.13 (#10833)
Summary:
zlib(https://zlib.net/) has released v1.2.13.

1.2.12 is no longer available for downloading and Makefile for rocksdb will be broken due to can't find the source .tar.gz.

https://nvd.nist.gov/vuln/detail/CVE-2022-37434

This pr update the version number and the shasum of new .tar.gz file. (1.2.13)

Fixes https://github.com/facebook/rocksdb/issues/10876

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10833

Reviewed By: hx235

Differential Revision: D40575954

Pulled By: ajkr

fbshipit-source-id: 3e560e453ddf58d045214fc4e64f83bef91f22e5
2022-11-14 11:49:06 -08:00
akankshamahajan 8515437594 Update unit test to avoid timeout (#10950)
Summary:
Update unit test to avoid timeout

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10950

Reviewed By: hx235

Differential Revision: D41258892

Pulled By: akankshamahajan15

fbshipit-source-id: cbfe94da63e9e54544a307845deb79ba42458301
2022-11-14 11:39:22 -08:00
anand76 ecba6a320e Add some async read stats (#10947)
Summary:
Add stats for time spent in the ReadAsync call, and async read errors.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10947

Test Plan: Run db_bench and look at stats

Reviewed By: akankshamahajan15

Differential Revision: D41236637

Pulled By: anand1976

fbshipit-source-id: 70539b69a28491d57acead449436a761f7108acf
2022-11-13 21:38:35 -08:00
Peter Dillinger f321e8fc98 Don't attempt to use SecondaryCache on block_cache_compressed (#10944)
Summary:
Compressed block cache depends on reading the block compression marker beyond the payload block size. Only the payload bytes were being saved and loaded from SecondaryCache -> boom!

This removes some unnecessary code attempting to combine these two competing features. Note that BlockContents was previously used for block-based filter in block cache, but that support has been removed.

Also marking block_cache_compressed as deprecated in this commit as we expect it to be replaced with SecondaryCache.

This problem was discovered during refactoring but didn't want to combine bug fix with that refactoring.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10944

Test Plan: test added that fails on base revision (at least with ASAN)

Reviewed By: akankshamahajan15

Differential Revision: D41205578

Pulled By: pdillinger

fbshipit-source-id: 1b29d36c7a6552355ac6511fcdc67038ef4af29f
2022-11-11 17:35:53 -08:00
Levi Tamasi 5e8947057b Support Merge for wide-column entities in the compaction logic (#10946)
Summary:
The patch extends the compaction logic to handle `Merge`s in conjunction with wide-column entities. As usual, the merge operation is applied to the anonymous default column, and any other columns are unaffected.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10946

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D41233722

Pulled By: ltamasi

fbshipit-source-id: dfd9b1362222f01bafcecb139eb48480eb279fed
2022-11-11 16:32:32 -08:00
akankshamahajan d1aca4a5ae Fix async_io regression in scans (#10939)
Summary:
Fix async_io regression in scans due to incorrect check which was causing the valid data in buffer to be cleared during seek.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10939

Test Plan:
- stress tests  export CRASH_TEST_EXT_ARGS="--async_io=1"
    make crash_test -j32
- Ran db_bench command which was caught the regression:
./db_bench --db=/rocksdb_async_io_testing/prefix_scan --disable_wal=1 --use_existing_db=true --benchmarks="seekrandom" -key_size=32 -value_size=512 -num=50000000 -use_direct_reads=false -seek_nexts=963 -duration=30 -ops_between_duration_checks=1 --async_io=true --compaction_readahead_size=4194304 --log_readahead_size=0 --blob_compaction_readahead_size=0 --initial_auto_readahead_size=65536 --num_file_reads_for_auto_readahead=0 --max_auto_readahead_size=524288

seekrandom   :    3777.415 micros/op 264 ops/sec 30.000 seconds 7942 operations;  132.3 MB/s (7942 of 7942 found)

Reviewed By: anand1976

Differential Revision: D41173899

Pulled By: akankshamahajan15

fbshipit-source-id: 2d75b06457d65b1851c92382565d9c3fac329dfe
2022-11-11 13:34:49 -08:00
Levi Tamasi dbc4101b89 Support Merge with wide-column entities in iterator (#10941)
Summary:
The patch adds `Merge` support for wide-column entities in `DBIter`. As before, the `Merge` operation is applied to the default column of the entity; any other columns are unchanged. As a small cleanup, the PR also changes the signature of `DBIter::Merge` to simply return a boolean instead of the `Merge` operation's `Status` since the actual `Status` is already stored in a member variable.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10941

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D41195471

Pulled By: ltamasi

fbshipit-source-id: 362cf555897296e252c3de5ddfbd569ef34f85ef
2022-11-10 18:00:08 -08:00
Levi Tamasi 9460d4b77e Refactor MergeHelper::MergeUntil a bit (#10943)
Summary:
The patch untangles some nested ifs in `MergeHelper::MergeUntil`. This will come in handy when extending the compaction logic to support `Merge` for wide-column entities, and also enables us to eliminate some repeated branching on value type and to decrease the scope of some variables.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10943

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D41201946

Pulled By: ltamasi

fbshipit-source-id: 890bd3d4e31cdccadca614489a94686d76485ba9
2022-11-10 17:29:57 -08:00
Levi Tamasi 2ea109521f Revisit the interface of MergeHelper::TimedFullMerge(WithEntity) (#10932)
Summary:
The patch refines/reworks `MergeHelper::TimedFullMerge(WithEntity)`
a bit in two ways. First, it eliminates the recently introduced `TimedFullMerge`
overload, which makes the responsibilities clearer by making sure the query
result (`value` for `Get`, `columns` for `GetEntity`) is set uniformly in
`SaveValue` and `GetContext`. Second, it changes the interface of
`TimedFullMergeWithEntity` so it exposes its result in a serialized form; this
is a more decoupled design which will come in handy when adding support
for `Merge` with wide-column entities to `DBIter`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10932

Test Plan: `make check`

Reviewed By: akankshamahajan15

Differential Revision: D41129399

Pulled By: ltamasi

fbshipit-source-id: 69d8da358c77d4fc7e8c40f4dafc2c129a710677
2022-11-09 12:54:05 -08:00
Levi Tamasi c62f322169 Clear saved value in DBIter::{Next, Prev} (#10934)
Summary:
`DBIter::saved_value_` stores the result of any `Merge` that was performed to compute the iterator's current value. This value can be ditched whenever the iterator's position is changed, and is already cleared in `Seek`, `SeekForPrev`, `SeekToFirst`, and `SeekToLast`. With the patch, it is also cleared in `Next` and `Prev`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10934

Test Plan: `make check`

Reviewed By: akankshamahajan15

Differential Revision: D41133473

Pulled By: ltamasi

fbshipit-source-id: cf9e936f48151e64e455cc1664d6e9f4a03aa308
2022-11-08 14:49:16 -08:00
Daniel Engel 55d58d91e7 Fix use of crc32c 3way on portable builds using MSVC (#10667)
Summary:
Hello,
As discussed previously in this [discussion](https://github.com/facebook/rocksdb/pull/9680#discussion_r853105163), the mentioned PR introduced a regression in portable versions that compile with MSVC - crc_3way optimization won't be used even in cases where it is supported.

This PR aims to fix just that.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10667

Reviewed By: akankshamahajan15

Differential Revision: D40644592

Pulled By: ajkr

fbshipit-source-id: dadbeb10d57c19800e74288258ec3b96095557dd
2022-11-08 11:56:55 -08:00
Jay Zhuang b8de2291ad Blog post for Aligning Compaction Output File Boundaries (#10917)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10917

Reviewed By: ajkr

Differential Revision: D41070371

Pulled By: jay-zhuang

fbshipit-source-id: f211aa4f9931d06a38b32042f73a5e207d996caa
2022-11-07 19:28:05 -08:00
Levi Tamasi fbd9077d66 Fix a bug where GetContext does not update READ_NUM_MERGE_OPERANDS (#10925)
Summary:
The patch fixes a bug where `GetContext::Merge` (and `MergeEntity`) does not update the ticker `READ_NUM_MERGE_OPERANDS` because it implicitly uses the default parameter value of `update_num_ops_stats=false` when calling `MergeHelper::TimedFullMerge`. Also, to prevent such issues going forward, the PR removes the default parameter values from the `TimedFullMerge` methods. In addition, it removes an unused/unnecessary parameter from `TimedFullMergeWithEntity`, and does some cleanup at the call sites of these methods.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10925

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D41096453

Pulled By: ltamasi

fbshipit-source-id: fc60646d32b4d516b8fe81e265c3f020a32fd7f8
2022-11-07 15:42:10 -08:00
Yanqin Jin 75aca74017 Replace member variable lambda with methods (#10924)
Summary:
In transaction unit tests, replace a few member variable lambdas with
non-static methods. It's easier for gdb to work with variables in methods than in lambdas.
(Seen similar things to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86675).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10924

Test Plan: make check

Reviewed By: jay-zhuang

Differential Revision: D41072241

Pulled By: riversand963

fbshipit-source-id: e4fa491de573c4656225a86a75af926c1df827f6
2022-11-07 12:31:48 -08:00
Andrew Kryczka aa0a11e1b9 Fix flush picking non-consecutive memtables (#10921)
Summary:
Prevents `MemTableList::PickMemtablesToFlush()` from picking non-consecutive memtables. It leads to wrong ordering in L0 if the files are committed, or an error like below if force_consistency_checks=true catches it:

```
Corruption: force_consistency_checks: VersionBuilder: L0 file https://github.com/facebook/rocksdb/issues/25 with seqno 320416 368066 vs. file https://github.com/facebook/rocksdb/issues/24 with seqno 336037 352068
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10921

Test Plan: fix the expectation in the existing test of this behavior

Reviewed By: riversand963

Differential Revision: D41046935

Pulled By: ajkr

fbshipit-source-id: 783696bff56115063d5dc5856dfaed6a9881d1ab
2022-11-04 15:55:54 -07:00
anand76 aafe7bd376 Add multireadwhilewriting benchmark to db_bench (#10919)
Summary:
Add the new benchmark

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10919

Reviewed By: akankshamahajan15

Differential Revision: D41017025

Pulled By: anand1976

fbshipit-source-id: 5220815d66de1f689b7f09d9c5266cebf4e345d1
2022-11-04 11:01:33 -07:00
Yanqin Jin 18cb731f27 Fix a bug in range scan with merge and deletion with timestamp (#10915)
Summary:
When performing Merge during range scan, iterator should understand value types of kDeletionWithTimestamp.

Also add an additional check in debug mode to MergeHelper, and account for the presence of compaction filter.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10915

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D40960039

Pulled By: riversand963

fbshipit-source-id: dd79d86d7c79d05755bb939a3d94e0c53ddd7f59
2022-11-03 13:02:06 -07:00
Levi Tamasi 941d834739 Support Merge for wide-column entities during point lookups (#10916)
Summary:
The patch adds `Merge` support for wide-column entities to the point lookup
APIs, i.e. `Get`, `MultiGet`, `GetEntity`, and `GetMergeOperands`. (I plan to
update the iterator and compaction logic in separate PRs.) In terms of semantics,
the `Merge` operation is applied to the default (anonymous) column; any other
columns in the entity are unaffected.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10916

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D40962311

Pulled By: ltamasi

fbshipit-source-id: 244bc9d172be1af2f204796b2f89104e4d2fa373
2022-11-03 08:35:42 -07:00
Peter Dillinger cc8c8f6958 Refactor (Hyper)ClockCache code (#10887)
Summary:
For clean-up and in preparation for some other anticipated changes, including
* A new dynamically-scaling variant of HyperClockCache
* SecondaryCache support for HyperClockCache

This change does some refactoring for current and future code sharing and reusability. (Including follow-up on https://github.com/facebook/rocksdb/issues/10843)

## clock_cache.h
* TBD whether new variant will be a HyperClockCache or use some other name, so namespace is just clock_cache for the family of structures.
* A number of helper functions introduced and used.
* Pre-emptively split ClockHandle (shared among lock-free clock cache variants) and HandleImpl (specific to a kind of Table), and introduce template to plug new Table implementation into ClockCacheShard.

## clock_cache.cc
* Mostly using helper functions. Some things like `Rollback()` and `FreeDataMarkEmpty()` were not combined because `Rollback()` is Table-specific while `FreeDataMarkEmpty()` can be used with different table implementations.
* Performance testing indicated that despite more opportunities for parallelism, making a local copy of handle data for processing after marking an entry empty was slower than doing that processing before marking the entry empty (but after marking it "under construction"), thus avoiding a few words of copying data. At least for now, this answers the "TODO? Delay freeing?" questions (no).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10887

Test Plan:
fixed a unit testing gap; other minor test updates for refactoring

No functionality change

## Performance
Same setup as https://github.com/facebook/rocksdb/issues/10801:

Before: `readrandom [AVG 81 runs] : 627992 (± 5124) ops/sec`
After: `readrandom [AVG 81 runs] : 637512 (± 4866) ops/sec`

I've been getting some inconsistent results on restarts like the system is not being fair to the two processes, so I'm not sure there's such a real difference.

Reviewed By: anand1976

Differential Revision: D40959240

Pulled By: pdillinger

fbshipit-source-id: 0a8f3646b3bdb5bc7aaad60b26790b0779189949
2022-11-02 22:41:39 -07:00
Tal Zussman 0d5dc5fdb9 Add rocksdb_backup_restore_example to examples/.gitignore (#10825)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10825

Reviewed By: akankshamahajan15

Differential Revision: D40419234

Pulled By: ajkr

fbshipit-source-id: 2d700154eb5b2943d10a0f944f2b414ece353e4a
2022-11-02 15:02:09 -07:00
Yanqin Jin 0547cecb81 Reduce access to atomic variables in a test (#10909)
Summary:
With TSAN build on CircleCI (see mini-tsan in .circleci/config).
Sometimes `SeqAdvanceConcurrentTest.SeqAdvanceConcurrent` will get stuck when an experimental feature called
"unordered write" is enabled. Stack trace will be the following
```
Thread 7 (Thread 0x7f2284a1c700 (LWP 481523) "write_prepared_"):
#0  0x00000000004fa3f5 in __tsan_atomic64_load () at ./db/merge_context.h:15
https://github.com/facebook/rocksdb/issues/1  0x00000000005e5942 in std::__atomic_base<unsigned long>::load (this=0x7b74000012f8, __m=std::memory_order_seq_cst) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/atomic_base.h:481
https://github.com/facebook/rocksdb/issues/2  std::__atomic_base<unsigned long>::operator unsigned long (this=0x7b74000012f8) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/atomic_base.h:341
https://github.com/facebook/rocksdb/issues/3  0x00000000005bf001 in rocksdb::SeqAdvanceConcurrentTest_SeqAdvanceConcurrent_Test::TestBody()::$_9::operator()(void*) const (this=0x7b14000085e8) at utilities/transactions/write_prepared_transaction_test.cc:1702

Thread 6 (Thread 0x7f228421b700 (LWP 481521) "write_prepared_"):
#0  0x000000000052178c in __tsan::MetaMap::GetAndLock(__tsan::ThreadState*, unsigned long, unsigned long, bool, bool) () at ./db/merge_context.h:15
https://github.com/facebook/rocksdb/issues/1  0x00000000004fa48e in __tsan_atomic64_load () at ./db/merge_context.h:15
https://github.com/facebook/rocksdb/issues/2  0x00000000005e5942 in std::__atomic_base<unsigned long>::load (this=0x7b74000012f8, __m=std::memory_order_seq_cst) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/atomic_base.h:481
https://github.com/facebook/rocksdb/issues/3  std::__atomic_base<unsigned long>::operator unsigned long (this=0x7b74000012f8) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/atomic_base.h:341
https://github.com/facebook/rocksdb/issues/4  0x00000000005bf001 in rocksdb::SeqAdvanceConcurrentTest_SeqAdvanceConcurrent_Test::TestBody()::$_9::operator()(void*) const (this=0x7b14000085e8) at utilities/transactions/write_prepared_transaction_test.cc:1702
```

This is problematic and suspicious. Two threads will get stuck in the same place trying to load from an atomic variable.
https://github.com/facebook/rocksdb/blob/7.8.fb/utilities/transactions/write_prepared_transaction_test.cc#L1694:L1707. Not sure why two threads can reach the same point.

The stack trace shows that there may be a deadlock, since the two threads are on the same write thread (one is doing Prepare, while the other is trying to commit).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10909

Test Plan:
On CircleCI mini-tsan, apply a patch first so that we have a higher chance of hitting the same problematic situation,
```
 diff --git a/utilities/transactions/write_prepared_transaction_test.cc b/utilities/transactions/write_prepared_transaction_test.cc
index 4bc1f3744..bd5dc4924 100644
 --- a/utilities/transactions/write_prepared_transaction_test.cc
+++ b/utilities/transactions/write_prepared_transaction_test.cc
@@ -1714,13 +1714,13 @@ TEST_P(SeqAdvanceConcurrentTest, SeqAdvanceConcurrent) {
       size_t d = (n % base[bi + 1]) / base[bi];
       switch (d) {
         case 0:
-          threads.emplace_back(txn_t0, bi);
+          threads.emplace_back(txn_t3, bi);
           break;
         case 1:
-          threads.emplace_back(txn_t1, bi);
+          threads.emplace_back(txn_t3, bi);
           break;
         case 2:
-          threads.emplace_back(txn_t2, bi);
+          threads.emplace_back(txn_t3, bi);
           break;
         case 3:
           threads.emplace_back(txn_t3, bi);
```
then build and run tests
```
COMPILE_WITH_TSAN=1 CC=clang-13 CXX=clang++-13 ROCKSDB_DISABLE_ALIGNED_NEW=1 USE_CLANG=1 make V=1 -j32 check
gtest-parallel -r 100 ./write_prepared_transaction_test --gtest_filter=TwoWriteQueues/SeqAdvanceConcurrentTest.SeqAdvanceConcurrent/19
```
In the above, `SeqAdvanceConcurrent/19`. The tests 10 to 19 correspond to unordered write in which Prepare() and Commit() can both enter the same write thread.
Before this PR, there is a high chance of hitting the deadlock. With this PR, no deadlock has been encountered so far.

Reviewed By: ltamasi

Differential Revision: D40869387

Pulled By: riversand963

fbshipit-source-id: 81e82a70c263e4f3417597a201b081ee54f1deab
2022-11-02 14:54:58 -07:00
Brord van Wierst d80baa1396 Added placeholders for MADV defines (#10881)
Summary:
Cross compiling rocksdb with rust bindings to android leads to an error since 7.4.0 (Incusion of madvise)
This is due to missing placeholders for non-linux platforms.

This PR adds the missing placeholders.

See https://github.com/rust-rocksdb/rust-rocksdb/issues/697 for the specific error thrown.

I have just completed the CLA :)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10881

Reviewed By: akankshamahajan15

Differential Revision: D40726103

Pulled By: ajkr

fbshipit-source-id: 6b391636a74ef7e20d0daf47d332ddf0c14d5c34
2022-11-02 14:42:42 -07:00
Adam Retter 781a387488 Improve musl libc detection and provide an option for the user to override (#10889)
Summary:
The user may override the detection of whether to use GNU libc (the default) or musl libc by setting the environment variable: `ROCKSDB_MUSL_LIBC=true`.

Builds upon and supersedes: https://github.com/facebook/rocksdb/pull/9977

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10889

Reviewed By: akankshamahajan15

Differential Revision: D40788431

Pulled By: ajkr

fbshipit-source-id: ef594d973fc14cbadf28bfb38434231a18a2107c
2022-11-02 14:42:23 -07:00
Brad Smith 4a6906e28c Add OpenBSD/arm64 support for detection of CRC32 and PMULL (#10902)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10902

Reviewed By: akankshamahajan15

Differential Revision: D40839659

Pulled By: ajkr

fbshipit-source-id: 06be5919622f8cce1fce1097c5e654900bf7f8fb
2022-11-02 14:35:27 -07:00
Andrew Kryczka 5cf6ab6f31 Ran clang-format on db/ directory (#10910)
Summary:
Ran `find ./db/ -type f | xargs clang-format -i`. Excluded minor changes it tried to make on db/db_impl/. Everything else it changed was directly under db/ directory. Included minor manual touchups mentioned in PR commit history.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10910

Reviewed By: riversand963

Differential Revision: D40880683

Pulled By: ajkr

fbshipit-source-id: cfe26cda05b3fb9a72e3cb82c286e21d8c5c4174
2022-11-02 14:34:24 -07:00
akankshamahajan ff9ad2c39b Fix async_io failures in case there is error in reading data (#10890)
Summary:
Fix memory corruption error in scans if async_io is enabled. Memory corruption happened if data is overlapping between two buffers. If there is IOError while reading the data, it leads to empty buffer and other buffer already in progress of async read goes again for reading causing the error.
Fix: Added check to abort IO in second buffer if curr_ got empty.

This PR also fixes db_stress failures which happened when buffers are not aligned.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10890

Test Plan:
- Ran make crash_test -j32 with async_io enabled.
-  Ran benchmarks to make sure there is no regression.

Reviewed By: anand1976

Differential Revision: D40881731

Pulled By: akankshamahajan15

fbshipit-source-id: 39fcf2134c7b1bbb08415ede3e1ef261ac2dbc58
2022-11-01 16:06:51 -07:00
Yanqin Jin 7d26e4c5a3 Basic Support for Merge with user-defined timestamp (#10819)
Summary:
This PR implements the originally disabled `Merge()` APIs when user-defined timestamp is enabled.

Simplest usage:
```cpp
// assume string append merge op is used with '.' as delimiter.
// ts1 < ts2
db->Put(WriteOptions(), "key", ts1, "v0");
db->Merge(WriteOptions(), "key", ts2, "1");
ReadOptions ro;
ro.timestamp = &ts2;
db->Get(ro, "key", &value);
ASSERT_EQ("v0.1", value);
```

Some code comments are added for clarity.

Note: support for timestamp in `DB::GetMergeOperands()` will be done in a follow-up PR.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10819

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D40603195

Pulled By: riversand963

fbshipit-source-id: f96d6f183258f3392d80377025529f7660503013
2022-10-31 22:28:58 -07:00
Denis Hananein 9f3475eccf Fix compilation errors, clang++-15 (#10907)
Summary:
I've tried to compile the main branch, but there are two minor things which are make CE.
I'm not sure about the second one (`num_empty_non_l0_level`), probably there is should be additional assert.

```
-c ../cache/clock_cache.cc
[build] ../cache/clock_cache.cc:855:15: error: variable 'i' set but not used [-Werror,-Wunused-but-set-variable]
[build]   for (size_t i = 0; &array_[current] != h; i++) {
[build]               ^
```

```
[build] ../db/version_set.cc:3665:7: error: variable 'num_empty_non_l0_level' set but not used [-Werror,-Wunused-but-set-variable]
[build]   int num_empty_non_l0_level = 0;
[build]       ^
[build] 1 error generated.
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10907

Reviewed By: jay-zhuang

Differential Revision: D40866667

Pulled By: ajkr

fbshipit-source-id: 963b7bd56859d0b3b2779cd36fad229425cb7b17
2022-10-31 18:24:44 -07:00
Hui Xiao 7f5e438aee Move move wrong history entry out of 7.8 release (#10898)
Summary:
**Context/Summary:**

https://github.com/facebook/rocksdb/pull/10777 mistakenly added a history entry under 7.8 release but the PR is not included in 7.8. This mistake was due to rebase and merge didn't realize it was a conflict when "## Unreleased" was changed to "## 7.8.0 (10/22/2022)".

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10898

Test Plan: Make check

Reviewed By: akankshamahajan15

Differential Revision: D40861001

Pulled By: hx235

fbshipit-source-id: b2310c95490f6ebb90834a210c965a74c9560b51
2022-10-31 15:02:29 -07:00