Commit graph

11279 commits

Author SHA1 Message Date
Andrew Kryczka a0c63083d3 Fix explanation of XOR usage in KV checksum blog post (#10392)
Summary:
Thanks pdillinger for reminding us that we are protected from swapping corruptions due to independent seeds (and for suggesting that approach in the first place).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10392

Reviewed By: cbi42

Differential Revision: D37981819

Pulled By: ajkr

fbshipit-source-id: 3ed32982ae1dbc88eb92569010f9f2e8d190c962
2022-07-19 21:39:34 -07:00
Yanqin Jin b443d24f4d Stop operating on DB in a stress test background thread (#10373)
Summary:
Stress test background threads do not coordinate with test worker
threads for db reopen in the middle of a test run, thus accessing db
obj in a stress test bg thread can race with test workers. Remove the
TimestampedSnapshotThread.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10373

Test Plan:
```
./db_stress --acquire_snapshot_one_in=0 --adaptive_readahead=0 --allow_concurrent_memtable_write=1 \
--allow_data_in_errors=True --async_io=0 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 \
--backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=8 \
--block_size=16384 --bloom_bits=7.580319535285394 --bottommost_compression_type=disable \
--bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache \
--charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=1 \
--charge_table_reader=0 --checkpoint_one_in=0 --checksum_type=kxxHash64 --clear_column_family_one_in=0 \
--compact_files_one_in=1000000 --compact_range_one_in=0 --compaction_pri=1 --compaction_ttl=0 \
--compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 \
--compression_type=xpress --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 \
--continuous_verification_interval=0 --create_timestamped_snapshot_one_in=20 --data_block_index_type=0 \
--db=/dev/shm/rocksdb/ --db_write_buffer_size=0 --delpercent=5 --delrangepercent=0 --destroy_db_initially=1 \
--detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=1 --enable_pipelined_write=0 \
--fail_if_options_file_error=1 --file_checksum_impl=xxh64 --flush_one_in=1000000 --format_version=2 \
--get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 \
--get_sorted_wal_files_one_in=0 --index_block_restart_interval=11 --index_type=0 --ingest_external_file_one_in=0 \
--iterpercent=0 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True \
--log2_keys_per_lock=10 --long_running_snapshots=0 --mark_for_compaction_one_file_in=10 \
--max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=25000000 \
--max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 \
--max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.5 \
--memtable_whole_key_filtering=1 --memtablerep=skip_list --mmap_read=0 --mock_direct_io=True \
--nooverwritepercent=1 --open_files=500000 --open_metadata_write_fault_one_in=0 \
--open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=20000 \
--optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=2 \
--pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=1 \
--prefixpercent=5 --prepopulate_block_cache=0 --progress_reports=0 --read_fault_one_in=1000 \
--readpercent=55 --recycle_log_file_num=0 --reopen=100 --ribbon_starting_level=8 \
--secondary_cache_fault_one_in=0 --secondary_cache_uri= --snapshot_hold_ops=100000 \
--sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=0 \
--subcompactions=3 --sync=0 --sync_fault_injection=0 --target_file_size_base=2097152 \
--target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=1 \
--txn_write_policy=0 --unordered_write=0 --unpartitioned_pinning=0 \
--use_direct_io_for_flush_and_compaction=0 --use_direct_reads=1 --use_full_merge_v1=1 \
--use_merge=1 --use_multiget=0 --use_txn=1 --user_timestamp_size=0 --value_size_mult=32 \
--verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 \
--verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none \
--write_buffer_size=4194304 --write_dbid_to_manifest=0 --writepercent=35
```
make crash_test_with_txn
make crash_test_with_multiops_wc_txn

Reviewed By: jay-zhuang

Differential Revision: D37903189

Pulled By: riversand963

fbshipit-source-id: cd1728ad7ba4ce4cf47af23c4f65dda0956744f9
2022-07-19 11:25:43 -07:00
Andrew Kryczka e576f2ab19 Fix race conditions in GenericRateLimiter (#10374)
Summary:
Made locking strict for all accesses of `GenericRateLimiter` internal state.

`SetBytesPerSecond()` was the main problem since it had no locking, while the two updates it makes need to be done as one atomic operation.

The test case, "ConfigOptionsTest.ConfiguringOptionsDoesNotRevertRateLimiterBandwidth", is for the issue fixed in https://github.com/facebook/rocksdb/issues/10378, but I forgot to include the test there.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10374

Reviewed By: pdillinger

Differential Revision: D37906367

Pulled By: ajkr

fbshipit-source-id: ccde620d2a7f96d1401bdafd2bdb685cbefbafa5
2022-07-19 09:31:14 -07:00
Gang Liao 0b6bc101ba Charge blob cache usage against the global memory limit (#10321)
Summary:
To help service owners to manage their memory budget effectively, we have been working towards counting all major memory users inside RocksDB towards a single global memory limit (see e.g. https://github.com/facebook/rocksdb/wiki/Write-Buffer-Manager#cost-memory-used-in-memtable-to-block-cache). The global limit is specified by the capacity of the block-based table's block cache, and is technically implemented by inserting dummy entries ("reservations") into the block cache. The goal of this task is to support charging the memory usage of the new blob cache against this global memory limit when the backing cache of the blob cache and the block cache are different.

This PR is a part of https://github.com/facebook/rocksdb/issues/10156

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10321

Reviewed By: ltamasi

Differential Revision: D37913590

Pulled By: gangliao

fbshipit-source-id: eaacf23907f82dc7d18964a3f24d7039a2937a72
2022-07-18 23:26:57 -07:00
Jay Zhuang 18a61a1734 Fix seqno->time worker not scheduled with multi DB instances (#10383)
Summary:
`PeriodicWorkScheduler` is a global singleton, which were used to store per-instance setting `record_seqno_time_cadence_`. Move that to db_impl.h which is per-instance.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10383

Reviewed By: siying

Differential Revision: D37928009

Pulled By: jay-zhuang

fbshipit-source-id: e517754f4a9db98798ac04f72033d4b517f734e9
2022-07-18 19:08:39 -07:00
Changyu Bi 5b5144deb2 Per kv checksum blogpost (#10385)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10385

Test Plan: https://github.com/facebook/rocksdb/blob/main/docs/CONTRIBUTING.md

Reviewed By: ajkr

Differential Revision: D37944670

Pulled By: cbi42

fbshipit-source-id: 963c4d973dc748d4280b9c9d82dc31c33679f22a
2022-07-18 17:32:15 -07:00
Akanksha Mahajan f6c4d7a576 Fix hang in MultiRead with O_DIRECT and io_uring (#10368)
Summary:
Fix bug in O_DIRECT and io_uring when its EOF and bytes_read =
0 because of wrong check, it got added into incomplete list and gets stuck in an infinite loop as it will always return bytes_read = 0. The bug was introduced by PR https://github.com/facebook/rocksdb/pull/10197 and that PR is not released yet in any release branch.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10368

Test Plan: Added new unit test

Reviewed By: siying

Differential Revision: D37885184

Pulled By: akankshamahajan15

fbshipit-source-id: 35b36a44b696d29b2f6f25301aa1b19547b4e03b
2022-07-18 15:37:29 -07:00
Andrew Kryczka 25cc564ff7 Make RateLimiter not Customizable (#10378)
Summary:
(PR created for informational/testing purposes only.)

- Fixes lost dynamic updates to GenericRateLimiter bandwidth using `SetBytesPerSecond()`
- Benefit over #10374 is eliminating race conditions with Configurable framework.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10378

Reviewed By: pdillinger

Differential Revision: D37914865

fbshipit-source-id: d4f566d60ec9726d26932388c61671adf0ee0f30
2022-07-18 14:48:42 -07:00
sdong d9deffba57 Post 7.5 branch cut changes (#10376)
Summary:
After branch 7.5.fb branch is cut, following release process, upgrade version number to 7.6 and add 7.5.fb to format compatibility check.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10376

Test Plan: Watch CI

Reviewed By: ajkr

Differential Revision: D37927694

fbshipit-source-id: 71b37ae55ebb7c95a1bcc0d7eee643d6ba5f8461
2022-07-18 12:58:04 -07:00
Gang Liao ec4ebeff30 Support prepopulating/warming the blob cache (#10298)
Summary:
Many workloads have temporal locality, where recently written items are read back in a short period of time. When using remote file systems, this is inefficient since it involves network traffic and higher latencies. Because of this, we would like to support prepopulating the blob cache during flush.

This task is a part of https://github.com/facebook/rocksdb/issues/10156

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10298

Reviewed By: ltamasi

Differential Revision: D37908743

Pulled By: gangliao

fbshipit-source-id: 9feaed234bc719d38f0c02975c1ad19fa4bb37d1
2022-07-17 07:13:59 -07:00
sg20180546 f5ef36a29a add sstfilewriter_delete_range (#10314)
Summary:
I add C API
db/c.cc
rocksdb_sstfilewriter_delete_range

and test it , PASS.
 can you review it ? ajkr

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10314

Reviewed By: riversand963

Differential Revision: D37657236

Pulled By: ajkr

fbshipit-source-id: c3b758daa36fbd9133210b011716914dff311278
2022-07-16 19:35:46 -07:00
Gang Liao 95ef007adc Support using secondary cache with the blob cache (#10349)
Summary:
RocksDB supports a two-level cache hierarchy (see https://rocksdb.org/blog/2021/05/27/rocksdb-secondary-cache.html), where items evicted from the primary cache can be spilled over to the secondary cache, or items from the secondary cache can be promoted to the primary one. We have a CacheLib-based non-volatile secondary cache implementation that can be used to improve read latencies and reduce the amount of network bandwidth when using distributed file systems. In addition, we have recently implemented a compressed secondary cache that can be used as a replacement for the OS page cache when e.g. direct I/O is used. The goals of this task are to add support for using a secondary cache with the blob cache and to measure the potential performance gains using `db_bench`.

This task is a part of https://github.com/facebook/rocksdb/issues/10156

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10349

Reviewed By: ltamasi

Differential Revision: D37896773

Pulled By: gangliao

fbshipit-source-id: 7804619ce4a44b73d9e11ad606640f9385969c84
2022-07-16 03:54:37 -07:00
Guido Tagliavini Ponce efdb428edc Lock-free Lookup and Release in ClockCache (#10347)
Summary:
This is a prototype of a partially lock-free version of ClockCache. Roughly speaking, reads are lock-free and writes are lock-based:
- Lookup is lock-free.
- Release is lock-free, unless (i) no references to the element are left and (ii) it was marked for deletion or ``erase_if_last_ref`` is set.
- Insert and Erase still use a per-shard lock.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10347

Test Plan:
- ``make -j24 check``
- ``make -j24 CRASH_TEST_EXT_ARGS="--duration=960 --cache_type=clock_cache --cache_size=1073741824 --block_size=16384" blackbox_crash_test_with_atomic_flush``

Reviewed By: pdillinger

Differential Revision: D37898776

Pulled By: guidotag

fbshipit-source-id: 6418fd980f786d69b871bf2fe959398e44cd3d80
2022-07-15 22:36:58 -07:00
Jay Zhuang faa0f9723c Tiered compaction: integrate Seqno time mapping with per key placement (#10370)
Summary:
Using the Sequence number to time mapping to decide if a key is hot or not in
compaction and place it in the corresponding level.

Note: the feature is not complete, level compaction will run indefinitely until
all penultimate level data is cold and small enough to not trigger compaction.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10370

Test Plan:
CI
* Run basic db_bench for universal compaction manually

Reviewed By: siying

Differential Revision: D37892338

Pulled By: jay-zhuang

fbshipit-source-id: 792bbd91b1ccc2f62b5d14c53118434bcaac4bbe
2022-07-15 19:01:30 -07:00
sdong 7506c1a4ca Update HISTORY.md for the upcoming 7.5 release (#10372)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10372

Reviewed By: riversand963

Differential Revision: D37894412

fbshipit-source-id: 77055a460d662f3b89921102a16b1a726d324d84
2022-07-15 15:35:14 -07:00
Jay Zhuang fb579a221c Remove fixed TODO (#10241)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10241

Reviewed By: gitbw95

Differential Revision: D37369726

Pulled By: jay-zhuang

fbshipit-source-id: 1e94f0e2433aee42e9871043fa434291ce948eac
2022-07-15 14:47:36 -07:00
Jay Zhuang dcb6a3be4e Add helper function to get debug type name (#10243)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10243

Reviewed By: riversand963

Differential Revision: D37370236

Pulled By: jay-zhuang

fbshipit-source-id: 6e7a6fadf45fdfb5afe97b3f6fe4acf1260d4a86
2022-07-15 14:42:00 -07:00
Jay Zhuang 69a18b9bad VerifySstUniqueIds status is overrided for multi CFs (#10247)
Summary:
There's bug that basically we only report the last CF's
VerifySstUniqueIds() result:
https://github.com/facebook/rocksdb/pull/9990#discussion_r877268810

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10247

Test Plan: CI

Reviewed By: pdillinger

Differential Revision: D37384265

Pulled By: jay-zhuang

fbshipit-source-id: d462ad0eab39c9145c45a3db9c45539d5d76f7dd
2022-07-15 11:50:30 -07:00
Guido Tagliavini Ponce a543773bbc Add lean option to cache_bench (#10363)
Summary:
Sometimes we may not want to include extra computation in our cache_bench experiments. Here we add a flag to avoid any extra work. We also moved the timer start after the key generation.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10363

Test Plan: Run cache_bench with and without the new flag and check that the appropriate code is being executed.

Reviewed By: pdillinger

Differential Revision: D37870416

Pulled By: guidotag

fbshipit-source-id: f853207b6643b9328e774251c3f679b1fd78a11a
2022-07-15 09:33:32 -07:00
sdong 00e68e7a30 DB::PutEntity() shouldn't be defined as =0 (#10364)
Summary:
DB::PutEntity() is defined as 0, but it is actually implemented in db/db_impl/db_impl_write.cc. It is incorrect, and might cause problems when users implement class DB themselves.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10364

Test Plan: See existing tests pass

Reviewed By: riversand963

Differential Revision: D37874886

fbshipit-source-id: b81713ddb707720b52d57a15de56a59414c24f66
2022-07-14 22:24:02 -07:00
Jay Zhuang a3acf2ef87 Add seqno to time mapping (#10338)
Summary:
Which will be used for tiered storage to preclude hot data from
compacting to the cold tier (the last level).
Internally, adding seqno to time mapping. A periodic_task is scheduled
to record the current_seqno -> current_time in certain cadence. When
memtable flush, the mapping informaiton is stored in sstable property.
During compaction, the mapping information are merged and get the
approximate time of sequence number, which is used to determine if a key
is recently inserted or not and preclude it from the last level if it's
recently inserted (within the `preclude_last_level_data_seconds`).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10338

Test Plan: CI

Reviewed By: siying

Differential Revision: D37810187

Pulled By: jay-zhuang

fbshipit-source-id: 6953be7a18a99de8b1cb3b162d712f79c2b4899f
2022-07-14 21:49:34 -07:00
Siying Dong 66685d6aa1 Fix HISTORY.md for misplaced items (#10362)
Summary:
Some items are misplaced to 7.4 but they are unreleased.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10362

Reviewed By: jay-zhuang

Differential Revision: D37859426

fbshipit-source-id: e2ad099227309ed2e0f3ca450a9a43986d681c7c
2022-07-14 11:33:47 -07:00
sdong c8b20d469d Make InternalKeyComparator not configurable (#10342)
Summary:
InternalKeyComparator is an internal class which is a simple wrapper of Comparator. https://github.com/facebook/rocksdb/pull/8336 made Comparator customizeable. As a side effect, internal key comparator was made configurable too. This introduces overhead to this simple wrapper. For example, every InternalKeyComparator will have an std::vector attached to it, which consumes memory and possible allocation overhead too.
We remove InternalKeyComparator from being customizable by making InternalKeyComparator not a subclass of Comparator.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10342

Test Plan: Run existing CI tests and make sure it doesn't fail

Reviewed By: riversand963

Differential Revision: D37771351

fbshipit-source-id: 917256ee04b2796ed82974549c734fb6c4d8ccee
2022-07-14 10:09:31 -07:00
Jay Zhuang 6ce0b2ca34 Tiered Compaction: per key placement support (#9964)
Summary:
Support per_key_placement for last level compaction, which will
be used for tiered compaction.
* compaction iterator reports which level a key should output to;
* compaction get the output level information and check if it's safe to
  output the data to penultimate level;
* all compaction output files will be installed.
* extra internal compaction stats added for penultimate level.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9964

Test Plan:
* Unittest
* db_bench, no significate difference: https://gist.github.com/jay-zhuang/3645f8fb97ec0ab47c10704bb39fd6e4
* microbench manual compaction no significate difference: https://gist.github.com/jay-zhuang/ba679b3e89e24992615ee9eef310e6dd
* run the db_stress multiple times (not covering the new feature) looks good (internal: https://fburl.com/sandcastle/9w84pp2m)

Reviewed By: ajkr

Differential Revision: D36249494

Pulled By: jay-zhuang

fbshipit-source-id: a96da57c8031c1df83e4a7a8567b657a112b80a3
2022-07-13 20:54:49 -07:00
Guido Tagliavini Ponce 7e1b417824 Revert NewClockCache signature (#10358)
Summary:
This complements https://github.com/facebook/rocksdb/issues/10351. This PR reverts NewClockCache's signature to an older version, expected by the users of the old (buggy) ClockCache.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10358

Test Plan: ``make -j24 check`` and re-run the pre-release tests.

Reviewed By: siying

Differential Revision: D37832601

Pulled By: guidotag

fbshipit-source-id: 32a91d3da4119be187935003b7b897272ceb1950
2022-07-13 17:43:39 -07:00
Changyu Bi 5f9fe7f21e Added WAL compression checksum (#10319)
Summary:
Enabled zstd checksum flag in StreamingCompress so that WAL (de)compreression is protected by a checksum per compression frame.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10319

Test Plan:
- `make check`
- WAL perf: average ops/sec over 10 runs is 161226 pre PR and 159635 post PR (1% drop).
```
sudo TEST_TMPDIR=/dev/shm/memtable_write ./db_bench_checksum -benchmarks=fillseq -max_write_buffer_number=100 -num=1000000 -min_write_buffer_number_to_merge=10 -wal_compression=zstd
```

Reviewed By: ajkr

Differential Revision: D37673311

Pulled By: cbi42

fbshipit-source-id: 9f34a3bfc2a82e5c80b1ec63bb339a7465108ec9
2022-07-13 15:29:20 -07:00
Bo Wang 86c2d0a95d Add the secondary cache information into LRUCache:: GetPrintableOptions (#10346)
Summary:
If the primary cache is LRU cache and there is a secondary cache, add  Secondary Cache printable options into LRUCache::GetPrintableOptions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10346

Test Plan:
1. Current Unit Tests should pass.
2. Use db_bench (with compressed_secondary_cache ) and the LOG should includes the new printable options from Seoncdary Cache.

Reviewed By: jay-zhuang

Differential Revision: D37779310

Pulled By: gitbw95

fbshipit-source-id: 88ce1f7df6b5f25740e598d9e7fa91e4c414cb8f
2022-07-13 12:30:44 -07:00
Guido Tagliavini Ponce 9645e66fc9 Temporarily return a LRUCache from NewClockCache (#10351)
Summary:
ClockCache is still in experimental stage, and currently fails some pre-release fbcode tests. See https://www.internalfb.com/diff/D37772011. API calls to construct ClockCache are done via the function NewClockCache. For now, NewClockCache calls will return an LRUCache (with appropriate arguments), which is stable.

The idea that NewClockCache returns nullptr was also floated, but this would be interpreted as unsupported cache, and a default LRUCache would be constructed instead, potentially causing a performance regression that is harder to identify.

A new version of the NewClockCache function was created for our internal tests.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10351

Test Plan: ``make -j24 check`` and re-run the pre-release tests.

Reviewed By: pdillinger

Differential Revision: D37802685

Pulled By: guidotag

fbshipit-source-id: 0a8d10612ff21e576f7360cb13e20bc36e244972
2022-07-13 08:45:44 -07:00
Yanqin Jin b283f041f5 Stop tracking syncing live WAL for performance (#10330)
Summary:
With https://github.com/facebook/rocksdb/issues/10087, applications calling `SyncWAL()` or writing with `WriteOptions::sync=true` can suffer
from performance regression. This PR reverts to original behavior of tracking the syncing of closed WALs.
After we revert back to old behavior, recovery, whether kPointInTime or kAbsoluteConsistency, may fail to
detect corruption in synced WALs if the corruption is in the live WAL.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10330

Test Plan:
make check

Before https://github.com/facebook/rocksdb/issues/10087
```bash
fillsync     :     750.269 micros/op 1332 ops/sec 75.027 seconds 100000 operations;    0.1 MB/s (100 ops)
fillsync     :     776.492 micros/op 1287 ops/sec 77.649 seconds 100000 operations;    0.1 MB/s (100 ops)
fillsync [AVG 2 runs] : 1310 (± 44) ops/sec;    0.1 (± 0.0) MB/sec
fillsync     :     805.625 micros/op 1241 ops/sec 80.563 seconds 100000 operations;    0.1 MB/s (100 ops)
fillsync [AVG 3 runs] : 1287 (± 51) ops/sec;    0.1 (± 0.0) MB/sec
fillsync [AVG    3 runs] : 1287 (± 51) ops/sec;    0.1 (± 0.0) MB/sec
fillsync [MEDIAN 3 runs] : 1287 ops/sec;    0.1 MB/sec
```

Before this PR and after https://github.com/facebook/rocksdb/issues/10087
```bash
fillsync     :    1479.601 micros/op 675 ops/sec 147.960 seconds 100000 operations;    0.1 MB/s (100 ops)
fillsync     :    1626.080 micros/op 614 ops/sec 162.608 seconds 100000 operations;    0.1 MB/s (100 ops)
fillsync [AVG 2 runs] : 645 (± 59) ops/sec;    0.1 (± 0.0) MB/sec
fillsync     :    1588.402 micros/op 629 ops/sec 158.840 seconds 100000 operations;    0.1 MB/s (100 ops)
fillsync [AVG 3 runs] : 640 (± 35) ops/sec;    0.1 (± 0.0) MB/sec
fillsync [AVG    3 runs] : 640 (± 35) ops/sec;    0.1 (± 0.0) MB/sec
fillsync [MEDIAN 3 runs] : 629 ops/sec;    0.1 MB/sec
```

After this PR
```bash
fillsync     :     749.621 micros/op 1334 ops/sec 74.962 seconds 100000 operations;    0.1 MB/s (100 ops)
fillsync     :     865.577 micros/op 1155 ops/sec 86.558 seconds 100000 operations;    0.1 MB/s (100 ops)
fillsync [AVG 2 runs] : 1244 (± 175) ops/sec;    0.1 (± 0.0) MB/sec
fillsync     :     845.837 micros/op 1182 ops/sec 84.584 seconds 100000 operations;    0.1 MB/s (100 ops)
fillsync [AVG 3 runs] : 1223 (± 109) ops/sec;    0.1 (± 0.0) MB/sec
fillsync [AVG    3 runs] : 1223 (± 109) ops/sec;    0.1 (± 0.0) MB/sec
fillsync [MEDIAN 3 runs] : 1182 ops/sec;    0.1 MB/sec
```

Reviewed By: ajkr

Differential Revision: D37725212

Pulled By: riversand963

fbshipit-source-id: 8fa7d13b3c7662be5d56351c42caf3266af937ae
2022-07-12 17:16:57 -07:00
sdong 769b156e65 Remove customized naming from InternalKeyComparator (#10343)
Summary:
InternalKeyComparator is a thin wrapper around user comparator. Storing a string for name is relatively expensive to this small wrapper for both CPU and memory usage. Try to remove it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10343

Test Plan: Run existing tests

Reviewed By: ajkr

Differential Revision: D37772469

fbshipit-source-id: d2d106a8d022193058fd7f6b220108e3d94aca34
2022-07-12 13:30:35 -07:00
Yanqin Jin 7679f22a89 Add coverage for the combination of write-prepared and WAL recycling (#10350)
Summary:
as title.
Test plan
- make check
- CI on PR
- TEST_TMPDIR=/dev/shm make crash_test_with_multiops_wp_txn (tested with successful run)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10350

Reviewed By: ajkr

Differential Revision: D37792872

Pulled By: riversand963

fbshipit-source-id: ff064093b7f715d0acf387af2e3ae87b1278b52b
2022-07-12 13:17:21 -07:00
Yanqin Jin 7e2004a123 Remove unused variables (#10327)
Summary:
As title.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10327

Test Plan: make check

Reviewed By: gitbw95

Differential Revision: D37699040

Pulled By: riversand963

fbshipit-source-id: 305a88628907a47dea53c4d9aec9c2f5bb9b58df
2022-07-11 13:55:23 -07:00
Yanqin Jin 2f13f5f7d0 Add coverage for timestamped snapshot to MultiOpsTxnsStressTest (#10325)
Summary:
As title.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10325

Test Plan:
```bash
TEST_TMPDIR=/dev/shm/rocksdb/ make crash_test_with_multiops_wc_txn
TEST_TMPDIR=/dev/shm/rocksdb/ make crash_test_with_txn
```

Reviewed By: akankshamahajan15

Differential Revision: D37688742

Pulled By: riversand963

fbshipit-source-id: e198ace921898af63f99e869568c1a7bbf69f1a4
2022-07-11 12:44:08 -07:00
zczhu 96206531bc Support reservation in thread pool (#10278)
Summary:
Add `ReserveThreads` and `ReleaseThreads` functions in thread pool to support reservation in for a specific thread pool.  With this feature, a thread will be blocked if the number of waiting threads (noted by `num_waiting_threads_`) equals the number of reserved threads (noted by `reserved_threads_`), normally `reserved_threads_` is upper bounded by `num_waiting_threads_`; in rare cases (e.g. `SetBackgroundThreadsInternal` is called when some threads are already reserved), `num_waiting_threads_` can be less than `reserved_threads`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10278

Test Plan: Add `ReserveThreads` unit test in `env_test`. Update the unit test `SimpleColumnFamilyInfoTest` in `thread_list_test` with adding `ReserveThreads` related assertions.

Reviewed By: hx235

Differential Revision: D37640946

Pulled By: littlepig2013

fbshipit-source-id: 4d691f6b9a433569f96ab52d52c3defe5b065367
2022-07-08 19:48:09 -07:00
Gang Liao 28586be8ec Update HISTORY.md for blob cache (#10328)
Summary:
Update HISTORY.md for blob cache.  Implementation can be found from Github issue https://github.com/facebook/rocksdb/issues/10156 (or Github PRs https://github.com/facebook/rocksdb/issues/10155, https://github.com/facebook/rocksdb/issues/10178, https://github.com/facebook/rocksdb/issues/10225, https://github.com/facebook/rocksdb/issues/10198, and https://github.com/facebook/rocksdb/issues/10272).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10328

Reviewed By: riversand963

Differential Revision: D37732514

Pulled By: gangliao

fbshipit-source-id: 4c942a41c07914bfc8db56a0d3cf4d3e53d5963f
2022-07-08 18:35:52 -07:00
Akanksha Mahajan fc51b7f33a Fix clang error implicit conversion loses integer precision (#10323)
Summary:
Fix  error: implicit conversion loses integer precision:
'size_t' (aka 'unsigned long') to 'uint32_t' (aka 'unsigned int')
[-Werror,-Wshorten-64-to-32]

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10323

Test Plan: USE_CLANG=1 make -j32

Reviewed By: gitbw95

Differential Revision: D37688250

Pulled By: akankshamahajan15

fbshipit-source-id: 443873e41279ee8bdbe8452818549792047532fb
2022-07-07 11:35:15 -07:00
Gang Liao c987eb4712 Eliminate the copying of blobs when serving reads from the cache (#10297)
Summary:
The blob cache enables an optimization on the read path: when a blob is found in the cache, we can avoid copying it into the buffer provided by the application. Instead, we can simply transfer ownership of the cache handle to the target `PinnableSlice`. (Note: this relies on the `Cleanable` interface, which is implemented by `PinnableSlice`.)

This has the potential to save a lot of CPU, especially with large blob values.

This task is a part of https://github.com/facebook/rocksdb/issues/10156

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10297

Reviewed By: riversand963

Differential Revision: D37640311

Pulled By: gangliao

fbshipit-source-id: 92de0e35cc703d06c87c5c1861cc2899ec52234a
2022-07-06 18:57:29 -07:00
Guido Tagliavini Ponce c277aeb42c Midpoint insertions in ClockCache (#10305)
Summary:
When an element is first inserted into the ClockCache, it is now assigned either medium or high clock priority, depending on whether its cache priority is low or high, respectively. This is a variant of LRUCache's midpoint insertions. The main difference is that LRUCache can specify the allocated capacity for high-priority elements via the ``high_pri_pool_ratio`` parameter. Contrarily, in ClockCache, low- and high-priority elements compete for all cache slots, and one group can take over the other (of course, it takes more low-priority insertions to push out high-priority elements). However, just as LRUCache, ClockCache provides the following guarantee: a high-priority element will not be evicted before a low-priority element that was inserted earlier in time.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10305

Test Plan: ``make -j24 check``

Reviewed By: pdillinger

Differential Revision: D37607787

Pulled By: guidotag

fbshipit-source-id: 24d9f2523d2f4e6415e7f0029cc061fa275c2040
2022-07-06 18:28:35 -07:00
zczhu 8debfe2b21 Replace the output split key with its pointer in subcompaction (#10316)
Summary:
Earlier implementation of cutting the output files with a compact cursor under Round-Robin priority uses `Valid()` to determine if the `output_split_key` is valid in `ShouldStopBefore`. This contributes to excessive CPU computation, as pointed out by [this issue](https://github.com/facebook/rocksdb/issues/10315). In this PR, we change the type of `output_split_key` to be `InternalKey*` and set it as `nullptr` if it is not going to be used in `ShouldStopBefore`, `Valid()` condition checking can be avoided using that pointer.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10316

Reviewed By: ajkr

Differential Revision: D37661492

Pulled By: littlepig2013

fbshipit-source-id: 66ff1105f3378e5573d3a126fdaff9bb23b5498f
2022-07-06 16:19:45 -07:00
Peter Dillinger e6c5e0ab9a Have Cache use Status::MemoryLimit (#10262)
Summary:
I noticed it would clean up some things to have Cache::Insert()
return our MemoryLimit Status instead of Incomplete for the case in
which the capacity limit is reached. I suspect this fixes some existing but
unknown bugs where this Incomplete could be confused with other uses
of Incomplete, especially no_io cases. This is the most suspicious case I
noticed, but was not able to reproduce a bug, in part because the existing
code is not covered by unit tests (FIXME added): 57adbf0e91/table/get_context.cc (L397)

I audited all the existing uses of IsIncomplete and updated those that
seemed relevant.

HISTORY updated with a clear warning to users of strict_capacity_limit=true
to update uses of `IsIncomplete()`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10262

Test Plan: updated unit tests

Reviewed By: hx235

Differential Revision: D37473155

Pulled By: pdillinger

fbshipit-source-id: 4bd9d9353ccddfe286b03ebd0652df8ce20f99cb
2022-07-06 14:41:46 -07:00
Manuel Ung 071fe39c05 Allow user to pass git command to makefile (#10318)
Summary:
This allows users to pass their git command with extra options if necessary.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10318

Reviewed By: ajkr

Differential Revision: D37661175

Pulled By: lth

fbshipit-source-id: 2a7cf27626c74f167471e6ec57e3870630a582b0
2022-07-06 14:28:00 -07:00
Akanksha Mahajan 2acbf386a3 Provide support for direct_reads with async_io (#10197)
Summary:
Provide support for use_direct_reads with async_io.

TestPlan:
-  Updated unit tests
-  db_bench: Results in https://github.com/facebook/rocksdb/pull/10197#issuecomment-1159239420
- db_stress
```
export CRASH_TEST_EXT_ARGS=" --async_io=1 --use_direct_reads=1"
make crash_test -j
```
- Ran db_bench on previous RocksDB version before any async_io implementation (as there have many changes in different PRs in this area) https://github.com/facebook/rocksdb/pull/10197#issuecomment-1160781563.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10197

Reviewed By: anand1976

Differential Revision: D37255646

Pulled By: akankshamahajan15

fbshipit-source-id: fec61ae15bf4d625f79dea56e4f86e0e307ba920
2022-07-06 11:42:59 -07:00
Mark Callaghan 177b2fa341 Set the value for --version, add --build_info (#10275)
Summary:
./db_bench --version
db_bench version 7.5.0

./db_bench --build_info
 (RocksDB) 7.5.0
    rocksdb_build_date: 2022-06-29 09:58:04
    rocksdb_build_git_sha: d96febeeaa
    rocksdb_build_git_tag: print_version_githash

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10275

Test Plan: run it

Reviewed By: ajkr

Differential Revision: D37524720

Pulled By: mdcallag

fbshipit-source-id: 0f6c819dbadf7b033a4a3ba2941992bb76b4ff99
2022-07-06 09:58:45 -07:00
Changyu Bi f9cfc6a808 Updated NewDataBlockIterator to not fetch compression dict for non-da… (#10310)
Summary:
…ta blocks

During MyShadow testing, ajkr helped me find out that with partitioned index and dictionary compression enabled, `PartitionedIndexIterator::InitPartitionedIndexBlock()` spent considerable amount of time (1-2% CPU) on fetching uncompression dictionary. Fetching uncompression dict was not needed since the index blocks were not compressed (and even if they were, they use empty dictionary). This should only affect use cases with partitioned index, dictionary compression and without uncompression dictionary pinned. This PR updates NewDataBlockIterator to not fetch uncompression dictionary when it is not for data blocks.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10310

Test Plan:
1. `make check`
2. Perf benchmark: 1.5% (143950 -> 146176) improvement in op/sec for partitioned index + dict compression benchmark.
For default config without partitioned index and without dict compression, there is no regression in readrandom perf from multiple runs of db_bench.

```
# Set up for partitioned index with dictionary compression
TEST_TMPDIR=/dev/shm ./db_bench_main -benchmarks=filluniquerandom,compact -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -partition_index=true  -compression_max_dict_bytes=16384 -compression_zstd_max_train_bytes=1638400

# Pre PR
TEST_TMPDIR=/dev/shm ./db_bench_main -use_existing_db=true -benchmarks=readrandom[-X50] -partition_index=true
readrandom [AVG    50 runs] : 143950 (± 1108) ops/sec;   15.9 (± 0.1) MB/sec
readrandom [MEDIAN 50 runs] : 144406 ops/sec;   16.0 MB/sec

# Post PR
TEST_TMPDIR=/dev/shm ./db_bench_opt -use_existing_db=true -benchmarks=readrandom[-X50] -partition_index=true
readrandom [AVG    50 runs] : 146176 (± 1121) ops/sec;   16.2 (± 0.1) MB/sec
readrandom [MEDIAN 50 runs] : 146014 ops/sec;   16.2 MB/sec

# Set up for no partitioned index and no dictionary compression
TEST_TMPDIR=/dev/shm/baseline ./db_bench_main -benchmarks=filluniquerandom,compact -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false
# Pre PR
TEST_TMPDIR=/dev/shm/baseline/ ./db_bench_main --use_existing_db=true "--benchmarks=readrandom[-X50]"
readrandom [AVG    50 runs] : 158546 (± 1000) ops/sec;   17.5 (± 0.1) MB/sec
readrandom [MEDIAN 50 runs] : 158280 ops/sec;   17.5 MB/sec

# Post PR
TEST_TMPDIR=/dev/shm/baseline/ ./db_bench_opt --use_existing_db=true "--benchmarks=readrandom[-X50]"
readrandom [AVG    50 runs] : 161061 (± 1520) ops/sec;   17.8 (± 0.2) MB/sec
readrandom [MEDIAN 50 runs] : 161596 ops/sec;   17.9 MB/sec
```

Reviewed By: ajkr

Differential Revision: D37631358

Pulled By: cbi42

fbshipit-source-id: 6ca2665e270e63871968e061ba4a99d3136785d9
2022-07-06 09:30:25 -07:00
Changyu Bi 0ff7713112 Handoff checksum during WAL replay (#10212)
Summary:
Added checksum protection for write batch content read from WAL to when per key-value checksum is computed on the write batch. This gives full coverage on write batch integrity of WAL replay to memtable.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10212

Test Plan:
- Added unit test and the existing tests (replay code path covers the change in this PR): `make -j32 check`
- Stress test: ran `db_stress` for 30min.
- Perf regression:
```
# setup
TEST_TMPDIR=/dev/shm/100MB_WAL_DB/ ./db_bench -benchmarks=fillrandom -write_buffer_size=1048576000
# benchmark db open time
TEST_TMPDIR=/dev/shm/100MB_WAL_DB/ /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=overwrite -write_buffer_size=1048576000 -writes=1 -report_open_timing=true

For 20 runs, pre-PR avg: 3734.31ms, post-PR avg: 3790.06 ms (~1.5% regression).

Pre-PR
OpenDb:     3714.36 milliseconds
OpenDb:     3622.71 milliseconds
OpenDb:     3591.17 milliseconds
OpenDb:     3674.7 milliseconds
OpenDb:     3615.79 milliseconds
OpenDb:     3982.83 milliseconds
OpenDb:     3650.6 milliseconds
OpenDb:     3809.26 milliseconds
OpenDb:     3576.44 milliseconds
OpenDb:     3638.12 milliseconds
OpenDb:     3845.68 milliseconds
OpenDb:     3677.32 milliseconds
OpenDb:     3659.64 milliseconds
OpenDb:     3837.55 milliseconds
OpenDb:     3899.64 milliseconds
OpenDb:     3840.72 milliseconds
OpenDb:     3802.71 milliseconds
OpenDb:     3573.27 milliseconds
OpenDb:     3895.76 milliseconds
OpenDb:     3778.02 milliseconds

Post-PR:
OpenDb:     3880.46 milliseconds
OpenDb:     3709.02 milliseconds
OpenDb:     3954.67 milliseconds
OpenDb:     3955.64 milliseconds
OpenDb:     3958.64 milliseconds
OpenDb:     3631.28 milliseconds
OpenDb:     3721 milliseconds
OpenDb:     3729.89 milliseconds
OpenDb:     3730.55 milliseconds
OpenDb:     3966.32 milliseconds
OpenDb:     3685.54 milliseconds
OpenDb:     3573.17 milliseconds
OpenDb:     3703.75 milliseconds
OpenDb:     3873.62 milliseconds
OpenDb:     3704.4 milliseconds
OpenDb:     3820.98 milliseconds
OpenDb:     3721.62 milliseconds
OpenDb:     3770.86 milliseconds
OpenDb:     3949.78 milliseconds
OpenDb:     3760.07 milliseconds
```

Reviewed By: ajkr

Differential Revision: D37302092

Pulled By: cbi42

fbshipit-source-id: 7346e625f453ce4c0e5d708776cd1fb2af6b068b
2022-07-05 15:44:35 -07:00
Yanqin Jin caced09e79 Expand stress test coverage for user-defined timestamp (#10280)
Summary:
Before this PR, we call `now()` to get the wall time before performing point-lookup and range
scans when user-defined timestamp is enabled.

With this PR, we expand the coverage to:
- read with an older timestamp which is larger then the wall time when the process starts but potentially smaller than now()
- add coverage for `ReadOptions::iter_start_ts != nullptr`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10280

Test Plan:
```bash
make check
```

Also,
```bash
TEST_TMPDIR=/dev/shm/rocksdb make crash_test_with_ts
```

So far, we have had four successful runs of the above

In addition,
```bash
TEST_TMPDIR=/dev/shm/rocksdb make crash_test
```
Succeeded twice showing no regression.

Reviewed By: ltamasi

Differential Revision: D37539805

Pulled By: riversand963

fbshipit-source-id: f2d9887ad95245945ce17a014d55bb93f00e1cb5
2022-07-05 13:30:15 -07:00
Mark Callaghan 9eced1a344 Add the git hash and full RocksDB version to report.tsv (#10277)
Summary:
Previously the version was displayed as $major.$minor
This changes it to $major.$minor.$path

This also adds the git hash for the time from which RocksDB was built to the end of report.tsv. I confirmed that benchmark_log_tool.py still parses it and that the people
who consume/graph these results are OK with it.

Example output:
ops_sec	mb_sec	lsm_sz	blob_sz	c_wgb	w_amp	c_mbps	c_wsecs	c_csecs	b_rgb	b_wgb	usec_op	p50	p99	p99.9	p99.99	pmax	uptime	stall%	Nstall	u_cpu	s_cpu	rss	test	date	version	job_id	githash
609488	244.1	1GB	0.0GB,	1.4	0.7	93.3	39	38	0	0	1.6	1.0	4	15	26	5365	15	0.0	0	0.1	0.0	0.5	fillseq.wal_disabled.v400	2022-06-29T13:36:05	7.5.0		6115254416

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10277

Test Plan: Run it

Reviewed By: jay-zhuang

Differential Revision: D37532418

Pulled By: mdcallag

fbshipit-source-id: 55e472640d51265819b228d3373c9fa9b62b660d
2022-07-05 11:46:36 -07:00
sdong a9565ccb26 Try to trivial move more than one files (#10190)
Summary:
In leveled compaction, try to trivial move more than one files if possible, up to 4 files or max_compaction_bytes. This is to allow higher write throughput for some use cases where data is loaded in sequential order, where appying compaction results is the bottleneck.

When pick up a file to compact and it doesn't have overlapping files in the next level, try to expand to the next file if there is still no overlapping.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10190

Test Plan:
Add some unit tests.
For performance, Try to run
./db_bench_multi_move --benchmarks=fillseq --compression_type=lz4 --write_buffer_size=5000000 --num=100000000 --value_size=1000 -level_compaction_dynamic_level_bytes
Together with https://github.com/facebook/rocksdb/pull/10188 , stalling will be eliminated in this benchmark.

Reviewed By: jay-zhuang

Differential Revision: D37230647

fbshipit-source-id: 42b260f545c46abc5d90335ac2bbfcd09602b549
2022-07-05 10:10:37 -07:00
Yanqin Jin d6b9c4ae26 Update code comment and logging for secondary instance (#10260)
Summary:
Before this PR, it is required that application open RocksDB secondary
instance with `max_open_files = -1`. This is a hacky workaround that
prevents IOErrors on the seconary instance during point-lookup or range
scan caused by primary instance deleting the table files. This is not
necessary if the application can coordinate the primary and secondaries
so that primary does not delete files that are still being used by the
secondaries. Or users can provide a custom Env/FS implementation that
deletes the files only after all primary and secondary instances
indicate files are obsolete and deleted.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10260

Test Plan: make check

Reviewed By: jay-zhuang

Differential Revision: D37462633

Pulled By: riversand963

fbshipit-source-id: 9c2fc939f49663efa61e3d60c8f1e01d64b9d72c
2022-07-05 10:09:44 -07:00
yite.gu a9117a3490 BackupEngine: we can return immediately if GetFileSize failed (#10176)
Summary:
In some case, GetFileSize would be failure in copy_file_cb.
If failure, we can return immediately, the subsequent code
is meaningless, and add a log info let user know that problem
happen here.

Singed-off-by: Yite Gu <ess_gyt@qq.com>

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10176

Reviewed By: cbi42

Differential Revision: D37510888

Pulled By: ajkr

fbshipit-source-id: 044ad8c45852fd19b8cd564b11f65d40c39e296f
2022-07-03 23:16:09 -07:00