rocksdb

Commit Graph

Author	SHA1	Message	Date
Changyu Bi	cd15331711	Print status when VerifyOrSyncValue() fails with non-OK status (#12217 ) Summary: This should print more helpful message when a non-ok status like Corruption is returned. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12217 Test Plan: CI passes. Reviewed By: jaykorean Differential Revision: D52637595 Pulled By: cbi42 fbshipit-source-id: e810eeb4cba633d4d4c5d198da4468995e4ed427	2024-01-09 14:20:08 -08:00
akankshamahajan	1de6940980	Fix heap use after free error in FilePrefetchBuffer (#12211 ) Summary: Fix heap use after free error in FilePrefetchBuffer Fix heap use after free error in FilePrefetchBuffer Pull Request resolved: https://github.com/facebook/rocksdb/pull/12211 Test Plan: Ran db_stress in ASAN mode ``` ==652957==ERROR: AddressSanitizer: heap-use-after-free on address 0x6150006d8578 at pc 0x7f91f74ae85b bp 0x7f91c25f90c0 sp 0x7f91c25f90b8 READ of size 8 at 0x6150006d8578 thread T48 #0 0x7f91f74ae85a in void __gnu_cxx::new_allocator<rocksdb::BufferInfo>::construct<rocksdb::BufferInfo, rocksdb::BufferInfo&>(rocksdb::BufferInfo, rocksdb::BufferInfo&) /mnt/gvfs/third-party2/libgcc/c00dcc6a3e4125c7e8b248e9a79c14b78ac9e0ca/11.x/platform010/5684a5a/include/c++/trunk/ext/new_allocator.h:163 https://github.com/facebook/rocksdb/issues/1 0x7f91f74ae85a in void std::allocator_traits<std::allocator<rocksdb::BufferInfo> >::construct<rocksdb::BufferInfo, rocksdb::BufferInfo&>(std::allocator<rocksdb::BufferInfo>&, rocksdb::BufferInfo*, rocksdb::BufferInfo&) /mnt/gvfs/third-party2/libgcc/c00dcc6a3e4125c7e8b248e9a79c14b78ac9e0ca/11.x/platform010/5684a5a/include/c++/trunk/bits/alloc_traits.h:512 https://github.com/facebook/rocksdb/issues/2 0x7f91f74ae85a in rocksdb::BufferInfo& std::deque<rocksdb::BufferInfo, std::allocator<rocksdb::BufferInfo> >::emplace_back<rocksdb::BufferInfo&>(rocksdb::BufferInfo*&) /mnt/gvfs/third-party2/libgcc/c00dcc6a3e4125c7e8b248e9a79c14b78ac9e0ca/11.x/platform010/5684a5a/include/c++/trunk/bits/deque.tcc:170 https://github.com/facebook/rocksdb/issues/3 0x7f91f74b93d8 in rocksdb::FilePrefetchBuffer::FreeAllBuffers() file/file_prefetch_buffer.h:557 ``` Reviewed By: ajkr Differential Revision: D52575217 Pulled By: akankshamahajan15 fbshipit-source-id: 6811ec10a393f5a62fedaff0fab5fd6e823c2687	2024-01-05 18:10:58 -08:00
Andrew Kryczka	5a9ecf6614	Automated modernization (#12210 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12210 Reviewed By: hx235 Differential Revision: D52559771 Pulled By: ajkr fbshipit-source-id: 1ccdd3a0180cc02bc0441f20b0e4a1db50841b03	2024-01-05 11:53:57 -08:00
Peter Dillinger	5da900f28a	Fix a case of ignored corruption in creating backups (#12200 ) Summary: We often need to read the table properties of an SST file when taking a backup. However, we currently do not check checksums for this step, and even with that enabled, we ignore failures. This change ensures we fail creating a backup if corruption is detected in that step of reading table properties. To get this working properly (with existing unit tests), we also add some temperature handling logic like already exists in BackupEngineImpl::ReadFileAndComputeChecksum and elsewhere in BackupEngine. Also, SstFileDumper needed a fix to its error handling logic. This was originally intended to help diagnose some mysterious failures (apparent corruptions) seen in taking backups in the crash test, though that is now fixed in https://github.com/facebook/rocksdb/pull/12206 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12200 Test Plan: unit test added that corrupts table properties, along with existing tests Reviewed By: ajkr Differential Revision: D52520674 Pulled By: pdillinger fbshipit-source-id: 032cfc0791428f3b8147d34c7d424ab128e28f42	2024-01-05 09:48:19 -08:00
akankshamahajan	5cb2d09d47	Refactor FilePrefetchBuffer code (#12097 ) Summary: Summary - Refactor FilePrefetchBuffer code - Implementation: FilePrefetchBuffer maintains a deque of free buffers (free_bufs_) of size num_buffers_ and buffers (bufs_) which contains the prefetched data. Whenever a buffer is consumed or is outdated (w.r.t. to requested offset), that buffer is cleared and returned to free_bufs_. If a buffer is available in free_bufs_, it's moved to bufs_ and is sent for prefetching. num_buffers_ defines how many buffers are maintained that contains prefetched data. If num_buffers_ == 1, it's a sequential read flow. Read API will be called on that one buffer whenever the data is requested and is not in the buffer. If num_buffers_ > 1, then the data is prefetched asynchronosuly in the buffers whenever the data is consumed from the buffers and that buffer is freed. If num_buffers > 1, then requested data can be overlapping between 2 buffers. To return the continuous buffer overlap_bufs_ is used. The requested data is copied from 2 buffers to the overlap_bufs_ and overlap_bufs_ is returned to the caller. - Merged Sync and Async code flow into one in FilePrefetchBuffer. Test Plan - - Crash test passed - Unit tests - Pending - Benchmarks Pull Request resolved: https://github.com/facebook/rocksdb/pull/12097 Reviewed By: ajkr Differential Revision: D51759552 Pulled By: akankshamahajan15 fbshipit-source-id: 69a352945affac2ed22be96048d55863e0168ad5	2024-01-05 09:29:01 -08:00
Peter Dillinger	ed46981bea	Fix and defend against FilePrefetchBuffer combined with mmap reads (#12206 ) Summary: FilePrefetchBuffer makes an unchecked assumption about the behavior of RandomAccessFileReader::Read: that it will write to the provided buffer rather than returning the data in an alternate buffer. FilePrefetchBuffer has been quietly incompatible with mmap reads (e.g. allow_mmap_reads / use_mmap_reads) because in that case an alternate buffer is returned (mmapped memory). This incompatibility currently leads to quiet data corruption, as seen in amplified crash test failure in https://github.com/facebook/rocksdb/issues/12200. In this change, * Check whether RandomAccessFileReader::Read has the expected behavior, and fail if not. (Assertion failure in debug build, return Corruption in release build.) This will detect future regressions synchronously and precisely, rather than relying on debugging downstream data corruption. * Why not recover? My understanding is that FilePrefetchBuffer is not intended for use when RandomAccessFileReader::Read uses an alternate buffer, so quietly recovering could lead to undesirable (inefficient) behavior. * Mention incompatibility with mmap-based readers in the internal API comments for FilePrefetchBuffer * Fix two cases where FilePrefetchBuffer could be used with mmap, both stemming from SstFileDumper, though one fix is in BlockBasedTableReader. There is currently no way to ask a RandomAccessFileReader whether it's using mmap, so we currently have to rely on other options as clues. Keeping separate from https://github.com/facebook/rocksdb/issues/12200 in part because this change is more appropriate for backport than that one. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12206 Test Plan: * Manually verified that the new check aids in debugging. * Unit test added, that fails if either fix is missed. * Ran blackbox_crash_test for hours, with and without https://github.com/facebook/rocksdb/issues/12200 Reviewed By: akankshamahajan15 Differential Revision: D52551701 Pulled By: pdillinger fbshipit-source-id: dea87c5782b7c484a6c6e424585c8832dfc580dc	2024-01-04 18:39:05 -08:00
git-hulk	f11a0237b6	sst_dump: display metaindex_handle and the index_handle's offset and size in footer information (#12204 ) Summary: Before applying this PR, the footer details: ``` Footer Details: -------------------------------------- metaindex handle: B0E499405C index handle: 8AC49940CD17 table_magic_number: 9863518390377041911 format version: 5 ``` and after ``` Footer Details: -------------------------------------- metaindex handle: B0E499405C offset: 134640176 size: 92 index handle: 8AC49940CD17 offset: 134636042 size: 3021 table_magic_number: 9863518390377041911 format version: 5 ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12204 Reviewed By: cbi42 Differential Revision: D52547832 Pulled By: ajkr fbshipit-source-id: 5ff58ed347f9caf919bbdc6b242e3306d2525653	2024-01-04 14:11:15 -08:00
Peter Dillinger	ea6ed0d56e	Re-enable ingest_external_file with mmap_read in crash test (#12201 ) Summary: I suspect the issue called out in https://github.com/facebook/rocksdb/issues/9357 was fixed in https://github.com/facebook/rocksdb/issues/11328 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12201 Test Plan: `make blackbox_crash_test` for hours Reviewed By: ajkr Differential Revision: D52543075 Pulled By: pdillinger fbshipit-source-id: b705a6bdb2799a5f51ad2746df2083aa82f360a2	2024-01-04 13:46:07 -08:00
Hui Xiao	81b6296c7e	Pass flush IO activity enum in FlushJob::MaybeIncreaseFullHistoryTsLowToAboveCutoffUDT...() (#12197 ) Summary: Context/Summary: as titled Pull Request resolved: https://github.com/facebook/rocksdb/pull/12197 Test Plan: ``` ./db_stress --acquire_snapshot_one_in=100 --adaptive_readahead=0 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --async_io=1 --atomic_flush=0 --auto_readahead_size=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --block_protection_bytes_per_key=0 --block_size=16384 --bloom_before_level=2147483647 --bloom_bits=4.393039399748979 --bottommost_compression_type=disable --bottommost_file_compaction_delay=86400 --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=33554432 --cache_type=fixed_hyper_clock_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=0 --charge_table_reader=1 --checkpoint_one_in=1000000 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --compact_files_one_in=1000 --compact_range_one_in=1000 --compaction_pri=3 --compaction_readahead_size=1048576 --compaction_ttl=0 --compressed_secondary_cache_ratio=0.0 --compressed_secondary_cache_size=0 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=lz4hc --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=1 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_blackbox --db_write_buffer_size=0 --delpercent=5 --delrangepercent=0 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_blob_files=0 --enable_compaction_filter=0 --enable_pipelined_write=0 --enable_thread_tracking=1 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=1 --fifo_allow_compaction=0 --file_checksum_impl=none --flush_one_in=1000 --format_version=6 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=13 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --iterpercent=0 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=10000 --long_running_snapshots=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=524288 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=100000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=8388608 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.1 --memtable_protection_bytes_per_key=2 --memtable_whole_key_filtering=1 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=2 --open_files=100 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=16 --ops_per_thread=100000000 --optimize_filters_for_memory=0 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=2 --pause_background_one_in=10000 --periodic_compaction_seconds=0 --persist_user_defined_timestamps=0 --prefix_size=5 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --read_fault_one_in=0 --readahead_size=16384 --readpercent=55 --recycle_log_file_num=0 --reopen=0 --secondary_cache_fault_one_in=0 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=10 --subcompactions=1 --sync=0 --sync_fault_injection=0 --target_file_size_base=2097152 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_cf_consistency=0 --top_level_index_pinning=3 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_get_entity=0 --use_multiget=0 --use_put_entity_one_in=0 --use_txn=0 --use_write_buffer_manager=0 --user_timestamp_size=8 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=10000 --verify_file_checksums_one_in=0 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=zstd --write_buffer_size=1048576 --write_dbid_to_manifest=1 --write_fault_one_in=128 --writepercent=35 ``` Before fix: ``` db_stress_tool/db_stress_env_wrapper.h:92: virtual rocksdb::IOStatus rocksdb::DbStressWritableFileWrapper::Append(const rocksdb::Slice &, const rocksdb::IOOptions &, rocksdb::IODebugContext *): Assertion `io_activity == Env::IOActivity::kUnknown \|\| io_activity == options.io_activity' failed. ``` After fix: Succeed Reviewed By: ajkr Differential Revision: D52492030 Pulled By: hx235 fbshipit-source-id: 842a0dcbdf135838b57ddb4a3a6f1effc8dd3e82	2024-01-02 17:33:00 -08:00
haobo sun	09411e199d	Format async io for Java API (#12192 ) Summary: Format https://github.com/facebook/rocksdb/issues/12184 according to adamretter 's comments. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12192 Reviewed By: cbi42 Differential Revision: D52457427 Pulled By: ajkr fbshipit-source-id: 75b1be5d89687be4e58e618d693a6a120c5efc78	2024-01-02 13:19:08 -08:00
leipeng	d411fc4dd6	column_family.cc: SanitizeOptions(dbo, cfo): WARN msg: add missing spaces (#12193 ) Summary: Fix for multi line strings missing spaces. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12193 Reviewed By: cbi42 Differential Revision: D52457430 Pulled By: ajkr fbshipit-source-id: 4ca75a14e61c09819e5d821da6137f4536e9e76e	2024-01-02 11:18:11 -08:00
leipeng	906c6683ed	InternalKey::Set: remove redundant assign (#12194 ) Summary: InternalKey::Set: remove redundant assign Pull Request resolved: https://github.com/facebook/rocksdb/pull/12194 Reviewed By: cbi42 Differential Revision: D52457542 Pulled By: ajkr fbshipit-source-id: 329983a8734ff38ffd93018bbbe112b4a23b5c11	2024-01-02 11:17:39 -08:00
Hui Xiao	06e593376c	Group SST write in flush, compaction and db open with new stats (#11910 ) Summary: ## Context/Summary Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity. For that, this PR does the following: - Tag different write IOs by passing down and converting WriteOptions to IOOptions - Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH\|COMPACTION\|DB_OPEN}_MICROS Some related code refactory to make implementation cleaner: - Blob stats - Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH\|COMPACTION\|DB_OPEN}_MICROS include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info. - Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write. - Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority - Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification - Build table - TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables - Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder. This parameter is used for dynamically changing file io priority for flush, see https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more - Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority ## Test ### db bench Flush ``` ./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100 rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377 rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377 rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0 rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0 ``` compaction, db oopen ``` Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench Run:./db_bench --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1 rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279 rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0 rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213 rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66 ``` blob stats - just to make sure they aren't broken by this PR ``` Integrated Blob DB Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1 pre-PR: rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600 rocksdb.blobdb.blob.file.synced COUNT : 1 rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 post-PR: rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614 - COUNT is higher and values are smaller as it includes header and footer write - COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164 rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same) rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same) ``` ``` Stacked Blob DB Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench pre-PR: rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876 rocksdb.blobdb.blob.file.synced COUNT : 8 rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 post-PR: rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924 - COUNT is higher and values are smaller as it includes header and footer write - COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164 rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same) rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same) ``` ### Rehearsal CI stress test Trigger 3 full runs of all our CI stress tests ### Performance Flush ``` TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000 -- default: 1 thread is used to run benchmark; enable_statistics = true Pre-pr: avg 507515519.3 ns 497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908, Post-pr: avg 511971266.5 ns, regressed 0.88% 502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408, ``` Compaction ``` TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre\|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1 --benchmark_repetitions=1000 -- default: 1 thread is used to run benchmark Pre-pr: avg 495346098.30 ns 492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846 Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97% 502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007 ``` Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats) ``` TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1 --benchmark_repetitions=1000 -- default: 1 thread is used to run benchmark Pre-pr: avg 3848.10 ns 3814,3838,3839,3848,3854,3854,3854,3860,3860,3860 Post-pr: avg 3874.20 ns, regressed 0.68% 3863,3867,3871,3874,3875,3877,3877,3877,3880,3881 ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910 Reviewed By: ajkr Differential Revision: D49788060 Pulled By: hx235 fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff	2023-12-29 15:29:23 -08:00
anand76	a036525809	Lightweight verification of MANIFEST file after close on shutdown (#12174 ) Summary: Do a size verification on the MANIFEST file during DB shutdown, after closing the file. If the verification fails, write a new MANIFEST file. In the future, we can do a more thorough verification if we want to. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12174 Test Plan: Unit test, and some manual verification Reviewed By: ajkr Differential Revision: D52451184 Pulled By: anand1976 fbshipit-source-id: fc3bc170e22f6c9a9c482ee5ff592abab889df83	2023-12-28 18:25:29 -08:00
Peter Dillinger	5a1fb5ccd6	Disable GitHub Actions jobs on forks (#12191 ) Summary: See new comment in pr-jobs.yml for context. I tried avoiding the massive copy-paste through some trial and error in https://github.com/facebook/rocksdb/issues/12156, but was unsuccessful. Also upgrading actions/setup-python to v5 to fix a warning. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12191 Test Plan: Here's an example of a bad run from my fork, prior to this change: https://github.com/pdillinger/rocksdb/actions/runs/7303363126 Here's the "skipped" run associated with this change on my fork: https://github.com/pdillinger/rocksdb/actions/runs/7352251207 Here's the actual run associated with this PR: https://github.com/facebook/rocksdb/actions/runs/7352262420 Reviewed By: ajkr Differential Revision: D52451292 Pulled By: pdillinger fbshipit-source-id: 9e0d3db8a40e3257e6f912a5cba72de76f4827fa	2023-12-28 17:23:18 -08:00
hulk	b7ecbe309d	Trigger compaction to the next level if the data age exceeds periodic_compaction_seconds (#12175 ) Summary: Currently, the data are always compacted to the same level if exceed periodic_compaction_seconds which may confuse users, so we change it to allow trigger compaction to the next level here. It's a behavior change to users, and may affect users who have disabled their ttl or ttl > periodic_compaction_seconds. Relate issue: https://github.com/facebook/rocksdb/issues/12165 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12175 Reviewed By: ajkr Differential Revision: D52446722 Pulled By: cbi42 fbshipit-source-id: ccd3d2c6434ed77055735a03408d4a62d119342f	2023-12-28 12:50:08 -08:00
Changyu Bi	3d81f175b4	Prioritize marked file in level compaction (#12187 ) Summary: When ranking file by compaction priority in a level, prioritize files marked for compaction over files that are not marked. This only applies to default CompactPri kMinOverlappingRatio for now. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12187 Test Plan: * New unit tests Reviewed By: ajkr Differential Revision: D52437194 Pulled By: cbi42 fbshipit-source-id: 65ea9ce5bb421e598d539a55c8219b70844b82b3	2023-12-28 10:28:37 -08:00
darionyaphet	01f2edd145	Replace push_back by emplace_back in wal manager (#10805 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10805 Reviewed By: ajkr Differential Revision: D52424928 Pulled By: hx235 fbshipit-source-id: 548e3304ca721a3907be3696d12735929aca8490	2023-12-27 10:40:33 -08:00
Qiaolin Yu	f799c73d28	Trace analyzer: replace number with enumeration type (#10827 ) Summary: Currently, some numbers in the `tracer_analyzer_tool` may be a little confusing and unfriendly for people who want to add new query types. It may be better to replace them with the existing enumeration type to improve readability. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10827 Reviewed By: ajkr Differential Revision: D40576023 Pulled By: hx235 fbshipit-source-id: 0eb16820a15f365d53e848a3a8efd92928420429	2023-12-27 10:38:53 -08:00
Andrew Kryczka	4fefe1fed9	Downgrade warning for dynamic leveling with non-leveled compaction (#12186 ) Summary: Now that `level_compaction_dynamic_level_bytes`'s default value is true, users who do not touch that setting and use non-leveled compaction will also see this log message. It can be info level rather than warning since, in the case mentioned, there is nothing the user needs to be warned about. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12186 Reviewed By: cbi42 Differential Revision: D52422499 Pulled By: ajkr fbshipit-source-id: 8dbfcd102aab671b881ba047fb4a0a555b3e0a78	2023-12-26 15:13:42 -08:00
haobo sun	2a8b2df383	Add async_io for Java API (#12184 ) Summary: Fixes https://github.com/facebook/rocksdb/issues/12183 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12184 Reviewed By: hx235 Differential Revision: D52421787 Pulled By: ajkr fbshipit-source-id: ad3bdae9be51bef5a208b02ceb08f6feb9fac8e4	2023-12-26 14:33:11 -08:00
Jason Volk	83e38c0a58	Fix SystemClock not passed from environment to PERF_CPU_TIMER_GUARD. (#12180 ) Summary: The hardcoded nullptr argument for SystemClock to PERF_CPU_TIMER_GUARD ignored any SystemClock instance provided by the env; this was probably an oversight. In practice, the defaulting SystemClock could lead to excessive `clock_gettime(CLOCK_THREAD_CPUTIME_ID)` syscalls if `report_bg_io_stats=true` which cannot be mitigated by the embedder. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12180 Reviewed By: hx235 Differential Revision: D52421750 Pulled By: ajkr fbshipit-source-id: 92f8a93cebe9f8030ea5f6c3bf35398078e6bdfe	2023-12-26 14:32:53 -08:00
Nicolas Pepin-Perreault	5b073a7daa	Access SST full file checksum via RocksDB#getLiveFilesMetadata (#11770 ) Summary: Description This PR passes along the native `LiveFileMetaData#file_checksum` field from the C++ class to the Java API as a copied byte array. If there is no file checksum generator factory set beforehand, then the array will empty. Please advise if you'd rather it be null - an empty array means one extra allocation, but it avoids possible null pointer exceptions. > Note > This functionality complements but does not supersede https://github.com/facebook/rocksdb/issues/11736 It's outside the scope here to add support for Java based `FileChecksumGenFactory` implementations. As a workaround, users can already use the built-in one by creating their initial `DBOptions` via properties: ```java final Properties props = new Properties(); props.put("file_checksum_gen_factory", "FileChecksumGenCrc32cFactory"); try (final DBOptions dbOptions = DBOptions.getDBOptionsFromProps(props); final ColumnFamilyOptions cfOptions = new ColumnFamilyOptions(); final Options options = new Options(dbOptions, cfOptions).setCreateIfMissing(true)) { // do stuff } ``` I wanted to add a better test, but unfortunately there's no available CRC32C implementation available in Java 8 without adding a dependency or adding a JNI helper for RocksDB's own implementation (or bumping the minimum version for tests to Java 9). That said, I understand the test is rather poor, so happy to change it to whatever you'd like. Context To give some context, we replicate RocksDB checkpoints to other nodes. Part of this is verifying the integrity of each file during replication. With a large enough RocksDB, computing the checksum ourselves is prohibitively expensive. Since SST files comprise the bulk of the data, we'd much rather delegate this to RocksDB on file write, and read it back after to compare. It's likely we will provide a follow up to read the file checksum list directly from the manifest without having to open the DB, but this was the easiest first step to get it working for us. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11770 Reviewed By: hx235 Differential Revision: D52420729 Pulled By: ajkr fbshipit-source-id: a873de35a48aaf315e125733091cd221a97b9073	2023-12-26 14:02:36 -08:00
Peter Dillinger	a771a47a1b	Fix leak or crash on failure in automatic atomic flush (#12176 ) Summary: Through code inspection in debugging an apparent leak of ColumnFamilyData in the crash test, I found a case where too few UnrefAndTryDelete() could be called on a cfd. This fixes that case, which would fail like this in the new unit test: ``` db_flush_test: db/column_family.cc:1648: rocksdb::ColumnFamilySet::~ColumnFamilySet(): Assertion `last_ref' failed. ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12176 Test Plan: unit test added Reviewed By: cbi42 Differential Revision: D52417071 Pulled By: pdillinger fbshipit-source-id: 4ee33c918409cf9c1968f138e273d3347a6cc8e5	2023-12-26 11:04:25 -08:00
Peter Dillinger	106058c076	Initial CircleCI -> GitHub Actions migration (#12163 ) Summary: * Largely based on https://github.com/facebook/rocksdb/issues/12085 but grouped into one large workflow because of bad GHA UI design (see comments). * Windows job details consolidated into an action file so that those jobs can easily move between per-pr-push and nightly. * Simplify some handling of "CIRCLECI" environment and add "GITHUB_ACTIONS" in the same places * For jobs that we want to go in pr-jobs or nightly there are disabled "candidate" workflows with draft versions of those jobs. * ARM jobs are disabled waiting on full GHA support. * build-linux-java-static needed some special attention to work, due to GLIBC compatibility issues (see comments). Pull Request resolved: https://github.com/facebook/rocksdb/pull/12163 Test Plan: Nightly jobs can be seen passing between these two links: https://github.com/facebook/rocksdb/actions/runs/7266835435/job/19799390061?pr=12163 https://github.com/facebook/rocksdb/actions/runs/7269697823/job/19807724471?pr=12163 And per-PR jobs of course passing on this PR. Reviewed By: hx235 Differential Revision: D52335810 Pulled By: pdillinger fbshipit-source-id: bbb95196f33eabad8cddf3c6b52f4413c80e034d	2023-12-21 15:40:21 -08:00
zaidoon	ad0362ac92	Expose Options::ttl through C API (#12170 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12170 Reviewed By: jaykorean Differential Revision: D52378902 Pulled By: cbi42 fbshipit-source-id: 0bac94b8785d5149df86e7317e69c0e64beab887	2023-12-21 15:04:53 -08:00
Andrew Kryczka	15487b84e4	fix ldb_cmd_test.cc build with nondefault -DROCKSDB_NAMESPACE (#12173 ) Summary: I landed https://github.com/facebook/rocksdb/issues/12159 which had the below compiler error when using `-DROCKSDB_NAMESPACE`, which broke the CircleCI "build-linux-static_lib-alt_namespace-status_checked" job: ``` tools/ldb_cmd_test.cc:1213:21: error: 'rocksdb' does not name a type 1213 \| int Compare(const rocksdb::Slice& a, const rocksdb::Slice& b) const override { \| ^~~~~~~ tools/ldb_cmd_test.cc:1213:35: error: expected unqualified-id before '&' token 1213 \| int Compare(const rocksdb::Slice& a, const rocksdb::Slice& b) const override { \| ^ tools/ldb_cmd_test.cc:1213:35: error: expected ')' before '&' token 1213 \| int Compare(const rocksdb::Slice& a, const rocksdb::Slice& b) const override { \| ~ ^ \| ) tools/ldb_cmd_test.cc:1213:35: error: expected ';' at end of member declaration 1213 \| int Compare(const rocksdb::Slice& a, const rocksdb::Slice& b) const override { \| ^ \| ; tools/ldb_cmd_test.cc:1213:37: error: 'a' does not name a type 1213 \| int Compare(const rocksdb::Slice& a, const rocksdb::Slice& b) const override { \| ^ ... ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12173 Test Plan: ``` $ make clean && make OPT="-DROCKSDB_NAMESPACE=alternative_rocksdb_ns" ldb_cmd_test -j56 ``` Reviewed By: pdillinger Differential Revision: D52373797 Pulled By: ajkr fbshipit-source-id: 8597aaae65a5333831fef66d85072827c5fb1187	2023-12-21 12:22:02 -08:00
chuhao zeng	8d50a7c9df	Fix ldbcmd cant use custom comparator (#12159 ) Summary: According to this [Q&A](https://github.com/facebook/rocksdb/wiki/RocksDB-FAQ#:~:text=Q%3A%20If%20I%20use%20non%2Ddefault%20comparators%20or%20merge%20operators%2C%20can%20I%20still%20use%20ldb%20tool%3F), user should be able to use LDB with passing a customized comparator into the option. In the process of opening DB in order to perform ldb commands, there is a exception saying comparator not match even if a option with customized comparator is provided. After initializing the column family to open DB, the `LDBCommand::OverrideBaseCFOptions` method does not update the comparator inside column family descriptor using the passed in options. This can cause a mismatch while doing version edit, and in function `ToggleUDT CompareComparator` it will failed and return a exception saying comparator not match. Propose fix by updating the column family descriptor's option using the user passed in option. Also a test case is provided to illustrate the steps. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12159 Reviewed By: hx235 Differential Revision: D52267367 Pulled By: ajkr fbshipit-source-id: c240f93f440e02cb485893de058a46c6dbf9654b	2023-12-20 18:04:08 -08:00
Adam Retter	d8c1ab8b2d	Add Iterator::Refresh(Snapshot) to RocksJava (#12145 ) Summary: Adds the API to RocksJava. Also improves the C++ doc for Iterator::Refresh(Snapshot) Closes https://github.com/facebook/rocksdb/issues/12095 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12145 Reviewed By: hx235 Differential Revision: D52266452 Pulled By: ajkr fbshipit-source-id: 6b72b41672081b966b0c5dd07d9bf151ed009122	2023-12-20 18:03:42 -08:00
akankshamahajan	7b24dec25d	Fix header files to meet Open source requirements (#12164 ) Summary: Same as title Pull Request resolved: https://github.com/facebook/rocksdb/pull/12164 Reviewed By: hx235 Differential Revision: D52302234 Pulled By: akankshamahajan15 fbshipit-source-id: d4724fc944c773242788f5a47d1c7eadbbc5a522	2023-12-19 13:43:17 -08:00
Radek Hubner	f7486ff6a3	Add deletion-triggered compaction to RocksJava (#12028 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12028 Reviewed By: akankshamahajan15 Differential Revision: D52264983 Pulled By: ajkr fbshipit-source-id: 02d08015b4bffac06d889dc1be50a51d03f891b3	2023-12-18 13:43:01 -08:00
maztheman	66ef68bec8	Update CMakeLists.txt (#12140 ) Summary: check is way too common to use as a target (https://cmake.org/cmake/help/latest/policy/CMP0002.html) Pull Request resolved: https://github.com/facebook/rocksdb/pull/12140 Reviewed By: akankshamahajan15 Differential Revision: D52265318 Pulled By: ajkr fbshipit-source-id: 2d16257dc4620f4dd4e7debc1a420f0681b3b559	2023-12-18 13:17:45 -08:00
Andrew Kryczka	8c568bac61	Sync a source file license from percona/PerconaFT (#12103 ) Summary: Fixes https://github.com/facebook/rocksdb/issues/10478 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12103 Reviewed By: cbi42 Differential Revision: D51623089 Pulled By: ajkr fbshipit-source-id: 81f88262ed247144ae063a0552e0162db90c0e43	2023-12-18 11:53:27 -08:00
Hui Xiao	5b981b64f4	Intensify operations on same key in crash test (#12148 ) Summary: Context/Summary: Continued from https://github.com/facebook/rocksdb/pull/12127, we can randomly reduce the # max key to coerce more operations on the same key. My experimental run shows it surfaced more issue than just https://github.com/facebook/rocksdb/pull/12127. I also randomly reduce the related parameters, write buffer size and target file base, to adapt to randomly lower number of # max key. This creates 4 situations of testing, 3 of which are new: 1. high # max key with high write buffer size and target file base (existing) 2. high # max key with low write buffer size and target file base (new, will go through some rehearsal testing to ensure we don't run out of space with many files) 3. low # max key with high write buffer size and target file base (new, keys will stay in memory longer) 4. low # max key with low write buffer size and target file base (new, experimental runs show it surfaced even more issues) Pull Request resolved: https://github.com/facebook/rocksdb/pull/12148 Test Plan: - [Ongoing] Rehearsal stress test - Monitor production stress test Reviewed By: jaykorean Differential Revision: D52174980 Pulled By: hx235 fbshipit-source-id: bd5e11280826819ca9314c69bbbf05d481c6d105	2023-12-17 10:46:26 -08:00
Levi Tamasi	81765866c4	Update HISTORY/version/format compatibility script for the 8.10 release (#12154 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12154 Reviewed By: jaykorean, akankshamahajan15 Differential Revision: D52216271 Pulled By: ltamasi fbshipit-source-id: 13bab72802eeec8f6e3544be9ebcd7f725a64d2e	2023-12-15 14:44:23 -08:00
anand76	cc069f25b3	Add some compressed and tiered secondary cache stats (#12150 ) Summary: Add statistics for more visibility. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12150 Reviewed By: akankshamahajan15 Differential Revision: D52184633 Pulled By: anand1976 fbshipit-source-id: 9969e05d65223811cd12627102b020bb6d229352	2023-12-15 11:34:08 -08:00
Peter Dillinger	88bc91f3cc	Cap eviction effort (CPU under stress) in HyperClockCache (#12141 ) Summary: HyperClockCache is intended to mitigate performance problems under stress conditions (as well as optimizing average-case parallel performance). In LRUCache, the biggest such problem is lock contention when one or a small number of cache entries becomes particularly hot. Regardless of cache sharding, accesses to any particular cache entry are linearized against a single mutex, which is held while each access updates the LRU list. All HCC variants are fully lock/wait-free for accessing blocks already in the cache, which fully mitigates this contention problem. However, HCC (and CLOCK in general) can exhibit extremely degraded performance under a different stress condition: when no (or almost no) entries in a cache shard are evictable (they are pinned). Unlike LRU which can find any evictable entries immediately (at the cost of more coordination / synchronization on each access), CLOCK has to search for evictable entries. Under the right conditions (almost exclusively MB-scale caches not GB-scale), the CPU cost of each cache miss could fall off a cliff and bog down the whole system. To effectively mitigate this problem (IMHO), I'm introducing a new default behavior and tuning parameter for HCC, `eviction_effort_cap`. See the comments on the new config parameter in the public API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12141 Test Plan: unit test included ## Performance test We can use cache_bench to validate no regression (CPU and memory) in normal operation, and to measure change in behavior when cache is almost entirely pinned. (TODO: I'm not sure why I had to get the pinned ratio parameter well over 1.0 to see truly bad performance, but the behavior is there.) Build with `make DEBUG_LEVEL=0 USE_CLANG=1 PORTABLE=0 cache_bench`. We also set MALLOC_CONF="narenas:1" for all these runs to essentially remove jemalloc variances from the results, so that the max RSS given by /usr/bin/time is essentially ideal (assuming the allocator minimizes fragmentation and other memory overheads well). Base command reproducing bad behavior: ``` ./cache_bench -cache_type=auto_hyper_clock_cache -threads=12 -histograms=0 -pinned_ratio=1.7 ``` ``` Before, LRU (alternate baseline not exhibiting bad behavior): Rough parallel ops/sec = 2290997 1088060 maxresident Before, AutoHCC (bad behavior): Rough parallel ops/sec = 141011 <- Yes, more than 10x slower 1083932 maxresident ``` Now let us sample a range of values in the solution space: ``` After, AutoHCC, eviction_effort_cap = 1: Rough parallel ops/sec = 3212586 2402216 maxresident After, AutoHCC, eviction_effort_cap = 10: Rough parallel ops/sec = 2371639 1248884 maxresident After, AutoHCC, eviction_effort_cap = 30: Rough parallel ops/sec = 1981092 1131596 maxresident After, AutoHCC, eviction_effort_cap = 100: Rough parallel ops/sec = 1446188 1090976 maxresident After, AutoHCC, eviction_effort_cap = 1000: Rough parallel ops/sec = 549568 1084064 maxresident ``` I looks like `cap=30` is a sweet spot balancing acceptable CPU and memory overheads, so is chosen as the default. ``` Change to -pinned_ratio=0.85 Before, LRU: Rough parallel ops/sec = 2108373 1078232 maxresident Before, AutoHCC, averaged over ~20 runs: Rough parallel ops/sec = 2164910 1077312 maxresident After, AutoHCC, eviction_effort_cap = 30, averaged over ~20 runs: Rough parallel ops/sec = 2145542 1077216 maxresident ``` The slight CPU improvement above is consistent with the cap, with no measurable memory overhead under moderate stress. ``` Change to -pinned_ratio=0.25 (low stress) Before, AutoHCC, averaged over ~20 runs: Rough parallel ops/sec = 2221149 1076540 maxresident After, AutoHCC, eviction_effort_cap = 30, averaged over ~20 runs: Rough parallel ops/sec = 2224521 1076664 maxresident ``` No measurable difference under normal circumstances. Some tests repeated with FixedHCC, with similar results. Reviewed By: anand1976 Differential Revision: D52174755 Pulled By: pdillinger fbshipit-source-id: d278108031b1220c1fa4c89c5a9d34b7cf4ef1b8	2023-12-14 22:13:32 -08:00
Akanksha Mahajan	cd577f6059	Fix WRITE_STALL start_time (#12147 ) Summary: `Delayed` is set true in two cases. One is when `delay` is specified. Other one is in the `while` loop - `cd21e4e69d/db/db_impl/db_impl_write.cc (L1876)` However start_time is not initialized in second case, resulting in time_delayed = immutable_db_options_.clock->NowMicros() - 0(start_time); Pull Request resolved: https://github.com/facebook/rocksdb/pull/12147 Test Plan: Existing CircleCI Reviewed By: cbi42 Differential Revision: D52173481 Pulled By: akankshamahajan15 fbshipit-source-id: fb9183b24c191d287a1d715346467bee66190f98	2023-12-14 13:45:06 -08:00
Ludovic Henry	5502f06729	Add support for linux-riscv64 (#12139 ) Summary: Following https://github.com/evolvedbinary/docker-rocksjava/pull/2, we can now build rocksdb on riscv64. I've verified this works as expected with `make rocksdbjavastaticdockerriscv64`. Also fixes https://github.com/facebook/rocksdb/issues/10500 https://github.com/facebook/rocksdb/issues/11994 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12139 Reviewed By: jaykorean Differential Revision: D52128098 Pulled By: akankshamahajan15 fbshipit-source-id: 706d36a3f8a9e990b76f426bc450937a0cd1a537	2023-12-14 11:27:17 -08:00
akankshamahajan	e7c6259447	Make auto_readahead_size default true (#12080 ) Summary: Make auto_readahead_size option default true Pull Request resolved: https://github.com/facebook/rocksdb/pull/12080 Test Plan: benchmarks and exisiting tests Reviewed By: anand1976 Differential Revision: D52152132 Pulled By: akankshamahajan15 fbshipit-source-id: f1515563564e77df457dff2e865e4ede8c3ddf44	2023-12-14 11:25:51 -08:00
Levi Tamasi	cd21e4e69d	Some further cleanup in WriteBatchWithIndex::MultiGetFromBatchAndDB (#12143 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12143 https://github.com/facebook/rocksdb/pull/11982 changed `WriteBatchWithIndex::MultiGetFromBatchDB` to preallocate space in the `autovector`s `key_contexts` and `merges` in order to prevent any reallocations, both as an optimization and in order to prevent pointers into the container from being invalidated during subsequent insertions. On second thought, this preallocation can actually be a pessimization in cases when only a small subset of keys require querying the underlying database. To prevent any memory regressions, the PR reverts this preallocation. In addition, it makes some small code hygiene improvements like incorporating the `PinnableWideColumns` object into `MergeTuple`. Reviewed By: jaykorean Differential Revision: D52136513 fbshipit-source-id: 21aa835084433feab27b501d9d1fc5434acea609	2023-12-13 17:34:18 -08:00
Peter Dillinger	c74531b1d2	Fix a nuisance compiler warning from clang (#12144 ) Summary: Example: ``` cache/clock_cache.cc:56:7: error: fallthrough annotation in unreachable code [-Werror,-Wimplicit-fallthrough] FALLTHROUGH_INTENDED; ^ ./port/lang.h:10:30: note: expanded from macro 'FALLTHROUGH_INTENDED' ^ ``` In clang < 14, this is annoyingly generated from -Wimplicit-fallthrough, but was changed to -Wunreachable-code-fallthrough (implied by -Wunreachable-code) in clang 14. See https://reviews.llvm.org/D107933 for how this nuisance pattern generated false positives similar to ours in the Linux kernel. Just to underscore the ridiculousness of this warning, here an error is reported on the annotation, not the call to do_something(), depending on the constexpr value (https://godbolt.org/z/EvxqdPTdr): ``` #include <atomic> void do_something(); void test(int v) { switch (v) { case 1: if constexpr (std::atomic<long>::is_always_lock_free) { return; } else { do_something(); [[fallthrough]]; } case 2: return; } } ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12144 Test Plan: Added the warning to our Makefile for USE_CLANG, which reproduced the warning-as-error as shown above, but is now fixed. Reviewed By: jaykorean Differential Revision: D52139615 Pulled By: pdillinger fbshipit-source-id: ba967ae700c0916d1a478bc465cf917633e337d9	2023-12-13 15:58:46 -08:00
akankshamahajan	d926593df5	Fix stress tests failure for auto_readahead_size (#12131 ) Summary: When auto_readahead_size is enabled, Prev operation calls SeekForPrev in db_iter so that - BlockBasedTableIterator can point index_iter_ to the right block. - disable readahead_cache_lookup. However, there can be cases where SeekForPrev might not go through Version_set and call BlockBasedTableIterator SeekForPrev. In that case, when BlockBasedTableIterator::Prev is called, it returns NotSupported error. This more like a corner case. So to handle that case, removed SeekForPrev calling from db_iter and reseeking index_iter_ in Prev operation. block_iter_'s key already point to right block. So reseeking to index_iter_ solves the issue. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12131 Test Plan: - Tested on db_stress command that was failing - `./db_stress --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --allow_data_in_errors=True --async_io=0 --atomic_flush=0 --auto_readahead_size=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --best_efforts_recovery=1 --block_protection_bytes_per_key=1 --block_size=16384 --bloom_before_level=2147483646 --bloom_bits=12 --bottommost_compression_type=none --bottommost_file_compaction_delay=0 --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=33554432 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=1 --charge_table_reader=1 --checkpoint_one_in=1000000 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=4 --compaction_readahead_size=1048576 --compaction_ttl=10 --compressed_secondary_cache_size=16777216 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=zlib --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=1 --db=/home/akankshamahajan/rocksdb_auto_tune/dev/shm/rocksdb_test/rocksdb_crashtest_blackbox --db_write_buffer_size=134217728 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=1 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --enable_thread_tracking=1 --expected_values_dir=/home/akankshamahajan/rocksdb_auto_tune/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=1 --fifo_allow_compaction=1 --file_checksum_impl=big --flush_one_in=1000000 --format_version=6 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=10 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=1000000 --long_running_snapshots=1 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=524288 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=25000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=4194304 --memtable_max_range_deletions=1000 --memtable_prefix_bloom_size_ratio=0 --memtable_protection_bytes_per_key=2 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=1 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=1 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=1000000 --periodic_compaction_seconds=10 --prefix_size=-1 --prefixpercent=0 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --read_fault_one_in=1000 --readahead_size=524288 --readpercent=50 --recycle_log_file_num=0 --reopen=0 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --set_options_one_in=10000 --skip_verifydb=1 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --target_file_size_base=2097152 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=3 --unpartitioned_pinning=3 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=1 --use_merge=1 --use_multi_get_entity=0 --use_multiget=1 --use_put_entity_one_in=10 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=0 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=4194304 --write_dbid_to_manifest=0 --write_fault_one_in=0 --writepercent=35` - make crash_test -j32 Reviewed By: anand1976 Differential Revision: D51986326 Pulled By: akankshamahajan15 fbshipit-source-id: 90e11e63d1f1894770b457a44d8b213ae5512df9	2023-12-13 12:15:04 -08:00
Andrew Kryczka	d8e47620d7	Speedup based on pending compaction bytes relative to data size (#12130 ) Summary: RocksDB self throttles per-DB compaction parallelism until it detects compaction pressure. The pressure detection based on pending compaction bytes was only comparing against the slowdown trigger (`soft_pending_compaction_bytes_limit`). Online services tend to set that extremely high to avoid stalling at all costs. Perhaps they should have set it to zero, but we never documented that zero disables stalling so I have been telling everyone to increase it for years. This PR adds pressure detection based on pending compaction bytes relative to the size of bottommost data. The size of bottommost data should be fairly stable and proportional to the logical data size Pull Request resolved: https://github.com/facebook/rocksdb/pull/12130 Reviewed By: hx235 Differential Revision: D52000746 Pulled By: ajkr fbshipit-source-id: 7e1fd170901a74c2d4a69266285e3edf6e7631c7	2023-12-13 10:37:27 -08:00
anand76	ebb5242d55	Sanitize the secondary_cache option in TieredCacheOptions (#12137 ) Summary: Sanitize the `secondary_cache` field in the `cache_opts` option of `TieredCacheOptions` to `nullptr` if set by the user. The nvm secondary cache should be directly set in `TieredCacheOptions`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12137 Reviewed By: akankshamahajan15 Differential Revision: D52063817 Pulled By: anand1976 fbshipit-source-id: 255116c665a9b908c8f44109a2d331d4b73e7591	2023-12-12 10:58:00 -08:00
Yu Zhang	c2ab4e754b	Add initial support to stress test persist_user_defined_timestamps (#12124 ) Summary: This PR adds initial stress testing for the user-defined timestamps in memtable only feature. Each flavor of the `*_ts` crash test get a 1 in 3 chance to run with timestamps not persisted, this setting is initialized once and kept consistent across the following re-runs. This initial stress test included these things besides disabling incompatible feature combinations to make the test run more stably: 1) It currently only run test methods that validates db state with expected state. Not the ones that validate db state by comparing result from one API to another API. Such as `TestMultiGet` (compared with `Get`), similarly `TestMultiGetEntity`, `TestIterate` (compare src iterator to a control iterator). Due to timestamps being removed, results from one API to another API is not directly comparable as it is now. More test logic to handle that need to be added, will do that in a follow up. 2) Even when comparing db state to expected state, sometimes the db can receive `InvalidArgument` too due to timestamps getting flushed and removed. Added some logic to handle that. 3) When timestamps are not persisted, we don't try to read with older timestamp. Since that's making it easier to get `InvalidArgument`. And this capability is not yet needed by our customer so it's disabled for now. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12124 Test Plan: running multiple flavor of this test on continuous run for sometime before checkin Reviewed By: ltamasi Differential Revision: D51916267 Pulled By: jowlyzhang fbshipit-source-id: 3f3eb5f9618d05d296062820e0ef5cb8edc7c2b2	2023-12-12 09:35:29 -08:00
anand76	c1b84d0437	Fix false negative in TieredSecondaryCache nvm cache lookup (#12134 ) Summary: There is a bug in the `TieredSecondaryCache` that can result in a false negative. This can happen when a MultiGet does a cache lookup that gets a hit in the `TieredSecondaryCache` local nvm cache tier, and the result is available before MultiGet calls `WaitAll` (i.e the nvm cache `SecondaryCacheResultHandle` `IsReady` returns true). Pull Request resolved: https://github.com/facebook/rocksdb/pull/12134 Test Plan: Add a new unit test in tiered_secondary_cache_test Reviewed By: akankshamahajan15 Differential Revision: D52023309 Pulled By: anand1976 fbshipit-source-id: e5ae681226a0f12753fecb2f6acc7e5f254ae72b	2023-12-11 16:59:59 -08:00
Peter Dillinger	c96d9a0fbb	Allow TablePropertiesCollectorFactory to return null collector (#12129 ) Summary: As part of building another feature, I wanted this: * Custom implementations of `TablePropertiesCollectorFactory` may now return a `nullptr` collector to decline processing a file, reducing callback overheads in such cases. * Polished, clarified some related API comments. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12129 Test Plan: unit test added Reviewed By: ltamasi Differential Revision: D51966667 Pulled By: pdillinger fbshipit-source-id: 2991c08fe6ce3a8c9f14c68f1495f5a17bca2770	2023-12-11 12:02:56 -08:00
Radek Hubner	5c5e018943	Fix JNI lazy load regression. (#12133 ) Summary: A small regression that conflicted with PR https://github.com/facebook/rocksdb/pull/12133 was later merged in commit `2296c624fa (diff-26d3ab8a3d764183d0ea3aea834fe481eec2347c334b918ebd7bdce4bcabcc19R35)` This PR addresses that regression. Closes https://github.com/facebook/rocksdb/issues/12132 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12133 Reviewed By: jowlyzhang Differential Revision: D52041736 Pulled By: ltamasi fbshipit-source-id: 33db57035154c833ae00b5d921b17b3be80c8dd7	2023-12-11 11:21:52 -08:00
Alan Paxton	5a063ecd34	Java API consistency between RocksDB.put() , .merge() and Transaction.put() , .merge() (#11019 ) Summary: ### Implement new Java API get()/put()/merge() methods, and transactional variants. The Java API methods are very inconsistent in terms of how they pass parameters (byte[], ByteBuffer), and what variants and defaulted parameters they support. We try to bring some consistency to this. * All APIs should support calls with ByteBuffer parameters. * Similar methods (RocksDB.get() vs Transaction.get()) should support as similar as possible sets of parameters for predictability. * get()-like methods should provide variants where the caller supplies the target buffer, for the sake of efficiency. Allocation costs in Java can be significant when large buffers are repeatedly allocated and freed. ### API Additions 1. RockDB.get implement indirect ByteBuffers. Added indirect ByteBuffers and supporting native methods for get(). 2. RocksDB.Iterator implement missing (byte[], offset, length) variants for key() and value() parameters. 3. Transaction.get() implement missing methods, based on RocksDB.get. Added ByteBuffer.get with and without column family. Added byte[]-as-target get. 4. Transaction.iterator() implement a getIterator() which defaults ReadOptions; as per RocksDB.iterator(). Rationalize support API for this and RocksDB.iterator() 5. RocksDB.merge implement ByteBuffer methods; both direct and indirect buffers. Shadow the methods of RocksDB.put; RocksDB.put only offers ByteBuffer API with explicit WriteOptions. Duplicated this with RocksDB.merge 6. Transaction.merge implement methods as per RocksDB.merge methods. Transaction is already constructed with WriteOptions, so no explicit WriteOptions methods required. 7. Transaction.mergeUntracked implement the same API methods as Transaction.merge except the ones that use assumeTracked, because that’s not a feature of merge untracked. ### Support Changes (C++) The current JNI code in C++ supports multiple variants of methods through a number of helper functions. There are numerous TODO suggestions in the code proposing that the helpers be re-factored/shared. We have taken a different approach for the new methods; we have created wrapper classes `JDirectBufferSlice`, `JDirectBufferPinnableSlice`, `JByteArraySlice` and `JByteArrayPinnableSlice` RAII classes which construct slices from JNI parameters and can then be passed directly to RocksDB methods. For instance, the `Java_org_rocksdb_Transaction_getDirect` method is implemented like this: ``` try { ROCKSDB_NAMESPACE::JDirectBufferSlice key(env, jkey_bb, jkey_off, jkey_part_len); ROCKSDB_NAMESPACE::JDirectBufferPinnableSlice value(env, jval_bb, jval_off, jval_part_len); ROCKSDB_NAMESPACE::KVException::ThrowOnError( env, txn->Get(*read_options, column_family_handle, key.slice(), &value.pinnable_slice())); return value.Fetch(); } catch (const ROCKSDB_NAMESPACE::KVException& e) { return e.Code(); } ``` Notice the try/catch mechanism with the `KVException` class, which combined with RAII and the wrapper classes means that there is no ad-hoc cleanup necessary in the JNI methods. We propose to extend this mechanism to existing JNI methods as further work. ### Support Changes (Java) Where there are multiple parameter-variant versions of the same method, we use fewer or just one supporting native method for all of them. This makes maintenance a bit easier and reduces the opportunity for coding errors mixing up (untyped) object handles. In order to support this efficiently, some classes need to have default values for column families and read options added and cached so that they are not re-constructed on every method call. This PR closes https://github.com/facebook/rocksdb/issues/9776 Pull Request resolved: https://github.com/facebook/rocksdb/pull/11019 Reviewed By: ajkr Differential Revision: D52039446 Pulled By: jowlyzhang fbshipit-source-id: 45d0140a4887e42134d2e56520e9b8efbd349660	2023-12-11 11:03:17 -08:00

1 2 3 4 5 ...

12364 Commits All Branches Search

12364 Commits

All Branches