rocksdb

Commit Graph

Author	SHA1	Message	Date
Levi Tamasi	55ac0b729e	Support building db_bench using buck (#13004 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/13004 The patch extends the buckifier script so it generates a target for `db_bench` as well. Reviewed By: cbi42 Differential Revision: D62407071 fbshipit-source-id: 0cb98a324ce0598ad84a8675aa77b7d0f91bf40c	2024-09-10 11:37:29 -07:00
anand76	f48b64460e	Provide a way to invoke a callback for a Cache handle (#12987 ) Summary: Add the `ApplyToHandle` method to the `Cache` interface to allow a caller to request the invocation of a callback on the given cache handle. The goal here is to allow a cache that manages multiple cache instances to use a callback on a handle to determine which instance it belongs to. For example, the callback can hash the key and use that to pick the correct target instance. This is useful to redirect methods like `Ref` and `Release`, which don't know the cache key. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12987 Reviewed By: pdillinger Differential Revision: D62151907 Pulled By: anand1976 fbshipit-source-id: e4ffbbb96eac9061d2ab0e7e1739eea5ebb1cd58	2024-09-06 14:54:09 -07:00
Yu Zhang	0c6e9c036a	Make compaction always use the input version with extra ref protection (#12992 ) Summary: `Compaction` is already creating its own ref for the input Version: `4b1d595306/db/compaction/compaction.cc (L73)` And properly Unref it during destruction: `4b1d595306/db/compaction/compaction.cc (L450)` This PR redirects compaction's access of `cfd->current()` to this input `Version`, to prepare for when a column family's data can be replaced all together, and `cfd->current()` is not safe to access for a compaction job. Because a new `Version` with just some other external files could be installed as `cfd->current()`. The compaction job's expectation of the current `Version` and the corresponding storage info to always have its input files will no longer be guaranteed. My next follow up is to do a similar thing for flush, also to prepare it for when a column family's data can be replaced. I will make it create its own reference of the current `MemTableListVersion` and use it as input, all flush job's access of memtables will be wired to that input `MemTableListVersion`. Similarly this reference will be unreffed during a flush job's destruction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12992 Test Plan: Existing tests Reviewed By: pdillinger Differential Revision: D62212625 Pulled By: jowlyzhang fbshipit-source-id: 9a781213469cf366857a128d50a702af683a046a	2024-09-06 14:07:33 -07:00
Yu Zhang	a24574e80a	Add documentation for background job's state transition (#12994 ) Summary: The `SchedulePending` API is a bit confusing since it doesn't immediately schedule the work and can be confused with the actual scheduling. So I have changed these to be `EnqueuePending` and added some documentation for the corresponding state transitions of these background work. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12994 Test Plan: existing tests Reviewed By: cbi42 Differential Revision: D62252746 Pulled By: jowlyzhang fbshipit-source-id: ee68be6ed33070cad9a5004b7b3e16f5bcb041bf	2024-09-06 13:08:34 -07:00
Changyu Bi	0bea5a2cfe	Disable WAL fault injection in some case (#13000 ) Summary: when manual_wal_flush is true and when there are more than 1 CF, WAL fault injection can cause CFs to be inconsistent. See more explanation and repro in T199157789. Disable the combination for now until we have a fix that allows auto recovery. This also helps to see if there's other cause of stress test failures. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13000 Test Plan: the following command could repro db consistency failure in a few runs before this PR. From stress test output we can also see that exclude_wal_from_write_fault_injection and metadata_write_fault_one_in are sanitized to 0. ``` python3 ./tools/db_crashtest.py blackbox --interval=60 --metadata_write_fault_one_in=1000 --column_families=10 --exclude_wal_from_write_fault_injection=0 --manual_wal_flush_one_in=1000 --WAL_size_limit_MB=10240 --WAL_ttl_seconds=0 --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --adm_policy=1 --advise_random_on_open=1 --allow_data_in_errors=True --allow_fallocate=1 --async_io=0 --auto_readahead_size=0 --avoid_flush_during_recovery=1 --avoid_flush_during_shutdown=1 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=100 --block_align=1 --block_protection_bytes_per_key=0 --block_size=16384 --bloom_before_level=2147483647 --bottommost_compression_type=none --bottommost_file_compaction_delay=0 --bytes_per_sync=0 --cache_index_and_filter_blocks=1 --cache_index_and_filter_blocks_with_high_priority=1 --cache_size=33554432 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=0 --check_multiget_consistency=0 --check_multiget_entity_consistency=0 --checkpoint_one_in=0 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_readahead_size=1048576 --compaction_ttl=0 --compress_format_version=1 --compressed_secondary_cache_size=8388608 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=4 --compression_type=none --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc= --data_block_index_type=0 --db_write_buffer_size=0 --decouple_partitioned_filters=1 --default_temperature=kCold --default_write_temperature=kWarm --delete_obsolete_files_period_micros=30000000 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_file_deletions_one_in=1000000 --disable_manual_compaction_one_in=1000000 --disable_wal=0 --dump_malloc_stats=1 --enable_checksum_handoff=1 --enable_compaction_filter=0 --enable_custom_split_merge=0 --enable_do_not_compress_roles=0 --enable_index_compression=0 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=1 --enable_sst_partitioner_factory=0 --enable_thread_tracking=1 --enable_write_thread_adaptive_yield=1 --error_recovery_with_no_fault_injection=1 --fail_if_options_file_error=1 --fifo_allow_compaction=1 --file_checksum_impl=big --fill_cache=1 --flush_one_in=1000000 --format_version=6 --get_all_column_family_metadata_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=10000 --get_properties_of_all_tables_one_in=1000000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --index_block_restart_interval=4 --index_shortening=1 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100000 --last_level_temperature=kWarm --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=10000 --log_file_time_to_roll=0 --log_readahead_size=0 --long_running_snapshots=0 --lowest_used_cache_tier=2 --manifest_preallocation_size=5120 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=100000 --max_key_len=3 --max_log_file_size=0 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=16 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16777216 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=2097152 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.001 --memtable_protection_bytes_per_key=2 --memtable_whole_key_filtering=0 --memtablerep=skip_list --metadata_charge_policy=1 --metadata_read_fault_one_in=0 --min_write_buffer_number_to_merge=1 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=2 --open_files=100 --open_metadata_read_fault_one_in=0 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --optimize_filters_for_hits=0 --optimize_filters_for_memory=0 --optimize_multiget_for_io=0 --paranoid_file_checks=1 --paranoid_memory_checks=0 --partition_filters=0 --partition_pinning=2 --pause_background_one_in=10000 --periodic_compaction_seconds=0 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=0 --readahead_size=524288 --readpercent=45 --recycle_log_file_num=0 --reopen=0 --report_bg_io_stats=0 --reset_stats_one_in=10000 --sample_for_compression=5 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --set_options_one_in=10000 --skip_stats_update_on_db_open=1 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=1048576 --sqfc_name=bar --sqfc_version=1 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=600 --stats_history_buffer_size=1048576 --strict_bytes_per_sync=1 --subcompactions=2 --sync=0 --sync_fault_injection=1 --table_cache_numshardbits=6 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=3 --uncache_aggressiveness=8 --universal_max_read_amp=-1 --unpartitioned_pinning=2 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=0 --use_attribute_group=1 --use_delta_encoding=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=1 --use_multi_cf_iterator=1 --use_multi_get_entity=0 --use_multiget=0 --use_put_entity_one_in=1 --use_sqfc_for_range_queries=0 --use_timed_put_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_compression=1 --verify_db_one_in=100000 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=4194304 --write_dbid_to_manifest=0 --write_fault_one_in=50 --writepercent=35 --ops_per_thread=100000 --preserve_unverified_changes=1 ``` Reviewed By: hx235 Differential Revision: D62303631 Pulled By: cbi42 fbshipit-source-id: d9441188ee84d53e5e7916f3305e50843fe9fde2	2024-09-06 11:05:49 -07:00
Peter Dillinger	4577b672d5	More valgrind fixes (#12990 ) Summary: * https://github.com/facebook/rocksdb/issues/12936 was insufficient to fix the std::optional false positives. Making a fix validated in CI this time (see https://github.com/facebook/rocksdb/issues/12991) * valgrind grinds to a halt on startup on my dev machine apparently because it expects internet access. Disable its attempts to access the internet when git is using a proxy. * Move PORTABLE=1 from CI job to the Makefile. Without it, valgrind complains about illegal instructions (too new) Pull Request resolved: https://github.com/facebook/rocksdb/pull/12990 Test Plan: manual, watch nightly valgrind job Reviewed By: ltamasi Differential Revision: D62203242 Pulled By: pdillinger fbshipit-source-id: a611b08da7dbd173b0709ed7feb0578729553a17	2024-09-06 10:11:34 -07:00
Peter Dillinger	0d3aaf7c0f	Ensure SSTs compressed in tiered_secondary_cache_test (#12993 ) Summary: It appears the arm testsuite is failing because it is building without snappy, which is causing the SST files not to be compressed, which somehow causes these tests to fail. Manually setting LZ4 which is already required. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12993 Test Plan: reproduced and verified fix on ARM laptop Reviewed By: anand1976 Differential Revision: D62216451 Pulled By: pdillinger fbshipit-source-id: 3f21fcd9be0edaa66c7eca0cb7d56b998171e263	2024-09-05 10:36:29 -07:00
Peter Dillinger	4b1d595306	Fix and clarify ignore_unknown_options (#12989 ) Summary: `ignore_unknown_options=true` had an undocumented behavior of having no effect (disallow unknown options) if reading from the same or older major.minor version. Presumably this was intended to catch unintentional addition of new options in a patch release, but there is no automated version compatibility testing between patch releases. So this was a bad choice without such testing support, because it just means users would hit the failure in case of adding features to a patch release. In this diff we respect ignore_unknown_options when reading a file from any newer version, even patch versions, and document this behavior in the API. I don't think it's practical or necessary to test among patch releases in check_format_compatible.sh. This seems like an exceptional case of applying a different semantics to patch version updates than to minor/major versions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12989 Test Plan: unit test updated (and refactored) Reviewed By: jaykorean Differential Revision: D62168738 Pulled By: pdillinger fbshipit-source-id: fb3c3ef30f0bbad0d5ffcc4570fb9ef963e7daac	2024-09-04 11:42:04 -07:00
Jay Huh	064c0ad53d	Fix the check format compatible change (#12988 ) Summary: `check_format_compatible` script was broken due to extra comma added in `5b8f5cbcf4` e.g. https://github.com/facebook/rocksdb/actions/runs/10505042711/job/29101787220 ``` ... 2024-08-23T11:44:15.0175202Z == Building 9.5.fb, debug 2024-08-23T11:44:15.0190592Z fatal: ambiguous argument '_tmp_origin/9.5.fb,': unknown revision or path not in the working tree. ... ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12988 Test Plan: ``` tools/check_format_compatible.sh ``` ``` == Use HEAD (48339d2a65670211bc9c204364a2127ba9b2a460) to open DB generated using 8.6.fb... == Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/8.6.fb to /tmp/rocksdb_format_compatible_jewoongh/db/8.6.fb/db_dump.txt == Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/current to /tmp/rocksdb_format_compatible_jewoongh/db/current/db_dump.txt == Use HEAD (48339d2a65670211bc9c204364a2127ba9b2a460) to open DB generated using 8.7.fb... == Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/8.7.fb to /tmp/rocksdb_format_compatible_jewoongh/db/8.7.fb/db_dump.txt == Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/current to /tmp/rocksdb_format_compatible_jewoongh/db/current/db_dump.txt == Use HEAD (48339d2a65670211bc9c204364a2127ba9b2a460) to open DB generated using 8.8.fb... == Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/8.8.fb to /tmp/rocksdb_format_compatible_jewoongh/db/8.8.fb/db_dump.txt == Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/current to /tmp/rocksdb_format_compatible_jewoongh/db/current/db_dump.txt == Use HEAD (48339d2a65670211bc9c204364a2127ba9b2a460) to open DB generated using 8.9.fb... == Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/8.9.fb to /tmp/rocksdb_format_compatible_jewoongh/db/8.9.fb/db_dump.txt == Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/current to /tmp/rocksdb_format_compatible_jewoongh/db/current/db_dump.txt == Use HEAD (48339d2a65670211bc9c204364a2127ba9b2a460) to open DB generated using 8.10.fb... == Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/8.10.fb to /tmp/rocksdb_format_compatible_jewoongh/db/8.10.fb/db_dump.txt == Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/current to /tmp/rocksdb_format_compatible_jewoongh/db/current/db_dump.txt == Use HEAD (48339d2a65670211bc9c204364a2127ba9b2a460) to open DB generated using 8.11.fb... == Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/8.11.fb to /tmp/rocksdb_format_compatible_jewoongh/db/8.11.fb/db_dump.txt == Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/current to /tmp/rocksdb_format_compatible_jewoongh/db/current/db_dump.txt == Use HEAD (48339d2a65670211bc9c204364a2127ba9b2a460) to open DB generated using 9.0.fb... == Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/9.0.fb to /tmp/rocksdb_format_compatible_jewoongh/db/9.0.fb/db_dump.txt == Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/current to /tmp/rocksdb_format_compatible_jewoongh/db/current/db_dump.txt == Use HEAD (48339d2a65670211bc9c204364a2127ba9b2a460) to open DB generated using 9.1.fb... == Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/9.1.fb to /tmp/rocksdb_format_compatible_jewoongh/db/9.1.fb/db_dump.txt == Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/current to /tmp/rocksdb_format_compatible_jewoongh/db/current/db_dump.txt == Use HEAD (48339d2a65670211bc9c204364a2127ba9b2a460) to open DB generated using 9.2.fb... == Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/9.2.fb to /tmp/rocksdb_format_compatible_jewoongh/db/9.2.fb/db_dump.txt == Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/current to /tmp/rocksdb_format_compatible_jewoongh/db/current/db_dump.txt == Use HEAD (48339d2a65670211bc9c204364a2127ba9b2a460) to open DB generated using 9.3.fb... == Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/9.3.fb to /tmp/rocksdb_format_compatible_jewoongh/db/9.3.fb/db_dump.txt == Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/current to /tmp/rocksdb_format_compatible_jewoongh/db/current/db_dump.txt == Use HEAD (48339d2a65670211bc9c204364a2127ba9b2a460) to open DB generated using 9.4.fb... == Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/9.4.fb to /tmp/rocksdb_format_compatible_jewoongh/db/9.4.fb/db_dump.txt == Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/current to /tmp/rocksdb_format_compatible_jewoongh/db/current/db_dump.txt == Use HEAD (48339d2a65670211bc9c204364a2127ba9b2a460) to open DB generated using 9.5.fb... == Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/9.5.fb to /tmp/rocksdb_format_compatible_jewoongh/db/9.5.fb/db_dump.txt == Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/current to /tmp/rocksdb_format_compatible_jewoongh/db/current/db_dump.txt == Use HEAD (48339d2a65670211bc9c204364a2127ba9b2a460) to open DB generated using 9.6.fb... == Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/9.6.fb to /tmp/rocksdb_format_compatible_jewoongh/db/9.6.fb/db_dump.txt == Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/current to /tmp/rocksdb_format_compatible_jewoongh/db/current/db_dump.txt ==== Compatibility Test PASSED ==== ``` Reviewed By: pdillinger Differential Revision: D62162454 Pulled By: jaykorean fbshipit-source-id: 562225c6cb27e0eb66f241a6f9424dc624d8c837	2024-09-03 22:17:31 -07:00
Symious	c989c51ed7	Fix non-ASCII character (#12972 ) Summary: Met the following error while compiling the project. ``` build_tools/check-sources.sh utilities/fault_injection_fs.cc:509: // If there<E2><80><99>s no injected error, then cb will be called asynchronously when utilities/fault_injection_fs.cc:510: // target_ actually finishes the read. But if there<E2><80><99>s an injected error, it utilities/fault_injection_fs.cc:512: // isn<E2><80><99>t invoked at all. ^^^^ Use only ASCII characters in source files make[1]: * [Makefile:1291: check-sources] Error 1 make[1]: Leaving directory '/home/janus/Github/symious/rocksdb' make: * [Makefile:1084: check] Error 2 ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12972 Reviewed By: hx235 Differential Revision: D61923865 Pulled By: cbi42 fbshipit-source-id: 63af0a38fea15e09a860895bdd5ed0a57700e447	2024-09-03 14:41:55 -07:00
Changyu Bi	cd6f802ccb	Add a new file ingestion option `link_files` (#12980 ) Summary: Add option `IngestExternalFileOptions::link_files` that hard links input files and preserves original file links after ingestion, unlike `move_files` which will unlink input files after ingestion. This can be useful when being used together with `allow_db_generated_files` to ingest files from another DB. Also reverted the change to `move_files` in https://github.com/facebook/rocksdb/issues/12959 to simplify the contract so that it will always unlink input files without exception. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12980 Test Plan: updated unit test `ExternSSTFileLinkFailFallbackTest.LinkFailFallBackExternalSst` to test that input files will not be unlinked. Reviewed By: pdillinger Differential Revision: D61925111 Pulled By: cbi42 fbshipit-source-id: eadaca72e1ae5288bdd195d57158466e5656fa62	2024-09-03 13:06:25 -07:00
Changyu Bi	92ad4a88f3	Small CPU optimization in InlineSkipList::Insert() (#12975 ) Summary: reuse decode key in more places to avoid decoding length prefixed key x->Key(). Pull Request resolved: https://github.com/facebook/rocksdb/pull/12975 Test Plan: ran benchmarks simultaneously for "before" and "after" * fillseq: ``` (for I in $(seq 1 50); do ./db_bench --benchmarks=fillseq --disable_auto_compactions=1 --min_write_buffer_number_to_merge=100 --max_write_buffer_number=1000 --write_buffer_size=268435456 --num=5000000 --seed=1723056275 --disable_wal=1 2>&1 \| grep "fillseq" done;) \| awk '{ t += $5; c++; print } END { printf ("%9.3f\n", 1.0 * t / c) }'; before: 1483191 after: 1490555 (+0.5%) ``` * fillrandom: ``` (for I in $(seq 1 2); do ./db_bench_imain --benchmarks=fillrandom --disable_auto_compactions=1 --min_write_buffer_number_to_merge=100 --max_write_buffer_number=1000 --write_buffer_size=268435456 --num=2500000 --seed=1723056275 --disable_wal=1 2>&1 \| grep "fillrandom" before: 255463 after: 256128 (+0.26%) ``` Reviewed By: anand1976 Differential Revision: D61835340 Pulled By: cbi42 fbshipit-source-id: 70345510720e348bacd51269acb5d2dd5a62bf0a	2024-08-27 13:57:40 -07:00
anand76	f31b4d80ff	Retain previous trace file in db_stress for debugging purposes (#12978 ) Summary: There are several crash test failures due to DB verification failure. Retain some trace history in the expected state directory to make debugging easier. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12978 Reviewed By: cbi42 Differential Revision: D61864921 Pulled By: anand1976 fbshipit-source-id: 9f3f37b7e1e958bc89a3cf0373182354c2c1aa3b	2024-08-27 12:43:47 -07:00
Jay Huh	0082907bf2	Scope down workflow permissions (#12973 ) Summary: Followed instruction per https://docs.github.com/en/actions/writing-workflows/workflow-syntax-for-github-actions#defining-access-for-the-github_token-scopes It turns out that we did not need any of these except `Metadata: read`. Before ``` GITHUB_TOKEN Permissions Actions: write Attestations: write Checks: write Contents: write Deployments: write Discussions: write Issues: write Metadata: read Packages: write Pages: write PullRequests: write RepositoryProjects: write SecurityEvents: write Statuses: write ``` After ``` GITHUB_TOKEN Permissions Metadata: read ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12973 Test Plan: GitHub Actions triggered by this PR Reviewed By: cbi42 Differential Revision: D61812651 Pulled By: jaykorean fbshipit-source-id: 4413756c93f503e8b2fb77eb8b684ef9e6a6c13d	2024-08-26 15:28:17 -07:00
Peter Dillinger	d96e67c2bf	Fix flaky test DBTest2.VariousFileTemperatures (#12974 ) Summary: ... apparently due to potentially not purging obsolete files after CompactRange Example: https://github.com/facebook/rocksdb/actions/runs/10564621261/job/29267393711?pr=12959 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12974 Test Plan: reproduced failure with USE_CLANG=1 COERCE_CONTEXT_SWITCH=1, now fixed Reviewed By: cbi42 Differential Revision: D61812600 Pulled By: pdillinger fbshipit-source-id: d4b23e1a179bb8ec39875ed7a8ce1649fa3344bd	2024-08-26 14:08:21 -07:00
Changyu Bi	4eb5878ab2	Support ingesting db generated files using hard link (#12959 ) Summary: so `IngestExternalFileOptions::move_files` and `IngestExternalFileOptions::allow_db_generated_files` are now compatible. The original file links won't be removed if `allow_db_generated_files` is true. This is to prevent deleting files from another DB. There was a [comment](https://github.com/facebook/rocksdb/pull/12750#discussion_r1684509620) in https://github.com/facebook/rocksdb/issues/12750 about how exactly-once ingestion would work with `move_files`. I've discussed with customer and decided that it can be done by reading the target DB to see if it contains any ingested key. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12959 Test Plan: updated unit tests `IngestDBGeneratedFileTest*` to enable `move_files`. Reviewed By: jowlyzhang Differential Revision: D61703480 Pulled By: cbi42 fbshipit-source-id: 6b4294369767f989a2f36bbace4ca3c0257aeaf7	2024-08-26 12:25:16 -07:00
Peter Dillinger	8ac14f525e	Add Temperature to FileAttributes (#12965 ) Summary: .. so that appropriate implementations can return temperature information from GetChildrenFileAttributes Pull Request resolved: https://github.com/facebook/rocksdb/pull/12965 Test Plan: just an API placeholder for now Reviewed By: anand1976 Differential Revision: D61748199 Pulled By: pdillinger fbshipit-source-id: b457e324cb451e836611a0bf630c3da0f30a8abf	2024-08-23 20:06:41 -07:00
Peter Dillinger	96340dbce2	Options for file temperature for more files (#12957 ) Summary: We have a request to use the cold tier as primary source of truth for the DB, and to best support such use cases and to complement the existing options controlling SST file temperatures, we add two new DB options: * `metadata_write_temperature` for DB "small" files that don't contain much user data * `wal_write_temperature` for WALs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12957 Test Plan: Unit test included, though it's hard to be sure we've covered all the places Reviewed By: jowlyzhang Differential Revision: D61664815 Pulled By: pdillinger fbshipit-source-id: 8e19c9dd8fd2db059bb15f74938d6bc12002e82b	2024-08-23 19:49:25 -07:00
Jay Huh	d6aed64de4	Disable benchmark-linux (#12964 ) Summary: Disabling the job temporarily. We will re-enable this when ready again Pull Request resolved: https://github.com/facebook/rocksdb/pull/12964 Reviewed By: ltamasi Differential Revision: D61740941 Pulled By: jaykorean fbshipit-source-id: 167e50c4f5e38d508a8e56633261611467f30690	2024-08-23 18:39:16 -07:00
eniac1024	f5c5f881d2	Fix MultiGet with timestamps (#12943 ) Summary: Issue: MultiGet(PinnableSlice) can't read out all timestamps. Fixed the impl, and added an UT as well. In the original impl, if MultiGet reads multiple column families, a later column family would clean up timestamps of previous column family. Fix: https://github.com/facebook/rocksdb/issues/12950#issue-2476996580 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12943 Reviewed By: anand1976 Differential Revision: D61729257 Pulled By: pdillinger fbshipit-source-id: 55267c26076c8a59acedd27e14714711729a40df	2024-08-23 13:58:04 -07:00
Changyu Bi	c62de54c7c	Record largest seqno in table properties and verify in file ingestion (#12951 ) Summary: this helps to avoid scanning input files when ingesting db generated files: `ecb844babd/db/external_sst_file_ingestion_job.cc (L917-L935)` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12951 Test Plan: * `IngestDBGeneratedFileTest.FailureCase` is updated to verify that this table property is verified during ingestion * existing unit tests for other ingestion use cases. Reviewed By: jowlyzhang Differential Revision: D61608285 Pulled By: cbi42 fbshipit-source-id: b5b7aae9741531349ab247be6ffaa3f3628b76ca	2024-08-21 16:24:18 -07:00
Jay Huh	4df71db246	Fix build for macos-arm64-macosx-clang17-no-san (#12949 ) Summary: When merged into internal code base we see the following error. This should fix it. ``` Actions failed: [2024-08-20T07:45:53.879-07:00] Action failed: fbcode//rocksdb/src:rocksdb_lib (cfg:macos-arm64-macosx-clang17-no-san#e5847010950663ca) (cxx_compile util/write_batch_util.cc) [2024-08-20T07:45:53.879-07:00] Remote command returned non-zero exit code 1 [2024-08-20T07:45:53.879-07:00] Remote action, reproduce with: `frecli cas download-action 2fe3749f2d3ea6107cce103d4e2be1dcc76a9df797bae308cde5eaccc65201b7:145` fbcode/rocksdb/src/include/rocksdb/write_batch.h:460:14: error: no template named 'unordered_map' in namespace 'std'; did you mean 'unordered_set'? const std::unordered_map<uint32_t, size_t>& GetColumnFamilyToTimestampSize() { ~~~~~^~~~~~~~~~~~~ fbcode/rocksdb/src/include/rocksdb/write_batch.h:540:8: error: no template named 'unordered_map' in namespace 'std'; did you mean 'unordered_set'? std::unordered_map<uint32_t, size_t> cf_id_to_ts_sz_; ~~~~~^~~~~~~~~~~~~ /paragon/pods/259551525/home/execution/3/202ac945754041b6bc424b0c35e42c9d/work/buck-out/v2/gen/fbsource/a90614bbe22ec1d7/xplat/toolchains/minimal_xcode/__clang_genrule__/out/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/__memory/compressed_pair.h:113:3: error: static_assert failed due to requirement '!is_same<unsigned long, unsigned long>::value' "__compressed_pair cannot be instantiated when T1 and T2 are the same type; The current implementation is NOT ABI-compatible with the previous implementation for this configuration" static_assert((!is_same<_T1, _T2>::value), ^ ~~~~~~~~~~~~~~~~~~~~~~~~~ ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12949 Test Plan: CI Reviewed By: jowlyzhang, cbi42 Differential Revision: D61577604 Pulled By: jaykorean fbshipit-source-id: 3584a2cd550a303346d80ccc5cc90f4a9b3e2da2	2024-08-21 10:27:50 -07:00
Yu Zhang	945f60b157	Add some documentation for version edit handlers (#12948 ) Summary: As titled. No functional change. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12948 Reviewed By: hx235 Differential Revision: D61551254 Pulled By: jowlyzhang fbshipit-source-id: ccf53d78bd2f18f174d7e61972e5de467c96ce76	2024-08-21 10:09:10 -07:00
Radek Hubner	ecb844babd	Enable Continuous Benchmarking of RocksDB (#12885 ) Summary: This pull request transitions the benchmarking process from CircleCI to GitHub Actions. The benchmarking jobs will now be executed on a self-hosted runner. Unlike the previous CircleCI configuration, where jobs were queued due to the long execution time (nearly 60 minutes per job), the new setup schedules the benchmarking tasks to run every two hours. Closes https://github.com/facebook/rocksdb/issues/12615 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12885 Reviewed By: pdillinger Differential Revision: D61422468 Pulled By: jaykorean fbshipit-source-id: 10535865c849797825f9652e4e9ef367b3d73599	2024-08-20 11:47:52 -07:00
Yu Zhang	81d52bdc1a	Fix UDT in memtable only assertions (#12946 ) Summary: Empty memtables can be legitimately created and flushed, for example by error recovery flush attempts: `273b3eadf0/db/db_impl/db_impl_compaction_flush.cc (L2309-L2312)` This check is updated to be considerate of this. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12946 Reviewed By: hx235 Differential Revision: D61492477 Pulled By: jowlyzhang fbshipit-source-id: 7d16fcaea457948546072f85b3650fd1cc24f9db	2024-08-20 09:19:52 -07:00
Jay Huh	d223d34bf3	Fix HISTORY.md file for 9.6 release (#12947 ) Summary: # Summary Mistakenly double-updated the HISTORY.md file by running `unreleased_history/release.sh` after the first commit in https://github.com/facebook/rocksdb/issues/12945. Manually fixing the file to reflect the correct content Pull Request resolved: https://github.com/facebook/rocksdb/pull/12947 Test Plan: N/A. History file change. Reviewed By: cbi42 Differential Revision: D61512756 Pulled By: jaykorean fbshipit-source-id: 50dc7e92a945fa80c7dfd01cc89243fd5eaf0548	2024-08-19 22:39:17 -07:00
Jay Huh	5b8f5cbcf4	Update main branch for 9.6 release (#12945 ) Summary: Main branch cut at `defd97bc9`. Updated HISTORY.md, version and format compatibility test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12945 Test Plan: CI Reviewed By: jowlyzhang Differential Revision: D61482149 Pulled By: jaykorean fbshipit-source-id: 4edf7c0a8c6e4df8fcc938bc778dfd02981d0c55	2024-08-19 17:36:23 -07:00
Changyu Bi	defd97bc9d	Add an option to verify memtable key order during reads (#12889 ) Summary: add a new CF option `paranoid_memory_checks` that allows additional data integrity validations during read/scan. Currently, skiplist-based memtable will validate the order of keys visited. Further data validation can be added in different layers. The option will be opt-in due to performance overhead. The motivation for this feature is for services where data correctness is critical and want to detect in-memory corruption earlier. For a corrupted memtable key, this feature can help to detect it during during reads instead of during flush with existing protections (OutputValidator that verifies key order or per kv checksum). See internally linked task for more context. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12889 Test Plan: * new unit test added for paranoid_memory_checks=true. * existing unit test for paranoid_memory_checks=false. * enable in stress test. Performance Benchmark: we check for performance regression in read path where data is in memtable only. For each benchmark, the script was run at the same time for main and this PR: * Memtable-only randomread ops/sec: ``` (for I in $(seq 1 50);do ./db_bench --benchmarks=fillseq,readrandom --write_buffer_size=268435456 --writes=250000 --num=250000 --reads=500000 --seed=1723056275 2>&1 \| grep "readrandom"; done;) \| awk '{ t += $5; c++; print } END { print 1.0 * t / c }'; Main: 608146 PR with paranoid_memory_checks=false: 607727 (- %0.07) PR with paranoid_memory_checks=true: 521889 (-%14.2) ``` * Memtable-only sequential scan ops/sec: ``` (for I in $(seq 1 50); do ./db_bench--benchmarks=fillseq,readseq[-X10] --write_buffer_size=268435456 --num=1000000 --seed=1723056275 2>1 \| grep "\[AVG 10 runs\]"; done;) \| awk '{ t += $6; c++; print; } END { printf "%.0f\n", 1.0 * t / c }'; Main: 9180077 PR with paranoid_memory_checks=false: 9536241 (+%3.8) PR with paranoid_memory_checks=true: 7653934 (-%16.6) ``` * Memtable-only reverse scan ops/sec: ``` (for I in $(seq 1 20); do ./db_bench --benchmarks=fillseq,readreverse[-X10] --write_buffer_size=268435456 --num=1000000 --seed=1723056275 2>1 \| grep "\[AVG 10 runs\]"; done;) \| awk '{ t += $6; c++; print; } END { printf "%.0f\n", 1.0 * t / c }'; Main: 1285719 PR with integrity_checks=false: 1431626 (+%11.3) PR with integrity_checks=true: 811031 (-%36.9) ``` The `readrandom` benchmark shows no regression. The scanning benchmarks show improvement that I can't explain. Reviewed By: pdillinger Differential Revision: D60414267 Pulled By: cbi42 fbshipit-source-id: a70b0cbeea131f1a249a5f78f9dc3a62dacfaa91	2024-08-19 13:53:25 -07:00
Jay Huh	273b3eadf0	Add Remote Compaction Installation Callback Function (#12940 ) Summary: Add an optional callback function upon remote compaction temp output installation. This will be internally used for setting the final status in the Offload Infra. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12940 Test Plan: Unit Test added ``` ./compaction_service_test ``` _Also internally tested by manually merging into internal code base_ Reviewed By: anand1976 Differential Revision: D61419157 Pulled By: jaykorean fbshipit-source-id: 66831685bc403949c26bfc65840dd1900d2a5a67	2024-08-19 11:22:43 -07:00
Yu Zhang	295326b6ee	Best efforts recovery recover seqno prefix (#12938 ) Summary: This PR make best efforts recovery more permissive by allowing it to recover incomplete Version that presents a valid point in time view from the user's perspective. Currently, a Version is only valid and saved if all files consisting that Version can be found. With this change, if only a suffix of L0 files (and their associated blob files) are missing, a valid Version is also available to be saved and recover to. Note that we don't do this if the column family was atomically flushed. Because atomic flush also need a consistent view across the column families, we cannot guarantee that if we are recovering to incomplete version. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12938 Test Plan: Existing tests and added unit tests. Reviewed By: anand1976 Differential Revision: D61414381 Pulled By: jowlyzhang fbshipit-source-id: f9b73deb34d35ad696ab42315928b656d586262a	2024-08-16 17:18:54 -07:00
Peter Dillinger	4d3518951a	Option to decouple index and filter partitions (#12939 ) Summary: Partitioned metadata blocks were introduced back in 2017 to deal more gracefully with large DBs where RAM is relatively scarce and some data might be much colder than other data. The feature allows metadata blocks to compete for memory in the block cache against data blocks while alleviating tail latencies and thrash conditions that can arise with large metadata blocks (sometimes megabytes each) that can arise with large SST files. In general, the cost to partitioned metadata is more CPU in accesses (especially for filters where more binary search is needed before hashing can be used) and a bit more memory fragmentation and related overheads. However the feature has always had a subtle limitation with a subtle effect on performance: index partitions and filter partitions must be cut at the same time, regardless of which wins the space race (hahaha) to metadata_block_size. Commonly filters will be a few times larger than indexes, so index partitions will be under-sized compared to filter (and data) blocks. While this does affect fragmentation and related overheads a bit, I suspect the bigger impact on performance is in the block cache. The coupling of the partition cuts would be defensible if the binary search done to find the filter block was used (on filter hit) to short-circuit binary search to an index partition, but that optimization has not been developed. Consider two metadata blocks, an under-sized one and a normal-sized one, covering proportional sections of the key space with the same density of read queries. The under-sized one will be more prone to eviction from block cache because it is used less often. This is unfair because of its despite its proportionally smaller cost of keeping in block cache, and most of the cost of a miss to re-load it (random IO) is not proportional to the size (similar latency etc. up to ~32KB). ## This change Adds a new table option decouple_partitioned_filters allows filter blocks and index blocks to be cut independently. To make this work, the partitioned filter block builder needs to know about the previous key, to generate an appropriate separator for the partition index. In most cases, BlockBasedTableBuilder already has easy access to the previous key to provide to the filter block builder. This change includes refactoring to pass that previous key to the filter builder when available, with the filter building caching the previous key itself when unavailable, such as during compression dictionary training and some unit tests. Access to the previous key eliminates the need to track the previous prefix, which results in a small SST construction CPU win in prefix filtering cases, regardless of coupling, and possibly a small regression for some non-prefix cases, regardless of coupling, but still overall improvement especially with https://github.com/facebook/rocksdb/issues/12931. Suggested follow-up: * Update confusing use of "last key" to refer to "previous key" * Expand unit test coverage with parallel compression and dictionary training * Consider an option or enhancement to alleviate under-sized metadata blocks "at the end" of an SST file due to no coordination or awareness of when files are cut. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12939 Test Plan: unit tests updated. Also did some unit test runs with "hard wired" usage of parallel compression and dictionary training code paths to ensure they were working. Also ran blackbox_crash_test for a while with the new feature. ## SST write performance (CPU) Using the same testing setup as in https://github.com/facebook/rocksdb/issues/12931 but with -decouple_partitioned_filters=1 in the "after" configuration, which benchmarking shows makes almost no difference in terms of SST write CPU. "After" vs. "before" this PR ``` -partition_index_and_filters=0 -prefix_size=0 -whole_key_filtering=1 923691 vs. 924851 (-0.13%) -partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=0 921398 vs. 922973 (-0.17%) -partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=1 902259 vs. 908756 (-0.71%) -partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=0 917932 vs. 916901 (+0.60%) -partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=0 912755 vs. 907298 (+0.60%) -partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=1 899754 vs. 892433 (+0.82%) ``` I think this is a pretty good trade, especially in attracting more movement toward partitioned configurations. ## Read performance Let's see how decoupling affects read performance across various degrees of memory constraint. To simplify LSM structure, we're using FIFO compaction. Since decoupling will overall increase metadata block size, we control for this somewhat with an extra "before" configuration with larger metadata block size setting (8k instead of 4k). Basic setup: ``` (for CS in 0300 1200; do TEST_TMPDIR=/dev/shm/rocksdb1 ./db_bench -benchmarks=fillrandom,flush,readrandom,block_cache_entry_stats -num=5000000 -duration=30 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=10 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters=1 -statistics=1 -cache_size=${CS}000000 -metadata_block_size=4096 -decouple_partitioned_filters=1 2>&1 \| tee results-$CS; done) ``` And read ops/s results: ```CSV Cache size MB,After/decoupled/4k,Before/4k,Before/8k 3,15593,15158,12826 6,16295,16693,14134 10,20427,20813,18459 20,27035,26836,27384 30,33250,31810,33846 60,35518,32585,35329 100,36612,31805,35292 300,35780,31492,35481 1000,34145,31551,35411 1100,35219,31380,34302 1200,35060,31037,34322 ``` If you graph this with log scale on the X axis (internal link: https://pxl.cl/5qKRc), you see that the decoupled/4k configuration is essentially the best of both the before/4k and before/8k configurations: handles really tight memory closer to the old 4k configuration and handles generous memory closer to the old 8k configuration. Reviewed By: jowlyzhang Differential Revision: D61376772 Pulled By: pdillinger fbshipit-source-id: fc2af2aee44290e2d9620f79651a30640799e01f	2024-08-16 15:34:31 -07:00
Hui Xiao	75a1230ce8	Fix improper ExpectedValute::Exists() usages and disable compaction during VerifyDB() in crash test (#12933 ) Summary: Context: Adding assertion `!PendingPut()&&!PendingDelete()` in `ExpectedValute::Exists()` surfaced a couple improper usages of `ExpectedValute::Exists()` in the crash test - Commit phase of `ExpectedValue::Delete()`/`SyncDelete()`: When we issue delete to expected value during commit phase or `SyncDelete()` (used in crash recovery verification) as below, we don't really care about the result. `d458331ee9/db_stress_tool/expected_state.cc (L73)` `d458331ee9/db_stress_tool/expected_value.cc (L52)` That means, we don't really need to check for `Exists()` `d458331ee9/db_stress_tool/expected_value.cc (L24-L26)`. This actually gives an alternative solution to `b65e29a4a9` to solve false-positive assertion violation. - TestMultiGetXX() path: `Exists()` is called without holding the lock as required `f63428bcc7/db_stress_tool/no_batched_ops_stress.cc (L2688)` ``` void MaybeAddKeyToTxnForRYW( ThreadState* thread, int column_family, int64_t key, Transaction* txn, std::unordered_map<std::string, ExpectedValue>& ryw_expected_values) { assert(thread); assert(txn); SharedState* const shared = thread->shared; assert(shared); if (!shared->AllowsOverwrite(key) && shared->Exists(column_family, key)) { // Just do read your write checks for keys that allow overwrites. return; } // With a 1 in 10 probability, insert the just added key in the batch // into the transaction. This will create an overlap with the MultiGet // keys and exercise some corner cases in the code if (thread->rand.OneIn(10)) { ``` `f63428bcc7/db_stress_tool/expected_state.h (L74-L76)` The assertion also failed if db stress compaction filter was invoked before crash recovery verification (`VerifyDB()`->`VerifyOrSyncValue()`) finishes. `f63428bcc7/db_stress_tool/db_stress_compaction_filter.h (L53)` It failed because it can encounter a key with pending state when checking for `Exists()` since that key's expected state has not been sync-ed with db state in `VerifyOrSyncValue()`. `f63428bcc7/db_stress_tool/no_batched_ops_stress.cc (L2579-L2591)` Summary: This PR fixes above issues by - not checking `Exists()` in commit phase/`SyncDelete()` - using the concurrent version of key existence check like in other read - conditionally temporarily disabling compaction till after crash recovery verification succeeds() And add back the assertion `!PendingPut()&&!PendingDelete()` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12933 Test Plan: Rehearsal CI Reviewed By: cbi42 Differential Revision: D61214889 Pulled By: hx235 fbshipit-source-id: ef25ba896e64330ddf330182314981516880c3e4	2024-08-15 12:32:59 -07:00
Peter Dillinger	21da4ba4aa	Attempt fix valgrind FP on std::optional (#12936 ) Summary: ``` [ RUN ] BlockBasedTableReaderTest/BlockBasedTableReaderTest.MultiGet/347 ==49577== Thread 4: ==49577== Conditional jump or move depends on uninitialised value(s) ==49577== at 0x518AF93: operator!=<long unsigned int, long unsigned int> (optional:1115) ==49577== by 0x518AF93: rocksdb::(anonymous namespace)::XXPH3FilterBitsBuilder::AddKeyAndAlt(rocksdb::Slice const&, rocksdb::Slice const&) (filter_policy.cc:100) ==49577== by 0x5192722: Add (full_filter_block.cc:37) ==49577== by 0x5192722: rocksdb::FullFilterBlockBuilder::Add(rocksdb::Slice const&) (full_filter_block.cc:33) ==49577== by 0x5125DDB: rocksdb::BlockBasedTableBuilder::BGWorkWriteMaybeCompressedBlock() (block_based_table_builder.cc:1473) ==49577== by 0x570C6B3: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.29) ==49577== by 0x5617608: start_thread (pthread_create.c:477) ==49577== by 0x5988132: clone (clone.S:95) ==49577== ``` Seems to be explained by ASM that valgrind doesn't like. https://stackoverflow.com/q/51616179 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12936 Test Plan: Wasn't able to reproduce locally Reviewed By: hx235 Differential Revision: D61338401 Pulled By: pdillinger fbshipit-source-id: b5b10f7f5c6a8c9eb088c00e5699046100167cb7	2024-08-15 10:55:29 -07:00
Jay Huh	b99baa5ae7	Do not add unprep_seqs when WriteImpl() fails in unprepared txn (#12927 ) Summary: With the following scenario, we see assertion failure in write_unprepared_txn 1. The write is the first one in the transaction (`log_number_` is not set yet - https://github.com/facebook/rocksdb/blob/main/utilities/transactions/write_unprepared_txn.cc#L376-L379) 2. `WriteToWAL()` fails inside `db_impl_->WriteImpl()` due to fault injection. `last_log_number_`is 0. https://github.com/facebook/rocksdb/blob/main/utilities/transactions/write_unprepared_txn.cc#L386 3. `prepare_batch_cnt_` is still added to `unprep_seqs_` https://github.com/facebook/rocksdb/blob/main/utilities/transactions/write_unprepared_txn.cc#L395-L398 4. When the transaction gets destructed after failed, it expects `log_number_ > 0` if `unprep_seqs_` is not empty - https://github.com/facebook/rocksdb/blob/main/utilities/transactions/write_unprepared_txn.cc#L54-L55. Step 3 is wrong. `unprep_seqs_` is for the ordered list of unprep sequence numbers that we have already written to the DB. If `db_impl_->WriteImpl()` failed, `unprep_seqs_` shouldn't be set. `prepare_seq` value would be wrong anyway. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12927 Test Plan: Following command steadily repros the issue. This no longer fails after the fix ``` ./db_stress \ --WAL_size_limit_MB=0 \ --WAL_ttl_seconds=60 \ --acquire_snapshot_one_in=10000 \ --adaptive_readahead=1 \ --adm_policy=2 \ --advise_random_on_open=0 \ --allow_data_in_errors=True \ --allow_fallocate=1 \ --async_io=1 \ --auto_readahead_size=0 \ --avoid_flush_during_recovery=0 \ --avoid_flush_during_shutdown=1 \ --avoid_unnecessary_blocking_io=0 \ --backup_max_size=104857600 \ --backup_one_in=100000 \ --batch_protection_bytes_per_key=8 \ --bgerror_resume_retry_interval=100 \ --block_align=1 \ --block_protection_bytes_per_key=8 \ --block_size=16384 \ --bloom_before_level=7 \ --bloom_bits=4.1280979878467345 \ --bottommost_compression_type=none \ --bottommost_file_compaction_delay=600 \ --bytes_per_sync=262144 \ --cache_index_and_filter_blocks=1 \ --cache_index_and_filter_blocks_with_high_priority=0 \ --cache_size=33554432 \ --cache_type=lru_cache \ --charge_compression_dictionary_building_buffer=1 \ --charge_file_metadata=1 \ --charge_filter_construction=1 \ --charge_table_reader=1 \ --check_multiget_consistency=0 \ --check_multiget_entity_consistency=1 \ --checkpoint_one_in=0 \ --checksum_type=kXXH3 \ --clear_column_family_one_in=0 \ --compact_files_one_in=1000000 \ --compact_range_one_in=1000000 \ --compaction_pri=4 \ --compaction_readahead_size=1048576 \ --compaction_ttl=0 \ --compress_format_version=1 \ --compressed_secondary_cache_ratio=0.0 \ --compressed_secondary_cache_size=0 \ --compression_checksum=1 \ --compression_max_dict_buffer_bytes=4294967295 \ --compression_max_dict_bytes=16384 \ --compression_parallel_threads=8 \ --compression_type=none \ --compression_use_zstd_dict_trainer=1 \ --compression_zstd_max_train_bytes=0 \ --continuous_verification_interval=0 \ --create_timestamped_snapshot_one_in=0 \ --daily_offpeak_time_utc=04:00-08:00 \ --data_block_index_type=0 \ --db=$db \ --db_write_buffer_size=8388608 \ --default_temperature=kHot \ --default_write_temperature=kUnknown \ --delete_obsolete_files_period_micros=21600000000 \ --delpercent=5 \ --delrangepercent=0 \ --destroy_db_initially=0 \ --detect_filter_construct_corruption=0 \ --disable_file_deletions_one_in=10000 \ --disable_manual_compaction_one_in=10000 \ --disable_wal=0 \ --dump_malloc_stats=1 \ --enable_checksum_handoff=1 \ --enable_compaction_filter=0 \ --enable_custom_split_merge=1 \ --enable_do_not_compress_roles=0 \ --enable_index_compression=1 \ --enable_memtable_insert_with_hint_prefix_extractor=0 \ --enable_pipelined_write=0 \ --enable_sst_partitioner_factory=0 \ --enable_thread_tracking=1 \ --enable_write_thread_adaptive_yield=1 \ --error_recovery_with_no_fault_injection=1 \ --exclude_wal_from_write_fault_injection=0 \ --expected_values_dir=$exp \ --fail_if_options_file_error=0 \ --fifo_allow_compaction=1 \ --file_checksum_impl=crc32c \ --fill_cache=1 \ --flush_one_in=1000 \ --format_version=2 \ --get_all_column_family_metadata_one_in=1000000 \ --get_current_wal_file_one_in=0 \ --get_live_files_apis_one_in=1000000 \ --get_properties_of_all_tables_one_in=100000 \ --get_property_one_in=1000000 \ --get_sorted_wal_files_one_in=0 \ --hard_pending_compaction_bytes_limit=274877906944 \ --high_pri_pool_ratio=0 \ --index_block_restart_interval=15 \ --index_shortening=0 \ --index_type=2 \ --ingest_external_file_one_in=0 \ --initial_auto_readahead_size=524288 \ --inplace_update_support=0 \ --iterpercent=10 \ --key_len_percent_dist=1,30,69 \ --key_may_exist_one_in=100000 \ --kill_random_test=888887 \ --last_level_temperature=kCold \ --level_compaction_dynamic_level_bytes=1 \ --lock_wal_one_in=1000000 \ --log2_keys_per_lock=10 \ --log_file_time_to_roll=0 \ --log_readahead_size=0 \ --long_running_snapshots=1 \ --low_pri_pool_ratio=0 \ --lowest_used_cache_tier=1 \ --manifest_preallocation_size=5120 \ --manual_wal_flush_one_in=0 \ --mark_for_compaction_one_file_in=10 \ --max_auto_readahead_size=0 \ --max_background_compactions=20 \ --max_bytes_for_level_base=10485760 \ --max_key=100000 \ --max_key_len=3 \ --max_log_file_size=1048576 \ --max_manifest_file_size=1073741824 \ --max_sequential_skip_in_iterations=2 \ --max_total_wal_size=0 \ --max_write_batch_group_size_bytes=16 \ --max_write_buffer_number=10 \ --max_write_buffer_size_to_maintain=0 \ --memtable_insert_hint_per_batch=0 \ --memtable_max_range_deletions=1000 \ --memtable_prefix_bloom_size_ratio=0.1 \ --memtable_protection_bytes_per_key=8 \ --memtable_whole_key_filtering=1 \ --memtablerep=skip_list \ --metadata_charge_policy=0 \ --metadata_read_fault_one_in=32 \ --metadata_write_fault_one_in=0 \ --min_write_buffer_number_to_merge=2 \ --mmap_read=1 \ --mock_direct_io=False \ --nooverwritepercent=1 \ --num_file_reads_for_auto_readahead=2 \ --open_files=-1 \ --open_metadata_read_fault_one_in=0 \ --open_metadata_write_fault_one_in=0 \ --open_read_fault_one_in=0 \ --open_write_fault_one_in=0 \ --ops_per_thread=20000000 \ --optimize_filters_for_hits=1 \ --optimize_filters_for_memory=0 \ --optimize_multiget_for_io=0 \ --paranoid_file_checks=1 \ --partition_filters=0 \ --partition_pinning=0 \ --pause_background_one_in=1000000 \ --periodic_compaction_seconds=1 \ --prefix_size=5 \ --prefixpercent=5 \ --prepopulate_block_cache=0 \ --preserve_internal_time_seconds=60 \ --progress_reports=0 \ --promote_l0_one_in=0 \ --read_amp_bytes_per_bit=32 \ --read_fault_one_in=1000 \ --readahead_size=0 \ --readpercent=45 \ --recycle_log_file_num=1 \ --reopen=20 \ --report_bg_io_stats=1 \ --reset_stats_one_in=10000 \ --sample_for_compression=0 \ --secondary_cache_fault_one_in=0 \ --sync=0 \ --sync_fault_injection=0 \ --table_cache_numshardbits=6 \ --target_file_size_base=524288 \ --target_file_size_multiplier=2 \ --test_batches_snapshots=0 \ --top_level_index_pinning=0 \ --txn_write_policy=2 \ --uncache_aggressiveness=126 \ --universal_max_read_amp=4 \ --unordered_write=0 \ --unpartitioned_pinning=1 \ --use_adaptive_mutex=0 \ --use_adaptive_mutex_lru=1 \ --use_attribute_group=1 \ --use_delta_encoding=1 \ --use_direct_io_for_flush_and_compaction=0 \ --use_direct_reads=0 \ --use_full_merge_v1=0 \ --use_get_entity=0 \ --use_merge=1 \ --use_multi_cf_iterator=0 \ --use_multi_get_entity=0 \ --use_multiget=1 \ --use_optimistic_txn=0 \ --use_put_entity_one_in=0 \ --use_timed_put_one_in=0 \ --use_txn=1 \ --use_write_buffer_manager=0 \ --user_timestamp_size=0 \ --value_size_mult=32 \ --verification_only=0 \ --verify_checksum=1 \ --verify_checksum_one_in=1000000 \ --verify_compression=0 \ --verify_db_one_in=10000 \ --verify_file_checksums_one_in=1000000 \ --verify_iterator_with_expected_state_one_in=5 \ --verify_sst_unique_id_in_manifest=1 \ --wal_bytes_per_sync=0 \ --wal_compression=none \ --write_buffer_size=4194304 \ --write_dbid_to_manifest=1 \ --write_fault_one_in=128 \ --writepercent=35 ``` Reviewed By: cbi42 Differential Revision: D61048774 Pulled By: jaykorean fbshipit-source-id: 22200d55fd0b22b68732b12516e681a6c6e2c601	2024-08-15 09:16:29 -07:00
Peter Dillinger	f63428bcc7	Optimize, simplify filter block building (fix regression) (#12931 ) Summary: This is in part a refactoring / simplification to set up for "decoupled" partitioned filters and in part to fix an intentional regression for a correctness fix in https://github.com/facebook/rocksdb/issues/12872. Basically, we are taking out some complexity of the filter block builders, and pushing part of it (simultaneous de-duplication of prefixes and whole keys) into the filter bits builders, where it is more efficient by operating on hashes (rather than copied keys). Previously, the FullFilterBlockBuilder had a somewhat fragile and confusing set of conditions under which it would keep a copy of the most recent prefix and most recent whole key, along with some other state that is essentially redundant. Now we just track (always) the previous prefix in the PartitionedFilterBlockBuilder, to deal with the boundary prefix Seek filtering problem. (Btw, the next PR will optimize this away since BlockBasedTableReader already tracks the previous key.) And to deal with the problem of de-duplicating both whole keys and prefixes going into a single filter, we add a new function to FilterBitsBuilder that has that extra de-duplication capabilty, which is relatively efficient because we only have to cache an extra 64-bit hash, not a copied key or prefix. (The API of this new function is somewhat awkward to avoid a small CPU regression in some cases.) Also previously, there was awkward logic split between FullFilterBlockBuilder and PartitionedFilterBlockBuilder to deal with some things specific to partitioning. And confusing names like Add vs. AddKey. FullFilterBlockBuilder is much cleaner and simplified now. The splitting of PartitionedFilterBlockBuilder::MaybeCutAFilterBlock into DecideCutAFilterBlock and CutAFilterBlock is to address what would have been a slight performance regression in some cases. The split allows for more intruction-level parallelism by reducing unnecessary control dependencies. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12931 Test Plan: existing tests (with some minor updates) Also manually ported over the pre-broken regression test described in https://github.com/facebook/rocksdb/issues/12870 and ran it (passed). Performance: Here we validate that an entire series of recent related PRs are a net improvement in aggregate. "Before" is with these PRs reverted: https://github.com/facebook/rocksdb/issues/12872 #12911 https://github.com/facebook/rocksdb/issues/12874 #12867 https://github.com/facebook/rocksdb/issues/12903 #12904. "After" includes this PR (and all of those, with base revision `16c21af`). Simultaneous test script designed to maximally depend on SST construction efficiency: ``` for PF in 0 1; do for PS in 0 8; do for WK in 0 1; do [ "$PS" == "$WK" ] \|\| (for I in `seq 1 20`; do TEST_TMPDIR=/dev/shm/rocksdb2 ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -memtablerep=vector -allow_concurrent_memtable_write=0 -bloom_bits=10 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters=$PF -prefix_size=$PS -whole_key_filtering=$WK 2>&1 \| grep micros/op; done) \| awk '{ t += $5; c++; print } END { print 1.0 * t / c }'; echo "Was -partition_index_and_filters=$PF -prefix_size=$PS -whole_key_filtering=$WK"; done; done; done) \| tee results ``` Showing average ops/sec of "after" vs. "before" ``` -partition_index_and_filters=0 -prefix_size=0 -whole_key_filtering=1 935586 vs. 928176 (+0.79%) -partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=0 930171 vs. 926801 (+0.36%) -partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=1 910727 vs. 894397 (+1.8%) -partition_index_and_filters=1 -prefix_size=0 -whole_key_filtering=1 929795 vs. 922007 (+0.84%) -partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=0 921924 vs. 917285 (+0.51%) -partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=1 903393 vs. 887340 (+1.8%) ``` As one would predict, the most improvement is seen in cases where we have optimized away copying the whole key. Reviewed By: jowlyzhang Differential Revision: D61138271 Pulled By: pdillinger fbshipit-source-id: 427cef0b1465017b45d0a507bfa7720fa20af043	2024-08-14 15:13:16 -07:00
Yu Zhang	d458331ee9	Move file tracking in VersionEditHandlerPointInTime to VersionBuilder (#12928 ) Summary: `VersionEditHandlerPointInTime` is tracking found files, missing files, intermediate files in order to decide to build a `Version` on negative edge trigger (transition from valid to invalid) without applying the current `VersionEdit`. However, applying `VersionEdit` and check completeness of a `Version` are specialization of `VersionBuilder`. More importantly, when we augment best efforts recovery to recover not just complete point in time Version but also a prefix of seqno for a point in time Version, such checks need to be duplicated in `VersionEditHandlerPointInTime` and `VersionBuilder`. To avoid this, this refactor move all the file tracking functionality in `VersionEditHandlerPointInTime` into `VersionBuilder`. To continue to let `VersionEditHandlerPIT` do the edge trigger check and build a `Version` before applying the current `VersionEdit`, a suite of APIs to supporting creating a save point and its associated functions are added in `VersionBuilder` to achieve this. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12928 Test Plan: Existing tests Reviewed By: anand1976 Differential Revision: D61171320 Pulled By: jowlyzhang fbshipit-source-id: 604f66f8b1e3a3e13da59d8ba357c74e8a366dbc	2024-08-12 21:09:37 -07:00
anand76	c21fe1a47f	Add ticker stats for read corruption retries (#12923 ) Summary: Add a couple of ticker stats for corruption retry count and successful retries. This PR also eliminates an extra read attempt when there's a checksum mismatch in a block read from the prefetch buffer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12923 Test Plan: Update existing tests Reviewed By: jowlyzhang Differential Revision: D61024687 Pulled By: anand1976 fbshipit-source-id: 3a08403580ab244000e0d480b7ee0f5a03d76b06	2024-08-12 15:32:07 -07:00
Hui Xiao	b65e29a4a9	Loosen a strong assertion in ExpectedValue::Exists() (#12932 ) Summary: Context/Summary: .... since it won't work in the PrepareDelete() path Pull Request resolved: https://github.com/facebook/rocksdb/pull/12932 Test Plan: CI Reviewed By: cbi42 Differential Revision: D61155155 Pulled By: hx235 fbshipit-source-id: 99b0784f6c903d70c7b3b88b53ae8e2c885de96f	2024-08-12 15:21:27 -07:00
SGZW	6727f0f58a	fix compaction_picker_test asan heap use after free (#12908 ) Summary: ![image](https://github.com/user-attachments/assets/3290fe18-aca2-4691-b072-fbbc96a15fb1) this testcase set syncpoint function which reference this test case heap variable "enable_per_key_placement_" and this sync point function will be triggered by another testcase, so asan will report asan heap use after free error Pull Request resolved: https://github.com/facebook/rocksdb/pull/12908 Reviewed By: hx235 Differential Revision: D60973363 Pulled By: cbi42 fbshipit-source-id: df4f488f51e7741784d5a92fc0a5fc538c5d5b1a	2024-08-09 15:06:37 -07:00
SGZW	5c456c4c08	fix compaction speedup for marked files ut (#12912 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12912 Reviewed By: hx235 Differential Revision: D60973460 Pulled By: cbi42 fbshipit-source-id: ebaa343757f09f7281884a512ebe3a7d6845c8b3	2024-08-09 15:05:02 -07:00
Hui Xiao	112bf15dca	Fix false-positive TestBackupRestore corruption (#12917 ) Summary: Context: https://github.com/facebook/rocksdb/pull/12838 allows a write thread encountered certain injected error to release the lock and sleep before retrying write in order to reduce performance cost. This requires adding checks like [this](`b26b395e0a/db_stress_tool/expected_value.cc (L29-L31)`) to prevent writing to the same key from another thread. The added check causes a false-positive failure when delete range + file ingestion + backup is used. Consider the following scenario: (1) Issue a delete range covering some key that do not exist and a key does exist (named as k1). k1 will have "pending delete" state while the keys that does not exit will have whatever state they already have since we don't delete a key that does not exist already. (2) After https://github.com/facebook/rocksdb/pull/12838, `PrepareDeleteRange(... &prepared)` will return `prepared = false`. So below logic will be executed and k1's "pending delete" won't get roll-backed nor committed. ``` std::vector<PendingExpectedValue> pending_expected_values = shared->PrepareDeleteRange(rand_column_family, rand_key, rand_key + FLAGS_range_deletion_width, &prepared); if (!prepared) { for (PendingExpectedValue& pending_expected_value : pending_expected_values) { pending_expected_value.PermitUnclosedPendingState(); } return s; } ``` (3) Issue an file ingestion covering k1 and another key k2. Similar to (2), we will have `shared->PreparePut(column_family, key, &prepared)` return `prepared = false` for k1 while k2 will have a "pending put" state. So below logic will be executed and k2's "pending put" state won't get roll-backed nor committed. ``` for (int64_t key = key_base; s.ok() && key < shared->GetMaxKey() && static_cast<int32_t>(keys.size()) < FLAGS_ingest_external_file_width; ++key) PendingExpectedValue pending_expected_value = shared->PreparePut(column_family, key, &prepared); if (!prepared) { pending_expected_value.PermitUnclosedPendingState(); for (PendingExpectedValue& pev : pending_expected_values) { pev.PermitUnclosedPendingState(); } return; } } ``` (4) Issue a backup and verify on k2. Below logic decides that k2 should exist in restored DB since it has a pending write state while k2 is never ingested into the original DB as (3) returns early. ``` bool Exists() const { return PendingPut() \|\| !IsDeleted(); } TestBackupRestore() { ... Status get_status = restored_db->Get( read_opts, restored_cf_handles[rand_column_families[i]], key, &restored_value); bool exists = thread->shared->Exists(rand_column_families[i], rand_keys[0]); if (get_status.ok()) { if (!exists && from_latest && ShouldAcquireMutexOnKey()) { std::ostringstream oss; oss << "0x" << key.ToString(true) << " exists in restore but not in original db"; s = Status::Corruption(oss.str()); } } else if (get_status.IsNotFound()) { if (exists && from_latest && ShouldAcquireMutexOnKey()) { std::ostringstream oss; oss << "0x" << key.ToString(true) << " exists in original db but not in restore"; s = Status::Corruption(oss.str()); } } ... } ``` So we see false-positive corruption like `Failure in a backup/restore operation with: Corruption: 0x000000000000017B0000000000000073787878 exists in original db but not in restore` A simple fix is to remove `PendingPut()` from `bool Exists() ` since it's called under a lock and should never see a pending write. However, in order for "under a lock and should never see a pending write" to be true, we need to remove the logic of releasing the lock during sleep in the write thread, which expose pending write to other thread that can call Exists() like back up thread. The downside of holding lock during sleep is blocking other write thread of the same key to proceed cuz they need to wait for the lock. This should happen rarely as the key of a thread is selected randomly in crash test like below. ``` void StressTest::OperateDb(ThreadState* thread) { for (uint64_t i = 0; i < ops_per_open; i++) { ... int64_t rand_key = GenerateOneKey(thread, i); ... } } ``` Summary: - Removed the "lock release" part and related checks - Printed recovery time if the write thread waited more than 10 seconds - Reverted regression in testing coverage when deleting a non-existent key Pull Request resolved: https://github.com/facebook/rocksdb/pull/12917 Test Plan: Below command repro-ed frequently before the fix and not after. ``` ./db_stress --WAL_size_limit_MB=1 --WAL_ttl_seconds=60 --acquire_snapshot_one_in=0 --adaptive_readahead=0 --adm_policy=1 --advise_random_on_open=1 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --allow_fallocate=0 --allow_setting_blob_options_dynamically=1 --async_io=0 --auto_readahead_size=1 --avoid_flush_during_recovery=0 --avoid_flush_during_shutdown=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=100 --blob_cache_size=8388608 --blob_compaction_readahead_size=1048576 --blob_compression_type=none --blob_file_size=1073741824 --blob_file_starting_level=1 --blob_garbage_collection_age_cutoff=0.0 --blob_garbage_collection_force_threshold=0.75 --block_align=0 --block_protection_bytes_per_key=8 --block_size=16384 --bloom_before_level=2147483647 --bloom_bits=16.216959977115277 --bottommost_compression_type=xpress --bottommost_file_compaction_delay=600 --bytes_per_sync=262144 --cache_index_and_filter_blocks=1 --cache_index_and_filter_blocks_with_high_priority=1 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=0 --charge_table_reader=1 --check_multiget_consistency=0 --check_multiget_entity_consistency=0 --checkpoint_one_in=1000000 --checksum_type=kXXH3 --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000 --compact_range_one_in=0 --compaction_pri=3 --compaction_readahead_size=0 --compaction_ttl=10 --compress_format_version=2 --compressed_secondary_cache_size=8388608 --compression_checksum=0 --compression_max_dict_buffer_bytes=2097151 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=zlib --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc=04:00-08:00 --data_block_index_type=0 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --db_write_buffer_size=0 --default_temperature=kUnknown --default_write_temperature=kWarm --delete_obsolete_files_period_micros=21600000000 --delpercent=0 --delrangepercent=5 --destroy_db_initially=0 --detect_filter_construct_corruption=1 --disable_file_deletions_one_in=10000 --disable_manual_compaction_one_in=1000000 --disable_wal=0 --dump_malloc_stats=0 --enable_blob_files=0 --enable_blob_garbage_collection=1 --enable_checksum_handoff=1 --enable_compaction_filter=1 --enable_custom_split_merge=1 --enable_do_not_compress_roles=0 --enable_index_compression=1 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=1 --enable_sst_partitioner_factory=1 --enable_thread_tracking=0 --enable_write_thread_adaptive_yield=0 --error_recovery_with_no_fault_injection=1 --exclude_wal_from_write_fault_injection=1 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=0 --fifo_allow_compaction=1 --file_checksum_impl=big --fill_cache=1 --flush_one_in=1000000 --format_version=2 --get_all_column_family_metadata_one_in=10000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=1000000 --get_properties_of_all_tables_one_in=100000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=2097152 --high_pri_pool_ratio=0.5 --index_block_restart_interval=1 --index_shortening=2 --index_type=0 --ingest_external_file_one_in=1000 --initial_auto_readahead_size=0 --inplace_update_support=0 --iterpercent=0 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100 --last_level_temperature=kUnknown --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=10000 --log2_keys_per_lock=10 --log_file_time_to_roll=0 --log_readahead_size=0 --long_running_snapshots=1 --low_pri_pool_ratio=0.5 --lowest_used_cache_tier=1 --manifest_preallocation_size=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=100000 --max_key_len=3 --max_log_file_size=1048576 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=16 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=8388608 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=1000 --memtable_prefix_bloom_size_ratio=0.001 --memtable_protection_bytes_per_key=4 --memtable_whole_key_filtering=1 --memtablerep=skip_list --metadata_charge_policy=1 --metadata_read_fault_one_in=0 --metadata_write_fault_one_in=0 --min_blob_size=16 --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --open_files=-1 --open_metadata_read_fault_one_in=0 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=20000000 --optimize_filters_for_hits=1 --optimize_filters_for_memory=0 --optimize_multiget_for_io=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=10000 --periodic_compaction_seconds=10 --prefix_size=8 --prefixpercent=0 --prepopulate_blob_cache=1 --prepopulate_block_cache=1 --preserve_internal_time_seconds=0 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=0 --readahead_size=524288 --readpercent=60 --recycle_log_file_num=1 --reopen=20 --report_bg_io_stats=0 --reset_stats_one_in=1000000 --sample_for_compression=5 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --skip_stats_update_on_db_open=1 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=68719476736 --sqfc_name=foo --sqfc_version=1 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=10 --stats_history_buffer_size=1048576 --strict_bytes_per_sync=1 --subcompactions=2 --sync=0 --sync_fault_injection=0 --table_cache_numshardbits=0 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=3 --uncache_aggressiveness=118 --universal_max_read_amp=-1 --unpartitioned_pinning=0 --use_adaptive_mutex=0 --use_adaptive_mutex_lru=1 --use_attribute_group=0 --use_blob_cache=0 --use_delta_encoding=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_cf_iterator=0 --use_multi_get_entity=0 --use_multiget=1 --use_put_entity_one_in=0 --use_shared_block_and_blob_cache=1 --use_sqfc_for_range_queries=1 --use_timed_put_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_compression=0 --verify_db_one_in=10000 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=0 --write_fault_one_in=0 --writepercent=35 ``` Reviewed By: cbi42 Differential Revision: D60890580 Pulled By: hx235 fbshipit-source-id: 401f90d6d351c7ee11088cad06fb00e54062d416	2024-08-09 14:51:36 -07:00
Hui Xiao	16c21afc06	Fix failure to clean the temporary directory due to NotFound in crash test checkpoint creation (#12919 ) Summary: Context/Summary: `b26b395e0a` propagates `CleanStagingDirectory()` status to `CreateCheckpoint()`. However, we didn't return early when `Status s = db_->GetEnv()->FileExists(full_private_path);` return non-NotFound non-ok stratus in `CleanStagingDirectory()`. Therefore we can proceed to the next step when `full_private_path` doesn't exist. ``` Verification failed: Checkpoint failed: Operation aborted: Failed to clean the temporary directory /dev/shm/rocksdb.J4Su/rocksdb_crashtest_blackbox/.checkpoint28.tmp needed before checkpoint creation : NotFound: db_stress: db_stress_tool/db_stress_test_base.cc:549: void rocksdb::StressTest::ProcessStatus(rocksdb::SharedState*, std::string, const rocksdb::Status&, bool) const: Assertion `false' failed. ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12919 Test Plan: Below failed before the fix and passes after ``` ./db_stress --WAL_size_limit_MB=1 --WAL_ttl_seconds=0 --acquire_snapshot_one_in=100 --adaptive_readahead=1 --adm_policy=1 --advise_random_on_open=0 --allow_data_in_errors=True --allow_fallocate=1 --async_io=1 --auto_readahead_size=0 --avoid_flush_during_recovery=1 --avoid_flush_during_shutdown=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=8 --bgerror_resume_retry_interval=1000000 --block_align=0 --block_protection_bytes_per_key=4 --block_size=16384 --bloom_before_level=2 --bloom_bits=4 --bottommost_compression_type=snappy --bottommost_file_compaction_delay=0 --bytes_per_sync=0 --cache_index_and_filter_blocks=1 --cache_index_and_filter_blocks_with_high_priority=0 --cache_size=8388608 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=1 --check_multiget_consistency=0 --check_multiget_entity_consistency=0 --checkpoint_one_in=10000 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --compact_files_one_in=1000 --compact_range_one_in=1000000 --compaction_pri=3 --compaction_readahead_size=1048576 --compaction_ttl=0 --compress_format_version=2 --compressed_secondary_cache_ratio=0.0 --compressed_secondary_cache_size=0 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=none --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc= --data_block_index_type=0 --db=/dev/shm/rocksdb.J4Su/rocksdb_crashtest_blackbox --db_write_buffer_size=134217728 --default_temperature=kUnknown --default_write_temperature=kHot --delete_obsolete_files_period_micros=21600000000 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_file_deletions_one_in=1000000 --disable_manual_compaction_one_in=10000 --disable_wal=0 --dump_malloc_stats=0 --enable_checksum_handoff=0 --enable_compaction_filter=0 --enable_custom_split_merge=0 --enable_do_not_compress_roles=1 --enable_index_compression=1 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=0 --enable_sst_partitioner_factory=0 --enable_thread_tracking=0 --enable_write_thread_adaptive_yield=1 --error_recovery_with_no_fault_injection=1 --exclude_wal_from_write_fault_injection=0 --expected_values_dir=/dev/shm/rocksdb.J4Su/rocksdb_crashtest_expected --fail_if_options_file_error=1 --fifo_allow_compaction=1 --file_checksum_impl=xxh64 --fill_cache=1 --flush_one_in=1000000 --format_version=6 --get_all_column_family_metadata_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=1000000 --get_properties_of_all_tables_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0.5 --index_block_restart_interval=13 --index_shortening=0 --index_type=3 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100000 --last_level_temperature=kWarm --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=1000000 --log_file_time_to_roll=0 --log_readahead_size=0 --long_running_snapshots=0 --low_pri_pool_ratio=0 --lowest_used_cache_tier=0 --manifest_preallocation_size=5120 --manual_wal_flush_one_in=1000 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=2500000 --max_key_len=3 --max_log_file_size=0 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=8 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=0 --memtable_insert_hint_per_batch=0 --memtable_max_range_deletions=100 --memtable_prefix_bloom_size_ratio=0.1 --memtable_protection_bytes_per_key=4 --memtable_whole_key_filtering=0 --memtablerep=skip_list --metadata_charge_policy=1 --metadata_read_fault_one_in=32 --metadata_write_fault_one_in=128 --min_write_buffer_number_to_merge=2 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=2 --open_files=500000 --open_metadata_read_fault_one_in=8 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=16 --ops_per_thread=100000000 --optimize_filters_for_hits=0 --optimize_filters_for_memory=1 --optimize_multiget_for_io=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=0 --pause_background_one_in=10000 --periodic_compaction_seconds=0 --prefix_size=7 --prefixpercent=5 --prepopulate_block_cache=1 --preserve_internal_time_seconds=36000 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=0 --readahead_size=0 --readpercent=45 --recycle_log_file_num=1 --reopen=0 --report_bg_io_stats=0 --reset_stats_one_in=10000 --sample_for_compression=5 --secondary_cache_fault_one_in=32 --set_options_one_in=10000 --skip_stats_update_on_db_open=1 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=68719476736 --sqfc_name=foo --sqfc_version=0 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=600 --stats_history_buffer_size=1048576 --strict_bytes_per_sync=1 --subcompactions=3 --sync=0 --sync_fault_injection=0 --table_cache_numshardbits=6 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_cf_consistency=1 --top_level_index_pinning=1 --uncache_aggressiveness=0 --universal_max_read_amp=0 --unpartitioned_pinning=0 --use_adaptive_mutex=0 --use_adaptive_mutex_lru=1 --use_attribute_group=0 --use_delta_encoding=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=1 --use_merge=0 --use_multi_cf_iterator=0 --use_multi_get_entity=0 --use_multiget=0 --use_put_entity_one_in=0 --use_sqfc_for_range_queries=1 --use_timed_put_one_in=1 --use_write_buffer_manager=1 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000 --verify_compression=1 --verify_db_one_in=100000 --verify_file_checksums_one_in=1000 --verify_iterator_with_expected_state_one_in=0 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=1048576 --write_dbid_to_manifest=0 --write_fault_one_in=128 --writepercent=35 ``` Reviewed By: cbi42 Differential Revision: D60938952 Pulled By: hx235 fbshipit-source-id: 5696cd6b00f33c9f9a256944fecb4e2f4d52a2e6	2024-08-08 15:37:19 -07:00
Changyu Bi	b32d899482	Fix MultiGet dropping memtable kv checksum corruption (#12842 ) Summary: Corruption status returned by `GetFromTable()` could be overwritten here: `b6c3495a71/db/version_set.cc (L2614)` This PR fixes this issue by setting `*(s->found_final_value) = true;` in SaveValue. Also makes the handling of the return value of `GetFromTable()` more robust and added asserts in a couple places. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12842 Test Plan: Updated an existing unit test to cover MultiGet. It fails the assertion here before this PR: `b6c3495a71/db/version_set.cc (L2601)` Reviewed By: anand1976 Differential Revision: D59498203 Pulled By: cbi42 fbshipit-source-id: 1f071c1b2c5b66fb71264b547a9e670d1cf592f0	2024-08-08 13:34:11 -07:00
Peter Dillinger	d33d25f903	Disable WAL recycling in crash test; reproducer for recovery data loss (#12918 ) Summary: I was investigating a crash test failure with "Corruption: SST file is ahead of WALs" which I haven't reproduced, but I did reproduce a data loss issue on recovery which I suspect could be the same root problem. The problem is already somewhat known (see https://github.com/facebook/rocksdb/issues/12403 and https://github.com/facebook/rocksdb/issues/12639) where it's only safe to recovery multiple recycled WAL files with trailing old data if the sequence numbers between them are adjacent (to ensure we didn't lose anything in the corrupt/obsolete WAL tail). However, aside from disableWAL=true, there are features like external file ingestion that can increment the sequence numbers without writing to the WAL. It is simply unsustainable to worry about this kind of feature interaction limiting where we can consume sequence numbers. It is very hard to test and audit as well. For reliable crash recovery of recycled WALs, we need a better way of detecting that we didn't drop data from one WAL to the next. Until then, let's disable WAL recycling in the crash test, to help stabilize it. Ideas for follow-up to fix the underlying problem: (a) With recycling, we could always sync the WAL before opening the next one. HOWEVER, this potentially very large sync could cause a big hiccup in writes (vs. O(1) sized manifest sync). (a1) The WAL sync could ensure it is truncated to size, or (a2) By requiring track_and_verify_wals_in_manifest, we could assume that the last synced size in the manifest is the final usable size of the WAL. (It might also be worth avoiding truncating recycled WALs.) (b) Add a new mechanism to record and verify the final size of a WAL without requiring a sync. (b1) By requiring track_and_verify_wals_in_manifest, this could be new WAL metadata recorded in the manifest (at the time of switching WALs). Note that new fields of WalMetadata are not forward-compatible, but a new kind of manifest record (next to WalAddition, WalDeletion; e.g. WalCompletion) is IIRC forward-compatible. (b2) A new kind of WAL header entry (not forward compatible, unfortunately) could record the final size of the previous WAL. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12918 Test Plan: Added disabled reproducer for non-linear data loss on recovery Reviewed By: hx235 Differential Revision: D60917527 Pulled By: pdillinger fbshipit-source-id: 3663d79aec81851f5cf41669f84a712bb4563fd7	2024-08-07 14:20:45 -07:00
Peter Dillinger	b15f8c7f0e	Refactor db_bloom_filter_test (#12911 ) Summary: Ahead of a "decoupled" variant of partitioned filters, refactoring this unit test file to make it easier to incorporate that new variant. * bool test param to new enum class FilterPartitioning * Some cases of iterating over that bool to new parameterized test * Combine some common functionality for configuring parameterized options Pull Request resolved: https://github.com/facebook/rocksdb/pull/12911 Test Plan: no production changes, and no intentional changes to scope or conditions of tests Differential Revision: D60701287 fbshipit-source-id: 3497e3230e29a4f62c934bcb75693965a2df41d8	2024-08-07 11:28:16 -07:00
Hui Xiao	b26b395e0a	Fix CreateCheckpoint not handling failed CleanStagingDirectory well (#12894 ) Summary: Context/Summary: `CleanStagingDirectory()` is called when the temporary .tmp folder we use to create checkpoint is not empty to begin with. Expanded fault injection can make this call fail e.g, `Delete file /dev/shm/rocksdb_test/rocksdb_crashtest_blackbox/.checkpoint17.tmp/012393.sst -- IO error: injected metadata write error`. But The result of `CleanStagingDirectory()` is ignored in `CreateCheckpoint()`. So the injected IO error can't be propagated to db stress test and handled correctly. Hence we see `While mkdir: /dev/shm/rocksdb_test/rocksdb_crashtest_blackbox/.checkpoint17.tmp: File exists` when we try to re-use a non-empty .tmp folder for new snapshots. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12894 Test Plan: Monitor CI Reviewed By: ltamasi Differential Revision: D60422849 Pulled By: hx235 fbshipit-source-id: 6f735c98eaa05d2b97ba4f781e0928357a50377a	2024-08-06 11:24:29 -07:00
Yu Zhang	719c96125c	Add a TransactionOptions to enable tracking timestamp size info inside WriteBatch (#12864 ) Summary: In normal use cases, meta info like column family's timestamp size is tracked at the transaction layer, so it's not necessary and even detrimental to track such info inside the internal WriteBatch because it may let anti-patterns like bypassing Transaction write APIs and directly write to its internal WriteBatch like this: `9d0a754dc9/storage/rocksdb/ha_rocksdb.cc (L4949-L4950)` Setting this option to true will keep aforementioned use case continue to work before it's refactored out. This option is only for this purpose and it will be gradually deprecated after aforementioned MyRocks use case are refactored. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12864 Test Plan: Added unit tests Reviewed By: cbi42 Differential Revision: D60194094 Pulled By: jowlyzhang fbshipit-source-id: 64a98822167e99aa7e4fa2a60085d44a5deaa45c	2024-08-05 13:06:45 -07:00
Yu Zhang	36b061a6c7	Fix test breakage (#12915 ) Summary: https://github.com/facebook/rocksdb/issues/12891 updated this deletion rate in the test to be much higher, which makes the test flaky. The rate is being intentionally set to very low to maximize the retention of a ".log.trash" file after DB closes. This PR just change it back. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12915 Reviewed By: ltamasi Differential Revision: D60776312 Pulled By: jowlyzhang fbshipit-source-id: d193557a042c65816fcc337cceb09905e042e9f6	2024-08-05 12:26:18 -07:00
Yu Zhang	d12aaf23ca	Fix file deletions in DestroyDB not rate limited (#12891 ) Summary: Make `DestroyDB` slowly delete files if it's configured and enabled via `SstFileManager`. It's currently not available mainly because of DeleteScheduler's logic related to tracked total_size_ and total_trash_size_. These accounting and logic should not be applied to `DestroyDB`. This PR adds a `DeleteUnaccountedDBFile` util for this purpose which deletes files without accounting it. This util also supports assigning a file to a specified trash bucket so that user can later wait for a specific trash bucket to be empty. For `DestroyDB`, files with more than 1 hard links will be deleted immediately. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12891 Test Plan: Added unit tests, existing tests. Reviewed By: anand1976 Differential Revision: D60300220 Pulled By: jowlyzhang fbshipit-source-id: 8b18109a177a3a9532f6dc2e40e08310c08ca3c7	2024-08-02 19:31:55 -07:00
Peter Dillinger	9d5c8c89a1	Fix filter partition size logic (#12904 ) Summary: Was checking == a desired number of entries added to a filter, when the combination of whole key and prefix filtering could add more than one entry per table internal key. This could lead to unnecessarily large filter partitions, which could affect performance and block cache fairness. Also (only somewhat related because of other work in progress): * Some variable renaming and a new assertion in BlockBasedTableBuilder, to add some clarity. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12904 Test Plan: If you add assertion logic to the base revision checking that the partition cut is requested whenever `keys_added_to_partition_ >= keys_per_partition_`, it fails on a number of db_bloom_filter_test tests. However, such an assertion in the revised code would be essentially redundant with the new logic. If I added a regression test for this, it would be tricky and fragile, so I don't think it's important enough to chase and maintain. (Open to suggestions / input.) Reviewed By: jowlyzhang Differential Revision: D60557827 Pulled By: pdillinger fbshipit-source-id: 77a56097d540da6e7851941a26d26ced2d944373	2024-08-02 14:49:02 -07:00

1 2 3 4 5 ...

13005 Commits All Branches Search

13005 Commits

All Branches