mirror of
https://github.com/facebook/rocksdb.git
synced 2024-11-29 18:33:58 +00:00
12753 commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
Hui Xiao | a2f772910e |
Fix manual WAL flush causing false-positive inconsistent values in TestBackupRestore() (#12758)
Summary: **Context/Summary:** When manual WAL flush is used, the following can happen: t1: Issued Put(k1) to original DB. It entered WAL buffer since manual_wal_flush_one_in > 0. It never made it to WAL file without FlushWAL() t2: The same WAL got back-up and restored to restore DB. So the restore DB's WAL does not contain this Put() t3: The same WAL in the original DB got FlushWAL() so it got the Put() entry Querying k1 in original and restored DB will give different result and fail our consistency check in stress test. ``` Failure in a backup/restore operation with: Corruption: 0x000000000000000178 exists in original db but not in restore ``` This PR fixed it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12758 Test Plan: ``` ./db_stress --WAL_size_limit_MB=0 --WAL_ttl_seconds=0 --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --adm_policy=1 --advise_random_on_open=1 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --allow_fallocate=1 --async_io=1 --auto_readahead_size=0 --avoid_flush_during_recovery=1 --avoid_flush_during_shutdown=1 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=100 --batch_protection_bytes_per_key=8 --bgerror_resume_retry_interval=1000000 --block_align=1 --block_protection_bytes_per_key=0 --block_size=16384 --bloom_before_level=2147483646 --bloom_bits=13 --bottommost_compression_type=none --bottommost_file_compaction_delay=600 --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_index_and_filter_blocks_with_high_priority=0 --cache_size=33554432 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=0 --charge_filter_construction=1 --charge_table_reader=0 --check_multiget_consistency=0 --check_multiget_entity_consistency=1 --checkpoint_one_in=1000000 --checksum_type=kxxHash --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000 --compact_range_one_in=1000000 --compaction_pri=3 --compaction_readahead_size=0 --compaction_ttl=0 --compress_format_version=1 --compressed_secondary_cache_size=8388608 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=4 --compression_type=none --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc= --data_block_index_type=0 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_blackbox_1 --db_write_buffer_size=0 --default_temperature=kCold --default_write_temperature=kHot --delete_obsolete_files_period_micros=21600000000 --delpercent=40 --delrangepercent=0 --destroy_db_initially=0 --detect_filter_construct_corruption=1 --disable_file_deletions_one_in=1000000 --disable_manual_compaction_one_in=10000 --disable_wal=0 --dump_malloc_stats=0 --enable_checksum_handoff=0 --enable_compaction_filter=0 --enable_custom_split_merge=1 --enable_do_not_compress_roles=1 --enable_index_compression=0 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=0 --enable_sst_partitioner_factory=1 --enable_thread_tracking=0 --enable_write_thread_adaptive_yield=1 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected_1 --fail_if_options_file_error=1 --fifo_allow_compaction=1 --file_checksum_impl=none --fill_cache=0 --flush_one_in=1000000 --format_version=2 --get_all_column_family_metadata_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=1000000 --get_properties_of_all_tables_one_in=1000000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0.5 --index_block_restart_interval=5 --index_shortening=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100 --last_level_temperature=kUnknown --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=0 --log_file_time_to_roll=60 --log_readahead_size=16777216 --long_running_snapshots=0 --low_pri_pool_ratio=0.5 --lowest_used_cache_tier=1 --manifest_preallocation_size=5120 --manual_wal_flush_one_in=100 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=524288 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=10 --max_key_len=3 --max_log_file_size=0 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=16 --max_total_wal_size=0 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=2097152 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.1 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=0 --memtablerep=skip_list --metadata_charge_policy=1 --min_write_buffer_number_to_merge=1 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=16 --ops_per_thread=100000000 --optimize_filters_for_hits=0 --optimize_filters_for_memory=0 --optimize_multiget_for_io=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=0 --pause_background_one_in=1000000 --periodic_compaction_seconds=2 --prefix_size=7 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=1000 --readahead_size=16384 --readpercent=0 --recycle_log_file_num=0 --reopen=0 --report_bg_io_stats=0 --reset_stats_one_in=10000 --sample_for_compression=0 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --set_options_one_in=0 --skip_stats_update_on_db_open=1 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=68719476736 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=600 --stats_history_buffer_size=0 --strict_bytes_per_sync=1 --subcompactions=2 --sync=0 --sync_fault_injection=1 --table_cache_numshardbits=-1 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=2 --uncache_aggressiveness=709 --universal_max_read_amp=0 --unpartitioned_pinning=0 --use_adaptive_mutex=0 --use_adaptive_mutex_lru=1 --use_attribute_group=1 --use_delta_encoding=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=1 --use_multi_cf_iterator=0 --use_multi_get_entity=0 --use_multiget=0 --use_put_entity_one_in=0 --use_timed_put_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000 --verify_compression=0 --verify_db_one_in=100000 --verify_file_checksums_one_in=0 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=335544 --write_dbid_to_manifest=1 --write_fault_one_in=128 --writepercent=45 ``` Repro-ed quickly before the fix and stably run after the fix. Reviewed By: jowlyzhang Differential Revision: D58426535 Pulled By: hx235 fbshipit-source-id: 611e56086e76f8c06d292624e60fd96e511ce723 |
||
Peter Dillinger | 0646ec6e2d |
Ensure Close() before LinkFile() for WALs in Checkpoint (#12734)
Summary: POSIX semantics for LinkFile (hard links) allow linking a file that is still being written two, with both the source and destination showing any subsequent writes to the source. This may not be practical semantics for some FileSystem implementations such as remote storage. They might only link the flushed or sync-ed file contents at time of LinkFile, or might even have undefined behavior if LinkFile is called on a file still open for write (not yet "sealed"). This change builds on https://github.com/facebook/rocksdb/issues/12731 to bring more hygiene to our handling of WAL files in Checkpoint. Specifically, we now Close WAL files as soon as they are either (a) inactive and fully synced, or (b) inactive and obsolete (so maybe never fully synced), rather than letting Close() happen in handling obsolete files (maybe a background thread). This should not be a performance issue as Close() should be trivial cost relative to other IO ops, but just in case: * We don't Close() while holding a mutex, to avoid blocking, and * The old behavior is available with a new kill switch option `background_close_inactive_wals`. Stacked on https://github.com/facebook/rocksdb/issues/12731 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12734 Test Plan: Extended existing unit test, especially adding a hygiene check to FaultInjectionTestFS to detect LinkFile() on a file still open for writes. FaultInjectionTestFS already has relevant tracking data, and tests can opt out of the new check, as in a smoke test I have left for the old, deprecated functionality `background_close_inactive_wals=true`. Also ran lengthy blackbox_crash_test to ensure the hygiene check is OK with the crash test. (The only place I can find we use LinkFile in production is Checkpoint.) Reviewed By: cbi42 Differential Revision: D58295284 Pulled By: pdillinger fbshipit-source-id: 64d90ed8477e2366c19eaf9c4c5ad60b82cac5c6 |
||
Peter Dillinger | d64eac28d3 |
Fix a failure to propagate ReadOptions (#12757)
Summary: The crash test revealed a case in which the uncache functionality in ~BlockBasedTableReader could initiate an block read (IO), despite setting ReadOptions::read_tier = kBlockCacheTier. The root cause is a place in the code where many people have over time decided to opt-in propagating ReadOptions and no one took the initiative to propagate ReadOptions by default (opt out / override only as needed). The fix is in partitioned_index_reader.cc. Here, ReadOptions::readahead_size is opted-out to avoid churn in prefetch_test that is not clearly an improvement or regression. It's hard to tell given the poor state of relevant documentation https://github.com/facebook/rocksdb/issues/12756. The affected unit test was added in https://github.com/facebook/rocksdb/issues/10602. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12757 Test Plan: (Now postponed to a follow-up diff) I have added some new infrastructure to DEBUG builds to catch this specific kind of violation in unit tests and in the stress/crash test. `EnforceReadOpts` establishes a thread-local context under which we assert no IOs are performed if ReadOptions said it should be forbidden. With this new checking, the Uncache unit test would catch the critical step toward a violation (inner ReadOptions allowing IO, even if no IO is actually performed), which is fixed with the production code change. Reviewed By: hx235 Differential Revision: D58421526 Pulled By: pdillinger fbshipit-source-id: 9e9917a0e320c78967e751bd887926a2ed231d37 |
||
Peter Dillinger | 961468f92e |
Fix TSAN-reported data race with uncache_aggressiveness (#12753)
Summary: Data race reported on BlockBasedTableReader::Rep::uncache_aggressiveness because apparently a file can be marked obsolete through multiple table cache references in parallel. Using a relaxed atomic should resolve the race quite reasonably, especially considering this is a rare case and the racing writes should be storing the same value anyway. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12753 Test Plan: watch for TSAN crash test results Reviewed By: ltamasi Differential Revision: D58397473 Pulled By: pdillinger fbshipit-source-id: 3e78b6adac4f7a7056790754bee42b3cb244f037 |
||
Peter Dillinger | 21eb82ebec |
Disable "uncache" behavior in DB shutdown (#12751)
Summary: Crash test showed a potential use-after-free where a file marked as obsolete and eligible for uncache on destruction is destroyed in the VersionSet destructor, which only happens as part of DB shutdown. At that point, the in-memory column families have already been destroyed, so attempting to uncache could use-after-free on stuff like getting the `user_comparator()` from the `internal_comparator()`. I attempted to make it smarter, but wasn't able to untangle the destruction dependencies in a way that was safe, understandable, and maintainable. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12751 Test Plan: Reproduced by adding uncache_aggressiveness to an existing (but otherwise unrelated) test. This makes it a fair regression test. Also added testing to ensure that trivial moves and DB close & reopen are well behaved with uncache_aggressiveness. Specifically, this issue doesn't seem to be because things are uncached inappropriately in those cases. Reviewed By: ltamasi Differential Revision: D58390058 Pulled By: pdillinger fbshipit-source-id: 66ac9cb13bf02638fa80ee5b7218153d8bc7cfd3 |
||
Evan Jones | af50823069 |
c.h: Add set_track_and_verify_wals_in_manifest to C API (#12749)
Summary: This option is recommended to be set for production use: We recommend to set track_and_verify_wals_in_manifest to true for production https://github.com/facebook/rocksdb/wiki/Track-WAL-in-MANIFEST This adds this setting to the C API, so it can be used by other languages. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12749 Reviewed By: ltamasi Differential Revision: D58382892 Pulled By: ajkr fbshipit-source-id: 885de4539745a3119b6b2a162ab4fca9fa975283 |
||
Peter Dillinger | 68112b3beb |
Attempt fix rare failure in DBBlockCacheTypeTest.Uncache (#12748)
Summary: I haven't been able to reproduce the failure, seen in https://github.com/facebook/rocksdb/actions/runs/9420830905/job/25953696902?pr=12734 ``` [ RUN ] DBBlockCacheTypeTestInstance/DBBlockCacheTypeTest.Uncache/2 db/db_block_cache_test.cc:1415: Failure Expected equality of these values: cache->GetOccupancyCount() Which is: 37 kBaselineCount + kNumDataBlocks + meta_blocks_per_file Which is: 15 Google Test trace: db/db_block_cache_test.cc:1346: ua=10000 db/db_block_cache_test.cc:1344: partitioned=1 db/db_block_cache_test.cc:1418: Failure ... ``` But it's consistent with a SuperVersion reference sticking around beyond the CompactRange, as I can reproduce the result with a dangling Iterator. Like some other tests have had trouble with periodic stats popping up randomly, I suspect that could be the explanation in this case. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12748 Test Plan: Watch for similar future failures Reviewed By: ltamasi Differential Revision: D58366031 Pulled By: pdillinger fbshipit-source-id: b812ca8837b8c8b9cbda1b201d76316d145fa3ec |
||
Hui Xiao | d3c4b7fe0b |
Enable reopen with un-synced data loss in crash test (#12746)
Summary: **Context/Summary:** https://github.com/facebook/rocksdb/pull/12567 disabled reopen with un-synced data loss in crash test since we discovered un-synced WAL loss and we currently don't support prefix recovery in reopen. This PR explicitly sync WAL data before close to avoid such data loss case from happening and add back the testing coverage. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12746 Test Plan: CI Reviewed By: ajkr Differential Revision: D58326890 Pulled By: hx235 fbshipit-source-id: 0865f715e97c5948d7cb3aea62fe2a626cb6522a |
||
Eduardo Menges Mattje | 43906597f5 |
Fixed CMake builds for iOS (#12744)
Summary: This [ports the flags which were already defined in `build_tools`]( |
||
Evan Jones | 32e6825bc6 |
c.h: Add GetDbIdentity, Options::write_dbid_to_manifest (#12736)
Summary: The write_dbid_to_manifest option is documented as "We recommend setting this flag to true". However, there is no way to set this flag from the C API. Add the following functions to the C API: * rocksdb_get_db_identity * rocksdb_options_get_write_dbid_to_manifest * rocksdb_options_set_write_dbid_to_manifest Add a test that this option preserves the ID across checkpoints. c.cc: * Remove outdated comments about missing C API functions that exist. * Document that CopyString is intended for binary data and is not NUL terminated. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12736 Reviewed By: ltamasi Differential Revision: D58202117 Pulled By: ajkr fbshipit-source-id: 707b110df5c4bd118d65548327428a53a9dc3019 |
||
Peter Dillinger | b34cef57b7 |
Support pro-actively erasing obsolete block cache entries (#12694)
Summary: Currently, when files become obsolete, the block cache entries associated with them just age out naturally. With pure LRU, this is not too bad, as once you "use" enough cache entries to (re-)fill the cache, you are guranteed to have purged the obsolete entries. However, HyperClockCache is a counting clock cache with a somewhat longer memory, so could be more negatively impacted by previously-hot cache entries becoming obsolete, and taking longer to age out than newer single-hit entries. Part of the reason we still have this natural aging-out is that there's almost no connection between block cache entries and the file they are associated with. Everything is hashed into the same pool(s) of entries with nothing like a secondary index based on file. Keeping track of such an index could be expensive. This change adds a new, mutable CF option `uncache_aggressiveness` for erasing obsolete block cache entries. The process can be speculative, lossy, or unproductive because not all potential block cache entries associated with files will be resident in memory, and attempting to remove them all could be wasted CPU time. Rather than a simple on/off switch, `uncache_aggressiveness` basically tells RocksDB how much CPU you're willing to burn trying to purge obsolete block cache entries. When such efforts are not sufficiently productive for a file, we stop and move on. The option is in ColumnFamilyOptions so that it is dynamically changeable for already-open files, and customizeable by CF. Note that this block cache removal happens as part of the process of purging obsolete files, which is often in a background thread (depending on `background_purge_on_iterator_cleanup` and `avoid_unnecessary_blocking_io` options) rather than along CPU critical paths. Notable auxiliary code details: * Possibly fixing some issues with trivial moves with `only_delete_metadata`: unnecessary TableCache::Evict in that case and missing from the ObsoleteFileInfo move operator. (Not able to reproduce an current failure.) * Remove suspicious TableCache::Erase() from VersionSet::AddObsoleteBlobFile() (TODO follow-up item) Marked EXPERIMENTAL until more thorough validation is complete. Direct stats of this functionality are omitted because they could be misleading. Block cache hit rate is a better indicator of benefit, and CPU profiling a better indicator of cost. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12694 Test Plan: * Unit tests added, including refactoring an existing test to make better use of parameterized tests. * Added to crash test. * Performance, sample command: ``` for I in `seq 1 10`; do for UA in 300; do for CT in lru_cache fixed_hyper_clock_cache auto_hyper_clock_cache; do rm -rf /dev/shm/test3; TEST_TMPDIR=/dev/shm/test3 /usr/bin/time ./db_bench -benchmarks=readwhilewriting -num=13000000 -read_random_exp_range=6 -write_buffer_size=10000000 -bloom_bits=10 -cache_type=$CT -cache_size=390000000 -cache_index_and_filter_blocks=1 -disable_wal=1 -duration=60 -statistics -uncache_aggressiveness=$UA 2>&1 | grep -E 'micros/op|rocksdb.block.cache.data.(hit|miss)|rocksdb.number.keys.(read|written)|maxresident' | awk '/rocksdb.block.cache.data.miss/ { miss = $4 } /rocksdb.block.cache.data.hit/ { hit = $4 } { print } END { print "hit rate = " ((hit * 1.0) / (miss + hit)) }' | tee -a results-$CT-$UA; done; done; done ``` Averaging 10 runs each case, block cache data block hit rates ``` lru_cache UA=0 -> hit rate = 0.327, ops/s = 87668, user CPU sec = 139.0 UA=300 -> hit rate = 0.336, ops/s = 87960, user CPU sec = 139.0 fixed_hyper_clock_cache UA=0 -> hit rate = 0.336, ops/s = 100069, user CPU sec = 139.9 UA=300 -> hit rate = 0.343, ops/s = 100104, user CPU sec = 140.2 auto_hyper_clock_cache UA=0 -> hit rate = 0.336, ops/s = 97580, user CPU sec = 140.5 UA=300 -> hit rate = 0.345, ops/s = 97972, user CPU sec = 139.8 ``` Conclusion: up to roughly 1 percentage point of improved block cache hit rate, likely leading to overall improved efficiency (because the foreground CPU cost of cache misses likely outweighs the background CPU cost of erasure, let alone I/O savings). Reviewed By: ajkr Differential Revision: D57932442 Pulled By: pdillinger fbshipit-source-id: 84a243ca5f965f731f346a4853009780a904af6c |
||
Yu Zhang | 44aceb88d0 |
Add a OnManualFlushScheduled callback in event listener (#12631)
Summary: As titled. Also added the newest user-defined timestamp into the `MemTableInfo`. This can be a useful info in the callback. Added some unit tests as examples for how users can use two separate approaches to allow manual flush / manual compactions to go through when the user-defined timestamps in memtable only feature is enabled. One approach relies on selectively increase cutoff timestamp in `OnMemtableSeal` callback when it's initiated by a manual flush. Another approach is to increase cutoff timestamp in `OnManualFlushScheduled` callback. The caveats of the approaches are also documented in the unit test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12631 Reviewed By: ajkr Differential Revision: D58260528 Pulled By: jowlyzhang fbshipit-source-id: bf446d7140affdf124744095e0a179fa6e427532 |
||
Hui Xiao | 390fc55ba1 |
Revert PR 12684 and 12556 (#12738)
Summary: **Context/Summary:** a better API design is decided lately so we decided to revert these two changes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12738 Test Plan: - CI Reviewed By: ajkr Differential Revision: D58162165 Pulled By: hx235 fbshipit-source-id: 9bbe4d2fe9fbe39213f4cf137a2d419e6ffb8e16 |
||
Peter Dillinger | 98393f0139 |
Fix Checkpoint hard link of inactive but unsynced WAL (#12731)
Summary: Background: there is one active WAL file but there can be several more WAL files in various states. Those other WALs are always in a "flushed" state but could be on the `logs_` list not yet fully synced. We currently allow any WAL that is not the active WAL to be hard-linked when creating a Checkpoint, as although it might still be open for write, we are not appending any more data to it. The problem is that a created Checkpoint is supposed to be fully synced on return of that function, and a hard-linked WAL in the state described above might not be fully synced. (Through some prudence in https://github.com/facebook/rocksdb/issues/10083, it would synced if using track_and_verify_wals_in_manifest=true.) The fix is a step toward a long term goal of removing the need to query the filesystem to determine WAL files and their state. (I consider it dubious any time we independently read from or query metadata from a file we have open for writing, as this makes us more susceptible to FileSystem deficiencies or races.) More specifically: * Detect which WALs might not be fully synced, according to our DBImpl metadata, and prevent hard linking those (with `trim_to_size=true` from `GetLiveFilesStorageInfo()`. And while we're at it, use our known flushed sizes for those WALs. * To avoid a race between that and GetSortedWalFiles(), track a maximum needed WAL number for the Checkpoint/GetLiveFilesStorageInfo. * Because of the level of consistency provided by those two, we no longer need to consider syncing as part of the FlushWAL in GetLiveFilesStorageInfo. (We determine the max WAL number consistent with the manifest file size, while holding DB mutex. Should make track_and_verify_wals_in_manifest happy.) This makes the premise of test PutRaceWithCheckpointTrackedWalSync obsolete (sync point callback no longer hit) so the test is removed, with crash test as backstop for related issues. See https://github.com/facebook/rocksdb/issues/10185 Stacked on https://github.com/facebook/rocksdb/issues/12729 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12731 Test Plan: Expanded an existing test, which now fails before fix. Also long runs of blackbox_crash_test with amplified checkpoint frequency. Reviewed By: cbi42 Differential Revision: D58199629 Pulled By: pdillinger fbshipit-source-id: 376e55f4a2b082cd2adb6408a41209de14422382 |
||
Adam Kupczyk | a211e06552 |
Remove close when fd == -1. (#12732)
Summary: Its polluting my valgrind runs: ==3733139== Warning: invalid file descriptor -1 in syscall close() ==3733139== Warning: invalid file descriptor -1 in syscall close() ==3733139== Warning: invalid file descriptor -1 in syscall close() Pull Request resolved: https://github.com/facebook/rocksdb/pull/12732 Reviewed By: ltamasi Differential Revision: D58170009 Pulled By: ajkr fbshipit-source-id: 1fc6944c2667641996676a75aa3e91984070ba49 |
||
wuruilong | 61d10fe0c3 |
Fix compile errors on loongarch (#12739)
Summary: Failed to compile when using cmake on loongarch architecture with the following error details:[https://buildd.debian.org/status/fetch.php?pkg=rocksdb&arch=loong64&ver=9.2.1-2&stamp= 1717362107&raw=0](url). The reason for the error is that loongarch does not support the mcpu option, refer to the link for details: [https://gcc.gnu.org/onlinedocs/gcc-13.2.0/gcc/LoongArch-Options.html](url) Pull Request resolved: https://github.com/facebook/rocksdb/pull/12739 Reviewed By: ltamasi Differential Revision: D58200695 Pulled By: ajkr fbshipit-source-id: 00e1a51e15defaa8983524cdd3fc25240833c08b |
||
Peter Dillinger | 9f4c597d83 |
FaultInjectionTestFS read unsynced data by default (#12729)
Summary: In places (e.g. GetSortedWals()) RocksDB relies on querying the file size or even reading the contents of files currently open for writing, and as in POSIX semantics, expects to see the flushed size and contents regardless of what has been synced. FaultInjectionTestFS historically did not emulate this behavior, only showing synced data from such read operations. (Different from FaultInjectionTestEnv--sigh.) This change makes the "proper" behavior the default behavior, at least for GetFileSize and FSSequentialFile. However, this new functionality is disabled in db_stress because of undiagnosed, unresolved issues. Also removes unused and confusing field `pos_at_last_flush_` This change is needed to support testing a relevant bug fix (in a follow-up diff). Other suggested follow-up: * Fix db_stress not to rely on the old behavior, and fix a related FIXME in db_stress_test_base.cc in LockWAL testing. * Fill in some corner cases in the FileSystem API for reading unsynced data (see new TODO items). * Consider deprecating and removing Flush() API functions from FileSystem APIs. It is not clear to me that there is a supported scenario in which they do anything but confuse API users and developers. If there is a use for them, it doesn't appear to be tested. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12729 Test Plan: applies to all unit tests successfully, just updating the unit test from https://github.com/facebook/rocksdb/issues/12556 due to relying on the errant behavior. Also added a specific unit test Reviewed By: hx235 Differential Revision: D58091835 Pulled By: pdillinger fbshipit-source-id: f47a63b2b000f5875b6293a98577bff663d7fd33 |
||
Yu Zhang | 8523f0a86a |
Use extended file boundary for key range overlap check during file ingestion (#12735)
Summary: When https://github.com/facebook/rocksdb/issues/12343 added support to bulk load external files while column family enables user-defined timestamps, it's a requirement that the external file doesn't overlap with the DB in key ranges. More specifically, the external file should not contain a user key (without timestamp) that already have some entries in the DB. All the `*Overlap*` functions like `RangeOverlapWithMemtable`, `RangeOverlapWithCompaction` are using `CompareWithoutTimestamp` to check for overlap already. One thing that is missing here is we need to extend the external file's user key boundary for this check to avoid missing the checks for the boundary user keys. For example, with the current way of checking things where `external_file_info.smallest.user_key()` is used as the left boundary, and `external_file_info.largest.user_key()` is used as the right boundary, a file with this entry: (b, 40) can fit into a DB with these two entries: (b, 30), (c, 20). To avoid this, we extend the user key boundaries used for overlap check, by updating the left boundary with the maximum timestamp and the right boundary with the minimum timestamp. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12735 Test Plan: Added unit test Reviewed By: ltamasi Differential Revision: D58152117 Pulled By: jowlyzhang fbshipit-source-id: 9cba61e7357f6d76ad44c258381c35073ebbf347 |
||
Valery Mironov | a8a52e5b4d |
Fix AddressSanitizer container-overflow (#12722)
Summary: ``` ERROR: AddressSanitizer: container-overflow on address 0x506000682221 at pc 0x5583da569f76 bp 0x7f0ec8a9ffb0 sp 0x7f0ec8a9f780 WRITE of size 53 at 0x506000682221 thread T29 #0 0x5583da569f75 in pread https://github.com/facebook/rocksdb/issues/1 0x5583e334fde4 in rocksdb::PosixRandomAccessFile::Read(unsigned long, unsigned long, rocksdb::IOOptions const&, rocksdb::Slice*, char*, rocksdb::IODebugContext*) const /rocksdb/env/io_posix.cc:580:9 https://github.com/facebook/rocksdb/issues/2 0x5583e2cac42b in rocksdb::(anonymous namespace)::CompositeRandomAccessFileWrapper::Read(unsigned long, unsigned long, rocksdb::Slice*, char*) const /rocksdb/env/composite_env.cc:61:21 https://github.com/facebook/rocksdb/issues/3 0x5583e2c8a8e4 in rocksdb::(anonymous namespace)::LegacyRandomAccessFileWrapper::Read(unsigned long, unsigned long, rocksdb::IOOptions const&, rocksdb::Slice*, char*, rocksdb::IODebugContext*) const /rocksdb/env/env.cc:152:41 https://github.com/facebook/rocksdb/issues/4 0x5583e2d6cbfb in rocksdb::RandomAccessFileReader::Read(rocksdb::IOOptions const&, unsigned long, unsigned long, rocksdb::Slice*, char*, std::__2::unique_ptr<char [], std::__2::default_delete<char []>>*, rocksdb::Env::IOPriority) const /rocksdb/file/random_access_file_reader.cc:204:25 https://github.com/facebook/rocksdb/issues/5 0x5583e307c614 in rocksdb::ReadFooterFromFile(rocksdb::IOOptions const&, rocksdb::RandomAccessFileReader*, rocksdb::FilePrefetchBuffer*, unsigned long, rocksdb::Footer*, unsigned long) /rocksdb/table/format.cc:383:17 https://github.com/facebook/rocksdb/issues/6 0x5583e2f88456 in rocksdb::BlockBasedTable::Open(rocksdb::ReadOptions const&, rocksdb::ImmutableOptions const&, rocksdb::EnvOptions const&, rocksdb::BlockBasedTableOptions const&, rocksdb::InternalKeyComparator const&, std::__2::unique_ptr<rocksdb::RandomAccessFileReader, std::__2::default_delete<rocksdb::RandomAccessFileReader>>&&, unsigned long, std::__2::unique_ptr<rocksdb::TableReader, std::__2::default_delete<rocksdb::TableReader>>*, std::__2::shared_ptr<rocksdb::CacheReservationManager>, std::__2::shared_ptr<rocksdb::SliceTransform const> const&, bool, bool, int, bool, unsigned long, bool, rocksdb::TailPrefetchStats*, rocksdb::BlockCacheTracer*, unsigned long, std::__2::basic_string<char, std::__2::char_traits<char>, std::__2::allocator<char>> const&, unsigned long) /rocksdb/table/block_based/block_based_table_reader.cc:610:9 https://github.com/facebook/rocksdb/issues/7 0x5583e2ef7837 in rocksdb::BlockBasedTableFactory::NewTableReader(rocksdb::ReadOptions const&, rocksdb::TableReaderOptions const&, std::__2::unique_ptr<rocksdb::RandomAccessFileReader, std::__2::default_delete<rocksdb::RandomAccessFileReader>>&&, unsigned long, std::__2::unique_ptr<rocksdb::TableReader, std::__2::default_delete<rocksdb::TableReader>>*, bool) const /rocksdb/table/block_based/block_based_table_factory.cc:599:10 https://github.com/facebook/rocksdb/issues/8 0x5583e2ab873c in rocksdb::TableCache::GetTableReader(rocksdb::ReadOptions const&, rocksdb::FileOptions const&, rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor const&, bool, bool, rocksdb::HistogramImpl*, std::__2::unique_ptr<rocksdb::TableReader, std::__2::default_delete<rocksdb::TableReader>>*, std::__2::shared_ptr<rocksdb::SliceTransform const> const&, bool, int, bool, unsigned long, rocksdb::Temperature) /rocksdb/db/table_cache.cc:142:34 https://github.com/facebook/rocksdb/issues/9 0x5583e2aba5f6 in rocksdb::TableCache::FindTable(rocksdb::ReadOptions const&, rocksdb::FileOptions const&, rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor const&, rocksdb::Cache::Handle**, std::__2::shared_ptr<rocksdb::SliceTransform const> const&, bool, bool, rocksdb::HistogramImpl*, bool, int, bool, unsigned long, rocksdb::Temperature) /rocksdb/db/table_cache.cc:190:16 https://github.com/facebook/rocksdb/issues/10 0x5583e2abb7e1 in rocksdb::TableCache::NewIterator(rocksdb::ReadOptions const&, rocksdb::FileOptions const&, rocksdb::InternalKeyComparator const&, rocksdb::FileMetaData const&, rocksdb::RangeDelAggregator*, std::__2::shared_ptr<rocksdb::SliceTransform const> const&, rocksdb::TableReader**, rocksdb::HistogramImpl*, rocksdb::TableReaderCaller, rocksdb::Arena*, bool, int, unsigned long, rocksdb::InternalKey const*, rocksdb::InternalKey const*, bool) /rocksdb/db/table_cache.cc:235:9 https://github.com/facebook/rocksdb/issues/11 0x5583e28d14cf in rocksdb::BuildTable(std::__2::basic_string<char, std::__2::char_traits<char>, std::__2::allocator<char>> const&, rocksdb::VersionSet*, rocksdb::ImmutableDBOptions const&, rocksdb::TableBuilderOptions const&, rocksdb::FileOptions const&, rocksdb::TableCache*, rocksdb::InternalIteratorBase<rocksdb::Slice>*, std::__2::vector<std::__2::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator, std::__2::default_delete<rocksdb::FragmentedRangeTombstoneIterator>>, std::__2::allocator<std::__2::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator, std::__2::default_delete<rocksdb::FragmentedRangeTombstoneIterator>>>>, rocksdb::FileMetaData*, std::__2::vector<rocksdb::BlobFileAddition, std::__2::allocator<rocksdb::BlobFileAddition>>*, std::__2::vector<unsigned long, std::__2::allocator<unsigned long>>, unsigned long, unsigned long, rocksdb::SnapshotChecker*, bool, rocksdb::InternalStats*, rocksdb::IOStatus*, std::__2::shared_ptr<rocksdb::IOTracer> const&, rocksdb::BlobFileCreationReason, rocksdb::EventLogger*, int, rocksdb::Env::IOPriority, rocksdb::TableProperties*, rocksdb::Env::WriteLifeTimeHint, std::__2::basic_string<char, std::__2::char_traits<char>, std::__2::allocator<char>> const*, rocksdb::BlobFileCompletionCallback*, unsigned long*, unsigned long*, unsigned long*) /rocksdb/db/builder.cc:335:57 https://github.com/facebook/rocksdb/issues/12 0x5583e29bf29d in rocksdb::FlushJob::WriteLevel0Table() /rocksdb/db/flush_job.cc:919:11 https://github.com/facebook/rocksdb/issues/13 0x5583e29b33ac in rocksdb::FlushJob::Run(rocksdb::LogsWithPrepTracker*, rocksdb::FileMetaData*, bool*) /rocksdb/db/flush_job.cc:276:9 https://github.com/facebook/rocksdb/issues/14 0x5583e27a4781 in rocksdb::DBImpl::FlushMemTableToOutputFile(rocksdb::ColumnFamilyData*, rocksdb::MutableCFOptions const&, bool*, rocksdb::JobContext*, rocksdb::SuperVersionContext*, std::__2::vector<unsigned long, std::__2::allocator<unsigned long>>&, unsigned long, rocksdb::SnapshotChecker*, rocksdb::LogBuffer*, rocksdb::Env::Priority) /rocksdb/db/db_impl/db_impl_compaction_flush.cc:258:19 https://github.com/facebook/rocksdb/issues/15 0x5583e27a7a96 in rocksdb::DBImpl::FlushMemTablesToOutputFiles(rocksdb::autovector<rocksdb::DBImpl::BGFlushArg, 8ul> const&, bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::Env::Priority) /rocksdb/db/db_impl/db_impl_compaction_flush.cc:377:14 https://github.com/facebook/rocksdb/issues/16 0x5583e27d6777 in rocksdb::DBImpl::BackgroundFlush(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::FlushReason*, rocksdb::Env::Priority) /rocksdb/db/db_impl/db_impl_compaction_flush.cc:2778:14 https://github.com/facebook/rocksdb/issues/17 0x5583e27d14e2 in rocksdb::DBImpl::BackgroundCallFlush(rocksdb::Env::Priority) /rocksdb/db/db_impl/db_impl_compaction_flush.cc:2817:16 https://github.com/facebook/rocksdb/issues/18 0x5583e323d353 in std::__2::__function::__policy_func<void ()>::operator()[abi:ne180100]() const /root/build/3rdParty/llvm/runtimes/include/c++/v1/__functional/function.h:714:12 https://github.com/facebook/rocksdb/issues/19 0x5583e323d353 in std::__2::function<void ()>::operator()() const /root/build/3rdParty/llvm/runtimes/include/c++/v1/__functional/function.h:981:10 https://github.com/facebook/rocksdb/issues/20 0x5583e323d353 in rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long) /rocksdb/util/threadpool_imp.cc:266:5 https://github.com/facebook/rocksdb/issues/21 0x5583e3243d18 in decltype(std::declval<void (*)(void*)>()(std::declval<rocksdb::BGThreadMetadata*>())) std::__2::__invoke[abi:ne180100]<void (*)(void*), rocksdb::BGThreadMetadata*>(void (*&&)(void*), rocksdb::BGThreadMetadata*&&) /root/build/3rdParty/llvm/runtimes/include/c++/v1/__type_traits/invoke.h:344:25 https://github.com/facebook/rocksdb/issues/22 0x5583e3243d18 in void std::__2::__thread_execute[abi:ne180100]<std::__2::unique_ptr<std::__2::__thread_struct, std::__2::default_delete<std::__2::__thread_struct>>, void (*)(void*), rocksdb::BGThreadMetadata*, 2ul>(std::__2::tuple<std::__2::unique_ptr<std::__2::__thread_struct, std::__2::default_delete<std::__2::__thread_struct>>, void (*)(void*), rocksdb::BGThreadMetadata*>&, std::__2::__tuple_indices<2ul>) /root/build/3rdParty/llvm/runtimes/include/c++/v1/__thread/thread.h:193:3 https://github.com/facebook/rocksdb/issues/23 0x5583e3243d18 in void* std::__2::__thread_proxy[abi:ne180100]<std::__2::tuple<std::__2::unique_ptr<std::__2::__thread_struct, std::__2::default_delete<std::__2::__thread_struct>>, void (*)(void*), rocksdb::BGThreadMetadata*>>(void*) /root/build/3rdParty/llvm/runtimes/include/c++/v1/__thread/thread.h:202:3 https://github.com/facebook/rocksdb/issues/24 0x5583da5e819e in asan_thread_start(void*) crtstuff.c https://github.com/facebook/rocksdb/issues/25 0x7f0eda362a93 in start_thread nptl/pthread_create.c:447:8 https://github.com/facebook/rocksdb/issues/26 0x7f0eda3efc3b in clone3 misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:78 0x506000682221 is located 1 bytes inside of 56-byte region [0x506000682220,0x506000682258) allocated by thread T29 here: #0 0x5583da6281d1 in operator new(unsigned long) https://github.com/facebook/rocksdb/issues/1 0x5583da6c987d in __libcpp_operator_new<unsigned long> /root/build/3rdParty/llvm/runtimes/include/c++/v1/new:271:10 https://github.com/facebook/rocksdb/issues/2 0x5583da6c987d in __libcpp_allocate /root/build/3rdParty/llvm/runtimes/include/c++/v1/new:295:10 https://github.com/facebook/rocksdb/issues/3 0x5583da6c987d in allocate /root/build/3rdParty/llvm/runtimes/include/c++/v1/__memory/allocator.h:125:32 https://github.com/facebook/rocksdb/issues/4 0x5583da6c987d in allocate_at_least /root/build/3rdParty/llvm/runtimes/include/c++/v1/__memory/allocator.h:131:13 https://github.com/facebook/rocksdb/issues/5 0x5583da6c987d in allocate_at_least<std::__2::allocator<char> > /root/build/3rdParty/llvm/runtimes/include/c++/v1/__memory/allocate_at_least.h:34:20 https://github.com/facebook/rocksdb/issues/6 0x5583da6c987d in __allocate_at_least<std::__2::allocator<char> > /root/build/3rdParty/llvm/runtimes/include/c++/v1/__memory/allocate_at_least.h:42:10 https://github.com/facebook/rocksdb/issues/7 0x5583da6c987d in std::__2::basic_string<char, std::__2::char_traits<char>, std::__2::allocator<char>>::__shrink_or_extend[abi:ne180100](unsigned long) /root/build/3rdParty/llvm/runtimes/include/c++/v1/string:3236:27 https://github.com/facebook/rocksdb/issues/8 0x5583e307c5aa in std::__2::basic_string<char, std::__2::char_traits<char>, std::__2::allocator<char>>::reserve(unsigned long) /root/build/3rdParty/llvm/runtimes/include/c++/v1/string:3207:3 https://github.com/facebook/rocksdb/issues/9 0x5583e307c5aa in rocksdb::ReadFooterFromFile(rocksdb::IOOptions const&, rocksdb::RandomAccessFileReader*, rocksdb::FilePrefetchBuffer*, unsigned long, rocksdb::Footer*, unsigned long) /rocksdb/table/format.cc:382:18 https://github.com/facebook/rocksdb/issues/10 0x5583e2f88456 in rocksdb::BlockBasedTable::Open(rocksdb::ReadOptions const&, rocksdb::ImmutableOptions const&, rocksdb::EnvOptions const&, rocksdb::BlockBasedTableOptions const&, rocksdb::InternalKeyComparator const&, std::__2::unique_ptr<rocksdb::RandomAccessFileReader, std::__2::default_delete<rocksdb::RandomAccessFileReader>>&&, unsigned long, std::__2::unique_ptr<rocksdb::TableReader, std::__2::default_delete<rocksdb::TableReader>>*, std::__2::shared_ptr<rocksdb::CacheReservationManager>, std::__2::shared_ptr<rocksdb::SliceTransform const> const&, bool, bool, int, bool, unsigned long, bool, rocksdb::TailPrefetchStats*, rocksdb::BlockCacheTracer*, unsigned long, std::__2::basic_string<char, std::__2::char_traits<char>, std::__2::allocator<char>> const&, unsigned long) /rocksdb/table/block_based/block_based_table_reader.cc:610:9 https://github.com/facebook/rocksdb/issues/11 0x5583e2ef7837 in rocksdb::BlockBasedTableFactory::NewTableReader(rocksdb::ReadOptions const&, rocksdb::TableReaderOptions const&, std::__2::unique_ptr<rocksdb::RandomAccessFileReader, std::__2::default_delete<rocksdb::RandomAccessFileReader>>&&, unsigned long, std::__2::unique_ptr<rocksdb::TableReader, std::__2::default_delete<rocksdb::TableReader>>*, bool) const /rocksdb/table/block_based/block_based_table_factory.cc:599:10 https://github.com/facebook/rocksdb/issues/12 0x5583e2ab873c in rocksdb::TableCache::GetTableReader(rocksdb::ReadOptions const&, rocksdb::FileOptions const&, rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor const&, bool, bool, rocksdb::HistogramImpl*, std::__2::unique_ptr<rocksdb::TableReader, std::__2::default_delete<rocksdb::TableReader>>*, std::__2::shared_ptr<rocksdb::SliceTransform const> const&, bool, int, bool, unsigned long, rocksdb::Temperature) /rocksdb/db/table_cache.cc:142:34 https://github.com/facebook/rocksdb/issues/13 0x5583e2aba5f6 in rocksdb::TableCache::FindTable(rocksdb::ReadOptions const&, rocksdb::FileOptions const&, rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor const&, rocksdb::Cache::Handle**, std::__2::shared_ptr<rocksdb::SliceTransform const> const&, bool, bool, rocksdb::HistogramImpl*, bool, int, bool, unsigned long, rocksdb::Temperature) /rocksdb/db/table_cache.cc:190:16 https://github.com/facebook/rocksdb/issues/14 0x5583e2abb7e1 in rocksdb::TableCache::NewIterator(rocksdb::ReadOptions const&, rocksdb::FileOptions const&, rocksdb::InternalKeyComparator const&, rocksdb::FileMetaData const&, rocksdb::RangeDelAggregator*, std::__2::shared_ptr<rocksdb::SliceTransform const> const&, rocksdb::TableReader**, rocksdb::HistogramImpl*, rocksdb::TableReaderCaller, rocksdb::Arena*, bool, int, unsigned long, rocksdb::InternalKey const*, rocksdb::InternalKey const*, bool) /rocksdb/db/table_cache.cc:235:9 https://github.com/facebook/rocksdb/issues/15 0x5583e28d14cf in rocksdb::BuildTable(std::__2::basic_string<char, std::__2::char_traits<char>, std::__2::allocator<char>> const&, rocksdb::VersionSet*, rocksdb::ImmutableDBOptions const&, rocksdb::TableBuilderOptions const&, rocksdb::FileOptions const&, rocksdb::TableCache*, rocksdb::InternalIteratorBase<rocksdb::Slice>*, std::__2::vector<std::__2::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator, std::__2::default_delete<rocksdb::FragmentedRangeTombstoneIterator>>, std::__2::allocator<std::__2::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator, std::__2::default_delete<rocksdb::FragmentedRangeTombstoneIterator>>>>, rocksdb::FileMetaData*, std::__2::vector<rocksdb::BlobFileAddition, std::__2::allocator<rocksdb::BlobFileAddition>>*, std::__2::vector<unsigned long, std::__2::allocator<unsigned long>>, unsigned long, unsigned long, rocksdb::SnapshotChecker*, bool, rocksdb::InternalStats*, rocksdb::IOStatus*, std::__2::shared_ptr<rocksdb::IOTracer> const&, rocksdb::BlobFileCreationReason, rocksdb::EventLogger*, int, rocksdb::Env::IOPriority, rocksdb::TableProperties*, rocksdb::Env::WriteLifeTimeHint, std::__2::basic_string<char, std::__2::char_traits<char>, std::__2::allocator<char>> const*, rocksdb::BlobFileCompletionCallback*, unsigned long*, unsigned long*, unsigned long*) /rocksdb/db/builder.cc:335:57 https://github.com/facebook/rocksdb/issues/16 0x5583e29bf29d in rocksdb::FlushJob::WriteLevel0Table() /rocksdb/db/flush_job.cc:919:11 https://github.com/facebook/rocksdb/issues/17 0x5583e29b33ac in rocksdb::FlushJob::Run(rocksdb::LogsWithPrepTracker*, rocksdb::FileMetaData*, bool*) /rocksdb/db/flush_job.cc:276:9 https://github.com/facebook/rocksdb/issues/18 0x5583e27a4781 in rocksdb::DBImpl::FlushMemTableToOutputFile(rocksdb::ColumnFamilyData*, rocksdb::MutableCFOptions const&, bool*, rocksdb::JobContext*, rocksdb::SuperVersionContext*, std::__2::vector<unsigned long, std::__2::allocator<unsigned long>>&, unsigned long, rocksdb::SnapshotChecker*, rocksdb::LogBuffer*, rocksdb::Env::Priority) /rocksdb/db/db_impl/db_impl_compaction_flush.cc:258:19 https://github.com/facebook/rocksdb/issues/19 0x5583e27a7a96 in rocksdb::DBImpl::FlushMemTablesToOutputFiles(rocksdb::autovector<rocksdb::DBImpl::BGFlushArg, 8ul> const&, bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::Env::Priority) /rocksdb/db/db_impl/db_impl_compaction_flush.cc:377:14 https://github.com/facebook/rocksdb/issues/20 0x5583e27d6777 in rocksdb::DBImpl::BackgroundFlush(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::FlushReason*, rocksdb::Env::Priority) /rocksdb/db/db_impl/db_impl_compaction_flush.cc:2778:14 https://github.com/facebook/rocksdb/issues/21 0x5583e27d14e2 in rocksdb::DBImpl::BackgroundCallFlush(rocksdb::Env::Priority) /rocksdb/db/db_impl/db_impl_compaction_flush.cc:2817:16 https://github.com/facebook/rocksdb/issues/22 0x5583e323d353 in std::__2::__function::__policy_func<void ()>::operator()[abi:ne180100]() const /root/build/3rdParty/llvm/runtimes/include/c++/v1/__functional/function.h:714:12 https://github.com/facebook/rocksdb/issues/23 0x5583e323d353 in std::__2::function<void ()>::operator()() const /root/build/3rdParty/llvm/runtimes/include/c++/v1/__functional/function.h:981:10 https://github.com/facebook/rocksdb/issues/24 0x5583e323d353 in rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long) /rocksdb/util/threadpool_imp.cc:266:5 https://github.com/facebook/rocksdb/issues/25 0x5583e3243d18 in decltype(std::declval<void (*)(void*)>()(std::declval<rocksdb::BGThreadMetadata*>())) std::__2::__invoke[abi:ne180100]<void (*)(void*), rocksdb::BGThreadMetadata*>(void (*&&)(void*), rocksdb::BGThreadMetadata*&&) /root/build/3rdParty/llvm/runtimes/include/c++/v1/__type_traits/invoke.h:344:25 https://github.com/facebook/rocksdb/issues/26 0x5583e3243d18 in void std::__2::__thread_execute[abi:ne180100]<std::__2::unique_ptr<std::__2::__thread_struct, std::__2::default_delete<std::__2::__thread_struct>>, void (*)(void*), rocksdb::BGThreadMetadata*, 2ul>(std::__2::tuple<std::__2::unique_ptr<std::__2::__thread_struct, std::__2::default_delete<std::__2::__thread_struct>>, void (*)(void*), rocksdb::BGThreadMetadata*>&, std::__2::__tuple_indices<2ul>) /root/build/3rdParty/llvm/runtimes/include/c++/v1/__thread/thread.h:193:3 https://github.com/facebook/rocksdb/issues/27 0x5583e3243d18 in void* std::__2::__thread_proxy[abi:ne180100]<std::__2::tuple<std::__2::unique_ptr<std::__2::__thread_struct, std::__2::default_delete<std::__2::__thread_struct>>, void (*)(void*), rocksdb::BGThreadMetadata*>>(void*) /root/build/3rdParty/llvm/runtimes/include/c++/v1/__thread/thread.h:202:3 https://github.com/facebook/rocksdb/issues/28 0x5583da5e819e in asan_thread_start(void*) crtstuff.c HINT: if you don't care about these errors you may set ASAN_OPTIONS=detect_container_overflow=0. If you suspect a false positive see also: https://github.com/google/sanitizers/wiki/AddressSanitizerContainerOverflow. AddressSanitizer:container-overflow in pread Shadow bytes around the buggy address: 0x506000681f80: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x506000682000: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x506000682080: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x506000682100: fa fa fa fa fa fa fa fa fa fa fa fa 00 00 00 00 0x506000682180: 00 00 00 fa fa fa fa fa fa fa fa fa fa fa fa fa =>0x506000682200: fa fa fa fa[01]fc fc fc fc fc fc fa fa fa fa fa 0x506000682280: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x506000682300: fa fa fa fa fa fa fa fa 00 00 00 00 00 00 00 01 0x506000682380: fa fa fa fa fd fd fd fd fd fd fd fd fa fa fa fa 0x506000682400: fd fd fd fd fd fd fd fa fa fa fa fa fd fd fd fd 0x506000682480: fd fd fd fd fa fa fa fa fd fd fd fd fd fd fd fd Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user: f7 Container overflow: fc Array cookie: ac Intra object redzone: bb ASan internal: fe Left alloca redzone: ca Right alloca redzone: cb ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12722 Reviewed By: hx235 Differential Revision: D58118264 Pulled By: ajkr fbshipit-source-id: 0dd914c886c022d82697b769d664ba52de0770de |
||
Po-Chuan Hsieh | b03d415660 |
Fix build on i386 (#12719)
Summary: Cited from https://pkg-status.freebsd.org/beefy21/data/140i386-default/02faf78f4c9b/logs/rocksdb-9.2.1.log The error message is as follows: ``` mkdir -p db_stress_tool && clang++ -O2 -pipe -DOS_FREEBSD -fstack-protector-strong -isystem /usr/local/include -fno-strict-aliasing -Wno-inconsistent-missing-override -Wno-unused-parameter -Wno-unused-variable -Wno-unused-private-field -isystem /usr/local/include -std=c++17 -fPIC -DROCKSDB_DLL -DROCKSDB_USE_RTTI -g -W -Wextra -Wall -Wsign-compare -Wshadow -Wunused-parameter -Werror -I. -I./include -std=c++17 -O2 -pipe -DOS_FREEBSD -fstack-protector-strong -isystem /usr/local/include -fno-strict-aliasing -Wno-inconsistent-missing-override -Wno-unused-parameter -Wno-unused-variable -Wno-unused-private-field -isystem /usr/local/include -std=c++17 -faligned-new -DHAVE_ALIGNED_NEW -DROCKSDB_PLATFORM_POSIX -DROCKSDB_LIB_IO_POSIX -O2 -pipe -DOS_FREEBSD -fstack-protector-strong -isystem /usr/local/include -fno-strict-aliasing -D_REENTRANT -DOS_FREEBSD -DSNAPPY -DGFLAGS=1 -DZLIB -DBZIP2 -DLZ4 -DZSTD -DROCKSDB_MALLOC_USABLE_SIZE -DROCKSDB_PTHREAD_ADAPTIVE_MUTEX -DROCKSDB_BACKTRACE -DROCKSDB_SCHED_GETCPU_PRESENT -isystem third-party/gtest-1.8.1/fused-src -O2 -fno-omit-frame-pointer -momit-leaf-frame-pointer -DNDEBUG -Woverloaded-virtual -Wnon-virtual-dtor -Wno-missing-field-initializers -Wno-invalid-offsetof -c db_stress_tool/db_stress_common.cc -o db_stress_tool/db_stress_common.o db_stress_tool/db_stress_common.cc:204:17: error: format specifies type 'unsigned long' but the argument has type 'size_t' (aka 'unsigned int') [-Werror,-Wformat] block_cache->GetCapacity()); ^~~~~~~~~~~~~~~~~~~~~~~~~~ 1 error generated. ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12719 Reviewed By: hx235 Differential Revision: D58093539 Pulled By: jaykorean fbshipit-source-id: 400cae3a4b0d23b168937a5388065ef1c4b8b56e |
||
Andrii Lysenko | e4428b7eb9 |
More details for 'tail prefetch size is calculated based on' (#12667)
Summary: These messages indicate that SST file was created by a pre-9.0.0 RocksDB. Eventually, `TailPrefetchStats` might be removed, so it would be more informative if log message also included name of the affected SST file. Issue: https://github.com/facebook/rocksdb/issues/12664 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12667 Reviewed By: ajkr Differential Revision: D57464025 Pulled By: hx235 fbshipit-source-id: 12f2f2635e3092f8c29362aa132462492b5c1417 |
||
hgy@ruijie.com.cn | 21a16f9e64 |
add c-api to get default cf handle (#12514)
Summary: rocksdb_batched_multi_get_cf has performance improvement than normal multi_get, however it needs a cf_handle arg, so add a C-API to get and destroy the default cf_handle, as many user only use the default cf. Fixes https://github.com/facebook/rocksdb/issues/12316 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12514 Reviewed By: hx235 Differential Revision: D55922517 Pulled By: ajkr fbshipit-source-id: c4cc4289f2cfd9efbb8f390a44a9d8d1ed08d9f0 |
||
Levi Tamasi | b9e82f5162 |
Mention two fixes (#12677 and #12681) in the changelog (#12730)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12730 Reviewed By: jowlyzhang Differential Revision: D58092079 fbshipit-source-id: 708437e705f9d9f770c9ba1a1a9b3c369b0a4b79 |
||
Andrew Kryczka | c3ae569792 |
Update the main branch for the 9.3 release (#12726)
Summary: Cut the 9.3.fb branch as of 5/17 11:59pm. Also, cherry-picked all bug fixes that have happened since then. Removed their files from unreleased_history/ since those fixes will appear in 9.3.0, so there seems no use repeating them in any later release. Release branch: https://github.com/facebook/rocksdb/tree/9.3.fb Tests: https://github.com/facebook/rocksdb/actions/runs/9342097111 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12726 Reviewed By: ltamasi Differential Revision: D58069263 Pulled By: ajkr fbshipit-source-id: c4f557bc8dbc20ce53021ac7e97a24f930542bf9 |
||
Jay Huh | b7fc9ada9e |
Temporarily disable multi_cf_iter in stress test (#12728)
Summary: We plan to re-enable the test after fixing the test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12728 Test Plan: N/A. Disabling the test Reviewed By: hx235 Differential Revision: D58071284 Pulled By: jaykorean fbshipit-source-id: af6b45ec7654f9c7b40c36d3b59c7087e27a7af9 |
||
Levi Tamasi | 023a808417 |
Disable iterator refresh for CoalescingIterator in TestIterateAgainstExpected (#12723)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12723 `CoalescingIterator` doesn't support `Refresh` currently; the patch adds a check that was missing from https://github.com/facebook/rocksdb/pull/12721 to disable this operation when multi-CF iterators are in use in the stress test. Reviewed By: jaykorean Differential Revision: D58053334 fbshipit-source-id: 3146f0e7e87230b49b244cecdfcee345c0ce78fa |
||
Yu Zhang | fc59d8f9c6 |
Add public API WriteWithCallback to support custom callbacks (#12603)
Summary: This PR adds a `DB::WriteWithCallback` API that does the same things as `DB::Write` while takes an argument `UserWriteCallback` to execute custom callback functions during the write. We currently support two types of callback functions: `OnWriteEnqueued` and `OnWalWriteFinish`. The former is invoked after the write is enqueued, and the later is invoked after WAL write finishes when applicable. These callback functions are intended for users to use to improve synchronization between concurrent writes, their execution is on the write's critical path so it will impact the write's latency if not used properly. The documentation for the callback interface mentioned this and suggest user to keep these callback functions' implementation minimum. Although transaction interfaces' writes doesn't yet allow user to specify such a user write callback argument, the `DBImpl::Write*` type of APIs do not differentiate between regular DB writes or writes coming from the transaction layer when it comes to supporting this `UserWriteCallback`. These callbacks works for all the write modes including: default write mode, Options.two_write_queues, Options.unordered_write, Options.enable_pipelined_write Pull Request resolved: https://github.com/facebook/rocksdb/pull/12603 Test Plan: Added unit test in ./write_callback_test Reviewed By: anand1976 Differential Revision: D58044638 Pulled By: jowlyzhang fbshipit-source-id: 87a84a0221df8f589ec8fc4d74597e72ce97e4cd |
||
Jay Huh | f3b7e959b3 |
Add CoalescingIterator to TestIterateAgainstExpected (#12721)
Summary: Continuing from https://github.com/facebook/rocksdb/pull/12706. Adding the CoalescingIterator to `TestIterateAgainstExpected` as well when `use_multi_cf_iterator` is set True Pull Request resolved: https://github.com/facebook/rocksdb/pull/12721 Test Plan: ``` python3 tools/db_crashtest.py blackbox --simple --max_key=25000000 --write_buffer_size=4194304 --use_attribute_group=0 --use_put_entity_one_in=1 --use_multi_get=1 --use_multi_cf_iterator=1 --verify_iterator_with_expected_state_one_in=2 ``` Reviewed By: ltamasi Differential Revision: D58033811 Pulled By: jaykorean fbshipit-source-id: 7caf39883e277e695b653e295ad72b1004169ca0 |
||
Levi Tamasi | 6f17056e40 |
Add transactional/read-your-own-write MultiGetEntity stress test (#12717)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12717 The PR adds `Transaction::MultiGetEntity` to the stress tests. Similarly to what we do for `Transaction::MultiGet`, in this mode we open a transaction and randomly add writes for some of the queried keys to it while keeping track of the values written on a per-key basis. The results of `Transaction::MultiGetEntity` can then be validated against these expected values (in order to test the read-your-own-writes functionality) as well as the results returned by `Transaction::GetEntity` for the same keys. Reviewed By: jaykorean Differential Revision: D57990210 fbshipit-source-id: 9bf3bb292051c2c57757f86b517919197b03c524 |
||
Jay Huh | a901ef48f0 |
Introduce use_multi_cf_iterator in stress test (#12706)
Summary: Introduce `use_multi_cf_iterator`, and when it's set, use `CoalescingIterator` in `TestIterate()`. Because all the column families contain the same data in today's Stress Test, we can compare `CoalescingIterator` against any `DBIter` from any of the column families. Currently, coalescing logic verification is done by unit tests, but we can extend the stress test to support different data in different column families in the future. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12706 Test Plan: ``` python3 tools/db_crashtest.py blackbox --simple --max_key=25000000 --write_buffer_size=4194304 --use_attribute_group=0 --use_put_entity_one_in=1 --use_multi_get=1 --use_multi_cf_iterator=1 ``` **More PRs to come** - Use `AttributeGroupIterator` when both `use_multi_cf_iterator` and `use_attribute_group` are true - Support `Refresh()` in `CoalescingIterator` - Extend Stress Test to support different data in different CFs (Long-term) Reviewed By: ltamasi Differential Revision: D58020247 Pulled By: jaykorean fbshipit-source-id: 8e2483b85cf2bb0f5a9bb44851601bbf063484ec |
||
Po-Chuan Hsieh | 76aa0d9ee2 |
Fix build on FreeBSD (#12714)
Summary: The error message is as follows: ``` port/stack_trace.cc:286:7: error: use of undeclared identifier 'waitpid' waitpid(child_pid, &wstatus, 0); ^ port/stack_trace.cc:287:11: error: use of undeclared identifier 'WIFEXITED' if (WIFEXITED(wstatus) && WEXITSTATUS(wstatus) == EXIT_SUCCESS) { ^ port/stack_trace.cc:287:33: error: use of undeclared identifier 'WEXITSTATUS' if (WIFEXITED(wstatus) && WEXITSTATUS(wstatus) == EXIT_SUCCESS) { ^ 3 errors generated. ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12714 Reviewed By: ajkr Differential Revision: D57970244 Pulled By: jaykorean fbshipit-source-id: afdad9af16b4bfe5e059bc82180f74b2c3260ed9 |
||
Yu Zhang | 8a462eefae |
Add user timestamp support into interactive query command (#12716)
Summary: As titled. This PR also makes the interactive query tool more permissive by allowing the user to continue to try out a different command after the previous command received some allowed errors, such as `Status::NotFound`, `Status::InvalidArgument`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12716 Test Plan: Manually tested: ``` yuzhangyu@yuzhangyu-mbp rocksdb % ./ldb --db=$TEST_DB --key_hex --value_hex query get 0x0000000000000000 --read_timestamp=1115559245398440 0x0000000000000000|timestamp:1115559245398440 ==> 0x07000000000102030C0D0E0F08090A0B14151617101112131C1D1E1F18191A1B24252627202122232C2D2E2F28292A2B34353637303132333C3D3E3F38393A3B put 0x0000000000000000 0x0000 put 0x0000000000000000 => 0x0000 failed: Invalid argument: cannot call this method on column family default that enables timestamp put 0x0000000000000000 aha 0x0000 put gets invalid argument: Invalid argument: user provided timestamp is not a valid uint64 value. put 0x0000000000000000 1115559245398441 0x08000000000102030C0D0E0F08090A0B14151617101112131C1D1E1F18191A1B24252627202122232C2D2E2F28292A2B34353637303132333C3D3E3F38393A3B put 0x0000000000000000 write_ts: 1115559245398441 => 0x08000000000102030C0D0E0F08090A0B14151617101112131C1D1E1F18191A1B24252627202122232C2D2E2F28292A2B34353637303132333C3D3E3F38393A3B succeeded delete 0x0000000000000000 delete 0x0000000000000000 failed: Invalid argument: cannot call this method on column family default that enables timestamp delete 0x0000000000000000 1115559245398442 delete 0x0000000000000000 write_ts: 1115559245398442 succeeded get 0x0000000000000000 --read_timestamp=1115559245398442 get 0x0000000000000000 read_timestamp: 1115559245398442 status: NotFound: get 0x0000000000000000 --read_timestamp=1115559245398441 0x0000000000000000|timestamp:1115559245398441 ==> 0x08000000000102030C0D0E0F08090A0B14151617101112131C1D1E1F18191A1B24252627202122232C2D2E2F28292A2B34353637303132333C3D3E3F38393A3B count --from=0x0000000000000000 --to=0x0000000000000001 scan from 0x0000000000000000 to 0x0000000000000001failed: Invalid argument: cannot call this method on column family default that enables timestamp count --from=0x0000000000000000 --to=0x0000000000000001 --read_timestamp=1115559245398442 0 count --from=0x0000000000000000 --to=0x0000000000000001 --read_timestamp=1115559245398441 1 ``` Reviewed By: ltamasi Differential Revision: D57992183 Pulled By: jowlyzhang fbshipit-source-id: 720525de22412d16aa952870e088f2c371459ece |
||
Peter Dillinger | 7127119ae9 |
Refactor SyncWAL and SyncClosedLogs for code sharing (#12707)
Summary: These functions were very similar and did not make sense for maintaining separately. This is not a pure refactor but I think bringing the behaviors closer together should reduce long term risk of unintentionally divergent behavior. This change is motivated by some forthcoming WAL handling fixes for Checkpoint and Backups. * Sync() is always used on closed WALs, like the old SyncClosedWals. SyncWithoutFlush() is only used on the active (maybe) WAL. Perhaps SyncWithoutFlush() should be used whenever available, but I don't know which is preferred, as the previous state of the code was inconsistent. * Syncing the WAL dir is selective based on need, like old SyncWAL, rather than done always like old SyncClosedLogs. This could be a performance improvement that was never applied to SyncClosedLogs but now is. We might still sync the dir more times than necessary in the case of parallel SyncWAL variants, but on a good FileSystem that's probably not too different performance-wise from us implementing something to have threads wait on each other. Cosmetic changes: * Rename internal function SyncClosedLogs to SyncClosedWals * Merging the sync points into the common implementation between the two entry points isn't pretty, but should be fine. Recommended follow-up: * Clean up more confusing naming like log_dir_synced_ Pull Request resolved: https://github.com/facebook/rocksdb/pull/12707 Test Plan: existing tests Reviewed By: anand1976 Differential Revision: D57870856 Pulled By: pdillinger fbshipit-source-id: 5455fba016d25dd5664fa41b253f18db2ca8919a |
||
Levi Tamasi | 01179678b2 |
Refactor the non-attribute-group/attribute-group code paths in NonBatchedOpsStressTest::TestMultiGetEntity (#12715)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12715 The patch refactors/deduplicates the non-attribute-group and attribute-group code paths in `NonBatchedOpsStressTest::TestMultiGetEntity` by introducing two new generic lambdas `verify_expected_errors` and `check_results` (the latter of which subsumes the existing `handle_results`) that can handle both types of APIs. This change also serves as groundwork for the upcoming transactional `MultiGetEntity` stress tests. Reviewed By: jaykorean Differential Revision: D57977700 fbshipit-source-id: 83a18a9e57f46ea92ba07b2f0dca3e9bc353f257 |
||
anand76 | 0ae3d9f98d |
Fix stale memory access with FSBuffer and tiered sec cache (#12712)
Summary: A `BlockBasedTable` with `TieredSecondaryCache` containing a NVM cache inserts blocks into the compressed cache and the corresponding compressed block into the NVM cache. The `BlockFetcher` is used to get the uncompressed and compressed blocks by calling `ReadBlockContents()` and `GetUncompressedBlock()` respectively. If the file system supports FSBuffer (i.e returning a FS allocated buffer rather than caller provided), that buffer gets freed between the two calls. This PR fixes it by making the FSBuffer unique pointer a member rather than local variable. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12712 Test Plan: 1. Add a unit test 2. Release validation stress test Reviewed By: jaykorean Differential Revision: D57974026 Pulled By: anand1976 fbshipit-source-id: cfa895914e74b4f628413b40e6e39d8d8e5286bd |
||
Rulin Huang | 20777b96cb |
Optimizations in notify-one (#12545)
Summary: We tested on icelake server (vcpu=160). The default configuration is allow_concurrent_memtable_write=1, thread number =activate core number. With our optimizations, the improvement can reach up to 184% in fillseq case. op/s is as the performance indicator in db_bench, and the following are performance improvements in some cases in db_bench. | case name | optimized/original | |-------------------:|--------------------:| | fillrandom | 182% | | fillseq | 184% | | fillsync | 136% | | overwrite | 179% | | randomreplacekeys | 180% | | randomtransaction | 161% | | updaterandom | 163% | | xorupdaterandom | 165% | With analysis, we find that although the process of writing memtable is processed in parallel, the process of waking up the writers is not processed in parallel, which means that only one writers is responsible for the sequential waking up other writers. The following is our method to optimize this process. Assume that there are currently n threads in total, we parallelize SetState in LaunchParallelMemTableWriters. To wake up each writer to write its own memtable, the leader writer first wakes up the (n^0.5-1) caller writers, and then those callers and the leader will wake up n/x separately to write to the memtable. This reduces the number for the leader's to SetState n-1 writers to 2*(n^0.5) writers in turn. A reproduction script: ./db_bench --benchmarks="fillrandom" --threads ${number of all activate vcpu} --seed 1708494134896523 --duration 60 ![image](https://github.com/facebook/rocksdb/assets/22110918/c5eca02f-93b3-4434-bba2-5155fc892a97) Pull Request resolved: https://github.com/facebook/rocksdb/pull/12545 Reviewed By: ajkr Differential Revision: D57422827 Pulled By: cbi42 fbshipit-source-id: 94127937c0c61e4241720bd902c82c607b7b2431 |
||
Levi Tamasi | b6ea246333 |
Fix NonBatchOpsStressTest::TestGetEntity by adding fuzzy match (#12711)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12711 The patch adds the missing other half of https://github.com/facebook/rocksdb/pull/12709: when there is no locking in a read test, we have to be more permissive when it comes to values returned by queries. In particular, any expected state value in a small window around the read call should be allowed, and discrepancies in the presence/absence of a key should only be treated as a failure if the key is guaranteed to have not existed/existed during the above window. Reviewed By: hx235 Differential Revision: D57938678 fbshipit-source-id: cd5c8bc2e014ec12ea4daf441965f3ec2115663e |
||
Changyu Bi | af3be5255a |
Fail DeleteRange() early when row_cache is configured (#12710)
Summary: https://github.com/facebook/rocksdb/issues/12512 added the sanity check for this incompatible combination. However, it does the check during memtable insertion which can turn the DB into read-only mode. This PR moves the check earlier so that this write failure will not turn the DB into read-only mode and affect other DB operations. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12710 Test Plan: * updated unit test `DBRangeDelTest.RowCache` to write to DB after a failed DeleteRange(). The test fails before this PR. Reviewed By: ajkr Differential Revision: D57925188 Pulled By: cbi42 fbshipit-source-id: 8bf001bd3fcf05635411ba28bc4a037321942879 |
||
Levi Tamasi | d9316260e4 |
Remove unneccessary MutexLock from NonBatchedOpStressTest::TestGetEntity (#12709)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12709 This is most likely copypasta from `TestGet` from before https://github.com/facebook/rocksdb/pull/11058 . There is no need to lock the mutex for the key for reads; in fact, doing so is detrimental to test coverage since it locks out concurrent writers. Reviewed By: jowlyzhang Differential Revision: D57915207 fbshipit-source-id: eb0dbf6b84e5408b87d96dd47597511996e206a7 |
||
anand76 | 9cc6168c98 |
Add LDB command and option for follower instances (#12682)
Summary: Add the `--leader_path` option to specify the directory path of the leader for a follower RocksDB instance. This PR also adds a `count` command to the repl shell. While not specific to followers, it is useful for testing purposes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12682 Reviewed By: jowlyzhang Differential Revision: D57642296 Pulled By: anand1976 fbshipit-source-id: 53767d496ecadc363ff92cd958b8e15a7bf3b151 |
||
Levi Tamasi | 5cec4bbcab |
Support PutEntity as write method in the transactional MultiGet stress test (#12699)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12699 The patch adds `PutEntity` to the potential write operations used in the read-your-own-writes tests for `Transaction::MultiGet`. Note that since the stress test generates wide-column structures which have the value returned by `GenerateValue` in the default column, this does not affect the results returned by the `MultiGet` API (unless we have a bug). The wide-column entity is generated according to the usual rules based on the value base and the `use_put_entity_one_in` flag. The entire entity structure will be validated by the upcoming stress test for `Transaction::MultiGetEntity`, where we also plan to leverage this logic. Reviewed By: jowlyzhang Differential Revision: D57799075 fbshipit-source-id: 5f86c2b2b3ceee8e1b8bf7453c02f1f1b1b00751 |
||
HypenZou | 8765a0f546 |
Fix version edit dump in json (#12703)
Summary: **Context/Summary:** the flag --json of manifest_dump in ldb tool has no effect The bug may be introduced by pr https://github.com/facebook/rocksdb/pull/8378 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12703 Reviewed By: cbi42 Differential Revision: D57848094 Pulled By: ajkr fbshipit-source-id: 3d1ce65528bf4ce9c53593a7208406ab90e8994b |
||
muthukrishnan24 | 259f21e695 |
Add WB, WBWI Create, UpdateTimestamp, Iterator::Refresh in C API (#10529)
Summary: This PR adds UpdateTimestamp API of WriteBatch and WBWI, create WB, WBWI with all options and Iterator Refresh in C API Pull Request resolved: https://github.com/facebook/rocksdb/pull/10529 Reviewed By: cbi42 Differential Revision: D57826913 Pulled By: ajkr fbshipit-source-id: d2ec840129f61a1d3a5a12e859728be98ebbad2f |
||
Jaepil Jeong | c115eb6162 |
Fix compile errors in C++23 (#12106)
Summary: This PR fixes compile errors in C++23. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12106 Reviewed By: cbi42 Differential Revision: D57826279 Pulled By: ajkr fbshipit-source-id: 594abfd8eceaf51eaf3bbabf7696c0bb5e0e9a68 |
||
Daniel Vasquez Lopez | 7c6c632ea9 |
Use std::optional instead of std::unique_ptr to conditionally create a read lock. (#12704)
Summary: This change replaces the use of `std::unique_ptr` with `std::optional` for conditionally constructing a `ReadLock` object. The read lock object was recently introduced in PR https://github.com/facebook/rocksdb/issues/12624. This change makes the code more concise and clarifies that the lock is not meant to be transferred (as `std::unique_ptr` is movable). It also avoids a heap allocation. There are no functional changes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12704 Reviewed By: cbi42 Differential Revision: D57848192 Pulled By: ajkr fbshipit-source-id: da48c77aac33b51ba5dcc238f98fc48ccf234a21 |
||
Peter Dillinger | d2ef70872f |
Rename, deprecate LogFile and VectorLogPtr (#12695)
Summary: These names are confusing with `Logger` etc. so moving to `WalFile` etc. Other small, related name refactorings. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12695 Test Plan: Left most unit tests using old names as an API compatibility test. Non-test code compiles with deprecated names removed. No functional changes. Reviewed By: ajkr Differential Revision: D57747458 Pulled By: pdillinger fbshipit-source-id: 7b77596b9c20d865d43b9dc66c30c8bd2b3b424f |
||
Changyu Bi | 0ee7f8bacb |
Fix max_read_amp value in crash test (#12701)
Summary: It should be no less than `level0_file_num_compaction_trigger`(which defaults to 4) when set to a positive value. Otherwise DB open will fail. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12701 Test Plan: crash test not failing DB open due to this option value. Reviewed By: ajkr Differential Revision: D57825062 Pulled By: cbi42 fbshipit-source-id: 22d8e12aeceb5cef815157845995a8448552e2d2 |
||
Fan Zhang(DevX) | 0e5ed2e0c8 |
add export_file to rockdb TARGETS generator and re-gen
Summary: we are converting the implicit loads to explicit loads, then remove the hidden loads in fbcode macroes. details see https://fb.workplace.com/groups/devx.build.bffs/permalink/7481848805183560/ Reviewed By: JakobDegen Differential Revision: D57800976 fbshipit-source-id: a893aa2aa9237704ba9eb998cba210222c95dd2f |
||
Levi Tamasi | bd801bd98c |
Factor out the RYW transaction building logic into a helper (#12697)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12697 As groundwork for stress testing `Transaction::MultiGetEntity`, the patch factors out the logic for adding transactional writes for some of the keys in a `MultiGet` batch into a separate helper method called `MaybeAddKeyToTxnForRYW`. Reviewed By: jowlyzhang Differential Revision: D57791830 fbshipit-source-id: ef347ba6e6e82dfe5cedb4cf67dd6d1503901d89 |
||
Changyu Bi | fecb10c2fa |
Improve universal compaction sorted-run trigger (#12477)
Summary: Universal compaction currently uses `level0_file_num_compaction_trigger` for two purposes: 1. the trigger for checking if there is any compaction to do, and 2. the limit on the number of sorted runs. RocksDB will do compaction to keep the number of sorted runs no more than the value of this option. This can make the option inflexible. A value that is too small causes higher write amp: more compactions to reduce the number of sorted runs. A value that is too big delays potential compaction work and causes worse read performance. This PR introduce an option `CompactionOptionsUniversal::max_read_amp` for only the second purpose: to specify the hard limit on the number of sorted runs. For backward compatibility, `max_read_amp = -1` by default, which means to fallback to the current behavior. When `max_read_amp > 0`,`level0_file_num_compaction_trigger` will only serve as a trigger to find potential compaction. When `max_read_amp = 0`, RocksDB will auto-tune the limit on the number of sorted runs. The estimation is based on DB size, write_buffer_size and size_ratio, so it is adaptive to the size change of the DB. See more in `UniversalCompactionBuilder::PickCompaction()`. Alternatively, users now can configure `max_read_amp` to a very big value and keep `level0_file_num_compaction_trigger` small. This will allow `size_ratio` and `max_size_amplification_percent` to control the number of sorted runs. This essentially disables compactions with reason kUniversalSortedRunNum. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12477 Test Plan: * new unit test * existing unit test for default behavior * updated crash test with the new option * benchmark: * Create a DB that is roughly 24GB in the last level. When `max_read_amp = 0`, we estimate that the DB needs 9 levels to avoid excessive compactions to reduce the number of sorted runs. * We then run fillrandom to ingest another 24GB data to compare write amp. * case 1: small level0 trigger: `level0_file_num_compaction_trigger=5, max_read_amp=-1` * write-amp: 4.8 * case 2: auto-tune: `level0_file_num_compaction_trigger=5, max_read_amp=0` * write-amp: 3.6 * case 3: auto-tune with minimal trigger: `level0_file_num_compaction_trigger=1, max_read_amp=0` * write-amp: 3.8 * case 4: hard-code a good value for trigger: `level0_file_num_compaction_trigger=9` * write-amp: 2.8 ``` Case 1: ** Compaction Stats [default] ** Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 0/0 0.00 KB 1.0 0.0 0.0 0.0 22.6 22.6 0.0 1.0 0.0 163.2 141.94 111.10 108 1.314 0 0 0.0 0.0 L45 8/0 1.81 GB 0.0 39.6 11.1 28.5 39.3 10.8 0.0 3.5 209.0 207.3 194.25 191.29 43 4.517 348M 2498K 0.0 0.0 L46 13/0 3.12 GB 0.0 15.3 9.5 5.8 15.0 9.3 0.0 1.6 203.1 199.3 77.13 75.88 16 4.821 134M 2362K 0.0 0.0 L47 19/0 4.68 GB 0.0 15.4 10.5 4.9 14.7 9.8 0.0 1.4 204.0 194.9 77.38 76.15 8 9.673 135M 5920K 0.0 0.0 L48 38/0 9.42 GB 0.0 19.6 11.7 7.9 17.3 9.4 0.0 1.5 206.5 182.3 97.15 95.02 4 24.287 172M 20M 0.0 0.0 L49 91/0 22.70 GB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.00 0 0.000 0 0 0.0 0.0 Sum 169/0 41.74 GB 0.0 89.9 42.9 47.0 109.0 61.9 0.0 4.8 156.7 189.8 587.85 549.45 179 3.284 791M 31M 0.0 0.0 Case 2: ** Compaction Stats [default] ** Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 1/0 214.47 MB 1.2 0.0 0.0 0.0 22.6 22.6 0.0 1.0 0.0 164.5 140.81 109.98 108 1.304 0 0 0.0 0.0 L44 0/0 0.00 KB 0.0 1.3 1.3 0.0 1.2 1.2 0.0 1.0 206.1 204.9 6.24 5.98 3 2.081 11M 51K 0.0 0.0 L45 4/0 844.36 MB 0.0 7.1 5.4 1.7 7.0 5.4 0.0 1.3 194.6 192.9 37.41 36.00 13 2.878 62M 489K 0.0 0.0 L46 11/0 2.57 GB 0.0 14.6 9.8 4.8 14.3 9.5 0.0 1.5 193.7 189.8 77.09 73.54 17 4.535 128M 2411K 0.0 0.0 L47 24/0 5.81 GB 0.0 19.8 12.0 7.8 18.8 11.0 0.0 1.6 191.4 181.1 106.19 101.21 9 11.799 174M 9166K 0.0 0.0 L48 38/0 9.42 GB 0.0 19.6 11.8 7.9 17.3 9.4 0.0 1.5 197.3 173.6 101.97 97.23 4 25.491 172M 20M 0.0 0.0 L49 91/0 22.70 GB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.00 0 0.000 0 0 0.0 0.0 Sum 169/0 41.54 GB 0.0 62.4 40.3 22.1 81.3 59.2 0.0 3.6 136.1 177.2 469.71 423.94 154 3.050 549M 32M 0.0 0.0 Case 3: ** Compaction Stats [default] ** Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 0/0 0.00 KB 5.0 0.0 0.0 0.0 22.6 22.6 0.0 1.0 0.0 163.8 141.43 111.13 108 1.310 0 0 0.0 0.0 L44 0/0 0.00 KB 0.0 0.8 0.8 0.0 0.8 0.8 0.0 1.0 201.4 200.2 4.26 4.19 2 2.130 7360K 33K 0.0 0.0 L45 4/0 844.38 MB 0.0 6.3 5.0 1.2 6.2 5.0 0.0 1.2 202.0 200.3 31.81 31.50 12 2.651 55M 403K 0.0 0.0 L46 7/0 1.62 GB 0.0 13.3 8.8 4.6 13.1 8.6 0.0 1.5 198.9 195.7 68.72 67.89 17 4.042 117M 1696K 0.0 0.0 L47 24/0 5.81 GB 0.0 21.7 12.9 8.8 20.6 11.8 0.0 1.6 198.5 188.6 112.04 109.97 12 9.336 191M 9352K 0.0 0.0 L48 41/0 10.14 GB 0.0 24.8 13.0 11.8 21.9 10.1 0.0 1.7 198.6 175.6 127.88 125.36 6 21.313 218M 25M 0.0 0.0 L49 91/0 22.70 GB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.00 0 0.000 0 0 0.0 0.0 Sum 167/0 41.10 GB 0.0 67.0 40.5 26.4 85.4 58.9 0.0 3.8 141.1 179.8 486.13 450.04 157 3.096 589M 36M 0.0 0.0 Case 4: ** Compaction Stats [default] ** Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 0/0 0.00 KB 0.7 0.0 0.0 0.0 22.6 22.6 0.0 1.0 0.0 158.6 146.02 114.68 108 1.352 0 0 0.0 0.0 L42 0/0 0.00 KB 0.0 1.7 1.7 0.0 1.7 1.7 0.0 1.0 185.4 184.3 9.25 8.96 4 2.314 14M 67K 0.0 0.0 L43 0/0 0.00 KB 0.0 2.5 2.5 0.0 2.5 2.5 0.0 1.0 197.8 195.6 13.01 12.65 4 3.253 22M 202K 0.0 0.0 L44 4/0 844.40 MB 0.0 4.2 4.2 0.0 4.1 4.1 0.0 1.0 188.1 185.1 22.81 21.89 5 4.562 36M 503K 0.0 0.0 L45 13/0 3.12 GB 0.0 7.5 6.5 1.0 7.2 6.2 0.0 1.1 188.7 181.8 40.69 39.32 5 8.138 65M 2282K 0.0 0.0 L46 17/0 4.18 GB 0.0 8.3 7.1 1.2 7.9 6.6 0.0 1.1 192.2 181.8 44.23 43.06 4 11.058 73M 3846K 0.0 0.0 L47 22/0 5.34 GB 0.0 8.9 7.5 1.4 8.2 6.8 0.0 1.1 189.1 174.1 48.12 45.37 3 16.041 78M 6098K 0.0 0.0 L48 27/0 6.58 GB 0.0 9.2 7.6 1.6 8.2 6.6 0.0 1.1 195.2 172.9 48.52 47.11 2 24.262 81M 9217K 0.0 0.0 L49 91/0 22.70 GB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.00 0 0.000 0 0 0.0 0.0 Sum 174/0 42.74 GB 0.0 42.3 37.0 5.3 62.4 57.1 0.0 2.8 116.3 171.3 372.66 333.04 135 2.760 372M 22M 0.0 0.0 setup: ./db_bench --benchmarks=fillseq,compactall,waitforcompaction --num=200000000 --compression_type=none --disable_wal=1 --compaction_style=1 --num_levels=50 --target_file_size_base=268435456 --max_compaction_bytes=6710886400 --level0_file_num_compaction_trigger=10 --write_buffer_size=268435456 --seed 1708494134896523 benchmark: ./db_bench --benchmarks=overwrite,waitforcompaction,stats --num=200000000 --compression_type=none --disable_wal=1 --compaction_style=1 --write_buffer_size=268435456 --level0_file_num_compaction_trigger=5 --target_file_size_base=268435456 --use_existing_db=1 --num_levels=50 --writes=200000000 --universal_max_read_amp=-1 --seed=1716488324800233 ``` Reviewed By: ajkr Differential Revision: D55370922 Pulled By: cbi42 fbshipit-source-id: 9be69979126b840d08e93e7059260e76a878bb2a |