mirror of https://github.com/facebook/rocksdb.git
11841 Commits
Author | SHA1 | Message | Date |
---|---|---|---|
codeoos | f4a5446cab |
Fix error maybe-uninitialized #11100 (#11101)
Summary: In this issue [11100](https://github.com/facebook/rocksdb/issues/11100) I try to upgrade dependencies of [BaikalDB](https://github.com/baidu/BaikalDB) and tool chain to gcc-12.I found that when I build rocksdb v6.26.0(maybe I can use newer version),I found that in file trace_replay/trace_replay.cc,the compiler tell me "error mybe-uninitialized".I dound that it can be fixed very easy,so I make this pull request. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11101 Reviewed By: ajkr Differential Revision: D42583031 Pulled By: cbi42 fbshipit-source-id: 7f399f09441a30fe88b83cec5e2fd9885bad5c06 |
|
leipeng | a5bcbcd8be |
remove unused InternalIteratorBase::is_mutable_ (#11104)
Summary: `InternalIteratorBase::is_mutable_` is not used any more, remove it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11104 Reviewed By: ajkr Differential Revision: D42582747 Pulled By: cbi42 fbshipit-source-id: d30bf75151fc8414df0ae112a6ec4943b5b7330b |
|
Peter Dillinger | fd911f9655 |
Upgrade xxhash.h to latest dev (#11098)
Summary: Upgrading xxhash.h to latest dev version as of 1/17/2023, which is d7197ddea81364a539051f116ca77926100fc77f This should improve performance on some ARM machines. I allowed some of our RocksDB-specific changes to be made obsolete where it seemed appropriate, for example * xxhash.h has its own fallthrough marker (which I hope works for us) * As in https://github.com/Cyan4973/xxHash/pull/549 Merging and resolving conflicts one way or the other was all that went into this diff. Except I had to mix the two sides around `defined(__loongarch64)` How I did the upgrade (for future reference), so that I could use usual merge conflict resolution: ``` # New branch to help with merging git checkout -b xxh_merge_base # Check out RocksDB revision before last xxhash.h upgrade git reset --hard 22161b7547652af82a5dc67458de9ca8946ac83d^ # Create a commit with the raw base version from xxHash repo (from xxHash repo) git show 2c611a76f914828bed675f0f342d6c4199ffee1e:xxhash.h > ../rocksdb/util/xxhash.h # In RocksDB repo git commit -a # Merge in the last xxhash.h upgrade git merge |
|
Changyu Bi | e9d6a0d7ce |
Fix asan failure caused by range tombstone start key use-after-free (#11106)
Summary: the `last_tombstone_start_user_key` variable in `BuildTable()` and in `CompactionOutputs::AddRangeDels()` may point to a start key that is freed if user-defined timestamp is enabled. This was causing ASAN failure and this PR fixes this issue. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11106 Test Plan: Added UT for repro. Reviewed By: ajkr Differential Revision: D42590862 Pulled By: cbi42 fbshipit-source-id: c493265ececdf89636d801d55ae929806c4d4b2c |
|
akankshamahajan | bd4b8d6487 |
Fix crash in block_cache_trace_analyzer if reference key is null in case of MultiGet (#11042)
Summary: Same as title Error: ``` block_cache_trace_analyzer: ./db/dbformat.h:421: uint64_t rocksdb::GetInternalKeySeqno(const rocksdb::Slice&): Assertion `n >= kNumInternalBytes' failed. Aborted (core dumped) ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11042 Test Plan: - Added new unit test which fails without the fix. - Also ran manually on traces to confirm. Reviewed By: anand1976 Differential Revision: D42481587 Pulled By: akankshamahajan15 fbshipit-source-id: 7c33eb03a4a4d8ffbabcfbe0efa1e4d11bde3ba2 |
|
Changyu Bi | 4d0f9a995c |
Consider TTL compaction file cutting earlier to prevent small output file (#11075)
Summary: in `CompactionOutputs::ShouldStopBefore()`, TTL-related states, `cur_files_to_cut_for_ttl_` and `next_files_to_cut_for_ttl_`, are not updated if the function returns early. This can cause unnecessary compaction output file cuttings and hence produce smaller output files, which may hurt write amp. See the example in the unit test for how this "unnecessary file cutting" can happen. This PR fixes this issue by moving the code for updating TTL states earlier in `CompactionOutputs::ShouldStopBefore()` so that the states are updated for each key. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11075 Test Plan: - Added new unit test. Reviewed By: hx235 Differential Revision: D42398739 Pulled By: cbi42 fbshipit-source-id: 09fab66679c1a734abcfc31bcea33dd9aeb9dbc7 |
|
Changyu Bi | 6a82b68788 |
Avoid counting extra range tombstone compensated size in `AddRangeDels()` (#11091)
Summary: in `CompactionOutputs::AddRangeDels()`, range tombstones with the same start and end key but different sequence numbers all contribute to compensated range tombstone size. This PR removes this redundancy. This PR also includes a fix from https://github.com/facebook/rocksdb/issues/11067 where a range tombstone that is not within a file's range was being added to the file. This fixes an assertion failure for `icmp.Compare(start, end) <= 0` in VersionSet::ApproximateSize() when calculating compensated range tombstone size. Assertions and a comment/essay was added to reason that no such range tombstone will be added after this fix. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11091 Test Plan: - Added unit tests - Stress test with small key range: `python3 tools/db_crashtest.py blackbox --simple --max_key=100 --interval=600 --write_buffer_size=262144 --target_file_size_base=256 --max_bytes_for_level_base=262144 --block_size=128 --value_size_mult=33 --subcompactions=10` Reviewed By: ajkr Differential Revision: D42521588 Pulled By: cbi42 fbshipit-source-id: 5bda3fe38997995314e1f7592319af12b69bc4f8 |
|
Changyu Bi | f515d9d203 |
Revert #10802 Consider range tombstone in compaction output file cutting (#11089)
Summary:
This reverts commit
|
|
leipeng | 3941c34950 |
db_bench: let -benchmark=compact respect -subcompactions (#11077)
Summary: When running `-benchmarks=compact`, `-subcompactions` does not take effect. `-subcompactions` option comment says it is for L0-L1 compactions, it is natural to extend it to CompactionRangeOptions.max_subcompactions. This PR set CompactionRangeOptions.max_subcompactions = FLAGS_subcompactions Pull Request resolved: https://github.com/facebook/rocksdb/pull/11077 Reviewed By: akankshamahajan15 Differential Revision: D42506251 Pulled By: ajkr fbshipit-source-id: f77c9a99d32ff7af59f3c452c9e16aaeb0360304 |
|
Wenlong Zhang | 1cfe3528a2 |
support loongarch64 for rocksdb (#10036)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10036 Reviewed By: hx235 Differential Revision: D42424074 Pulled By: ajkr fbshipit-source-id: 004adb75005a26bd01c5d568d1ec6ac442cd59dd |
|
anand76 | a510880346 |
Add a unit test for async prefetch fix in #11049 (#11084)
Summary: Add a unit test in prefetch_test for https://github.com/facebook/rocksdb/issues/11049 Pull Request resolved: https://github.com/facebook/rocksdb/pull/11084 Test Plan: Verify the test fails without https://github.com/facebook/rocksdb/issues/11049 and passes with it Reviewed By: akankshamahajan15 Differential Revision: D42485828 Pulled By: anand1976 fbshipit-source-id: ae512f2d121745a1f5212645a9b58868976c1f83 |
|
Peter Dillinger | 9f7801c5f1 |
Major Cache refactoring, CPU efficiency improvement (#10975)
Summary: This is several refactorings bundled into one to avoid having to incrementally re-modify uses of Cache several times. Overall, there are breaking changes to Cache class, and it becomes more of low-level interface for implementing caches, especially block cache. New internal APIs make using Cache cleaner than before, and more insulated from block cache evolution. Hopefully, this is the last really big block cache refactoring, because of rather effectively decoupling the implementations from the uses. This change also removes the EXPERIMENTAL designation on the SecondaryCache support in Cache. It seems reasonably mature at this point but still subject to change/evolution (as I warn in the API docs for Cache). The high-level motivation for this refactoring is to minimize code duplication / compounding complexity in adding SecondaryCache support to HyperClockCache (in a later PR). Other benefits listed below. * static_cast lines of code +29 -35 (net removed 6) * reinterpret_cast lines of code +6 -32 (net removed 26) ## cache.h and secondary_cache.h * Always use CacheItemHelper with entries instead of just a Deleter. There are several motivations / justifications: * Simpler for implementations to deal with just one Insert and one Lookup. * Simpler and more efficient implementation because we don't have to track which entries are using helpers and which are using deleters * Gets rid of hack to classify cache entries by their deleter. Instead, the CacheItemHelper includes a CacheEntryRole. This simplifies a lot of code (cache_entry_roles.h almost eliminated). Fixes https://github.com/facebook/rocksdb/issues/9428. * Makes it trivial to adjust SecondaryCache behavior based on kind of block (e.g. don't re-compress filter blocks). * It is arguably less convenient for many direct users of Cache, but direct users of Cache are now rare with introduction of typed_cache.h (below). * I considered and rejected an alternative approach in which we reduce customizability by assuming each secondary cache compatible value starts with a Slice referencing the uncompressed block contents (already true or mostly true), but we apparently intend to stack secondary caches. Saving an entry from a compressed secondary to a lower tier requires custom handling offered by SaveToCallback, etc. * Make CreateCallback part of the helper and introduce CreateContext to work with it (alternative to https://github.com/facebook/rocksdb/issues/10562). This cleans up the interface while still allowing context to be provided for loading/parsing values into primary cache. This model works for async lookup in BlockBasedTable reader (reader owns a CreateContext) under the assumption that it always waits on secondary cache operations to finish. (Otherwise, the CreateContext could be destroyed while async operation depending on it continues.) This likely contributes most to the observed performance improvement because it saves an std::function backed by a heap allocation. * Use char* for serialized data, e.g. in SaveToCallback, where void* was confusingly used. (We use `char*` for serialized byte data all over RocksDB, with many advantages over `void*`. `memcpy` etc. are legacy APIs that should not be mimicked.) * Add a type alias Cache::ObjectPtr = void*, so that we can better indicate the intent of the void* when it is to be the object associated with a Cache entry. Related: started (but did not complete) a refactoring to move away from "value" of a cache entry toward "object" or "obj". (It is confusing to call Cache a key-value store (like DB) when it is really storing arbitrary in-memory objects, not byte strings.) * Remove unnecessary key param from DeleterFn. This is good for efficiency in HyperClockCache, which does not directly store the cache key in memory. (Alternative to https://github.com/facebook/rocksdb/issues/10774) * Add allocator to Cache DeleterFn. This is a kind of future-proofing change in case we get more serious about using the Cache allocator for memory tracked by the Cache. Right now, only the uncompressed block contents are allocated using the allocator, and a pointer to that allocator is saved as part of the cached object so that the deleter can use it. (See CacheAllocationPtr.) If in the future we are able to "flatten out" our Cache objects some more, it would be good not to have to track the allocator as part of each object. * Removes legacy `ApplyToAllCacheEntries` and changes `ApplyToAllEntries` signature for Deleter->CacheItemHelper change. ## typed_cache.h Adds various "typed" interfaces to the Cache as internal APIs, so that most uses of Cache can use simple type safe code without casting and without explicit deleters, etc. Almost all of the non-test, non-glue code uses of Cache have been migrated. (Follow-up work: CompressedSecondaryCache deserves deeper attention to migrate.) This change expands RocksDB's internal usage of metaprogramming and SFINAE (https://en.cppreference.com/w/cpp/language/sfinae). The existing usages of Cache are divided up at a high level into these new interfaces. See updated existing uses of Cache for examples of how these are used. * PlaceholderCacheInterface - Used for making cache reservations, with entries that have a charge but no value. * BasicTypedCacheInterface<TValue> - Used for primary cache storage of objects of type TValue, which can be cleaned up with std::default_delete<TValue>. The role is provided by TValue::kCacheEntryRole or given in an optional template parameter. * FullTypedCacheInterface<TValue, TCreateContext> - Used for secondary cache compatible storage of objects of type TValue. In addition to BasicTypedCacheInterface constraints, we require TValue::ContentSlice() to return persistable data. This simplifies usage for the normal case of simple secondary cache compatibility (can give you a Slice to the data already in memory). In addition to TCreateContext performing the role of Cache::CreateContext, it is also expected to provide a factory function for creating TValue. * For each of these, there's a "Shared" version (e.g. FullTypedSharedCacheInterface) that holds a shared_ptr to the Cache, rather than assuming external ownership by holding only a raw `Cache*`. These interfaces introduce specific handle types for each interface instantiation, so that it's easy to see what kind of object is controlled by a handle. (Ultimately, this might not be worth the extra complexity, but it seems OK so far.) Note: I attempted to make the cache 'charge' automatically inferred from the cache object type, such as by expecting an ApproximateMemoryUsage() function, but this is not so clean because there are cases where we need to compute the charge ahead of time and don't want to re-compute it. ## block_cache.h This header is essentially the replacement for the old block_like_traits.h. It includes various things to support block cache access with typed_cache.h for block-based table. ## block_based_table_reader.cc Before this change, accessing the block cache here was an awkward mix of static polymorphism (template TBlocklike) and switch-case on a dynamic BlockType value. This change mostly unifies on static polymorphism, relying on minor hacks in block_cache.h to distinguish variants of Block. We still check BlockType in some places (especially for stats, which could be improved in follow-up work) but at least the BlockType is a static constant from the template parameter. (No more awkward partial redundancy between static and dynamic info.) This likely contributes to the overall performance improvement, but hasn't been tested in isolation. The other key source of simplification here is a more unified system of creating block cache objects: for directly populating from primary cache and for promotion from secondary cache. Both use BlockCreateContext, for context and for factory functions. ## block_based_table_builder.cc, cache_dump_load_impl.cc Before this change, warming caches was super ugly code. Both of these source files had switch statements to basically transition from the dynamic BlockType world to the static TBlocklike world. None of that mess is needed anymore as there's a new, untyped WarmInCache function that handles all the details just as promotion from SecondaryCache would. (Fixes `TODO akanksha: Dedup below code` in block_based_table_builder.cc.) ## Everything else Mostly just updating Cache users to use new typed APIs when reasonably possible, or changed Cache APIs when not. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10975 Test Plan: tests updated Performance test setup similar to https://github.com/facebook/rocksdb/issues/10626 (by cache size, LRUCache when not "hyper" for HyperClockCache): 34MB 1thread base.hyper -> kops/s: 0.745 io_bytes/op: 2.52504e+06 miss_ratio: 0.140906 max_rss_mb: 76.4844 34MB 1thread new.hyper -> kops/s: 0.751 io_bytes/op: 2.5123e+06 miss_ratio: 0.140161 max_rss_mb: 79.3594 34MB 1thread base -> kops/s: 0.254 io_bytes/op: 1.36073e+07 miss_ratio: 0.918818 max_rss_mb: 45.9297 34MB 1thread new -> kops/s: 0.252 io_bytes/op: 1.36157e+07 miss_ratio: 0.918999 max_rss_mb: 44.1523 34MB 32thread base.hyper -> kops/s: 7.272 io_bytes/op: 2.88323e+06 miss_ratio: 0.162532 max_rss_mb: 516.602 34MB 32thread new.hyper -> kops/s: 7.214 io_bytes/op: 2.99046e+06 miss_ratio: 0.168818 max_rss_mb: 518.293 34MB 32thread base -> kops/s: 3.528 io_bytes/op: 1.35722e+07 miss_ratio: 0.914691 max_rss_mb: 264.926 34MB 32thread new -> kops/s: 3.604 io_bytes/op: 1.35744e+07 miss_ratio: 0.915054 max_rss_mb: 264.488 233MB 1thread base.hyper -> kops/s: 53.909 io_bytes/op: 2552.35 miss_ratio: 0.0440566 max_rss_mb: 241.984 233MB 1thread new.hyper -> kops/s: 62.792 io_bytes/op: 2549.79 miss_ratio: 0.044043 max_rss_mb: 241.922 233MB 1thread base -> kops/s: 1.197 io_bytes/op: 2.75173e+06 miss_ratio: 0.103093 max_rss_mb: 241.559 233MB 1thread new -> kops/s: 1.199 io_bytes/op: 2.73723e+06 miss_ratio: 0.10305 max_rss_mb: 240.93 233MB 32thread base.hyper -> kops/s: 1298.69 io_bytes/op: 2539.12 miss_ratio: 0.0440307 max_rss_mb: 371.418 233MB 32thread new.hyper -> kops/s: 1421.35 io_bytes/op: 2538.75 miss_ratio: 0.0440307 max_rss_mb: 347.273 233MB 32thread base -> kops/s: 9.693 io_bytes/op: 2.77304e+06 miss_ratio: 0.103745 max_rss_mb: 569.691 233MB 32thread new -> kops/s: 9.75 io_bytes/op: 2.77559e+06 miss_ratio: 0.103798 max_rss_mb: 552.82 1597MB 1thread base.hyper -> kops/s: 58.607 io_bytes/op: 1449.14 miss_ratio: 0.0249324 max_rss_mb: 1583.55 1597MB 1thread new.hyper -> kops/s: 69.6 io_bytes/op: 1434.89 miss_ratio: 0.0247167 max_rss_mb: 1584.02 1597MB 1thread base -> kops/s: 60.478 io_bytes/op: 1421.28 miss_ratio: 0.024452 max_rss_mb: 1589.45 1597MB 1thread new -> kops/s: 63.973 io_bytes/op: 1416.07 miss_ratio: 0.0243766 max_rss_mb: 1589.24 1597MB 32thread base.hyper -> kops/s: 1436.2 io_bytes/op: 1357.93 miss_ratio: 0.0235353 max_rss_mb: 1692.92 1597MB 32thread new.hyper -> kops/s: 1605.03 io_bytes/op: 1358.04 miss_ratio: 0.023538 max_rss_mb: 1702.78 1597MB 32thread base -> kops/s: 280.059 io_bytes/op: 1350.34 miss_ratio: 0.023289 max_rss_mb: 1675.36 1597MB 32thread new -> kops/s: 283.125 io_bytes/op: 1351.05 miss_ratio: 0.0232797 max_rss_mb: 1703.83 Almost uniformly improving over base revision, especially for hot paths with HyperClockCache, up to 12% higher throughput seen (1597MB, 32thread, hyper). The improvement for that is likely coming from much simplified code for providing context for secondary cache promotion (CreateCallback/CreateContext), and possibly from less branching in block_based_table_reader. And likely a small improvement from not reconstituting key for DeleterFn. Reviewed By: anand1976 Differential Revision: D42417818 Pulled By: pdillinger fbshipit-source-id: f86bfdd584dce27c028b151ba56818ad14f7a432 |
|
Changyu Bi | 0a2d3b663a |
Fix some unit test failure in ExternalSSTFileBasicTest (#11070)
Summary: valgrind build for `ExternalSSTFileBasicTest/ExternalSSTFileBasicTest.IngestFileWithMixedValueType` and `ExternalSSTFileBasicTest/ExternalSSTFileBasicTest.IngestFileWithGlobalSeqnoPickedSeqno` started failing (see error message in T141554665). I could not repro but I suspect it is due to file ingestion range overlapping with ongoing compaction, which caused a new global seqno being assigned after https://github.com/facebook/rocksdb/issues/10988. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11070 Test Plan: monitor future valgrind tests result. Reviewed By: hx235 Differential Revision: D42319056 Pulled By: cbi42 fbshipit-source-id: acbcd841a2a15e36b278f39ba514f4b9a6ee43ca |
|
Niklas Fiekas | ff04fb154b |
Add C API for ReadOptions::async_io (#11062)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11062 Reviewed By: hx235 Differential Revision: D42297489 Pulled By: ajkr fbshipit-source-id: 03fe1477c1ae1f8af73dc77a6986fdc7025edf4f |
|
ehds | 4737e1d41b |
fix shared state used after free (#11059)
Summary: Before this pr, the destruction order is `shared` -> `db_`(StressTest destruction) -> `stress`, but `compaction_filter` of `db_` will hold the `shared` pointer, so `shared` maybe used after free. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11059 Reviewed By: hx235 Differential Revision: D42297366 Pulled By: ajkr fbshipit-source-id: 17b314635359acacd5ba62f9db5f955f451133f7 |
|
Hui Xiao | b965a5a80e |
Add back Options::CompactionOptionsFIFO::allow_compaction to stress/crash test (#11063)
Summary: **Context/Summary:** https://github.com/facebook/rocksdb/pull/10777 was reverted (https://github.com/facebook/rocksdb/pull/10999) due to internal blocker and replaced with a better fix https://github.com/facebook/rocksdb/pull/10922. However, the revert also reverted the `Options::CompactionOptionsFIFO::allow_compaction` stress/crash coverage added by the PR. It's an useful coverage cuz setting `Options::CompactionOptionsFIFO::allow_compaction=true` will [increase](https://github.com/facebook/rocksdb/blob/7.8.fb/db/version_set.cc#L3255) the compaction score of L0 files for FIFO and then trigger more FIFO compaction. This speed up discovery of bug related to FIFO compaction like https://github.com/facebook/rocksdb/pull/10955. To see the speedup, compare the failure occurrence in following commands with `Options::CompactionOptionsFIFO::allow_compaction=true/false` ``` --fifo_allow_compaction=1 --acquire_snapshot_one_in=10000 --adaptive_readahead=0 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --async_io=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=8.869062094789008 --bottommost_compression_type=none --bytes_per_sync=0 --cache_index_and_filter_blocks=1 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=1 --checkpoint_one_in=1000000 --checksum_type=kxxHash --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=3 --compaction_style=2 --compaction_ttl=0 --compression_max_dict_buffer_bytes=8589934591 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=xpress --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --db_write_buffer_size=1048576 --delpercent=4 --delrangepercent=1 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=1 --file_checksum_impl=xxh64 --flush_one_in=1000000 --format_version=4 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=10 --index_type=2 --ingest_external_file_one_in=100 --initial_auto_readahead_size=16384 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=False --log2_keys_per_lock=10 --long_running_snapshots=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=524288 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=25000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=True --nooverwritepercent=1 --num_file_reads_for_auto_readahead=2 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=40000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=7 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=1000 --readahead_size=0 --readpercent=15 --recycle_log_file_num=1 --reopen=0 --ribbon_starting_level=999 --secondary_cache_fault_one_in=0 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=1 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=1 --use_direct_reads=1 --use_full_merge_v1=1 --use_merge=0 --use_multiget=0 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=1 --writepercent=65 ``` Therefore this PR is adding it back to stress/crash test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11063 Test Plan: Rehearsal stress test to make sure stress/crash test is stable Reviewed By: ajkr Differential Revision: D42283650 Pulled By: hx235 fbshipit-source-id: 132e6396ab6e24d8dcb8fe51c62dd5211cdf53ef |
|
Changyu Bi | f24ef5d6ab |
Fix BackupEngineTest.ExcludeFiles memory leak (#11066)
Summary: Valgrind was complaining about the test BackupEngineTest.ExcludeFiles. The cause is backup_engine not being freed similar to https://github.com/facebook/rocksdb/issues/9610. ``` ==18228== Command: ./backup_engine_test --gtest_filter=BackupEngineTest.ExcludeFiles ==18228== Note: Google Test filter = BackupEngineTest.ExcludeFiles [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from BackupEngineTest [ RUN ] BackupEngineTest.ExcludeFiles [ OK ] BackupEngineTest.ExcludeFiles (16264 ms) [----------] 1 test from BackupEngineTest (16273 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (16306 ms total) [ PASSED ] 1 test. ==18228== ==18228== HEAP SUMMARY: ==18228== in use at exit: 14,099 bytes in 159 blocks ==18228== total heap usage: 255,328 allocs, 255,169 frees, 497,538,546 bytes allocated ==18228== ==18228== 19 bytes in 1 blocks are possibly lost in loss record 4 of 67 ==18228== at 0x483BE63: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) ==18228== by 0x1E752D: void std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_construct<char*>(char*, char*, std::forward_iterator_tag) [clone .constprop.0] (basic_string.tcc:219) ==18228== by 0x1F1898: _M_construct_aux<char*> (basic_string.h:251) ==18228== by 0x1F1898: _M_construct<char*> (basic_string.h:270) ==18228== by 0x1F1898: basic_string (basic_string.h:455) ==18228== by 0x1F1898: construct<std::__cxx11::basic_string<char>, const std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&> (new_allocator.h:146) ==18228== by 0x1F1898: construct<std::__cxx11::basic_string<char>, const std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&> (alloc_traits.h:483) ==18228== by 0x1F1898: push_back (stl_vector.h:1189) ==18228== by 0x1F1898: rocksdb::(anonymous namespace)::TestFs::NewWritableFile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::FileOptions const&, std::unique_ptr<rocksdb::FSWritableFile, std::default_delete<rocksdb::FSWritableFile> >*, rocksdb::IODebugContext*) (backup_engine_test.cc:208) ==18228== by 0x4B3583: rocksdb::NewWritableFile(rocksdb::FileSystem*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unique_ptr<rocksdb::FSWritableFile, std::default_delete<rocksdb::FSWritableFile> >*, rocksdb::FileOptions const&) (read_write_util.cc:23) ==18228== by 0x31C3A8: rocksdb::DBImpl::CreateWAL(unsigned long, unsigned long, unsigned long, rocksdb::log::Writer**) (db_impl_open.cc:1752) ==18228== by 0x321A8C: rocksdb::DBImpl::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**, bool, bool) (db_impl_open.cc:1852) ==18228== by 0x322E7F: Open (db_impl_open.cc:1660) ==18228== by 0x322E7F: rocksdb::DB::Open(rocksdb::Options const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::DB**) (db_impl_open.cc:1637) ==18228== by 0x1EE1CD: InitializeDBAndBackupEngine (backup_engine_test.cc:724) ==18228== by 0x1EE1CD: rocksdb::(anonymous namespace)::BackupEngineTest::OpenDBAndBackupEngine(bool, bool, rocksdb::(anonymous namespace)::BackupEngineTest::ShareOption) (backup_engine_test.cc:732) ==18228== by 0x217585: rocksdb::(anonymous namespace)::BackupEngineTest_ExcludeFiles_Test::TestBody() (backup_engine_test.cc:4232) ==18228== by 0x296143: HandleSehExceptionsInMethodIfSupported<testing::Test, void> (gtest-all.cc:3899) ==18228== by 0x296143: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (gtest-all.cc:3935) ==18228== by 0x28A0A5: testing::Test::Run() [clone .part.0] (gtest-all.cc:3973) ==18228== by 0x28A364: Run (gtest-all.cc:3965) ==18228== by 0x28A364: testing::TestInfo::Run() [clone .part.0] (gtest-all.cc:4149) ... ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11066 Test Plan: make -j24 J=24 ROCKSDBTESTS_SUBSET=backup_engine_test valgrind_check_some Reviewed By: ajkr Differential Revision: D42297791 Pulled By: cbi42 fbshipit-source-id: db67982b27b91cc78e1a9f4a96da0cba7c9785b7 |
|
mrambacher | 559aaa3577 |
Add ability to have unit tests for ROCKSDB_PLUGINS (#11052)
Summary: This is based on speedb PR [143](https://github.com/speedb-io/speedb/pull/143). This PR adds the ability to add a xxx_TESTS variable to the make or cmake files for a plugin. When set, those files will be added to the unit tests built and executed by the corresponding make system. Note that the rule for building plugin tests via make could be expanded to almost every other unit test in RocksDB. This expansion would allow for a much smaller/simpler Makefile and make it easier to add new test files to RocksDB. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11052 Reviewed By: cbi42 Differential Revision: D42212269 Pulled By: ajkr fbshipit-source-id: d02668f7f4466900d63c90bb4f7962d23fcc7114 |
|
ywave | 7f71880de9 |
Fix typo in flushing stats CF (#11055)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11055 Test Plan: make check Reviewed By: cbi42 Differential Revision: D42232828 Pulled By: ajkr fbshipit-source-id: 3b46514aebff4da7e47b9954b90800ba4a3ba30b |
|
HuangYi | 33aca893c2 |
add c-api for setting option optimize_filters_for_memory (#11044)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11044 Reviewed By: cbi42 Differential Revision: D42152851 Pulled By: ajkr fbshipit-source-id: 81710d9503ba4f23f112c72ebf16a48112e27158 |
|
Hui Xiao | 9502856edd |
Add missing range conflict check between file ingestion and RefitLevel() (#10988)
Summary: **Context:** File ingestion never checks whether the key range it acts on overlaps with an ongoing RefitLevel() (used in `CompactRange()` with `change_level=true`). That's because RefitLevel() doesn't register and make its key range known to file ingestion. Though it checks overlapping with other compactions by https://github.com/facebook/rocksdb/blob/7.8.fb/db/external_sst_file_ingestion_job.cc#L998. RefitLevel() (used in `CompactRange()` with `change_level=true`) doesn't check whether the key range it acts on overlaps with an ongoing file ingestion. That's because file ingestion does not register and make its key range known to other compactions. - Note that non-refitlevel-compaction (e.g, manual compaction w/o RefitLevel() or general compaction) also does not check key range overlap with ongoing file ingestion for the same reason. - But it's fine. Credited to cbi42's discovery, `WaitForIngestFile` was called by background and foreground compactions. They were introduced in |
|
Changyu Bi | cc6f323705 |
Include estimated bytes deleted by range tombstones in compensated file size (#10734)
Summary: compensate file sizes in compaction picking so files with range tombstones are preferred, such that they get compacted down earlier as they tend to delete a lot of data. This PR adds a `compensated_range_deletion_size` field in FileMeta that is computed during Flush/Compaction and persisted in MANIFEST. This value is added to `compensated_file_size` which will be used for compaction picking. Currently, for a file in level L, `compensated_range_deletion_size` is set to the estimated bytes deleted by range tombstone of this file in all levels > L. This helps to reduce space amp when data in older levels are covered by range tombstones in level L. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10734 Test Plan: - Added unit tests. - benchmark to check if the above definition `compensated_range_deletion_size` is reducing space amp as intended, without affecting write amp too much. The experiment set up favorable for this optimization: large range tombstone issued infrequently. Command used: ``` ./db_bench -benchmarks=fillrandom,waitforcompaction,stats,levelstats -use_existing_db=false -avoid_flush_during_recovery=true -write_buffer_size=33554432 -level_compaction_dynamic_level_bytes=true -max_background_jobs=8 -max_bytes_for_level_base=134217728 -target_file_size_base=33554432 -writes_per_range_tombstone=500000 -range_tombstone_width=5000000 -num=50000000 -benchmark_write_rate_limit=8388608 -threads=16 -duration=1800 --max_num_range_tombstones=1000000000 ``` In this experiment, each thread wrote 16 range tombstones over the duration of 30 minutes, each range tombstone has width 5M that is the 10% of the key space width. Results shows this PR generates a smaller DB size. Compaction stats from this PR: ``` Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 2/0 31.54 MB 0.5 0.0 0.0 0.0 8.4 8.4 0.0 1.0 0.0 63.4 135.56 110.94 544 0.249 0 0 0.0 0.0 L4 3/0 96.55 MB 0.8 18.5 6.7 11.8 18.4 6.6 0.0 2.7 65.3 64.9 290.08 284.03 108 2.686 284M 1957K 0.0 0.0 L5 15/0 404.41 MB 1.0 19.1 7.7 11.4 18.8 7.4 0.3 2.5 66.6 65.7 292.93 285.34 220 1.332 293M 3808K 0.0 0.0 L6 143/0 4.12 GB 0.0 45.0 7.5 37.5 41.6 4.1 0.0 5.5 71.2 65.9 647.00 632.66 251 2.578 739M 47M 0.0 0.0 Sum 163/0 4.64 GB 0.0 82.6 21.9 60.7 87.2 26.5 0.3 10.4 61.9 65.4 1365.58 1312.97 1123 1.216 1318M 52M 0.0 0.0 ``` Compaction stats from main: ``` Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 0/0 0.00 KB 0.0 0.0 0.0 0.0 8.4 8.4 0.0 1.0 0.0 60.5 142.12 115.89 569 0.250 0 0 0.0 0.0 L4 3/0 85.68 MB 1.0 17.7 6.8 10.9 17.6 6.7 0.0 2.6 62.7 62.3 289.05 281.79 112 2.581 272M 2309K 0.0 0.0 L5 11/0 293.73 MB 1.0 18.8 7.5 11.2 18.5 7.2 0.5 2.5 64.9 63.9 296.07 288.50 220 1.346 288M 4365K 0.0 0.0 L6 130/0 3.94 GB 0.0 51.5 7.6 43.9 47.9 3.9 0.0 6.3 67.2 62.4 784.95 765.92 258 3.042 848M 51M 0.0 0.0 Sum 144/0 4.31 GB 0.0 88.0 21.9 66.0 92.3 26.3 0.5 11.0 59.6 62.5 1512.19 1452.09 1159 1.305 1409M 58M 0.0 0.0``` Reviewed By: ajkr Differential Revision: D39834713 Pulled By: cbi42 fbshipit-source-id: fe9341040b8704a8fbb10cad5cf5c43e962c7e6b |
|
Peter Dillinger | 02f2b20864 |
Add BackupEngine feature to exclude files (#11030)
Summary: We have a request for RocksDB to essentially support disconnected incremental backup. In other words, if there is limited or no connectivity to the primary backup dir, we should still be able to take an incremental backup relative to that primary backup dir, assuming some metadata about that primary dir is available (and obviously anticipating primary backup dir will be fully available if restore is needed). To support that, this feature allows the API user to "exclude" DB files from backup. This only applies to files that can be shared between backups (sst and blob files), and excluded files are tracked in the backup metadata sufficiently to ensure they are restored at restore time. At restore time, the user provides a set of alternate backup directories (as open BackupEngines, which can be read-only), and excluded files must be found in one of the backup directories ("included" in some backup). This feature depends on backup schema version 2 features, though schema version 2.0 support is not sufficient to read / restore a backup with exclusions. This change updates the schema version to 2.1 because of this feature, so that it's easy to recognize whether a RocksDB release supports this feature, while backups not using the feature are fully compatible with 2.0. Also in this PR: * Stacked on https://github.com/facebook/rocksdb/pull/11029 * Allow progress_callback to be empty, not just no-op function, and recover from exceptions thrown by BackupEngine callbacks. * The internal-only `AsBackupEngine()` function is working around the diamond hierarchy of `BackupEngineImplThreadSafe` to get to the internals, without using confusing features like virtual inheritance. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11030 Test Plan: unit tests added / updated Reviewed By: ajkr Differential Revision: D42004388 Pulled By: pdillinger fbshipit-source-id: 31b6e533d308a5462e528d9012d650482d974077 |
|
anand76 | bec4264813 |
Avoid mixing sync and async prefetch (#11050)
Summary: Reading uncompression dict block always uses sync reads, while data blocks may use async reads and prefetching. This causes problems in FilePrefetchBuffer. So avoid mixing the two by reading the uncompression dict straight from the file. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11050 Test Plan: Crash test Reviewed By: akankshamahajan15 Differential Revision: D42194682 Pulled By: anand1976 fbshipit-source-id: aaa8b396fdfe966b157e210f5ef8501c45b7b69e |
|
Peter Dillinger | e6b6e74154 |
Make CompactRange() more aware of SstPartitionerFactory (#11032)
Summary: Some users are at least considering using SstPartitioner to support efficient physical migration of specific key ranges between RocksDB instances. One might expect manual `CompactRange()` over a narrow key range across some partition to enforce partitioning of any SST files crossing that partition boundary, but that currently only works if there are keys within that range. This change makes the overlap logic in CompactRange more aware of the partitioner to automatically select relevant files crossing a partition boundary, even when they otherwise would not be selected due to the compaction range falling in a gap between entries. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11032 Test Plan: unit test included Reviewed By: hx235 Differential Revision: D41981380 Pulled By: pdillinger fbshipit-source-id: 2fe445bdddc73c00276c20f295cc1fa33d15b05a |
|
Alan Paxton | f8969ad7d4 |
Improve Java API get() performance by reducing copies (#10970)
Summary: Performance improvements for `get()` paths in the RocksJava API (JNI). Document describing the performance results. Replace uses of the legacy `DB::Get()` method wrapper returning data in a `std::string` with direct calls to `DB::Get()` passing a pinnable slice to receive this data. Copying from a pinned slice direct to the destination java byte array, without going via an intervening std::string, is a major performance gain for this code path. Note that this gain only comes where `DB::Get()` is able to return a pinned buffer; where it has to copy into the buffer owned by the slice, there is still the intervening copy and no performance gain. It may be possible to address this case too, but it is not trivial. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10970 Reviewed By: pdillinger Differential Revision: D42125567 Pulled By: ajkr fbshipit-source-id: b7a4df7523b0420cadb1e9b6c7da3ec030a8da34 |
|
anand76 | dbf37c290a |
Fix async prefetch heap use after free (#11049)
Summary: This PR fixes a heap use after free bug in the async prefetch code that happens in the following scenario - 1. Scan thread starts 2 async reads for Seek, one for the seek block and one for prefetching 2. Before the first read in https://github.com/facebook/rocksdb/issues/1 completes, another thread reads and loads the block in cache 3. The first scan thread finds the block in cache, continues and the next block cache miss is for a block that spans the boundary of the 2 prefetch buffers, and the 1st read is complete but the 2nd one is not complete yet 4. The scan thread will reallocate (i.e free the old buffer and allocate a new one) the 2nd prefetch buffer, and the in-progress prefetch is orphaned 5. The orphaned prefetch finally completes, resulting in a use after free Also add a few asserts to surface bugs earlier in the crash tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11049 Test Plan: Repro with db_stress and verify the fix Reviewed By: akankshamahajan15 Differential Revision: D42181118 Pulled By: anand1976 fbshipit-source-id: 1ac55d2f64a89ce128c1c574262b8aa7d82eb8cc |
|
Changyu Bi | 53b703eafe |
Fix an assertion failure in `CompactionOutputs::AddRangeDels()` (#11040)
Summary: the [assertion]( |
|
ehds | ddad943c29 |
snapshots of FragmentedRangeTombstoneList must in ascending order (#11046)
Summary: `snapshots` argument of `FragmentedRangeTombstoneList` should be in ascending order. If we pass it in descending order order, it will not work. for example: ``` auto range_del_iter = MakeRangeDelIter({{"a", "e", 3},{"a","e", 6}}); FragmentedRangeTombstoneList fragment_list( std::move(range_del_iter), bytewise_icmp, true /* for_compaction */, {8 ,7 ,4} /* snapshots */); FragmentedRangeTombstoneIterator iter(&fragment_list, bytewise_icmp, kMaxSequenceNumber /* upper_bound */); VerifyFragmentedRangeDels(&iter, {{"a", "e", 6}, {"a", "e", 3}}); ``` VerifyFragmentedRangeDels will fail. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11046 Reviewed By: ajkr Differential Revision: D42148654 Pulled By: cbi42 fbshipit-source-id: a2e76f96dccf56fcca1a91cb8da9b99145f68026 |
|
anand76 | 692d6be358 |
Prevent db_stress failure when io_uring is disabled (#11045)
Summary: The IO uring usage is disabled in RocksDB by default and, as a result, PosixRandomAccessFile::ReadAsync returns a NotSupported() status. This was causing stress test failures with MultiGet and async_io combination. Fix it by relying on redirection of ReadAsync to Read when default Env is used in db_stress. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11045 Reviewed By: akankshamahajan15 Differential Revision: D42136213 Pulled By: anand1976 fbshipit-source-id: fc7904d8ece74d7e8f2e1a34c3d70bd5774fb45f |
|
anand76 | c3f720c60d |
Enable ReadAsync testing and fault injection in db_stress (#11037)
Summary: The db_stress code uses a wrapper Env on top of the raw/fault injection Env. The wrapper, DbStressEnvWrapper, is a legacy Env and thus has a default implementation of ReadAsync that just does a sync read. As a result, the ReadAsync implementations of PosixFileSystem and other file systems weren't being tested. Also, the ReadAsync interface wasn't implemented in FaultInjectionTestFS. This change implements the necessary interfaces in FaultInjectionTestFS and derives DbStressEnvWrapper from FileSystemWrapper rather than EnvWrapper. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11037 Test Plan: Run db_stress standalone and crash test. With this change, db_stress is able to repro the bug fixed in https://github.com/facebook/rocksdb/issues/10890. Reviewed By: akankshamahajan15 Differential Revision: D42061290 Pulled By: anand1976 fbshipit-source-id: 7f0331fd15ee33fb4f7f0f4b22b206fe801ba074 |
|
Changyu Bi | f02c708aa3 |
Consider range tombstone in compaction output file cutting (#10802)
Summary: This PR is the first step for Issue https://github.com/facebook/rocksdb/issues/4811. Currently compaction output files are cut at point keys, and the decision is made mainly in `CompactionOutputs::ShouldStopBefore()`. This makes it possible for range tombstones to cause large compactions that does not respect `max_compaction_bytes`. For example, we can have a large range tombstone that overlaps with too many files from the next level. Another example is when there is a gap between a range tombstone and another key. The first issue may be more acceptable, as a lot of data is deleted. This PR address the second issue by calling `ShouldStopBefore()` for range tombstone start keys. The main change is for `CompactionIterator` to emit range tombstone start keys to be processed by `CompactionOutputs`. A new `CompactionMergingIterator` is introduced and only used under `CompactionIterator` for this purpose. Further improvement after this PR include 1) cut compaction output at some grandparent boundary key instead of at the next point key or range tombstone start key and 2) cut compaction output file within a large range tombstone (it may be easier and reasonable to only do it for range tombstones at the end of a compaction output). Pull Request resolved: https://github.com/facebook/rocksdb/pull/10802 Test Plan: - added unit tests in db_range_del_test. - stress test: `python3 tools/db_crashtest.py whitebox --[simple|enable_ts] --verify_iterator_with_expected_state_one_in=5 --delrangepercent=5 --prefixpercent=2 --writepercent=58 --readpercen=21 --duration=36000 --range_deletion_width=1000000` Reviewed By: ajkr, jay-zhuang Differential Revision: D40308827 Pulled By: cbi42 fbshipit-source-id: a8fd6f70a3f09d0ef7a40e006f6c964bba8c00df |
|
sdong | 1928902a6f |
~SleepingBackgroundTask() to wake up the sleeping task (#11036)
Summary: Right now, in unit tests, when background tests are sleeping using SleepingBackgroundTask, and the test exits with test assertion failure, the process will hang and it might prevent us to see the test failure message in CI runs. Try to wake up the thread so that the test can exit correctly. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11036 Test Plan: Watch CI succeeds Reviewed By: riversand963 Differential Revision: D42020489 fbshipit-source-id: 5b8441b18d5f67bbb3ade59a1225a8d3c860c2eb |
|
Alan Paxton | 6a8920f988 |
JNI native memory leak - release array elements (#10981)
Summary: Closes https://github.com/facebook/rocksdb/issues/10980 Reproduced as per the suggestion in the ticket, and `$ jcmd <PID> VM.native_memory | grep Internal` reports that we are no longer leaking internal memory with the suggested fix. I did the repro in `MultiGetTest.java` which I have optimized imports on. It did not seem helpful to leave the test code around as it would be onerous to build a memory leak reproducer, and regression seems a remote possibility. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10981 Reviewed By: riversand963 Differential Revision: D41498748 Pulled By: ajkr fbshipit-source-id: 8c6dd0d608172879c8bda479c7c9c05c12d34e70 |
|
Yanqin Jin | c93ba7db5d |
Revise LockWAL/UnlockWAL implementation (#11020)
Summary: RocksDB has two public APIs: `DB::LockWAL()`/`DB::UnlockWAL()`. The current implementation acquires and releases the internal `DBImpl::log_write_mutex_`. According to the comment on `DBImpl::log_write_mutex_`: https://github.com/facebook/rocksdb/blob/7.8.fb/db/db_impl/db_impl.h#L2287:L2288 > Note: to avoid dealock, if needed to acquire both log_write_mutex_ and mutex_, the order should be first mutex_ and then log_write_mutex_. This puts limitations on how applications can use the `LockWAL()` API. After `LockWAL()` returns ok, then application should not perform any operation that acquires `mutex_`. Currently, the use case of `LockWAL()` is MyRocks implementing the MySQL storage engine handlerton `lock_hton_log` interface. The operation that MyRocks performs after `LockWAL()` is `GetSortedWalFiless()` which not only acquires mutex_, but also `log_write_mutex_`. There are two issues: 1. Applications using these two APIs may hang if one thread calls `GetSortedWalFiles()` after calling `LockWAL()` because log_write_mutex is not recursive. 2. Two threads may dead lock due to lock order inversion. To fix these issues, we can modify the implementation of LockWAL so that it does not keep `log_write_mutex_` held until UnlockWAL. To achieve the goal of locking the WAL, we can instead manually inject a write stall so that all future writes will be stopped. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11020 Test Plan: make check Reviewed By: ajkr Differential Revision: D41785203 Pulled By: riversand963 fbshipit-source-id: 5ccb7a9c6eb9a2c3fa80fd2c399cc2568b8f89ce |
|
Hui Xiao | 98d5db5c2e |
Sort L0 files by newly introduced epoch_num (#10922)
Summary:
**Context:**
Sorting L0 files by `largest_seqno` has at least two inconvenience:
- File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. `force_consistency_check=true` will catch such overlap seqno range even those harmless overlap.
- For example, consider the following sequence of events ("key@n" indicates key at seqno "n")
- insert k1@1 to memtable m1
- ingest file s1 with k2@2, ingest file s2 with k3@3
- insert k4@4 to m1
- compact files s1, s2 and result in new file s3 of seqno range [2, 3]
- flush m1 and result in new file s4 of seqno range [1, 4]. And `force_consistency_check=true` will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1
- However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption.
- Single delete can decrease a file's largest seqno and ordering by `largest_seqno` can introduce a wrong ordering hence file reordering corruption
- For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to ajkr for this example)
- an existing SST s1 contains only k1@1
- insert k1@2 to memtable m1
- ingest file s2 with k3@3, ingest file s3 with k4@4
- insert single delete k5@5 in m1
- flush m1 and result in new file s4 of seqno range [2, 5]
- compact s1, s2, s3 and result in new file s5 of seqno range [1, 4]
- compact s4 and result in new file s6 of seqno range [2] due to single delete
- By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by `force_consistency_check=true`, there isn't a good way to prevent this from happening if ordering by `largest_seqno`
Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to ajkr , we now introduce `epoch_num` which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum `epoch_num` among input files'). This will avoid the above inconvenience in the following ways:
- In the first case above, there will no longer be overlap seqno range check in `force_consistency_check=true` but `epoch_number` ordering check. This will result in file ordering s1 < s2 < s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class `DBCompactionTestL0FilesMisorderCorruption*` for more.
- In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption.
**Summary:**
- Introduce `epoch_number` stored per `ColumnFamilyData` and sort CF's L0 files by their assigned `epoch_number` instead of `largest_seqno`.
- `epoch_number` is increased and assigned upon `VersionEdit::AddFile()` for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the `kReservedEpochNumberForFileIngestedBehind`)
- Compaction output file is assigned with the minimum `epoch_number` among input files'
- Refit level: reuse refitted file's epoch_number
- Other paths needing `epoch_number` treatment:
- Import column families: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`
- Repair: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`.
- Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon `VersionEdit::AddFile()` where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair).
- Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later `Refit(target_level=0)`). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder.
- Persist `epoch_number` of each file in manifest and recover `epoch_number` on db recovery
- Backward compatibility with old db without `epoch_number` support is guaranteed by assigning `epoch_number` to recovered files by `NewestFirstBySeqno` order. See `VersionStorageInfo::RecoverEpochNumbers()` for more
- Forward compatibility with manifest is guaranteed by flexibility of `NewFileCustomTag`
- Replace `force_consistent_check` on L0 with `epoch_number` and remove false positive check like case 1 with `largest_seqno` above
- Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (`NewestFirstBySeqno`) to check/sort them till we infer their epoch number. See usages of `EpochNumberRequirement`.
- Remove fix https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and their outdated tests to file reordering corruption because such fix can be replaced by this PR.
- Misc:
- update existing tests with `epoch_number` so make check will pass
- update https://github.com/facebook/rocksdb/pull/5958#issue-511150930 tests to verify corruption is fixed using `epoch_number` and cover universal/fifo compaction/CompactRange/CompactFile cases
- assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber()
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10922
Test Plan:
- `make check`
- New unit tests under `db/db_compaction_test.cc`, `db/db_test2.cc`, `db/version_builder_test.cc`, `db/repair_test.cc`
- Updated tests (i.e, `DBCompactionTestL0FilesMisorderCorruption*`) under https://github.com/facebook/rocksdb/pull/5958#issue-511150930
- [Ongoing] Compatibility test: manually run
|
|
Peter Dillinger | 9b34c097a1 |
Fix bug updating latest backup on delete (#11029)
Summary: Previously, the "latest" valid backup would not be updated on delete. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11029 Test Plan: unit test included (added to an existing test for efficiency) Reviewed By: hx235 Differential Revision: D41967925 Pulled By: pdillinger fbshipit-source-id: ca143354d281eb979557ea421902cd26803a1137 |
|
Arvid Lunnemark | 00238a386b |
replace sprintf with its safe version snprintf (v2) (#11011)
Summary: same motivations as https://github.com/facebook/rocksdb/pull/5475, applied to the last remaining `sprintf`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11011 Reviewed By: pdillinger Differential Revision: D41673500 Pulled By: ajkr fbshipit-source-id: 88618ea791cafad86a9a491799c45979d46e3544 |
|
Jay Zhuang | 1078d860a9 |
Add an unittest for Periodic compaction conflict with ongoing compaction (#10908)
Summary: Add a tiered storage migration test which would conflict with an ongoing penultimate level compaction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10908 Test Plan: Test only change Reviewed By: anand1976 Differential Revision: D40864509 Pulled By: ajkr fbshipit-source-id: e316e849a01a6c71a41be130101f909b6c0498cb |
|
Andrew Kryczka | 4d60cbc629 |
Use VersionBuilder for CF import ordering/validation (#11028)
Summary: Besides the existing ordering and validation, more is coming to VersionBuilder/VersionStorageInfo, like migration of epoch_numbers from older RocksDB versions. We should start using those common classes for importing CFs, instead of duplicating their ordering, validation, and migration logic. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11028 Test Plan: rely on existing tests Reviewed By: hx235 Differential Revision: D41865427 Pulled By: ajkr fbshipit-source-id: 873f5cd87b8902a2380c3b71373ce0b0db3a0c50 |
|
Peter Dillinger | 433d7e4594 |
Improve error messages for SST footer and size errors (#11009)
Summary: Previously, you could get a format_version error if SST file size was too small in manifest, or a weird "too short" error if too big in manifest. Now we ensure: * Magic number error is reported first if we attempt to open an SST file and the footer is completely bad. * Footer errors are reported with affected file. * If manifest file size doesn't match actual, then the error includes expected and actual sizes (if an error is reported; in some cases we allow the file to be too big) Pull Request resolved: https://github.com/facebook/rocksdb/pull/11009 Test Plan: unit tests added, some manual Previously, the code for "file too short" in footer processing was only covered by some tests attempting to verify SST checksums on non-SST files (fixed). Reviewed By: siying Differential Revision: D41656272 Pulled By: pdillinger fbshipit-source-id: 3da32702eb5aaedbea0e5e74742ad57edd7ad3df |
|
dependabot[bot] | 6648dec0a3 |
Bump nokogiri from 1.13.9 to 1.13.10 in /docs (#11024)
Summary: Bumps [nokogiri](https://github.com/sparklemotion/nokogiri) from 1.13.9 to 1.13.10. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/sparklemotion/nokogiri/releases">nokogiri's releases</a>.</em></p> <blockquote> <h2>1.13.10 / 2022-12-07</h2> <h3>Security</h3> <ul> <li>[CRuby] Address CVE-2022-23476, unchecked return value from <code>xmlTextReaderExpand</code>. See <a href="https://github.com/sparklemotion/nokogiri/security/advisories/GHSA-qv4q-mr5r-qprj">GHSA-qv4q-mr5r-qprj</a> for more information.</li> </ul> <h3>Improvements</h3> <ul> <li>[CRuby] <code>XML::Reader#attribute_hash</code> now returns <code>nil</code> on parse errors. This restores the behavior of <code>#attributes</code> from v1.13.7 and earlier. [<a href="https://github-redirect.dependabot.com/sparklemotion/nokogiri/issues/2715">https://github.com/facebook/rocksdb/issues/2715</a>]</li> </ul> <hr /> <p>sha256 checksums:</p> <pre><code>777ce2e80f64772e91459b943e531dfef387e768f2255f9bc7a1655f254bbaa1 nokogiri-1.13.10-aarch64-linux.gem b432ff47c51386e07f7e275374fe031c1349e37eaef2216759063bc5fa5624aa nokogiri-1.13.10-arm64-darwin.gem 73ac581ddcb680a912e92da928ffdbac7b36afd3368418f2cee861b96e8c830b nokogiri-1.13.10-java.gem 916aa17e624611dddbf2976ecce1b4a80633c6378f8465cff0efab022ebc2900 nokogiri-1.13.10-x64-mingw-ucrt.gem 0f85a1ad8c2b02c166a6637237133505b71a05f1bb41b91447005449769bced0 nokogiri-1.13.10-x64-mingw32.gem 91fa3a8724a1ce20fccbd718dafd9acbde099258183ac486992a61b00bb17020 nokogiri-1.13.10-x86-linux.gem d6663f5900ccd8f72d43660d7f082565b7ffcaade0b9a59a74b3ef8791034168 nokogiri-1.13.10-x86-mingw32.gem 81755fc4b8130ef9678c76a2e5af3db7a0a6664b3cba7d9fe8ef75e7d979e91b nokogiri-1.13.10-x86_64-darwin.gem 51d5246705dedad0a09b374d09cc193e7383a5dd32136a690a3cd56e95adf0a3 nokogiri-1.13.10-x86_64-linux.gem d3ee00f26c151763da1691c7fc6871ddd03e532f74f85101f5acedc2d099e958 nokogiri-1.13.10.gem </code></pre> </blockquote> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/sparklemotion/nokogiri/blob/main/CHANGELOG.md">nokogiri's changelog</a>.</em></p> <blockquote> <h2>1.13.10 / 2022-12-07</h2> <h3>Security</h3> <ul> <li>[CRuby] Address CVE-2022-23476, unchecked return value from <code>xmlTextReaderExpand</code>. See <a href="https://github.com/sparklemotion/nokogiri/security/advisories/GHSA-qv4q-mr5r-qprj">GHSA-qv4q-mr5r-qprj</a> for more information.</li> </ul> <h3>Improvements</h3> <ul> <li>[CRuby] <code>XML::Reader#attribute_hash</code> now returns <code>nil</code> on parse errors. This restores the behavior of <code>#attributes</code> from v1.13.7 and earlier. [<a href="https://github-redirect.dependabot.com/sparklemotion/nokogiri/issues/2715">https://github.com/facebook/rocksdb/issues/2715</a>]</li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href=" |
|
Hui Xiao | 5f52dbc505 |
Move two history entries mistake out of 7.9 section (#11013)
Summary: https://github.com/facebook/rocksdb/pull/10892 and https://github.com/facebook/rocksdb/pull/10955 mistakenly added two entries under sealed 7.9.history section. This PR fixes these two. No need to update 7.9 branch (https://github.com/facebook/rocksdb/blob/7.9.fb/HISTORY.md) cuz it's cut before these two PRs landed Pull Request resolved: https://github.com/facebook/rocksdb/pull/11013 Reviewed By: cbi42 Differential Revision: D41666514 Pulled By: hx235 fbshipit-source-id: c4bc7a29ff663664bf0be1ba1c7eab4d00a61528 |
|
Hui Xiao | 15bb4ea084 |
Deflake DBWALTest.FixSyncWalOnObseletedWalWithNewManifestCausingMissingWAL (#11016)
Summary: **Context/Summary:** Credit to ajkr's https://github.com/facebook/rocksdb/pull/11016#pullrequestreview-1205020134, flaky test https://app.circleci.com/pipelines/github/facebook/rocksdb/21985/workflows/5f6cc355-78c1-46d8-89ee-0fd679725a8a/jobs/540878 is due to `Flush()` called in the test returned earlier than obsoleted WAL being found in background flush and SyncWAL() was called (i.e, "sync_point_called" sets to true). Fix this by making checking `sync_point_called == true` after obsoleted WAL is found and `SyncWAL()` is called. Also rename the "sync_point_called" to be something more specific. Also, fix a potential flakiness due to manually setting a log threshold to force new manifest creation. This is unreliable so I decided to use sync point to force new manifest creation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11016 Test Plan: make check Reviewed By: pdillinger Differential Revision: D41717786 Pulled By: hx235 fbshipit-source-id: ad1e4701a987285bbe6c8e7d9b05c4db06b4edf4 |
|
Changyu Bi | 23af6786a9 |
Fix an assertion failure in `TimestampTablePropertiesCollector` for empty output (#11015)
Summary: when the compaction output file is empty, the assertion in `TimestampTablePropertiesCollector::Finish()` breaks. This PR fixes this assert and added unit test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11015 Test Plan: added UT. Reviewed By: ajkr Differential Revision: D41716719 Pulled By: cbi42 fbshipit-source-id: d891d46be4c4805e3d49be6b80c9d75f1bd51080 |
|
anand76 | 8ffabdc226 |
Fix table cache leak in MultiGet with async_io (#10997)
Summary: When MultiGet with the async_io option encounters an IO error in TableCache::GetTableReader, it may result in leakage of table cache handles due to queued coroutines being abandoned. This PR fixes it by ensuring any queued coroutines are run before aborting the MultiGet. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10997 Test Plan: 1. New unit test in db_basic_test 2. asan_crash Reviewed By: pdillinger Differential Revision: D41587244 Pulled By: anand1976 fbshipit-source-id: 900920cd3fba47cb0fc744a62facc5ffe2eccb64 |
|
Peter Dillinger | 95bf302189 |
Fix use of make_unique in Arena::AllocateNewBlock (#11012)
Summary: The change to `make_unique<char[]>` in https://github.com/facebook/rocksdb/issues/10810 inadvertently started initializing data in Arena blocks, which could lead to increased memory use due to (at least on our implementation) force-mapping pages as a result. This change goes back to `new char[]` while keeping all the other good parts of https://github.com/facebook/rocksdb/issues/10810. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11012 Test Plan: unit test added (fails on Linux before fix) Reviewed By: anand1976 Differential Revision: D41658893 Pulled By: pdillinger fbshipit-source-id: 267b7dccfadaeeb1be767d43c602a6abb0e71cd0 |
|
WLeoo | be3a62a2e7 |
Fix an uninitialized variable warning for g++ 12.2.0 (#10995)
Summary: /home/wl/rocksdbtry/rocksdb-WL/util/bloom_test.cc: In constructor ‘rocksdb::RawFilterTester::RawFilterTester()’: /home/wl/rocksdbtry/rocksdb-WL/util/bloom_test.cc:813:40: error: member ‘rocksdb::RawFilterTester::data_’ is used uninitialized [-Werror=uninitialized] Pull Request resolved: https://github.com/facebook/rocksdb/pull/10995 Reviewed By: cbi42 Differential Revision: D41620186 Pulled By: ajkr fbshipit-source-id: a6ebd3820ef12e0af322cbfb7eb553de5bdfcb29 |
|
Jay Zhuang | c068799d0c |
Blog: Time Aware Tiered Storage in RocksDB (#11005)
Summary: Blog for time aware tiered storage Pull Request resolved: https://github.com/facebook/rocksdb/pull/11005 Test Plan: document only Reviewed By: siying Differential Revision: D41620430 Pulled By: ajkr fbshipit-source-id: d85406a50a7b1553495f2f2f72143dbd90101b01 |
|
Hui Xiao | 2f76ab150d |
Fix missing WAL in new manifest by rolling over the WAL deletion record from prev manifest (#10892)
Summary: **Context** `Options::track_and_verify_wals_in_manifest = true` verifies each of the WALs tracked in manifest indeed presents in the WAL folder. If not, a corruption "Missing WAL with log number" will be thrown. `DB::SyncWAL()` called at a specific timing (i.e, at the `TEST_SYNC_POINT("FindObsoleteFiles::PostMutexUnlock")`) can record in a new manifest the WAL addition of a WAL file that already had a WAL deletion recorded in the previous manifest. And the WAL deletion record is not rollover-ed to the new manifest. So the new manifest creates the illusion of such WAL never gets deleted and should presents at db re/open. - Such WAL deletion record can be caused by flushing the memtable associated with that WAL and such WAL deletion can actually happen in` PurgeObsoleteFiles()`. As a consequence, upon `DB::Reopen()`, this WAL file can be deleted while manifest still has its WAL addition record , which causes a false alarm of corruption "Missing WAL with log number" to be thrown. **Summary** This PR fixes this false alarm by rolling over the WAL deletion record from prev manifest to the new manifest by adding the WAL deletion record to the new manifest. **Test** - Make check - Added new unit test `TEST_F(DBWALTest, FixSyncWalOnObseletedWalWithNewManifestCausingMissingWAL)` that failed before the fix and passed after - [Ongoing]CI stress test + aggressive value as in https://github.com/facebook/rocksdb/pull/10761 , which is how this false alarm was first surfaced, to confirm such false alarm disappears - [Ongoing]Regular CI stress test to confirm such fix didn't harm anything Pull Request resolved: https://github.com/facebook/rocksdb/pull/10892 Reviewed By: ajkr Differential Revision: D40778965 Pulled By: hx235 fbshipit-source-id: a512364bfdeb0b1a55c171890e60d856c528f37f |