rocksdb

mirror of https://github.com/facebook/rocksdb.git synced 2024-11-28 05:43:50 +00:00

Author	SHA1	Message	Date
Peter Dillinger	96340dbce2	Options for file temperature for more files (#12957 ) Summary: We have a request to use the cold tier as primary source of truth for the DB, and to best support such use cases and to complement the existing options controlling SST file temperatures, we add two new DB options: * `metadata_write_temperature` for DB "small" files that don't contain much user data * `wal_write_temperature` for WALs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12957 Test Plan: Unit test included, though it's hard to be sure we've covered all the places Reviewed By: jowlyzhang Differential Revision: D61664815 Pulled By: pdillinger fbshipit-source-id: 8e19c9dd8fd2db059bb15f74938d6bc12002e82b	2024-08-23 19:49:25 -07:00
eniac1024	f5c5f881d2	Fix MultiGet with timestamps (#12943 ) Summary: Issue: MultiGet(PinnableSlice) can't read out all timestamps. Fixed the impl, and added an UT as well. In the original impl, if MultiGet reads multiple column families, a later column family would clean up timestamps of previous column family. Fix: https://github.com/facebook/rocksdb/issues/12950#issue-2476996580 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12943 Reviewed By: anand1976 Differential Revision: D61729257 Pulled By: pdillinger fbshipit-source-id: 55267c26076c8a59acedd27e14714711729a40df	2024-08-23 13:58:04 -07:00
Changyu Bi	c62de54c7c	Record largest seqno in table properties and verify in file ingestion (#12951 ) Summary: this helps to avoid scanning input files when ingesting db generated files: `ecb844babd/db/external_sst_file_ingestion_job.cc (L917-L935)` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12951 Test Plan: * `IngestDBGeneratedFileTest.FailureCase` is updated to verify that this table property is verified during ingestion * existing unit tests for other ingestion use cases. Reviewed By: jowlyzhang Differential Revision: D61608285 Pulled By: cbi42 fbshipit-source-id: b5b7aae9741531349ab247be6ffaa3f3628b76ca	2024-08-21 16:24:18 -07:00
Yu Zhang	945f60b157	Add some documentation for version edit handlers (#12948 ) Summary: As titled. No functional change. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12948 Reviewed By: hx235 Differential Revision: D61551254 Pulled By: jowlyzhang fbshipit-source-id: ccf53d78bd2f18f174d7e61972e5de467c96ce76	2024-08-21 10:09:10 -07:00
Yu Zhang	81d52bdc1a	Fix UDT in memtable only assertions (#12946 ) Summary: Empty memtables can be legitimately created and flushed, for example by error recovery flush attempts: `273b3eadf0/db/db_impl/db_impl_compaction_flush.cc (L2309-L2312)` This check is updated to be considerate of this. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12946 Reviewed By: hx235 Differential Revision: D61492477 Pulled By: jowlyzhang fbshipit-source-id: 7d16fcaea457948546072f85b3650fd1cc24f9db	2024-08-20 09:19:52 -07:00
Changyu Bi	defd97bc9d	Add an option to verify memtable key order during reads (#12889 ) Summary: add a new CF option `paranoid_memory_checks` that allows additional data integrity validations during read/scan. Currently, skiplist-based memtable will validate the order of keys visited. Further data validation can be added in different layers. The option will be opt-in due to performance overhead. The motivation for this feature is for services where data correctness is critical and want to detect in-memory corruption earlier. For a corrupted memtable key, this feature can help to detect it during during reads instead of during flush with existing protections (OutputValidator that verifies key order or per kv checksum). See internally linked task for more context. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12889 Test Plan: * new unit test added for paranoid_memory_checks=true. * existing unit test for paranoid_memory_checks=false. * enable in stress test. Performance Benchmark: we check for performance regression in read path where data is in memtable only. For each benchmark, the script was run at the same time for main and this PR: * Memtable-only randomread ops/sec: ``` (for I in $(seq 1 50);do ./db_bench --benchmarks=fillseq,readrandom --write_buffer_size=268435456 --writes=250000 --num=250000 --reads=500000 --seed=1723056275 2>&1 \| grep "readrandom"; done;) \| awk '{ t += $5; c++; print } END { print 1.0 * t / c }'; Main: 608146 PR with paranoid_memory_checks=false: 607727 (- %0.07) PR with paranoid_memory_checks=true: 521889 (-%14.2) ``` * Memtable-only sequential scan ops/sec: ``` (for I in $(seq 1 50); do ./db_bench--benchmarks=fillseq,readseq[-X10] --write_buffer_size=268435456 --num=1000000 --seed=1723056275 2>1 \| grep "\[AVG 10 runs\]"; done;) \| awk '{ t += $6; c++; print; } END { printf "%.0f\n", 1.0 * t / c }'; Main: 9180077 PR with paranoid_memory_checks=false: 9536241 (+%3.8) PR with paranoid_memory_checks=true: 7653934 (-%16.6) ``` * Memtable-only reverse scan ops/sec: ``` (for I in $(seq 1 20); do ./db_bench --benchmarks=fillseq,readreverse[-X10] --write_buffer_size=268435456 --num=1000000 --seed=1723056275 2>1 \| grep "\[AVG 10 runs\]"; done;) \| awk '{ t += $6; c++; print; } END { printf "%.0f\n", 1.0 * t / c }'; Main: 1285719 PR with integrity_checks=false: 1431626 (+%11.3) PR with integrity_checks=true: 811031 (-%36.9) ``` The `readrandom` benchmark shows no regression. The scanning benchmarks show improvement that I can't explain. Reviewed By: pdillinger Differential Revision: D60414267 Pulled By: cbi42 fbshipit-source-id: a70b0cbeea131f1a249a5f78f9dc3a62dacfaa91	2024-08-19 13:53:25 -07:00
Jay Huh	273b3eadf0	Add Remote Compaction Installation Callback Function (#12940 ) Summary: Add an optional callback function upon remote compaction temp output installation. This will be internally used for setting the final status in the Offload Infra. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12940 Test Plan: Unit Test added ``` ./compaction_service_test ``` _Also internally tested by manually merging into internal code base_ Reviewed By: anand1976 Differential Revision: D61419157 Pulled By: jaykorean fbshipit-source-id: 66831685bc403949c26bfc65840dd1900d2a5a67	2024-08-19 11:22:43 -07:00
Yu Zhang	295326b6ee	Best efforts recovery recover seqno prefix (#12938 ) Summary: This PR make best efforts recovery more permissive by allowing it to recover incomplete Version that presents a valid point in time view from the user's perspective. Currently, a Version is only valid and saved if all files consisting that Version can be found. With this change, if only a suffix of L0 files (and their associated blob files) are missing, a valid Version is also available to be saved and recover to. Note that we don't do this if the column family was atomically flushed. Because atomic flush also need a consistent view across the column families, we cannot guarantee that if we are recovering to incomplete version. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12938 Test Plan: Existing tests and added unit tests. Reviewed By: anand1976 Differential Revision: D61414381 Pulled By: jowlyzhang fbshipit-source-id: f9b73deb34d35ad696ab42315928b656d586262a	2024-08-16 17:18:54 -07:00
Peter Dillinger	4d3518951a	Option to decouple index and filter partitions (#12939 ) Summary: Partitioned metadata blocks were introduced back in 2017 to deal more gracefully with large DBs where RAM is relatively scarce and some data might be much colder than other data. The feature allows metadata blocks to compete for memory in the block cache against data blocks while alleviating tail latencies and thrash conditions that can arise with large metadata blocks (sometimes megabytes each) that can arise with large SST files. In general, the cost to partitioned metadata is more CPU in accesses (especially for filters where more binary search is needed before hashing can be used) and a bit more memory fragmentation and related overheads. However the feature has always had a subtle limitation with a subtle effect on performance: index partitions and filter partitions must be cut at the same time, regardless of which wins the space race (hahaha) to metadata_block_size. Commonly filters will be a few times larger than indexes, so index partitions will be under-sized compared to filter (and data) blocks. While this does affect fragmentation and related overheads a bit, I suspect the bigger impact on performance is in the block cache. The coupling of the partition cuts would be defensible if the binary search done to find the filter block was used (on filter hit) to short-circuit binary search to an index partition, but that optimization has not been developed. Consider two metadata blocks, an under-sized one and a normal-sized one, covering proportional sections of the key space with the same density of read queries. The under-sized one will be more prone to eviction from block cache because it is used less often. This is unfair because of its despite its proportionally smaller cost of keeping in block cache, and most of the cost of a miss to re-load it (random IO) is not proportional to the size (similar latency etc. up to ~32KB). ## This change Adds a new table option decouple_partitioned_filters allows filter blocks and index blocks to be cut independently. To make this work, the partitioned filter block builder needs to know about the previous key, to generate an appropriate separator for the partition index. In most cases, BlockBasedTableBuilder already has easy access to the previous key to provide to the filter block builder. This change includes refactoring to pass that previous key to the filter builder when available, with the filter building caching the previous key itself when unavailable, such as during compression dictionary training and some unit tests. Access to the previous key eliminates the need to track the previous prefix, which results in a small SST construction CPU win in prefix filtering cases, regardless of coupling, and possibly a small regression for some non-prefix cases, regardless of coupling, but still overall improvement especially with https://github.com/facebook/rocksdb/issues/12931. Suggested follow-up: * Update confusing use of "last key" to refer to "previous key" * Expand unit test coverage with parallel compression and dictionary training * Consider an option or enhancement to alleviate under-sized metadata blocks "at the end" of an SST file due to no coordination or awareness of when files are cut. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12939 Test Plan: unit tests updated. Also did some unit test runs with "hard wired" usage of parallel compression and dictionary training code paths to ensure they were working. Also ran blackbox_crash_test for a while with the new feature. ## SST write performance (CPU) Using the same testing setup as in https://github.com/facebook/rocksdb/issues/12931 but with -decouple_partitioned_filters=1 in the "after" configuration, which benchmarking shows makes almost no difference in terms of SST write CPU. "After" vs. "before" this PR ``` -partition_index_and_filters=0 -prefix_size=0 -whole_key_filtering=1 923691 vs. 924851 (-0.13%) -partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=0 921398 vs. 922973 (-0.17%) -partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=1 902259 vs. 908756 (-0.71%) -partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=0 917932 vs. 916901 (+0.60%) -partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=0 912755 vs. 907298 (+0.60%) -partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=1 899754 vs. 892433 (+0.82%) ``` I think this is a pretty good trade, especially in attracting more movement toward partitioned configurations. ## Read performance Let's see how decoupling affects read performance across various degrees of memory constraint. To simplify LSM structure, we're using FIFO compaction. Since decoupling will overall increase metadata block size, we control for this somewhat with an extra "before" configuration with larger metadata block size setting (8k instead of 4k). Basic setup: ``` (for CS in 0300 1200; do TEST_TMPDIR=/dev/shm/rocksdb1 ./db_bench -benchmarks=fillrandom,flush,readrandom,block_cache_entry_stats -num=5000000 -duration=30 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=10 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters=1 -statistics=1 -cache_size=${CS}000000 -metadata_block_size=4096 -decouple_partitioned_filters=1 2>&1 \| tee results-$CS; done) ``` And read ops/s results: ```CSV Cache size MB,After/decoupled/4k,Before/4k,Before/8k 3,15593,15158,12826 6,16295,16693,14134 10,20427,20813,18459 20,27035,26836,27384 30,33250,31810,33846 60,35518,32585,35329 100,36612,31805,35292 300,35780,31492,35481 1000,34145,31551,35411 1100,35219,31380,34302 1200,35060,31037,34322 ``` If you graph this with log scale on the X axis (internal link: https://pxl.cl/5qKRc), you see that the decoupled/4k configuration is essentially the best of both the before/4k and before/8k configurations: handles really tight memory closer to the old 4k configuration and handles generous memory closer to the old 8k configuration. Reviewed By: jowlyzhang Differential Revision: D61376772 Pulled By: pdillinger fbshipit-source-id: fc2af2aee44290e2d9620f79651a30640799e01f	2024-08-16 15:34:31 -07:00
Peter Dillinger	f63428bcc7	Optimize, simplify filter block building (fix regression) (#12931 ) Summary: This is in part a refactoring / simplification to set up for "decoupled" partitioned filters and in part to fix an intentional regression for a correctness fix in https://github.com/facebook/rocksdb/issues/12872. Basically, we are taking out some complexity of the filter block builders, and pushing part of it (simultaneous de-duplication of prefixes and whole keys) into the filter bits builders, where it is more efficient by operating on hashes (rather than copied keys). Previously, the FullFilterBlockBuilder had a somewhat fragile and confusing set of conditions under which it would keep a copy of the most recent prefix and most recent whole key, along with some other state that is essentially redundant. Now we just track (always) the previous prefix in the PartitionedFilterBlockBuilder, to deal with the boundary prefix Seek filtering problem. (Btw, the next PR will optimize this away since BlockBasedTableReader already tracks the previous key.) And to deal with the problem of de-duplicating both whole keys and prefixes going into a single filter, we add a new function to FilterBitsBuilder that has that extra de-duplication capabilty, which is relatively efficient because we only have to cache an extra 64-bit hash, not a copied key or prefix. (The API of this new function is somewhat awkward to avoid a small CPU regression in some cases.) Also previously, there was awkward logic split between FullFilterBlockBuilder and PartitionedFilterBlockBuilder to deal with some things specific to partitioning. And confusing names like Add vs. AddKey. FullFilterBlockBuilder is much cleaner and simplified now. The splitting of PartitionedFilterBlockBuilder::MaybeCutAFilterBlock into DecideCutAFilterBlock and CutAFilterBlock is to address what would have been a slight performance regression in some cases. The split allows for more intruction-level parallelism by reducing unnecessary control dependencies. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12931 Test Plan: existing tests (with some minor updates) Also manually ported over the pre-broken regression test described in https://github.com/facebook/rocksdb/issues/12870 and ran it (passed). Performance: Here we validate that an entire series of recent related PRs are a net improvement in aggregate. "Before" is with these PRs reverted: https://github.com/facebook/rocksdb/issues/12872 #12911 https://github.com/facebook/rocksdb/issues/12874 #12867 https://github.com/facebook/rocksdb/issues/12903 #12904. "After" includes this PR (and all of those, with base revision `16c21af`). Simultaneous test script designed to maximally depend on SST construction efficiency: ``` for PF in 0 1; do for PS in 0 8; do for WK in 0 1; do [ "$PS" == "$WK" ] \|\| (for I in `seq 1 20`; do TEST_TMPDIR=/dev/shm/rocksdb2 ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -memtablerep=vector -allow_concurrent_memtable_write=0 -bloom_bits=10 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters=$PF -prefix_size=$PS -whole_key_filtering=$WK 2>&1 \| grep micros/op; done) \| awk '{ t += $5; c++; print } END { print 1.0 * t / c }'; echo "Was -partition_index_and_filters=$PF -prefix_size=$PS -whole_key_filtering=$WK"; done; done; done) \| tee results ``` Showing average ops/sec of "after" vs. "before" ``` -partition_index_and_filters=0 -prefix_size=0 -whole_key_filtering=1 935586 vs. 928176 (+0.79%) -partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=0 930171 vs. 926801 (+0.36%) -partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=1 910727 vs. 894397 (+1.8%) -partition_index_and_filters=1 -prefix_size=0 -whole_key_filtering=1 929795 vs. 922007 (+0.84%) -partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=0 921924 vs. 917285 (+0.51%) -partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=1 903393 vs. 887340 (+1.8%) ``` As one would predict, the most improvement is seen in cases where we have optimized away copying the whole key. Reviewed By: jowlyzhang Differential Revision: D61138271 Pulled By: pdillinger fbshipit-source-id: 427cef0b1465017b45d0a507bfa7720fa20af043	2024-08-14 15:13:16 -07:00
Yu Zhang	d458331ee9	Move file tracking in VersionEditHandlerPointInTime to VersionBuilder (#12928 ) Summary: `VersionEditHandlerPointInTime` is tracking found files, missing files, intermediate files in order to decide to build a `Version` on negative edge trigger (transition from valid to invalid) without applying the current `VersionEdit`. However, applying `VersionEdit` and check completeness of a `Version` are specialization of `VersionBuilder`. More importantly, when we augment best efforts recovery to recover not just complete point in time Version but also a prefix of seqno for a point in time Version, such checks need to be duplicated in `VersionEditHandlerPointInTime` and `VersionBuilder`. To avoid this, this refactor move all the file tracking functionality in `VersionEditHandlerPointInTime` into `VersionBuilder`. To continue to let `VersionEditHandlerPIT` do the edge trigger check and build a `Version` before applying the current `VersionEdit`, a suite of APIs to supporting creating a save point and its associated functions are added in `VersionBuilder` to achieve this. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12928 Test Plan: Existing tests Reviewed By: anand1976 Differential Revision: D61171320 Pulled By: jowlyzhang fbshipit-source-id: 604f66f8b1e3a3e13da59d8ba357c74e8a366dbc	2024-08-12 21:09:37 -07:00
anand76	c21fe1a47f	Add ticker stats for read corruption retries (#12923 ) Summary: Add a couple of ticker stats for corruption retry count and successful retries. This PR also eliminates an extra read attempt when there's a checksum mismatch in a block read from the prefetch buffer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12923 Test Plan: Update existing tests Reviewed By: jowlyzhang Differential Revision: D61024687 Pulled By: anand1976 fbshipit-source-id: 3a08403580ab244000e0d480b7ee0f5a03d76b06	2024-08-12 15:32:07 -07:00
SGZW	6727f0f58a	fix compaction_picker_test asan heap use after free (#12908 ) Summary: ![image](https://github.com/user-attachments/assets/3290fe18-aca2-4691-b072-fbbc96a15fb1) this testcase set syncpoint function which reference this test case heap variable "enable_per_key_placement_" and this sync point function will be triggered by another testcase, so asan will report asan heap use after free error Pull Request resolved: https://github.com/facebook/rocksdb/pull/12908 Reviewed By: hx235 Differential Revision: D60973363 Pulled By: cbi42 fbshipit-source-id: df4f488f51e7741784d5a92fc0a5fc538c5d5b1a	2024-08-09 15:06:37 -07:00
SGZW	5c456c4c08	fix compaction speedup for marked files ut (#12912 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12912 Reviewed By: hx235 Differential Revision: D60973460 Pulled By: cbi42 fbshipit-source-id: ebaa343757f09f7281884a512ebe3a7d6845c8b3	2024-08-09 15:05:02 -07:00
Changyu Bi	b32d899482	Fix MultiGet dropping memtable kv checksum corruption (#12842 ) Summary: Corruption status returned by `GetFromTable()` could be overwritten here: `b6c3495a71/db/version_set.cc (L2614)` This PR fixes this issue by setting `*(s->found_final_value) = true;` in SaveValue. Also makes the handling of the return value of `GetFromTable()` more robust and added asserts in a couple places. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12842 Test Plan: Updated an existing unit test to cover MultiGet. It fails the assertion here before this PR: `b6c3495a71/db/version_set.cc (L2601)` Reviewed By: anand1976 Differential Revision: D59498203 Pulled By: cbi42 fbshipit-source-id: 1f071c1b2c5b66fb71264b547a9e670d1cf592f0	2024-08-08 13:34:11 -07:00
Peter Dillinger	d33d25f903	Disable WAL recycling in crash test; reproducer for recovery data loss (#12918 ) Summary: I was investigating a crash test failure with "Corruption: SST file is ahead of WALs" which I haven't reproduced, but I did reproduce a data loss issue on recovery which I suspect could be the same root problem. The problem is already somewhat known (see https://github.com/facebook/rocksdb/issues/12403 and https://github.com/facebook/rocksdb/issues/12639) where it's only safe to recovery multiple recycled WAL files with trailing old data if the sequence numbers between them are adjacent (to ensure we didn't lose anything in the corrupt/obsolete WAL tail). However, aside from disableWAL=true, there are features like external file ingestion that can increment the sequence numbers without writing to the WAL. It is simply unsustainable to worry about this kind of feature interaction limiting where we can consume sequence numbers. It is very hard to test and audit as well. For reliable crash recovery of recycled WALs, we need a better way of detecting that we didn't drop data from one WAL to the next. Until then, let's disable WAL recycling in the crash test, to help stabilize it. Ideas for follow-up to fix the underlying problem: (a) With recycling, we could always sync the WAL before opening the next one. HOWEVER, this potentially very large sync could cause a big hiccup in writes (vs. O(1) sized manifest sync). (a1) The WAL sync could ensure it is truncated to size, or (a2) By requiring track_and_verify_wals_in_manifest, we could assume that the last synced size in the manifest is the final usable size of the WAL. (It might also be worth avoiding truncating recycled WALs.) (b) Add a new mechanism to record and verify the final size of a WAL without requiring a sync. (b1) By requiring track_and_verify_wals_in_manifest, this could be new WAL metadata recorded in the manifest (at the time of switching WALs). Note that new fields of WalMetadata are not forward-compatible, but a new kind of manifest record (next to WalAddition, WalDeletion; e.g. WalCompletion) is IIRC forward-compatible. (b2) A new kind of WAL header entry (not forward compatible, unfortunately) could record the final size of the previous WAL. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12918 Test Plan: Added disabled reproducer for non-linear data loss on recovery Reviewed By: hx235 Differential Revision: D60917527 Pulled By: pdillinger fbshipit-source-id: 3663d79aec81851f5cf41669f84a712bb4563fd7	2024-08-07 14:20:45 -07:00
Peter Dillinger	b15f8c7f0e	Refactor db_bloom_filter_test (#12911 ) Summary: Ahead of a "decoupled" variant of partitioned filters, refactoring this unit test file to make it easier to incorporate that new variant. * bool test param to new enum class FilterPartitioning * Some cases of iterating over that bool to new parameterized test * Combine some common functionality for configuring parameterized options Pull Request resolved: https://github.com/facebook/rocksdb/pull/12911 Test Plan: no production changes, and no intentional changes to scope or conditions of tests Differential Revision: D60701287 fbshipit-source-id: 3497e3230e29a4f62c934bcb75693965a2df41d8	2024-08-07 11:28:16 -07:00
Yu Zhang	719c96125c	Add a TransactionOptions to enable tracking timestamp size info inside WriteBatch (#12864 ) Summary: In normal use cases, meta info like column family's timestamp size is tracked at the transaction layer, so it's not necessary and even detrimental to track such info inside the internal WriteBatch because it may let anti-patterns like bypassing Transaction write APIs and directly write to its internal WriteBatch like this: `9d0a754dc9/storage/rocksdb/ha_rocksdb.cc (L4949-L4950)` Setting this option to true will keep aforementioned use case continue to work before it's refactored out. This option is only for this purpose and it will be gradually deprecated after aforementioned MyRocks use case are refactored. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12864 Test Plan: Added unit tests Reviewed By: cbi42 Differential Revision: D60194094 Pulled By: jowlyzhang fbshipit-source-id: 64a98822167e99aa7e4fa2a60085d44a5deaa45c	2024-08-05 13:06:45 -07:00
Yu Zhang	36b061a6c7	Fix test breakage (#12915 ) Summary: https://github.com/facebook/rocksdb/issues/12891 updated this deletion rate in the test to be much higher, which makes the test flaky. The rate is being intentionally set to very low to maximize the retention of a ".log.trash" file after DB closes. This PR just change it back. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12915 Reviewed By: ltamasi Differential Revision: D60776312 Pulled By: jowlyzhang fbshipit-source-id: d193557a042c65816fcc337cceb09905e042e9f6	2024-08-05 12:26:18 -07:00
Yu Zhang	d12aaf23ca	Fix file deletions in DestroyDB not rate limited (#12891 ) Summary: Make `DestroyDB` slowly delete files if it's configured and enabled via `SstFileManager`. It's currently not available mainly because of DeleteScheduler's logic related to tracked total_size_ and total_trash_size_. These accounting and logic should not be applied to `DestroyDB`. This PR adds a `DeleteUnaccountedDBFile` util for this purpose which deletes files without accounting it. This util also supports assigning a file to a specified trash bucket so that user can later wait for a specific trash bucket to be empty. For `DestroyDB`, files with more than 1 hard links will be deleted immediately. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12891 Test Plan: Added unit tests, existing tests. Reviewed By: anand1976 Differential Revision: D60300220 Pulled By: jowlyzhang fbshipit-source-id: 8b18109a177a3a9532f6dc2e40e08310c08ca3c7	2024-08-02 19:31:55 -07:00
Levi Tamasi	2e8a1a14ef	Fix a data race affecting the background error status (#12910 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12910 There is currently a call to `GetBGError()` in `DBImpl::WriteImplWALOnly()` where the DB mutex is (incorrectly) not held, leading to a data race. Technically, we could acquire the mutex here but instead, the patch removes the affected check altogether, since the same check is already performed (in a thread-safe manner) in the subsequent call to `PreprocessWrite()`. Reviewed By: cbi42 Differential Revision: D60682008 fbshipit-source-id: 54b67975dcf57d67c068cac71e8ada09a1793ec5	2024-08-02 14:11:08 -07:00
Changyu Bi	8be824e316	Use compensated file size for intra-L0 compaction (#12878 ) Summary: In leveled compaction, we pick intra-L0 compaction instead of L0->Lbase whenever L0 size is small. When L0 files contain many deletions, it makes more sense to compact then down instead of accumulating tombstones in L0. This PR uses compensated_file_size when computing L0 size for determining intra-L0 compaction. Also scale down the limit on total L0 size further to be more cautious about accumulating data in L0. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12878 Test Plan: updated unit test. Reviewed By: hx235 Differential Revision: D59932421 Pulled By: cbi42 fbshipit-source-id: 9de973ac51eb7df81b38b8c68110072b1aa06321	2024-08-01 17:49:34 -07:00
Yu Zhang	319374ae67	Add some checks at property block creation side (#12898 ) Summary: Crash test encountered this failure: ```file ingestion error: Corruption: properties unsorted under specified IngestExternalFileOptions: move_files: 0, verify_checksums_before_ingest: 1, verify_checksums_readahead_size: 1048576 (Empty string or missing field indicates default option or value is used``` Further inspection showed out of order table properties in an external file created by `SstFileWriter` for ingestion, and the file is likely created like this because it passed the initial checksum check. This change added some assertions to check invariant at the properties creation and collecting side. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12898 Test Plan: Existing tests Reviewed By: hx235 Differential Revision: D60459817 Pulled By: jowlyzhang fbshipit-source-id: 91474943d2f9d7795f00b6031c08a13ab91e2470	2024-07-31 13:28:17 -07:00
Peter Dillinger	2595476541	Fix rare WAL handling crash (#12899 ) Summary: A crash test failure in log sync in DBImpl::WriteToWAL is due to a missed case in https://github.com/facebook/rocksdb/issues/12734. Just need to apply similar logic from DBImpl::SyncWalImpl to check for an already closed WAL (nullptr writer). This is extremely rare because it only comes from failed Sync on a closed WAL. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12899 Test Plan: watch crash test Reviewed By: cbi42 Differential Revision: D60481652 Pulled By: pdillinger fbshipit-source-id: 4a176bb6a53dcf077f88344710a110c2f946c386	2024-07-30 17:38:30 -07:00
Peter Dillinger	9058fd037c	Small CPU optimization to experimental range filters (#12893 ) Summary: By reusing an object that owns a vector. The vector allocation/sizing was substantial in a CPU profile. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12893 Test Plan: existing tests Reviewed By: jowlyzhang Differential Revision: D60405139 Pulled By: pdillinger fbshipit-source-id: 8bfbc07cd9b4829f2ac9015e90f2b4eba61fd984	2024-07-29 14:23:35 -07:00
Hui Xiao	408e8d4c85	Handle injected write error after successful WAL write in crash test + misc (#12838 ) Summary: Context/Summary: We discovered the following false positive in our crash test lately: (1) PUT() writes k/v to WAL but fails in `ApplyWALToManifest()`. The k/v is in the WAL (2) Current stress test logic will rollback the expected state of such k/v since PUT() fails (3) If the DB crashes before recovery finishes and reopens, the WAL will be replayed and the k/v is in the DB while the expected state have been roll-backed. We decided to leave those expected state to be pending until the loop-write of the same key succeeds. Bonus: Now that I realized write to manifest can also fail the write which faces the similar problem as https://github.com/facebook/rocksdb/pull/12797, I decided to disable fault injection on user write per thread (instead of globally) when tracing is needed for prefix recovery; some refactory Pull Request resolved: https://github.com/facebook/rocksdb/pull/12838 Test Plan: Rehearsal CI Run below command (varies on sync_fault_injection=1,0 to verify ExpectedState behavior) for a while to ensure crash recovery validation works fine ``` python3 tools/db_crashtest.py --simple blackbox --interval=30 --WAL_size_limit_MB=0 --WAL_ttl_seconds=0 --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --adm_policy=1 --advise_random_on_open=0 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --allow_fallocate=0 --async_io=0 --auto_readahead_size=0 --avoid_flush_during_recovery=0 --avoid_flush_during_shutdown=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=1000000 --block_align=1 --block_protection_bytes_per_key=4 --block_size=16384 --bloom_before_level=4 --bloom_bits=56.810257702625165 --bottommost_compression_type=none --bottommost_file_compaction_delay=0 --bytes_per_sync=262144 --cache_index_and_filter_blocks=1 --cache_index_and_filter_blocks_with_high_priority=1 --cache_size=8388608 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=0 --check_multiget_consistency=0 --check_multiget_entity_consistency=1 --checkpoint_one_in=10000 --checksum_type=kxxHash --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000 --compact_range_one_in=1000 --compaction_pri=4 --compaction_readahead_size=1048576 --compaction_ttl=10 --compress_format_version=1 --compressed_secondary_cache_ratio=0.0 --compressed_secondary_cache_size=0 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=none --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc=04:00-08:00 --data_block_index_type=1 --db_write_buffer_size=0 --default_temperature=kWarm --default_write_temperature=kCold --delete_obsolete_files_period_micros=30000000 --delpercent=20 --delrangepercent=20 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_file_deletions_one_in=10000 --disable_manual_compaction_one_in=1000000 --disable_wal=0 --dump_malloc_stats=0 --enable_checksum_handoff=1 --enable_compaction_filter=0 --enable_custom_split_merge=0 --enable_do_not_compress_roles=0 --enable_index_compression=1 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=0 --enable_sst_partitioner_factory=0 --enable_thread_tracking=0 --enable_write_thread_adaptive_yield=0 --error_recovery_with_no_fault_injection=1 --exclude_wal_from_write_fault_injection=0 --fail_if_options_file_error=1 --fifo_allow_compaction=0 --file_checksum_impl=crc32c --fill_cache=1 --flush_one_in=1000000 --format_version=3 --get_all_column_family_metadata_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=1000000 --get_properties_of_all_tables_one_in=1000000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0.5 --index_block_restart_interval=4 --index_shortening=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100 --last_level_temperature=kWarm --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=10000 --log_file_time_to_roll=60 --log_readahead_size=16777216 --long_running_snapshots=1 --low_pri_pool_ratio=0 --lowest_used_cache_tier=0 --manifest_preallocation_size=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=100000 --max_key_len=3 --max_log_file_size=1048576 --max_manifest_file_size=32768 --max_sequential_skip_in_iterations=1 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=8388608 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --metadata_charge_policy=1 --metadata_read_fault_one_in=0 --metadata_write_fault_one_in=8 --min_write_buffer_number_to_merge=1 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=1 --open_files=-1 --open_metadata_read_fault_one_in=0 --open_metadata_write_fault_one_in=8 --open_read_fault_one_in=0 --open_write_fault_one_in=8 --ops_per_thread=100000000 --optimize_filters_for_hits=1 --optimize_filters_for_memory=1 --optimize_multiget_for_io=1 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=1000000 --periodic_compaction_seconds=2 --prefix_size=7 --prefixpercent=0 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=1000 --readahead_size=524288 --readpercent=10 --recycle_log_file_num=1 --reopen=0 --report_bg_io_stats=0 --reset_stats_one_in=1000000 --sample_for_compression=0 --secondary_cache_fault_one_in=0 --set_options_one_in=0 --skip_stats_update_on_db_open=1 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=68719476736 --sqfc_name=foo --sqfc_version=0 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=10 --stats_history_buffer_size=0 --strict_bytes_per_sync=1 --subcompactions=4 --sync=1 --sync_fault_injection=0 --table_cache_numshardbits=6 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=2 --uncache_aggressiveness=239 --universal_max_read_amp=-1 --unpartitioned_pinning=1 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=1 --use_attribute_group=0 --use_delta_encoding=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_cf_iterator=0 --use_multi_get_entity=0 --use_multiget=0 --use_put_entity_one_in=0 --use_sqfc_for_range_queries=1 --use_timed_put_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_compression=0 --verify_db_one_in=100000 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=0 --write_fault_one_in=8 --writepercent=40 ``` Reviewed By: cbi42 Differential Revision: D59377075 Pulled By: hx235 fbshipit-source-id: 91f602fd67e2d339d378cd28b982095fd073dcb6	2024-07-29 13:51:49 -07:00
Yu Zhang	9883b5f497	Fix manifest_number_ point to invalid file (#12882 ) Summary: This PR fix `VersionSet`'s `manifest_number_` could be pointing to an invalid number intermediately. This happens when a new manifest roll is attempted but fast failed after loading table handlers and before the new manifest file creation/writing is actually attempted. In theory, a later manifest roll effort will overthrow this intermediate invalid in memory state. There is on harm when the DB crashes in this invalid state either. But efforts that takes a file snapshot of the DB like backup will incorrectly try to copy a non existing manifest file. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12882 Reviewed By: cbi42 Differential Revision: D60204956 Pulled By: jowlyzhang fbshipit-source-id: effbdb124b582f879d114988af06ac63867fc549	2024-07-24 17:50:08 -07:00
Yu Zhang	05c9c9aeed	Fix race between test and recovery flush switch memtable (#12884 ) Summary: As titled, to fix this type of data race: https://github.com/facebook/rocksdb/actions/runs/10066814221/job/27829003372?pr=12882 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12884 Test Plan: COMPILE_WITH_TSAN=1 make -j10 db_wal_test ./db_wal_test --gtest_filter=DBWALTest.RecoveryFlushSwitchWALOnEmptyMemtable --gtest_repeat=100 Reviewed By: anand1976 Differential Revision: D60197834 Pulled By: jowlyzhang fbshipit-source-id: 89524cdb4d17a1b647295bcccf5eb2d7d425bc6a	2024-07-24 17:06:16 -07:00
Peter Dillinger	f456a7213f	Refactor IndexBuilder::AddIndexEntry (#12867 ) Summary: Something I am working on is going to expand usage of `BlockBasedTableBuilder::Rep::last_key`, but the existing code contract for `IndexBuilder::AddIndexEntry` makes that difficult because it modifies its `last_key` parameter to be the separator value recorded in the index, often something between the two boundary keys. This change primarily changes the contract of that function and related functions to separate function inputs and outputs, without sacrificing efficiency. For efficiency, a reusable scratch string buffer is provided by the caller, which the callee can use (or not) in returning a result Slice. That should yield a performance improvement as we are reusing a buffer for keys rather than copying into a new one each time in the FindShort* functions, without any additional string copies or conditional branches. Additional improvements in PartitionedIndexBuilder specifically: * Reduce string copies by eliminating `sub_index_last_key_` and instead tracking the key for the next partition in a placeholder Entry. * Simplify code and improve code quality by changing `sub_index_builder_` to unique_ptr. * Eliminate unnecessary NewFlushBlockPolicy call/object. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12867 Test Plan: existing tests, crash test. Will validate performance along with the change this is setting up. Reviewed By: anand1976 Differential Revision: D59793119 Pulled By: pdillinger fbshipit-source-id: 556da75cf13b967511f84702b2713d152f536a07	2024-07-22 14:27:31 -07:00
Hui Xiao	349b1ec08f	Fix duplicate WAL entries caused by write after error recovery (#12873 ) Summary: Context/Summary: We recently discovered a case where write of the same key right after error recovery of a previous failed write of the same key finishes causes two same WAL entries, violating our assertion. This is because we don't advance seqno on failed write and reuse the same WAL containing the failed write for the new write if the memtable at the time is empty. This PR reuses the flush path for an empty memtable to switch WAL and update min WAL to keep in error recovery flush as well as updates the INFO log message for clarity. ``` 2024/07/17-15:01:32.271789 327757 (Original Log Time 2024/07/17-15:01:25.942234) [/flush_job.cc:1017] [default] [JOB 2] Level-0 flush table https://github.com/facebook/rocksdb/issues/9: 0 bytes OK It's an empty SST file from a successful flush so won't be kept in the DB 2024/07/17-15:01:32.271798 327757 (Original Log Time 2024/07/17-15:01:32.269954) [/memtable_list.cc:560] [default] Level-0 commit flush result of table https://github.com/facebook/rocksdb/issues/9 started 2024/07/17-15:01:32.271802 327757 (Original Log Time 2024/07/17-15:01:32.271217) [/memtable_list.cc:760] [default] Level-0 commit flush result of table https://github.com/facebook/rocksdb/issues/9: memtable https://github.com/facebook/rocksdb/issues/1 done ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12873 Test Plan: New UT that failed before this PR with following assertion failure (i.e, duplicate WAL entries) and passes after ``` db_wal_test: db/write_batch.cc:2254: rocksdb::Status rocksdb::{anonymous}::MemTableInserter::PutCFImpl(uint32_t, const rocksdb::Slice&, const rocksdb::Slice&, rocksdb::ValueType, RebuildTxnOp, const ProtectionInfoKVOS64) [with RebuildTxnOp = rocksdb::{anonymous}::MemTableInserter::PutCF(uint32_t, const rocksdb::Slice&, const rocksdb::Slice&)::<lambda(rocksdb::WriteBatch, uint32_t, const rocksdb::Slice&, const rocksdb::Slice&)>; uint32_t = unsigned int; rocksdb::ProtectionInfoKVOS64 = rocksdb::ProtectionInfoKVOS<long unsigned int>]: Assertion `seq_per_batch_' failed. ``` Reviewed By: anand1976 Differential Revision: D59884468 Pulled By: hx235 fbshipit-source-id: 5d854b719092552c69727a979f269fb7f6c39756	2024-07-22 12:40:25 -07:00
Changyu Bi	c064ac3bc5	Avoid opening table files and reading table properties under mutex (#12879 ) Summary: InitInputTableProperties() can open and do IOs and is called under mutex_. This PR removes it from FinalizeInputInfo(). It is now called in CompactionJob::Run() and BuildCompactionJobInfo() (called in NotifyOnCompactionBegin()) without holding mutex_. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12879 Test Plan: existing unit tests. Added assert in GetInputTableProperties() to ensure that input_table_properties_ is initialized whenever it's called. Reviewed By: hx235 Differential Revision: D59933195 Pulled By: cbi42 fbshipit-source-id: c8089e13af8567fa3ab4b94d9ec384ae98ab2ec8	2024-07-19 19:12:45 -07:00
Changyu Bi	4384dd5eee	Support ingesting SST files generated by a live DB (#12750 ) Summary: ... to enable use cases like using RocksDB to merge sort data for ingestion. A new file ingestion option `IngestExternalFileOptions::allow_db_generated_files` is introduced to allows users to ingest SST files generated by live DBs instead of SstFileWriter. For now this only works if the SST files being ingested have zero as their largest sequence number AND do not overlap with any data in the DB (so we can assign seqno 0 which matches the seqno of all ingested keys). The feature is marked the option as experimental for now. Main changes needed to enable this: - ignore CF id mismatch during ingestion - ignore the missing external file version table property Rest of the change is mostly in new unit tests. A previous attempt is in https://github.com/facebook/rocksdb/issues/5602. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12750 Test Plan: - new unit tests Reviewed By: ajkr, jowlyzhang Differential Revision: D58396673 Pulled By: cbi42 fbshipit-source-id: aae513afad7b1ff5d4faa48104df5f384926bf03	2024-07-19 16:14:54 -07:00
anand76	0fca5e31b4	Fix race between manifest error recovery and file ingestion (#12871 ) Summary: This PR fixes an assertion failure in `DBImpl::ResumeImpl` - `assert(!versions_->descriptor_log_)`. In `VersionSet`, `descriptor_log_` has a pointer to the current MANIFEST writer. When there's an error updating the manifest, `descriptor_log_` is reset, and the error recovery thread checks `io_status()` in `VersionSet` and attempts to write a new MANIFEST. If another DB manipulation happens at the same time (like external file ingestion, column family manipulation etc), it calls `LogAndApply`, which also attempts to write a new MANIFEST. The assertion in `ResumeImpl` might fail in this case since the other MANIFEST writer may have updated `descriptor_log_`. To prevent the assertion, this fix updates both `io_status_` and `descriptor_log_` while holding the DB mutex. The other option would have been to simply remove the assert. But I think its important to have it to ensure the invariant that `io_status_` is cleared if the MANIFEST is written successfully, and this fix makes things easier to reason about. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12871 Test Plan: Existing tests and crash test Reviewed By: hx235 Differential Revision: D59926947 Pulled By: anand1976 fbshipit-source-id: af9ad18da3e29fc62c7ec2e30e0738aa33d4e5f1	2024-07-19 10:37:51 -07:00
Peter Dillinger	93b163d1a2	Fix major bug with prefixes, SeekForPrev, and partitioned filters (#12872 ) Summary: Basically, the fix in https://github.com/facebook/rocksdb/issues/8137 was incomplete (and I missed it in the review), because if `whole_key_filtering` is false, then `last_prefix_str_` will never be set to non-empty and the fix doesn't work. Also related to https://github.com/facebook/rocksdb/issues/5835. This is intended as a safe, simple fix that will regress CPU efficiency slightly (for `whole_key_filtering=false` cases, because of extra prefix string copies during flush & compaction). An efficient fix is not possible without some substantial refactoring. Also in this PR: new test DBBloomFilterTest.FilterNumEntriesCoalesce tests an adjacent code path that was previously untested for its effect of ensuring the number of unique prefixes and keys is tracked properly when both prefixes and whole keys are going into a filter. (Test fails when either of the two code segments checking for duplicates is disabled.) In addition, the same test would fail before the main bug fix here because the code would inappropriately add the empty string to the filter (because of unmodified `last_prefix_str_`). Pull Request resolved: https://github.com/facebook/rocksdb/pull/12872 Test Plan: In addition to DBBloomFilterTest.FilterNumEntriesCoalesce, extended DBBloomFilterTest.SeekForPrevWithPartitionedFilters to cover the broken case. (Mostly whitespace change.) Reviewed By: jowlyzhang Differential Revision: D59873793 Pulled By: pdillinger fbshipit-source-id: 2a7b7f09ca73dc188fb4dab833826ad6da7ebb11	2024-07-17 14:08:35 -07:00
Hui Xiao	21db55f816	Move WAL sync before memtable insertion (#12869 ) Summary: Context/Summary: WAL sync currently happens after memtable write. This causes inconvenience in stress test as we can't simply rollback the ExpectedState when write fails due to injected WAL sync error so something complicated like https://github.com/facebook/rocksdb/pull/12838 might be needed. After moving WAL sync before memtable insertion, there should not be injected IO error after memtable insertion so we can keep the current simple way of handling failed write in stress test with ExpectedState rollback. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12869 Test Plan: 1. Below command failed with `iterator has key 0000000000000207000000000000012B0000000000000013, but expected state does not.` before this PR and passes after ``` ./db_stress --WAL_size_limit_MB=0 --WAL_ttl_seconds=0 --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --adm_policy=1 --advise_random_on_open=0 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --allow_fallocate=0 --async_io=0 --auto_readahead_size=0 --avoid_flush_during_recovery=0 --avoid_flush_during_shutdown=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=1000000 --block_align=1 --block_protection_bytes_per_key=4 --block_size=16384 --bloom_before_level=4 --bloom_bits=56.810257702625165 --bottommost_compression_type=none --bottommost_file_compaction_delay=0 --bytes_per_sync=262144 --cache_index_and_filter_blocks=1 --cache_index_and_filter_blocks_with_high_priority=1 --cache_size=8388608 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=0 --check_multiget_consistency=0 --check_multiget_entity_consistency=1 --checkpoint_one_in=10000 --checksum_type=kxxHash --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000 --compact_range_one_in=1000 --compaction_pri=4 --compaction_readahead_size=1048576 --compaction_ttl=10 --compress_format_version=1 --compressed_secondary_cache_ratio=0.0 --compressed_secondary_cache_size=0 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=none --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc=04:00-08:00 --data_block_index_type=1 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_blackbox --db_write_buffer_size=0 --default_temperature=kWarm --default_write_temperature=kCold --delete_obsolete_files_period_micros=30000000 --delpercent=0 --delrangepercent=0 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_file_deletions_one_in=10000 --disable_manual_compaction_one_in=1000000 --disable_wal=0 --dump_malloc_stats=0 --enable_checksum_handoff=1 --enable_compaction_filter=0 --enable_custom_split_merge=0 --enable_do_not_compress_roles=0 --enable_index_compression=1 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=0 --enable_sst_partitioner_factory=0 --enable_thread_tracking=0 --enable_write_thread_adaptive_yield=0 --error_recovery_with_no_fault_injection=1 --exclude_wal_from_write_fault_injection=1 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=1 --fifo_allow_compaction=0 --file_checksum_impl=crc32c --fill_cache=1 --flush_one_in=1000000 --format_version=3 --get_all_column_family_metadata_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=1000000 --get_properties_of_all_tables_one_in=1000000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0.5 --index_block_restart_interval=4 --index_shortening=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --inplace_update_support=0 --iterpercent=50 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100 --last_level_temperature=kWarm --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=10000 --log_file_time_to_roll=60 --log_readahead_size=16777216 --long_running_snapshots=1 --low_pri_pool_ratio=0 --lowest_used_cache_tier=0 --manifest_preallocation_size=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=100000 --max_key_len=3 --max_log_file_size=1048576 --max_manifest_file_size=32768 --max_sequential_skip_in_iterations=1 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=8388608 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --metadata_charge_policy=1 --metadata_read_fault_one_in=32 --metadata_write_fault_one_in=0 --min_write_buffer_number_to_merge=1 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=1 --open_files=-1 --open_metadata_read_fault_one_in=0 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_hits=1 --optimize_filters_for_memory=1 --optimize_multiget_for_io=1 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=1000000 --periodic_compaction_seconds=2 --prefix_size=7 --prefixpercent=0 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=1000 --readahead_size=524288 --readpercent=0 --recycle_log_file_num=1 --reopen=0 --report_bg_io_stats=0 --reset_stats_one_in=1000000 --sample_for_compression=0 --secondary_cache_fault_one_in=0 --set_options_one_in=0 --skip_stats_update_on_db_open=1 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=68719476736 --sqfc_name=foo --sqfc_version=0 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=10 --stats_history_buffer_size=0 --strict_bytes_per_sync=1 --subcompactions=4 --sync=1 --sync_fault_injection=0 --table_cache_numshardbits=6 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=2 --uncache_aggressiveness=239 --universal_max_read_amp=-1 --unpartitioned_pinning=1 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=1 --use_attribute_group=0 --use_delta_encoding=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_cf_iterator=0 --use_multi_get_entity=0 --use_multiget=0 --use_put_entity_one_in=0 --use_sqfc_for_range_queries=1 --use_timed_put_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_compression=0 --verify_db_one_in=100000 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=0 --write_fault_one_in=128 --writepercent=50 Reviewed By: jowlyzhang Differential Revision: D59825730 Pulled By: hx235 fbshipit-source-id: 7d77aaf177ded2f99bf1ce19f5a4bd0783b9ca92	2024-07-17 13:39:14 -07:00
Hui Xiao	9e4ee7f0c6	Fix non-okay status being ignored in write path under two_write_queues_ (#12866 ) Summary: Context/Summary: see above, though the impact is small. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12866 Test Plan: exiting UT Reviewed By: anand1976 Differential Revision: D59782913 Pulled By: hx235 fbshipit-source-id: ec02843645cce49466bde602035d2e61c31965b8	2024-07-16 10:55:08 -07:00
anand76	5aa675457e	Fix unhandled MANIFEST write errors (#12865 ) Summary: The failure of `WriteCurrentStateToManifest()` in `VersionSet::ProcessManifestWrites()` was not handled properly. If it failed, `manifest_io_status` was not updated, leading to `manifest_file_number_` being updated to the newly created manifest even though its bad. This would lead to the bad manifest immediately getting deleted, and also the good manifest (referenced by `CURRENT`) getting deleted by obsolete file deletion because of `manifest_file_number_` not referencing its number. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12865 Reviewed By: hx235 Differential Revision: D59782940 Pulled By: anand1976 fbshipit-source-id: f752fb9a1c23fd3d734616e273613cbac204301b	2024-07-15 19:13:29 -07:00
Hui Xiao	4ff35afb42	Fix a bug where `OnErrorRecoveryBegin()` is not called before auto-recovery (#12860 ) Summary: Context/Summary: `*auto_recovery` needs to be set true in order for `OnErrorRecoveryBegin()` to be called before auto-recovery `3db030d7ee/db/event_helpers.cc (L64-L66)` Currently it's set false for auto-recovery. This PR fixes it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12860 Test Plan: - Manual observation that it is called - Existing UT Reviewed By: jowlyzhang Differential Revision: D59693315 Pulled By: hx235 fbshipit-source-id: 3f428c5b1e9818bb7697fdcd7f245d11378eb14a	2024-07-15 17:00:14 -07:00
WangQian	755010f8d3	Fix the bug with using the user comparator to compare prefix. (#12862 ) Summary: Fixes https://github.com/facebook/rocksdb/issues/12855 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12862 Reviewed By: cbi42 Differential Revision: D59771651 Pulled By: jowlyzhang fbshipit-source-id: ffe0025143f51f9ce1b46900c3fef6a20eb34f4a	2024-07-15 15:13:29 -07:00
Peter Dillinger	0e3e43f4d1	FaultInjectionTestFS follow-up and clean-up (#12861 ) Summary: In follow-up to https://github.com/facebook/rocksdb/issues/12852: * Use std::copy in place of copy_n for potentially overlapping buffer * Get rid of troublesome -1 idiom from `pos_at_last_append_` and `pos_at_last_sync_` * Small improvements to test FaultInjectionFSTest.ReadUnsyncedData Pull Request resolved: https://github.com/facebook/rocksdb/pull/12861 Test Plan: CI, crash test, etc. Reviewed By: cbi42 Differential Revision: D59757484 Pulled By: pdillinger fbshipit-source-id: c6fbdc2e97c959983184925a855cc8b0285fa23f	2024-07-15 10:28:34 -07:00
Peter Dillinger	72438a6788	Support read & write with unsynced data in FaultInjectionTestFS (#12852 ) Summary: Follow-up to https://github.com/facebook/rocksdb/issues/12729 and others to fix FaultInjectionTestFS handling the case where a live WAL is being appended to and synced while also being copied for checkpoint or backup, up to a known flushed (but not necessarily synced) prefix of the file. It was tricky to structure the code in a way that could handle a tricky race with Sync in another thread (see code comments, thanks Changyu) while maintaining good performance and test-ability. For more context, see the call to FlushWAL() in DBImpl::GetLiveFilesStorageInfo(). Also, the unit test for https://github.com/facebook/rocksdb/issues/12729 was neutered by https://github.com/facebook/rocksdb/issues/12797, and this re-enables the functionality it is testing. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12852 Test Plan: unit test expanded/updated. Local runs of blackbox_crash_test. The implementation is structured so that a multi-threaded unit test is not needed to cover at least the code lines, as the race handling is folded into "catch up after returning unsynced and then a sync." Reviewed By: cbi42 Differential Revision: D59594045 Pulled By: pdillinger fbshipit-source-id: 94667bb72255e2952586c53bae2c2dd384e85a50	2024-07-12 16:01:57 -07:00
Konstantin Ilin	5ecb92760a	Create C API function to iterate over WriteBatch for custom Column Families (#12718 ) Summary: Create C API function for iterating over WriteBatch for custom Column Families Adding function to C API that exposes column family specific methods to iterate over WriteBatch: put_cf, delete_cf and merge_cf. This is required when the one needs to read changes for any non-default column family. Without that functionality it is impossible to iterate over changes in WAL that are relevant to custom column families. Fixes https://github.com/facebook/rocksdb/issues/12790 Testing: Added WriteBatch iteration test to "columnfamilies" section of C API unit tests Pull Request resolved: https://github.com/facebook/rocksdb/pull/12718 Reviewed By: cbi42 Differential Revision: D59483601 Pulled By: ajkr fbshipit-source-id: b68b900636304528a38620a8c3ad82fdce4b60cb	2024-07-09 12:05:08 -07:00
w41ter	b837d41ab1	Expose SizeApproximationFlags to C API (#12836 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12836 Reviewed By: cbi42 Differential Revision: D59502673 Pulled By: ajkr fbshipit-source-id: fc9f77d6740d8efa45d9357662f0f827dbd0511f	2024-07-09 12:00:50 -07:00
Chdy	110ce5f4a3	fix: Round-Robin pri under leveled compaction allows subcompactions b… (#12843 ) Summary: ### Summary: Round-Robin pri under leveled compaction allows subcompactions by default is not compatible with PlainTable ```c++ bool Compaction::ShouldFormSubcompactions() const { if (cfd_ == nullptr) { return false; } // Round-Robin pri under leveled compaction allows subcompactions by default // and the number of subcompactions can be larger than max_subcompactions_ if (cfd_->ioptions()->compaction_pri == kRoundRobin && cfd_->ioptions()->compaction_style == kCompactionStyleLevel) { return output_level_ > 0; } if (max_subcompactions_ <= 1) { return false; } ``` PlainTable does not support Subcompaction, including when AdaptiveTable is applied to PlainTable. subcompaction by default will result in the following error in some scenarios. ```c++ void PlainTableIterator::Seek(const Slice& target) { if (use_prefix_seek_ != !table_->IsTotalOrderMode()) { // This check is done here instead of NewIterator() to permit creating an // iterator with total_order_seek = true even if we won't be able to Seek() // it. This is needed for compaction: it creates iterator with // total_order_seek = true but usually never does Seek() on it, // only SeekToFirst(). status_ = Status::InvalidArgument( "total_order_seek not implemented for PlainTable."); offset_ = next_offset_ = table_->file_info_.data_end_offset; return; } ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12843 Reviewed By: ajkr Differential Revision: D59433477 Pulled By: cbi42 fbshipit-source-id: fb780ba7f7e8efdfedb7480abf14dd38e0b63677	2024-07-08 12:25:11 -07:00
Peter Dillinger	0bb939611d	Avoid unnecessary work in internal calls to GetSortedWalFiles (#12831 ) Summary: We are seeing a number of crash test failures coming from checkpoint and backup code, likely from WalManager::GetSortedWalFiles -> ... -> WalManager::ReadFirstLine and this code path is not needed, because we don't need to know the sequence numbers of WAL files going into a checkpoint or backup. We can minimize the impact of whatever inconsistency is causing that problem by not relying on it where it's not needed. Similarly, when we only need a roughly accurate set of current WAL files, we don't need to query all the archived WAL files (and redundantly the live ones again). So this reduces filesystem queries and DB mutex acquires in creating backups and checkpoints. Needed follow-up: Figure out what is causing various failures with an apparent inconsistency where GetSortedWalFiles fails on reading a WAL file. If it's an injected failure, perhaps it's not propagating that injected failure appropriately. It might also be an inconsistency between what the DB knows is flushed and what WalManager reads from the filesystem (which we know is dubious and should be phased out, which this is arguably another step toward). Or completing that phase-out might solve the problem without a full diagnosis. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12831 Test Plan: existing tests (easily caught when I went too far in initally developing this change) Update to BackupUsingDirectIO test so that there's a WAL file in what is backed up. (Was relying on some oddity.) Reviewed By: cbi42 Differential Revision: D59252649 Pulled By: pdillinger fbshipit-source-id: 7ad4187a1c70caa59a6d6c1c643ef95232b929f5	2024-07-01 23:29:02 -07:00
Jeffery	093f4ef82c	Fix db_rate_limiter_test for win (#12816 ) Summary: We didn't implement file system prefetch for OS Win. During table open, it uses `FilePrefetchBuffer` instead and only do 1 read instead of 4 in BufferedIO. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12816 Reviewed By: jaykorean Differential Revision: D59181835 Pulled By: ajkr fbshipit-source-id: 18b8f0247408cd1a80f289357ede5232ae5a3c66	2024-07-01 16:14:19 -07:00
HypenZou	5bb7f95ed6	Don't take archived log size into account when calculating log size for flush (#12680 ) Summary: Context/Summary: It seems unreasonable to take the archived log size into account when calculating log size for flush in method CreateCheckpoint. If the user sets WAL_ttl_seconds or WAL_size_limit_MB, the argument _log_size_for_flush_ can easily be reached due to the size of the archived dir. As a result, the flush may always be triggered. Test corverd by ./checkpoint_test Pull Request resolved: https://github.com/facebook/rocksdb/pull/12680 Reviewed By: jaykorean Differential Revision: D59097904 Pulled By: ajkr fbshipit-source-id: 0ed29c1b078d8f40b85288541b008e00dbc517d3	2024-06-28 11:56:26 -07:00
Changyu Bi	a31fe52173	Remove the return value of `SetBGError()` (#12792 ) Summary: the return value for `ErrorHandler::SetBGError(error)` seems to be not well-defined, it can be `bg_error_` (no matter if the `bg_error_` is set to the input error), ok status or [`recovery_error_`](`3ee4d5a11a/db/error_handler.cc (L669)`) from `StartRecoverFromRetryableBGIOError()`. The `recovery_error_` returned may be an OK status. We have only a few places that use the return value of `SetBGError()` and they don't need to do so. Using the return value may even be wrong for example in `3ee4d5a11a/db/db_impl/db_impl_write.cc (L2365)` where a non-ok `s` could be overwritten to OK. This PR changes SetBGError() to return void and clean up relevant code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12792 Test Plan: existing unit tests and go over all places where return value of `SetBGError()` is used. Reviewed By: hx235 Differential Revision: D58904898 Pulled By: cbi42 fbshipit-source-id: d58a20ba5a40e3f35367c6034a32c755088c3653	2024-06-26 18:17:05 -07:00
Jason Volk	8bf1f6f87f	Add info logging via callback to C api. (#12537 ) Summary: I'd like to get this in so the Rust folks can integrate with their splendid logging/tracing frameworks; will be hugely appreciated. 🙏🏻 The infolog capabilities for C embeddings are quite spartan. LOG files were generated involuntarily until redirection to stderr was added by https://github.com/facebook/rocksdb/issues/12262; still insufficient for apps which cannot tolerate pollution of their stdio and tend to have existing logging frameworks to tie into for that. Adds a very minimal derive of Logger around a C callback, written in the spirit, useful for FFI interfaces from other languages to integrate infolog. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12537 Reviewed By: ajkr Differential Revision: D57597766 Pulled By: cbi42 fbshipit-source-id: ec684ce4ddf77a0a6ebbf013a1bacb4ff2e49eb0	2024-06-26 10:29:05 -07:00
Peter Dillinger	39455974cb	Fix possible double-free on TruncatedRangeDelIterator (#12805 ) Summary: Not sure where or how it happens, but using a recent CircleCI failure I got a reliable db_stress reproducer. Using std::unique_ptr appropriately for managing them has apparently (and unsurprisingly) fixed the problem without needing to know exactly where the problem was. Suggested follow-up: * Three or even four levels of pointers is very confusing to work with. Surely this part can be cleaned up to be simpler. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12805 Test Plan: Reproducer passes, plus ASAN test and crash test runs. I don't think it's worth the extra work to track down the details and create a careful unit test. ``` ./db_stress --WAL_size_limit_MB=1 --WAL_ttl_seconds=60 --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --adm_policy=2 --advise_random_on_open=1 --allow_data_in_errors=True --allow_fallocate=1 --async_io=0 --auto_readahead_size=1 --avoid_flush_during_recovery=0 --avoid_flush_during_shutdown=1 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=1000000 --block_align=1 --block_protection_bytes_per_key=4 --block_size=16384 --bloom_before_level=2147483646 --bloom_bits=15 --bottommost_compression_type=none --bottommost_file_compaction_delay=3600 --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_index_and_filter_blocks_with_high_priority=0 --cache_size=33554432 --cache_type=tiered_lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=0 --check_multiget_consistency=1 --check_multiget_entity_consistency=1 --checkpoint_one_in=10000 --checksum_type=kxxHash --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000 --compaction_pri=0 --compaction_readahead_size=0 --compaction_ttl=0 --compress_format_version=2 --compressed_secondary_cache_ratio=0.2 --compressed_secondary_cache_size=0 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=none --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc= --data_block_index_type=0 --db=/dev/shm/rocksdb.gpxs/rocksdb_crashtest_blackbox --db_write_buffer_size=0 --default_temperature=kWarm --default_write_temperature=kCold --delete_obsolete_files_period_micros=21600000000 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_file_deletions_one_in=10000 --disable_manual_compaction_one_in=1000000 --disable_wal=0 --dump_malloc_stats=1 --enable_checksum_handoff=1 --enable_compaction_filter=0 --enable_custom_split_merge=0 --enable_do_not_compress_roles=0 --enable_index_compression=0 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=1 --enable_sst_partitioner_factory=0 --enable_thread_tracking=1 --enable_write_thread_adaptive_yield=0 --error_recovery_with_no_fault_injection=0 --expected_values_dir=/dev/shm/rocksdb.gpxs/rocksdb_crashtest_expected --fail_if_options_file_error=0 --fifo_allow_compaction=0 --file_checksum_impl=none --fill_cache=1 --flush_one_in=1000000 --format_version=3 --get_all_column_family_metadata_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=10000 --get_properties_of_all_tables_one_in=100000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0 --index_block_restart_interval=4 --index_shortening=0 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100 --last_level_temperature=kHot --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=1000000 --log_file_time_to_roll=0 --log_readahead_size=0 --long_running_snapshots=1 --low_pri_pool_ratio=0 --lowest_used_cache_tier=2 --manifest_preallocation_size=5120 --manual_wal_flush_one_in=1000 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=2500000 --max_key_len=3 --max_log_file_size=0 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=1 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=100 --memtable_prefix_bloom_size_ratio=0 --memtable_protection_bytes_per_key=4 --memtable_whole_key_filtering=0 --memtablerep=skip_list --metadata_charge_policy=0 --metadata_read_fault_one_in=32 --metadata_write_fault_one_in=0 --min_write_buffer_number_to_merge=2 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --open_files=100 --open_metadata_read_fault_one_in=0 --open_metadata_write_fault_one_in=8 --open_read_fault_one_in=0 --open_write_fault_one_in=16 --ops_per_thread=100000000 --optimize_filters_for_hits=1 --optimize_filters_for_memory=0 --optimize_multiget_for_io=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=-1 --prefixpercent=0 --prepopulate_block_cache=1 --preserve_internal_time_seconds=60 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=32 --readahead_size=524288 --readpercent=50 --recycle_log_file_num=1 --reopen=0 --report_bg_io_stats=1 --reset_stats_one_in=10000 --sample_for_compression=5 --secondary_cache_fault_one_in=32 --secondary_cache_uri= --set_options_one_in=10000 --skip_stats_update_on_db_open=0 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=68719476736 --sqfc_name=bar --sqfc_version=1 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=0 --stats_history_buffer_size=1048576 --strict_bytes_per_sync=1 --subcompactions=3 --sync=0 --sync_fault_injection=1 --table_cache_numshardbits=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_cf_consistency=1 --top_level_index_pinning=1 --uncache_aggressiveness=5 --universal_max_read_amp=-1 --unpartitioned_pinning=2 --use_adaptive_mutex=0 --use_adaptive_mutex_lru=0 --use_attribute_group=1 --use_delta_encoding=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_cf_iterator=0 --use_multi_get_entity=0 --use_multiget=1 --use_put_entity_one_in=1 --use_sqfc_for_range_queries=1 --use_timed_put_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_compression=1 --verify_db_one_in=100000 --verify_file_checksums_one_in=0 --verify_iterator_with_expected_state_one_in=0 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=1048576 --write_dbid_to_manifest=1 --write_fault_one_in=0 --writepercent=35 ``` Reviewed By: cbi42 Differential Revision: D58958390 Pulled By: pdillinger fbshipit-source-id: 1271cfdcc3c574f78cd59f3c68148f7ed4a19c47	2024-06-24 11:51:16 -07:00

1 2 3 4 5 ...

5712 commits