Summary:
We have a request to use the cold tier as primary source of truth for the DB, and to best support such use cases and to complement the existing options controlling SST file temperatures, we add two new DB options:
* `metadata_write_temperature` for DB "small" files that don't contain much user data
* `wal_write_temperature` for WALs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12957
Test Plan: Unit test included, though it's hard to be sure we've covered all the places
Reviewed By: jowlyzhang
Differential Revision: D61664815
Pulled By: pdillinger
fbshipit-source-id: 8e19c9dd8fd2db059bb15f74938d6bc12002e82b
Summary:
Issue: MultiGet(PinnableSlice) can't read out all timestamps.
Fixed the impl, and added an UT as well. In the original impl, if MultiGet reads multiple column families, a later column family would clean up timestamps of previous column family.
Fix: https://github.com/facebook/rocksdb/issues/12950#issue-2476996580
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12943
Reviewed By: anand1976
Differential Revision: D61729257
Pulled By: pdillinger
fbshipit-source-id: 55267c26076c8a59acedd27e14714711729a40df
Summary:
this helps to avoid scanning input files when ingesting db generated files: ecb844babd/db/external_sst_file_ingestion_job.cc (L917-L935)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12951
Test Plan:
* `IngestDBGeneratedFileTest.FailureCase` is updated to verify that this table property is verified during ingestion
* existing unit tests for other ingestion use cases.
Reviewed By: jowlyzhang
Differential Revision: D61608285
Pulled By: cbi42
fbshipit-source-id: b5b7aae9741531349ab247be6ffaa3f3628b76ca
Summary:
add a new CF option `paranoid_memory_checks` that allows additional data integrity validations during read/scan. Currently, skiplist-based memtable will validate the order of keys visited. Further data validation can be added in different layers. The option will be opt-in due to performance overhead.
The motivation for this feature is for services where data correctness is critical and want to detect in-memory corruption earlier. For a corrupted memtable key, this feature can help to detect it during during reads instead of during flush with existing protections (OutputValidator that verifies key order or per kv checksum). See internally linked task for more context.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12889
Test Plan:
* new unit test added for paranoid_memory_checks=true.
* existing unit test for paranoid_memory_checks=false.
* enable in stress test.
Performance Benchmark: we check for performance regression in read path where data is in memtable only. For each benchmark, the script was run at the same time for main and this PR:
* Memtable-only randomread ops/sec:
```
(for I in $(seq 1 50);do ./db_bench --benchmarks=fillseq,readrandom --write_buffer_size=268435456 --writes=250000 --num=250000 --reads=500000 --seed=1723056275 2>&1 | grep "readrandom"; done;) | awk '{ t += $5; c++; print } END { print 1.0 * t / c }';
Main: 608146
PR with paranoid_memory_checks=false: 607727 (- %0.07)
PR with paranoid_memory_checks=true: 521889 (-%14.2)
```
* Memtable-only sequential scan ops/sec:
```
(for I in $(seq 1 50); do ./db_bench--benchmarks=fillseq,readseq[-X10] --write_buffer_size=268435456 --num=1000000 --seed=1723056275 2>1 | grep "\[AVG 10 runs\]"; done;) | awk '{ t += $6; c++; print; } END { printf "%.0f\n", 1.0 * t / c }';
Main: 9180077
PR with paranoid_memory_checks=false: 9536241 (+%3.8)
PR with paranoid_memory_checks=true: 7653934 (-%16.6)
```
* Memtable-only reverse scan ops/sec:
```
(for I in $(seq 1 20); do ./db_bench --benchmarks=fillseq,readreverse[-X10] --write_buffer_size=268435456 --num=1000000 --seed=1723056275 2>1 | grep "\[AVG 10 runs\]"; done;) | awk '{ t += $6; c++; print; } END { printf "%.0f\n", 1.0 * t / c }';
Main: 1285719
PR with integrity_checks=false: 1431626 (+%11.3)
PR with integrity_checks=true: 811031 (-%36.9)
```
The `readrandom` benchmark shows no regression. The scanning benchmarks show improvement that I can't explain.
Reviewed By: pdillinger
Differential Revision: D60414267
Pulled By: cbi42
fbshipit-source-id: a70b0cbeea131f1a249a5f78f9dc3a62dacfaa91
Summary:
Add an optional callback function upon remote compaction temp output installation. This will be internally used for setting the final status in the Offload Infra.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12940
Test Plan:
Unit Test added
```
./compaction_service_test
```
_Also internally tested by manually merging into internal code base_
Reviewed By: anand1976
Differential Revision: D61419157
Pulled By: jaykorean
fbshipit-source-id: 66831685bc403949c26bfc65840dd1900d2a5a67
Summary:
This PR make best efforts recovery more permissive by allowing it to recover incomplete Version that presents a valid point in time view from the user's perspective. Currently, a Version is only valid and saved if all files consisting that Version can be found. With this change, if only a suffix of L0 files (and their associated blob files) are missing, a valid Version is also available to be saved and recover to. Note that we don't do this if the column family was atomically flushed. Because atomic flush also need a consistent view across the column families, we cannot guarantee that if we are recovering to incomplete version.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12938
Test Plan: Existing tests and added unit tests.
Reviewed By: anand1976
Differential Revision: D61414381
Pulled By: jowlyzhang
fbshipit-source-id: f9b73deb34d35ad696ab42315928b656d586262a
Summary:
Partitioned metadata blocks were introduced back in 2017 to deal more gracefully with large DBs where RAM is relatively scarce and some data might be much colder than other data. The feature allows metadata blocks to compete for memory in the block cache against data blocks while alleviating tail latencies and thrash conditions that can arise with large metadata blocks (sometimes megabytes each) that can arise with large SST files. In general, the cost to partitioned metadata is more CPU in accesses (especially for filters where more binary search is needed before hashing can be used) and a bit more memory fragmentation and related overheads.
However the feature has always had a subtle limitation with a subtle effect on performance: index partitions and filter partitions must be cut at the same time, regardless of which wins the space race (hahaha) to metadata_block_size. Commonly filters will be a few times larger than indexes, so index partitions will be under-sized compared to filter (and data) blocks. While this does affect fragmentation and related overheads a bit, I suspect the bigger impact on performance is in the block cache. The coupling of the partition cuts would be defensible if the binary search done to find the filter block was used (on filter hit) to short-circuit binary search to an index partition, but that optimization has not been developed.
Consider two metadata blocks, an under-sized one and a normal-sized one, covering proportional sections of the key space with the same density of read queries. The under-sized one will be more prone to eviction from block cache because it is used less often. This is unfair because of its despite its proportionally smaller cost of keeping in block cache, and most of the cost of a miss to re-load it (random IO) is not proportional to the size (similar latency etc. up to ~32KB).
## This change
Adds a new table option decouple_partitioned_filters allows filter blocks and index blocks to be cut independently. To make this work, the partitioned filter block builder needs to know about the previous key, to generate an appropriate separator for the partition index. In most cases, BlockBasedTableBuilder already has easy access to the previous key to provide to the filter block builder.
This change includes refactoring to pass that previous key to the filter builder when available, with the filter building caching the previous key itself when unavailable, such as during compression dictionary training and some unit tests. Access to the previous key eliminates the need to track the previous prefix, which results in a small SST construction CPU win in prefix filtering cases, regardless of coupling, and possibly a small regression for some non-prefix cases, regardless of coupling, but still overall improvement especially with https://github.com/facebook/rocksdb/issues/12931.
Suggested follow-up:
* Update confusing use of "last key" to refer to "previous key"
* Expand unit test coverage with parallel compression and dictionary training
* Consider an option or enhancement to alleviate under-sized metadata blocks "at the end" of an SST file due to no coordination or awareness of when files are cut.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12939
Test Plan:
unit tests updated. Also did some unit test runs with "hard wired" usage of parallel compression and dictionary training code paths to ensure they were working. Also ran blackbox_crash_test for a while with the new feature.
## SST write performance (CPU)
Using the same testing setup as in https://github.com/facebook/rocksdb/issues/12931 but with -decouple_partitioned_filters=1 in the "after" configuration, which benchmarking shows makes almost no difference in terms of SST write CPU. "After" vs. "before" this PR
```
-partition_index_and_filters=0 -prefix_size=0 -whole_key_filtering=1
923691 vs. 924851 (-0.13%)
-partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=0
921398 vs. 922973 (-0.17%)
-partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=1
902259 vs. 908756 (-0.71%)
-partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=0
917932 vs. 916901 (+0.60%)
-partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=0
912755 vs. 907298 (+0.60%)
-partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=1
899754 vs. 892433 (+0.82%)
```
I think this is a pretty good trade, especially in attracting more movement toward partitioned configurations.
## Read performance
Let's see how decoupling affects read performance across various degrees of memory constraint. To simplify LSM structure, we're using FIFO compaction. Since decoupling will overall increase metadata block size, we control for this somewhat with an extra "before" configuration with larger metadata block size setting (8k instead of 4k). Basic setup:
```
(for CS in 0300 1200; do TEST_TMPDIR=/dev/shm/rocksdb1 ./db_bench -benchmarks=fillrandom,flush,readrandom,block_cache_entry_stats -num=5000000 -duration=30 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=10 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters=1 -statistics=1 -cache_size=${CS}000000 -metadata_block_size=4096 -decouple_partitioned_filters=1 2>&1 | tee results-$CS; done)
```
And read ops/s results:
```CSV
Cache size MB,After/decoupled/4k,Before/4k,Before/8k
3,15593,15158,12826
6,16295,16693,14134
10,20427,20813,18459
20,27035,26836,27384
30,33250,31810,33846
60,35518,32585,35329
100,36612,31805,35292
300,35780,31492,35481
1000,34145,31551,35411
1100,35219,31380,34302
1200,35060,31037,34322
```
If you graph this with log scale on the X axis (internal link: https://pxl.cl/5qKRc), you see that the decoupled/4k configuration is essentially the best of both the before/4k and before/8k configurations: handles really tight memory closer to the old 4k configuration and handles generous memory closer to the old 8k configuration.
Reviewed By: jowlyzhang
Differential Revision: D61376772
Pulled By: pdillinger
fbshipit-source-id: fc2af2aee44290e2d9620f79651a30640799e01f
Summary:
This is in part a refactoring / simplification to set up for "decoupled" partitioned filters and in part to fix an intentional regression for a correctness fix in https://github.com/facebook/rocksdb/issues/12872. Basically, we are taking out some complexity of the filter block builders, and pushing part of it (simultaneous de-duplication of prefixes and whole keys) into the filter bits builders, where it is more efficient by operating on hashes (rather than copied keys).
Previously, the FullFilterBlockBuilder had a somewhat fragile and confusing set of conditions under which it would keep a copy of the most recent prefix and most recent whole key, along with some other state that is essentially redundant. Now we just track (always) the previous prefix in the PartitionedFilterBlockBuilder, to deal with the boundary prefix Seek filtering problem. (Btw, the next PR will optimize this away since BlockBasedTableReader already tracks the previous key.) And to deal with the problem of de-duplicating both whole keys and prefixes going into a single filter, we add a new function to FilterBitsBuilder that has that extra de-duplication capabilty, which is relatively efficient because we only have to cache an extra 64-bit hash, not a copied key or prefix. (The API of this new function is somewhat awkward to avoid a small CPU regression in some cases.)
Also previously, there was awkward logic split between FullFilterBlockBuilder and PartitionedFilterBlockBuilder to deal with some things specific to partitioning. And confusing names like Add vs. AddKey. FullFilterBlockBuilder is much cleaner and simplified now.
The splitting of PartitionedFilterBlockBuilder::MaybeCutAFilterBlock into DecideCutAFilterBlock and CutAFilterBlock is to address what would have been a slight performance regression in some cases. The split allows for more intruction-level parallelism by reducing unnecessary control dependencies.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12931
Test Plan:
existing tests (with some minor updates)
Also manually ported over the pre-broken regression test described in
https://github.com/facebook/rocksdb/issues/12870 and ran it (passed).
Performance:
Here we validate that an entire series of recent related PRs are a net improvement in aggregate. "Before" is with these PRs reverted: https://github.com/facebook/rocksdb/issues/12872#12911https://github.com/facebook/rocksdb/issues/12874#12867https://github.com/facebook/rocksdb/issues/12903#12904. "After" includes this PR (and all
of those, with base revision 16c21af). Simultaneous test script designed to maximally depend on SST construction efficiency:
```
for PF in 0 1; do for PS in 0 8; do for WK in 0 1; do [ "$PS" == "$WK" ] || (for I in `seq 1 20`; do TEST_TMPDIR=/dev/shm/rocksdb2 ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -memtablerep=vector -allow_concurrent_memtable_write=0 -bloom_bits=10 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters=$PF -prefix_size=$PS -whole_key_filtering=$WK 2>&1 | grep micros/op; done) | awk '{ t += $5; c++; print } END { print 1.0 * t / c }'; echo "Was -partition_index_and_filters=$PF -prefix_size=$PS -whole_key_filtering=$WK"; done; done; done) | tee results
```
Showing average ops/sec of "after" vs. "before"
```
-partition_index_and_filters=0 -prefix_size=0 -whole_key_filtering=1
935586 vs. 928176 (+0.79%)
-partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=0
930171 vs. 926801 (+0.36%)
-partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=1
910727 vs. 894397 (+1.8%)
-partition_index_and_filters=1 -prefix_size=0 -whole_key_filtering=1
929795 vs. 922007 (+0.84%)
-partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=0
921924 vs. 917285 (+0.51%)
-partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=1
903393 vs. 887340 (+1.8%)
```
As one would predict, the most improvement is seen in cases where we have optimized away copying the whole key.
Reviewed By: jowlyzhang
Differential Revision: D61138271
Pulled By: pdillinger
fbshipit-source-id: 427cef0b1465017b45d0a507bfa7720fa20af043
Summary:
`VersionEditHandlerPointInTime` is tracking found files, missing files, intermediate files in order to decide to build a `Version` on negative edge trigger (transition from valid to invalid) without applying the current `VersionEdit`. However, applying `VersionEdit` and check completeness of a `Version` are specialization of `VersionBuilder`. More importantly, when we augment best efforts recovery to recover not just complete point in time Version but also a prefix of seqno for a point in time Version, such checks need to be duplicated in `VersionEditHandlerPointInTime` and `VersionBuilder`.
To avoid this, this refactor move all the file tracking functionality in `VersionEditHandlerPointInTime` into `VersionBuilder`. To continue to let `VersionEditHandlerPIT` do the edge trigger check and build a `Version` before applying the current `VersionEdit`, a suite of APIs to supporting creating a save point and its associated functions are added in `VersionBuilder` to achieve this.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12928
Test Plan: Existing tests
Reviewed By: anand1976
Differential Revision: D61171320
Pulled By: jowlyzhang
fbshipit-source-id: 604f66f8b1e3a3e13da59d8ba357c74e8a366dbc
Summary:
Add a couple of ticker stats for corruption retry count and successful retries. This PR also eliminates an extra read attempt when there's a checksum mismatch in a block read from the prefetch buffer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12923
Test Plan: Update existing tests
Reviewed By: jowlyzhang
Differential Revision: D61024687
Pulled By: anand1976
fbshipit-source-id: 3a08403580ab244000e0d480b7ee0f5a03d76b06
Summary:
![image](https://github.com/user-attachments/assets/3290fe18-aca2-4691-b072-fbbc96a15fb1)
this testcase set syncpoint function which reference this test case heap variable "enable_per_key_placement_" and this sync point function will be triggered by another testcase, so asan will report asan heap use after free error
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12908
Reviewed By: hx235
Differential Revision: D60973363
Pulled By: cbi42
fbshipit-source-id: df4f488f51e7741784d5a92fc0a5fc538c5d5b1a
Summary:
Corruption status returned by `GetFromTable()` could be overwritten here: b6c3495a71/db/version_set.cc (L2614)
This PR fixes this issue by setting `*(s->found_final_value) = true;` in SaveValue. Also makes the handling of the return value of `GetFromTable()` more robust and added asserts in a couple places.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12842
Test Plan: Updated an existing unit test to cover MultiGet. It fails the assertion here before this PR: b6c3495a71/db/version_set.cc (L2601)
Reviewed By: anand1976
Differential Revision: D59498203
Pulled By: cbi42
fbshipit-source-id: 1f071c1b2c5b66fb71264b547a9e670d1cf592f0
Summary:
I was investigating a crash test failure with "Corruption: SST file is ahead of WALs" which I haven't reproduced, but I did reproduce a data loss issue on recovery which I suspect could be the same root problem. The problem is already somewhat known (see https://github.com/facebook/rocksdb/issues/12403 and https://github.com/facebook/rocksdb/issues/12639) where it's only safe to recovery multiple recycled WAL files with trailing old data if the sequence numbers between them are adjacent (to ensure we didn't lose anything in the corrupt/obsolete WAL tail).
However, aside from disableWAL=true, there are features like external file ingestion that can increment the sequence numbers without writing to the WAL. It is simply unsustainable to worry about this kind of feature interaction limiting where we can consume sequence numbers. It is very hard to test and audit as well. For reliable crash recovery of recycled WALs, we need a better way of detecting that we didn't drop data from one WAL to the next.
Until then, let's disable WAL recycling in the crash test, to help stabilize it.
Ideas for follow-up to fix the underlying problem:
(a) With recycling, we could always sync the WAL before opening the next one. HOWEVER, this potentially very large sync could cause a big hiccup in writes (vs. O(1) sized manifest sync).
(a1) The WAL sync could ensure it is truncated to size, or
(a2) By requiring track_and_verify_wals_in_manifest, we could assume that the last synced size in the manifest is the final usable size of the WAL. (It might also be worth avoiding truncating recycled WALs.)
(b) Add a new mechanism to record and verify the final size of a WAL without requiring a sync.
(b1) By requiring track_and_verify_wals_in_manifest, this could be new WAL metadata recorded in the manifest (at the time of switching WALs). Note that new fields of WalMetadata are not forward-compatible, but a new kind of manifest record (next to WalAddition, WalDeletion; e.g. WalCompletion) is IIRC forward-compatible.
(b2) A new kind of WAL header entry (not forward compatible, unfortunately) could record the final size of the previous WAL.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12918
Test Plan: Added disabled reproducer for non-linear data loss on recovery
Reviewed By: hx235
Differential Revision: D60917527
Pulled By: pdillinger
fbshipit-source-id: 3663d79aec81851f5cf41669f84a712bb4563fd7
Summary:
Ahead of a "decoupled" variant of partitioned filters, refactoring this unit test file to make it easier to incorporate that new variant.
* bool test param to new enum class FilterPartitioning
* Some cases of iterating over that bool to new parameterized test
* Combine some common functionality for configuring parameterized options
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12911
Test Plan: no production changes, and no intentional changes to scope or conditions of tests
Differential Revision: D60701287
fbshipit-source-id: 3497e3230e29a4f62c934bcb75693965a2df41d8
Summary:
In normal use cases, meta info like column family's timestamp size is tracked at the transaction layer, so it's not necessary and even detrimental to track such info inside the internal WriteBatch because it may let anti-patterns like bypassing Transaction write APIs and directly write to its internal WriteBatch like this:
9d0a754dc9/storage/rocksdb/ha_rocksdb.cc (L4949-L4950)
Setting this option to true will keep aforementioned use case continue to work before it's refactored out. This option is only for this purpose and it will be gradually deprecated after aforementioned MyRocks use case are refactored.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12864
Test Plan: Added unit tests
Reviewed By: cbi42
Differential Revision: D60194094
Pulled By: jowlyzhang
fbshipit-source-id: 64a98822167e99aa7e4fa2a60085d44a5deaa45c
Summary:
https://github.com/facebook/rocksdb/issues/12891 updated this deletion rate in the test to be much higher, which makes the test flaky. The rate is being intentionally set to very low to maximize the retention of a ".log.trash" file after DB closes. This PR just change it back.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12915
Reviewed By: ltamasi
Differential Revision: D60776312
Pulled By: jowlyzhang
fbshipit-source-id: d193557a042c65816fcc337cceb09905e042e9f6
Summary:
Make `DestroyDB` slowly delete files if it's configured and enabled via `SstFileManager`.
It's currently not available mainly because of DeleteScheduler's logic related to tracked total_size_ and total_trash_size_. These accounting and logic should not be applied to `DestroyDB`. This PR adds a `DeleteUnaccountedDBFile` util for this purpose which deletes files without accounting it. This util also supports assigning a file to a specified trash bucket so that user can later wait for a specific trash bucket to be empty. For `DestroyDB`, files with more than 1 hard links will be deleted immediately.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12891
Test Plan: Added unit tests, existing tests.
Reviewed By: anand1976
Differential Revision: D60300220
Pulled By: jowlyzhang
fbshipit-source-id: 8b18109a177a3a9532f6dc2e40e08310c08ca3c7
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12910
There is currently a call to `GetBGError()` in `DBImpl::WriteImplWALOnly()` where the DB mutex is (incorrectly) not held, leading to a data race. Technically, we could acquire the mutex here but instead, the patch removes the affected check altogether, since the same check is already performed (in a thread-safe manner) in the subsequent call to `PreprocessWrite()`.
Reviewed By: cbi42
Differential Revision: D60682008
fbshipit-source-id: 54b67975dcf57d67c068cac71e8ada09a1793ec5
Summary:
In leveled compaction, we pick intra-L0 compaction instead of L0->Lbase whenever L0 size is small. When L0 files contain many deletions, it makes more sense to compact then down instead of accumulating tombstones in L0. This PR uses compensated_file_size when computing L0 size for determining intra-L0 compaction. Also scale down the limit on total L0 size further to be more cautious about accumulating data in L0.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12878
Test Plan: updated unit test.
Reviewed By: hx235
Differential Revision: D59932421
Pulled By: cbi42
fbshipit-source-id: 9de973ac51eb7df81b38b8c68110072b1aa06321
Summary:
Crash test encountered this failure:
```file ingestion error: Corruption: properties unsorted under specified IngestExternalFileOptions: move_files: 0, verify_checksums_before_ingest: 1, verify_checksums_readahead_size: 1048576 (Empty string or missing field indicates default option or value is used```
Further inspection showed out of order table properties in an external file created by `SstFileWriter` for ingestion, and the file is likely created like this because it passed the initial checksum check. This change added some assertions to check invariant at the properties creation and collecting side.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12898
Test Plan: Existing tests
Reviewed By: hx235
Differential Revision: D60459817
Pulled By: jowlyzhang
fbshipit-source-id: 91474943d2f9d7795f00b6031c08a13ab91e2470
Summary:
A crash test failure in log sync in DBImpl::WriteToWAL is due to a missed case in https://github.com/facebook/rocksdb/issues/12734. Just need to apply similar logic from DBImpl::SyncWalImpl to check for an already closed WAL (nullptr writer). This is extremely rare because it only comes from failed Sync on a closed WAL.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12899
Test Plan: watch crash test
Reviewed By: cbi42
Differential Revision: D60481652
Pulled By: pdillinger
fbshipit-source-id: 4a176bb6a53dcf077f88344710a110c2f946c386
Summary:
By reusing an object that owns a vector. The vector allocation/sizing was substantial in a CPU profile.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12893
Test Plan: existing tests
Reviewed By: jowlyzhang
Differential Revision: D60405139
Pulled By: pdillinger
fbshipit-source-id: 8bfbc07cd9b4829f2ac9015e90f2b4eba61fd984
Summary:
**Context/Summary:**
We discovered the following false positive in our crash test lately:
(1) PUT() writes k/v to WAL but fails in `ApplyWALToManifest()`. The k/v is in the WAL
(2) Current stress test logic will rollback the expected state of such k/v since PUT() fails
(3) If the DB crashes before recovery finishes and reopens, the WAL will be replayed and the k/v is in the DB while the expected state have been roll-backed.
We decided to leave those expected state to be pending until the loop-write of the same key succeeds.
Bonus: Now that I realized write to manifest can also fail the write which faces the similar problem as https://github.com/facebook/rocksdb/pull/12797, I decided to disable fault injection on user write per thread (instead of globally) when tracing is needed for prefix recovery; some refactory
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12838
Test Plan:
Rehearsal CI
Run below command (varies on sync_fault_injection=1,0 to verify ExpectedState behavior) for a while to ensure crash recovery validation works fine
```
python3 tools/db_crashtest.py --simple blackbox --interval=30 --WAL_size_limit_MB=0 --WAL_ttl_seconds=0 --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --adm_policy=1 --advise_random_on_open=0 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --allow_fallocate=0 --async_io=0 --auto_readahead_size=0 --avoid_flush_during_recovery=0 --avoid_flush_during_shutdown=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=1000000 --block_align=1 --block_protection_bytes_per_key=4 --block_size=16384 --bloom_before_level=4 --bloom_bits=56.810257702625165 --bottommost_compression_type=none --bottommost_file_compaction_delay=0 --bytes_per_sync=262144 --cache_index_and_filter_blocks=1 --cache_index_and_filter_blocks_with_high_priority=1 --cache_size=8388608 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=0 --check_multiget_consistency=0 --check_multiget_entity_consistency=1 --checkpoint_one_in=10000 --checksum_type=kxxHash --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000 --compact_range_one_in=1000 --compaction_pri=4 --compaction_readahead_size=1048576 --compaction_ttl=10 --compress_format_version=1 --compressed_secondary_cache_ratio=0.0 --compressed_secondary_cache_size=0 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=none --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc=04:00-08:00 --data_block_index_type=1 --db_write_buffer_size=0 --default_temperature=kWarm --default_write_temperature=kCold --delete_obsolete_files_period_micros=30000000 --delpercent=20 --delrangepercent=20 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_file_deletions_one_in=10000 --disable_manual_compaction_one_in=1000000 --disable_wal=0 --dump_malloc_stats=0 --enable_checksum_handoff=1 --enable_compaction_filter=0 --enable_custom_split_merge=0 --enable_do_not_compress_roles=0 --enable_index_compression=1 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=0 --enable_sst_partitioner_factory=0 --enable_thread_tracking=0 --enable_write_thread_adaptive_yield=0 --error_recovery_with_no_fault_injection=1 --exclude_wal_from_write_fault_injection=0 --fail_if_options_file_error=1 --fifo_allow_compaction=0 --file_checksum_impl=crc32c --fill_cache=1 --flush_one_in=1000000 --format_version=3 --get_all_column_family_metadata_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=1000000 --get_properties_of_all_tables_one_in=1000000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0.5 --index_block_restart_interval=4 --index_shortening=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100 --last_level_temperature=kWarm --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=10000 --log_file_time_to_roll=60 --log_readahead_size=16777216 --long_running_snapshots=1 --low_pri_pool_ratio=0 --lowest_used_cache_tier=0 --manifest_preallocation_size=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=100000 --max_key_len=3 --max_log_file_size=1048576 --max_manifest_file_size=32768 --max_sequential_skip_in_iterations=1 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=8388608 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --metadata_charge_policy=1 --metadata_read_fault_one_in=0 --metadata_write_fault_one_in=8 --min_write_buffer_number_to_merge=1 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=1 --open_files=-1 --open_metadata_read_fault_one_in=0 --open_metadata_write_fault_one_in=8 --open_read_fault_one_in=0 --open_write_fault_one_in=8 --ops_per_thread=100000000 --optimize_filters_for_hits=1 --optimize_filters_for_memory=1 --optimize_multiget_for_io=1 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=1000000 --periodic_compaction_seconds=2 --prefix_size=7 --prefixpercent=0 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=1000 --readahead_size=524288 --readpercent=10 --recycle_log_file_num=1 --reopen=0 --report_bg_io_stats=0 --reset_stats_one_in=1000000 --sample_for_compression=0 --secondary_cache_fault_one_in=0 --set_options_one_in=0 --skip_stats_update_on_db_open=1 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=68719476736 --sqfc_name=foo --sqfc_version=0 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=10 --stats_history_buffer_size=0 --strict_bytes_per_sync=1 --subcompactions=4 --sync=1 --sync_fault_injection=0 --table_cache_numshardbits=6 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=2 --uncache_aggressiveness=239 --universal_max_read_amp=-1 --unpartitioned_pinning=1 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=1 --use_attribute_group=0 --use_delta_encoding=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_cf_iterator=0 --use_multi_get_entity=0 --use_multiget=0 --use_put_entity_one_in=0 --use_sqfc_for_range_queries=1 --use_timed_put_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_compression=0 --verify_db_one_in=100000 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=0 --write_fault_one_in=8 --writepercent=40
```
Reviewed By: cbi42
Differential Revision: D59377075
Pulled By: hx235
fbshipit-source-id: 91f602fd67e2d339d378cd28b982095fd073dcb6
Summary:
This PR fix `VersionSet`'s `manifest_number_` could be pointing to an invalid number intermediately. This happens when a new manifest roll is attempted but fast failed after loading table handlers and before the new manifest file creation/writing is actually attempted.
In theory, a later manifest roll effort will overthrow this intermediate invalid in memory state. There is on harm when the DB crashes in this invalid state either. But efforts that takes a file snapshot of the DB like backup will incorrectly try to copy a non existing manifest file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12882
Reviewed By: cbi42
Differential Revision: D60204956
Pulled By: jowlyzhang
fbshipit-source-id: effbdb124b582f879d114988af06ac63867fc549
Summary:
Something I am working on is going to expand usage of `BlockBasedTableBuilder::Rep::last_key`, but the existing code contract for `IndexBuilder::AddIndexEntry` makes that difficult because it modifies its `last_key` parameter to be the separator value recorded in the index, often something between the two boundary keys.
This change primarily changes the contract of that function and related functions to separate function inputs and outputs, without sacrificing efficiency. For efficiency, a reusable scratch string buffer is provided by the caller, which the callee can use (or not) in returning a result Slice. That should yield a performance improvement as we are reusing a buffer for keys rather than copying into a new one each time in the FindShort* functions, without any additional string copies or conditional branches.
Additional improvements in PartitionedIndexBuilder specifically:
* Reduce string copies by eliminating `sub_index_last_key_` and instead tracking the key for the next partition in a placeholder Entry.
* Simplify code and improve code quality by changing `sub_index_builder_` to unique_ptr.
* Eliminate unnecessary NewFlushBlockPolicy call/object.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12867
Test Plan: existing tests, crash test. Will validate performance along with the change this is setting up.
Reviewed By: anand1976
Differential Revision: D59793119
Pulled By: pdillinger
fbshipit-source-id: 556da75cf13b967511f84702b2713d152f536a07
Summary:
**Context/Summary:**
We recently discovered a case where write of the same key right after error recovery of a previous failed write of the same key finishes causes two same WAL entries, violating our assertion. This is because we don't advance seqno on failed write and reuse the same WAL containing the failed write for the new write if the memtable at the time is empty.
This PR reuses the flush path for an empty memtable to switch WAL and update min WAL to keep in error recovery flush
as well as updates the INFO log message for clarity.
```
2024/07/17-15:01:32.271789 327757 (Original Log Time 2024/07/17-15:01:25.942234) [/flush_job.cc:1017] [default] [JOB 2] Level-0 flush table https://github.com/facebook/rocksdb/issues/9: 0 bytes OK It's an empty SST file from a successful flush so won't be kept in the DB
2024/07/17-15:01:32.271798 327757 (Original Log Time 2024/07/17-15:01:32.269954) [/memtable_list.cc:560] [default] Level-0 commit flush result of table https://github.com/facebook/rocksdb/issues/9 started
2024/07/17-15:01:32.271802 327757 (Original Log Time 2024/07/17-15:01:32.271217) [/memtable_list.cc:760] [default] Level-0 commit flush result of table https://github.com/facebook/rocksdb/issues/9: memtable https://github.com/facebook/rocksdb/issues/1 done
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12873
Test Plan:
New UT that failed before this PR with following assertion failure (i.e, duplicate WAL entries) and passes after
```
db_wal_test: db/write_batch.cc:2254: rocksdb::Status rocksdb::{anonymous}::MemTableInserter::PutCFImpl(uint32_t, const rocksdb::Slice&, const rocksdb::Slice&, rocksdb::ValueType, RebuildTxnOp, const ProtectionInfoKVOS64*) [with RebuildTxnOp = rocksdb::{anonymous}::MemTableInserter::PutCF(uint32_t, const rocksdb::Slice&, const rocksdb::Slice&)::<lambda(rocksdb::WriteBatch*, uint32_t, const rocksdb::Slice&, const rocksdb::Slice&)>; uint32_t = unsigned int; rocksdb::ProtectionInfoKVOS64 = rocksdb::ProtectionInfoKVOS<long unsigned int>]: Assertion `seq_per_batch_' failed.
```
Reviewed By: anand1976
Differential Revision: D59884468
Pulled By: hx235
fbshipit-source-id: 5d854b719092552c69727a979f269fb7f6c39756
Summary:
InitInputTableProperties() can open and do IOs and is called under mutex_. This PR removes it from FinalizeInputInfo(). It is now called in CompactionJob::Run() and BuildCompactionJobInfo() (called in NotifyOnCompactionBegin()) without holding mutex_.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12879
Test Plan: existing unit tests. Added assert in GetInputTableProperties() to ensure that input_table_properties_ is initialized whenever it's called.
Reviewed By: hx235
Differential Revision: D59933195
Pulled By: cbi42
fbshipit-source-id: c8089e13af8567fa3ab4b94d9ec384ae98ab2ec8
Summary:
... to enable use cases like using RocksDB to merge sort data for ingestion. A new file ingestion option `IngestExternalFileOptions::allow_db_generated_files` is introduced to allows users to ingest SST files generated by live DBs instead of SstFileWriter. For now this only works if the SST files being ingested have zero as their largest sequence number AND do not overlap with any data in the DB (so we can assign seqno 0 which matches the seqno of all ingested keys).
The feature is marked the option as experimental for now.
Main changes needed to enable this:
- ignore CF id mismatch during ingestion
- ignore the missing external file version table property
Rest of the change is mostly in new unit tests.
A previous attempt is in https://github.com/facebook/rocksdb/issues/5602.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12750
Test Plan: - new unit tests
Reviewed By: ajkr, jowlyzhang
Differential Revision: D58396673
Pulled By: cbi42
fbshipit-source-id: aae513afad7b1ff5d4faa48104df5f384926bf03
Summary:
This PR fixes an assertion failure in `DBImpl::ResumeImpl` - `assert(!versions_->descriptor_log_)`. In `VersionSet`, `descriptor_log_` has a pointer to the current MANIFEST writer. When there's an error updating the manifest, `descriptor_log_` is reset, and the error recovery thread checks `io_status()` in `VersionSet` and attempts to write a new MANIFEST. If another DB manipulation happens at the same time (like external file ingestion, column family manipulation etc), it calls `LogAndApply`, which also attempts to write a new MANIFEST. The assertion in `ResumeImpl` might fail in this case since the other MANIFEST writer may have updated `descriptor_log_`. To prevent the assertion, this fix updates both `io_status_` and `descriptor_log_` while holding the DB mutex.
The other option would have been to simply remove the assert. But I think its important to have it to ensure the invariant that `io_status_` is cleared if the MANIFEST is written successfully, and this fix makes things easier to reason about.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12871
Test Plan: Existing tests and crash test
Reviewed By: hx235
Differential Revision: D59926947
Pulled By: anand1976
fbshipit-source-id: af9ad18da3e29fc62c7ec2e30e0738aa33d4e5f1
Summary:
Basically, the fix in https://github.com/facebook/rocksdb/issues/8137 was incomplete (and I missed it in the review), because if `whole_key_filtering` is false, then `last_prefix_str_` will never be set to non-empty and the fix doesn't work. Also related to https://github.com/facebook/rocksdb/issues/5835.
This is intended as a safe, simple fix that will regress CPU efficiency slightly (for `whole_key_filtering=false` cases, because of extra prefix string copies during flush & compaction). An efficient fix is not possible without some substantial refactoring.
Also in this PR: new test DBBloomFilterTest.FilterNumEntriesCoalesce tests an adjacent code path that was previously untested for its effect of ensuring the number of unique prefixes and keys is tracked properly when both prefixes and whole keys are going into a filter. (Test fails when either of the two code segments checking for duplicates is disabled.) In addition, the same test would fail before the main bug fix here because the code would inappropriately add the empty string to the filter (because of unmodified `last_prefix_str_`).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12872
Test Plan: In addition to DBBloomFilterTest.FilterNumEntriesCoalesce, extended DBBloomFilterTest.SeekForPrevWithPartitionedFilters to cover the broken case. (Mostly whitespace change.)
Reviewed By: jowlyzhang
Differential Revision: D59873793
Pulled By: pdillinger
fbshipit-source-id: 2a7b7f09ca73dc188fb4dab833826ad6da7ebb11
Summary:
Context/Summary: see above, though the impact is small.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12866
Test Plan: exiting UT
Reviewed By: anand1976
Differential Revision: D59782913
Pulled By: hx235
fbshipit-source-id: ec02843645cce49466bde602035d2e61c31965b8
Summary:
The failure of `WriteCurrentStateToManifest()` in `VersionSet::ProcessManifestWrites()` was not handled properly. If it failed, `manifest_io_status` was not updated, leading to `manifest_file_number_` being updated to the newly created manifest even though its bad. This would lead to the bad manifest immediately getting deleted, and also the good manifest (referenced by `CURRENT`) getting deleted by obsolete file deletion because of `manifest_file_number_` not referencing its number.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12865
Reviewed By: hx235
Differential Revision: D59782940
Pulled By: anand1976
fbshipit-source-id: f752fb9a1c23fd3d734616e273613cbac204301b
Summary:
**Context/Summary:**
`*auto_recovery` needs to be set true in order for `OnErrorRecoveryBegin()` to be called before auto-recovery
3db030d7ee/db/event_helpers.cc (L64-L66)
Currently it's set false for auto-recovery. This PR fixes it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12860
Test Plan:
- Manual observation that it is called
- Existing UT
Reviewed By: jowlyzhang
Differential Revision: D59693315
Pulled By: hx235
fbshipit-source-id: 3f428c5b1e9818bb7697fdcd7f245d11378eb14a
Summary:
In follow-up to https://github.com/facebook/rocksdb/issues/12852:
* Use std::copy in place of copy_n for potentially overlapping buffer
* Get rid of troublesome -1 idiom from `pos_at_last_append_` and `pos_at_last_sync_`
* Small improvements to test FaultInjectionFSTest.ReadUnsyncedData
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12861
Test Plan: CI, crash test, etc.
Reviewed By: cbi42
Differential Revision: D59757484
Pulled By: pdillinger
fbshipit-source-id: c6fbdc2e97c959983184925a855cc8b0285fa23f
Summary:
Follow-up to https://github.com/facebook/rocksdb/issues/12729 and others to fix FaultInjectionTestFS handling the case where a live WAL is being appended to and synced while also being copied for checkpoint or backup, up to a known flushed (but not necessarily synced) prefix of the file. It was tricky to structure the code in a way that could handle a tricky race with Sync in another thread (see code comments, thanks Changyu) while maintaining good performance and test-ability.
For more context, see the call to FlushWAL() in DBImpl::GetLiveFilesStorageInfo().
Also, the unit test for https://github.com/facebook/rocksdb/issues/12729 was neutered by https://github.com/facebook/rocksdb/issues/12797, and this re-enables the functionality it is testing.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12852
Test Plan:
unit test expanded/updated. Local runs of blackbox_crash_test.
The implementation is structured so that a multi-threaded unit test is not needed to cover at least the code lines, as the race handling is folded into "catch up after returning unsynced and then a sync."
Reviewed By: cbi42
Differential Revision: D59594045
Pulled By: pdillinger
fbshipit-source-id: 94667bb72255e2952586c53bae2c2dd384e85a50
Summary:
Create C API function for iterating over WriteBatch for custom Column Families
Adding function to C API that exposes column family specific methods to iterate over WriteBatch: put_cf, delete_cf and merge_cf. This is required when the one needs to read changes for any non-default column family. Without that functionality it is impossible to iterate over changes in WAL that are relevant to custom column families.
Fixes https://github.com/facebook/rocksdb/issues/12790
Testing:
Added WriteBatch iteration test to "columnfamilies" section of C API unit tests
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12718
Reviewed By: cbi42
Differential Revision: D59483601
Pulled By: ajkr
fbshipit-source-id: b68b900636304528a38620a8c3ad82fdce4b60cb
Summary:
### Summary: Round-Robin pri under leveled compaction allows subcompactions by default is not compatible with PlainTable
```c++
bool Compaction::ShouldFormSubcompactions() const {
if (cfd_ == nullptr) {
return false;
}
// Round-Robin pri under leveled compaction allows subcompactions by default
// and the number of subcompactions can be larger than max_subcompactions_
if (cfd_->ioptions()->compaction_pri == kRoundRobin &&
cfd_->ioptions()->compaction_style == kCompactionStyleLevel) {
return output_level_ > 0;
}
if (max_subcompactions_ <= 1) {
return false;
}
```
PlainTable does not support Subcompaction, including when AdaptiveTable is applied to PlainTable. subcompaction by default will result in the following error in some scenarios.
```c++
void PlainTableIterator::Seek(const Slice& target) {
if (use_prefix_seek_ != !table_->IsTotalOrderMode()) {
// This check is done here instead of NewIterator() to permit creating an
// iterator with total_order_seek = true even if we won't be able to Seek()
// it. This is needed for compaction: it creates iterator with
// total_order_seek = true but usually never does Seek() on it,
// only SeekToFirst().
status_ = Status::InvalidArgument(
"total_order_seek not implemented for PlainTable.");
offset_ = next_offset_ = table_->file_info_.data_end_offset;
return;
}
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12843
Reviewed By: ajkr
Differential Revision: D59433477
Pulled By: cbi42
fbshipit-source-id: fb780ba7f7e8efdfedb7480abf14dd38e0b63677
Summary:
We are seeing a number of crash test failures coming from checkpoint and backup code, likely from WalManager::GetSortedWalFiles -> ... -> WalManager::ReadFirstLine and this code path is not needed, because we don't need to know the sequence numbers of WAL files going into a checkpoint or backup. We can minimize the impact of whatever inconsistency is causing that problem by not relying on it where it's not needed.
Similarly, when we only need a roughly accurate set of current WAL files, we don't need to query all the archived WAL files (and redundantly the live ones again).
So this reduces filesystem queries and DB mutex acquires in creating backups and checkpoints.
Needed follow-up:
Figure out what is causing various failures with an apparent inconsistency where GetSortedWalFiles fails on reading a WAL file. If it's an injected failure, perhaps it's not propagating that injected failure appropriately. It might also be an inconsistency between what the DB knows is flushed and what WalManager reads from the filesystem (which we know is dubious and should be phased out, which this is arguably another step toward). Or completing that phase-out might solve the problem without a full diagnosis.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12831
Test Plan:
existing tests (easily caught when I went too far in initally developing this change)
Update to BackupUsingDirectIO test so that there's a WAL file in what is backed up. (Was relying on some oddity.)
Reviewed By: cbi42
Differential Revision: D59252649
Pulled By: pdillinger
fbshipit-source-id: 7ad4187a1c70caa59a6d6c1c643ef95232b929f5
Summary:
We didn't implement file system prefetch for OS Win. During table open, it uses `FilePrefetchBuffer` instead and only do 1 read instead of 4 in BufferedIO.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12816
Reviewed By: jaykorean
Differential Revision: D59181835
Pulled By: ajkr
fbshipit-source-id: 18b8f0247408cd1a80f289357ede5232ae5a3c66
Summary:
**Context/Summary:**
It seems unreasonable to take the archived log size into account when calculating log size **for flush** in method CreateCheckpoint. If the user sets WAL_ttl_seconds or WAL_size_limit_MB, the argument _log_size_for_flush_ can easily be reached due to the size of the archived dir. As a result, the flush may always be triggered.
**Test**
corverd by ./checkpoint_test
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12680
Reviewed By: jaykorean
Differential Revision: D59097904
Pulled By: ajkr
fbshipit-source-id: 0ed29c1b078d8f40b85288541b008e00dbc517d3
Summary:
the return value for `ErrorHandler::SetBGError(error)` seems to be not well-defined, it can be `bg_error_` (no matter if the `bg_error_` is set to the input error), ok status or [`recovery_error_`](3ee4d5a11a/db/error_handler.cc (L669)) from `StartRecoverFromRetryableBGIOError()`. The `recovery_error_` returned may be an OK status.
We have only a few places that use the return value of `SetBGError()` and they don't need to do so. Using the return value may even be wrong for example in 3ee4d5a11a/db/db_impl/db_impl_write.cc (L2365) where a non-ok `s` could be overwritten to OK. This PR changes SetBGError() to return void and clean up relevant code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12792
Test Plan: existing unit tests and go over all places where return value of `SetBGError()` is used.
Reviewed By: hx235
Differential Revision: D58904898
Pulled By: cbi42
fbshipit-source-id: d58a20ba5a40e3f35367c6034a32c755088c3653
Summary:
I'd like to get this in so the Rust folks can integrate with their splendid logging/tracing frameworks; will be hugely appreciated. 🙏🏻
The infolog capabilities for C embeddings are quite spartan. LOG files were generated involuntarily until redirection to stderr was added by https://github.com/facebook/rocksdb/issues/12262; still insufficient for apps which cannot tolerate pollution of their stdio and tend to have existing logging frameworks to tie into for that.
Adds a very minimal derive of Logger around a C callback, written in the spirit, useful for FFI interfaces from other languages to integrate infolog.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12537
Reviewed By: ajkr
Differential Revision: D57597766
Pulled By: cbi42
fbshipit-source-id: ec684ce4ddf77a0a6ebbf013a1bacb4ff2e49eb0