Summary:
For the user defined timestamps in memtable only feature, some special handling for range deletion blocks are needed since both the key (start_key) and the value (end_key) of a range tombstone can contain user-defined timestamps. Handling for the key is taken care of in the same way as the other data blocks in the block based table. This PR adds the special handling needed for the value (end_key) part. This includes:
1) On the write path, when L0 SST files are first created from flush, user-defined timestamps are removed from an end key of a range tombstone. There are places where it's logically removed (replaced with a min timestamp) because there is still logic with the running comparator that expects a user key that contains timestamp. And in the block based builder, it is eventually physically removed before persisted in a block.
2) On the read path, when range deletion block is being read, we artificially pad a min timestamp to the end key of a range tombstone in `BlockBasedTableReader`.
3) For file boundary `FileMetaData.largest`, we artificially pad a max timestamp to it if it contains a range deletion sentinel. Anytime when range deletion end_key is used to update file boundaries, it's using max timestamp instead of the range tombstone's actual timestamp to mark it as an exclusive end. d69628e6ce/db/dbformat.h (L923-L935)
This max timestamp is removed when in memory `FileMetaData.largest` is persisted into Manifest, we pad it back when it's read from Manifest while handling related `VersionEdit` in `VersionEditHandler`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12254
Test Plan: Added unit test and enabled this feature combination's stress test.
Reviewed By: cbi42
Differential Revision: D52965527
Pulled By: jowlyzhang
fbshipit-source-id: e8315f8a2c5268e2ae0f7aec8012c266b86df985
Summary:
While working on Meta's internal test triaging process, I found that `db_crashtest.py` was printing out `stdout` and `stderr` altogether. Adding an option to print `stderr` separately so that it's easy to extract only `stderr` from the test run.
`print_stderr_separately` is introduced as an optional parameter with default value `False` to keep the existing behavior as is (except a few minor changes).
Minor changes to the existing behavior
- We no longer print `stderr has error message:` and `***` prefix to each line. We simply print `stderr:` before printing `stderr` if stderr is printed in stdout and print `stderr` as is.
- We no longer print `times error occurred in output is ...` which doesn't appear to have any values
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12301
Test Plan:
**Default Behavior (blackbox)**
Run printed everything as is
```
$> python3 tools/db_crashtest.py blackbox --simple --max_key=25000000 --write_buffer_size=4194304 2> /tmp/error.log
Running blackbox-crash-test with
interval_between_crash=120
total-duration=6000
...
Integrated BlobDB: blob files enabled 0, min blob size 0, blob file size 268435456, blob compression type NoCompression, blob GC enabled 0, cutoff 0.250000, force threshold 1.000000, blob compaction readahead size 0, blob file starting level 0
Integrated BlobDB: blob cache disabled
DB path: [/tmp/jewoongh/rocksdb_crashtest_blackboxwh7yxpec]
(Re-)verified 0 unique IDs
2024/01/29-09:16:30 Initializing worker threads
Crash-recovery verification passed :)
2024/01/29-09:16:35 Starting database operations
2024/01/29-09:16:35 Starting verification
Stress Test : 543.600 micros/op 8802 ops/sec
: Wrote 0.00 MB (0.27 MB/sec) (50% of 10 ops)
: Wrote 5 times
: Deleted 1 times
: Single deleted 0 times
: 4 read and 0 found the key
: Prefix scanned 0 times
: Iterator size sum is 0
: Iterated 0 times
: Deleted 0 key-ranges
: Range deletions covered 0 keys
: Got errors 0 times
: 0 CompactFiles() succeed
: 0 CompactFiles() did not succeed
stderr:
WARNING: prefix_size is non-zero but memtablerep != prefix_hash
Error : jewoongh injected test error This is not a real failure.
Verification failed :(
```
Nothing in stderr
```
$> cat /tmp/error.log
```
**Default Behavior (whitebox)**
Run printed everything as is
```
$> python3 tools/db_crashtest.py whitebox --simple --max_key=25000000 --write_buffer_size=4194304 2> /tmp/error.log
Running whitebox-crash-test with
total-duration=10000
...
(Re-)verified 571 unique IDs
2024/01/29-09:33:53 Initializing worker threads
Crash-recovery verification passed :)
2024/01/29-09:35:16 Starting database operations
2024/01/29-09:35:16 Starting verification
Stress Test : 97248.125 micros/op 10 ops/sec
: Wrote 0.00 MB (0.00 MB/sec) (12% of 8 ops)
: Wrote 1 times
: Deleted 0 times
: Single deleted 0 times
: 4 read and 1 found the key
: Prefix scanned 1 times
: Iterator size sum is 120868
: Iterated 4 times
: Deleted 0 key-ranges
: Range deletions covered 0 keys
: Got errors 0 times
: 0 CompactFiles() succeed
: 0 CompactFiles() did not succeed
stderr:
WARNING: prefix_size is non-zero but memtablerep != prefix_hash
Error : jewoongh injected test error This is not a real failure.
New cache capacity = 4865393
Verification failed :(
TEST FAILED. See kill option and exit code above!!!
```
Nothing in stderr
```
$> cat /tmp/error.log
```
**New option added (blackbox)**
```
$> python3 tools/db_crashtest.py blackbox --simple --max_key=25000000 --write_buffer_size=4194304 --print_stderr_separately 2> /tmp/error.log
Running blackbox-crash-test with
interval_between_crash=120
total-duration=6000
...
Integrated BlobDB: blob files enabled 0, min blob size 0, blob file size 268435456, blob compression type NoCompression, blob GC enabled 0, cutoff 0.250000, force threshold 1.000000, blob compaction readahead size 0, blob file starting level 0
Integrated BlobDB: blob cache disabled
DB path: [/tmp/jewoongh/rocksdb_crashtest_blackbox7ybna32z]
(Re-)verified 0 unique IDs
Compaction filter factory: DbStressCompactionFilterFactory
2024/01/29-09:05:39 Initializing worker threads
Crash-recovery verification passed :)
2024/01/29-09:05:46 Starting database operations
2024/01/29-09:05:46 Starting verification
Stress Test : 235.917 micros/op 16000 ops/sec
: Wrote 0.00 MB (0.16 MB/sec) (16% of 12 ops)
: Wrote 2 times
: Deleted 1 times
: Single deleted 0 times
: 9 read and 0 found the key
: Prefix scanned 0 times
: Iterator size sum is 0
: Iterated 0 times
: Deleted 0 key-ranges
: Range deletions covered 0 keys
: Got errors 0 times
: 0 CompactFiles() succeed
: 0 CompactFiles() did not succeed
```
stderr printed separately
```
$> cat /tmp/error.log
WARNING: prefix_size is non-zero but memtablerep != prefix_hash
Error : jewoongh injected test error This is not a real failure.
New cache capacity = 19461571
Verification failed :(
```
**New option added (whitebox)**
```
$> python3 tools/db_crashtest.py whitebox --simple --max_key=25000000 --write_buffer_size=4194304 --print_stderr_separately 2> /tmp/error.log
Running whitebox-crash-test with
total-duration=10000
...
Integrated BlobDB: blob files enabled 0, min blob size 0, blob file size 268435456, blob compression type NoCompression, blob GC enabled 0, cutoff 0.250000, force threshold 1.000000, blob compaction readahead size 0, blob file starting level 0
Integrated BlobDB: blob cache disabled
DB path: [/tmp/jewoongh/rocksdb_crashtest_whiteboxtwj0ihn6]
(Re-)verified 157 unique IDs
2024/01/29-09:39:59 Initializing worker threads
Crash-recovery verification passed :)
2024/01/29-09:40:16 Starting database operations
2024/01/29-09:40:16 Starting verification
Stress Test : 742.474 micros/op 11801 ops/sec
: Wrote 0.00 MB (0.27 MB/sec) (36% of 19 ops)
: Wrote 7 times
: Deleted 1 times
: Single deleted 0 times
: 8 read and 0 found the key
: Prefix scanned 0 times
: Iterator size sum is 0
: Iterated 4 times
: Deleted 0 key-ranges
: Range deletions covered 0 keys
: Got errors 0 times
: 0 CompactFiles() succeed
: 0 CompactFiles() did not succeed
TEST FAILED. See kill option and exit code above!!!
```
stderr printed separately
```
$> cat /tmp/error.log
WARNING: prefix_size is non-zero but memtablerep != prefix_hash
Error : jewoongh injected test error This is not a real failure.
Error : jewoongh injected test error This is not a real failure.
Error : jewoongh injected test error This is not a real failure.
New cache capacity = 4865393
Verification failed :(
```
Reviewed By: akankshamahajan15
Differential Revision: D53187491
Pulled By: jaykorean
fbshipit-source-id: 76f9100d08b96d014e41b7b88b206d69f0ae932b
Summary:
Right now crash_test relies on std::errors too to check for only errors/failures along with verification. However, that's not a reliable solution and many internal services logs benign errors/warnings in which case our test script fails.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12265
Test Plan: Keep std::errors but printout instead of failing and will monitor crash tests internally to see if there is any scenario which solely relies on std::error, in which case stress tests can be improve.
Reviewed By: ajkr, cbi42
Differential Revision: D52967000
Pulled By: akankshamahajan15
fbshipit-source-id: 5328c8b69480c7946fe6a9c72f9ffeede70ac2ad
Summary:
The current implementation of the ldb_cmd tool involves commenting out the user-passed column_family_descriptors, resulting in the tool consistently constructing its column_family_descriptors from the pre-existing OPTIONS file.
The proposed fix prioritizes user-passed column family descriptors, ensuring they take precedence over those specified in the OPTIONS file. This modification enhances the tool's adaptability and responsiveness to user configurations.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12261
Reviewed By: cbi42
Differential Revision: D52965877
Pulled By: ajkr
fbshipit-source-id: 334a83a8e1004c271b19e7ca09381a0e7cf87b03
Summary:
While investigating test failures due to the inconsistency between `Get()` and `MultiGet()`, I realized that LDB currently doesn't support `MultiGet()`. This PR introduces the `MultiGet()` support in LDB. Tested the command manually. Unit test will follow in a separate PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12283
Test Plan:
When key not found
```
$> ./ldb --db=/data/users/jewoongh/rocksdb_test/T173992396/rocksdb_crashtest_blackbox --hex multi_get 0x0000000000000009000000000000012B00000000000002AB
Status for key 0x0000000000000009000000000000012B00000000000002AB: NotFound:
```
Compare the same key with get
```
$> ./ldb --db=/data/users/jewoongh/rocksdb_test/T173992396/rocksdb_crashtest_blackbox --hex get 0x0000000000000009000000000000012B00000000000002AB
Failed: Get failed: NotFound:
```
Multiple keys not found
```
$> ./ldb --db=/data/users/jewoongh/rocksdb_test/T173992396/rocksdb_crashtest_blackbox --hex multi_get 0x0000000000000009000000000000012B00000000000002AB 0x0000000000000009000000000000012B00000000000002AC Status for key 0x0000000000000009000000000000012B00000000000002AB: NotFound:
Status for key 0x0000000000000009000000000000012B00000000000002AC: NotFound:
```
One of the keys found
```
$> ./ldb --db=/data/users/jewoongh/rocksdb_test/T173992396/rocksdb_crashtest_blackbox --hex multi_get 0x0000000000000009000000000000012B00000000000002AB 0x00000000000000090000000000000026787878787878
Status for key 0x0000000000000009000000000000012B00000000000002AB: NotFound:
0x22000000262724252A2B28292E2F2C2D32333031363734353A3B38393E3F3C3D02030001060704050A0B08090E0F0C0D12131011161714151A1B18191E1F1C1D
```
All of the keys found
```
$> ./ldb --db=/data/users/jewoongh/rocksdb_test/T173992396/rocksdb_crashtest_blackbox --hex multi_get 0x0000000000000009000000000000012B00000000000000D8 0x00000000000000090000000000000026787878787878 15:57:03
0x47000000434241404F4E4D4C4B4A494857565554535251505F5E5D5C5B5A595867666564636261606F6E6D6C6B6A696877767574737271707F7E7D7C7B7A797807060504030201000F0E0D0C0B0A090817161514131211101F1E1D1C1B1A1918
0x22000000262724252A2B28292E2F2C2D32333031363734353A3B38393E3F3C3D02030001060704050A0B08090E0F0C0D12131011161714151A1B18191E1F1C1D
```
Reviewed By: hx235
Differential Revision: D53048519
Pulled By: jaykorean
fbshipit-source-id: a6217905464c5f460a222e2b883bdff47b9dd9c7
Summary:
with release notes for 8.11.fb, format_compatible test update, and version.h update.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12256
Test Plan: CI
Reviewed By: cbi42
Differential Revision: D52926051
Pulled By: pdillinger
fbshipit-source-id: adcf7119b065758599e904c16cbdf1d28811e0b4
Summary:
Currently, we treat the long-running whitebox_crash_test as passing. However, we were not cleaning up after ourselves when we killed the running test for running too long, which often caused out-of-space errors in subsequent tests (e.g., blackbox_crash_test after whitebox_crash_test).
Unless we want to start treating these timeouts as failures and need the DB output for investigation now, we should properly clean up the tmp dir.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12248
Test Plan:
```
$> make crash_test -j
```
Reviewed By: ajkr
Differential Revision: D52885342
Pulled By: jaykorean
fbshipit-source-id: 7c1f2ca7cf03d0705bb14155ee44d5d7a411c132
Summary:
Add ```CompressionOptions``` to ```CompressedSecondaryCacheOptions``` to allow users to set options such as compression level. It allows performance to be fine tuned.
Tests -
Run db_bench and verify compression options in the LOG file
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12234
Reviewed By: ajkr
Differential Revision: D52758133
Pulled By: anand1976
fbshipit-source-id: af849fbffce6f84704387c195d8edba40d9548f6
Summary:
We test LockWAL() and UnlockWAL() by checking that latest sequence number is not changed: 1a1f9f1660/db_stress_tool/db_stress_test_base.cc (L920-L937). With writeprepared transaction, sequence number can be advanced in SwitchMemtable::WriteRecoverableState() when writing recoverable state: 1a1f9f1660/db/db_impl/db_impl_write.cc (L1560)
This PR disables LockWAL() tests for writeprepared transaction for now. We probably need to change how we test LockWAL() for writeprepared before re-enabling this test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12221
Reviewed By: ajkr
Differential Revision: D52677076
Pulled By: cbi42
fbshipit-source-id: 27ee694878edf63e8f4ad52f769d4db401f511bc
Summary:
Currently, when `block_cache_trace_analyzer` analyzes the cache miss ratio, it only analyzes the total miss ratio.
But it seems also important to analyze the cache miss ratio of each caller. To achieve this, we can calculate and print the miss ratio of each caller in the analyzer.
## Before modification
```
Running for 1 seconds: Processed 85732 records/second. Trace duration 58 seconds. Observed miss ratio 7.97
```
## After modification
```
Running for 1 seconds: Processed 85732 records/second. Trace duration 58 seconds. Observed miss ratio 7.97
Caller Get: Observed miss ratio 6.31
Caller Iterator: Observed miss ratio 11.86
***************************************************************
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10823
Reviewed By: ajkr
Differential Revision: D52632764
Pulled By: hx235
fbshipit-source-id: 40994d6039b73dc38fe78ea1b4adce187bb98909
Summary:
This feature combination is not fully working yet. Disable them so the stress tests have less noise.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12218
Reviewed By: cbi42
Differential Revision: D52643957
Pulled By: jowlyzhang
fbshipit-source-id: 8815a18a3b5814cad4f7ec41f3fb94869302081e
Summary:
FilePrefetchBuffer makes an unchecked assumption about the behavior of RandomAccessFileReader::Read: that it will write to the provided buffer rather than returning the data in an alternate buffer. FilePrefetchBuffer has been quietly incompatible with mmap reads (e.g. allow_mmap_reads / use_mmap_reads) because in that case an alternate buffer is returned (mmapped memory). This incompatibility currently leads to quiet data corruption, as seen in amplified crash test failure in https://github.com/facebook/rocksdb/issues/12200.
In this change,
* Check whether RandomAccessFileReader::Read has the expected behavior, and fail if not. (Assertion failure in debug build, return Corruption in release build.) This will detect future regressions synchronously and precisely, rather than relying on debugging downstream data corruption.
* Why not recover? My understanding is that FilePrefetchBuffer is not intended for use when RandomAccessFileReader::Read uses an alternate buffer, so quietly recovering could lead to undesirable (inefficient) behavior.
* Mention incompatibility with mmap-based readers in the internal API comments for FilePrefetchBuffer
* Fix two cases where FilePrefetchBuffer could be used with mmap, both stemming from SstFileDumper, though one fix is in BlockBasedTableReader. There is currently no way to ask a RandomAccessFileReader whether it's using mmap, so we currently have to rely on other options as clues.
Keeping separate from https://github.com/facebook/rocksdb/issues/12200 in part because this change is more appropriate for backport than that one.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12206
Test Plan:
* Manually verified that the new check aids in debugging.
* Unit test added, that fails if either fix is missed.
* Ran blackbox_crash_test for hours, with and without https://github.com/facebook/rocksdb/issues/12200
Reviewed By: akankshamahajan15
Differential Revision: D52551701
Pulled By: pdillinger
fbshipit-source-id: dea87c5782b7c484a6c6e424585c8832dfc580dc
Summary:
## Context/Summary
Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity.
For that, this PR does the following:
- Tag different write IOs by passing down and converting WriteOptions to IOOptions
- Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS
Some related code refactory to make implementation cleaner:
- Blob stats
- Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info.
- Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write.
- Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority
- Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification
- Build table
- TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables
- Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder.
This parameter is used for dynamically changing file io priority for flush, see https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more
- Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority
## Test
### db bench
Flush
```
./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100
rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
```
compaction, db oopen
```
Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279
rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213
rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66
```
blob stats - just to make sure they aren't broken by this PR
```
Integrated Blob DB
Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600
rocksdb.blobdb.blob.file.synced COUNT : 1
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same)
```
```
Stacked Blob DB
Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876
rocksdb.blobdb.blob.file.synced COUNT : 8
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same)
```
### Rehearsal CI stress test
Trigger 3 full runs of all our CI stress tests
### Performance
Flush
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark; enable_statistics = true
Pre-pr: avg 507515519.3 ns
497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908,
Post-pr: avg 511971266.5 ns, regressed 0.88%
502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408,
```
Compaction
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 495346098.30 ns
492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846
Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97%
502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007
```
Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats)
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 3848.10 ns
3814,3838,3839,3848,3854,3854,3854,3860,3860,3860
Post-pr: avg 3874.20 ns, regressed 0.68%
3863,3867,3871,3874,3875,3877,3877,3877,3880,3881
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910
Reviewed By: ajkr
Differential Revision: D49788060
Pulled By: hx235
fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff
Summary:
Do a size verification on the MANIFEST file during DB shutdown, after closing the file. If the verification fails, write a new MANIFEST file. In the future, we can do a more thorough verification if we want to.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12174
Test Plan: Unit test, and some manual verification
Reviewed By: ajkr
Differential Revision: D52451184
Pulled By: anand1976
fbshipit-source-id: fc3bc170e22f6c9a9c482ee5ff592abab889df83
Summary:
Currently, some numbers in the `tracer_analyzer_tool` may be a little confusing and unfriendly for people who want to add new query types.
It may be better to replace them with the existing enumeration type to improve readability.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10827
Reviewed By: ajkr
Differential Revision: D40576023
Pulled By: hx235
fbshipit-source-id: 0eb16820a15f365d53e848a3a8efd92928420429
Summary:
I landed https://github.com/facebook/rocksdb/issues/12159 which had the below compiler error when using `-DROCKSDB_NAMESPACE`, which broke the CircleCI "build-linux-static_lib-alt_namespace-status_checked" job:
```
tools/ldb_cmd_test.cc:1213:21: error: 'rocksdb' does not name a type
1213 | int Compare(const rocksdb::Slice& a, const rocksdb::Slice& b) const override {
| ^~~~~~~
tools/ldb_cmd_test.cc:1213:35: error: expected unqualified-id before '&' token
1213 | int Compare(const rocksdb::Slice& a, const rocksdb::Slice& b) const override {
| ^
tools/ldb_cmd_test.cc:1213:35: error: expected ')' before '&' token
1213 | int Compare(const rocksdb::Slice& a, const rocksdb::Slice& b) const override {
| ~ ^
| )
tools/ldb_cmd_test.cc:1213:35: error: expected ';' at end of member declaration
1213 | int Compare(const rocksdb::Slice& a, const rocksdb::Slice& b) const override {
| ^
| ;
tools/ldb_cmd_test.cc:1213:37: error: 'a' does not name a type
1213 | int Compare(const rocksdb::Slice& a, const rocksdb::Slice& b) const override {
| ^
...
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12173
Test Plan:
```
$ make clean && make OPT="-DROCKSDB_NAMESPACE=alternative_rocksdb_ns" ldb_cmd_test -j56
```
Reviewed By: pdillinger
Differential Revision: D52373797
Pulled By: ajkr
fbshipit-source-id: 8597aaae65a5333831fef66d85072827c5fb1187
Summary:
According to this [Q&A](https://github.com/facebook/rocksdb/wiki/RocksDB-FAQ#:~:text=Q%3A%20If%20I%20use%20non%2Ddefault%20comparators%20or%20merge%20operators%2C%20can%20I%20still%20use%20ldb%20tool%3F), user should be able to use LDB with passing a customized comparator into the option.
In the process of opening DB in order to perform ldb commands, there is a exception saying comparator not match even if a option with customized comparator is provided. After initializing the column family to open DB, the `LDBCommand::OverrideBaseCFOptions` method does not update the comparator inside column family descriptor using the passed in options. This can cause a mismatch while doing version edit, and in function `ToggleUDT CompareComparator` it will failed and return a exception saying comparator not match.
Propose fix by updating the column family descriptor's option using the user passed in option. Also a test case is provided to illustrate the steps.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12159
Reviewed By: hx235
Differential Revision: D52267367
Pulled By: ajkr
fbshipit-source-id: c240f93f440e02cb485893de058a46c6dbf9654b
Summary:
**Context/Summary:**
Continued from https://github.com/facebook/rocksdb/pull/12127, we can randomly reduce the # max key to coerce more operations on the same key. My experimental run shows it surfaced more issue than just https://github.com/facebook/rocksdb/pull/12127.
I also randomly reduce the related parameters, write buffer size and target file base, to adapt to randomly lower number of # max key. This creates 4 situations of testing, 3 of which are new:
1. **high** # max key with **high** write buffer size and target file base (existing)
2. **high** # max key with **low** write buffer size and target file base (new, will go through some rehearsal testing to ensure we don't run out of space with many files)
3. **low** # max key with **high** write buffer size and target file base (new, keys will stay in memory longer)
4. **low** # max key with **low** write buffer size and target file base (new, experimental runs show it surfaced even more issues)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12148
Test Plan:
- [Ongoing] Rehearsal stress test
- Monitor production stress test
Reviewed By: jaykorean
Differential Revision: D52174980
Pulled By: hx235
fbshipit-source-id: bd5e11280826819ca9314c69bbbf05d481c6d105
Summary:
This PR adds initial stress testing for the user-defined timestamps in memtable only feature. Each flavor of the `*_ts` crash test get a 1 in 3 chance to run with timestamps not persisted, this setting is initialized once and kept consistent across the following re-runs.
This initial stress test included these things besides disabling incompatible feature combinations to make the test run more stably:
1) It currently only run test methods that validates db state with expected state. Not the ones that validate db state by comparing result from one API to another API. Such as `TestMultiGet` (compared with `Get`), similarly `TestMultiGetEntity`, `TestIterate` (compare src iterator to a control iterator). Due to timestamps being removed, results from one API to another API is not directly comparable as it is now. More test logic to handle that need to be added, will do that in a follow up.
2) Even when comparing db state to expected state, sometimes the db can receive `InvalidArgument` too due to timestamps getting flushed and removed. Added some logic to handle that.
3) When timestamps are not persisted, we don't try to read with older timestamp. Since that's making it easier to get `InvalidArgument`. And this capability is not yet needed by our customer so it's disabled for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12124
Test Plan: running multiple flavor of this test on continuous run for sometime before checkin
Reviewed By: ltamasi
Differential Revision: D51916267
Pulled By: jowlyzhang
fbshipit-source-id: 3f3eb5f9618d05d296062820e0ef5cb8edc7c2b2
Summary:
**Context/Summary:**
My experimental stress runs with more frequent "xxx_one_in" surfaced a couple interesting bugs/issues with RocksDB or crash test framework in the past. We now consider changing the default value so they are run more frequently in production testing environment.
Increase frequency by 2 orders of magnitude for most parameters, except for error-prone features e.g, manual compaction and file ingestion (increased by 3 orders) and expensive features e.g, checksum verification (increased by 1 order)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12127
Test Plan: Monitor CI to see if it did surface more interesting bugs/issues. If not, we may consider intensify even more.
Reviewed By: pdillinger
Differential Revision: D51954235
Pulled By: hx235
fbshipit-source-id: 92046cb7c52a37212f19ab7965b40f77b90b08b1
Summary:
This is a simple refactor for the crash test script to put shared logic for parsing stderr into a function. There is no functional change.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12109
Test Plan: manually tested the script
Reviewed By: ajkr
Differential Revision: D51692172
Pulled By: jowlyzhang
fbshipit-source-id: d346d64e981d9c489c380ff6ce33296a224b5877
Summary:
Add the option to have a 3-tier block cache (uncompressed RAM, compressed RAM, and local flash) in db_bench, as well as specifying secondary cache admission policy.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12104
Reviewed By: ajkr
Differential Revision: D51629092
Pulled By: anand1976
fbshipit-source-id: 6a208f853bc85d3d8b437d91cb1b0142d9a99e53
Summary:
Part of the procedures to handle manifest IO error is to disable file deletion in case some files in limbo state get deleted prematurely. This is not ideal because: 1) not all the VersionEdits whose commit encounter such an error contain updates for files, disabling file deletion sometimes are not necessary. 2) `EnableFileDeletion` has a force mode that could make other threads accidentally disrupt this procedure in recovery. 3) Disabling file deletion as a whole is also not as efficient as more precisely tracking impacted files from being prematurely deleted. This PR replaces this mechanism with tracking such files and quarantine them from being deleted in `ErrorHandler`.
These are the types of files being actively tracked in quarantine in this PR:
1) new table files and blob files from a background job
2) old manifest file whose immediately following new manifest file's CURRENT file creation gets into unclear state. Current handling is not sufficient to make sure the old manifest file is kept in case it's needed.
Note that WAL logs are not part of the quarantine because `min_log_number_to_keep` is a safe mechanism and it's only updated after successful manifest commits so it can prevent this premature deletion issue from happening.
We track these files' file numbers because they share the same file number space.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12030
Test Plan: Modified existing unit tests
Reviewed By: ajkr
Differential Revision: D51036774
Pulled By: jowlyzhang
fbshipit-source-id: 84ef26271fbbc888ef70da5c40fe843bd7038716
Summary:
Disabling file deletion can be critical for operations like making a backup, recovery from manifest IO error (for now). Ideally as long as there is one caller requesting file deletion disabled, it should be kept disabled until all callers agree to re-enable it. So this PR removes the default forcing behavior for the `EnableFileDeletion` API, and users need to explicitly pass the argument if they insisted on doing so knowing the consequence of what can be potentially disrupted.
This PR removes the API's default argument value so it will cause breakage for all users that are relying on the default value, regardless of whether the forcing behavior is critical for them. When fixing this breakage, it's good to check if the forcing behavior is indeed needed and potential disruption is OK.
This PR also makes unit test that do not need force behavior to do a regular enable file deletion.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12001
Reviewed By: ajkr
Differential Revision: D51214683
Pulled By: jowlyzhang
fbshipit-source-id: ca7b1ebf15c09eed00f954da2f75c00d2c6a97e4
Summary:
I have finally tracked down and fixed a bug affecting AutoHCC that was causing CI crash test assertion failures in AutoHCC when using secondary cache, but I was only able to reproduce locally a couple of times, after very long runs/repetitions.
It turns out that the essential feature used by secondary cache to trigger the bug is Insert without keeping a handle, which is otherwise rarely used in RocksDB and not incorporated into cache_bench (also used for targeted correctness stress testing) until this change (new option `-blind_insert_percent`).
The problem was in copying some logic from FixedHCC that makes the entry "sharable" but unreferenced once populated, if no reference is to be saved. The problem in AutoHCC is that we can only add the entry to a chain after it is in the sharable state, and must be removed from the chain while in the "under (de)construction" state and before it is back in the "empty" state. Also, it is possible for Lookup to find entries that are not connected to any chain, by design for efficiency, and for Release to erase_if_last_ref. Therefore, we could have
* Thread 1 starts to Insert a cache entry without keeping ref, and pauses before adding to the chain.
* Thread 2 finds it with Lookup optimizations, and then does Release with `erase_if_last_ref=true` causing it to trigger erasure on the entry. It successfully locks the home chain for the entry and purges any entries pending erasure. It is OK that this entry is not found on the chain, as another thread is allowed to remove it from the chain before we are able to (but after is it marked for (de)construction). And after the purge of the chain, the entry is marked empty.
* Thread 1 resumes in adding the slot (presumed entry) to the home chain for what was being inserted, but that now violates invariants and sets up a race or double-chain-reference as another thread could insert a new entry in the slot and try to insert into a different chain.
This is easily fixed by holding on to a reference until inserted onto the chain.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12046
Test Plan:
As I don't have a reliable local reproducer, I triggered 20 runs of internal CI on fbcode_blackbox_crash_test that were previously failing in AutoHCC with about 1/3 probability, and they all passed.
Also re-enabling AutoHCC in the crash test with this change. (Revert https://github.com/facebook/rocksdb/issues/12000)
Reviewed By: jowlyzhang
Differential Revision: D51016979
Pulled By: pdillinger
fbshipit-source-id: 3840fb829d65b97c779d8aed62a4a4a433aeff2b
Summary:
db_stress flag `verify_iterator_with_expected_state_one_in` is only enabled for in crash test if --simple flag is set. This PR enables it for all supported crash tests by enabling it by default. This adds coverage for --txn and --enable_ts crash tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12040
Test Plan:
ran crash tests that disabled this flag before for a few hours
```
python3 ./tools/db_crashtest.py blackbox --verify_iterator_with_expected_state_one_in=1 --txn --txn_write_policy=[0,1,2]
python3 ./tools/db_crashtest.py blackbox --verify_iterator_with_expected_state_one_in=1 --enable_ts
```
Reviewed By: ajkr, hx235
Differential Revision: D50980001
Pulled By: cbi42
fbshipit-source-id: 3daf6b4c32bdddc5df057240068162aa1a907587
Summary:
As mentioned in https://github.com/facebook/rocksdb/issues/11893, we are going to use the offpeak time information to pre-process TTL-based compactions. To do so, we need to access `daily_offpeak_time_utc` in `VersionStorageInfo::ComputeCompactionScore()` where we pick the files to compact. This PR is to make the offpeak time information available at the time of compaction-scoring. We are not changing any compaction scoring logic just yet. Will follow up in a separate PR.
There were two ways to achieve what we want.
1. Make `MutableDBOptions` available in `ColumnFamilyData` and `ComputeCompactionScore()` take `MutableDBOptions` along with `ImmutableOptions` and `MutableCFOptions`.
2. Make `daily_offpeak_time_utc` and `IsNowOffpeak()` available in `VersionStorageInfo`.
We chose the latter as it involves smaller changes.
This change includes the following
- Introduction of `OffpeakTimeInfo` and `IsNowOffpeak()` has been moved from `MutableDBOptions`
- `OffpeakTimeInfo` added to `VersionSet` and it can be set during construction and by `ChangeOffpeakTimeInfo()`
- During `SetDBOptions()`, if offpeak time info needs to change, it calls `MaybeScheduleFlushOrCompaction()` to re-compute compaction scores and process compactions as needed
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12018
Test Plan:
- `DBOptionsTest::OffpeakTimes` changed to include checks for `MaybeScheduleFlushOrCompaction()` calls and `VersionSet`'s OffpeakTimeInfo value change during `SetDBOptions()`.
- `VersionSetTest::OffpeakTimeInfoTest` added to test `ChangeOffpeakTimeInfo()`. `IsNowOffpeak()` tests moved from `DBOptionsTest::OffpeakTimes`
Reviewed By: pdillinger
Differential Revision: D50723881
Pulled By: jaykorean
fbshipit-source-id: 3cff0291936f3729c0e9c7750834b9378fb435f6
Summary:
- Right now in blackbox test we don't exit if there are std::error as we do in whitebox crash tests. As result those errors are swallowed.
It only errors out if state is unexpected.
One example that was noticed in blackbox crash test -
```
stderr has error message:
***Error restoring historical expected values: Corruption: DB is older than any restorable expected state***
Running db_stress with pid=30454: /packages/rocksdb_db_stress_internal_repo/rocks_db_stress ....
```
- This diff also provided support to export files - db_crashtest.py file to be used by different repo.
Reviewed By: ajkr
Differential Revision: D50564889
fbshipit-source-id: 7bafbbc6179dc79467ca2b680fe83afc7850616a
Summary:
**Context/Summary:**
We ignore trace writing status e.g, 543191f2ea/db/db_impl/db_impl_write.cc (L221-L222)
If a write into the trace file fails, subsequent trace write will continue onto the same file.
This will trigger the assertion `assert(sync_without_flush_called_)` intended to catch write to a file that has previously seen error, added in https://github.com/facebook/rocksdb/pull/10489, https://github.com/facebook/rocksdb/pull/10555
Alternative (rejected) is to handle trace writing status at a higher level at e.g, 543191f2ea/db/db_impl/db_impl_write.cc (L221-L222). However, it makes sense to ignore such status considering tracing is not a critical but assistant component to db operation. And this alternative requires more code change. So it's better to handle the failure at a lower level as this PR
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11996
Test Plan: Add new UT failed before this PR and pass after
Reviewed By: akankshamahajan15
Differential Revision: D50532467
Pulled By: hx235
fbshipit-source-id: f2032abafd94917adbf89a20841d15b448782a33
Summary:
... until I can reproduce and resolve assertion failures (mostly in PurgeImplLocked) seen in crash test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12000
Test Plan: make blackbox_crash_test
Reviewed By: hx235
Differential Revision: D50565984
Pulled By: pdillinger
fbshipit-source-id: 5eea1638ff2683c41b4f65ee1ffc2398071911e7
Summary:
When `secondary_cache_uri` is non-empty and the `cache_type` is not a tiered cache, then sanitize `compressed_secondary_cache_size` to 0.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11967
Test Plan: Run crash test
Reviewed By: akankshamahajan15
Differential Revision: D50346157
Pulled By: anand1976
fbshipit-source-id: 57bcbad2ec81fa736f1539a0a41ed6854ded2077
Summary:
Thanks ltamasi and ajkr for initial investigations on the test failure. Per the investigations, the following scenario is likely causing the test to fail.
1. Recovery is needed (could be any reason during crash test)
2. Trying to recover from the latest manifest fails (likely due to read error injection)
3. DB opens with recovery from the next manifest which is different from step 2.
4. Expected state is based on the manifest we tried and failed in step 2.
5. Two manifests used in step 2 and 3 are confirmed to have difference in LSM trees (Thanks ltamasi again for the finding).
```
2023/10/05-11:24:18.942189 56341 [db/version_set.cc:6079] Trying to recover from manifest: /dev/shm/rocksdb_test/rocksdb_crashtest_blackbox/MANIFEST-007184
...
2023/10/05-11:24:18.978007 56341 [db/version_set.cc:6079] Trying to recover from manifest: /dev/shm/rocksdb_test/rocksdb_crashtest_blackbox/MANIFEST-007180
```
```
[ltamasi@devbig1024.prn1 /tmp/x]$ ldb manifest_dump --hex --path=MANIFEST-007184_renamed_ > 2
[ltamasi@devbig1024.prn1 /tmp/x]$ ldb manifest_dump --hex --path=MANIFEST-007180_renamed_ > 1
[ltamasi@devbig1024.prn1 /tmp/x]$ diff 1 2
--- 1 2023-10-09 10:29:16.966215207 -0700
+++ 2 2023-10-09 10:29:11.984241645 -0700
@@ -13,7 +13,7 @@
7174:3950254[1875617 .. 2203952]['000000000003415B000000000000012B000000000000007D' seq:1906214, type:1 .. '000000000003CA59000000000000012B000000000000005C' seq:2039838, type:1]
7175:88060[2074748 .. 2203892]['000000000003CA6300000000000000CF78787878787878' seq:2167539, type:2 .. '000000000003D08F000000000000012B0000000000000130' seq:2112478, type:0]
--- level 6 --- version# 1 ---
- 7057:3132633[0 .. 2046144]['0000000000000009000000000000000978' seq:0, type:1 .. '0000000000005F8B000000000000012B00000000000002AC' seq:0, type:1]
+ 7219:2135565[0 .. 2046144]['0000000000000009000000000000000978' seq:0, type:1 .. '0000000000005F8B000000000000012B00000000000002AC' seq:0, type:1]
7061:827724[0 .. 2046131]['0000000000005F95000000000000000778787878787878' seq:0, type:1 .. '000000000000784F000000000000012B0000000000000113' seq:0, type:1]
6763:1352[0 .. 0]['000000000000784F000000000000012B0000000000000129' seq:0, type:1 .. '000000000000784F000000000000012B0000000000000129' seq:0, type:1]
7173:4812291[0 .. 2203957]['000000000000784F000000000000012B0000000000000138' seq:0, type:1 .. '0000000000020FAE787878787878' seq:0, type:1]
@@ -77,4 +77,4 @@
--- level 61 --- version# 1 ---
--- level 62 --- version# 1 ---
--- level 63 --- version# 1 ---
-next_file_number 7182 last_sequence 2203963 prev_log_number 0 max_column_family 0 min_log_number_to_keep 7015
+next_file_number 7221 last_sequence 2203963 prev_log_number 0 max_column_family 0 min_log_number_to_keep 7015
```
We have two options to fix this. Either skip verification against expected state or disable read injection when BE recovery is enabled. I chose to skip verification against expected state per discussion. (See comments in this PR)
Please note that some linter changes were included in this PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11938
Test Plan:
```
TEST_TMPDIR=/dev/shm/rocksdb make crash_test_with_best_efforts_recovery
```
Reviewed By: ltamasi
Differential Revision: D50136341
Pulled By: jaykorean
fbshipit-source-id: ac7434d592aebc148bfc3a4fcaa34936f136b95c
Summary:
This PR depends on https://github.com/facebook/rocksdb/issues/11879 . Enable write fault injection for the basic whitebox, blackbox, and cf_consistency modes. For other test modes like multiops_txn, best_efforts_recovery etc., leave it disabled for now until we can do more testing.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11924
Reviewed By: ajkr
Differential Revision: D50178252
Pulled By: anand1976
fbshipit-source-id: 5794f81c14cded1eb28762b2de818dfff1c1a34c
Summary:
In preparing some seqno_to_time_mapping improvements, I found that some of the wrap-up work for creating column families was unnecessarily repeated in the case of DB::Open with create_missing_column_families. This change fixes that (`CreateColumnFamily()` -> `CreateColumnFamilyImpl()` in `DBImpl::Open()`), motivated by avoiding repeated calls to `RegisterRecordSeqnoTimeWorker()` but with the side benefit of avoiding repeated calls to `WriteOptionsFile()` for each CF.
Also in this change:
* Add a `Status::UpdateIfOk()` function for combining statuses in a common pattern
* Rename `max_time_duration` -> `min_preserve_seconds` (include units as much as possible)
* Improved comments in several places
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11920
Test Plan: tests added / updated
Reviewed By: jaykorean
Differential Revision: D49919147
Pulled By: pdillinger
fbshipit-source-id: 3d0318c1d070c842c5331da0a5b415caedc104f1
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11913
The `max_successive_merges` logic currently does not handle wide-column base values correctly, since it uses the `Get` API, which only returns the value of the default column. The patch fixes this by switching to `GetEntity` and passing all columns (if applicable) to the merge operator.
Reviewed By: jaykorean
Differential Revision: D49795097
fbshipit-source-id: 75eb7cc9476226255062cdb3d43ab6bd1cc2faa3
Summary:
Users may run into an issue when running ldb on db that's in a different version and they have different set of options: `Failed: Invalid argument: Could not find option: <MISSING_OPTION>`
They can work around this by setting `--ignore_unknown_options`, but the error message is not clear for users to find why the option is missing. It's also hard for the users to find the `ignore_unknown_options` option especially if they are not familiar with the codebase or `ldb` tool.
This PR changes the error message to help users to find out what's wrong and possible workaround for the issue
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11907
Test Plan:
Testing by reproducing the issue locally
```
❯./ldb --db=/data/users/jewoongh/db_crash_whitebox_T164195541/ get a
Failed: Invalid argument: Could not find option: : unknown_option_test
This tool was built with version 8.8.0. If your db is in a different version, please try again with option --ignore_unknown_options.
```
Reviewed By: jowlyzhang
Differential Revision: D49762291
Pulled By: jaykorean
fbshipit-source-id: 895570150fde886d5ec524908c4b2664c9230ac9
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11906
The patch adds stress test coverage for the wide-column aware `FullMergeV3` API by implementing a new `DBStressWideMergeOperator`. This operator is similar to `PutOperator` / `PutOperatorV2` in the sense that its result is based on the last merge operand; however, the merge result can be either a plain value or a wide-column entity, depending on the value base encoded into the operand and the value of the `use_put_entity_one_in` stress test parameter. Following the same rule for merge results that we do for writes ensures that the queries issued by the validation logic receive the expected results. The new operator is used instead of `PutOperatorV2` whenever `use_put_entity_one_in` is positive. Note that the patch also makes it possible to set `use_put_entity_one_in` and `use_merge` (but not `use_full_merge_v1`) at the same time, giving `use_put_entity_one_in` precedence, so the stress test will use `PutEntity` for writes passing the `use_put_entity_one_in` check described above and `Merge` for any other writes.
Reviewed By: jaykorean
Differential Revision: D49760024
fbshipit-source-id: 3893602c3e7935381b484f4f5026f1983e3a04a9
Summary:
Crash tests are failing with recent change of auto_readahead_size. Disable it in stress tests and enable it with fix to clear the crash tests failures.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11883
Reviewed By: pdillinger
Differential Revision: D49597854
Pulled By: akankshamahajan15
fbshipit-source-id: 0af8ca7414ee9b92f244ee0fb811579c3c052b41
Summary:
This PR implements support for a three tier cache - primary block cache, compressed secondary cache, and a nvm (local flash) secondary cache. This allows more effective utilization of the nvm cache, and minimizes the number of reads from local flash by caching compressed blocks in the compressed secondary cache.
The basic design is as follows -
1. A new secondary cache implementation, ```TieredSecondaryCache```, is introduced. It keeps the compressed and nvm secondary caches and manages the movement of blocks between them and the primary block cache. To setup a three tier cache, we allocate a ```CacheWithSecondaryAdapter```, with a ```TieredSecondaryCache``` instance as the secondary cache.
2. The table reader passes both the uncompressed and compressed block to ```FullTypedCacheInterface::InsertFull```, allowing the block cache to optionally store the compressed block.
3. When there's a miss, the block object is constructed and inserted in the primary cache, and the compressed block is inserted into the nvm cache by calling ```InsertSaved```. This avoids the overhead of recompressing the block, as well as avoiding putting more memory pressure on the compressed secondary cache.
4. When there's a hit in the nvm cache, we attempt to insert the block in the compressed secondary cache and the primary cache, subject to the admission policy of those caches (i.e admit on second access). Blocks/items evicted from any tier are simply discarded.
We can easily implement additional admission policies if desired.
Todo (In a subsequent PR):
1. Add to db_bench and run benchmarks
2. Add to db_stress
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11812
Reviewed By: pdillinger
Differential Revision: D49461842
Pulled By: anand1976
fbshipit-source-id: b40ac1330ef7cd8c12efa0a3ca75128e602e3a0b
Summary:
This PR contains two fixes:
1. disable write fault injection since it caused several other kinds of internal stress test failures. I'll try to fix those separately before enabling it again.
2. Fix segfault like
```
https://github.com/facebook/rocksdb/issues/5 0x000000000083dc43 in rocksdb::port::Mutex::Lock (this=0x30) at internal_repo_rocksdb/repo/port/port_posix.cc:80
80 internal_repo_rocksdb/repo/port/port_posix.cc: No such file or directory.
https://github.com/facebook/rocksdb/issues/6 0x0000000000465142 in rocksdb::MutexLock::MutexLock (mu=0x30, this=<optimized out>) at internal_repo_rocksdb/repo/util/mutexlock.h:37
37 internal_repo_rocksdb/repo/util/mutexlock.h: No such file or directory.
https://github.com/facebook/rocksdb/issues/7 rocksdb::FaultInjectionTestFS::DisableWriteErrorInjection (this=0x0) at internal_repo_rocksdb/repo/utilities/fault_injection_fs.h:505
505 internal_repo_rocksdb/repo/utilities/fault_injection_fs.h: No such file or directory.
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11859
Test Plan: db_stress with no fault injection: `./db_stress --write_fault_one_in=0 --read_fault_one_in=0 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --sync_fault_injection=0`
Reviewed By: jaykorean
Differential Revision: D49408247
Pulled By: cbi42
fbshipit-source-id: 0ca01f20e6e81bf52af77818b50d562ef7462165
Summary:
* db_crashtest.py now may set `write_fault_one_in` to 500 for blackbox and whitebox simple test.
* Error injection only applies to writing to SST files. Flush error will cause DB to pause background operations and auto-resume. Compaction error will just re-schedule later.
* File ingestion and back up tests are updated to check if the result status is due to an injected error.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11829
Test Plan:
a full round of whitebox simple and blackbox simple crash test
* `python3 ./tools/db_crashtest.py whitebox/blackbox --simple --write_fault_one_in=500`
Reviewed By: ajkr
Differential Revision: D49256962
Pulled By: cbi42
fbshipit-source-id: 68e0c9648d8e03bad39c7672b25d5500fc286d97